Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Policy Improvement

Definition

Policy improvement is a process of improving the policy based on current value function estimates.

Note

Like with policy evaluation, policy improvement can work with both state value function and action value function. But for DP methods, state value function will be used.

Now that you can estimate state value function for any policy, a natural next step is to explore whether there are any policies better than the current one. One way of doing this, is to consider taking a different action $a$ in a state $s$ , and to follow the current policy afterwards. If this sounds familiar, it's because this is similar to how we define the action value function:

q_\pi(s, a) = \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr)

If this new value is greater than the original state value $v_\pi(s)$ , it indicates that taking action $a$ in state $s$ and then continuing with policy $\pi$ leads to better outcomes than strictly following policy $\pi$ . Since states are independent, it's optimal to always select action $a$ whenever state $s$ is encountered. Therefore, we can construct an improved policy $\pi'$ , identical to $\pi$ except that it selects action $a$ in state $s$ , which would be superior to the original policy $\pi$ .

Policy Improvement Theorem

The reasoning described above can be generalized as the policy improvement theorem:

\begin{aligned} &q_\pi(s, \pi'(s)) \ge v_\pi(s) \qquad &\forall s \in S\\ \implies &v_{\pi'}(s) \ge v_\pi(s) \qquad &\forall s \in S \end{aligned}

The proof of this theorem is relatively simple, and can be achieved by a repeated substitution:

\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &\le q_\pi(s, \pi'(s))\\ &= \E_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s]\\ &\le \E_{\pi'}[R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1})) | S_t = s]\\ &= \E_{\pi'}[R_{t+1} + \gamma \E_{\pi'}[R_{t+2} + \gamma v_\pi(S_{t+2})] | S_t = s]\\ &= \E_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi(S_{t+2}) | S_t = s]\\ &...\\ &\le \E_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]\\ &= v_{\pi'}(s) \end{aligned}

Improvement Strategy

While updating actions for certain states can lead to improvements, it's more effective to update actions for all states simultaneously. Specifically, for each state $s$ , select the action $a$ that maximizes the action value $q_\pi(s, a)$ :

\begin{aligned} \pi'(s) &\gets \argmax_a q_\pi(s, a)\\ &\gets \argmax_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr) \end{aligned}

where $\argmax$ (short for argument of the maximum) is an operator that returns the value of the variable that maximizes a given function.

The resulting greedy policy, denoted by $\pi'$ , satisfies the conditions of the policy improvement theorem by construction, guaranteeing that $\pi'$ is at least as good as the original policy $\pi$ , and typically better.

If $\pi'$ is as good as, but not better than $\pi$ , then both $\pi'$ and $\pi$ are optimal policies, as their value functions are equal, and satisfy Bellman optimality equation:

v_\pi(s) = \max_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr)

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat