Contenido del Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Optimality Conditions

In the previous chapter, you learned about Bellman equations for state value and state-action value functions. These equations describe how state values can be recursively defined through the values of other states, with the values being dependent on a given policy. However, not all policies are equally effective. In fact, value functions provide a partial order for policies, which can be described as follows:

\pi \ge \pi' \iff v_\pi(s) \ge v_{\pi'}(s) \qquad \forall s \in S

So policy $\pi$ is better than or equal to policy $\pi'$ if for all possible states, the expected return of policy $\pi$ is not less than the expected return of policy $\pi'$ .

Study More

A partial order follows the usual ordering rules but doesn't force every pair to be compared. In our case, we can only rank two policies if they produce the same results, or one of them clearly outperforms the other. In all other cases, policies remain incomparable.

Optimal Policy

Definition

For any MDP, there exists at least one policy that is as good as or better than all other policies. This policy is called an optimal policy $\pi_*$ . Although there may be many optimal policies, all of them are denoted as $\pi_*$ .

Why optimal policy always exists?

You might be wondering why an optimal policy always exists for any MDP. That's a great question, and the intuition behind it is surprisingly simple. Remember, states in an MDP fully capture the environment's condition. This implies each state is independent from all others: the action chosen in one state doesn't affect the rewards or outcomes achievable in another. Therefore, by selecting the optimal action in each state separately, you naturally arrive at the overall best sequence of actions across the entire process. And this set of optimal actions in each state is an optimal policy.

Moreover, there is always at least one policy that is both optimal and deterministic. Indeed, if for some state $s$ , two actions $a$ and $a'$ yield the same expected return, selecting just one of them will not affect the policy's optimality. Applying this principle to every single state will make the policy deterministic while preserving its optimality.

Optimal Value Functions

Optimal policies share the same value functions — a fact that becomes clear when we consider how policies are compared. This means that optimal policies share both state value function and action value function.

Additionally, optimal value functions have their own Bellman equations that can be written without a reference to any specific policy. These equations are called Bellman optimality equations

Optimal state value function

Definition

Optimal state value function $V_*$ (or $v_*$ ) represents the maximal expected return achievable from a certain state by following an optimal policy.

It can be mathematically defined as such:

\def\E{\operatorname{\mathbb{E}}} v_*(s) = \max_\pi v_\pi(s) = \E_{\pi_*}[G_t | S_t = s]

Bellman optimality equation for this value function can be derived like this:

\begin{aligned} v_*(s) &= \sum_a \pi_*(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_*(s')\Bigr)\\ &= \max_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_*(s')\Bigr) \end{aligned}

Intuition

As you already know, there always exists at least one policy that is both optimal and deterministic. Such a policy would, for each state, consistently select one particular action that maximizes expected returns. Therefore, the probability of choosing this optimal action would always be 1, and the probability of choosing any other action would be 0. Given this, the original Bellman equation no longer needs the summation operator. Instead, since we know we will always select the best possible action, we can simply replace the sum by taking a maximum over all available actions.

Optimal action value function

Definition

Optimal action value function $Q_*$ (or $q_*$ ) represents the maximal expected return achievable by taking a certain action in a certain state and following the optimal policy afterwards.

It can be mathematically defined as such:

\def\E{\operatorname{\mathbb{E}}} q_*(s, a) = \max_\pi q_\pi(s, a) = \E_{\pi_*}[G_t | S_t = s, A_t = a]

Bellman optimality equation for this value function can be derived like this:

\begin{aligned} q_*(s, a) &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \sum_{a'} \pi_*(a' | s')q_*(s', a')\Bigr)\\ &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \max_{a'} q_*(s', a')\Bigr) \end{aligned}

Intuition

Similarly to the state value function, the sum can be replaced by taking a maximum over all available actions.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 3. Capítulo 3

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Contenido del Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Optimality Conditions

\pi \ge \pi' \iff v_\pi(s) \ge v_{\pi'}(s) \qquad \forall s \in S

So policy $\pi$ is better than or equal to policy $\pi'$ if for all possible states, the expected return of policy $\pi$ is not less than the expected return of policy $\pi'$ .

Study More

Optimal Policy

Definition

Why optimal policy always exists?

Optimal Value Functions

Additionally, optimal value functions have their own Bellman equations that can be written without a reference to any specific policy. These equations are called Bellman optimality equations

Optimal state value function

Definition

Optimal state value function $V_*$ (or $v_*$ ) represents the maximal expected return achievable from a certain state by following an optimal policy.

It can be mathematically defined as such:

\def\E{\operatorname{\mathbb{E}}} v_*(s) = \max_\pi v_\pi(s) = \E_{\pi_*}[G_t | S_t = s]

Bellman optimality equation for this value function can be derived like this:

\begin{aligned} v_*(s) &= \sum_a \pi_*(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_*(s')\Bigr)\\ &= \max_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_*(s')\Bigr) \end{aligned}

Intuition

Optimal action value function

Definition

Optimal action value function $Q_*$ (or $q_*$ ) represents the maximal expected return achievable by taking a certain action in a certain state and following the optimal policy afterwards.

It can be mathematically defined as such:

\def\E{\operatorname{\mathbb{E}}} q_*(s, a) = \max_\pi q_\pi(s, a) = \E_{\pi_*}[G_t | S_t = s, A_t = a]

Bellman optimality equation for this value function can be derived like this:

\begin{aligned} q_*(s, a) &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \sum_{a'} \pi_*(a' | s')q_*(s', a')\Bigr)\\ &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \max_{a'} q_*(s', a')\Bigr) \end{aligned}

Intuition

Similarly to the state value function, the sum can be replaced by taking a maximum over all available actions.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 3. Capítulo 3