Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Optimality Conditions | Dynamic Programming
Introduction to Reinforcement Learning
course content

Contenido del Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Optimality Conditions

In the previous chapter, you learned about Bellman equations for state value and state-action value functions. These equations describe how state values can be recursively defined through the values of other states, with the values being dependent on a given policy. However, not all policies are equally effective. In fact, value functions provide a partial ordering for policies, which can be described as follows:

ππ    vπ(s)vπ(s)sS\pi \ge \pi' \iff v_\pi(s) \ge v_{\pi'}(s) \qquad \forall s \in S

So policy π\pi is better than or equal to policy π\pi' if for all possible states, the expected return of policy π\pi is not less than the expected return of policy π\pi'.

Optimal Policy

Although there may be many optimal policies, all of them are denoted as π\pi_*.

Why optimal policy always exists?

You might be wondering why an optimal policy always exists for any MDP. That's a great question, and the intuition behind it is surprisingly simple. Remember, states in an MDP fully capture the environment's condition. This implies each state is independent from all others: the action chosen in one state doesn't affect the rewards or outcomes achievable in another. Therefore, by selecting the optimal action in each state separately, you naturally arrive at the overall best sequence of actions across the entire process. And this set of optimal actions in each state is an optimal policy.

Moreover, there is always at least one policy that is both optimal and deterministic. Indeed, if for some state ss, two actions aa and aa' yield the same expected return, selecting just one of them will not affect the policy's optimality. Applying this principle to every single state will make the policy deterministic while preserving its optimality.

Optimal Value Functions

Optimal policies share the same value functions — a fact that becomes clear when we consider how policies are compared. This means that optimal policies share both state value function and action value function.

Additionally, optimal value functions have their own Bellman equations that can be written without a reference to any specific policy. These equations are called Bellman optimality equations

Optimal state value function

Optimal state value function is usually denoted as VV_* or vv_*.

It can be mathematically defined as such:

v(s)=maxπvπ(s)=Eπ[GtSt=s]\def\E{\operatorname{\mathbb{E}}} v_*(s) = \max_\pi v_\pi(s) = \E_{\pi_*}[G_t | S_t = s]

Bellman optimality equation for this value function can be derived like this:

v(s)=aπ(as)s,rp(s,rs,a)(r+γv(s))=maxas,rp(s,rs,a)(r+γv(s))\begin{aligned} v_*(s) &= \sum_a \pi_*(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_*(s')\Bigr)\\ &= \max_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_*(s')\Bigr) \end{aligned}

Intuition

As you already know, there always exists at least one policy that is both optimal and deterministic. Such a policy would, for each state, consistently select one particular action that maximizes expected returns. Therefore, the probability of choosing this optimal action would always be 1, and the probability of choosing any other action would be 0. Given this, the original Bellman equation no longer needs the summation operator. Instead, since we know we will always select the best possible action, we can simply replace the sum by taking a maximum over all available actions.

Optimal action value function

Optimal action value function is usually denoted as QQ_* or qq_*

It can be mathematically defined as such:

q(s,a)=maxπqπ(s,a)=Eπ[GtSt=s,At=a]\def\E{\operatorname{\mathbb{E}}} q_*(s, a) = \max_\pi q_\pi(s, a) = \E_{\pi_*}[G_t | S_t = s, A_t = a]

Bellman optimality equation for this value function can be derived like this:

q(s,a)=s,rp(s,rs,a)(r+γaπ(as)q(s,a))=s,rp(s,rs,a)(r+γmaxaq(s,a))\begin{aligned} q_*(s, a) &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \sum_{a'} \pi_*(a' | s')q_*(s', a')\Bigr)\\ &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \max_{a'} q_*(s', a')\Bigr) \end{aligned}

Intuition

Similarly to the state value function, the sum can be replaced by taking a maximum over all available actions.

question mark

Why does an optimal policy always exist for any Markov decision process?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 3
Lamentamos que algo salió mal. ¿Qué pasó?
some-alt