Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Bellman Equations

Definition

A Bellman equation is a functional equation that defines a value function in a recursive form.

To clarify the definition:

A functional equation is an equation whose solution is a function. For Bellman equation, this solution is the value function for which the equation was formulated;
A recursive form means that the value at the current state is expressed in terms of values at future states.

In short, solving the Bellman equation gives the desired value function, and deriving this equation requires identifying a recursive relationship between current and future states.

State Value Function

As a reminder, here is a state value function in compact form:

\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi[G_t | S_t = s]

To obtain the Bellman equation for this value function, let's expand the right side of the equation and establish a recursive relationship:

\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &= \E_\pi[G_t | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]\\ &= \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \E_\pi\Bigl[G_{t+1} | S_{t+1} = s'\Bigr]\Bigr)\\ &= \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr) \end{aligned}

The last equation in this chain is a Bellman equation for state value function.

Intuition

To find the value of a state $s$ , you:

Consider all possible actions $a$ you might take from this state, each weighted by how likely you are to choose that action under your current policy $\pi(a | s)$ ;
For each action $a$ , you consider all possible next states $s'$ and rewards $r$ , weighted by their likelihood $p(s', r | s, a)$ ;
For each of these outcomes, you take the immediate reward $r$ you get plus the discounted value of the next state $\gamma v_\pi(s')$ .

By summing all these possibilities together, you get the total expected value of the state $s$ under your current policy.

Action Value Function

Here is an action value function in compact form:

\def\E{\operatorname{\mathbb{E}}} q_\pi(s, a) = \E_\pi[G_t | S_t = s, A_t = a]

The derivation of Bellman equation for this function is quite similar to the previous one:

\def\E{\operatorname{\mathbb{E}}} \begin{aligned} q_\pi(s, a) &= \E_\pi[G_t | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a]\\ &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \E_\pi\Bigl[G_{t+1} | S_{t+1} = s'\Bigr]\Bigr)\\ &= \sum_{s', r} p(s', r | s, a)\Biggl(r + \gamma \sum_{a'} \pi(a' | s') \Bigl(\E_\pi\Bigl[G_{t+1} | S_{t+1} = s', A_{t+1} = a'\Bigr]\Bigr)\Biggr)\\ &= \sum_{s', r} p(s', r | s, a)\Biggl(r + \gamma \sum_{a'} \pi(a' | s') q(s', a')\Biggr) \end{aligned}

The last equation in this chain is a Bellman equation for action value function.

Intuition

To find the value of a state-action pair $(s, a)$ , you:

Consider all possible next states $s'$ and rewards $r$ , weighted by their likelihood $p(s', r | s, a)$ ;
For each of these outcomes, you take the immediate reward $r$ you get plus the discounted value of the next state;
To compute the value of the next state $s'$ , for all actions $a'$ possible from state $s'$ , multiply the action value $q(s', a')$ by the probability of choosing $a'$ in state $s'$ under current policy $\pi(a' | s'$ . Then, sum up everything to receive the final value.

By summing all these possibilities together, you get the total expected value of the state-action pair $(s, a)$ under your current policy.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Bellman Equations

Definition

A Bellman equation is a functional equation that defines a value function in a recursive form.

To clarify the definition:

A functional equation is an equation whose solution is a function. For Bellman equation, this solution is the value function for which the equation was formulated;
A recursive form means that the value at the current state is expressed in terms of values at future states.

In short, solving the Bellman equation gives the desired value function, and deriving this equation requires identifying a recursive relationship between current and future states.

State Value Function

As a reminder, here is a state value function in compact form:

\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi[G_t | S_t = s]

To obtain the Bellman equation for this value function, let's expand the right side of the equation and establish a recursive relationship:

\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &= \E_\pi[G_t | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s]\\ &= \E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s]\\ &= \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \E_\pi\Bigl[G_{t+1} | S_{t+1} = s'\Bigr]\Bigr)\\ &= \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr) \end{aligned}

The last equation in this chain is a Bellman equation for state value function.

Intuition

To find the value of a state $s$ , you:

Consider all possible actions $a$ you might take from this state, each weighted by how likely you are to choose that action under your current policy $\pi(a | s)$ ;
For each action $a$ , you consider all possible next states $s'$ and rewards $r$ , weighted by their likelihood $p(s', r | s, a)$ ;
For each of these outcomes, you take the immediate reward $r$ you get plus the discounted value of the next state $\gamma v_\pi(s')$ .

By summing all these possibilities together, you get the total expected value of the state $s$ under your current policy.

Action Value Function

Here is an action value function in compact form:

\def\E{\operatorname{\mathbb{E}}} q_\pi(s, a) = \E_\pi[G_t | S_t = s, A_t = a]

The derivation of Bellman equation for this function is quite similar to the previous one:

\def\E{\operatorname{\mathbb{E}}} \begin{aligned} q_\pi(s, a) &= \E_\pi[G_t | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a]\\ &= \E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a]\\ &= \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma \E_\pi\Bigl[G_{t+1} | S_{t+1} = s'\Bigr]\Bigr)\\ &= \sum_{s', r} p(s', r | s, a)\Biggl(r + \gamma \sum_{a'} \pi(a' | s') \Bigl(\E_\pi\Bigl[G_{t+1} | S_{t+1} = s', A_{t+1} = a'\Bigr]\Bigr)\Biggr)\\ &= \sum_{s', r} p(s', r | s, a)\Biggl(r + \gamma \sum_{a'} \pi(a' | s') q(s', a')\Biggr) \end{aligned}

The last equation in this chain is a Bellman equation for action value function.

Intuition

To find the value of a state-action pair $(s, a)$ , you:

Consider all possible next states $s'$ and rewards $r$ , weighted by their likelihood $p(s', r | s, a)$ ;
For each of these outcomes, you take the immediate reward $r$ you get plus the discounted value of the next state;
To compute the value of the next state $s'$ , for all actions $a'$ possible from state $s'$ , multiply the action value $q(s', a')$ by the probability of choosing $a'$ in state $s'$ under current policy $\pi(a' | s'$ . Then, sum up everything to receive the final value.

By summing all these possibilities together, you get the total expected value of the state-action pair $(s, a)$ under your current policy.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 2