Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Action Values

Action value is a fundamental concept in the MAB problem. It plays a pivotal role in various algorithms, including epsilon-greedy and upper confidence bound. The primary purpose of an action value is to provide an estimate of the expected reward when a specific action is chosen. It is similar to a state-action value, but is independent of a state due to the stateless nature of the MAB problem.

Definition of Action Value

Formally, the action value, denoted as $Q(a)$ , represents the expected reward of choosing action $a$ :

\def\E{\operatorname{\mathbb{E}}} Q(a) = \E[R | A = a]

where:

$R$ is the reward received;
$A$ is the action selected.

Since the true reward distribution is typically unknown, we have to estimate $Q(a)$ using observed data.

Estimating Action Values

There are several ways to estimate $Q(a)$ based on observed rewards. The most common method is the sample average estimate, which calculates the mean reward received from selecting action $a$ up to time $t$ :

Q_t(a) = \frac{R_1 + R_2 + ... + R_{N_t(a)}}{N_t(a)} = \frac{\sum_{i=1}^{N_t(a)} R_i}{N_t(a)}

where:

$Q_t(a)$ is the estimated value of action $a$ at time step $t$ ;
$N_t(a)$ is the number of times action $a$ has been chosen up to time $t$ ;
$R_i$ is the reward obtained in each instance when action $a$ was taken.

As more samples are collected, this estimate converges to the true expected reward $Q_*(a)$ assuming the reward distribution remains stationary.

Definition

A stationary distribution is a distribution that doesn't change over time, no matter what actions are taken or how the environment changes.

Incremental Update Rule

While the formula above can be used to estimate action values, it requires storing all previous rewards, and recomputing their sum on every time step. With incremental updates, this becomes unnecessary. The formula for incremental updates can be derived like this:

\begin{aligned} Q_{k+1} &= \frac1k \sum_{i=1}^k R_i\\ &= \frac1k (R_k + \sum_{i=1}^{k-1} R_i)\\ &= \frac1k (R_k + (k-1) Q_k)\\ &= \frac1k (R_k + k Q_k - Q_k)\\ &= Q_k + \frac1k(R_k - Q_k) \end{aligned}

where for some action:

$Q_k$ is an estimate of $k$ -th reward, that can be expressed as an average of first $k-1$ rewards;
$R_k$ is an actual $k$ -th reward.

Intuition

Knowing the estimate of the $k$ -th reward, $Q_k$ , and the actual $k$ -th reward, $R_k$ , you can measure the error as the difference between these values. Afterward, the next estimate can be calculated by adjusting the previous estimate slightly in the direction of the actual reward, to reduce the error.

This intuition leads to another formula, which looks like this:

Q_{k+1} = Q_k + \alpha (R_k - Q_k)

where $\alpha$ is a step-size parameter controlling the rate of learning. Like in the previous formula, alpha can be $\frac1k$ , and it will result in a sample average estimate. Alternatively, a constant $\alpha$ is commonly used, as it doesn't require any additional space(to store how many times an action was taken) and allows adaptation to non-stationary environments by placing more weight on recent observations.

Optimistic Initialization

At the beginning of a training process, action value estimates can vary significantly, which may lead to premature exploitation. This means the agent may exploit its initial knowledge too early, favoring suboptimal actions based on limited experience. To mitigate this issue and encourage initial exploration, one simple and effective technique is optimistic initialization.

In optimistic initialization, action values are initialized to relatively high values (e.g., $Q_0(a) = 1$ instead of 0). This approach creates the impression that all actions are initially promising. As a result, the agent is incentivized to explore each action multiple times before settling on the best choice. This technique is most efficient when used in combination with constant step-size.

Note

The optimal action rate in this and future plots refers to the proportion of environments where the optimal action was chosen at a given time step.

For example, if there are 10 test environments, and the optimal action was selected in 6 of them at time step 200, the optimal action rate for that time step would be 0.6. This metric is useful for evaluating performance because it correlates with maximizing the reward, without depending on the exact reward values.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Action Values

Definition of Action Value

Formally, the action value, denoted as $Q(a)$ , represents the expected reward of choosing action $a$ :

\def\E{\operatorname{\mathbb{E}}} Q(a) = \E[R | A = a]

where:

$R$ is the reward received;
$A$ is the action selected.

Since the true reward distribution is typically unknown, we have to estimate $Q(a)$ using observed data.

Estimating Action Values

Q_t(a) = \frac{R_1 + R_2 + ... + R_{N_t(a)}}{N_t(a)} = \frac{\sum_{i=1}^{N_t(a)} R_i}{N_t(a)}

where:

$Q_t(a)$ is the estimated value of action $a$ at time step $t$ ;
$N_t(a)$ is the number of times action $a$ has been chosen up to time $t$ ;
$R_i$ is the reward obtained in each instance when action $a$ was taken.

As more samples are collected, this estimate converges to the true expected reward $Q_*(a)$ assuming the reward distribution remains stationary.

Definition

A stationary distribution is a distribution that doesn't change over time, no matter what actions are taken or how the environment changes.

Incremental Update Rule

\begin{aligned} Q_{k+1} &= \frac1k \sum_{i=1}^k R_i\\ &= \frac1k (R_k + \sum_{i=1}^{k-1} R_i)\\ &= \frac1k (R_k + (k-1) Q_k)\\ &= \frac1k (R_k + k Q_k - Q_k)\\ &= Q_k + \frac1k(R_k - Q_k) \end{aligned}

where for some action:

$Q_k$ is an estimate of $k$ -th reward, that can be expressed as an average of first $k-1$ rewards;
$R_k$ is an actual $k$ -th reward.

Intuition

This intuition leads to another formula, which looks like this:

Q_{k+1} = Q_k + \alpha (R_k - Q_k)

Optimistic Initialization

Note

The optimal action rate in this and future plots refers to the proportion of environments where the optimal action was chosen at a given time step.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2