Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Gradient Bandits Algorithm

When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach — they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.

Preferences

Instead of maintaining action value estimates $Q(a)$ , gradient bandits maintain preference values $H(a)$ for each action $a$ . These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:

P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

$H_t(a)$ is the preference for action $a$ on a time step $t$ ;
$P(A_t = a)$ is the probability of selecting action $a$ on a time step $t$ ;
The denominator ensures that probabilities sum to 1.

Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the $\argmax$ function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.

Update Rule

After selecting an action $A_t$ at time $t$ , the preference values are updated using the following rule:

\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

$\alpha$ is the step-size;
$R_t$ is the reward received;
$\bar R_t$ is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code


python

Additional Information

Gradient bandits have several interesting properties:

Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Gradient Bandits Algorithm

Preferences

P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

$H_t(a)$ is the preference for action $a$ on a time step $t$ ;
$P(A_t = a)$ is the probability of selecting action $a$ on a time step $t$ ;
The denominator ensures that probabilities sum to 1.

Update Rule

After selecting an action $A_t$ at time $t$ , the preference values are updated using the following rule:

\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

$\alpha$ is the step-size;
$R_t$ is the reward received;
$\bar R_t$ is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code


python

Additional Information

Gradient bandits have several interesting properties:

Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5