Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Gradient Bandits Algorithm | Multi-Armed Bandit Problem
Introduction to Reinforcement Learning
course content

Conteúdo do Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Gradient Bandits Algorithm

When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach — they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.

Preferences

Instead of maintaining action value estimates Q(a)Q(a), gradient bandits maintain preference values H(a)H(a) for each action aa. These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:

P(At=a)=eHt(a)b=1neHt(b)=πt(a)P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

  • Ht(a)H_t(a) is the preference for action aa on a time step tt;
  • P(At=a)P(A_t = a) is the probability of selecting action aa on a time step tt;
  • The denominator ensures that probabilities sum to 1.

Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the arg max\argmax function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.

Update Rule

After selecting an action AtA_t at time tt, the preference values are updated using the following rule:

Ht+1(At)Ht(At)+α(RtRˉt)(1π(At))Ht+1(a)Ht(a)α(RtRˉt)π(a)aAt\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

  • α\alpha is the step-size,
  • RtR_t is the reward received,
  • Rˉt\bar R_t is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

  • If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered.
  • If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code

python

Additional Information

Gradient bandits have several interesting properties:

  • Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
  • Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
  • Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.

question mark

What is the main advantage of using gradient bandits over traditional bandit algorithms?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5
Sentimos muito que algo saiu errado. O que aconteceu?
some-alt