Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Upper Confidence Bound Algorithm | Multi-Armed Bandit Problem
Introduction to Reinforcement Learning
course content

Conteúdo do Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Upper Confidence Bound Algorithm

The upper confidence bound (UCB) algorithm is a popular and effective approach for solving the multi-armed bandit problem. It has strong mathematical guarantees of quick convergence, optimizing the exploration process.

Despite its effectiveness in solving the MAB problem, the UCB algorithm has some notable limitations that restrict its application in broader reinforcement learning:

  • Assumption of stationary rewards: the UCB algorithm assumes that reward distributions do not change over time;
  • Constraints on state and action spaces: to even begin choosing actions according to some logic, the UCB algorithm requires trying each action in each state at least once.

While the first limitation can be addressed by modifying the algorithm slightly, the second limitation remains a significant challenge in many practical applications.

How it Works

The UCB algorithm balances exploration and exploitation by assigning a confidence interval to each action's estimated value and selecting the action with the highest upper bound. This approach ensures that actions with uncertain rewards are explored while favoring actions that appear to be optimal.

The steps of the UCB algorithm are identical to the steps of epsilon-greedy algorithm, except for the step of choosing an action. The UCB algorithm selects an action At at time step t using the following formula:

At=arg maxa(Qt(a)+clntNt(a))A_t = \argmax_a\Biggl(Q_t(a) + c \sqrt\frac{\ln t}{N_t(a)}\Biggr)

where:

  • Qt(a)Q_t(a) is the estimated reward of action aa at time tt;
  • Nt(a)N_t(a) is the number of times action aa has been selected until time tt;
  • c>0c > 0 is a tunable parameter that controls the balance between exploration and exploitation, similar to ε\varepsilon in ε\varepsilon-greedy algorithm;
  • ln\ln is a natural logarithm function;
  • arg max\argmax is a value of an argument(aa, in this case), that maximizes the expression.

Intuition

arg max\argmax chooses the action that maximizes the sum of two parts: the estimated action value and a confidence interval. The confidence interval is scaled by a factor cc, where larger values make the interval wider, meaning the agent is less confident about the action's value, which encourages exploration.

The size of this confidence interval depends on two factors:

  1. Time: as more time passes, the agent becomes less confident in the action's value;
  2. Action frequency: the more often an action is chosen, the more confident the agent becomes in its value.

Sample Code

python

Additional Information

The UCB algorithm incorporates a mechanism for exploration, which requires careful tuning of the cc hyperparameter to function effectively. The optimal cc value varies depending on the specific problem. Here are some general guidelines:

  • High variance in rewards: a larger cc value ensures sufficient exploration;
  • Stable rewards: a smaller cc value allows the algorithm to quickly focus on the optimal action;
  • Common default: a typical starting point is c=1c = 1, but it often requires experimental tuning for best results.

Summary

The UCB algorithm is a powerful and well-founded method for balancing exploration and exploitation in multi-armed bandit problems. By selecting actions based on both estimated rewards and uncertainty, it ensures efficient learning while minimizing regret.

question mark

How does the UCB algorithm address exploration-exploitation tradeoff?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 4
Sentimos muito que algo saiu errado. O que aconteceu?
some-alt