Conteúdo do Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Upper Confidence Bound Algorithm

The upper confidence bound (UCB) algorithm is a popular and effective approach for solving the multi-armed bandit problem. It has strong mathematical guarantees of quick convergence, optimizing the exploration process.

Despite its effectiveness in solving the MAB problem, the UCB algorithm has some notable limitations that restrict its application in broader reinforcement learning:

Assumption of stationary rewards: the UCB algorithm assumes that reward distributions do not change over time;
Constraints on state and action spaces: to even begin choosing actions according to some logic, the UCB algorithm requires trying each action in each state at least once.

While the first limitation can be addressed by modifying the algorithm slightly, the second limitation remains a significant challenge in many practical applications.

How it Works

The UCB algorithm balances exploration and exploitation by assigning a confidence interval to each action's estimated value and selecting the action with the highest upper bound. This approach ensures that actions with uncertain rewards are explored while favoring actions that appear to be optimal.

The steps of the UCB algorithm are identical to the steps of epsilon-greedy algorithm, except for the step of choosing an action. The UCB algorithm selects an action $A_t$ at time step $t$ using the following formula:

A_t = \argmax_a\Biggl(Q_t(a) + c \sqrt\frac{\ln t}{N_t(a)}\Biggr)

where:

$Q_t(a)$ is the estimated reward of action $a$ at time $t$ ;
$N_t(a)$ is the number of times action $a$ has been selected until time $t$ ;
$c > 0$ is a tunable parameter that controls the balance between exploration and exploitation, similar to $\varepsilon$ in $\varepsilon$ -greedy algorithm;
$\ln$ is a natural logarithm function;
$\argmax$ is a value of an argument( $a$ , in this case), that maximizes the expression.

Intuition

$\argmax$ chooses the action that maximizes the sum of two parts: the estimated action value and a confidence interval. The confidence interval is scaled by a factor $c$ , where larger values make the interval wider, meaning the agent is less confident about the action's value, which encourages exploration.

The size of this confidence interval depends on two factors:

Time: as more time passes, the agent becomes less confident in the action's value;
Action frequency: the more often an action is chosen, the more confident the agent becomes in its value.

Sample Code


python

Additional Information

The UCB algorithm incorporates a mechanism for exploration, which requires careful tuning of the $c$ hyperparameter to function effectively. The optimal $c$ value varies depending on the specific problem. Here are some general guidelines:

High variance in rewards: a larger $c$ value ensures sufficient exploration;
Stable rewards: a smaller $c$ value allows the algorithm to quickly focus on the optimal action;
Common default: a typical starting point is $c = 1$ , but it often requires experimental tuning for best results.

Summary

The UCB algorithm is a powerful and well-founded method for balancing exploration and exploitation in multi-armed bandit problems. By selecting actions based on both estimated rewards and uncertainty, it ensures efficient learning while minimizing regret.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 4

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo