Conteúdo do Curso
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Upper Confidence Bound Algorithm
The upper confidence bound (UCB) algorithm is a popular and effective approach for solving the multi-armed bandit problem. It has strong mathematical guarantees of quick convergence, optimizing the exploration process.
Despite its effectiveness in solving the MAB problem, the UCB algorithm has some notable limitations that restrict its application in broader reinforcement learning:
- Assumption of stationary rewards: the UCB algorithm assumes that reward distributions do not change over time;
- Constraints on state and action spaces: to even begin choosing actions according to some logic, the UCB algorithm requires trying each action in each state at least once.
While the first limitation can be addressed by modifying the algorithm slightly, the second limitation remains a significant challenge in many practical applications.
How it Works
The UCB algorithm balances exploration and exploitation by assigning a confidence interval to each action's estimated value and selecting the action with the highest upper bound. This approach ensures that actions with uncertain rewards are explored while favoring actions that appear to be optimal.
The steps of the UCB algorithm are identical to the steps of epsilon-greedy algorithm, except for the step of choosing an action. The UCB algorithm selects an action At at time step t using the following formula:
where:
- is the estimated reward of action at time ;
- is the number of times action has been selected until time ;
- is a tunable parameter that controls the balance between exploration and exploitation, similar to in -greedy algorithm;
- is a natural logarithm function;
- is a value of an argument(, in this case), that maximizes the expression.
Intuition
chooses the action that maximizes the sum of two parts: the estimated action value and a confidence interval. The confidence interval is scaled by a factor , where larger values make the interval wider, meaning the agent is less confident about the action's value, which encourages exploration.
The size of this confidence interval depends on two factors:
- Time: as more time passes, the agent becomes less confident in the action's value;
- Action frequency: the more often an action is chosen, the more confident the agent becomes in its value.
Sample Code
python
Additional Information
The UCB algorithm incorporates a mechanism for exploration, which requires careful tuning of the hyperparameter to function effectively. The optimal value varies depending on the specific problem. Here are some general guidelines:
- High variance in rewards: a larger value ensures sufficient exploration;
- Stable rewards: a smaller value allows the algorithm to quickly focus on the optimal action;
- Common default: a typical starting point is , but it often requires experimental tuning for best results.
Summary
The UCB algorithm is a powerful and well-founded method for balancing exploration and exploitation in multi-armed bandit problems. By selecting actions based on both estimated rewards and uncertainty, it ensures efficient learning while minimizing regret.
Obrigado pelo seu feedback!