Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Epsilon-Greedy Algorithm | Multi-Armed Bandit Problem
Introduction to Reinforcement Learning
course content

Contenu du cours

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Epsilon-Greedy Algorithm

The epsilon-greedy (ε\varepsilon-greedy) algorithm is a straightforward yet highly effective strategy for addressing the multi-armed bandit problem. Although it may not be as robust as some other methods for this specific task, its simplicity and versatility make it widely applicable in the field of reinforcement learning.

How it Works

The algorithm follows these steps:

  1. Initialize action value estimates Q(a)Q(a) for each action aa.
  2. Choose an action using the following rule:
    • With probability ε\varepsilon: select a random action (exploration);
    • With probability 1ε1 - \varepsilon: select the action with the highest estimated value (exploitation).
  3. Execute the action and observe the reward.
  4. Update the action value estimate Q(a)Q(a) based on the observed reward.
  5. Repeat steps 2-4 for a fixed number of time steps.

The hyperparameter ε\varepsilon (epsilon) controls the trade-off between exploration and exploitation:

  • A high ε\varepsilon (e.g., 0.5) encourages more exploration.
  • A low ε\varepsilon (e.g., 0.01) favors exploitation of the best-known action.

Sample Code

python

Additional Information

The efficiency of ε\varepsilon-greedy algorithm heavily relies on the value of ε\varepsilon. Two strategies are commonly used to select this value:

  • Fixed ε\varepsilon: this is the most generic option, where the value of ε\varepsilon is chosen to be a constant (e.g., 0.1);
  • Decaying ε\varepsilon: the value of ε\varepsilon decreases over time according to some schedule (e.g., starts at 1, and gradually decreases to 0) to encourage exploration on early stages.

Summary

The ε\varepsilon-greedy algorithm is a baseline approach for balancing exploration and exploitation. While simple, it serves as a foundation for understanding more advanced strategies like upper confidence bound (UCB) and gradient bandits.

question mark

What is a primary feature of the ε\varepsilon-greedy algorithm?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 2. Chapitre 3
Nous sommes désolés de vous informer que quelque chose s'est mal passé. Qu'est-il arrivé ?
some-alt