Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Epsilon-Greedy Algorithm | Multi-Armed Bandit Problem
/
Introduction to Reinforcement Learning with Python

bookEpsilon-Greedy Algorithm

メニューを表示するにはスワイプしてください

The epsilon-greedy (ε\varepsilon-greedy) algorithm is a straightforward yet highly effective strategy for addressing the multi-armed bandit problem. Although it may not be as robust as some other methods for this specific task, its simplicity and versatility make it widely applicable in the field of reinforcement learning.

How it Works

The algorithm follows these steps:

  1. Initialize action value estimates Q(a)Q(a) for each action aa;
  2. Choose an action using the following rule:
    • With probability ε\varepsilon: select a random action (exploration);
    • With probability 1ε1 - \varepsilon: select the action with the highest estimated value (exploitation).
  3. Execute the action and observe the reward;
  4. Update the action value estimate Q(a)Q(a) based on the observed reward;
  5. Repeat steps 2-4 for a fixed number of time steps.

The hyperparameter ε\varepsilon (epsilon) controls the trade-off between exploration and exploitation:

  • A high ε\varepsilon (e.g., 0.5) encourages more exploration;
  • A low ε\varepsilon (e.g., 0.01) favors exploitation of the best-known action.

Sample Code

class EpsilonGreedyAgent:
  def __init__(self, n_actions, epsilon):
    """Initialize an agent"""
    self.n_actions = n_actions # Number of available actions
    self.epsilon = epsilon # epsilon
    self.Q = np.zeros(self.n_actions) # Estimated action values
    self.N = np.zeros(self.n_actions) # Action selection counters

  def select_action(self):
    """Select an action according to the epsilon-greedy strategy"""
    # With probability epsilon - random action
    if np.random.rand() < self.epsilon:
      return np.random.randint(self.n_actions)
    # Otherwise - action with highest estimated action value
    else:
      return np.argmax(self.Q)

  def update(self, action, reward):
    """Update the values using sample average estimate"""
    # Increasing the action selection counter
    self.N[action] += 1
    # Updating the estimated action value
    self.Q[action] += (reward - self.Q[action]) / self.N[action]

Additional Information

The efficiency of ε\varepsilon-greedy algorithm heavily relies on the value of ε\varepsilon. Two strategies are commonly used to select this value:

  • Fixed ε\varepsilon: this is the most generic option, where the value of ε\varepsilon is chosen to be a constant (e.g., 0.1);
  • Decaying ε\varepsilon: the value of ε\varepsilon decreases over time according to some schedule (e.g., starts at 1, and gradually decreases to 0) to encourage exploration on early stages.

Summary

The ε\varepsilon-greedy algorithm is a baseline approach for balancing exploration and exploitation. While simple, it serves as a foundation for understanding more advanced strategies like upper confidence bound (UCB) and gradient bandits.

question mark

What is a primary feature of the ε\varepsilon-greedy algorithm?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 2.  3

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 2.  3
some-alt