Summary  
This chapter demonstrates how to implement an epsilon-greedy algorithm for balancing exploration and exploitation by selecting actions randomly with probability ε or choosing the highest-value action otherwise, and updating action-value estimates incrementally using sample averages.

General domain of usage  
Reinforcement learning

The **epsilon-greedy** ($$\varepsilon$$-greedy) algorithm is a straightforward yet highly effective strategy for addressing the multi-armed bandit problem. Although it may not be as robust as some other methods for this specific task, its **simplicity and versatility** make it widely applicable in the field of reinforcement learning.

The algorithm follows these steps:

1. **Initialize** action value estimates $$Q(a)$$ for each action $$a$$;
2. **Choose an action** using the following rule:
   - With probability $$\varepsilon$$: select a random action (exploration);
   - With probability $$1 - \varepsilon$$: select the action with the highest estimated value (exploitation).
3. **Execute the action** and observe the reward;
4. **Update** the action value estimate $$Q(a)$$ based on the observed reward;
5. **Repeat** steps 2-4 for a fixed number of time steps.

The **hyperparameter $$\varepsilon$$ (epsilon)** controls the trade-off between **exploration** and **exploitation**:
- A **high $$\varepsilon$$ (e.g., 0.5)** encourages more exploration;
- A **low $$\varepsilon$$ (e.g., 0.01)** favors exploitation of the best-known action.

```python
class EpsilonGreedyAgent:
  def __init__(self, n_actions, epsilon):
    """Initialize an agent"""
    self.n_actions = n_actions # Number of available actions
    self.epsilon = epsilon # epsilon
    self.Q = np.zeros(self.n_actions) # Estimated action values
    self.N = np.zeros(self.n_actions) # Action selection counters

  def select_action(self):
    """Select an action according to the epsilon-greedy strategy"""
    # With probability epsilon - random action
    if np.random.rand() < self.epsilon:
      return np.random.randint(self.n_actions)
    # Otherwise - action with highest estimated action value
    else:
      return np.argmax(self.Q)

  def update(self, action, reward):
    """Update the values using sample average estimate"""
    # Increasing the action selection counter
    self.N[action] += 1
    # Updating the estimated action value
    self.Q[action] += (reward - self.Q[action]) / self.N[action]
```

The efficiency of $$\varepsilon$$-greedy algorithm heavily relies on the value of $$\varepsilon$$. **Two strategies** are commonly used to select this value:
- **Fixed $$\varepsilon$$**: this is the most generic option, where the value of $$\varepsilon$$ is chosen to be a constant (e.g., 0.1);
- **Decaying $$\varepsilon$$**: the value of $$\varepsilon$$ decreases over time according to some schedule (e.g., starts at 1, and gradually decreases to 0) to encourage exploration on early stages.

The $$\varepsilon$$-greedy algorithm is a **baseline approach** for balancing exploration and exploitation. While simple, it serves as a foundation for understanding more advanced strategies like **upper confidence bound (UCB)** and **gradient bandits**.

What is a primary feature of the $$\varepsilon$$-greedy algorithm?

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Epsilon-Greedy Algorithm

How it Works

Sample Code

Additional Information

Summary