Summary  
This chapter introduces the multi-armed bandit algorithm, covering how to implement a probabilistic decision-making strategy that balances exploration and exploitation to maximize cumulative rewards.

General domain of usage  
Online advertising

The **multi-armed bandit (MAB) problem** is a well-known challenge in reinforcement learning, decision-making, and probability theory. It involves an agent repeatedly choosing between **multiple actions**, each offering a reward from some fixed **probability distribution**. The goal is to **maximize the return** over a fixed number of **time steps**.

The term **"multi-armed bandit"** originates from the analogy to a slot machine, often called a **"one-armed bandit"** due to its lever. In this scenario, imagine having **multiple slot machines**, or a slot machine that has **multiple levers (arms)**, and each arm is associated with a **distinct probability distribution** for rewards. The goal is to **maximize the return** over a limited number of attempts by carefully choosing which lever to pull.

The **MAB problem** captures the challenge of balancing **exploration** and **exploitation**:

- **Exploration**: trying different arms to gather information about their payouts;
- **Exploitation**: pulling the arm that currently seems best to maximize immediate rewards.

A **naive approach** — playing a single arm repeatedly — might lead to **suboptimal returns** if a better arm exists but remains unexplored. Conversely, **excessive exploration** can **waste resources** on low-reward options.

While originally framed in gambling, the **MAB problem** appears in **many fields**:
- **Online advertising**: choosing the best ad to display based on user engagement;
- **Clinical trials**: testing multiple treatments to find the most effective one;
- **Recommendation systems**: serving the most relevant content to users.

What is the primary challenge in the multi-armed bandit problem?

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Problem Introduction

Origin of a Problem

The Challenge

Real-World Applications