Problem Introduction

The multi-armed bandit (MAB) problem is a well-known challenge in reinforcement learning, decision-making, and probability theory. It involves an agent repeatedly choosing between multiple actions, each offering a reward from some fixed probability distribution. The goal is to maximize the return over a fixed number of time steps.

Origin of a Problem

The term "multi-armed bandit" originates from the analogy to a slot machine, often called a "one-armed bandit" due to its lever. In this scenario, imagine having multiple slot machines, or a slot machine that has multiple levers (arms), and each arm is associated with a distinct probability distribution for rewards. The goal is to maximize the return over a limited number of attempts by carefully choosing which lever to pull.

The Challenge

The MAB problem captures the challenge of balancing exploration and exploitation:

Exploration: trying different arms to gather information about their payouts;
Exploitation: pulling the arm that currently seems best to maximize immediate rewards.

A naive approach — playing a single arm repeatedly — might lead to suboptimal returns if a better arm exists but remains unexplored. Conversely, excessive exploration can waste resources on low-reward options.

Real-World Applications

While originally framed in gambling, the MAB problem appears in many fields:

Online advertising: choosing the best ad to display based on user engagement;
Clinical trials: testing multiple treatments to find the most effective one;
Recommendation systems: serving the most relevant content to users.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning