Contenido del Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

What are Monte Carlo Methods?

Definition

Monte Carlo (MC) methods are a class of computational algorithms that rely on random sampling to estimate numerical results.

Monte Carlo methods are used when deterministic solutions are difficult or impossible to obtain. They replace exact computations with approximations that improve with the number of random samples.

How they Work?

Monte Carlo methods can vary from one task to another, but all of them tend to follow a single pattern:

Define a domain of possible inputs;
Generate random inputs from a probability distribution;
Evaluate a function on these inputs;
Aggregate the results to produce an estimate.

Examples

While the pattern described above may sound complex, these examples should help clarify the idea behind it.

Integral Computation

Computing integrals is a non-trivial task that usually requires applying many techniques to achieve the correct result.

Let's try to apply Monte Carlo method to solve this integral:

\int_0^1 \int_0^1 \frac{1}{1 + (x + y)^2} \, dx \, dy

Input domain: this double integral has two variables, $x \in [0, 1]$ and $y \in [0, 1]$ ;
Generation: both of these variables are independent from each other and uniformly distributed;
Evaluation: to get a point value, function under integrals can be used;
Aggregation: the value of this integral can be defined as a volume under the curve. Volume can be computed as a product of base area and average height. Base area is 1(unit square) and average height is the average of results received in previous step.


              12345678910111213141516
            
import numpy as np

result = 0
# Many samples are required for estimates to be precise
for i in range(100000):
  # Generation of random variables
  x, y = np.random.uniform(), np.random.uniform()
  # Computation of point value
  value = 1 / (1 + (x + y) ** 2)
  # Mean aggregation
  result += (value - result) / (i + 1)

# Closed-form solution of this integral
true_result = 2*np.arctan(2) - np.pi/2 - (1/2)*np.log(5) + np.log(2)
print(f"Approximated result: {result}")
print(f"True result: {true_result}")

Approximation of $\Large\pi$

Approximating $\pi$ is one of the most iconic uses of the Monte Carlo method. It illustrates how random sampling can solve a geometric problem without any complex calculus.

Consider a unit square with a quarter circle inscribed in it:

The square spans $[0, 1] \times [0, 1]$ ;
The quarter circle has radius 1 and is centered at the origin.

The area of the quarter circle is $\displaystyle\frac{\pi r^2}{4}$ or $\displaystyle\frac{\pi}{4}$ , and the area of the square is 1. Now let's sample random points inside of a square. With big enough sample size:

\frac{\text{Points inside the quarter circle}}{\text{Total points}} \approx \frac\pi4

So the value of $\pi$ can be computed as

\pi \approx 4 \cdot \frac{\text{Points inside}}{\text{Total points}}


              1234567891011121314151617181920212223242526272829
            
import numpy as np
import matplotlib.pyplot as plt

# Lists for coordinates
inside = []
outside = []

# Many samples are required for estimates to be precise
for _ in range(100000):
  # Generation of random variables
  x, y = np.random.uniform(), np.random.uniform()
  # Splitting points inside and outside of the circle
  if x**2 + y**2 <= 1:
    inside.append((x, y))
  else:
    outside.append((x, y))

# Plotting points
plt.figure(figsize=(6,6))
plt.scatter(*zip(*inside), color="blue", s=1, label="Inside")
plt.scatter(*zip(*outside), color="red", s=1, label="Outside")
plt.legend()
plt.xlabel("x")
plt.ylabel("y")
plt.show()

estimate = 4 * len(inside) / (len(inside) + len(outside))
print(f"Estimated value of pi: {estimate}")
print(f"True value of pi: {np.pi}")

Multi-Armed Bandits

In the multi-armed bandit setting, a key objective is to estimate the action value for each arm — that is, the expected reward of choosing a particular action. One common strategy is to estimate these values by averaging the observed rewards obtained from pulling each arm over time. This technique is, in fact, a Monte Carlo method.

Monte Carlo Methods for MDPs

Unlike dynamic programming methods, which rely on a complete and accurate model of the environment's dynamics, Monte Carlo methods learn solely from experience — that is, from actual or simulated sequences of states, actions, and rewards.

This makes Monte Carlo approaches especially powerful: they don't require any prior knowledge about how the environment works. Instead, they extract value estimates directly from what happens during interaction. In many real-world scenarios, where modeling the environment is impractical or impossible, this ability to learn from raw experience is a major advantage.

When direct interaction with the environment is costly, risky, or slow, Monte Carlo methods can also learn from simulated experience, provided a reliable simulation exists. This allows for exploration and learning in a controlled, repeatable setting — though it does assume access to a model capable of generating plausible transitions.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 4. Capítulo 1

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Contenido del Curso

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

What are Monte Carlo Methods?

Definition

Monte Carlo (MC) methods are a class of computational algorithms that rely on random sampling to estimate numerical results.

How they Work?

Monte Carlo methods can vary from one task to another, but all of them tend to follow a single pattern:

Define a domain of possible inputs;
Generate random inputs from a probability distribution;
Evaluate a function on these inputs;
Aggregate the results to produce an estimate.

Examples

While the pattern described above may sound complex, these examples should help clarify the idea behind it.

Integral Computation

Computing integrals is a non-trivial task that usually requires applying many techniques to achieve the correct result.

Let's try to apply Monte Carlo method to solve this integral:

\int_0^1 \int_0^1 \frac{1}{1 + (x + y)^2} \, dx \, dy

Input domain: this double integral has two variables, $x \in [0, 1]$ and $y \in [0, 1]$ ;
Generation: both of these variables are independent from each other and uniformly distributed;
Evaluation: to get a point value, function under integrals can be used;
Aggregation: the value of this integral can be defined as a volume under the curve. Volume can be computed as a product of base area and average height. Base area is 1(unit square) and average height is the average of results received in previous step.


              12345678910111213141516
            
import numpy as np

result = 0
# Many samples are required for estimates to be precise
for i in range(100000):
  # Generation of random variables
  x, y = np.random.uniform(), np.random.uniform()
  # Computation of point value
  value = 1 / (1 + (x + y) ** 2)
  # Mean aggregation
  result += (value - result) / (i + 1)

# Closed-form solution of this integral
true_result = 2*np.arctan(2) - np.pi/2 - (1/2)*np.log(5) + np.log(2)
print(f"Approximated result: {result}")
print(f"True result: {true_result}")

Approximation of $\Large\pi$

Approximating $\pi$ is one of the most iconic uses of the Monte Carlo method. It illustrates how random sampling can solve a geometric problem without any complex calculus.

Consider a unit square with a quarter circle inscribed in it:

The square spans $[0, 1] \times [0, 1]$ ;
The quarter circle has radius 1 and is centered at the origin.

\frac{\text{Points inside the quarter circle}}{\text{Total points}} \approx \frac\pi4

So the value of $\pi$ can be computed as

\pi \approx 4 \cdot \frac{\text{Points inside}}{\text{Total points}}


              1234567891011121314151617181920212223242526272829
            
import numpy as np
import matplotlib.pyplot as plt

# Lists for coordinates
inside = []
outside = []

# Many samples are required for estimates to be precise
for _ in range(100000):
  # Generation of random variables
  x, y = np.random.uniform(), np.random.uniform()
  # Splitting points inside and outside of the circle
  if x**2 + y**2 <= 1:
    inside.append((x, y))
  else:
    outside.append((x, y))

# Plotting points
plt.figure(figsize=(6,6))
plt.scatter(*zip(*inside), color="blue", s=1, label="Inside")
plt.scatter(*zip(*outside), color="red", s=1, label="Outside")
plt.legend()
plt.xlabel("x")
plt.ylabel("y")
plt.show()

estimate = 4 * len(inside) / (len(inside) + len(outside))
print(f"Estimated value of pi: {estimate}")
print(f"True value of pi: {np.pi}")

Multi-Armed Bandits

Monte Carlo Methods for MDPs

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 4. Capítulo 1

Introduction to Reinforcement Learning

What are Monte Carlo Methods?

How they Work?

Examples

Integral Computation

Approximation of π\Large\piπ

Multi-Armed Bandits

Monte Carlo Methods for MDPs

Introduction to Reinforcement Learning

What are Monte Carlo Methods?

How they Work?

Examples

Integral Computation

Approximation of π\Large\piπ

Multi-Armed Bandits

Monte Carlo Methods for MDPs

Approximation of $\Large\pi$

Approximation of $\Large\pi$