Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Incremental Implementations | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Contenido del Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;
  2. Final sum is divided not by N(s,a)N(s, a), but by ρ(s,a)\sum \rho(s, a).

the value of α\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;
  • C(s,a)C(s, a) is equal to ρ(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 7
Lamentamos que algo salió mal. ¿Qué pasó?
some-alt