Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Incremental Implementations | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time β€” especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)←Q(s,a)+Ξ±(Gβˆ’Q(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where Ξ±=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)←Q(s,a)+Ξ±(Gβˆ’Q(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but Ξ±=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;
  2. Final sum is divided not by N(s,a)N(s, a), but by βˆ‘Ο(s,a)\sum \rho(s, a).

the value of Ξ±\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;
  • C(s,a)C(s, a) is equal to βˆ‘Ο(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)←C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 7
We're sorry to hear that something went wrong. What happened?
some-alt