Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Incremental Implementations | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time β€” especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)←Q(s,a)+Ξ±(Gβˆ’Q(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where Ξ±=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)←Q(s,a)+Ξ±(Gβˆ’Q(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but Ξ±=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;

  2. Final sum is divided not by N(s,a)N(s, a), but by βˆ‘Ο(s,a)\sum \rho(s, a).

the value of Ξ±\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;

  • C(s,a)C(s, a) is equal to βˆ‘Ο(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)←C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 7

Ask AI

expand
ChatGPT

Ask anything or try one of the suggested questions to begin our chat

course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time β€” especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)←Q(s,a)+Ξ±(Gβˆ’Q(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where Ξ±=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)←Q(s,a)+Ξ±(Gβˆ’Q(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but Ξ±=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;

  2. Final sum is divided not by N(s,a)N(s, a), but by βˆ‘Ο(s,a)\sum \rho(s, a).

the value of Ξ±\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;

  • C(s,a)C(s, a) is equal to βˆ‘Ο(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)←C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 7
We're sorry to hear that something went wrong. What happened?
some-alt