Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Incremental Implementations | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Kurssisisältö

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;

  2. Final sum is divided not by N(s,a)N(s, a), but by ρ(s,a)\sum \rho(s, a).

the value of α\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;

  • C(s,a)C(s, a) is equal to ρ(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 4. Luku 7

Kysy tekoälyä

expand
ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

course content

Kurssisisältö

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;

  2. Final sum is divided not by N(s,a)N(s, a), but by ρ(s,a)\sum \rho(s, a).

the value of α\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;

  • C(s,a)C(s, a) is equal to ρ(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 4. Luku 7
Pahoittelemme, että jotain meni pieleen. Mitä tapahtui?
some-alt