Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Incremental Implementations | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Зміст курсу

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;

  2. Final sum is divided not by N(s,a)N(s, a), but by ρ(s,a)\sum \rho(s, a).

the value of α\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;

  • C(s,a)C(s, a) is equal to ρ(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 4. Розділ 7

Запитати АІ

expand
ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

course content

Зміст курсу

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a)Q(s, a) and the amount of times state-action pair (s,a)(s, a) has been visited N(s,a)N(s, a).

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s,a)Q(s,a)+α(GQ(s,a))Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but α=1N(s,a)\displaystyle \alpha = \frac{1}{N(s, a)} can't be used because:

  1. Each return is weighted by ρ\rho;

  2. Final sum is divided not by N(s,a)N(s, a), but by ρ(s,a)\sum \rho(s, a).

the value of α\alpha that can actually be used in this case is equal to WC(s,a)\displaystyle \frac{W}{C(s,a)} where:

  • WW is a ρ\rho for current trajectory;

  • C(s,a)C(s, a) is equal to ρ(s,a)\sum \rho(s, a).

And each time the state-action pair (s,a)(s, a) occurs, the ρ\rho of current trajectory is added to C(s,a)C(s, a):

C(s,a)C(s,a)+WC(s, a) \gets C(s, a) + W

Pseudocode

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 4. Розділ 7
Ми дуже хвилюємося, що щось пішло не так. Що трапилося?
some-alt