Зміст курсу
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Incremental Implementations
Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.
On-Policy Monte Carlo Control
For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:
where for mean estimate. The only values that have to be stored are the current estimates of action values and the amount of times state-action pair has been visited .
Pseudocode
Off-Policy Monte Carlo Control
For off-policy method with ordinary importance sampling everything is the same as for on-policy method.
A more interesting situation happens with weighted importance sampling. The equation looks the same:
but can't be used because:
- Each return is weighted by ;
- Final sum is divided not by , but by .
the value of that can actually be used in this case is equal to where:
- is a for current trajectory;
- is equal to .
And each time the state-action pair occurs, the of current trajectory is added to :
Pseudocode
Дякуємо за ваш відгук!