Incremental Implementations
メニューを表示するにはスワイプしてください
Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.
On-Policy Monte Carlo Control
For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:
Q(s,a)←Q(s,a)+α(G−Q(s,a))where α=N(s,a)1 for mean estimate. The only values that have to be stored are the current estimates of action values Q(s,a) and the amount of times state-action pair (s,a) has been visited N(s,a).
Pseudocode
Off-Policy Monte Carlo Control
For off-policy method with ordinary importance sampling everything is the same as for on-policy method.
A more interesting situation happens with weighted importance sampling. The equation looks the same:
Q(s,a)←Q(s,a)+α(G−Q(s,a))but α=N(s,a)1 can't be used because:
- Each return is weighted by ρ;
- Final sum is divided not by N(s,a), but by ∑ρ(s,a).
the value of α that can actually be used in this case is equal to C(s,a)W where:
- W is a ρ for current trajectory;
- C(s,a) is equal to ∑ρ(s,a).
And each time the state-action pair (s,a) occurs, the ρ of current trajectory is added to C(s,a):
C(s,a)←C(s,a)+WPseudocode
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください