Summary  
This chapter covers incremental computation strategies for on-policy and off-policy Monte Carlo control, showing how to update action-value estimates online—using simple averaging for ordinary sampling and weighted updates for importance sampling—without storing full return histories.

General domain of usage  
Reinforcement learning

**Storing every return** for each state-action pair can quickly **exhaust memory** and significantly **increase computation time** — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt **incremental computation strategies**, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be **updated on the fly**, without retaining entire return histories.

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:
$$
Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a)) 
$$

where $$\displaystyle \alpha = \frac{1}{N(s, a)}$$ for mean estimate. The only values that have to be stored are the current estimates of action values $$Q(s, a)$$ and the amount of times state-action pair $$(s, a)$$ has been visited $$N(s, a)$$.

For off-policy method with **ordinary importance sampling** everything is the same as for on-policy method.

A more interesting situation happens with **weighted importance sampling**. The equation looks the same:
$$
Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a)) 
$$

but $$\displaystyle \alpha = \frac{1}{N(s, a)}$$ can't be used because:
1. Each return is weighted by $$\rho$$;
2. Final sum is divided not by $$N(s, a)$$, but by $$\sum \rho(s, a)$$.

the value of $$\alpha$$ that can actually be used in this case is equal to $$\displaystyle \frac{W}{C(s,a)}$$ where:
- $$W$$ is a $$\rho$$ for current trajectory;
- $$C(s, a)$$ is equal to $$\sum \rho(s, a)$$.

And each time the state-action pair $$(s, a)$$ occurs, the $$\rho$$ of current trajectory is added to $$C(s, a)$$:
$$
C(s, a) \gets C(s, a) + W
$$

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Incremental Implementations

On-Policy Monte Carlo Control

Pseudocode

Off-Policy Monte Carlo Control

Pseudocode