Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Incremental Implementations

Storing every return for each state-action pair can quickly exhaust memory and significantly increase computation time — especially in large environments. This limitation affects both on-policy and off-policy Monte Carlo control algorithms. To address this, we adopt incremental computation strategies, similar to those used in multi-armed bandit algorithms. These methods allow value estimates to be updated on the fly, without retaining entire return histories.

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

where $\displaystyle \alpha = \frac{1}{N(s, a)}$ for mean estimate. The only values that have to be stored are the current estimates of action values $Q(s, a)$ and the amount of times state-action pair $(s, a)$ has been visited $N(s, a)$ .

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but $\displaystyle \alpha = \frac{1}{N(s, a)}$ can't be used because:

Each return is weighted by $\rho$ ;
Final sum is divided not by $N(s, a)$ , but by $\sum \rho(s, a)$ .

the value of $\alpha$ that can actually be used in this case is equal to $\displaystyle \frac{W}{C(s,a)}$ where:

$W$ is a $\rho$ for current trajectory;
$C(s, a)$ is equal to $\sum \rho(s, a)$ .

And each time the state-action pair $(s, a)$ occurs, the $\rho$ of current trajectory is added to $C(s, a)$ :

C(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 7

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Incremental Implementations

On-Policy Monte Carlo Control

For on-policy method, the update strategy looks similar to the strategy used in MAB algorithms:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

Pseudocode

Off-Policy Monte Carlo Control

For off-policy method with ordinary importance sampling everything is the same as for on-policy method.

A more interesting situation happens with weighted importance sampling. The equation looks the same:

Q(s, a) \gets Q(s, a) + \alpha (G - Q(s, a))

but $\displaystyle \alpha = \frac{1}{N(s, a)}$ can't be used because:

Each return is weighted by $\rho$ ;
Final sum is divided not by $N(s, a)$ , but by $\sum \rho(s, a)$ .

the value of $\alpha$ that can actually be used in this case is equal to $\displaystyle \frac{W}{C(s,a)}$ where:

$W$ is a $\rho$ for current trajectory;
$C(s, a)$ is equal to $\sum \rho(s, a)$ .

And each time the state-action pair $(s, a)$ occurs, the $\rho$ of current trajectory is added to $C(s, a)$ :

C(s, a) \gets C(s, a) + W

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 7