Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Monte Carlo Control

By replacing the policy evaluation step in the standard policy iteration algorithm with the Monte Carlo estimation techniques described in the previous chapter, we can already derive a new variation of policy iteration—one that relies on sampled experience instead of dynamic programming.

However, there's a critical limitation. In traditional policy iteration, the policy improvement step depends on having access to a complete model of the environment. Specifically, to update the policy, we use the following expression:

\pi(s) \gets \argmax_a \sum_{s', r} \textcolor{red}{p(s', r | s, a)} \Bigl(r + \gamma v(s')\Bigr)

This equation assumes that we know the transition probabilities $p(s', r | s, a)$ . But this is precisely the problem: Monte Carlo methods are designed for model-free settings, where the environment's transition dynamics are unknown. If a complete model is available, then we might as well use dynamic programming throughout, including for policy evaluation, since it would be more efficient and precise.

Therefore, while substituting Monte Carlo methods for value estimation is a step toward model-free reinforcement learning, we must also find a way to perform policy improvement without relying on knowledge of the model. This requires shifting from state value function to action value function.

Why Action Values?

By using action values, it is possible to perform policy improvement without needing a model of the environment. Instead of relying on transition probabilities to compute expected returns, we can directly select actions that appear to yield the highest value. The policy improvement step then becomes:

\pi(s) \gets \argmax_a q(s, a) \qquad \forall s \in S

And it is not hard to prove that the new policy is not worse than the old one, as the policy improvement theorem can still be applied:

\begin{aligned} q_{\pi_{k}}(s, \pi_{k+1}(s)) &= q_{\pi_k}(s, \argmax_a q_{\pi_k}(s, a))\\ &= \max_a q_{\pi_k}(s, a)\\ &\ge q_{\pi_k}(s, \pi_k(s))\\ &= v_{\pi_k}(s) \end{aligned}

And, like with DP, this theorem guarantees that either $\pi_{k+1}$ is better than $\pi_k$ , or that they are both equal and optimal.

Action Value Function Estimation

The process of estimation is almost identical to state value function. All ideas that are used in estimating state values, can be used to estimate action values.

Pseudocode

Like that, with enough iterations, estimated action values should approach the true action values.

With this, you can already build a method similar to policy iteration that doesn't rely on a model. To do this, you replace the policy evaluation and policy improvement steps with processes described above.

Optimization

While the evaluation step can be performed using Monte Carlo estimation as described, it tends to be computationally inefficient. As you've already seen, Monte Carlo methods typically require a large number of samples to produce reasonably accurate estimates. If we follow a structure similar to policy iteration, this inefficiency is magnified: after each policy improvement, we must re-run Monte Carlo estimation to re-evaluate the new policy — resulting in substantial overhead and slow learning.

A more natural alternative is to update the policy immediately after processing each episode. Instead of waiting to complete a full sweep of policy evaluation, we allow the agent to refine its behavior episode by episode, using the most recent action value estimates.

This results in a method that more closely resembles value iteration: combining aspects of evaluation and improvement in a single step. It increases sample efficiency, increasing computation speed.

Pseudocode

This algorithm follows a GPI framework, as it has policy evaluation and policy imptovement steps, and it is called Monte Carlo control. The one major downside of this specific implementation is the assumption of exploring starts. In the next chapters you will see why this is a problem, and how it can be dealt with.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Monte Carlo Control

\pi(s) \gets \argmax_a \sum_{s', r} \textcolor{red}{p(s', r | s, a)} \Bigl(r + \gamma v(s')\Bigr)

Why Action Values?

\pi(s) \gets \argmax_a q(s, a) \qquad \forall s \in S

And it is not hard to prove that the new policy is not worse than the old one, as the policy improvement theorem can still be applied:

\begin{aligned} q_{\pi_{k}}(s, \pi_{k+1}(s)) &= q_{\pi_k}(s, \argmax_a q_{\pi_k}(s, a))\\ &= \max_a q_{\pi_k}(s, a)\\ &\ge q_{\pi_k}(s, \pi_k(s))\\ &= v_{\pi_k}(s) \end{aligned}

And, like with DP, this theorem guarantees that either $\pi_{k+1}$ is better than $\pi_k$ , or that they are both equal and optimal.

Action Value Function Estimation

The process of estimation is almost identical to state value function. All ideas that are used in estimating state values, can be used to estimate action values.

Pseudocode

Like that, with enough iterations, estimated action values should approach the true action values.

Optimization

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 3