Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Monte Carlo Control | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Monte Carlo Control

By replacing the policy evaluation step in the standard policy iteration algorithm with the Monte Carlo estimation techniques described in the previous chapter, we can already derive a new variation of policy iterationβ€”one that relies on sampled experience instead of dynamic programming.

However, there's a critical limitation. In traditional policy iteration, the policy improvement step depends on having access to a complete model of the environment. Specifically, to update the policy, we use the following expression:

Ο€(s)←arg max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)(r+Ξ³v(sβ€²))\pi(s) \gets \argmax_a \sum_{s', r} \textcolor{red}{p(s', r | s, a)} \Bigl(r + \gamma v(s')\Bigr)

This equation assumes that we know the transition probabilities p(sβ€²,r∣s,a)p(s', r | s, a). But this is precisely the problem: Monte Carlo methods are designed for model-free settings, where the environment's transition dynamics are unknown. If a complete model is available, then we might as well use dynamic programming throughout, including for policy evaluation, since it would be more efficient and precise.

Therefore, while substituting Monte Carlo methods for value estimation is a step toward model-free reinforcement learning, we must also find a way to perform policy improvement without relying on knowledge of the model. This requires shifting from state value function to action value function.

Why Action Values?

By using action values, it is possible to perform policy improvement without needing a model of the environment. Instead of relying on transition probabilities to compute expected returns, we can directly select actions that appear to yield the highest value. The policy improvement step then becomes:

Ο€(s)←arg max⁑aq(s,a)βˆ€s∈S\pi(s) \gets \argmax_a q(s, a) \qquad \forall s \in S

And it is not hard to prove that the new policy is not worse than the old one, as the policy improvement theorem can still be applied:

qΟ€k(s,Ο€k+1(s))=qΟ€k(s,arg max⁑aqΟ€k(s,a))=max⁑aqΟ€k(s,a)β‰₯qΟ€k(s,Ο€k(s))=vΟ€k(s)\begin{aligned} q_{\pi_{k}}(s, \pi_{k+1}(s)) &= q_{\pi_k}(s, \argmax_a q_{\pi_k}(s, a))\\ &= \max_a q_{\pi_k}(s, a)\\ &\ge q_{\pi_k}(s, \pi_k(s))\\ &= v_{\pi_k}(s) \end{aligned}

And, like with DP, this theorem guarantees that either Ο€k+1\pi_{k+1} is better than Ο€k\pi_k, or that they are both equal and optimal.

Action Value Function Estimation

The process of estimation is almost identical to state value function. All ideas that are used in estimating state values, can be used to estimate action values.

Pseudocode

Like that, with enough iterations, estimated action values should approach the true action values.

With this, you can already build a method similar to policy iteration that doesn't rely on a model. To do this, you replace the policy evaluation and policy improvement steps with processes described above.

Optimization

While the evaluation step can be performed using Monte Carlo estimation as described, it tends to be computationally inefficient. As you've already seen, Monte Carlo methods typically require a large number of samples to produce reasonably accurate estimates. If we follow a structure similar to policy iteration, this inefficiency is magnified: after each policy improvement, we must re-run Monte Carlo estimation to re-evaluate the new policy β€” resulting in substantial overhead and slow learning.

A more natural alternative is to update the policy immediately after processing each episode. Instead of waiting to complete a full sweep of policy evaluation, we allow the agent to refine its behavior episode by episode, using the most recent action value estimates.

This results in a method that more closely resembles value iteration: combining aspects of evaluation and improvement in a single step. It increases sample efficiency, increasing computation speed.

Pseudocode

This algorithm follows a GPI framework, as it has policy evaluation and policy imptovement steps, and it is called Monte Carlo control. The one major downside of this specific implementation is the assumption of exploring starts. In the next chapters you will see why this is a problem, and how it can be dealt with.

question mark

What is the main advantage of using action values instead of state values in Monte Carlo control?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 3
We're sorry to hear that something went wrong. What happened?
some-alt