Course Content
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Monte Carlo Control
By replacing the policy evaluation step in the standard policy iteration algorithm with the Monte Carlo estimation techniques described in the previous chapter, we can already derive a new variation of policy iterationβone that relies on sampled experience instead of dynamic programming.
However, there's a critical limitation. In traditional policy iteration, the policy improvement step depends on having access to a complete model of the environment. Specifically, to update the policy, we use the following expression:
This equation assumes that we know the transition probabilities . But this is precisely the problem: Monte Carlo methods are designed for model-free settings, where the environment's transition dynamics are unknown. If a complete model is available, then we might as well use dynamic programming throughout, including for policy evaluation, since it would be more efficient and precise.
Therefore, while substituting Monte Carlo methods for value estimation is a step toward model-free reinforcement learning, we must also find a way to perform policy improvement without relying on knowledge of the model. This requires shifting from state value function to action value function.
Why Action Values?
By using action values, it is possible to perform policy improvement without needing a model of the environment. Instead of relying on transition probabilities to compute expected returns, we can directly select actions that appear to yield the highest value. The policy improvement step then becomes:
And it is not hard to prove that the new policy is not worse than the old one, as the policy improvement theorem can still be applied:
And, like with DP, this theorem guarantees that either is better than , or that they are both equal and optimal.
Action Value Function Estimation
The process of estimation is almost identical to state value function. All ideas that are used in estimating state values, can be used to estimate action values.
Pseudocode
Like that, with enough iterations, estimated action values should approach the true action values.
With this, you can already build a method similar to policy iteration that doesn't rely on a model. To do this, you replace the policy evaluation and policy improvement steps with processes described above.
Optimization
While the evaluation step can be performed using Monte Carlo estimation as described, it tends to be computationally inefficient. As you've already seen, Monte Carlo methods typically require a large number of samples to produce reasonably accurate estimates. If we follow a structure similar to policy iteration, this inefficiency is magnified: after each policy improvement, we must re-run Monte Carlo estimation to re-evaluate the new policy β resulting in substantial overhead and slow learning.
A more natural alternative is to update the policy immediately after processing each episode. Instead of waiting to complete a full sweep of policy evaluation, we allow the agent to refine its behavior episode by episode, using the most recent action value estimates.
This results in a method that more closely resembles value iteration: combining aspects of evaluation and improvement in a single step. It increases sample efficiency, increasing computation speed.
Pseudocode
This algorithm follows a GPI framework, as it has policy evaluation and policy imptovement steps, and it is called Monte Carlo control. The one major downside of this specific implementation is the assumption of exploring starts. In the next chapters you will see why this is a problem, and how it can be dealt with.
Thanks for your feedback!