Course Content
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Value Function Estimation
Let's begin by revisiting a familiar concept: the state value function, denoted as . It can be defined as
Our goal in this chapter is to estimate this function from data, assuming we are given a fixed policy but have no access to the environment's model.
Monte Carlo Estimation
Monte Carlo methods approach this estimation task by sampling episodes of experience under the policy , then using these samples to form empirical estimates of .
In general, the process can be split into following steps:
Generate an episode using policy ;
Save the obtained return value for each state appearing in the episode;
Repeat steps 1-2 for some time;
Compute the new values by averaging returns for each state.
Collecting the Returns
Monte Carlo estimation of value function requires collecting the returns from generated episodes. To compute these returns, two primary approaches can be used:
First-visit: for each state encountered in an episode, only the return following its first occurrence is considered. Subsequent occurrences of the same state within the same episode are ignored for estimation purposes;
Every-visit: every occurrence of a state within an episode is used. That is, the return following each visit to the state is included in the estimate, even if the state appears multiple times in the same episode.
Exploring Starts
Imagine a simple one-dimensional world represented by a line that extends from -10 to +10. The agent begins at position 0, and its current policy dictates that it always moves to the right at each time step.
If we attempt to generate episodes under this policy, what happens? The agent will continuously move toward the positive end of the line β visiting states like 1, 2, 3, and so on β but it will never visit any negative states. As a result, we cannot estimate value functions for states to the left of the origin, simply because the agent never experiences them.
So the main problem is: if certain parts of the state space are never explored, their value estimates will remain inaccurate or undefined. One common solution to this problem is the use of exploring starts.
With exploring starts, each episode begins not at a fixed starting state like 0, but at a randomly selected state. Once the episode begins, the agent follows its current policy as usual. Over time, by starting from many different points across the state space, the agent is able to visit all states β not just the ones its policy would naturally lead it to. This allows the Monte Carlo method to produce more accurate and complete value estimates for the entire state space.
Pseudocode
This pseudocode uses every-visit approach together with exploring starts.
1. How does the first-visit MC method differ from the every-visit MC method?
2. What is the main advantage of using exploring starts in Monte Carlo methods?
Thanks for your feedback!