Kursinhalt
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Value Function Estimation
Let's begin by revisiting a familiar concept: the state value function, denoted as . It can be defined as
Our goal in this chapter is to estimate this function from data, assuming we are given a fixed policy but have no access to the environment's model.
Monte Carlo Estimation
Monte Carlo methods approach this estimation task by sampling episodes of experience under the policy , then using these samples to form empirical estimates of .
In general, the process can be split into following steps:
- Generate an episode using policy ;
- Save the obtained return value for each state appearing in the episode;
- Repeat steps 1-2 for some time;
- Compute the new values by averaging returns for each state.
First-Visit and Every Visit MC
There are two primary approaches to obtaining state returns: first-visit and every-visit.
First-Visit MC
In first-visit approach, each episode gives return only for the first occurrence of a state. For example, if state appears 3 times, only return after the first appearance counts. Two other returns are ignored.
Every-Visit MC
In every-visit approach, each occurrence of a state counts. For example, if state appears 3 times, each of 3 returns is saved.
Exploring Starts
Imagine a simple one-dimensional world represented by a line that extends from -10 to +10. The agent begins at position 0, and its current policy dictates that it always moves to the right at each time step.
If we attempt to generate episodes under this policy, what happens? The agent will continuously move toward the positive end of the line — visiting states like 1, 2, 3, and so on — but it will never visit any negative states. As a result, we cannot estimate value functions for states to the left of the origin, simply because the agent never experiences them.
So the main problem is: if certain parts of the state space are never explored, their value estimates will remain inaccurate or undefined. One common solution to this problem is the use of exploring starts.
With exploring starts, each episode begins not at a fixed starting state like 0, but at a randomly selected state. Once the episode begins, the agent follows its current policy as usual. Over time, by starting from many different points across the state space, the agent is able to visit all states — not just the ones its policy would naturally lead it to. This allows the Monte Carlo method to produce more accurate and complete value estimates for the entire state space.
Pseudocode
This pseudocode uses every-visit approach together with exploring starts.
1. How does the first-visit MC method differ from the every-visit MC method?
2. What is the main advantage of using exploring starts in Monte Carlo methods?
Danke für Ihr Feedback!