Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Value Function Estimation | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Kursinhalt

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Value Function Estimation

Let's begin by revisiting a familiar concept: the state value function, denoted as vπ(s)v_\pi(s). It can be defined as

vπ(s)=Eπ[GtSt=s]\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi [ G_t | S_t = s]

Our goal in this chapter is to estimate this function from data, assuming we are given a fixed policy π\pi but have no access to the environment's model.

Monte Carlo Estimation

Monte Carlo methods approach this estimation task by sampling episodes of experience under the policy π\pi, then using these samples to form empirical estimates of vπ(s)v_\pi(s).

In general, the process can be split into following steps:

  1. Generate an episode using policy π\pi;
  2. Save the obtained return value for each state appearing in the episode;
  3. Repeat steps 1-2 for some time;
  4. Compute the new values by averaging returns for each state.

First-Visit and Every Visit MC

There are two primary approaches to obtaining state returns: first-visit and every-visit.

First-Visit MC

In first-visit approach, each episode gives return only for the first occurrence of a state. For example, if state ss appears 3 times, only return after the first appearance counts. Two other returns are ignored.

Every-Visit MC

In every-visit approach, each occurrence of a state counts. For example, if state ss appears 3 times, each of 3 returns is saved.

Exploring Starts

Imagine a simple one-dimensional world represented by a line that extends from -10 to +10. The agent begins at position 0, and its current policy dictates that it always moves to the right at each time step.

If we attempt to generate episodes under this policy, what happens? The agent will continuously move toward the positive end of the line — visiting states like 1, 2, 3, and so on — but it will never visit any negative states. As a result, we cannot estimate value functions for states to the left of the origin, simply because the agent never experiences them.

So the main problem is: if certain parts of the state space are never explored, their value estimates will remain inaccurate or undefined. One common solution to this problem is the use of exploring starts.

With exploring starts, each episode begins not at a fixed starting state like 0, but at a randomly selected state. Once the episode begins, the agent follows its current policy as usual. Over time, by starting from many different points across the state space, the agent is able to visit all states — not just the ones its policy would naturally lead it to. This allows the Monte Carlo method to produce more accurate and complete value estimates for the entire state space.

Pseudocode

This pseudocode uses every-visit approach together with exploring starts.

1. How does the first-visit MC method differ from the every-visit MC method?

2. What is the main advantage of using exploring starts in Monte Carlo methods?

question mark

How does the first-visit MC method differ from the every-visit MC method?

Select the correct answer

question mark

What is the main advantage of using exploring starts in Monte Carlo methods?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 4. Kapitel 2
Wir sind enttäuscht, dass etwas schief gelaufen ist. Was ist passiert?
some-alt