Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Value Function Estimation | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Value Function Estimation

Let's begin by revisiting a familiar concept: the state value function, denoted as vΟ€(s)v_\pi(s). It can be defined as

vΟ€(s)=E⁑π[Gt∣St=s]\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi [ G_t | S_t = s]

Our goal in this chapter is to estimate this function from data, assuming we are given a fixed policy Ο€\pi but have no access to the environment's model.

Monte Carlo Estimation

Monte Carlo methods approach this estimation task by sampling episodes of experience under the policy Ο€\pi, then using these samples to form empirical estimates of vΟ€(s)v_\pi(s).

In general, the process can be split into following steps:

  1. Generate an episode using policy Ο€\pi;

  2. Save the obtained return value for each state appearing in the episode;

  3. Repeat steps 1-2 for some time;

  4. Compute the new values by averaging returns for each state.

Collecting the Returns

Monte Carlo estimation of value function requires collecting the returns from generated episodes. To compute these returns, two primary approaches can be used:

  • First-visit: for each state ss encountered in an episode, only the return following its first occurrence is considered. Subsequent occurrences of the same state within the same episode are ignored for estimation purposes;

  • Every-visit: every occurrence of a state ss within an episode is used. That is, the return following each visit to the state is included in the estimate, even if the state appears multiple times in the same episode.

Exploring Starts

Imagine a simple one-dimensional world represented by a line that extends from -10 to +10. The agent begins at position 0, and its current policy dictates that it always moves to the right at each time step.

If we attempt to generate episodes under this policy, what happens? The agent will continuously move toward the positive end of the line β€” visiting states like 1, 2, 3, and so on β€” but it will never visit any negative states. As a result, we cannot estimate value functions for states to the left of the origin, simply because the agent never experiences them.

So the main problem is: if certain parts of the state space are never explored, their value estimates will remain inaccurate or undefined. One common solution to this problem is the use of exploring starts.

With exploring starts, each episode begins not at a fixed starting state like 0, but at a randomly selected state. Once the episode begins, the agent follows its current policy as usual. Over time, by starting from many different points across the state space, the agent is able to visit all states β€” not just the ones its policy would naturally lead it to. This allows the Monte Carlo method to produce more accurate and complete value estimates for the entire state space.

Pseudocode

This pseudocode uses every-visit approach together with exploring starts.

1. How does the first-visit MC method differ from the every-visit MC method?

2. What is the main advantage of using exploring starts in Monte Carlo methods?

question mark

How does the first-visit MC method differ from the every-visit MC method?

Select the correct answer

question mark

What is the main advantage of using exploring starts in Monte Carlo methods?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 2

Ask AI

expand
ChatGPT

Ask anything or try one of the suggested questions to begin our chat

course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Value Function Estimation

Let's begin by revisiting a familiar concept: the state value function, denoted as vΟ€(s)v_\pi(s). It can be defined as

vΟ€(s)=E⁑π[Gt∣St=s]\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi [ G_t | S_t = s]

Our goal in this chapter is to estimate this function from data, assuming we are given a fixed policy Ο€\pi but have no access to the environment's model.

Monte Carlo Estimation

Monte Carlo methods approach this estimation task by sampling episodes of experience under the policy Ο€\pi, then using these samples to form empirical estimates of vΟ€(s)v_\pi(s).

In general, the process can be split into following steps:

  1. Generate an episode using policy Ο€\pi;

  2. Save the obtained return value for each state appearing in the episode;

  3. Repeat steps 1-2 for some time;

  4. Compute the new values by averaging returns for each state.

Collecting the Returns

Monte Carlo estimation of value function requires collecting the returns from generated episodes. To compute these returns, two primary approaches can be used:

  • First-visit: for each state ss encountered in an episode, only the return following its first occurrence is considered. Subsequent occurrences of the same state within the same episode are ignored for estimation purposes;

  • Every-visit: every occurrence of a state ss within an episode is used. That is, the return following each visit to the state is included in the estimate, even if the state appears multiple times in the same episode.

Exploring Starts

Imagine a simple one-dimensional world represented by a line that extends from -10 to +10. The agent begins at position 0, and its current policy dictates that it always moves to the right at each time step.

If we attempt to generate episodes under this policy, what happens? The agent will continuously move toward the positive end of the line β€” visiting states like 1, 2, 3, and so on β€” but it will never visit any negative states. As a result, we cannot estimate value functions for states to the left of the origin, simply because the agent never experiences them.

So the main problem is: if certain parts of the state space are never explored, their value estimates will remain inaccurate or undefined. One common solution to this problem is the use of exploring starts.

With exploring starts, each episode begins not at a fixed starting state like 0, but at a randomly selected state. Once the episode begins, the agent follows its current policy as usual. Over time, by starting from many different points across the state space, the agent is able to visit all states β€” not just the ones its policy would naturally lead it to. This allows the Monte Carlo method to produce more accurate and complete value estimates for the entire state space.

Pseudocode

This pseudocode uses every-visit approach together with exploring starts.

1. How does the first-visit MC method differ from the every-visit MC method?

2. What is the main advantage of using exploring starts in Monte Carlo methods?

question mark

How does the first-visit MC method differ from the every-visit MC method?

Select the correct answer

question mark

What is the main advantage of using exploring starts in Monte Carlo methods?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 2
We're sorry to hear that something went wrong. What happened?
some-alt