Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Exploration Approaches | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Exploration Approaches

The exploring starts assumption is useful for ensuring that all states (state-action pairs) are visited over time. However, in most real-world tasks, it has a major drawback: it requires a model to initialize the agent in arbitrary states.

In rare cases β€” when the environment naturally begins episodes from random states that cover the entire state space β€” exploring starts can be applied without issue. But more commonly, tasks have a fixed or limited set of starting states, making such randomization impossible without a partial model. This model should at least be capable of simulating one step of the environment from any state. While this is still less demanding than needing a full model, it's often impractical.

Alternative Exploration Approaches

If starting from a random state (state-action pair) is not an option, the alternative is to ensure that every action has a non-zero probability of being selected in every state. This guarantees that, over time, the agent will explore all reachable parts of the state space. If a state can be reached through some valid sequence of actions, it eventually will be; and if it can't be reached at all under the environment's dynamics, then it's irrelevant to the learning process.

This idea leads to the use of stochastic policies, where the agent does not always choose the best-known action, but instead selects actions with some degree of randomness. A common strategy for this is the familiar Ξ΅\varepsilon-greedy policy, which chooses the greedy action most of the time, but with probability Ξ΅\varepsilon, selects a random action instead. This ensures continual exploration while still favoring high-value actions.

At this point, it's also useful to distinguish between two major classes of methods:

  • On-policy methods evaluate and improve the same policy that is used to generate the data;
  • Off-policy methods evaluate and improve one policy, and generate the data with the other policy.

1. What is the main issue with the exploring starts assumption?

2. What is the difference between on-policy and off-policy methods in reinforcement learning?

question mark

What is the main issue with the exploring starts assumption?

Select the correct answer

question mark

What is the difference between on-policy and off-policy methods in reinforcement learning?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 4
We're sorry to hear that something went wrong. What happened?
some-alt