Exploration Approaches

The exploring starts assumption is useful for ensuring that all states (state-action pairs) are visited over time. However, in most real-world tasks, it has a major drawback: it requires a model to initialize the agent in arbitrary states.

In rare cases — when the environment naturally begins episodes from random states that cover the entire state space — exploring starts can be applied without issue. But more commonly, tasks have a fixed or limited set of starting states, making such randomization impossible without a partial model. This model should at least be capable of simulating one step of the environment from any state. While this is still less demanding than needing a full model, it's often impractical.

Alternative Exploration Approaches

If starting from a random state (state-action pair) is not an option, the alternative is to ensure that every action has a non-zero probability of being selected in every state. This guarantees that, over time, the agent will explore all reachable parts of the state space. If a state can be reached through some valid sequence of actions, it eventually will be; and if it can't be reached at all under the environment's dynamics, then it's irrelevant to the learning process.

This idea leads to the use of stochastic policies, where the agent does not always choose the best-known action, but instead selects actions with some degree of randomness. A common strategy for this is the familiar $\varepsilon$ -greedy policy, which chooses the greedy action most of the time, but with probability $\varepsilon$ , selects a random action instead. This ensures continual exploration while still favoring high-value actions.

At this point, it's also useful to distinguish between two major classes of methods:

On-policy methods evaluate and improve the same policy that is used to generate the data;
Off-policy methods evaluate and improve one policy, and generate the data with the other policy.

1. What is the main issue with the exploring starts assumption?

2. What is the difference between on-policy and off-policy methods in reinforcement learning?

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning