Зміст курсу
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Exploration Approaches
The exploring starts assumption is useful for ensuring that all states (state-action pairs) are visited over time. However, in most real-world tasks, it has a major drawback: it requires a model to initialize the agent in arbitrary states.
In rare cases — when the environment naturally begins episodes from random states that cover the entire state space — exploring starts can be applied without issue. But more commonly, tasks have a fixed or limited set of starting states, making such randomization impossible without a partial model. This model should at least be capable of simulating one step of the environment from any state. While this is still less demanding than needing a full model, it's often impractical.
Alternative Exploration Approaches
If starting from a random state (state-action pair) is not an option, the alternative is to ensure that every action has a non-zero probability of being selected in every state. This guarantees that, over time, the agent will explore all reachable parts of the state space. If a state can be reached through some valid sequence of actions, it eventually will be; and if it can't be reached at all under the environment's dynamics, then it's irrelevant to the learning process.
This idea leads to the use of stochastic policies, where the agent does not always choose the best-known action, but instead selects actions with some degree of randomness. A common strategy for this is the familiar -greedy policy, which chooses the greedy action most of the time, but with probability , selects a random action instead. This ensures continual exploration while still favoring high-value actions.
At this point, it's also useful to distinguish between two major classes of methods:
- On-policy methods evaluate and improve the same policy that is used to generate the data;
- Off-policy methods evaluate and improve one policy, and generate the data with the other policy.
1. What is the main issue with the exploring starts assumption?
2. What is the difference between on-policy and off-policy methods in reinforcement learning?
Дякуємо за ваш відгук!