Kursinhalt

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Value Function Estimation

Let's begin by revisiting a familiar concept: the state value function, denoted as $v_\pi(s)$ . It can be defined as

\def\E{\operatorname{\mathbb{E}}} v_\pi(s) = \E_\pi [ G_t | S_t = s]

Our goal in this chapter is to estimate this function from data, assuming we are given a fixed policy $\pi$ but have no access to the environment's model.

Monte Carlo Estimation

Monte Carlo methods approach this estimation task by sampling episodes of experience under the policy $\pi$ , then using these samples to form empirical estimates of $v_\pi(s)$ .

In general, the process can be split into following steps:

Generate an episode using policy $\pi$ ;
Save the obtained return value for each state appearing in the episode;
Repeat steps 1-2 for some time;
Compute the new values by averaging returns for each state.

Collecting the Returns

Monte Carlo estimation of value function requires collecting the returns from generated episodes. To compute these returns, two primary approaches can be used:

First-visit: for each state $s$ encountered in an episode, only the return following its first occurrence is considered. Subsequent occurrences of the same state within the same episode are ignored for estimation purposes;
Every-visit: every occurrence of a state $s$ within an episode is used. That is, the return following each visit to the state is included in the estimate, even if the state appears multiple times in the same episode.

Exploring Starts

Imagine a simple one-dimensional world represented by a line that extends from -10 to +10. The agent begins at position 0, and its current policy dictates that it always moves to the right at each time step.

If we attempt to generate episodes under this policy, what happens? The agent will continuously move toward the positive end of the line — visiting states like 1, 2, 3, and so on — but it will never visit any negative states. As a result, we cannot estimate value functions for states to the left of the origin, simply because the agent never experiences them.

So the main problem is: if certain parts of the state space are never explored, their value estimates will remain inaccurate or undefined. One common solution to this problem is the use of exploring starts.

With exploring starts, each episode begins not at a fixed starting state like 0, but at a randomly selected state. Once the episode begins, the agent follows its current policy as usual. Over time, by starting from many different points across the state space, the agent is able to visit all states — not just the ones its policy would naturally lead it to. This allows the Monte Carlo method to produce more accurate and complete value estimates for the entire state space.

Pseudocode

This pseudocode uses every-visit approach together with exploring starts.

1. How does the first-visit MC method differ from the every-visit MC method?

2. What is the main advantage of using exploring starts in Monte Carlo methods?

War alles klar?

Danke für Ihr Feedback!

Abschnitt 4. Kapitel 2

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen