Kursinhalt

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Off-Policy Monte Carlo Control

While on-policy methods learn by following and improving the same policy, off-policy methods introduce a twist: they learn about one policy (the target policy) while following another (the behavior policy). This separation is powerful — it allows us to evaluate or improve a target policy without needing to actually follow it during data collection.

Analogy

Let's return to the ice cream shop from the previous chapter. You and your friend walk in, and once again, the three familiar flavors are on offer: chocolate, vanilla, and strawberry. Chocolate is your personal favorite, and your first instinct is to order it. But this shop is new to you, and you're not quite sure if choosing chocolate is right. Fortunately, your friend is a prominent ice cream lover who's visited nearly every shop in the city. You ask for their opinion. "Chocolate here is okay," they say, "but trust me — the strawberry is exceptional." So, based on their experience, you decide to skip your usual choice and go with strawberry instead.

That decision — relying on someone else's experience to guide your own choice — is the essence of off-policy methods. You're trying to improve your decision-making using data collected under someone else's behavior. It's still exploration — but it's guided by external experience rather than your own.

Importance Sampling

Because the agent follows the behavior policy during episode generation, we have to account for the mismatch between what behavior policy generates and what target policy would generate. This is where importance sampling comes in.

Importance sampling provides a way to adjust the returns observed under the behavior policy so they're valid estimates for the target policy.

Let's look at a trajectory that starts from certain state $S_t$ and follows certain policy $\pi$ until the episode terminates at a time $T$ . Specifically, we observe:

A_t, S_{t+1}, A_{t+1}, ..., S_{T}

Now, what is the probability of this trajectory occurring under a policy $\pi$ ? It depends on both the policy's action probabilities and the environment's transition dynamics:

p(trajectory | \pi) = \prod_{k=t}^{T-1} \pi(A_k | S_k)p(S_{k+1} | S_k, A_k)

Now suppose the trajectory was actually generated by a different policy — the behavior policy $b$ . To properly use this trajectory to estimate expectations under the target policy $\pi$ , we must account for how much more or less likely this sequence of actions would have been under $\pi$ compared to $b$ .

This is where the importance sampling ratio comes in. It is defined as the relative likelihood of the trajectory under the two policies:

\rho = \frac{p(trajectory | \pi)}{p(trajectory | b)} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)p(S_{k+1} | S_k, A_k)}{b(A_k | S_k)p(S_{k+1} | S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}

In the end, transition probabilities canceled out, as both policies operate in the same environment, and the value of $\rho$ depends only on the policies, not the environment.

Why This Matters

The ratio $\rho$ tells us how to reweight the return $G_t$ observed under the behavior policy so that it becomes an unbiased estimate of what the return would have been under the target policy:

\def\E{\operatorname{\mathbb{E}}} \E_\pi[G_t] = \E_b[\rho \cdot G_t]

In other words, even though the data was collected using $b$ , we can still estimate expected returns under $\pi$ — provided that $b$ gives non-zero probability to every action that $\pi$ might take (assumption of coverage).

Practical Considerations

Importance Sampling Variance

Incorporating importance sampling is conceptually straightforward. We adjust the estimated action value function $q(s, a)$ by weighting each observed return with the corresponding importance sampling ratio. The simplest formulation looks like this:

q(s, a) = \frac{\sum_{i=0}^{N(s, a)} \rho_i(s, a) \cdot Returns_i(s, a)}{N(s, a)}

where:

$\rho_i(s, a)$ is the importance sampling ratio for the $i$ -th trajectory starting from $(s, a)$ ;
$Returns_i(s, a)$ is the return from that trajectory;
$N(s, a)$ is a number of times $(s, a)$ has been visited.

This is known as ordinary importance sampling. It provides an unbiased estimate of $q(s, a)$ , but can suffer from very high variance, especially when the behavior and target policies differ significantly.

To mitigate the variance issue, we can use a more stable alternative: weighted importance sampling. This method normalizes the importance weights, which reduces the impact of large ratios and leads to more stable learning:

q(s, a) = \frac{\sum_{i=0}^{N(s, a)} \rho_i(s, a) \cdot Returns_i(s, a)}{\sum_{i=0}^{N(s, a)} \rho_i(s, a)}

In this version the numerator is the same weighted sum of returns, but the denominator is now the sum of the importance weights, rather than a simple count.

This makes the estimate biased, but the bias diminishes as more samples are collected. In practice, weighted importance sampling is preferred due to its significantly lower variance and greater numerical stability.

Policies

Like in on-policy case, let's use $\varepsilon$ -greedy policies for both target policy $\pi(a | s)$ and behavior policy $b(a | s)$ .

At first glance, it seems natural to make the target policy fully greedy — after all, our ultimate goal is a greedy policy. In practice, however, this causes a major problem: if at any step $\pi(a | s) = 0$ for the action that actually was taken by the behavior policy, the importance sampling ratio $\rho$ becomes zero and the remaining part of the episode is effectively discarded.

By using a small $\varepsilon$ (e.g., $\varepsilon = 0.01$ ) in the target policy, we ensure $\pi(a | s) > 0$ for every action, so $\rho$ never collapses to zero mid‑episode. Once training is done, it's trivial to convert the learned $\varepsilon$ ‑greedy policy to a strictly greedy one. As with on‑policy learning, decaying $\varepsilon$ should be used in behavior policy, but this time it is mostly for numerical stability, as $\rho$ still can drop to zero mid-episode, due to how numbers are represented in computers.

Pseudocode

War alles klar?

Danke für Ihr Feedback!

Abschnitt 4. Kapitel 6

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen