Kursinhalt
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Off-Policy Monte Carlo Control
While on-policy methods learn by following and improving the same policy, off-policy methods introduce a twist: they learn about one policy (the target policy) while following another (the behavior policy). This separation is powerful — it allows us to evaluate or improve a target policy without needing to actually follow it during data collection.
Analogy
Let's return to the ice cream shop from the previous chapter. You and your friend walk in, and once again, the three familiar flavors are on offer: chocolate, vanilla, and strawberry. Chocolate is your personal favorite, and your first instinct is to order it. But this shop is new to you, and you're not quite sure if choosing chocolate is right. Fortunately, your friend is a prominent ice cream lover who's visited nearly every shop in the city. You ask for their opinion. "Chocolate here is okay," they say, "but trust me — the strawberry is exceptional." So, based on their experience, you decide to skip your usual choice and go with strawberry instead.
That decision — relying on someone else's experience to guide your own choice — is the essence of off-policy methods. You're trying to improve your decision-making using data collected under someone else's behavior. It's still exploration — but it's guided by external experience rather than your own.
Importance Sampling
Because the agent follows the behavior policy during episode generation, we have to account for the mismatch between what behavior policy generates and what target policy would generate. This is where importance sampling comes in.
Importance sampling provides a way to adjust the returns observed under the behavior policy so they're valid estimates for the target policy.
Let's look at a trajectory that starts from certain state and follows certain policy until the episode terminates at a time . Specifically, we observe:
Now, what is the probability of this trajectory occurring under a policy ? It depends on both the policy's action probabilities and the environment's transition dynamics:
Now suppose the trajectory was actually generated by a different policy — the behavior policy . To properly use this trajectory to estimate expectations under the target policy , we must account for how much more or less likely this sequence of actions would have been under compared to .
This is where the importance sampling ratio comes in. It is defined as the relative likelihood of the trajectory under the two policies:
In the end, transition probabilities canceled out, as both policies operate in the same environment, and the value of depends only on the policies, not the environment.
Why This Matters
The ratio tells us how to reweight the return observed under the behavior policy so that it becomes an unbiased estimate of what the return would have been under the target policy:
In other words, even though the data was collected using , we can still estimate expected returns under — provided that gives non-zero probability to every action that might take (assumption of coverage).
Practical Considerations
Importance Sampling Variance
Incorporating importance sampling is conceptually straightforward. We adjust the estimated action value function by weighting each observed return with the corresponding importance sampling ratio. The simplest formulation looks like this:
where:
- is the importance sampling ratio for the -th trajectory starting from ;
- is the return from that trajectory;
- is a number of times has been visited.
This is known as ordinary importance sampling. It provides an unbiased estimate of , but can suffer from very high variance, especially when the behavior and target policies differ significantly.
To mitigate the variance issue, we can use a more stable alternative: weighted importance sampling. This method normalizes the importance weights, which reduces the impact of large ratios and leads to more stable learning:
In this version the numerator is the same weighted sum of returns, but the denominator is now the sum of the importance weights, rather than a simple count.
This makes the estimate biased, but the bias diminishes as more samples are collected. In practice, weighted importance sampling is preferred due to its significantly lower variance and greater numerical stability.
Policies
Like in on-policy case, let's use -greedy policies for both target policy and behavior policy .
At first glance, it seems natural to make the target policy fully greedy — after all, our ultimate goal is a greedy policy. In practice, however, this causes a major problem: if at any step for the action that actually was taken by the behavior policy, the importance sampling ratio becomes zero and the remaining part of the episode is effectively discarded.
By using a small (e.g., ) in the target policy, we ensure for every action, so never collapses to zero mid‑episode. Once training is done, it's trivial to convert the learned ‑greedy policy to a strictly greedy one. As with on‑policy learning, decaying should be used in behavior policy, but this time it is mostly for numerical stability, as still can drop to zero mid-episode, due to how numbers are represented in computers.
Pseudocode
Danke für Ihr Feedback!