Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Off-Policy Monte Carlo Control | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Contenido del Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Off-Policy Monte Carlo Control

While on-policy methods learn by following and improving the same policy, off-policy methods introduce a twist: they learn about one policy (the target policy) while following another (the behavior policy). This separation is powerful — it allows us to evaluate or improve a target policy without needing to actually follow it during data collection.

Analogy

Let's return to the ice cream shop from the previous chapter. You and your friend walk in, and once again, the three familiar flavors are on offer: chocolate, vanilla, and strawberry. Chocolate is your personal favorite, and your first instinct is to order it. But this shop is new to you, and you're not quite sure if choosing chocolate is right. Fortunately, your friend is a prominent ice cream lover who's visited nearly every shop in the city. You ask for their opinion. "Chocolate here is okay," they say, "but trust me — the strawberry is exceptional." So, based on their experience, you decide to skip your usual choice and go with strawberry instead.

That decision — relying on someone else's experience to guide your own choice — is the essence of off-policy methods. You're trying to improve your decision-making using data collected under someone else's behavior. It's still exploration — but it's guided by external experience rather than your own.

Importance Sampling

Because the agent follows the behavior policy during episode generation, we have to account for the mismatch between what behavior policy generates and what target policy would generate. This is where importance sampling comes in.

Importance sampling provides a way to adjust the returns observed under the behavior policy so they're valid estimates for the target policy.

Let's look at a trajectory that starts from certain state StS_t and follows certain policy π\pi until the episode terminates at a time TT. Specifically, we observe:

At,St+1,At+1,...,STA_t, S_{t+1}, A_{t+1}, ..., S_{T}

Now, what is the probability of this trajectory occurring under a policy π\pi? It depends on both the policy's action probabilities and the environment's transition dynamics:

p(trajectoryπ)=k=tT1π(AkSk)p(Sk+1Sk,Ak)p(trajectory | \pi) = \prod_{k=t}^{T-1} \pi(A_k | S_k)p(S_{k+1} | S_k, A_k)

Now suppose the trajectory was actually generated by a different policy — the behavior policy bb. To properly use this trajectory to estimate expectations under the target policy π\pi, we must account for how much more or less likely this sequence of actions would have been under π\pi compared to bb.

This is where the importance sampling ratio comes in. It is defined as the relative likelihood of the trajectory under the two policies:

ρ=p(trajectoryπ)p(trajectoryb)=k=tT1π(AkSk)p(Sk+1Sk,Ak)b(AkSk)p(Sk+1Sk,Ak)=k=tT1π(AkSk)b(AkSk)\rho = \frac{p(trajectory | \pi)}{p(trajectory | b)} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)p(S_{k+1} | S_k, A_k)}{b(A_k | S_k)p(S_{k+1} | S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}

In the end, transition probabilities canceled out, as both policies operate in the same environment, and the value of ρ\rho depends only on the policies, not the environment.

Why This Matters

The ratio ρ\rho tells us how to reweight the return GtG_t​ observed under the behavior policy so that it becomes an unbiased estimate of what the return would have been under the target policy:

Eπ[Gt]=Eb[ρGt]\def\E{\operatorname{\mathbb{E}}} \E_\pi[G_t] = \E_b[\rho \cdot G_t]

In other words, even though the data was collected using bb, we can still estimate expected returns under π\pi — provided that bb gives non-zero probability to every action that π\pi might take (assumption of coverage).

Practical Considerations

Importance Sampling Variance

Incorporating importance sampling is conceptually straightforward. We adjust the estimated action value function q(s,a)q(s, a) by weighting each observed return with the corresponding importance sampling ratio. The simplest formulation looks like this:

q(s,a)=i=0N(s,a)ρi(s,a)Returnsi(s,a)N(s,a)q(s, a) = \frac{\sum_{i=0}^{N(s, a)} \rho_i(s, a) \cdot Returns_i(s, a)}{N(s, a)}

where:

  • ρi(s,a)\rho_i(s, a) is the importance sampling ratio for the ii-th trajectory starting from (s,a)(s, a);
  • Returnsi(s,a)Returns_i(s, a) is the return from that trajectory;
  • N(s,a)N(s, a) is a number of times (s,a)(s, a) has been visited.

This is known as ordinary importance sampling. It provides an unbiased estimate of q(s,a)q(s, a), but can suffer from very high variance, especially when the behavior and target policies differ significantly.

To mitigate the variance issue, we can use a more stable alternative: weighted importance sampling. This method normalizes the importance weights, which reduces the impact of large ratios and leads to more stable learning:

q(s,a)=i=0N(s,a)ρi(s,a)Returnsi(s,a)i=0N(s,a)ρi(s,a)q(s, a) = \frac{\sum_{i=0}^{N(s, a)} \rho_i(s, a) \cdot Returns_i(s, a)}{\sum_{i=0}^{N(s, a)} \rho_i(s, a)}

In this version the numerator is the same weighted sum of returns, but the denominator is now the sum of the importance weights, rather than a simple count.

This makes the estimate biased, but the bias diminishes as more samples are collected. In practice, weighted importance sampling is preferred due to its significantly lower variance and greater numerical stability.

Policies

Like in on-policy case, let's use ε\varepsilon-greedy policies for both target policy π(as)\pi(a | s) and behavior policy b(as)b(a | s).

At first glance, it seems natural to make the target policy fully greedy — after all, our ultimate goal is a greedy policy. In practice, however, this causes a major problem: if at any step π(as)=0\pi(a | s) = 0 for the action that actually was taken by the behavior policy, the importance sampling ratio ρ\rho becomes zero and the remaining part of the episode is effectively discarded.

By using a small ε\varepsilon (e.g., ε=0.01\varepsilon = 0.01) in the target policy, we ensure π(as)>0\pi(a | s) > 0 for every action, so ρ\rho never collapses to zero mid‑episode. Once training is done, it's trivial to convert the learned ε\varepsilon‑greedy policy to a strictly greedy one. As with on‑policy learning, decaying ε\varepsilon should be used in behavior policy, but this time it is mostly for numerical stability, as ρ\rho still can drop to zero mid-episode, due to how numbers are represented in computers.

Pseudocode

question mark

What is the purpose of importance sampling

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 6
Lamentamos que algo salió mal. ¿Qué pasó?
some-alt