Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Generalization of TD Learning | Temporal Difference Learning
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Generalization of TD Learning

As of now, we considered two extreme cases of learning from experience:

  • TD(0): uses one-step return;
  • Monte Carlo: waits until the end of the episode to compute the return.

But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?

This is where nn-step TD learning and TD(Ξ»\lambda) come in β€” methods that unify and generalize the ideas we've seen so far.

n\Large n-Step TD Learning

The idea behind nn-step TD learning is simple: instead of using just the next step or the entire episode, we use the next nn steps, then bootstrap:

Gt(n)=Rt+1+Ξ³Rt+2+...+Ξ³nβˆ’1Rt+n+Ξ³nV(St+1)G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+1})

This allows for a tradeoff:

  • When n=1n = 1: it's just TD(0);
  • When n=∞n = \infty: it becomes Monte Carlo.

This returns can then be used to replace a target in the TD(0) update rule:

V(St)←V(St)+Ξ±(Gt(n)βˆ’V(St))V(S_t) \gets V(S_t) + \alpha\Bigl(G_t^{(n)} - V(S_t)\Bigr)

TD(Ξ»\Large\lambda)

TD(Ξ»\lambda) is a clever idea that builds on top of the nn-step TD learning: instead of choosing a fixed nn, we combine all nn-step returns together:

Lt=(1βˆ’Ξ»)βˆ‘n=0∞λnβˆ’1Gt(n)L_t = (1 - \lambda) \sum_{n=0}^{\infty} \lambda^{n-1}G_t^{(n)}

where λ∈[0,1]\lambda \in [0, 1] controls the weighting:

  • If Ξ»=0\lambda = 0: only one-step return β†’\to TD(0);
  • If Ξ»=1\lambda = 1: full return β†’\to Monte Carlo;
  • Intermediate values blend multiple step returns.

So Ξ»\lambda acts as a bias-variance tradeoff knob:

  • Low Ξ»\lambda: more bias, less variance;
  • High Ξ»\lambda: less bias, more variance.

LtL_t can then be used as an update target in the TD(0) update rule:

V(St)←V(St)+Ξ±(Ltβˆ’V(St))V(S_t) \gets V(S_t) + \alpha\Bigl(L_t - V(S_t)\Bigr)
question mark

When the parameter Ξ»\lambda is set to 1 in TD(Ξ»\lambda), the method becomes equivalent to

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 5
We're sorry to hear that something went wrong. What happened?
some-alt