Course Content
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Generalization of TD Learning
As of now, we considered two extreme cases of learning from experience:
- TD(0): uses one-step return;
- Monte Carlo: waits until the end of the episode to compute the return.
But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?
This is where -step TD learning and TD() come in β methods that unify and generalize the ideas we've seen so far.
-Step TD Learning
The idea behind -step TD learning is simple: instead of using just the next step or the entire episode, we use the next steps, then bootstrap:
This allows for a tradeoff:
- When : it's just TD(0);
- When : it becomes Monte Carlo.
This returns can then be used to replace a target in the TD(0) update rule:
TD()
TD() is a clever idea that builds on top of the -step TD learning: instead of choosing a fixed , we combine all -step returns together:
where controls the weighting:
- If : only one-step return TD(0);
- If : full return Monte Carlo;
- Intermediate values blend multiple step returns.
So acts as a bias-variance tradeoff knob:
- Low : more bias, less variance;
- High : less bias, more variance.
can then be used as an update target in the TD(0) update rule:
Thanks for your feedback!