Summary  
This chapter introduces n-step TD learning and TD(λ), which generalize one-step TD and Monte Carlo methods by bootstrapping with multi-step returns or exponentially weighted combinations of returns to control the bias–variance tradeoff in value updates.

General domain of usage  
Reinforcement Learning

As of now, we considered two extreme cases of learning from experience:
- **TD(0)**: uses one-step return;
- **Monte Carlo**: waits until the end of the episode to compute the return.

But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?

This is where $$n$$**-step TD learning** and **TD($$\lambda$$)** come in — methods that unify and generalize the ideas we've seen so far.

The idea behind $$n$$-step TD learning is simple: instead of using just the next step or the entire episode, we use the **next $$n$$ steps**, then bootstrap:
$$
G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+1})
$$

This allows for a tradeoff:
- When $$n = 1$$: it's just TD(0);
- When $$n = \infty$$: it becomes Monte Carlo.

This returns can then be used to replace a target in the **TD(0)** update rule:
$$
V(S_t) \gets V(S_t) + \alpha\Bigl(G_t^{(n)} - V(S_t)\Bigr)
$$

**TD($$\lambda$$)** is a clever idea that builds on top of the $$n$$-step TD learning: instead of choosing a fixed $$n$$, we combine **all $$n$$-step returns** together:
$$
L_t = (1 - \lambda) \sum_{n=0}^{\infty} \lambda^{n-1}G_t^{(n)}
$$

where $$\lambda \in [0, 1]$$ controls the weighting:
- If $$\lambda = 0$$: only one-step return $$\to$$ TD(0);
- If $$\lambda = 1$$: full return $$\to$$ Monte Carlo;
- Intermediate values blend multiple step returns.

So $$\lambda$$ acts as a **bias-variance tradeoff knob**:
- Low $$\lambda$$: more bias, less variance;
- High $$\lambda$$: less bias, more variance.

$$L_t$$ can then be used as an update target in the TD(0) update rule:
$$
V(S_t) \gets V(S_t) + \alpha\Bigl(L_t - V(S_t)\Bigr)
$$

When the parameter $$\lambda$$ is set to 1 in TD($$\lambda$$), the method becomes equivalent to

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Generalization of TD Learning

$\Large n$ -Step TD Learning

TD( $\Large\lambda$ )

Generalization of TD Learning

n\Large nn-Step TD Learning

TD(λ\Large\lambdaλ)

$\Large n$ -Step TD Learning

TD( $\Large\lambda$ )