Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Generalization of TD Learning | Temporal Difference Learning
Introduction to Reinforcement Learning with Python

bookGeneralization of TD Learning

メニューを表示するにはスワイプしてください

As of now, we considered two extreme cases of learning from experience:

  • TD(0): uses one-step return;
  • Monte Carlo: waits until the end of the episode to compute the return.

But what if we want something in between? Something that leverages more future information than TD(0), yet doesn't need to wait for the full episode like Monte Carlo?

This is where nn-step TD learning and TD(λ\lambda) come in — methods that unify and generalize the ideas we've seen so far.

n\Large n-Step TD Learning

The idea behind nn-step TD learning is simple: instead of using just the next step or the entire episode, we use the next nn steps, then bootstrap:

Gt(n)=Rt+1+γRt+2+...+γn1Rt+n+γnV(St+1)G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+1})

This allows for a tradeoff:

  • When n=1n = 1: it's just TD(0);
  • When n=n = \infty: it becomes Monte Carlo.

This returns can then be used to replace a target in the TD(0) update rule:

V(St)V(St)+α(Gt(n)V(St))V(S_t) \gets V(S_t) + \alpha\Bigl(G_t^{(n)} - V(S_t)\Bigr)

TD(λ\Large\lambda)

TD(λ\lambda) is a clever idea that builds on top of the nn-step TD learning: instead of choosing a fixed nn, we combine all nn-step returns together:

Lt=(1λ)n=0λn1Gt(n)L_t = (1 - \lambda) \sum_{n=0}^{\infty} \lambda^{n-1}G_t^{(n)}

where λ[0,1]\lambda \in [0, 1] controls the weighting:

  • If λ=0\lambda = 0: only one-step return \to TD(0);
  • If λ=1\lambda = 1: full return \to Monte Carlo;
  • Intermediate values blend multiple step returns.

So λ\lambda acts as a bias-variance tradeoff knob:

  • Low λ\lambda: more bias, less variance;
  • High λ\lambda: less bias, more variance.

LtL_t can then be used as an update target in the TD(0) update rule:

V(St)V(St)+α(LtV(St))V(S_t) \gets V(S_t) + \alpha\Bigl(L_t - V(S_t)\Bigr)
question mark

When the parameter λ\lambda is set to 1 in TD(λ\lambda), the method becomes equivalent to

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 5.  5

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 5.  5
some-alt