Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda TD(0): Value Function Estimation | Temporal Difference Learning
Introduction to Reinforcement Learning
course content

Conteúdo do Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
TD(0): Value Function Estimation

The simplest version of TD learning is called TD(0). It updates the value of a state based on the immediate reward and the estimated value of the next state. It is a one-step TD method.

Update Rule

Given a state StS_t, reward Rt+1R_{t+1} and next state St+1S_{t+1}, the update rule looks like this:

V(St)V(St)+α(Rt+1+γV(St+1)V(St))V(S_t) \gets V(S_t) + \alpha\Bigl(R_{t+1}+\gamma V(S_{t+1}) - V(S_t)\Bigr)

where

  • α\alpha is a learning rate, or step size;
  • δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1}+\gamma V(S_{t+1}) - V(S_t) is a TD error.

Intuition

The state value function vπv_\pi can be defined and expanded as follows:

vπ(s)=Eπ[GtSt=s]=Eπ[Rt+γGt+1St=s]=Eπ[Rt+γvπ(St+1)St=s]\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &= \E_\pi[G_t | S_t = s] \\ &= \E_\pi[R_t + \gamma G_{t+1} | S_t = s] \\ &= \E_\pi[R_t + \gamma v_\pi(S_{t+1}) | S_t = s] \end{aligned}

This gives the first part of δt\delta_t — the experienced return Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}). And the second part of δt\delta_t is the expected return V(St)V(S_t). The TD error δt\delta_t​ is therefore the observable discrepancy between what actually happened and what we previously believed to happen. So the update rule adjusts the previous belief a little on each step, making it closer to the truth.

TD(0) vs Monte Carlo Estimation

Both of TD(0) and Monte Carlo estimation use sampled experience to estimate the state value function vπ(s)v_\pi(s) for a policy π\pi. Under standard convergence conditions, they both converge to the true vπ(s)v_\pi(s) as the number of visits to each state goes to infinity. In practice, however, we only ever have a finite amount of data, and the two methods differ significantly in how they use that data and how quickly they learn.

Bias-Variance Tradeoff

From a bias–variance tradeoff perspective:

Monte Carlo estimation waits until an episode ends and then uses the full return to update values. This yields unbiased estimates — the returns truly reflect the underlying distribution — but they can swing dramatically, especially in long or highly stochastic tasks. High variance means many episodes are required to average out the noise and obtain stable value estimates.

TD(0) bootstraps by combining each one‑step reward with the current estimate of the next state's value. This introduces bias — early updates rely on imperfect estimates — but keeps variance low, since each update is based on a small, incremental error. Lower variance lets TD(0) propagate reward information through the state space more quickly, even though initial bias can slow down the convergence.

Learning Data vs Learning Model

Another way to look at these two methods is to analyze what each of them really learns:

Monte Carlo estimation learns directly from the observed returns, effectively fitting its value estimates to the specific episodes it has seen. This means it minimizes error on those training trajectories, but because it never builds an explicit view of how states lead to one another, it can struggle to generalize to new or slightly different situations.

TD(0), by contrast, bootstraps on each one-step transition, combining the immediate reward with its estimate of the next state's value. In doing so, it effectively captures the relationships between states — an implicit model of the environment's dynamics. This model‑like understanding lets TD(0) generalize better to unseen transitions, often yielding more accurate value estimates on new data.

Pseudocode

question mark

How can you describe the TD(0) in terms of bias and variance?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 5. Capítulo 2
Sentimos muito que algo saiu errado. O que aconteceu?
some-alt