Conteúdo do Curso
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
TD(0): Value Function Estimation
The simplest version of TD learning is called TD(0). It updates the value of a state based on the immediate reward and the estimated value of the next state. It is a one-step TD method.
Update Rule
Given a state , reward and next state , the update rule looks like this:
where
- is a learning rate, or step size;
- is a TD error.
Intuition
The state value function can be defined and expanded as follows:
This gives the first part of — the experienced return . And the second part of is the expected return . The TD error is therefore the observable discrepancy between what actually happened and what we previously believed to happen. So the update rule adjusts the previous belief a little on each step, making it closer to the truth.
TD(0) vs Monte Carlo Estimation
Both of TD(0) and Monte Carlo estimation use sampled experience to estimate the state value function for a policy . Under standard convergence conditions, they both converge to the true as the number of visits to each state goes to infinity. In practice, however, we only ever have a finite amount of data, and the two methods differ significantly in how they use that data and how quickly they learn.
Bias-Variance Tradeoff
From a bias–variance tradeoff perspective:
Monte Carlo estimation waits until an episode ends and then uses the full return to update values. This yields unbiased estimates — the returns truly reflect the underlying distribution — but they can swing dramatically, especially in long or highly stochastic tasks. High variance means many episodes are required to average out the noise and obtain stable value estimates.
TD(0) bootstraps by combining each one‑step reward with the current estimate of the next state's value. This introduces bias — early updates rely on imperfect estimates — but keeps variance low, since each update is based on a small, incremental error. Lower variance lets TD(0) propagate reward information through the state space more quickly, even though initial bias can slow down the convergence.
Learning Data vs Learning Model
Another way to look at these two methods is to analyze what each of them really learns:
Monte Carlo estimation learns directly from the observed returns, effectively fitting its value estimates to the specific episodes it has seen. This means it minimizes error on those training trajectories, but because it never builds an explicit view of how states lead to one another, it can struggle to generalize to new or slightly different situations.
TD(0), by contrast, bootstraps on each one-step transition, combining the immediate reward with its estimate of the next state's value. In doing so, it effectively captures the relationships between states — an implicit model of the environment's dynamics. This model‑like understanding lets TD(0) generalize better to unseen transitions, often yielding more accurate value estimates on new data.
Pseudocode
Obrigado pelo seu feedback!