Conteúdo do Curso
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
What is Temporal Difference Learning?
Both dynamic programming and Monte Carlo methods have some great ideas and some major drawbacks.
Dynamic Programming
Dynamic programming has a way to efficiently compute the state value function and derive an optimal policy from it. It uses bootstrapping — computation of current state's value based on the future states' values — to achieve this.
And while the idea of bootstrapping is powerful, the dynamic programing itself has two major drawbacks:
- It requires a complete and explicit model of the environment;
- State values are computed for each state, even if state is nowhere near the optimal path.
Monte Carlo Methods
Monte Carlo methods fix the two drawbacks dynamic programming has:
- They don't require a model, as they learn from experience;
- The way they learn from experience makes exploration more limited, so not important states are rarely visited.
But they introduce a new one — the learning process occurs only after the episode concludes. This limits the applicability of Monte Carlo methods to small episodic tasks, as bigger tasks would require an absurdly large number of actions, until the episode concludes.
Temporal Difference Learning
Temporal difference (TD) learning is a result of combining the ideas from both dynamic programming and Monte Carlo methods. It takes learning from experience approach from Monte Carlo methods and combines it with bootstrapping from dynamic programming.
As a result, TD learning fixes the major issues the two methods have:
- Learning from experience addresses the issue of requiring a model and issue of large state spaces;
- Bootstrapping addresses the issue of episodic learning.
How it Works?
TD learning works through a simple loop:
- Estimate the value: the agent starts with an initial guess of how good the current state is;
- Take an action: it performs an action, receives a reward, and ends up in a new state;
- Update the estimate: using the reward and the value of the new state, the agent slightly adjusts its original estimate to make it more accurate;
- Repeat: over time, by repeating this loop, the agent gradually builds better and more accurate value estimates for different states.
Comparison Table
Obrigado pelo seu feedback!