Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Policy Iteration | Dynamic Programming
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Policy Iteration

The idea behind policy iteration is simple:

  1. Take some initial Ο€\pi and vv.
  2. Use policy evaluation to update vv until it's consistent with Ο€\pi.
  3. Use policy improvement to update Ο€\pi until it's greedy with respect to vv.
  4. Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

  • During policy evaluation, values are updated for each state, until they are consistent with current policy;
  • During policy improvement, policy is made greedy with respect to value function.

Pseudocode

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 7
We're sorry to hear that something went wrong. What happened?
some-alt