Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Policy Iteration | Dynamic Programming
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Policy Iteration

The idea behind policy iteration is simple:

  1. Take some initial Ο€\pi and vv;

  2. Use policy evaluation to update vv until it's consistent with Ο€\pi;

  3. Use policy improvement to update Ο€\pi until it's greedy with respect to vv;

  4. Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

  • During policy evaluation, values are updated for each state, until they are consistent with current policy;

  • During policy improvement, policy is made greedy with respect to value function.

Pseudocode

question mark

Based on the pseudocode, what condition causes the outer loop of policy iteration to stop?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 7

Ask AI

expand
ChatGPT

Ask anything or try one of the suggested questions to begin our chat

course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Policy Iteration

The idea behind policy iteration is simple:

  1. Take some initial Ο€\pi and vv;

  2. Use policy evaluation to update vv until it's consistent with Ο€\pi;

  3. Use policy improvement to update Ο€\pi until it's greedy with respect to vv;

  4. Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

  • During policy evaluation, values are updated for each state, until they are consistent with current policy;

  • During policy improvement, policy is made greedy with respect to value function.

Pseudocode

question mark

Based on the pseudocode, what condition causes the outer loop of policy iteration to stop?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 7
We're sorry to hear that something went wrong. What happened?
some-alt