Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Policy Iteration | Dynamic Programming
Introduction to Reinforcement Learning
course content

Kursinnhold

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Policy Iteration

The idea behind policy iteration is simple:

  1. Take some initial π\pi and vv;

  2. Use policy evaluation to update vv until it's consistent with π\pi;

  3. Use policy improvement to update π\pi until it's greedy with respect to vv;

  4. Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

  • During policy evaluation, values are updated for each state, until they are consistent with current policy;

  • During policy improvement, policy is made greedy with respect to value function.

Pseudocode

question mark

Based on the pseudocode, what condition causes the outer loop of policy iteration to stop?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 3. Kapittel 7

Spør AI

expand
ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

course content

Kursinnhold

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Policy Iteration

The idea behind policy iteration is simple:

  1. Take some initial π\pi and vv;

  2. Use policy evaluation to update vv until it's consistent with π\pi;

  3. Use policy improvement to update π\pi until it's greedy with respect to vv;

  4. Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

  • During policy evaluation, values are updated for each state, until they are consistent with current policy;

  • During policy improvement, policy is made greedy with respect to value function.

Pseudocode

question mark

Based on the pseudocode, what condition causes the outer loop of policy iteration to stop?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 3. Kapittel 7
Vi beklager at noe gikk galt. Hva skjedde?
some-alt