Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Value Iteration | Dynamic Programming
Introduction to Reinforcement Learning
course content

Kursinnehåll

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Value Iteration

While policy iteration is an effective approach for solving MDPs, it has a significant drawback: each iteration involves a separate policy evaluation step. When policy evaluation is performed iteratively, it requires multiple sweeps over the entire state space, leading to considerable computational overhead and longer computation times.

A good alternative is value iteration, a method that merges policy evaluation and policy improvement into a single step. This method updates the value function directly until it converges to the optimal value function. Once convergence is achieved, the optimal policy can be derived directly from this optimal value function.

How it Works?

Value iteration works by completing only one backup during policy evaluation, before doing policy improvement. This results in a following update formula:

vk+1(s)maxas,rp(s,rs,a)(r+γvk(s))sSv_{k+1}(s) \gets \max_a \sum_{s',r}p(s',r|s,a)\Bigl(r+\gamma v_k(s')\Bigr) \qquad \forall s \in S

By turning Bellman optimality equation into update rule, policy evaluation and policy improvement are merged into a single step.

Pseudocode

question mark

Based on the pseudocode, when does the value iteration stop?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 8

Fråga AI

expand
ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

course content

Kursinnehåll

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Value Iteration

While policy iteration is an effective approach for solving MDPs, it has a significant drawback: each iteration involves a separate policy evaluation step. When policy evaluation is performed iteratively, it requires multiple sweeps over the entire state space, leading to considerable computational overhead and longer computation times.

A good alternative is value iteration, a method that merges policy evaluation and policy improvement into a single step. This method updates the value function directly until it converges to the optimal value function. Once convergence is achieved, the optimal policy can be derived directly from this optimal value function.

How it Works?

Value iteration works by completing only one backup during policy evaluation, before doing policy improvement. This results in a following update formula:

vk+1(s)maxas,rp(s,rs,a)(r+γvk(s))sSv_{k+1}(s) \gets \max_a \sum_{s',r}p(s',r|s,a)\Bigl(r+\gamma v_k(s')\Bigr) \qquad \forall s \in S

By turning Bellman optimality equation into update rule, policy evaluation and policy improvement are merged into a single step.

Pseudocode

question mark

Based on the pseudocode, when does the value iteration stop?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 8
Vi beklagar att något gick fel. Vad hände?
some-alt