Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Value Iteration | Dynamic Programming
Introduction to Reinforcement Learning
course content

Kursinhalt

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Value Iteration

While policy iteration is an effective approach for solving MDPs, it has a significant drawback: each iteration involves a separate policy evaluation step. When policy evaluation is performed iteratively, it requires multiple sweeps over the entire state space, leading to considerable computational overhead and longer computation times.

A good alternative is value iteration, a method that merges policy evaluation and policy improvement into a single step. This method updates the value function directly until it converges to the optimal value function. Once convergence is achieved, the optimal policy can be derived directly from this optimal value function.

How it Works?

Value iteration works by completing only one backup during policy evaluation, before doing policy improvement. This results in a following update formula:

vk+1(s)maxas,rp(s,rs,a)(r+γvk(s))sSv_{k+1}(s) \gets \max_a \sum_{s',r}p(s',r|s,a)\Bigl(r+\gamma v_k(s')\Bigr) \qquad \forall s \in S

By turning Bellman optimality equation into update rule, policy evaluation and policy improvement are merged into a single step.

Pseudocode

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 8
Wir sind enttäuscht, dass etwas schief gelaufen ist. Was ist passiert?
some-alt