Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Value Iteration | Dynamic Programming
/
Introduction to Reinforcement Learning with Python

bookValue Iteration

メニューを表示するにはスワイプしてください

While policy iteration is an effective approach for solving MDPs, it has a significant drawback: each iteration involves a separate policy evaluation step. When policy evaluation is performed iteratively, it requires multiple sweeps over the entire state space, leading to considerable computational overhead and longer computation times.

A good alternative is value iteration, a method that merges policy evaluation and policy improvement into a single step. This method updates the value function directly until it converges to the optimal value function. Once convergence is achieved, the optimal policy can be derived directly from this optimal value function.

How it Works?

Value iteration works by completing only one backup during policy evaluation, before doing policy improvement. This results in a following update formula:

vk+1(s)maxas,rp(s,rs,a)(r+γvk(s))sSv_{k+1}(s) \gets \max_a \sum_{s',r}p(s',r|s,a)\Bigl(r+\gamma v_k(s')\Bigr) \qquad \forall s \in S

By turning Bellman optimality equation into update rule, policy evaluation and policy improvement are merged into a single step.

Pseudocode

question mark

Based on the pseudocode, when does the value iteration stop?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 3.  8

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 3.  8
some-alt