Value Iteration
メニューを表示するにはスワイプしてください
While policy iteration is an effective approach for solving MDPs, it has a significant drawback: each iteration involves a separate policy evaluation step. When policy evaluation is performed iteratively, it requires multiple sweeps over the entire state space, leading to considerable computational overhead and longer computation times.
A good alternative is value iteration, a method that merges policy evaluation and policy improvement into a single step. This method updates the value function directly until it converges to the optimal value function. Once convergence is achieved, the optimal policy can be derived directly from this optimal value function.
How it Works?
Value iteration works by completing only one backup during policy evaluation, before doing policy improvement. This results in a following update formula:
vk+1(s)←amaxs′,r∑p(s′,r∣s,a)(r+γvk(s′))∀s∈SBy turning Bellman optimality equation into update rule, policy evaluation and policy improvement are merged into a single step.
Pseudocode
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください