Summary  
Value iteration is an algorithm for solving Markov Decision Processes by merging policy evaluation and improvement into a single update: it repeatedly applies the Bellman optimality equation to the value function until it converges, then derives the optimal policy.  

General domain of usage  
Reinforcement learning

While **policy iteration** is an effective approach for solving **MDPs**, it has a significant drawback: each iteration involves a separate **policy evaluation** step. When **policy evaluation** is performed **iteratively**, it requires multiple sweeps over the entire **state space**, leading to considerable computational overhead and longer computation times.

A good alternative is **value iteration**, a  method that merges policy evaluation and policy improvement into a **single step**. This method updates the value function directly until it converges to the **optimal value function**. Once convergence is achieved, the **optimal policy** can be derived directly from this optimal value function.

**Value iteration** works by completing only one backup during policy evaluation, before doing policy improvement. This results in a following update formula:

$$
v_{k+1}(s) \gets \max_a \sum_{s',r}p(s',r|s,a)\Bigl(r+\gamma v_k(s')\Bigr) \qquad \forall s \in S
$$

By turning Bellman optimality equation into update rule, policy evaluation and policy improvement are merged into a single step.

Based on the pseudocode, when does the value iteration stop?

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Value Iteration

How it Works?

Pseudocode