Summary  
This chapter explains the policy iteration algorithm, which alternates between fully evaluating a given policy’s value function and greedily improving the policy based on that value until convergence.

General domain of usage  
Reinforcement learning

The idea behind **policy iteration** is simple:
1. Take some initial $$\pi$$ and $$v$$;
2. Use policy evaluation to update $$v$$ until it's consistent with $$\pi$$;
3. Use policy improvement to update $$\pi$$ until it's greedy with respect to $$v$$;
4. Repeat steps 2-3 until convergence.

In this method, there are **no partial updates**:
- During **policy evaluation**, values are updated for each state, until they are consistent with current policy;
- During **policy improvement**, policy is made greedy with respect to value function.

Based on the pseudocode, what condition causes the outer loop of policy iteration to stop?

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Policy Iteration

Pseudocode