Course Content
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Policy Evaluation
As you know, a state value function of a given policy can be determined by solving a Bellman equation:
If you have a complete model of the environment (i.e., known transition probabilities and expected rewards for all state-action pairs), the only unknown variables remaining in the equation are the state values. Therefore, the equation above can be reformulated as a system of linear equations with unknowns.
A unique solution to this linear system is guaranteed if at least one of the following conditions holds:
- The discount factor satisfies ;
- The policy , when followed from any state , ensures that the episode eventually terminates.
Iterative Policy Evaluation
The solution can be computed directly, but an iterative approach is more commonly used due to its ease of implementation. This method begins by assigning arbitrary initial values to all states, except for terminal states, which are set to 0. The values are then updated iteratively using the Bellman equation as the update rule:
The estimated state value function eventually converges to a true state value function as if exists.
Value Backup Strategies
When updating value estimates, new estimates are computed based on previous values. The process of preserving previous estimates is known as a backup. There are two common strategies for performing backups:
- Full backup: this method involves storing the new estimates in a separate array, distinct from the one containing the previous (backed-up) values. Consequently, two arrays are required β one for maintaining the previous estimates and another for storing the newly computed values;
- In-place backup: this approach maintains all values within a single array. Each new estimate immediately replaces the previous value. This method reduces memory usage, as only one array is needed.
Typically, the in-place backup method is preferred because it requires less memory and converges more rapidly, due to the immediate use of the latest estimates.
When to stop updating?
In iterative policy evaluation, there is no exact point at which the algorithm should stop. While convergence is guaranteed in the limit, continuing computations beyond a certain point is unnecessary in practice. A simple and effective stopping criterion is to track the absolute difference between consecutive value estimates, , and compare it to a small threshold . If, after a full update cycle (where values for all states are updated), no changes exceed , the process can be safely terminated.
Pseudocode
Thanks for your feedback!