Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Policy Evaluation

Definition

Policy evaluation is a process of determining the value function of a given policy.

Note

Policy evaluation can be used to estimate both state value function and action value function. But for DP methods, state value function will be used.

As you know, a state value function of a given policy can be determined by solving a Bellman equation:

v_\pi(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr)

If you have a complete model of the environment (i.e., known transition probabilities and expected rewards for all state-action pairs), the only unknown variables remaining in the equation are the state values. Therefore, the equation above can be reformulated as a system of $|S|$ linear equations with $|S|$ unknowns.

For example, if an MDP has 2 states ( $s_1$ , $s_2$ ) and 2 actions (move to $s_1$ , move to $s_2$ ), the state value function could be defined like this:

\begin{cases} V(s_1) = 0.5 \cdot (5 + 0.9 \cdot V(s_1)) + 0.5 \cdot (10 + 0.9 \cdot V(s_2)) \\ V(s_2) = 0.7 \cdot (2 + 0.9 \cdot V(s_1)) + 0.3 \cdot (0 + 0.9 \cdot V(s_2)) \end{cases}

This can be solved using standard linear algebra techniques.

A unique solution to such linear system is guaranteed if at least one of the following conditions holds:

The discount factor satisfies $γ < 1$ ;
The policy $\pi$ , when followed from any state $s$ , ensures that the episode eventually terminates.

Iterative Policy Evaluation

The solution can be computed directly, but an iterative approach is more commonly used due to its ease of implementation. This method begins by assigning arbitrary initial values to all states, except for terminal states, which are set to 0. The values are then updated iteratively using the Bellman equation as the update rule:

v_{k+1}(s) \gets \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_k(s')\Bigr)

The estimated state value function $v_k$ eventually converges to a true state value function $v_\pi$ as $k \to \infty$ if $v_\pi$ exists.

Value Backup Strategies

When updating value estimates, new estimates are computed based on previous values. The process of preserving previous estimates is known as a backup. There are two common strategies for performing backups:

Full backup: this method involves storing the new estimates in a separate array, distinct from the one containing the previous (backed-up) values. Consequently, two arrays are required — one for maintaining the previous estimates and another for storing the newly computed values;
In-place backup: this approach maintains all values within a single array. Each new estimate immediately replaces the previous value. This method reduces memory usage, as only one array is needed.

Typically, the in-place backup method is preferred because it requires less memory and converges more rapidly, due to the immediate use of the latest estimates.

When to stop updating?

In iterative policy evaluation, there is no exact point at which the algorithm should stop. While convergence is guaranteed in the limit, continuing computations beyond a certain point is unnecessary in practice. A simple and effective stopping criterion is to track the absolute difference between consecutive value estimates, $|v_{k+1}(s) - v_k(s)|$ , and compare it to a small threshold $\theta$ . If, after a full update cycle (where values for all states are updated), no changes exceed $\theta$ , the process can be safely terminated.

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat