Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

SARSA: On-Policy TD Learning

Just like with Monte Carlo methods, we can follow the generalized policy iteration (GPI) framework to move from estimating value functions to learning optimal policies. However, this process introduces a familiar challenge: the exploration-exploitation tradeoff. And similarly, there are two approaches that we can use: on-policy and off-policy. First, let's talk about on-policy method — SARSA.

Definition

SARSA is an on-policy TD control algorithm used to estimate the action value function $q_\pi(s, a)$ . It updates its estimates based on the action actually taken, making it an on-policy algorithm.

The acronym SARSA comes from the five key components used in the update:

S: current state $S_t$ ;
A: action taken $A_t$ ;
R: reward received $R_{t+1}$ ;
S: next state $S_{t+1}$ ;
A: next action $A_{t+1}$ .

Update Rule

The update rule is similar to the TD(0), only replacing the state value function with action value function:

Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\Bigr)

The $A_{t+1}$ is the action that will actually be taken during the next step, and it is selected according to the current policy. This means that the effects of exploration are incorporated into the learning process.

After each update of the action value function, the policy is updated as well, allowing the agent to immediately use the new estimates.

Pseudocode

When to Use SARSA?

SARSA is preferable when:

You are dealing with environments with high stochasticity (e.g., slippery surfaces, unreliable transitions);
You're okay with slower convergence in exchange for safer behavior during learning.

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat