Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn SARSA: On-Policy TD Learning | Temporal Difference Learning
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
SARSA: On-Policy TD Learning

Just like with Monte Carlo methods, we can follow the generalized policy iteration (GPI) framework to move from estimating value functions to learning optimal policies. However, this process introduces a familiar challenge: the exploration-exploitation tradeoff. And similarly, there are two approaches that we can use: on-policy and off-policy.

What is SARSA?

SARSA is an on-policy TD control algorithm used to estimate the action value function qΟ€(s,a)q_\pi(s, a). It updates its estimates based on the action actually taken, making it an on-policy algorithm.

The acronym SARSA comes from the five key components used in the update:

  • S: current state StS_t;
  • A: action taken AtA_t;
  • R: reward received Rt+1R_{t+1};
  • S: next state St+1S_{t+1};
  • A: next action At+1A_{t+1}.

Update Rule

The update rule is similar to the TD(0), only replacing the state value function with action value function:

Q(St,At)←Q(St,At)+Ξ±(Rt+1+Ξ³Q(St+1,At+1)βˆ’Q(St,At))Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\Bigr)

The At+1A_{t+1} is the action that will actually be taken during the next step, and it is selected according to the current policy. This means that the effects of exploration are incorporated into the learning process.

After each update of the action value function, the policy is updated as well, allowing the agent to immediately use the new estimates.

Pseudocode

When to Use SARSA?

SARSA is preferable when:

  • You are dealing with environments with high stochasticity (e.g., slippery surfaces, unreliable transitions);
  • You're okay with slower convergence in exchange for safer behavior during learning.
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 3
We're sorry to hear that something went wrong. What happened?
some-alt