Contenu du cours
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
SARSA: On-Policy TD Learning
Just like with Monte Carlo methods, we can follow the generalized policy iteration (GPI) framework to move from estimating value functions to learning optimal policies. However, this process introduces a familiar challenge: the exploration-exploitation tradeoff. And similarly, there are two approaches that we can use: on-policy and off-policy.
What is SARSA?
SARSA is an on-policy TD control algorithm used to estimate the action value function . It updates its estimates based on the action actually taken, making it an on-policy algorithm.
The acronym SARSA comes from the five key components used in the update:
- S: current state ;
- A: action taken ;
- R: reward received ;
- S: next state ;
- A: next action .
Update Rule
The update rule is similar to the TD(0), only replacing the state value function with action value function:
The is the action that will actually be taken during the next step, and it is selected according to the current policy. This means that the effects of exploration are incorporated into the learning process.
After each update of the action value function, the policy is updated as well, allowing the agent to immediately use the new estimates.
Pseudocode
When to Use SARSA?
SARSA is preferable when:
- You are dealing with environments with high stochasticity (e.g., slippery surfaces, unreliable transitions);
- You're okay with slower convergence in exchange for safer behavior during learning.
Merci pour vos commentaires !