Summary  
Q-learning is an off-policy temporal-difference learning method that updates action-value estimates using the reward plus the discounted maximum estimated value of the next state, enabling an agent to learn an optimal policy independently of its exploration strategy.

General domain of usage  
Sequential decision-making tasks in deterministic environments requiring fast convergence.

Learning an optimal policy with **SARSA** can be challenging. Similar to on-policy Monte Carlo control, it typically requires a **gradual decay of $$\varepsilon$$ over time**, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time.
An alternative is to use an **off-policy** method like **Q-learning**.

**Q-learning** is an off-policy TD control algorithm used to estimate the optimal action value function $$q_*(s, a)$$. It updates its estimates based on **the current best action**, making it an **off-policy** algorithm.

Definition

Unlike in off-policy Monte Carlo control, **Q-learning does not require importance sampling** to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles **SARSA**, but with a key difference.

The Q-learning update rule is:
$$
Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr) 
$$

The only difference from SARSA is in the **target value**. Instead of using the value of **the next action actually taken**, as SARSA does:
$$
\gamma Q(S_{t+1}, A_{t+1})
$$
Q-learning uses the value of **the best possible next action**: 
$$
\gamma \max_a Q(S_{t+1}, a)
$$

This subtle change has a **big impact**: it allows Q-learning to evaluate actions using an **estimate of the optimal policy**, even while the agent is still exploring. That's what makes it an **off-policy** method — it learns about the greedy policy, regardless of the actions chosen during training.

Q-learning is preferable when:
- You are dealing with deterministic environments, or environments;
- You need a faster convergence speed.

What distinguishes Q-learning as an off-policy algorithm?

Reinforcement Learning (RL) is a powerful branch of machine learning focused on training intelligent agents through interaction with their environment. In this course, you'll learn how agents gradually discover effective behaviors through trial and error. Beginning with core concepts like Markov decision processes and multi-armed bandits, you'll work your way through dynamic programming, Monte Carlo methods, and temporal difference learning.

Discover how to train agents to make optimal decisions through trial and error. Explore the essentials of reinforcement learning theory. Get hands-on experience setting up and running a Gymnasium environment.

Master the exploration-exploitation trade-off through the multi-armed bandit problem. Implement action-value estimation, ε-greedy, upper confidence bound, and gradient-bandit methods. Evaluate algorithms' performance on simulated reward-maximization tasks.

Master dynamic programming for model-based RL. Discover how Bellman equations can be used to evaluate and improve policies. Implement policy and value iteration algorithms. Explore generalized policy iteration as the theoretical foundation for model-free methods.

Master Monte Carlo methods for model-free RL. Estimate value functions and derive optimal policies from complete episodes. Implement on-policy and off-policy Monte Carlo control algorithms. Discover exploration strategies to optimize model-free learning.

Master temporal difference learning for model-free RL. Estimate value functions from partial episodes using TD(0) updates. Implement on-policy SARSA and off-policy Q-Learning algorithms. Discover how Monte Carlo methods and TD learning combine in n-step TD and TD(λ).

Q-Learning: Off-Policy TD Learning

Update Rule

When to Use Q-Learning?