Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Q-Learning: Off-Policy TD Learning | Temporal Difference Learning
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Q-Learning: Off-Policy TD Learning

Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of Ξ΅\varepsilon over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.

What is Q-Learning?

Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function qβˆ—(s,a)q_*(s, a).

Update Rule

Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.

The Q-learning update rule is:

Q(St,At)←Q(St,At)+Ξ±(Rt+1+Ξ³max⁑aQ(St+1,a)βˆ’Q(St,At))Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr)

The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:

Ξ³Q(St+1,At+1)\gamma Q(S_{t+1}, A_{t+1})

Q-learning uses the value of the best possible next action:

γmax⁑aQ(St+1,a)\gamma \max_a Q(S_{t+1}, a)

This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method β€” it learns about the greedy policy, regardless of the actions chosen during training.

When to Use Q-Learning?

Q-learning is preferable when:

  • You are dealing with deterministic environments, or environments;
  • You need a faster convergence speed.
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 4
We're sorry to hear that something went wrong. What happened?
some-alt