Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Q-Learning: Off-Policy TD Learning | Temporal Difference Learning
Introduction to Reinforcement Learning
course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Q-Learning: Off-Policy TD Learning

Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of Ξ΅\varepsilon over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.

Note
Definition

Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function qβˆ—(s,a)q_*(s, a). It updates its estimates based on the current best action, making it an off-policy algorithm.

Update Rule

Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.

The Q-learning update rule is:

Q(St,At)←Q(St,At)+Ξ±(Rt+1+Ξ³max⁑aQ(St+1,a)βˆ’Q(St,At))Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr)

The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:

Ξ³Q(St+1,At+1)\gamma Q(S_{t+1}, A_{t+1})

Q-learning uses the value of the best possible next action:

γmax⁑aQ(St+1,a)\gamma \max_a Q(S_{t+1}, a)

This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method β€” it learns about the greedy policy, regardless of the actions chosen during training.

When to Use Q-Learning?

Q-learning is preferable when:

  • You are dealing with deterministic environments, or environments;

  • You need a faster convergence speed.

question mark

What distinguishes Q-learning as an off-policy algorithm?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 4

Ask AI

expand
ChatGPT

Ask anything or try one of the suggested questions to begin our chat

course content

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Q-Learning: Off-Policy TD Learning

Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of Ξ΅\varepsilon over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.

Note
Definition

Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function qβˆ—(s,a)q_*(s, a). It updates its estimates based on the current best action, making it an off-policy algorithm.

Update Rule

Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.

The Q-learning update rule is:

Q(St,At)←Q(St,At)+Ξ±(Rt+1+Ξ³max⁑aQ(St+1,a)βˆ’Q(St,At))Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr)

The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:

Ξ³Q(St+1,At+1)\gamma Q(S_{t+1}, A_{t+1})

Q-learning uses the value of the best possible next action:

γmax⁑aQ(St+1,a)\gamma \max_a Q(S_{t+1}, a)

This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method β€” it learns about the greedy policy, regardless of the actions chosen during training.

When to Use Q-Learning?

Q-learning is preferable when:

  • You are dealing with deterministic environments, or environments;

  • You need a faster convergence speed.

question mark

What distinguishes Q-learning as an off-policy algorithm?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 5. ChapterΒ 4
We're sorry to hear that something went wrong. What happened?
some-alt