Q-Learning: Off-Policy TD Learning
メニューを表示するにはスワイプしてください
Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of ε over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.
Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function q∗(s,a). It updates its estimates based on the current best action, making it an off-policy algorithm.
Update Rule
Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.
The Q-learning update rule is:
Q(St,At)←Q(St,At)+α(Rt+1+γamaxQ(St+1,a)−Q(St,At))The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:
γQ(St+1,At+1)Q-learning uses the value of the best possible next action:
γamaxQ(St+1,a)This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method — it learns about the greedy policy, regardless of the actions chosen during training.
When to Use Q-Learning?
Q-learning is preferable when:
- You are dealing with deterministic environments, or environments;
- You need a faster convergence speed.
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください