Introduction to Reinforcement Learning

Kursinhalt

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Q-Learning: Off-Policy TD Learning

Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of $\varepsilon$ over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.

Definition

Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function $q_*(s, a)$ . It updates its estimates based on the current best action, making it an off-policy algorithm.

Update Rule

Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.

The Q-learning update rule is:

Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr)

The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:

\gamma Q(S_{t+1}, A_{t+1})

Q-learning uses the value of the best possible next action:

\gamma \max_a Q(S_{t+1}, a)

This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method — it learns about the greedy policy, regardless of the actions chosen during training.

When to Use Q-Learning?

Q-learning is preferable when:

You are dealing with deterministic environments, or environments;
You need a faster convergence speed.

War alles klar?

Danke für Ihr Feedback!

Abschnitt 5. Kapitel 4

Fragen Sie AI

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Kursinhalt

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Q-Learning: Off-Policy TD Learning

Learning an optimal policy with SARSA can be challenging. Similar to on-policy Monte Carlo control, it typically requires a gradual decay of $\varepsilon$ over time, eventually approaching zero to shift from exploration to exploitation. This process is often slow and may demand extensive training time. An alternative is to use an off-policy method like Q-learning.

Definition

Q-learning is an off-policy TD control algorithm used to estimate the optimal action value function $q_*(s, a)$ . It updates its estimates based on the current best action, making it an off-policy algorithm.

Update Rule

Unlike in off-policy Monte Carlo control, Q-learning does not require importance sampling to correct for differences between behavior and target policies. Instead, it relies on a direct update rule that closely resembles SARSA, but with a key difference.

The Q-learning update rule is:

Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \Bigl(R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\Bigr)

The only difference from SARSA is in the target value. Instead of using the value of the next action actually taken, as SARSA does:

\gamma Q(S_{t+1}, A_{t+1})

Q-learning uses the value of the best possible next action:

\gamma \max_a Q(S_{t+1}, a)

This subtle change has a big impact: it allows Q-learning to evaluate actions using an estimate of the optimal policy, even while the agent is still exploring. That's what makes it an off-policy method — it learns about the greedy policy, regardless of the actions chosen during training.

When to Use Q-Learning?

Q-learning is preferable when:

You are dealing with deterministic environments, or environments;
You need a faster convergence speed.

War alles klar?

Danke für Ihr Feedback!

Abschnitt 5. Kapitel 4

some-alt