Introduction to Reinforcement Learning

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Policy Iteration

The idea behind policy iteration is simple:

Take some initial $\pi$ and $v$ ;
Use policy evaluation to update $v$ until it's consistent with $\pi$ ;
Use policy improvement to update $\pi$ until it's greedy with respect to $v$ ;
Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

During policy evaluation, values are updated for each state, until they are consistent with current policy;
During policy improvement, policy is made greedy with respect to value function.

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 7

Ask AI

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Policy Iteration

The idea behind policy iteration is simple:

Take some initial $\pi$ and $v$ ;
Use policy evaluation to update $v$ until it's consistent with $\pi$ ;
Use policy improvement to update $\pi$ until it's greedy with respect to $v$ ;
Repeat steps 2-3 until convergence.

In this method, there are no partial updates:

During policy evaluation, values are updated for each state, until they are consistent with current policy;
During policy improvement, policy is made greedy with respect to value function.

Pseudocode

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 7

some-alt