Course Content

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

What is Dynamic Programming?

Definition

Dynamic programming (DP) methods help solve complex problems by breaking down big problems into smaller subproblems and solving them recursively. Instead of solving the same problem repeatedly, DP leverages previously computed solutions to speed up the process.

It is particularly useful in reinforcement learning (RL) for solving Markov decision processes (MDPs) efficiently when a complete model of the environment is available.

Note

Starting from this chapter, all environments are assumed to be finite MDPs. Finite MDPs have finite state space, finite action space, and finite reward set.

Conditions for Applying DP

Not every problem can be solved with DP. There are two key attributes that a problem must have in order for DP to work:

Optimal substructure: the optimal solution to a problem is derived from the optimal solutions of its subproblems. In MDPs, this means that the optimal policy at any state depends on the optimal policies of the following states. Because decisions in an MDP are sequential, solving smaller subproblems (finding the best action for future states) leads to solving the overall problem (finding the best action for the current state);
Overlapping subproblems: solutions to subproblems are reused to solve larger problems. In MDPs, this is evident because the value of a state is repeatedly calculated across different decision sequences. Since states are often revisited, previously computed values can be stored and reused, reducing redundant computations and improving efficiency.

Each node on the image represents a recursive call to compute Fib(n), and the tree structure shows how these calls are broken down into smaller subproblems. Notice that subproblems like Fib(2) and Fib(1) appear multiple times, demonstrating overlapping subproblems, while the solution to Fib(5) is built from the optimal solutions of its subproblems, demonstrating optimal substructure. This redundancy is what dynamic programming aims to eliminate by storing and reusing results.

Since MDPs exhibit both optimal substructure and overlapping subproblems, they are well-suited for DP-based solutions.

Why Use DP in RL?

Optimality guarantees: DP methods are guaranteed to converge to the optimal policy when the full model is known;
Efficiency for general solutions: with the help of DP, general solutions can be efficiently obtained, meaning that the resulting policy will be optimal for every single state;
Foundational for other methods: concepts from DP serve as the basis for other RL methods, such as Monte Carlo and temporal difference learning.

However, DP is not feasible for large-scale problems due to its reliance on a full model and computational demands, which leads to the challenges discussed next.

Challenges and Limitations of Dynamic Programming

While DP provides an elegant framework for solving RL problems, it comes with significant challenges that limit its applicability in real-world scenarios:

Computational complexity: DP methods require computations for every single state in an environment. As the state space grows, the number of required computations increases significantly, making DP impractical for complex problems;
Need for a known model: DP assumes that the environment's transition probabilities and rewards are known in advance. However, in many real-world RL applications, this information is unavailable, making model-free approaches more practical.

Study More

As the number of state variables increases, the state space expands exponentially—a challenge known as the curse of dimensionality. This makes storing or computing optimal solutions impractical, limiting DP's scalability.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat