Course Content
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
What is Dynamic Programming?
Dynamic programming (DP) methods help solve complex problems by breaking down big problems into smaller subproblems and solving them recursively. Instead of solving the same problem repeatedly, DP leverages previously computed solutions to speed up the process.
It is particularly useful in reinforcement learning (RL) for solving Markov decision processes (MDPs) efficiently when a complete model of the environment is available.
Starting from this chapter, all environments are assumed to be finite MDPs. Finite MDPs have finite state space, finite action space, and finite reward set.
Conditions for Applying DP
Not every problem can be solved with DP. There are two key attributes that a problem must have in order for DP to work:
Optimal substructure: the optimal solution to a problem is derived from the optimal solutions of its subproblems. In MDPs, this means that the optimal policy at any state depends on the optimal policies of the following states. Because decisions in an MDP are sequential, solving smaller subproblems (finding the best action for future states) leads to solving the overall problem (finding the best action for the current state);
Overlapping subproblems: solutions to subproblems are reused to solve larger problems. In MDPs, this is evident because the value of a state is repeatedly calculated across different decision sequences. Since states are often revisited, previously computed values can be stored and reused, reducing redundant computations and improving efficiency.
Each node on the image represents a recursive call to compute Fib(n), and the tree structure shows how these calls are broken down into smaller subproblems. Notice that subproblems like Fib(2) and Fib(1) appear multiple times, demonstrating overlapping subproblems, while the solution to Fib(5) is built from the optimal solutions of its subproblems, demonstrating optimal substructure. This redundancy is what dynamic programming aims to eliminate by storing and reusing results.
Since MDPs exhibit both optimal substructure and overlapping subproblems, they are well-suited for DP-based solutions.
Why Use DP in RL?
Optimality guarantees: DP methods are guaranteed to converge to the optimal policy when the full model is known;
Efficiency for general solutions: with the help of DP, general solutions can be efficiently obtained, meaning that the resulting policy will be optimal for every single state;
Foundational for other methods: concepts from DP serve as the basis for other RL methods, such as Monte Carlo and temporal difference learning.
However, DP is not feasible for large-scale problems due to its reliance on a full model and computational demands, which leads to the challenges discussed next.
Challenges and Limitations of Dynamic Programming
While DP provides an elegant framework for solving RL problems, it comes with significant challenges that limit its applicability in real-world scenarios:
Computational complexity: DP methods require computations for every single state in an environment. As the state space grows, the number of required computations increases significantly, making DP impractical for complex problems;
Need for a known model: DP assumes that the environment's transition probabilities and rewards are known in advance. However, in many real-world RL applications, this information is unavailable, making model-free approaches more practical.
As the number of state variables increases, the state space expands exponentiallyβa challenge known as the curse of dimensionality. This makes storing or computing optimal solutions impractical, limiting DP's scalability.
Thanks for your feedback!