Kursinhalt
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Optimality Conditions
In the previous chapter, you learned about Bellman equations for state value and state-action value functions. These equations describe how state values can be recursively defined through the values of other states, with the values being dependent on a given policy. However, not all policies are equally effective. In fact, value functions provide a partial ordering for policies, which can be described as follows:
So policy is better than or equal to policy if for all possible states, the expected return of policy is not less than the expected return of policy .
Optimal Policy
Although there may be many optimal policies, all of them are denoted as .
Why optimal policy always exists?
You might be wondering why an optimal policy always exists for any MDP. That's a great question, and the intuition behind it is surprisingly simple. Remember, states in an MDP fully capture the environment's condition. This implies each state is independent from all others: the action chosen in one state doesn't affect the rewards or outcomes achievable in another. Therefore, by selecting the optimal action in each state separately, you naturally arrive at the overall best sequence of actions across the entire process. And this set of optimal actions in each state is an optimal policy.
Moreover, there is always at least one policy that is both optimal and deterministic. Indeed, if for some state , two actions and yield the same expected return, selecting just one of them will not affect the policy's optimality. Applying this principle to every single state will make the policy deterministic while preserving its optimality.
Optimal Value Functions
Optimal policies share the same value functions — a fact that becomes clear when we consider how policies are compared. This means that optimal policies share both state value function and action value function.
Additionally, optimal value functions have their own Bellman equations that can be written without a reference to any specific policy. These equations are called Bellman optimality equations
Optimal state value function
Optimal state value function is usually denoted as or .
It can be mathematically defined as such:
Bellman optimality equation for this value function can be derived like this:
Intuition
As you already know, there always exists at least one policy that is both optimal and deterministic. Such a policy would, for each state, consistently select one particular action that maximizes expected returns. Therefore, the probability of choosing this optimal action would always be 1, and the probability of choosing any other action would be 0. Given this, the original Bellman equation no longer needs the summation operator. Instead, since we know we will always select the best possible action, we can simply replace the sum by taking a maximum over all available actions.
Optimal action value function
Optimal action value function is usually denoted as or
It can be mathematically defined as such:
Bellman optimality equation for this value function can be derived like this:
Intuition
Similarly to the state value function, the sum can be replaced by taking a maximum over all available actions.
Danke für Ihr Feedback!