Contenido del Curso
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Policy Improvement
Now that you can estimate state value function for any policy, a natural next step is to explore whether there are any policies better than the current one. One way of doing this, is to consider taking a different action in a state , and to follow the current policy afterwards. If this sounds familiar, it's because this is similar to how we define the action value function:
If this new value is greater than the original state value , it indicates that taking action in state and then continuing with policy leads to better outcomes than strictly following policy . Since states are independent, it's optimal to always select action whenever state is encountered. Therefore, we can construct an improved policy , identical to except that it selects action in state , which would be superior to the original policy .
Policy Improvement Theorem
The reasoning described above can be generalized as the policy improvement theorem:
The proof of this theorem is relatively simple, and can be achieved by a repeated substitution:
Improvement Strategy
While updating actions for certain states can lead to improvements, it's more effective to update actions for all states simultaneously. Specifically, for each state , select the action that maximizes the action value :
The resulting greedy policy, denoted by , satisfies the conditions of the policy improvement theorem by construction, guaranteeing that is at least as good as the original policy , and typically better.
If is as good as, but not better than , then both and are optimal policies, as their value functions are equal, and satisfy Bellman optimality equation:
¡Gracias por tus comentarios!