Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Policy Improvement | Dynamic Programming
Introduction to Reinforcement Learning
course content

Kursinhalt

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Policy Improvement

Now that you can estimate state value function for any policy, a natural next step is to explore whether there are any policies better than the current one. One way of doing this, is to consider taking a different action aa in a state ss, and to follow the current policy afterwards. If this sounds familiar, it's because this is similar to how we define the action value function:

qπ(s,a)=s,rp(s,rs,a)(r+γvπ(s))q_\pi(s, a) = \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr)

If this new value is greater than the original state value vπ(s)v_\pi(s), it indicates that taking action aa in state ss and then continuing with policy π\pi leads to better outcomes than strictly following policy π\pi. Since states are independent, it's optimal to always select action aa whenever state ss is encountered. Therefore, we can construct an improved policy π\pi', identical to π\pi except that it selects action aa in state ss, which would be superior to the original policy π\pi.

Policy Improvement Theorem

The reasoning described above can be generalized as the policy improvement theorem:

qπ(s,π(s))vπ(s)sS    vπ(s)vπ(s)sS\begin{aligned} &q_\pi(s, \pi'(s)) \ge v_\pi(s) \qquad &\forall s \in S\\ \implies &v_{\pi'}(s) \ge v_\pi(s) \qquad &\forall s \in S \end{aligned}

The proof of this theorem is relatively simple, and can be achieved by a repeated substitution:

vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1)St=s]Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]=Eπ[Rt+1+γEπ[Rt+2+γvπ(St+2)]St=s]=Eπ[Rt+1+γRt+2+γ2vπ(St+2)St=s]...Eπ[Rt+1+γRt+2+γ2Rt+3+...St=s]=vπ(s)\def\E{\operatorname{\mathbb{E}}} \begin{aligned} v_\pi(s) &\le q_\pi(s, \pi'(s))\\ &= \E_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s]\\ &\le \E_{\pi'}[R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1})) | S_t = s]\\ &= \E_{\pi'}[R_{t+1} + \gamma \E_{\pi'}[R_{t+2} + \gamma v_\pi(S_{t+2})] | S_t = s]\\ &= \E_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi(S_{t+2}) | S_t = s]\\ &...\\ &\le \E_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]\\ &= v_{\pi'}(s) \end{aligned}

Improvement Strategy

While updating actions for certain states can lead to improvements, it's more effective to update actions for all states simultaneously. Specifically, for each state ss, select the action aa that maximizes the action value qπ(s,a)q_\pi(s, a):

π(s)arg maxaqπ(s,a)arg maxas,rp(s,rs,a)(r+γvπ(s))\begin{aligned} \pi'(s) &\gets \argmax_a q_\pi(s, a)\\ &\gets \argmax_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr) \end{aligned}

The resulting greedy policy, denoted by π\pi', satisfies the conditions of the policy improvement theorem by construction, guaranteeing that π\pi' is at least as good as the original policy π\pi, and typically better.

If π\pi' is as good as, but not better than π\pi, then both π\pi' and π\pi are optimal policies, as their value functions are equal, and satisfy Bellman optimality equation:

vπ(s)=maxas,rp(s,rs,a)(r+γvπ(s))v_\pi(s) = \max_a \sum_{s', r} p(s', r | s, a)\Bigl(r + \gamma v_\pi(s')\Bigr)
question mark

How does adopting a greedy policy guarantee an improvement over the previous policy?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 5
Wir sind enttäuscht, dass etwas schief gelaufen ist. Was ist passiert?
some-alt