Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære On-Policy Monte Carlo Control | Monte Carlo Methods
Introduction to Reinforcement Learning
course content

Kursinnhold

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
On-Policy Monte Carlo Control

The idea behind on-policy methods is intuitive: an agent learns by following its current policy and improves that policy based on the outcomes it experiences. To discover better actions and avoid getting stuck in suboptimal behavior, the agent incorporates a degree of randomness — occasionally trying alternative actions to encourage exploration.

Analogy

Imagine you're at an ice cream shop and there are three flavors available: chocolate, vanilla, and strawberry. You love chocolate, so that's what you usually pick. But one day, out of curiosity, you decide to try strawberry instead. Turns out, the strawberry ice cream at this shop is incredibly tasty, and you decide to pick it whenever you visit this shop.

Trying new flavor wasn't necessarily the most logical choice based on past experience, but it gave you a chance to discover something new. And this kind of exploration lies at the core of on-policy methods.

Stochastic Policies

Formally, adopting this idea means replacing the deterministic (hard) policies used in dynamic programming with stochastic (soft) policies, denoted as π(as)\pi(a | s), where:

π(as)>0sS,aA(s)\pi(a | s) > 0 \qquad \forall s \in S, a \in A(s)

In other words, every action in every state has a non-zero probability of being selected. This ensures that all parts of the environment can eventually be explored, which is essential when learning from experience.

ε\Large\varepsilon-greedy Policies

To incorporate exploration into the policy, let's borrow the concept of ε\varepsilon-greedy exploration from the multi-armed bandit problem. This allows us to define a stochastic policy that balances exploiting the best-known action with exploring alternatives:

π(as){1ε+εA(s)if a=arg maxaqπ(s,a)εA(s)otherwise\pi(a | s) \gets \begin{dcases} 1 - \varepsilon + \frac{\varepsilon}{|A(s)|} & \text{if } a = \argmax_{a'} q_\pi(s, a') \\ \frac{\varepsilon}{|A(s)|} & \text{otherwise} \end{dcases}

This policy behaves greedily most of the time — choosing the action with the highest estimated value — but with probability ε\varepsilon, it selects a random action, ensuring that all actions have a non-zero chance of being taken (even the greedy one again, via uniform sampling).

At first glance, this approach seems problematic: since the policy never becomes purely greedy, it will never converge to the exact optimal policy. As such, it doesn't strictly satisfy the conditions for GPI if we expect the exact optimality in the limit.

However, GPI doesn't require the policy to become optimal immediately — it only requires that each policy improves (or stays the same) compared to the last, progressively moving toward optimality. The ε\varepsilon-greedy policy satisfies this condition: it improves the policy on average, and ensures ongoing exploration to support better estimates.

To address the issue of convergence to the truly optimal policy, we can gradually reduce ε\varepsilon over time. This strategy lets the policy become increasingly greedy as learning progresses. In the early stages, exploration helps gather diverse experience, while in later stages, the agent exploits its improved knowledge. With a properly decaying ε\varepsilon, the method does converge to an optimal policy in the limit.

Pseudocode

question mark

How can stochastic policies help with exploration?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 4. Kapittel 5

Spør AI

expand
ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

course content

Kursinnhold

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
On-Policy Monte Carlo Control

The idea behind on-policy methods is intuitive: an agent learns by following its current policy and improves that policy based on the outcomes it experiences. To discover better actions and avoid getting stuck in suboptimal behavior, the agent incorporates a degree of randomness — occasionally trying alternative actions to encourage exploration.

Analogy

Imagine you're at an ice cream shop and there are three flavors available: chocolate, vanilla, and strawberry. You love chocolate, so that's what you usually pick. But one day, out of curiosity, you decide to try strawberry instead. Turns out, the strawberry ice cream at this shop is incredibly tasty, and you decide to pick it whenever you visit this shop.

Trying new flavor wasn't necessarily the most logical choice based on past experience, but it gave you a chance to discover something new. And this kind of exploration lies at the core of on-policy methods.

Stochastic Policies

Formally, adopting this idea means replacing the deterministic (hard) policies used in dynamic programming with stochastic (soft) policies, denoted as π(as)\pi(a | s), where:

π(as)>0sS,aA(s)\pi(a | s) > 0 \qquad \forall s \in S, a \in A(s)

In other words, every action in every state has a non-zero probability of being selected. This ensures that all parts of the environment can eventually be explored, which is essential when learning from experience.

ε\Large\varepsilon-greedy Policies

To incorporate exploration into the policy, let's borrow the concept of ε\varepsilon-greedy exploration from the multi-armed bandit problem. This allows us to define a stochastic policy that balances exploiting the best-known action with exploring alternatives:

π(as){1ε+εA(s)if a=arg maxaqπ(s,a)εA(s)otherwise\pi(a | s) \gets \begin{dcases} 1 - \varepsilon + \frac{\varepsilon}{|A(s)|} & \text{if } a = \argmax_{a'} q_\pi(s, a') \\ \frac{\varepsilon}{|A(s)|} & \text{otherwise} \end{dcases}

This policy behaves greedily most of the time — choosing the action with the highest estimated value — but with probability ε\varepsilon, it selects a random action, ensuring that all actions have a non-zero chance of being taken (even the greedy one again, via uniform sampling).

At first glance, this approach seems problematic: since the policy never becomes purely greedy, it will never converge to the exact optimal policy. As such, it doesn't strictly satisfy the conditions for GPI if we expect the exact optimality in the limit.

However, GPI doesn't require the policy to become optimal immediately — it only requires that each policy improves (or stays the same) compared to the last, progressively moving toward optimality. The ε\varepsilon-greedy policy satisfies this condition: it improves the policy on average, and ensures ongoing exploration to support better estimates.

To address the issue of convergence to the truly optimal policy, we can gradually reduce ε\varepsilon over time. This strategy lets the policy become increasingly greedy as learning progresses. In the early stages, exploration helps gather diverse experience, while in later stages, the agent exploits its improved knowledge. With a properly decaying ε\varepsilon, the method does converge to an optimal policy in the limit.

Pseudocode

question mark

How can stochastic policies help with exploration?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 4. Kapittel 5
Vi beklager at noe gikk galt. Hva skjedde?
some-alt