Kursinhalt
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Action Values
Action value is a fundamental concept in the MAB problem. It plays a pivotal role in various algorithms, including epsilon-greedy and upper confidence bound. The primary purpose of an action value is to provide an estimate of the expected reward when a specific action is chosen. It is similar to a state-action value, but is independent of a state due to the stateless nature of the MAB problem. Understanding the concept of action values is essential for implementing effective bandit algorithms.
Definition of Action Value
Formally, the action value, denoted as , represents the expected reward of choosing action :
where:
- is the reward received;
- is the action selected.
Since the true reward distribution is typically unknown, we have to estimate using observed data.
Estimating Action Values
There are several ways to estimate based on observed rewards. The most common method is the sample average estimate, which calculates the mean reward received from selecting action up to time :
where:
- is the estimated value of action at time step ;
- is the number of times action has been chosen up to time ;
- is the reward obtained in each instance when action was taken.
As more samples are collected, this estimate converges to the true expected reward assuming the reward distribution remains stationary.
Incremental Update Rule
While the formula above can be used to estimate action values, it requires storing all previous rewards, and recomputing their sum on every time step. With incremental updates, this becomes unnecessary. The formula for incremental updates can be derived like this:
where for some action:
- is an estimate of -th reward, that can be expressed as an average of first rewards;
- is an actual -th reward.
Intuition
Knowing the estimate of the k-th reward, Qk, and the actual k-th reward, Rk, you can measure the error as the difference between these values. Afterward, the next estimate can be calculated by adjusting the previous estimate slightly in the direction of the actual reward, to reduce the error.
This intuition leads to another formula, which looks like this:
where is a step-size parameter controlling the rate of learning. Like in the previous formula, alpha can be , and it will result in a sample average estimate. Alternatively, a constant is commonly used, as it doesn't require any additional space(to store how many times an action was taken) and allows adaptation to non-stationary environments by placing more weight on recent observations.
Optimistic Initialization
At the beginning of a training process, action value estimates can vary significantly, which may lead to premature exploitation. This means the agent may exploit its initial knowledge too early, favoring suboptimal actions based on limited experience. To mitigate this issue and encourage initial exploration, one simple and effective technique is optimistic initialization.
In optimistic initialization, action values are initialized to relatively high values (e.g., instead of 0). This approach creates the impression that all actions are initially promising. As a result, the agent is incentivized to explore each action multiple times before settling on the best choice. This technique is most efficient when used in combination with constant step-size.
Danke für Ihr Feedback!