Kursinhalt
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Gradient Bandits Algorithm
When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach — they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.
Preferences
Instead of maintaining action value estimates , gradient bandits maintain preference values for each action . These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:
where:
- is the preference for action on a time step ;
- is the probability of selecting action on a time step ;
- The denominator ensures that probabilities sum to 1.
Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.
Update Rule
After selecting an action at time , the preference values are updated using the following rule:
where:
- is the step-size,
- is the reward received,
- is the average reward observed so far.
Intuition
On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:
- If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered.
- If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.
Sample Code
python
Additional Information
Gradient bandits have several interesting properties:
- Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
- Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
- Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.
Summary
Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.
Danke für Ihr Feedback!