Kurssisisältö
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Gradient Bandits Algorithm
When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach — they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.
Preferences
Instead of maintaining action value estimates , gradient bandits maintain preference values for each action . These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:
where:
- is the preference for action on a time step ;
- is the probability of selecting action on a time step ;
- The denominator ensures that probabilities sum to 1.
Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.
Update Rule
After selecting an action at time , the preference values are updated using the following rule:
where:
- is the step-size;
- is the reward received;
- is the average reward observed so far.
Intuition
On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:
- If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
- If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.
Sample Code
def softmax(x):
"""Simple softmax implementation"""
return np.exp(x) / np.sum(np.exp(x))
class GradientBanditsAgent:
def __init__(self, n_actions, alpha):
"""Initialize an agent"""
self.n_actions = n_actions # Number of available actions
self.alpha = alpha # alpha
self.H = np.zeros(n_actions) # Preferences
self.reward_avg = 0 # Average reward
self.t = 0 # Time step counter
def select_action(self):
"""Select an action according to the gradient bandits strategy"""
# Compute probabilities from preferences with softmax
probs = softmax(self.H)
# Choose an action according to the probabilities
return np.random.choice(self.n_actions, p=probs)
def update(self, action, reward):
"""Update preferences"""
# Increase the time step counter
self.t += 1
# Update the average reward
self.reward_avg += reward / self.t
# Compute probabilities from preferences with softmax
probs = softmax(self.H) # Getting action probabilities from preferences
# Update preference values using stochastic gradient ascent
self.H -= self.alpha * (reward - self.reward_avg) * probs
self.H[action] += self.alpha * (reward - self.reward_avg)
Additional Information
Gradient bandits have several interesting properties:
- Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
- Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
- Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.
Summary
Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.
Kiitos palautteestasi!