Kurssisisältö

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

What are Monte Carlo Methods?Value Function Estimation Monte Carlo Control Exploration Approaches On-Policy Monte Carlo Control Off-Policy Monte Carlo Control Incremental Implementations Challenge: Monte Carlo Methods

5. Temporal Difference Learning

What is Temporal Difference Learning?TD(0): Value Function Estimation SARSA: On-Policy TD Learning Q-Learning: Off-Policy TD Learning Generalization of TD Learning Challenge: Temporal Difference Learning

Gradient Bandits Algorithm

When dealing with multi-armed bandits, traditional methods like epsilon-greedy and UCB estimate action values to decide which action to take. However, gradient bandits take a different approach — they learn preferences for actions instead of estimating their values. These preferences are adjusted over time using stochastic gradient ascent.

Preferences

Instead of maintaining action value estimates $Q(a)$ , gradient bandits maintain preference values $H(a)$ for each action $a$ . These preferences are updated using a stochastic gradient ascent approach to maximize expected rewards. Each action's probability is computed using a softmax function:

P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

$H_t(a)$ is the preference for action $a$ on a time step $t$ ;
$P(A_t = a)$ is the probability of selecting action $a$ on a time step $t$ ;
The denominator ensures that probabilities sum to 1.

Softmax is a crucial function in ML, commonly used to convert lists of real numbers into lists of probabilities. This function serves as a smooth approximation to the $\argmax$ function, enabling natural exploration by giving lower-preference actions a non-zero chance of being selected.

Update Rule

After selecting an action $A_t$ at time $t$ , the preference values are updated using the following rule:

\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

$\alpha$ is the step-size;
$R_t$ is the reward received;
$\bar R_t$ is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code

def softmax(x):
  """Simple softmax implementation"""
  return np.exp(x) / np.sum(np.exp(x))

class GradientBanditsAgent:
  def __init__(self, n_actions, alpha):
    """Initialize an agent"""
    self.n_actions = n_actions # Number of available actions
    self.alpha = alpha # alpha
    self.H = np.zeros(n_actions) # Preferences
    self.reward_avg = 0 # Average reward
    self.t = 0 # Time step counter

  def select_action(self):
    """Select an action according to the gradient bandits strategy"""
    # Compute probabilities from preferences with softmax
    probs = softmax(self.H)
    # Choose an action according to the probabilities
    return np.random.choice(self.n_actions, p=probs)

  def update(self, action, reward):
    """Update preferences"""
    # Increase the time step counter
    self.t += 1
    # Update the average reward
    self.reward_avg += reward / self.t

    # Compute probabilities from preferences with softmax
    probs = softmax(self.H) # Getting action probabilities from preferences

    # Update preference values using stochastic gradient ascent
    self.H -= self.alpha * (reward - self.reward_avg) * probs
    self.H[action] += self.alpha * (reward - self.reward_avg)

Additional Information

Gradient bandits have several interesting properties:

Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Gradient bandits provide a powerful alternative to traditional bandit algorithms by leveraging preference-based learning. Their most interesting feature is ability to naturally balance exploration and exploitation.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 5

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Kurssisisältö

Introduction to Reinforcement Learning

1. RL Core Theory

What is RL?RL vs Other Learning Paradigms Markov Decision Process Episodes and Returns Model, Policy, and Values Exploration vs Exploitation Gymnasium Basics Challenge: Setting Up an Environment

2. Multi-Armed Bandit Problem

Problem Introduction Action Values Epsilon-Greedy Algorithm Upper Confidence Bound Algorithm Gradient Bandits Algorithm Challenge: Multi-Armed Bandits

3. Dynamic Programming

What is Dynamic Programming?Bellman Equations Optimality Conditions Policy Evaluation Policy Improvement Generalized Policy Iteration Policy Iteration Value Iteration Challenge: Dynamic Programming

4. Monte Carlo Methods

5. Temporal Difference Learning

Gradient Bandits Algorithm

Preferences

P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^n e^{H_t(b)}} = \pi_t(a)

where:

$H_t(a)$ is the preference for action $a$ on a time step $t$ ;
$P(A_t = a)$ is the probability of selecting action $a$ on a time step $t$ ;
The denominator ensures that probabilities sum to 1.

Update Rule

After selecting an action $A_t$ at time $t$ , the preference values are updated using the following rule:

\begin{aligned} &H_{t+1}(A_t) \gets H_t(A_t) + \alpha (R_t - \bar R_t)(1 - \pi(A_t))\\ &H_{t+1}(a) \gets H_t(a) - \alpha (R_t - \bar R_t)\pi(a) \qquad \forall a \ne A_t \end{aligned}

where:

$\alpha$ is the step-size;
$R_t$ is the reward received;
$\bar R_t$ is the average reward observed so far.

Intuition

On each time step, all preferences are shifted a little. The shift mostly depends on received reward and average reward, and it can be explained like this:

If received reward is higher than average, the selected action becomes more preferred, and other actions become less preffered;
If received reward is lower than average, the selected action's preference decreases, while preferences of other actions increase, encouraging exploration.

Sample Code

def softmax(x):
  """Simple softmax implementation"""
  return np.exp(x) / np.sum(np.exp(x))

class GradientBanditsAgent:
  def __init__(self, n_actions, alpha):
    """Initialize an agent"""
    self.n_actions = n_actions # Number of available actions
    self.alpha = alpha # alpha
    self.H = np.zeros(n_actions) # Preferences
    self.reward_avg = 0 # Average reward
    self.t = 0 # Time step counter

  def select_action(self):
    """Select an action according to the gradient bandits strategy"""
    # Compute probabilities from preferences with softmax
    probs = softmax(self.H)
    # Choose an action according to the probabilities
    return np.random.choice(self.n_actions, p=probs)

  def update(self, action, reward):
    """Update preferences"""
    # Increase the time step counter
    self.t += 1
    # Update the average reward
    self.reward_avg += reward / self.t

    # Compute probabilities from preferences with softmax
    probs = softmax(self.H) # Getting action probabilities from preferences

    # Update preference values using stochastic gradient ascent
    self.H -= self.alpha * (reward - self.reward_avg) * probs
    self.H[action] += self.alpha * (reward - self.reward_avg)

Additional Information

Gradient bandits have several interesting properties:

Preference relativity: the absolute values of action preferences do not affect the action selection process — only their relative differences matter. Shifting all preferences by the same constant (e.g., adding 100) results in the same probability distribution;
Effect of the baseline in the update rule: although the update formula typically includes the average reward as a baseline, this value can be replaced with any constant that is independent of the chosen action. The baseline influences the speed of convergence but does not alter the optimal solution;
Impact of the step-size: the step-size should be tuned based on the task at hand. A smaller step-size ensures more stable learning, while a larger value accelerates learning process.

Summary

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 5