Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Action Values | Multi-Armed Bandit Problem
Introduction to Reinforcement Learning
course content

Conteúdo do Curso

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

1. RL Core Theory
2. Multi-Armed Bandit Problem
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal Difference Learning

book
Action Values

Action value is a fundamental concept in the MAB problem. It plays a pivotal role in various algorithms, including epsilon-greedy and upper confidence bound. The primary purpose of an action value is to provide an estimate of the expected reward when a specific action is chosen. It is similar to a state-action value, but is independent of a state due to the stateless nature of the MAB problem. Understanding the concept of action values is essential for implementing effective bandit algorithms.

Definition of Action Value

Formally, the action value, denoted as Q(a)Q(a), represents the expected reward of choosing action aa:

Q(a)=E[RA=a]\def\E{\operatorname{\mathbb{E}}} Q(a) = \E[R | A = a]

where:

  • RR is the reward received;
  • AA is the action selected.

Since the true reward distribution is typically unknown, we have to estimate Q(a)Q(a) using observed data.

Estimating Action Values

There are several ways to estimate Q(a)Q(a) based on observed rewards. The most common method is the sample average estimate, which calculates the mean reward received from selecting action aa up to time tt:

Qt(a)=R1+R2+...+RNt(a)Nt(a)=i=1Nt(a)RiNt(a)Q_t(a) = \frac{R_1 + R_2 + ... + R_{N_t(a)}}{N_t(a)} = \frac{\sum_{i=1}^{N_t(a)} R_i}{N_t(a)}

where:

  • Qt(a)Q_t(a) is the estimated value of action aa at time step tt;
  • Nt(a)N_t(a) is the number of times action aa has been chosen up to time tt;
  • RiR_i is the reward obtained in each instance when action aa was taken.

As more samples are collected, this estimate converges to the true expected reward Q(a)Q_*(a) assuming the reward distribution remains stationary.

Incremental Update Rule

While the formula above can be used to estimate action values, it requires storing all previous rewards, and recomputing their sum on every time step. With incremental updates, this becomes unnecessary. The formula for incremental updates can be derived like this:

Qk+1=1ki=1kRi=1k(Rk+i=1k1Ri)=1k(Rk+(k1)Qk)=1k(Rk+kQkQk)=Qk+1k(RkQk)\begin{aligned} Q_{k+1} &= \frac1k \sum_{i=1}^k R_i\\ &= \frac1k (R_k + \sum_{i=1}^{k-1} R_i)\\ &= \frac1k (R_k + (k-1) Q_k)\\ &= \frac1k (R_k + k Q_k - Q_k)\\ &= Q_k + \frac1k(R_k - Q_k) \end{aligned}

where for some action:

  • QkQ_k is an estimate of kk-th reward, that can be expressed as an average of first k1k-1 rewards;
  • RkR_k is an actual kk-th reward.

Intuition

Knowing the estimate of the k-th reward, Qk, and the actual k-th reward, Rk, you can measure the error as the difference between these values. Afterward, the next estimate can be calculated by adjusting the previous estimate slightly in the direction of the actual reward, to reduce the error.

This intuition leads to another formula, which looks like this:

Qk+1=Qk+α(RkQk)Q_{k+1} = Q_k + \alpha (R_k - Q_k)

where α\alpha is a step-size parameter controlling the rate of learning. Like in the previous formula, alpha can be 1k\frac1k, and it will result in a sample average estimate. Alternatively, a constant α\alpha is commonly used, as it doesn't require any additional space(to store how many times an action was taken) and allows adaptation to non-stationary environments by placing more weight on recent observations.

Optimistic Initialization

At the beginning of a training process, action value estimates can vary significantly, which may lead to premature exploitation. This means the agent may exploit its initial knowledge too early, favoring suboptimal actions based on limited experience. To mitigate this issue and encourage initial exploration, one simple and effective technique is optimistic initialization.

In optimistic initialization, action values are initialized to relatively high values (e.g., Q0(a)=1Q_0(a) = 1 instead of 0). This approach creates the impression that all actions are initially promising. As a result, the agent is incentivized to explore each action multiple times before settling on the best choice. This technique is most efficient when used in combination with constant step-size.

question mark

What is the sample average estimate used for in action value estimation?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2
Sentimos muito que algo saiu errado. O que aconteceu?
some-alt