Perhaps the simplest, and the most studied, instance of a bandit problem. We have arms (typically is taken to be finite) each with an unknown reward distribution. We then pull the arms sequentially, attempting either best-arm identification or regret minimization. The MAB problem is perhaps the canonical example of the exploration-exploitation dilemma: Do we explore new arms, reducing our uncertainty about their reward distribution, or stick with the best arm we’ve found so far?

There is a massive literature on this topic. For a detailed overview see this survey.

History

  • The introduction of the MAB problem is typically credited to Lai and Robbins in 1985. They gave lower bounds on regret and also introduced the technique of using upper confidence bounds (of a frequentist flavor) to analyze regret.
  • In 1995, Agrawal simplified the upper confidence bound to something that is efficiently computable using only the history of each arm.
  • In 2002, Peter Auer and others introduced the famous UCB1 algorithm, which builds on Agrawal’s ideas and gives new upper confidence bounds based on Hoeffding’s bound.

Known results

  • Lai and Robbins gave the first lower bound for the MAB problem. If is the number of times arm is pulled, they show that for any suboptimal arm (i.e., an arm without the highest mean reward),

where is arm ‘s arm distribution, is the optimal arm, and is the KL divergence. They also gave an upper confidence bound style algorithm which achieves this lower bound asymptotically in one-parameter exponential families.

  • This lower bound was later refined by Burnetas and Katehakis in 1996. They replaced with
KL_{\inf}(a,a^*) = \inf_{P\in\calP: \E_P[X]< \E_{P_{a^*}}[X]} \kl(P,P_{a^*}). $$This looks a bit gross but is intuitive: You don't need to compare the precise divergence between the arm distributions, just the worst case among the distributions under consideration. For many distribution classes $\calP$ these will be the same thing, but there are classes for which KL-inf is strictly smaller. - The famous [UCB1 algorithm of Auer et al 2002](https://aima.eecs.berkeley.edu/~russell/classes/cs294/s11/readings/Auer+al:2002.pdf) matches these lower bounds at all times (up to constants), instead of asymptotically. It holds for any bounded reward distributions, though they also give a specialised UCB1-Normal algorithm for Gaussian rewards. - For Bayesian bandits with discounted rewards, the Gittins Index algorithm is optimal. - [Agrawal et al.](https://proceedings.mlr.press/v134/agrawal21a/agrawal21a.pdf) give asymptotically optimal (up to first order terms) algorithms for regret minimization in the heavy-tailed setting: The reward distributions are assumed to only have a $(1+\eps)$ moment for some $\eps>0$.