Perhaps the simplest, and the most studied, instance of a bandit problem. We have arms (typically is taken to be finite) each with an unknown reward distribution. We then pull the arms sequentially, attempting either best-arm identification or regret minimization. The MAB problem is perhaps the canonical example of the exploration-exploitation dilemma: Do we explore new arms, reducing our uncertainty about their reward distribution, or stick with the best arm we’ve found so far?
There is a massive literature on this topic. For a detailed overview see this survey.
History
- The introduction of the MAB problem is typically credited to Lai and Robbins in 1985. They gave lower bounds on regret and also introduced the technique of using upper confidence bounds (of a frequentist flavor) to analyze regret.
- In 1995, Agrawal simplified the upper confidence bound to something that is efficiently computable using only the history of each arm.
- In 2002, Peter Auer and others introduced the famous UCB1 algorithm, which builds on Agrawal’s ideas and gives new upper confidence bounds based on Hoeffding’s bound.
Known results
- Lai and Robbins gave the first lower bound for the MAB problem. If is the number of times arm is pulled, they show that for any suboptimal arm (i.e., an arm without the highest mean reward),
where is arm ‘s arm distribution, is the optimal arm, and is the KL divergence. They also gave an upper confidence bound style algorithm which achieves this lower bound asymptotically in one-parameter exponential families.
- This lower bound was later refined by Burnetas and Katehakis in 1996. They replaced with
