Here we list concentration inequalities for scalar-valued random variables that are bounded with probability 1 (sometimes only bounded on one side). This in contrast to light-tailed, unbounded scalar concentration which does not assume boundedness and heavy-tailed concentration, where we assume only a few moments.
Many of these concentration inequalities are proven using corresponding exponential inequalities. There are also time-uniform versions of many of these inequalities which stem from the sub-psi process given by the corresponding exponential inequality.
Hoeffding’s bound
Let be independent and set . Suppose . Hoeffding showed that, for any ,
The natural two sided version also exists. This is proved with the Chernoff method. Hoeffding’s bound is generalized by McDiarmid’s inequality (bounded difference inequalities).
Hoeffding’s bound suboptimal in a somewhat cheap and straightforward way: If we take for , then but the bound does not capture this behavior. However, it’s also suboptimal in a more fundamental way, namely that there’s a missing factor of , which can be seen by appealing to central limit theorems. See the missing factor in Hoeffding’s bounds. This factor is recovered by Talagrand’s inequality and Bentkus’ inequality below.
Bennett’s inequality
Hoeffding’s bound doesn’t use any information beside boundedness of the observations. It therefore must (implicitly) assume a worse case bound on the variance. If we know a bound on the variance, we can do better. Both Bennett’s inequality and Bernstein’s inequality use such information to tighten the bound.
Let be independent with finite variance and one-sided boundedness, i.e., for some . If and then
where . If we assume that then we can get a bound on . This trend of first presenting a result using only the boundedness of observations and then giving a variance-adaptive result is a common one, see from boundedness to variance adaptivity.
If the have conditional mean , then we can replace with where is the expectation conditional on the first observations.
Like Hoeffding’s bound, Bennett’s inequality also does not recover the missing factor of that we expect from the central limit theorem. This is because it is also based on the Chernoff method, but bounding the exponential function a little differently than does Hoeffding’s bound to take advantage of the variance information.
Bernstein bound 1
There are several bounds that go under the name “Bernstein bound”. The most common is perhaps a relaxation of Bennett’s bound and uses that
to show that
where and are as above. We can of course obtain two-sided versions of Bennett’s and Bernstein’s bound if we assume two-sided boundedness (), apply the bound twice and use a union bound.
A useful form of Bernstein’s bound that I’m writing down because re-deriving it is annoying is the following: With probability at least ,
There are two regimes in Bernstein’s bound: A sub-Gaussian regime where the tail decays at a sub-Gaussian rate, and a sub-exponential regime where it decays at a sub-exponential rate. The former occurs when the contribution of is small relative to so the tail decays as . When is large, the bound decays as . Therefore, if we have good a priori knowledge of small variance, Bernstein’s (and Bennett’s) inequality can be a big improvement over Hoeffding, scaling as instead of .
This result was first presented by Bernstein in 1927. It was not until 1962 that this was sharpened by Bennett, which resulted in Bennett’s inequality above.
Hoeffding’s bound remastered
It is somewhat of a shame that Hoeffding’s name came to be associated with the first bound above, since this is much weaker than the following more general result that he proved in his famous 1963 paper, which recovers both Bennett’s and Bernstein’s bounds.
If are iid with variance and lying in , then
where denotes the binary entropy function: . If we interpret as infinity, meaning the bound cuts off at that point. Thus, this bound circumvents one of the drawbacks mentioned above.
This is the sharpest bound using the Chernoff method. We can recover the first Hoeffding bound by assuming worst-case variance.
Talagrand’s inequality
Talagrand’s inequality improves Hoeffding’s inequality above by recovering the the missing factor in Hoeffding’s bounds. If are iid with variance and lying in , then
for some constant , where
which is of the order . Therefore, for , we have . Hence the multiplier on the left hand side of the exponential is of the order , which is clearly an improvement over Hoeffding’s bound above.
Talagrand proved this bound in 1995, and it was the first improvement since Hoeffding’s original article in 1963.
Bentkus’ inequality
Bentkus’ inequality is another way to recover the missing factor of . The approach is significantly different from Talagrand’s, and is based on interpolating between Markov and Chernoff. Instead of applying Markov’s inequality to the exponential, we apply it to the function for any and obtain, for independent with finite variance,
The right hand side can be bounded by considering the worst case random variables. If and , then defining with and , we obtain that
This is not a closed-form bound, but it can be computed. See eg here and references therein.
Empirical Bernstein bounds
Empirical Bernstein bounds replace the oracle variance with an estimated variance. This is useful because the true variance is not always known. These bounds deserve their own page: empirical Bernstein bounds.
References
- Concentration inequalities by Boucheron, Lugosi, and Massart.
- On the Bennett-Hoeffding inequality by Pinelis.
- On Hoeffding’s inequalities by Bentkus.