Here we list concentration inequalities for scalar-valued random variables that are bounded with probability 1 (sometimes only bounded on one side). This in contrast to light-tailed, unbounded scalar concentration which does not assume boundedness and heavy-tailed concentration, where we assume only a few moments.

Many of these concentration inequalities are proven using corresponding exponential inequalities. There are also time-uniform versions of many of these inequalities which stem from the sub-psi process given by the corresponding exponential inequality.

Hoeffding’s bound

Let $X_{1}, \dots, X_{n}$ be independent and set $S_{n} = \sum_{i \leq n} (X_{i} - E X_{i})$ . Suppose $a_{i} \leq X_{i} \leq b_{i}$ . Hoeffding showed that, for any $t$ ,

P (S_{n} \geq t) \leq exp (\frac{- 2 t ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} )}) .

The natural two sided version also exists. This is proved with the Chernoff method. Hoeffding’s bound is generalized by McDiarmid’s inequality (bounded difference inequalities).

Hoeffding’s bound suboptimal in a somewhat cheap and straightforward way: If we take $t = c S_{n}$ for $c > 1$ , then $P (S_{n} > t) = 0$ but the bound does not capture this behavior. However, it’s also suboptimal in a more fundamental way, namely that there’s a missing factor of $1/ t$ , which can be seen by appealing to central limit theorems. See the missing factor in Hoeffding’s bounds. This factor is recovered by Talagrand’s inequality and Bentkus’ inequality below.

Bennett’s inequality

Hoeffding’s bound doesn’t use any information beside boundedness of the observations. It therefore must (implicitly) assume a worse case bound on the variance. If we know a bound on the variance, we can do better. Both Bennett’s inequality and Bernstein’s inequality use such information to tighten the bound.

Let $X_{1}, \dots, X_{n}$ be independent with finite variance and one-sided boundedness, i.e., $X_{i} \leq B$ for some $B$ . If $σ^{2} = \sum_{i \leq n} E [(X_{i} - E X_{i})^{2}]$ and $S_{n} = \sum_{k \leq n} (X_{i} - E X_{i})$ then

P (S_{n} \geq t) \leq exp (- \frac{σ ^{2}}{B ^{2}} h (Bt / σ^{2})),

where $h (u) = (1 + u) lo g (1 + u) - u$ . If we assume that $∣ X_{i} ∣ \leq B$ then we can get a bound on $∣ S ∣$ . This trend of first presenting a result using only the boundedness of observations and then giving a variance-adaptive result is a common one, see from boundedness to variance adaptivity.

If the $X_{i}$ have conditional mean $μ$ , then we can replace $σ^{2}$ with $\sum_{i \leq n} E_{i - 1} ∥ X_{i} - E X_{i} ∥^{2}$ where $E_{i - 1}$ is the expectation conditional on the first $i - 1$ observations.

Like Hoeffding’s bound, Bennett’s inequality also does not recover the missing factor of $1/ t$ that we expect from the central limit theorem. This is because it is also based on the Chernoff method, but bounding the exponential function a little differently than does Hoeffding’s bound to take advantage of the variance information.

Bernstein bound 1

There are several bounds that go under the name “Bernstein bound”. The most common is perhaps a relaxation of Bennett’s bound and uses that

h (u) \geq \frac{u ^{2}}{2 ( 1 + u /3 )},

to show that

P (S_{n} \geq t) \leq exp (- \frac{t ^{2}}{2 ( σ ^{2} + Bt /3 )}),

where $M$ and $B$ are as above. We can of course obtain two-sided versions of Bennett’s and Bernstein’s bound if we assume two-sided boundedness ( $∣ X_{i} ∣ \leq B$ ), apply the bound twice and use a union bound.

A useful form of Bernstein’s bound that I’m writing down because re-deriving it is annoying is the following: With probability at least $1 - δ$ ,

P (∣ S_{n} / n ∣ \geq t) \leq \frac{2 σ ^{2} lo g ( 2/ δ )}{n} + \frac{2 B lo g ( 2/ δ )}{3 n} .

There are two regimes in Bernstein’s bound: A sub-Gaussian regime where the tail decays at a sub-Gaussian rate, and a sub-exponential regime where it decays at a sub-exponential rate. The former occurs when the contribution of $Bt /3$ is small relative to $σ^{2}$ so the tail decays as $t^{2}$ . When $Bt /3$ is large, the bound decays as $t$ . Therefore, if we have good a priori knowledge of small variance, Bernstein’s (and Bennett’s) inequality can be a big improvement over Hoeffding, scaling as $t^{2} / σ^{2}$ instead of $t^{2} / (n B^{2})$ .

This result was first presented by Bernstein in 1927. It was not until 1962 that this was sharpened by Bennett, which resulted in Bennett’s inequality above.

Hoeffding’s bound remastered

It is somewhat of a shame that Hoeffding’s name came to be associated with the first bound above, since this is much weaker than the following more general result that he proved in his famous 1963 paper, which recovers both Bennett’s and Bernstein’s bounds.

If $X_{1}, \dots, X_{n}$ are iid with variance $σ^{2}$ and lying in $(\infty, c]$ , then

P (S_{n} \geq t) \leq exp (- n D_{KL} (\frac{σ ^{2} + \frac{t}{n} c}{σ ^{2} + c ^{2}} \frac{σ ^{2}}{σ ^{2} + c ^{2}})),

where $D_{KL}$ denotes the binary entropy function: $D_{KL} (p ∥ q) = p lo g \frac{p}{q} + (1 - p) lo g \frac{1 - p}{1 - q}$ . If $p > 1$ we interpret $D_{KL} (p ∥ q)$ as infinity, meaning the bound cuts off at that point. Thus, this bound circumvents one of the drawbacks mentioned above.

This is the sharpest bound using the Chernoff method. We can recover the first Hoeffding bound by assuming worst-case variance.

Talagrand’s inequality

Talagrand’s inequality improves Hoeffding’s inequality above by recovering the the missing factor in Hoeffding’s bounds. If $X_{1}, \dots, X_{n}$ are iid with variance $σ^{2}$ and lying in $(\infty, c]$ , then

P (S_{n} \geq t) \leq (θ (\frac{t}{σ}) + \frac{B c}{σ}) exp (- n D_{KL} (\frac{σ ^{2} + \frac{t}{n} c}{σ ^{2} + c ^{2}} \frac{σ ^{2}}{σ ^{2} + c ^{2}})), t \leq \frac{σ ^{2}}{B c},

for some constant $B$ , where

θ (x) = \frac{1}{2 π} e^{x^{2} /2} \int_{x}^{\infty} e^{- u^{2} /2} d u,

which is of the order $σ / t$ . Therefore, for $t \leq σ^{2} / (B c)$ , we have $B c / σ < σ / t$ . Hence the multiplier on the left hand side of the exponential is of the order $1/ t$ , which is clearly an improvement over Hoeffding’s bound above.

Talagrand proved this bound in 1995, and it was the first improvement since Hoeffding’s original article in 1963.

Bentkus’ inequality

Bentkus’ inequality is another way to recover the missing factor of $t$ . The approach is significantly different from Talagrand’s, and is based on interpolating between Markov and Chernoff. Instead of applying Markov’s inequality to the exponential, we apply it to the function $h_{α} (x) = (1 + x / α)_{+}^{α}$ for any $α \geq 0$ and obtain, for $X_{1}, \dots, X_{n}$ independent with finite variance,

P (S_{n} \geq t) \leq x \leq t in f \frac{E ( S _{n} - x ) _{+}^{α}}{( t - x ) _{+}^{α}} .

The right hand side can be bounded by considering the worst case random variables. If $V (X_{i}) \leq σ_{i}^{2}$ and $X_{i} \leq c$ , then defining $G_{i}$ with $P (G_{i} = - σ_{i}^{2} / c) = c^{2} / (σ_{i}^{2} + c^{2})$ and $P (G_{i} = c) = 1 - P (G_{i} = σ_{i}^{2} / c)$ , we obtain that

P (S_{n} \geq t) \leq x \leq t in f \frac{E ( \sum _{i} G _{i} - x ) _{+}^{α}}{( t - x ) _{+}^{α}} .

This is not a closed-form bound, but it can be computed. See eg here and references therein.

Empirical Bernstein bounds

Empirical Bernstein bounds replace the oracle variance with an estimated variance. This is useful because the true variance is not always known. These bounds deserve their own page: empirical Bernstein bounds.

References

Concentration inequalities by Boucheron, Lugosi, and Massart.
On the Bennett-Hoeffding inequality by Pinelis.
On Hoeffding’s inequalities by Bentkus.

The Stats Map

Explore

bounded scalar concentration