A technique for multivariate concentration. Let $(S_{t})$ be some stochastic process, say in $R^{d}$ . For instance, $S_{n} = \sum_{i = 1}^{n} X_{i}$ for multivariate observations $X_{i} \in R^{d}$ . We are aiming to generate a high probability bound on

∥ S_{n} ∥ = v : ∥ v ∥ = 1 sup ⟨ v, S_{n} ⟩ .

The idea is to use a PAC-Bayes approach (which is itself based on variational inequalities, hence the name), in order to simultaneously bound $⟨ v, S_{t} ⟩$ in each direction $v$ . Recall that a PAC-Bayes bound has the form

P (\forall ρ \in M (Θ) : Something holds) \geq 1 - δ .

That is, a PAC-Bayes bound provides a high probability bound simultaneously over all posteriors. The variational approach to concentration translates this into a high probability bound over all directions.

This approach was pioneered by Catoni and Giulini (here and here), and has now been used by a few authors to prove bounds in a variety of settings:

Oliveira (2016) and Ziemann (2025) for providing self-normalized concentration bounds on random vectors;
Zhivotovskiy (2024) for bounding the singular values of random matrices;
Nakakita et al. (2023) for bounding the mean of high-dimensional random matrices under heavy-tails;
Giulini (2018) for estimating the Gram operator in Hilbert spaces;
Myself and others for estimating the mean of random vectors.

Recall a very general PAC-Bayes inequality: Let $N (θ)$ be nonnegative have expected value at most 1 (it is an e-value) for all $θ \in Θ$ . Then

P (\forall ρ \in M (Θ) : E_{ρ} lo g N (θ) \leq D_{KL} (ρ ∥ ν) + lo g (1/ δ)) \geq 1 - δ .

We can use this to prove various concentration bounds. To see how this works, let’s consider two use cases.

Example 1: Sub-Gaussian random vectors

This comes from our paper on time-uniform confidence spheres. Consider $n$ iid copies $X_{1}, \dots, X_{n}$ of a $Σ$ -sub-Gaussian random vector $X \in R^{d}$ (see sub-Gaussian distributions). That is,

E exp (λ ⟨ θ, X ⟩) \leq exp (\frac{λ ^{2}}{2} ⟨ θ, Σ θ ⟩),

for all $λ \in R$ and $θ \in R^{d}$ . This implies that

N (θ) = exp {λ i \leq n \sum ⟨ θ, X_{i} ⟩ - \frac{n λ ^{2}}{2} ⟨ θ, Σ θ ⟩},

has expectation at most 1 (i.e., it is an e-value). Let $ν$ be a Gaussian with mean 0 and covariance $β^{- 1} I$ for some $β > 0$ . Consider the family of distributions ${ρ_{u} : ∥ u ∥ = 1}$ where $ρ_{u}$ is a Gaussian with mean $u$ and covariance $β^{- 1} I$ . Then the KL divergence between $ρ_{u}$ and $ν$ is $D_{KL} (ρ_{u} ∥ ν) = β /2$ . Using the master theorem above, we obtain that, with probability $1 - δ$ , simultaneously for all distributions $ρ$ ,

λ i \leq n \sum E_{ρ} ⟨ θ, X_{i} ⟩ \leq \frac{n λ ^{2}}{2} E_{ρ} ⟨ θ, Σ θ ⟩ + \frac{β}{2} + lo g (1/ δ) .

Now, for $ρ = ρ_{u}$ , $E_{ρ} ⟨ θ, X_{i} ⟩ = ⟨ u, X_{i} ⟩$ and

E_{ρ} ⟨ θ, Σ θ ⟩ = ⟨ u, Σ u ⟩ + β^{- 1} Tr (Σ) \leq ∥ Σ ∥ + β^{- 1} Tr (Σ),

using basic properties of the expectation of quadratic forms under Gaussian distributions (see eg here), and definition of the operator norm as $∥ A ∥ = sup_{u, v : ∥ u ∥ = ∥ v ∥ = 1} ⟨ u, Σ v ⟩$ . Since this holds simultaneously for all $ρ_{u}$ , we obtain that, with probability $1 - δ$ ,

u sup λ i \leq n \sum ⟨ u, X_{i} ⟩ \leq \frac{n λ ^{2}}{2} (∥ Σ ∥ + β^{- 1} Tr (Σ)) + \frac{β}{2} + lo g (1/ δ) .

The left hand side is equal to $λ \sum_{i \leq n} X_{i}$ , which gives us our concentration result. One can then optimize $λ$ using some calculus.

Example 2: Random matrices with finite Orlicz-norm

This example is adapted from Zhivotovskiy (2024). Let $M_{1}, \dots, M_{n}$ be iid copies of a zero-mean random matrix $M$ with finite sub-exponential Orlicz norm, in the sense that, for some $C > 0$ ,

∥ ⟨ θ, Mϕ ⟩ ∥_{ψ_{1}} \leq C ⟨ θ, Σ ϕ ⟩,

for all $θ, ϕ \in R^{d}$ where $Q = E M$ .

Here we highlight the main ingredients that are used in Zhivotovskiy’s results. For the details, see his paper or my blog post.

We take our parameter space in the master theorem above to be $Θ = R^{d} \times R^{d}$ . Let $ν$ again be Gaussian with mean 0 and covariance $β^{- 1} Σ$ and let $μ_{u}$ be a truncated Gaussian mean $u$ , covariance $β^{- 1} Σ$ and radius $r$ . For a vector $u \in Σ^{1/2} S^{d - 1}$ , the KL-divergence between a truncated normal $μ_{u}$ and $ν$ is $lo g (1/ Z) + β /2$ , where $Z = P (∥ θ - u ∥ \leq r)$ where $θ \sim ρ_{u}$ . Equivalently, $Z = P (∥ Y ∥ \leq r)$ where $Y$ is a normal with mean $0$ and covariance $β^{- 1} Σ$ . Hence $1 - Z = P (∥ Y ∥ > r) \leq E ∥ Y ∥^{2} / r^{2} = β^{- 1} Tr (Σ) / r^{2}$ . Thus, taking $r = 2 β^{- 1} Tr (Σ)$ yields $Z \geq 1/2$ , and we obtain

D_{KL} (μ_{u} ∥ ν) \leq lo g (2) + \frac{β}{2} .

Now it remains to construct a relevant quantity to use in the PAC-Bayes theorem. Consider

N (θ, ϕ) = exp {λ i \leq n \sum ⟨ θ, Σ^{- 1/2} M_{i} Σ^{- 1/2} ϕ ⟩ - n lo g E exp (λ ⟨ θ, Σ^{- 1/2} M Σ^{- 1/2} ϕ ⟩)},

where the expectation is over $M$ . It’s easy to see this has expectation at most 1 (it can be written as the product of terms each with expectation exactly one). Apply the master theorem with the product distribution $μ_{u} \times μ_{v}$ for $u, v$ in the unit sphere. Then, to bound the right hand side of the process, use one of the exponential inequalities. In particular, for a random variable $Y$ ,

E [exp (λ (Y - E Y))] \leq exp (4 λ^{2} ∥ Y - E Y ∥_{ψ_{1}}), \forall∣ λ ∣ \leq \frac{1}{2 ∥ Y - E Y ∥ _{ψ_{1}}} .

A bunch of algebra then gives the result.

The Stats Map

Explore

variational approach to concentration

Example 1: Sub-Gaussian random vectors

Example 2: Random matrices with finite Orlicz-norm

Table of Contents

Graph View

Backlinks

Explore