A technique for multivariate concentration. Let be some stochastic process, say in . For instance, for multivariate observations . We are aiming to generate a high probability bound on
The idea is to use a PAC-Bayes approach (which is itself based on variational inequalities, hence the name), in order to simultaneously bound in each direction . Recall that a PAC-Bayes bound has the form
That is, a PAC-Bayes bound provides a high probability bound simultaneously over all posteriors. The variational approach to concentration translates this into a high probability bound over all directions.
This approach was pioneered by Catoni and Giulini (here and here), and has now been used by a few authors to prove bounds in a variety of settings:
- Zhivotovskiy for bounding the singular values of random matrices
- Nakakita et al. for bounding the mean of high-dimensional random matrices under heavy-tails;
- Giulini for estimating the Gram operator in Hilbert spaces.
- Myself and others for estimating the mean of random vectors.
Recall a very general PAC-Bayes inequality:
Master theorem
Let be a nonnegative supermartingale with initial value 1 for all . Let be a data-free prior. Then,
This is the time-uniform version of the master theorem, but we can also state a master fixed-time version. This reads: Let be nonnegative have expected value at most 1 (it is an e-value) for all . Then
Link to original
To see how this works, let’s consider two use cases.
Example 1: Sub-Gaussian random vectors
This comes from our paper on time-uniform confidence spheres. Consider iid copies of a -sub-Gaussian random vector (see sub-Gaussian distributions). That is,
for all and . This implies that
has expectation at most 1 (i.e., it is an e-value). Let be a Gaussian with mean 0 and covariance for some . Consider the family of distributions where is a Gaussian with mean and covariance . Then the KL divergence between and is . Using the master theorem above, we obtain that, with probability , simultaneously for all distributions ,
Now, for , and
using basic properties of the expectation of quadratic forms under Gaussian distributions (see eg here), and definition of the operator norm as . Since this holds simultaneously for all , we obtain that, with probability ,
The left hand side is equal to , which gives us our concentration result. One can then optimize using some calculus.
Example 2: Random matrices with finite Orlicz-norm
This example is adapted from Zhivotovskiy (2024). Let be iid copies of a zero-mean random matrix with finite sub-exponential Orlicz norm, in the sense that, for some ,
for all where .
Here we highlight the main ingredients that are used in Zhivotovskiy’s results. For the details, see his paper or my blog post.
We take our parameter space in the master theorem above to be . Let again be Gaussian with mean 0 and covariance and let be a truncated Gaussian mean , covariance and radius . For a vector , the KL-divergence between a truncated normal and is , where where . Equivalently, where is a normal with mean and covariance . Hence . Thus, taking yields , and we obtain
Now it remains to construct a relevant quantity to use in the PAC-Bayes theorem. Consider
where the expectation is over . It’s easy to see this has expectation at most 1 (it can be written as the product of terms each with expectation exactly one). Apply the master theorem with the product distribution for in the unit sphere. Then, to bound the right hand side of the process, use one of the exponential inequalities. In particular, for a random variable ,
A bunch of algebra then gives the result.