PAC-Bayes bounds were originally crafted as a method to prove guarantees in learning theory that didn’t suffer from the same difficulties as PAC learning. However, they’ve become useful for proving more general concentration inequalities, especially in multivariate settings. They are the cornerstone of the variational approach to concentration.
Traditional PAC bounds are usually proven via uniform convergence bounds, which result in bounds that depend on various notions of complexity such as the VC dimension of the class of learners. For sufficiently rich classes (such as neural nets) however, these complexities can be massive (or infinite), resulting in vacuous bounds.
The PAC-Bayes approach is to discard the idea of a worst case analysis (which is the idea of uniform convergence), and instead take a Bayesian perspective. We place a prior over the class of functions that we are trying to learn and develop a bound which depends not on the complexity of , but instead some divergence (often the KL divergence) between our prior and any “posterior” over .
For instance, one of the earliest and most famous PAC-Bayes bounds comes from Catoni in 2003 (that guy did a lot of stuff). It says that for all and priors over ,
where is the set of distributions over , is the risk, is the empirical risk, and is some loss function (see statistical decision theory). So now instead of arguing about worst case loss, we’re arguing about average loss. The bound is uniform over the distributions , meaning it holds simultaneously for all of them. But if we pick that looks nothing like , then will blow up.
What good is a bound on you may ask? Aren’t we after a bound on for some particular ? Well yes, and sometimes this poses a problem. But sometimes it doesn’t. Sometimes is a randomized predictor, in which you really care about anyway. But if not, then you need to perturb a bit to induce a distribution over it (eg place Gaussians over the weights in a neural net). This sounds crazy but actually led to non-vacuous bounds for neural nets, which PAC bounds have not been able to do.
So are they useful? As always, depends what you’re trying to do.
Master theorem
You can give a very general PAC-Bayes bound that is removed from learning theory altogether, but recovers known learning theory bounds. We gave this bound in a 2023 paper. I’ll call it the “master theorem” because I’m trying to add more gravitas to my life.
Let be a nonnegative supermartingale with initial value 1 for all . Let be a data-free prior. Then,
This is the time-uniform version of the master theorem, but we can also state a master fixed-time version. This reads: Let be nonnegative have expected value at most 1 (it is an e-value) for all . Then
Refs
- User friendly introduction to PAC-Bayes bounds, by Alquier. Extremely nice and simple overview.
- Primer on PAC-Bayesian learning slightly more technical and general intro, by Guedj.
- A unified recipe for deriving PAC-Bayes bounds, by yours truly.