Pac-Bayes

PAC-Bayes bounds were originally crafted as a method to prove guarantees in learning theory that didn’t suffer from the same difficulties as PAC learning. However, they’ve become useful for proving more general concentration inequalities, especially in multivariate settings. They are the cornerstone of the variational approach to concentration.

Traditional PAC bounds are usually proven via uniform convergence bounds, which result in bounds that depend on various notions of complexity such as the VC dimension of the class of learners. For sufficiently rich classes (such as neural nets) however, these complexities can be massive (or infinite), resulting in vacuous bounds.

The PAC-Bayes approach is to discard the idea of a worst case analysis (which is the idea of uniform convergence), and instead take a Bayesian perspective. We place a prior over the class of functions $F$ that we are trying to learn and develop a bound which depends not on the complexity of $F$ , but instead some divergence (often the KL divergence) between our prior and any “posterior” over $F$ .

For instance, one of the earliest and most famous PAC-Bayes bounds comes from Catoni in 2003 (that guy did a lot of stuff). It says that for all $λ > 0$ and priors $ν$ over $F$ ,

P (\forall ρ \in M (F) : E_{f \sim ρ} [R_{n} (f) - R_{n} (f)] \geq \frac{λ}{8 n} + \frac{D _{KL} ( ρ ∥ ν ) + lo g ( 1/ δ )}{λ}) \leq δ,

where $M (F)$ is the set of distributions over $F$ , $R_{n} (f) = E [ℓ (f, X)]$ is the risk, $R_{n} (f) = \frac{1}{n} \sum_{i} ℓ (f, X_{i})$ is the empirical risk, and $ℓ$ is some loss function (see statistical decision theory). So now instead of arguing about worst case loss, we’re arguing about average loss. The bound is uniform over the distributions $ρ$ , meaning it holds simultaneously for all of them. But if we pick $ρ$ that looks nothing like $ν$ , then $D_{KL} (ρ ∥ ν)$ will blow up.

What good is a bound on $E_{ρ} R_{n} (f)$ you may ask? Aren’t we after a bound on $R_{n} (f)$ for some particular $f$ ? Well yes, and sometimes this poses a problem. But sometimes it doesn’t. Sometimes $f$ is a randomized predictor, in which you really care about $E f$ anyway. But if not, then you need to perturb $f$ a bit to induce a distribution over it (eg place Gaussians over the weights in a neural net). This sounds crazy but actually led to non-vacuous bounds for neural nets, which PAC bounds have not been able to do.

So are they useful? As always, depends what you’re trying to do.

Master theorem

You can give a very general PAC-Bayes bound that is removed from learning theory altogether, but recovers known learning theory bounds. We gave this bound in a 2023 paper. I’ll call it the “master theorem” because I’m trying to add more gravitas to my life.

Let $M_{t} (θ)$ be a nonnegative supermartingale with initial value 1 for all $θ \in Θ$ . Let $ν$ be a data-free prior. Then,

P (\forall t, \forall ρ \in M (Θ) : E_{ρ} lo g M_{t} (θ) \leq D_{KL} (ρ ∥ ν) + lo g (1/ δ)) \geq 1 - δ .

This is the time-uniform version of the master theorem, but we can also state a master fixed-time version. This reads: Let $N (θ)$ be nonnegative have expected value at most 1 (it is an e-value) for all $θ \in Θ$ . Then

P (\forall ρ \in M (Θ) : E_{ρ} lo g N (θ) \leq D_{KL} (ρ ∥ ν) + lo g (1/ δ)) \geq 1 - δ .

Refs

User friendly introduction to PAC-Bayes bounds, by Alquier. Extremely nice and simple overview.
Primer on PAC-Bayesian learning slightly more technical and general intro, by Guedj.
A unified recipe for deriving PAC-Bayes bounds, by yours truly.

The Stats Map

Explore

PAC-Bayes

Master theorem

Refs

Table of Contents

Graph View

Backlinks

Explore