Given a distribution over feature-label pairs , the goal of mean calibration is to produce a model such that

and the goal of quantile calibration is to produce a model such that

The most interesting setting for calibration is online calibration.

Patching

Most research into calibration is not trying to generate a new model which is calibrated, but instead supposes we are given an uncalibrated model and asks whether we can we “fix” it to make it more calibrated? Yes, yes we can.

We have a target average calibration error (see below) of . Let

Then we simply set if and otherwise. Repeating this procedure leads to a model with calibration error. Moreover, it also reduces squared error. You can show that this runs for iterations where .

There’s also a one-shot algorithm that accomplishes this goal. If , then simply let . Then has .

Obviously, these assume access to the true distribution. One can approximate the means with the empirical distribution and then obtain bounds on the calibration error using standard concentration inequalities.

Similar arguments apply if we’re interested in quantile calibration. There’s also a one-shot version.

Calibration error

How do we actually measure calibration error? Following Roth, define the average calibration error of a predictor with finite range on distribution to be

This is similar to loss. We can also write something that’s similar to squared error:

Or we could naturally write something similar to by taking the maximum. Note that we are weighting each error term by the model predictions. So if our error is small, it’s small according to our own model’s lights, i.e., our own expected value.

Similar error notions exist for quantile calibration. Here we have eg

We can also write down , , etc.