Calibration

Given a distribution $P$ over feature-label pairs $(X, Y)$ , the goal of mean calibration is to produce a model $f$ such that

v \approx E [Y ∣ f (x) = v],

and the goal of quantile calibration is to produce a model such that

q \approx P [y \leq v ∣ f (x) = v] .

The most interesting setting for calibration is online calibration.

Patching

Most research into calibration is not trying to generate a new model $f$ which is calibrated, but instead supposes we are given an uncalibrated model $f$ and asks whether we can we “fix” it to make it more calibrated? Yes, yes we can.

We have a target average calibration error (see below) $K_{2}$ of $α$ . Let

v^{'} = argmax_{v} P (f (x) = v) (v - E_{P} [y ∣ f (x) = v]) .

Then we simply set $f (x) = v^{'}$ if $f (x) = v$ and $f (x) = f (x)$ otherwise. Repeating this procedure leads to a model with $\leq α$ calibration error. Moreover, it also reduces squared error. You can show that this runs for $T \leq m / α$ iterations where $m = ∣ range (f) ∣$ .

There’s also a one-shot algorithm that accomplishes this goal. If $c (v) = E [y ∣ f (x) = v]$ , then simply let $f (x) = c (f (x))$ . Then $f$ has $K_{2} = 0$ .

Obviously, these assume access to the true distribution. One can approximate the means with the empirical distribution and then obtain bounds on the calibration error using standard concentration inequalities.

Similar arguments apply if we’re interested in quantile calibration. There’s also a one-shot version.

Calibration error

How do we actually measure calibration error? Following Roth, define the average calibration error of a predictor $f$ with finite range $R (f)$ on distribution $D$ to be

K_{1} (f, D) := v \in R (f) \sum P_{x, y \sim D} [f (x) = v] ∣ v - E_{D} [y ∣ f (x) = v] ∣ .

This is similar to $ℓ_{1}$ loss. We can also write something that’s similar to squared error:

K_{2} (f, D) := v \in R (f) \sum P_{x, y \sim D} [f (x) = v] (v - E_{D} [y ∣ f (x) = v])^{2} .

Or we could naturally write something similar to $ℓ_{\infty}$ by taking the maximum. Note that we are weighting each error term by the model predictions. So if our error is small, it’s small according to our own model’s lights, i.e., our own expected value.

Similar error notions exist for quantile calibration. Here we have eg

Q_{1} (f, D) = v \in R (f) \sum P_{x, y \sim D} [f (x) = v] ∣ q - P_{D} [y \leq v ∣ f (x) = v] ∣ .

We can also write down $Q_{2}$ , $Q_{\infty}$ , etc.

The Stats Map

Explore

calibration

Patching

Calibration error

Table of Contents

Graph View

Backlinks

Explore