Conformal Prediction

Conformal prediction is a popular and practical tool in uncertainty quantification. We’re in a supervised learning setting and have a black-box model $f$ . Given a new observation $X$ , we want to develop a confidence set for its label $Y$ .

It turns out we can do this simply by having access to predictions $f (X_{1}), f (X_{2}), \dots, f (X_{n})$ and outcomes $Y_{1}, Y_{2}, \dots, Y_{n}$ . We need only make the assumption that the data $X_{1}, \dots, X_{n}$ are exchangeable, nothing else.

There are several flavors of conformal prediction, split conformal being the easiest both conceptually and in terms of implementation.

Split conformal prediction

Introduce an arbitrary score function $s$ which maps observation-label pairs to a positive number. We should think of $s$ as quantifying a heuristic notion of uncertainty, but it can be anything. If $f$ is a regressor then a popular choice is $s (X, Y) = ∣ f (X) - Y ∣$ . For classification we might take $s (X, Y) = 1 - f_{Y} (X)$ if $f_{Y} (X)$ is the probability that $f$ assigns to $X$ being in class $Y$ .

Suppose we have some validation data $(X_{i}, Y_{i})$ , $1 \leq i \leq n$ and we compute the scores $s_{i} = S (X_{i}, Y_{i})$ . Given a new covariate $X_{n + 1}$ , our uncertainty set is based on the conformal p-value

Q_{n} (X, y) = \frac{1 + \sum _{i \leq n} 1 { s ( X _{i} , Y _{i} ) \geq s ( X _{n + 1} , y )}}{n + 1} .

Let $Y_{n + 1}$ be the true label for $X_{n + 1}$ . If $(X_{i}, Y_{i})$ , $1 \leq i \leq n + 1$ are exchangeable, then $Q_{n} (X_{n + 1}, Y_{n + 1})$ is a p-value. Therefore, given $α \in (0, 1)$ , our confidence set is

C_{n} (X_{n + 1}) = {y \in Y : Q_{n} (X_{n}, y) > α},

which by definition of a p-value, satisfies $P (Y_{n + 1} \in / C_{n} (X_{n + 1})) = P (Q_{n} (X_{n + 1}, Y_{n + 1}) \leq α) \leq α$ . Intuitively, if $y \neq = Y_{n + 1}$ then $s (X_{n + 1}, y)$ should be large, making $Q_{n} (X_{n + 1}, y)$ small and ensuring that $y \in / C_{n} (X_{n + 1})$ .

An alternative way to describe the same algorithm is as follows. Compute the $⌈(1 - δ) (n + 1) / n ⌉$ quantile of $s_{1}, \dots, s_{n}$ . Call this $q$ . Then our confidence set for a new observation $X$ is

C_{n} (X_{n + 1}) = {y : s (X, y) \leq q} .

It can be shown to obey

1 - δ \leq P_{(X_{n + 1}, Y_{n + 1}), (X_{i}, Y_{i})_{1}^{n}} (Y_{n + 1} \in C_{n} (X_{n + 1})) \leq 1 - δ + 1/ n,

where we note that the probability is over both the new test point and the training data.

The intuition for the quantile version is very straightforward: we’re just letting the holdout set $s (X_{1}, Y_{1}), \dots, s (X_{n}, Y_{n})$ tell us what the most extreme values of $s (X, Y)$ are empirically. If we had access to the full distribution $s (X, Y)$ , then we could compute the precise $1 - δ$ quantile $q$ and our set $C (X)$ would be an exact $1 - δ$ confidence interval. The extra factor of $(n + 1) / n$ is a finite sample correction.

The Stats Map

Explore

conformal prediction

Split conformal prediction

Graph View

Backlinks

Explore