Most statistical inference assumes the data is somehow supplied to you (even in sequential settings; see sequential statistics). You are given data, and you are asked to estimate some parameter.

But suppose we are in a supervised learning setting and you get to choose which data points to label. How should you label in them in order to help your inference (eg obtain smaller confidence intervals)? Zrnic and Candes introduce the framework of active statistical inference to answer this question.

Suppose we have data $(X_{i}, Y_{i})_{i = 1}^{N}$ , where $Y_{i}$ is unobserved and $X_{i}$ is observed. We have a model $f$ which estimates $Y_{i}$ from $X_{i}$ . The goal is to estimate $θ^{*}$ , the solution to an M-estimation problem:

θ^{*} = argmin_{θ} E [ℓ_{θ} (X, Y)],

where $ℓ_{θ}$ is assumed to be convex. This handles mean estimation, linear regression coefficients, and quantile estimation. We have a budget of $B$ observations that we’re allowed to sample (in expectation).

That the budget holds only in expectation seems like a drawback of this model. Would that actually be the case in practice? Probably you would instead have some absolutely upper limit on the budget.

Intuition

Consider mean estimation, $θ^{*} = E [Y]$ (so $ℓ_{θ} (X, Y) = (θ - Y)^{2}$ ). Let $π (X_{i})$ be the probability of sampling $X_{i}$ . Consider the doubly robust estimator:

θ = \frac{1}{N} i = 1 \sum N (f (X_{i}) + (Y_{i} - f (X_{i})) \frac{ξ _{i}}{π ( X _{i} )}),

which is unbiased. The variance is

V (θ) = \frac{1}{N} (V (Y) + E [(Y - f (X))^{2} (\frac{1}{π ( X )} - 1)]) .

If $Y \approx f (X)$ , or $π (X) \approx 1$ , then $V (θ)$ is much smaller than the variance of just $B$ randomly selected points, which is $\frac{1}{B} V (Y)$ . So, ideally, we want to choose $π (X)$ such that if the model is bad, then $π (X) \approx 1$ .

Choosing $π$

They choose $π$ as a function of the model uncertainty. It’s hard to give rigorous guarantees here — the methods are mainly heuristics.

Regression: We train a model $u$ to predict $u (X) = ∣ Y - f (X) ∣$ . This should be trained on a dataset different from that which $f$ was trained on.

Classification: Here we assume that $f (x) = argmax_{i \in [k]} p_{i} (x)$ where $\sum_{i} p_{i} (x) = 1$ . Then we set

u (x) = \frac{K}{K - 1} (1 - i max p_{i} (x)) .

If the classifier is maximally uncertain, then $p_{i} (x) = 1/ K$ for all $i$ so $u (x) = 1$ . In practice, helps to mix $u$ with a uniform so that it doesn’t blow up the variance.

Once $u$ is determined, we might consider sampling strategies of the form: $π_{η} (x) = η u (x),$ where $η$ is some hyperparameter mean to ensure that we respect the budget $B$ . They suggest choosing $η = η$ where $η = sup {η \in H : η \sum_{i = 1}^{n} u (X_{i}) \leq B},$ is a data-driven selected parameter. The actual sampling is done by then sampling $ξ_{i} \sim Ber (π_{η} (X_{i}))$ (i.e., whether $X_{i}$ is sampled independently of other observations). Note that this means we only meet the budget in expectation:

E_{S} i = 1 \sum N ξ_{i} = i = 1 \sum N E [η ξ_{i}] = i = 1 \sum N η u (X_{i}) \leq B,

where the expectation is over the sampling only.

General setting

We design a sampling algorithm $π$ which samples $x_{i}$ with probability $π (x_{i})$ . That is, we draw $ξ_{i} \sim Ber (π (x_{i}))$ and if $ξ_{i} = 1$ we sample $x_{i}$ , otherwise we don’t. This implies that all observations are sampled independently from one another. $π$ is scaled such that $E [\sum_{i} ξ_{i}] \leq B$ to ensure that the expected number of labeled observations does not exceed the budget.

Let $ℓ_{θ, i} = ℓ_{θ} (X_{i}, Y_{i})$ be the estimate of the loss on example $i$ , and $ℓ_{θ, i}^{f} = ℓ_{θ} (X_{i}, f (X_{i}))$ be the loss of the model on example $i$ . An unbiased estimator of $L (θ) = E [ℓ_{θ} (X, Y)]$ is

L^{π} (θ) = \frac{1}{n} (i = 1 \sum n ℓ_{θ} (X_{i}, f (X_{i})) + (ℓ_{θ} (X_{i}, Y_{i}) - ℓ_{θ} (X_{i}, f (X_{i}))) \frac{ξ _{i}}{π ( X _{i} )}) .

Note that this is just the doubly robust estimator. If $π$ is just the uniform rule and doesn’t prioritize some observations over others, this recovers the prediction-powered inference estimator.

The resulting confidence interval for our estimate $θ^{η}$ comes from a CLT, from which a Wald interval is constructed. The CLT mainly follows from the usual CLT for M-estimation, but needs to correct for the fact that $η$ is not independent of the data. This is done by assuming that $η \to η^{*}$ in probability for some fixed $η^{*}$ .

They also study a sequential setting, where the model and sampling rule can be iteratively updated. Then they apply a martingale CLT to get a CI for the estimate.

References

Active statistical inference by Zrnic and Candes.
Efficient adaptive experimental design for ATE estimation Very similar ideas, but with more focus on estimation as opposed to choosing the sampling strategy.

The Stats Map

Explore

active statistical inference

Intuition

Choosing $π$

General setting

References

Table of Contents

Graph View

Backlinks

Explore

The Stats Map

Explore

active statistical inference

Intuition

Choosing π

General setting

References

Table of Contents

Graph View

Backlinks

Explore

Choosing $π$