Most statistical inference assumes the data is somehow supplied to you (even in sequential settings; see sequential statistics). You are given data, and you are asked to estimate some parameter.
But suppose we are in a supervised learning setting and you get to choose which data points to label. How should you label in them in order to help your inference (eg obtain smaller confidence intervals)? Zrnic and Candes introduce the framework of active statistical inference to answer this question.
Suppose we have data , where is unobserved and is observed. We have a model which estimates from . The goal is to estimate , the solution to an M-estimation problem:
where is assumed to be convex. This handles mean estimation, linear regression coefficients, and quantile estimation. We have a budget of observations that we’re allowed to sample (in expectation).
That the budget holds only in expectation seems like a drawback of this model. Would that actually be the case in practice? Probably you would instead have some absolutely upper limit on the budget.
Intuition
Consider mean estimation, (so ). Let be the probability of sampling . Consider the doubly robust estimator:
which is unbiased. The variance is
If , or , then is much smaller than the variance of just randomly selected points, which is . So, ideally, we want to choose such that if the model is bad, then .
Choosing
They choose as a function of the model uncertainty. It’s hard to give rigorous guarantees here — the methods are mainly heuristics.
Regression: We train a model to predict . This should be trained on a dataset different from that which was trained on.
Classification: Here we assume that where . Then we set
If the classifier is maximally uncertain, then for all so . In practice, helps to mix with a uniform so that it doesn’t blow up the variance.
Once is determined, we might consider sampling strategies of the form: where is some hyperparameter mean to ensure that we respect the budget . They suggest choosing where is a data-driven selected parameter. The actual sampling is done by then sampling (i.e., whether is sampled independently of other observations). Note that this means we only meet the budget in expectation:
where the expectation is over the sampling only.
General setting
We design a sampling algorithm which samples with probability . That is, we draw and if we sample , otherwise we don’t. This implies that all observations are sampled independently from one another. is scaled such that to ensure that the expected number of labeled observations does not exceed the budget.
Let be the estimate of the loss on example , and be the loss of the model on example . An unbiased estimator of is
Note that this is just the doubly robust estimator. If is just the uniform rule and doesn’t prioritize some observations over others, this recovers the prediction-powered inference estimator.
The resulting confidence interval for our estimate comes from a CLT, from which a Wald interval is constructed. The CLT mainly follows from the usual CLT for M-estimation, but needs to correct for the fact that is not independent of the data. This is done by assuming that in probability for some fixed .
They also study a sequential setting, where the model and sampling rule can be iteratively updated. Then they apply a martingale CLT to get a CI for the estimate.
References
- Active statistical inference by Zrnic and Candes.
- Efficient adaptive experimental design for ATE estimation Very similar ideas, but with more focus on estimation as opposed to choosing the sampling strategy.