Sufficient statistics attempt to capture precisely what is important about a distribution. It is statistic of the data which, informally, we should be able to use instead of the data itself to do our analysis.

For a more concrete example of how knowing a sufficient statistic suffices in downstream analysis see sufficiency and the likelihood.

Definition

Formally, given a model , we say is a sufficient statistic for if after conditioning on , the distribution non longer depends on : is independent of .

An alternative definition is that the information processing inequality is an equality: . This is intuitive: knowing tells you everything you need to know.

Many statistics can be sufficient; for a stricter definition see minimal sufficiency.

The Rao-Blackwell theorem says that an estimator can be improved by conditioning on a sufficient statistic.

In the discrete case, we can view the sufficient statistic as partitioning the values of into the values for the possible values of the statistic . If the elements of the resulting partition do not depend on , then is a sufficient statistic. Eg: Draw and let .

Fisher-Neyman characterization

In general, though, we can appeal to the Neyman-Fisher characterization:

Thm: is sufficient for iff the joint pdf of can be factored as That is, it can be factored into a product of a function of the data only (no parameter) and a function of and the parameter.

References