Let . The goal of density estimation is to determine the density of , call it . Here we want to make as few assumptions about as possible (i.e., we don’t assume that comes from some parametric family; see parametric versus nonparametric statistics).
An obvious solution is to simply take the empirical distribution, in which case our estimator is
But this solution obviously overfits the given data and has very few nice properties (continuity, smoothness, etc). It doesn’t generalize from the data at all; it sucks.
Common methods to nonparametric density estimation include:
- histograms
- kernel density estimation
- Series estimators, in which one chooses a basis for a function class, and then estimates the basis coefficients of the density.
In terms of evaluating a particular estimator (see statistical decision theory), typically we’re interested in loss, i.e.,
Here is treated as a fixed function of the training data. The risk is then the expectation of the loss over the training data:
As usual, the risk can be decomposed into a bias term and variance term (cf squared error).
A solution to nonparametric density estimation also provides a solution to nonparametric regression as follows. Suppose is an estimate of the distribution . Then, for , we can generate an estimate of with
We can estimate both of and with nonparametric density estimation. Then we can plug this into the empirical distribution of the to estimate the integral.