Nonparametric Density Estimation

Let $X_{1}, \dots, X_{n} \sim P$ . The goal of density estimation is to determine the density of $P$ , call it $p$ . Here we want to make as few assumptions about $p$ as possible (i.e., we don’t assume that $P$ comes from some parametric family; see parametric versus nonparametric statistics).

An obvious solution is to simply take the empirical distribution, in which case our estimator is

p (x) = \frac{1}{n} i \sum 1 (x = X_{i}) .

But this solution obviously overfits the given data and has very few nice properties (continuity, smoothness, etc). It doesn’t generalize from the data at all; it sucks.

Common methods to nonparametric density estimation include:

histograms
kernel density estimation
Series estimators, in which one chooses a basis for a function class, and then estimates the basis coefficients of the density.

In terms of evaluating a particular estimator $θ$ (see statistical decision theory), typically we’re interested in $L_{2}$ loss, i.e.,

L (p, p) = \int_{x} (p (x) - p (x))^{2} d x .

Here $p$ is treated as a fixed function of the training data. The risk is then the expectation of the loss over the training data:

R (p, p) = E [L (p, p)] .

As usual, the risk can be decomposed into a bias term and variance term (cf squared error).

A solution to nonparametric density estimation also provides a solution to nonparametric regression as follows. Suppose $p$ is an estimate of the distribution $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n}) \sim P$ . Then, for $m (x) = E [Y ∣ X = x]$ , we can generate an estimate of $m$ with

m (x) = \int y p (y ∣ x) d x = \int y p (x, y) / p (x) d y .

We can estimate both of $p (x, y)$ and $p (x)$ with nonparametric density estimation. Then we can plug this into the empirical distribution of the $Y_{1}, \dots, Y_{n}$ to estimate the integral.

The Stats Map

Explore

nonparametric density estimation

Graph View

Backlinks

Explore