Histograms

Possibly the simplest method for nonparametric density estimation after the empirical distribution.

We split the space $X \subset R^{d}$ into $N$ equal sized bins, $B_{1}, \dots, B_{N}$ . Note that the true density is

p (x) = i \sum P (x ∣ x \in B_{i}) P (x \in B_{i}) .

We estimate $P (x \in B_{i})$ by simply counting the fraction of training points $X_{i}$ which landed in bin $B_{i}$ . We estimate $P (x ∣ x \in B_{i}$ ) by placing a uniform distribution over the bin $B_{i}$ , i.e., we assume $P (x ∣ x \in B_{i}) \approx \frac{1}{v o l ( B _{i} )} 1 (x \in B_{i})$ . Our estimator is therefore

p (x) = i = 1 \sum N \frac{θ _{j}}{v o l ( B _{i} )} 1 (x \in B_{i}),

where $θ_{j} = \frac{1}{n} \sum_{i = 1}^{n} 1 (X_{i} \in B_{j})$ . Note that $N$ is the number of bins and $n$ is the number of training points.

If $X = [0, 1]^{d}$ , then $v o l (B_{i}) = h^{d}$ and $N = (1/ h^{d})$ . Obviously, the choice of $h$ is important and affects performance. The minimax $L_{2}$ risk (statistical decision theory) of the histogram in a 1-Hölder space is bounded by

L^{2} h^{2} d + O (\frac{1}{n h ^{d}}),

so minimizing over $h$ we get a risk of $O (n^{- 2/ (d + 2)})$ . This rate of $2/ (d + 2)$ is much slower than parametric models (parametric density estimation), which are usually on the order of $(d / n)$ , which is much better. In a $q$ -Holder space the rate is $2 q / (d + 2 q)$ . Histograms are not minimax optimal, unlike kernel density estimation.

There are also high-probability guarantees. In particular, with probability $1 - δ$ ,

∥ p_{h} - p ∥_{\infty} \leq O ((\frac{lo g n}{n})^{\frac{1}{2 + d}}),

where $p$ is the true density and $p_{h}$ is the histogram with bin size $h = O (n^{- 1/ (2 + d)})$ .

There have been efforts to choose the bin size adaptively. These estimators are sometimes called density trees. See:

Density estimation trees by Ram and Gray
Density estimation via adaptive sequential partitition Li, Yang, and Wong.
Density estimation via adaptive partitioning, Liu and Wong.

The Stats Map

Explore

histograms

Graph View

Backlinks

Explore