Median-of-Means

An approach to heavy-tailed concentration and multivariate heavy-tailed mean estimation.

The overall idea is straightforward: Split the data into $k$ buckets and compute the sample mean $μ_{i}$ of each bucket. Then the overall estimator is

μ = Median (μ_{1}, \dots, μ_{k}) .

There are several questions to answer to implement this in practice. First, how to choose $k$ ? Second (especially relevant in multivariate settings), how do we define the median?

Scalar case

With probability at least $3/4$ by Chebyshev, $μ_{i}$ is not too far from the true mean. So for the median to be far from the mean, many (at least half) independent Bernoulli events with probability 3/4 must fail to occur.

If we choose $k = ⌈ 8 lo g (1/ δ)⌉$ then the median-of-means estimator $μ$ satisfies

P (∣ μ - μ ∣ \geq σ \frac{32 lo g ( 1/ δ )}{n}) \leq δ .

The MoM estimator can also be extended to situations where the distribution has no variance. If $E ∣ X - μ ∣^{1 + ϵ} \leq v$ , then

P (∣ μ - μ ∣ \geq (12 v)^{\frac{1}{1 + ϵ}} (\frac{8 lo g ( 1/ δ )}{n})^{\frac{ϵ}{1 + ϵ}}) \leq δ .

This was originally proved by Bubeck, Cesa-Bianchi, and Lugosi in the content of heavy-tailed bandits. I have a post with the proofs here.

You can sequentialize the MoM estimator using the Dubins-Savage inequality. I have a post about that here.

Finite-dimensional vector case

Defining the median in $R^{d}$ is slightly trickier.

One option is the geometric median, which is studied by Minsker (see below). This achieves rate $Tr (Σ) lo g (1/ δ) / n$ , which is not quite sub-Gaussian. However, Minsker’s approach has the added benefit that it does not require finite variance. Instead, it’s a general recipe for boosting weak learners into strong ones. So even if you only have a finite $p$ -th moment for $p \in (1, 2)$ , you could use Markov’s inequality to obtain a bound on the empirical mean (say) and apply geometric MoM.

Lugosi and Mendelson define a specialized version of the median and get sub-Gaussian rates when the norm is the Euclidean norm and the variance exists. This is sometimes called median-of-means tournament. I have a blog post about this here. Hopkins made this computationally tractable in 2020. In 2019, Lugosi and Mendelson also studied median-of-means under general norms in $R^{d}$ and obtain sub-Gaussian rates. However, the estimator is computationally inefficient and I don’t think anyone has figured out how to improve it.

Infinite-dimensional case

MoM with the geometric median was studied in Banach spaces by Minsker and in Polish spaces by Yun and Park:

Heavy-tailed concentration

In 2015, Minsker proposed the geometric median-of-means for Banach spaces. This is a general method for boosting weak (polynomial rate) estimators into an estimator with exponential rate. The idea is similar to the Lugosi-Mendelson median-of-means (see multivariate heavy-tailed mean estimation), but the weak estimators are aggregated using the geometric median. This can be computed in polynomial time (in $R^{d}$ ) using Weisfeld’s algorithm, since the objective is convex. This estimator was proposed simultaneously by Hsu and Sabato.

In 2022, Yun and Park extended geometric median-of-means to Polish spaces, which include separable Banach spaces. They seem to get the same rates as Minsker, which are not quite sub-Gaussian in $R^{d}$ .

Whitehouse et al. (2024) propose extending Catoni and Giulini’s truncation-based estimator (truncation-based estimators) to Banach spaces. This estimator can handle infinite variance.
Link to original

The Stats Map

Explore

median-of-means

Scalar case

Finite-dimensional vector case

Infinite-dimensional case

Heavy-tailed concentration

Table of Contents

Graph View

Backlinks

Explore