Dan Dennett once said of Darwin’s theory of evolution that it was the best idea that anyone has ever had. You could say the same about the MLE in the realm of statistical inference. It’s simple and elegant and sometimes optimal.

Given a parametric model (parametric versus nonparametric statistics) and data we solve

So, given the data, we just optimize over the parameters that could have generated that data. Badaboom-badabing, and we have a solution to parametric density estimation.

Of course, the idea that we should just find the parameters that have the highest probability given the data is not some bedrock philosophical principle that can’t be debated. And, as you might imagine, people do debate it—Bayesians in particular. MLE is frequentist by nature; parameters are fixed and there are no priors. It also doesn’t provide natural uncertainty quantification since we just get a point estimate. Of course this is where central limit theorems kick in.

The MLE can be seen as empirical risk minimization with the loss . The associated risk (see statistical decision theory) is the KL divergence between and . This can be used to show that, under certain regularity conditions, the MLE is a consistent estimator. The conditions are strong identifiability and that the empirical risk obey a uniform LLN.

When our model is misspecified (ie the data are being generated by some distribution that’s not in our model), then we can use the connection between the KL divergence and the MLE to see that the MLE is finding the parameter that minimizes the distance between the true data-generating distribution and . That is, if the true data-generating distribution is , we have

where is the MLE estimate.

Under enough regularity conditions, the MLE obeys a CLT with variance equal to the inverse of the Fisher information, thus matching the Cramer-Rao lower bound.