Suppose we want to do parametric density estimation but our parametric class is a set of complicated functions which may not be proper probability distributions. For instance, think of deep neural networks.

One approach is to normalize as

The log-likelihood is then

In order to maximize this, we must solve (or approximate) the integral ,which is highly non-trivial for complex function families. This is the problem of approximate inference in deep learning, which can be solved by, eg variational inference.

Energy-based models have a similar idea, but they define the probability distribution as . This also requires computing the normalizing constant as above. These are called energy based because is a neural net that represents the “energy,” borrowing from statistical physics where lower energy is synonymous with higher probability.

There are other approaches, such as normalizing flows, autoregressive models, and GANs.