Energy-Based Models (EBMs) are a class of generative models that model unnormalized densities. They can have very flexible architectures, not requiring e.g. an ordering on the predicted variables like autoregressive models do, or invertible maps like flow-based models do. GANs have the same flexibility, but computing likelihoods with them is intractable, and their optimization is unstable because of the min-max objective, which EBMs do not have.

The general strategy is to model $p_\theta(x)$ as an unnormalized density:

\(p_\theta(x) = \frac{1}{Z(\theta)} \exp\left(f_\theta(x)\right)\)

Where $Z(\theta)$ is the partition function, i.e. the normalization constant:

\(Z(\theta) = \int_X \exp(f_\theta(x)) dx\)

So that we are guaranteed to have a probability distribution. Since $\exp(x)$ is always positive, this lets us use any function $f_\theta$ to get a valid density. The "Energy" in the name comes from the analogy to thermodynamics: if the energy of a state $s$ is $-f_\theta(s)$, then that state is more likely relative to other states when it has lower energy (higher $f$).

We're giving up the knowledge of $Z(\theta)$, but that also makes several tasks more complicated. First, taking samples from $p_\theta$ becomes difficult. Optimization also becomes non-trivial: taking gradient steps that locally maximize $f_\theta(x')$ for some sample $x'$ does not actually guarantee that $p_\theta(x')$ is increased, since the update might bump the likelihood of other samples in a way that reduces $p(x)$ after normalization. This is guaranteed to not be an issue with normalized models, since upweighting one point must necessarily downweigh other points because of the normalization constraint. Besides this, this formulation doesn't directly give us high-level features for $x$ (although we can later add latent variables and mitigate this).

In spite of this, some applications actually do not require knowing $Z(\theta)$.
For instance, notice that $Z(\theta)$ cancels out in the likelihood ratio between two samples.
Thus, we can do anomaly detection by comparing the likelihood ratio between the test sample
and samples from the reference distribution. We can also do "denoising": taking one noisy
sample (e.g., a blurry image) and making it more likely under a clean distribution (e.g., a
model trained on high-quality images) by optimizing the likelihood ratio between them.
In general, tasks that can be done with only *relative comparisons* are easy to do
directly with the energy function, without knowing $Z(\theta)$.

Another interesting property is the ability to compose unnormalized densities of different
energy-based models in a very simple way. Suppose we have two models $p_{\theta_1}$ and
$q_{\theta_2}$. Say $p$ models the distribution of faces of people with glasses,
and $q$ models the distribution of faces with hats. A mixture model of $p$ and $q$
would give us samples that have either glasses *or* hats (only sometimes both).
If we want samples of faces with both glasses *and* hats (Product of Experts), we instead want to take
their product, leaving out only samples that are likely in both models.
However, this leaves us with an unnormalized density, and we'd have to normalize it:

\(r_{AND}(x) = \frac{1}{Z(\theta_1, \theta_2)} p_{\theta_1}(x) q_{\theta_2}(x)\)

But if we have EBMs, we already don't care about normalizing anyway, so we can just multiply the output of two EBMs to get another EBM.

These applications, of course, assume we already have trained EBMs. How can we do that in the first place?

A first attempt to train EBMs could be to maximize the log-likelihood. For a training sample $x$, this would be:

\begin{align*} \max_\theta\ \log \left[ \frac{1}{Z(\theta)} \exp(f_\theta(x)) \right] = \max_\theta\ f_\theta(x) - \log Z(\theta) \end{align*}

We don't know $Z(\theta)$, but there's still a way around that. By taking gradients, we get:

\begin{align*} \nabla \left[ f_\theta(x) - \log Z(\theta) \right] &= \nabla f_\theta(x) - \nabla \log Z(\theta) \\ &= \nabla f_\theta(x) - \frac{1}{Z(\theta)} \nabla Z(\theta) \\ &= \nabla f_\theta(x) - \frac{1}{Z(\theta)} \nabla \int_X \exp f_\theta(x) dx \\ &= \nabla f_\theta(x) - \frac{1}{Z(\theta)} \int_X \nabla \exp f_\theta(x) dx \\ &= \nabla f_\theta(x) - \frac{1}{Z(\theta)} \int_X \exp(f_\theta(x)) \nabla f_\theta(x) dx \\ &= \nabla f_\theta(x) - \int_X \frac{1}{Z(\theta)} \exp(f_\theta(x)) \nabla f_\theta(x) dx \\ &= \nabla f_\theta(x) - \int_X p_\theta(x) \nabla f_\theta(x) dx \\ &= \nabla f_\theta(x) - \mathbb{E}_{x\sim p_\theta(x)} \nabla f_\theta(x) \\ &\approx \nabla f_\theta(x) - \nabla f_\theta(x_{sample}) \\ \end{align*}

Where we proceeded by expanding the gradient of the $\log$, then expanding the definition of $Z(\theta)$ as an integral, and then rearranging until we found that the gradient amounted to an expectation, which we approximated with a simple Monte Carlo sample.

Thus, if we're able to sample from the model, we can train it using this method,
which is known as *contrastive divergence*: we take a training example $x$,
sample from the model to obtain $x_{sample}$, then maximize this difference
using their gradients, which approximates an update to the log-likelihood.

However, this leaves an open issue: how do we obtain $x_{sample}$ efficiently?

A simple way to obtain samples given an unnormalized density is with Markov Chain Monte Carlo methods. For instance, we could run Metropolis-Hastings just with the unnormalized density, since it only uses likelihood ratios, to obtain samples from $p_\theta$. However, MCMC takes an undetermined number of iterations to converge to the target distribution, and would be too expensive to run inside the training loop.

Still, we can speed-up Metropolis-Hastings, which works for black-box functions, by using the fact that we have access to gradients from $f_\theta$. In particular, we can use Langevin MCMC, that leverages gradients to substantially improve convergence. Given our current point $x_t$, Langevin MCMC first samples a random Gaussian noise $z_t$ from a standard multi-variate normal distribution, and then proposes:

\(x_{t+1} = x_t + \nabla_{x_t} \log p_\theta(x_t) + \sqrt{2\epsilon} z_t \)

Note that the gradients are with respect to $x_t$, not $\theta$.
This gradient is known as the *score function*. This is especially interesting
for EBMs since:

\(\nabla_x \log p_\theta(x) = \nabla_x f_\theta(x) - \nabla_x \log Z(\theta) \)

But $\log Z(\theta)$ is a constant with respect to $x$, and thus the second gradient is a zero. Thus, computing the score function in EBMs is simple, since we don't really need the partition function.

Without the noise term, the update above would simply amount to optimizing the likelihood of $x$, going towards a mode of the distribution. Adding carefully scaled noise makes Langevin MCMC sample from $p_\theta(x)$ as $T \rightarrow \infty$ and $\epsilon \rightarrow 0$. In practice, this is significantly faster than vanilla Metropolis-Hastings, and can actually be used to obtain high-quality samples from EBMs in as few as 10 iterations (for instance, see the samples from this paper). Still, this is too slow for training samples inside the training loop.

The need for samples during training came from trying to estimate gradients of the log-likelihood. We can work around that by using other objectives. The observation that the score function is easy to compute for EBMs makes the Fisher divergence particularly interesting. Given our current distribution $p_\theta$ and the target distribution $q = p_{data}$, the Fisher divergence is defined as:

\(D_F(p || q) = \frac{1}{2} \mathbb{E}_{x \sim p} ||\nabla_x \log p(x) - \nabla_x \log q(x)||_2^2 \)

Like the KL-divergence, the Fisher divergence is non-negative, asymmetric (because of the sampling),
and is zero only when $p = q$.
Thus, we can try to train our model by minimizing the Fisher divergence, which only depends
on the score functions of $p_\theta$ and $p_{data}$. This is known as *score matching*.

While $\nabla_x p_\theta$ is easy to compute, we need to somehow estimate $p_{data}$.