This is the paper that introduced the now classical Variational Auto-Encoder. Not all points of the derivations are clear to me at this point, but the main idea is easy to understand. The goal is to train a generative model from an unlabeled dataset (e.g. of images of faces). On the surface, auto-encoders could be used for that: we train an encoder $phi(x)$ and a decoder that reconstructs $x$ from $\phi(x)$. Now if we sample a random vector in $\phi$ space, and use the decoder, we should get an image. Unfortunately, we do not know what distribution to sample from, and these representation spaces do not necessarily behave uniformly. So the idea of the Auto-Encoding Variational Bayes is to force a certain distribution in the representations (e.g. standard Gaussian). So what $\phi(x)$ gives is the parameters (mean and variances) of the gaussians that $x$ was sampled from. Given those, we now sample a vector using those parameters, and use the decoder to reconstruct the input.

That almost works, except that the sampling step is discrete, and we can’t backpropagate through it. Here comes the famous reparameterization trick. Instead of using $\phi(x)$ directly as the parameters to sample from, we just notice that a gaussian with mean $\mu$ and variance $\sigma^2$ can be reparameterized as a standard gaussian plus $\mu$ times $\sigma$. So now we can sample from a standard gaussian, and then given $\phi(x)$, we multiply it by $\sigma$ and add $\mu$ to obtain the input to the decoder. Now we can backpropagate and train the model end-to-end.

Qualitatively, in the faces dataset, they find features in the representation that correspond to several interpretable concepts, like the left-right orientation of the face, and how much is it smiling. The main problem with VAEs is that it’s been harder to produce high-quality outputs with thems. It seems GANs had way more success on that end, though one could argue they’re less statistically principled than VAEs.