Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, Oriol Vinyals (NeurIPS 2019)


This paper introduces the now popular contrastive InfoNCE objective for unsupervised representation learning. It took me a few passes over the derivations to appreciate what they mean. The idea is simple. Say we have a dataset $D$, and we want to learn representations for the items $x \in D$. One paradigm for doing this without labels is autoencoding: we learn a model that maps $x$ into a lower-dimensional $\phi(x)$, and then back to $x$. Then, $\phi(x)$ serves as a representation. Contrastive objectives avoid having to learn to reconstruct $x$ from its representation, by instead having a notion of ``views'' (which is not the term used in InfoNCE, but seems popular nowadays). Views are transformations in the input space to itself that do not change identity (i.e. intuitively, $x \approx v(x)$ for some view $v$). In images, for example, maybe small crops, rotations, and color changes do not essentially change what an image is (depends on the task, of course). InfoNCE is a simple cross-entropy classification objective, where given $\phi(x)$, the representation of a view of $x$, $\phi(v(x))$, and $N$ negative examples (i.e. other items from the dataset, different from $x$), InfoNCE trains a classifier that distinguishes $\phi(v(x))$ from $\phi(v(x^(n)_i))$ (the negatives). Thus, you force the representation to keep the information that allows you to differentiate between views of $x$ and views of other other items, and you make it invariant to the views.

Curiously, the InfoNCE objective also gives a lower bound on mutual information between $x$ and $v(x)$. The proof consists of two parts. First, they show that the function $f(x, v(x))$ is $\frac{p(x|v(x))}{p(x)}$ (note that it corresponds to exponentiated point-wise mutual information). Then, they show that this optimal $f$ bounds mutual information, along with a constant $\log N$ (the number of negative examples).

The paper has a number of experiments, in video, images, text and Reinforcement Learning. Here I’m using ``views'', which is terminology that got popularized in later papers. From the number of papers using it today, even this early it’s easy to see that InfoNCE contributed to the momentum that contrastive learning gained recently.

Gabriel Poesia
Computer Science PhD student