# Representation Learning with Contrastive Predictive Coding (@ NeurIPS 2019)

### Aaron van den Oord, Yazhe Li, Oriol Vinyals

This paper introduces the now popular contrastive InfoNCE objective for
unsupervised representation learning. It took me a few passes over the
derivations to appreciate what they mean.
The idea is simple. Say we have a dataset $D$, and we want to learn
representations for the items $x \in D$. One paradigm for doing this
without labels is autoencoding: we learn a model that maps $x$ into
a lower-dimensional $\phi(x)$, and then back to $x$. Then, $\phi(x)$
serves as a representation. Contrastive objectives avoid having to
learn to reconstruct $x$ from its representation, by instead
having a notion of ``views'' (which is not the term used in InfoNCE,
but seems popular nowadays). Views are transformations in the input
space to itself that do not change identity
(i.e. intuitively, $x \approx v(x)$ for some view $v$).
In images, for example, maybe small crops, rotations, and color changes
do not essentially change what an image is (depends on the task, of course).
InfoNCE is a simple cross-entropy classification objective,
where given $\phi(x)$, the representation of a view of $x$, $\phi(v(x))$,
and $N$ negative examples (i.e. other items from the dataset, different from $x$),
InfoNCE trains a classifier that distinguishes $\phi(v(x))$ from $\phi(v(x^(n)_i))$
(the negatives). Thus, you force the representation to keep the information
that allows you to differentiate between views of $x$ and views of other
other items, and you make it invariant to the views.

Curiously, the InfoNCE objective also gives a lower bound on mutual information between
$x$ and $v(x)$. The proof consists of two parts. First, they show that the
function $f(x, v(x))$ is $\frac{p(x|v(x))}{p(x)}$
(note that it corresponds to exponentiated point-wise mutual information).
Then, they show that this optimal $f$ bounds mutual information, along with a constant
$\log N$ (the number of negative examples).

The paper has a number of experiments, in video, images, text and Reinforcement Learning.
Here I'm using ``views'', which is terminology that got popularized in later papers.
From the number of papers using it today, even this early it's easy to see that
InfoNCE contributed to the momentum that contrastive learning gained recently.