# Wasserstein Generative Adversarial Networks (@ ICML 2017)

### Martin Arjovsky, Soumith Chintala, Léon Bottou

The [[roam:f-GANs]] generalizes GANs by optimizing arbitrary f-divergences.
However, since the f-divergence involves $q$ in the denominator, they can
be numerically unstable when the support of $q$ does not cover the support
of $p$ (i.e., $q$ assigns very low or zero probability to a sample from $p$).

This paper solves this problem by using the Wasserstein (a.k.a. earth mover's) distance
between $p$ and $q$.
If $p$ is our target distribution, and $q$ is the current distribution given by our
generator, this distance will estimate how much do we need to move the density assigned
by $q$ to points such that it matches $p$.
This movement can be encoded by a joint distribution.
Suppose $\Pi(p, q)$ is the set of all joint distributions
of pairs of $x$ such that the marginals match $p(x)$ and $q(x)$.
Intuitively, given some $x$, its density $p(x)$ must entirely "come" from somewhere in $q$.
The EMD is the least amount of movement necessary, and is defined as:

\( EMD(p, q) = \inf_{\gamma \sim \Pi(p, q)} \mathbb{E}_{(x, y) \sim \gamma} ||x - y||_1 \)

How to use this? Again, as in the f-GAN, using the original EMD is tricky.
Instead, we can use the equivalent Kantorovich-Rubinstein dual:

\( EMD(p, q) = \sup_{f} \mathbb{E}_{x \sim p} f(x) - \mathbb{E}_{x \sim q} f(x) \)

Where $f$ is constrained to be 1-Lipschitz continuous. Again, by magic, all this involves is sampling from $p$
(the training set) and $q$ (the generator), and evaluating $f$ (the "discriminator", which wants to assign
high value to samples from $p$ and low value to samples from $q$). This, again, gives a min-max objective:
EMD is computed by maximizing over $f$, but we want to minimize the EMD, which implies an outer loop.

The paper includes some tricks to implement the Lipschitz constraint.