The [[roam:f-GANs]] generalizes GANs by optimizing arbitrary f-divergences. However, since the f-divergence involves $q$ in the denominator, they can be numerically unstable when the support of $q$ does not cover the support of $p$ (i.e., $q$ assigns very low or zero probability to a sample from $p$).
This paper solves this problem by using the Wasserstein (a.k.a. earth mover's) distance between $p$ and $q$. If $p$ is our target distribution, and $q$ is the current distribution given by our generator, this distance will estimate how much do we need to move the density assigned by $q$ to points such that it matches $p$. This movement can be encoded by a joint distribution. Suppose $\Pi(p, q)$ is the set of all joint distributions of pairs of $x$ such that the marginals match $p(x)$ and $q(x)$. Intuitively, given some $x$, its density $p(x)$ must entirely "come" from somewhere in $q$. The EMD is the least amount of movement necessary, and is defined as:
\( EMD(p, q) = \inf_{\gamma \sim \Pi(p, q)} \mathbb{E}_{(x, y) \sim \gamma} ||x - y||_1 \)
How to use this? Again, as in the f-GAN, using the original EMD is tricky. Instead, we can use the equivalent Kantorovich-Rubinstein dual:
\( EMD(p, q) = \sup_{f} \mathbb{E}_{x \sim p} f(x) - \mathbb{E}_{x \sim q} f(x) \)
Where $f$ is constrained to be 1-Lipschitz continuous. Again, by magic, all this involves is sampling from $p$ (the training set) and $q$ (the generator), and evaluating $f$ (the "discriminator", which wants to assign high value to samples from $p$ and low value to samples from $q$). This, again, gives a min-max objective: EMD is computed by maximizing over $f$, but we want to minimize the EMD, which implies an outer loop.
The paper includes some tricks to implement the Lipschitz constraint.