Lernapparat

Maschinelles Lernen Lernen

Geometric Intuition on Improved Wasserstein GANs

April 13, 2017

Recently, Gulrajani et al published Improved Training of Wasserstein GANs. It adds a relaxed constraint to the original Wasserstein GAN discriminator training objective described by Arjovsky et al. We take a geometric look at why it is important.

Before I start, I can heartily recommend Alex Irpan's read-through of Arjovsky et al.'s Wasserstein GAN article.

My intuition why Wasserstein distance is great

The original Wasserstein GAN paper observed that the $W_1$ Wasserstein metric on a space of probability distributions can be a very useful loss criterion for GAN training.

To me, the number one intuitive property that makes the Wasserstein family of metrics attractive is that is differentiates between locality of deviations between different measures. Consider the three distributions in Figure 1. What are their distances?

Histograms of three distributions.
Figure 1: Histograms of three distributions. What is the distance between them?

The Kullback-Leibler divergence and the total variation essentially consider only the height of the bars for each category. Thus all three distributions seem equally far apart. If we put the four classes on the for points $1,...,4$ on the real line and use the distance function, we expect the distributions for A and B to be closer than B and C and that A and C are the farthest apart. The Wasserstein metric achieves just that by considering a metric (more generally a cost function) on the metric space on which the probability measures are defined.

Another way to look at this is that if you have a distribution on $X$, for example $X \subset \mathbb{R}^n$ and you apply a small distortion, i.e. a transformation $T : X \rightarrow X$ that is with $|x-T(x)|$ small, the Wasserstein distance between the original measure and the push-forward (the "distorted" probability distribution) will be small: $W_1(p, T_\# p) = \int |x-T(x)| dp(x)$. Thus the Wasserstein distance considers interior variations (on the $x$-es) rather than exterior variations (on the $p(x)$-es) like Kullback-Leibler or Total Variation. In this sense, the Wasserstein distance captures a sense of closeness that other metrics cannot see (think of a distorted image that we can easly match with the original, but expressing that in a distance function is hard).

For some applications we may need both. One way might be the unnormalized distance presented in Frogner et. al. Learning with a Wasserstein loss, but to me this is still a bit open (and my experience with implementing the Wasserstein loss is that it seemed rather unstable on single precision).

Intuition on what is improved

The Improved Training article has a neat description of experiments to illustrate the imperfectness of the gradient clipping method to achieve a Lipschitz bound. But what went wrong?

The Wasserstein GAN uses the dual representation of $W_1$ using Lipschitz functions with Lipschitz-constant at most $1$. $W_1 (\mu, \nu) = \sup \{ \int_M f(x) \mathrm{d} (\mu - \nu) (x) | f : M \to \mathbb{R}, \mathrm{Lip} (f) \leq 1 \} $. Naturally, replacing 1 by any other constant only multiplies $W_1$ with that constant.

In the original Wasserstein paper, the test space is constrained by clipping the neural network parameters. This makes the parameter space compact and - as the mapping of parameters to functions is continuous - also the set of test function, say $\mathcal{M}$, compact in the space of Lipschitz functions (I'm tempted to write $W^{1,\infty}$ following the usual Sobolev space notation, but the various $W$s would be confusing). This leads to the range of test functions $\mathcal{M}$ being in some Lipschitz ball because compact sets in metric spaces are always bounded.

However, to get the Wasserstein metric, it is crucial that the test function set covers the entire Lipschitz ball (well, in reality the intersection with the set of functions that can be represented by the network). Otherwise the supremum can be achieved somewhere totally different. To see how this can happen, consider the following simple task: given some point $x \in \mathbb{R}^2$ (the red dot in Figure 2) find the closest point in the unit ball (green). This is vastly different to finding the closest ball on the square inscribed in the unit ball. On the ray through the corner of the square, the two distances will coincide but in generally they will be different like for the red dot. By picking two points on that ray that are close (but appropriately nudged) to the distances of the red dot to the circle and square, one can see that the ordering of points by distance is not the same for the circle and the square. Thus one can expect the gradients of the Wasserstein GAN's loss function and the Wasserstein distance to point in different directions.

Find the closest point
Figure 2: Given the task to find the point on the green circle that is closest to the red dot, one ends up with a very different point and a very different distance when clipping the coordinates to the blue square.

Thus the Improved Training objective aims to find a better way to impose a "uniform" Lipschitz bound in that any relevant non-constant Lipschitz function divided by the Lipschitz constant is in the test set. In figure 2, we want to avoid that the maximum distance of a point $P$ to the origin $(0)$ depends on the angle of the line $\overline{0P}$ to the horizonal line. This is OK for the circle, but not the square.

This is my geometric intuition of what went wrong. In terms of the underuse of capacity discussed in the article, the simple functions would lie close to the corner of the square and the more complex ones in near middle of the edges.

A back-of-the-envelope argument why the penalty term is reasonable

So we wish to find the supremum over all Lipschitz functions with Lipschitz constant at most 1. Clearly the Lipschitz constant of a maximizer will be one: If it were less, one could divide by the Lipschitz constant. This would divide the integral by the Lipschitz constant, making it larger. But the Lipschitz constant is the (essential) maximum over $x$ of the gradient norm $|\nabla f(x)|$, why is it OK to favour the gradient to be of length one everywhere?

The authors note (and give a proof) that the maximizer would have gradient one (almost) everywhere, but the proof is technical enough to be banned to the appendix. For a brief intuition about this, consider the (special) case when we actually have a transport map $T$ such that $\nu = T_\# \mu$. Then given any 1-Lipschitz test function $f$, we can write $$ \int_X f(x) \mathrm{d} (\mu - \nu) (x) = \int_X f(x)-f(T(x)) \mathrm{d} \mu (x) \leq \int_X |x-T(x)| \mathrm{d} \mu (x) = W_1(\mu, \nu). $$ where we have used our assumption that $T$ is an optimal transport map for the distance as the cost function. The Kantorovich-Rubinstein duality theorem tells us that taking the supremum over all 1-Lipschitz functions on the left side we have equality. Actually, under rather mild conditions, this supremum is achieved (see e.g. the version of duality in Theorem 5.10 of Villani's excellent book Optimal Transport Old and New that is also used for the proof in the Improved Training article).

But as 1-Lipschitz functions always satisfy $|f(x)-f(y)| \leq |x-y|$, equality in the integral implies that equality holds for the integrand for $\mu$-almost all $x \in X$.1

If $f$ is regular enough along the way, we can apply the argument again: To get from $f(x)$ to $f(y)$ we must change by $|f(x)-f(y)|=|x-y|$ in total, so with rate $1$ on average (and always in the same direction), but with the rate of change being less than $1$ by the 1-Lipschitz condition, it has to be of rate $1$ almost everywhere between $x$ and $y$. More formally, the 1-Lipschitz function $f$ has $|\nabla f| \leq 1$ almost everywhere, but as $$ |y-x|=|f(y)-f(x)| = | \int_0^1 (y-x) \cdot \nabla f(x+t (y-x)) dt| \leq |y-x| \int_0^1 |\nabla f(x+t (y-x))| dt, $$ we must have that |\nabla f| is of norm 1 along the line from $x$ to $y$.

This is why it is natural to focus on test functions with $|\nabla f|=1$. The authors of Improved Training relax the requirement and introduce the penalty term $\lambda E[(|\nabla f(x)|-1)^2]$.

Making the constraint hard or even choosing a large will produce an energy landscape with deep narrow trenches which makes optimization hard. By relaxing the problem to only include a penalty, one may expect the length of the gradient to systematically exceed $1$. (Similar to almost all optimization in learning, we cannot guarantee that what the gradient ascent finds is a global maximum, but we likely have to accept that and carry on.)

Great article.

Semi-Improved Training of Wasserstein GANs

(This section has been added 2017-04-16.)

Looking at the code accompanying the Improved Training, I noticed that the penalty actually samples a single interpolated point per real-fake data pair and computed the gradient there. Having finished my writeup linked above, this - and the fact that pytorch did not support second derivatives - led me to think that maybe one could take a step back - namely stick with $|f(x)-f(y)| = |x-y|$ instead of testing $|\nabla f| = 1$.

As this cuts one step from Improved Training of Wasserstein GANs I call this method Semi-Improved Training of Wasserstein GANs. I have implemented two variants in a Jupyter Notebook on github:

  • Variant 1 takes $x$ and $y$ as a pair of real and fake data and applies the same penalty as Improved Training: $p(x,y) = \lambda (1-|f(x)-f(y)|/|x-y|)^2$.
  • Variant 2 takes $x$ and $y$, and a randomly chosen interpolation point $z=\alpha x+(1-\alpha) y$ (just like the point where Improved Training evaluates $|\nabla f|$). It then uses $(p(x,z)+p(y,z))/2$ as the penalty term.

As in Improved Training, these replace the gradient clipping of the original Wasserstein GAN.

On the toy problems of Improved Training both variants seem to work fine. The second one could help if there is a problem with test functions being steeper than 1 (i.e. violating the 1-Lipschitz constraint) in some part and less steep in another.

Things to try

  • One might use with an asymmetric penalty term like $\lambda E(max(|\nabla f|-1,0))$ with a higher weight $\lambda$ - the intuition being that the optimization will approach a good $f$ from the interior of the Lipschitz ball.
  • Also, one might increase the weight $\lambda$ as the discriminator convergence.
  • Maybe compare the discriminator-computed Wasserstein distance to one computed with one of the Wasserstein algorithms. I have been playing around with the Sinkhorn family of algorithms to compute a regularized Wasserstein distance, but it seems that single precision is quite a limitation.

  1. I love the Garrison Keillor's Lake Wobegon motto: Where all the women are strong, all the men are good looking and all the children are above average. 

'