More Improved Training of Wasserstein GANs and DRAGAN

May 29, 2017

This is following up on my post on improved and semi-improved training of Wasserstein GANs. A few days ago, Kodaldi et al published How to Train Your DRAGAN. They introduce an algorithmic game theory approach and propose to apply the gradient penalty only close to the real-data manifold. We take a look at their objective function, offer a new possible interpretation and also consider what might be wrong in Improved Training objective. While doing so we introduce PRODGAN and SLOGAN.

I won't be rehashing the construction of the Wasserstein GAN but instead assume that you have just read either of the articles or my previous blog post and jump right in.

Where to constrain the discrimination function?

The key difference between DRAGAN and the Improved Training article (aka WGAN-GP for Wasserstein GAN with Gradient Penalty) is where the discrimination function $f$ calculated by the critic network is Gradient-constrained (in both cases penalised at a random location per sample to favor $|\nabla f|=1$).

In the WGAN-GP, the gradient location is picked on a random line between a real and randomly generated fake sample. Thus is tries to globally enforce $|\nabla f| = 1$.
In the DRAGAN, the gradient is sampled "close" to the real sample. In this sense, this locally enforces $|\nabla f| = 1$ only in a neighborhood of the real sample support.

This distinction is emphasized in the section 4.3 as well as the text accompanying the DRAGAN code. Figure 1 shows the difference graphically.

Local vs. Global. — Figure 1: Local vs. global - the difference between WGAN-GP and DRAGAN is where the gradient penalty is enforced.

Note that in Improved Training, Gulrajani et al emphasize the advantage of the Wasserstein distance as a globally meaningful distance with meaningful gradients (as opposed to KL-Divergence, TV etc.). So moving to local may come with severe drawbacks when the generator is far from the real-data manifold. The main advantage is the lesser rigidity of DRAGANs.

The authors do suggest that by penalising the gradient near the real-data manifold¹, they keep the desirable rigidity that prevents mode collapse but that the WGAN-GP is overly constrained.

One experimental verification offered in the DRAGAN article uses the toy problems introduced by Gulrajani et al. In the code accompanying the latter, the authors note that smaller $\lambda$ seems to help for toy tasks specifically and set the weight of the gradient penalty to $0.1$ rather than $10$ as in the other experiments. The DRAGAN experiment sets $\lambda = 10$ and indeed, then DRAGAN seems to converge better than WGAN-GP. So there may be something to be said about WGAN-GP's penalty term.

In fact I do think that there is a point about the rigidity to be made, but I think that it may be educational to look at the duality theorem that plays a crucial rôle for WGAN-GP. This is what we do below, but first, let's quickly invent another type of GAN.²

Improved DRAGAN and PRODGAN

One curious implementation detail about DRAGAN (as of v3 from May 25th) is that the authors recomend a method of generating samples where all real inputs are changed in a positive direction. It would seem curious if it is important to move off the real input but it is OK to only do so in one particular quadrant. I have implemented this method in the Improved WGAN and DRAGAN jupyter notebook notebook, selected by lipschitz_constraint=4.

I also implemented a version that seemed to more intuitive to me. With lipschitz_constraint=5, the standard deviations are calculated for each input location over minibatch and a normal random number is drawn for each input and batch item. This way, constant inputs would not be varied and also both signs do happen. This might be considered an improved DRAGAN.

Given that the DRAGAN offsetting appears to work, we can go further and impose the Penalty Right On the Data for a novel GAN variant. PRODGAN's critic network thus minimizes the objective function $$ L_{PRODGAN-C} := E(f(x_{Fake}) - E(f(x_{Real})) + E((|\nabla f(x_{Real})|-1)^2) $$ On the global-local scale, this is the most local it gets.

I included this in my Improved WGAN and DRAGAN jupyter notebook when choosing lipschitz_constraint=6. For the toy swiss roll problem it seems to work reasonably well.

Note that I argue below that PRODGAN may not be a good solution.

A closer look at the duality theorem underlying WGANs

When we specialise (d) of the duality theorem 5.10 of Villani's Optimal Transport Old and New, the duality states that $f(y)-f(x) \leq |x-y|$ everywhere with equality $\pi$-almost everywhere where $\pi$ is the optimal transport plan. Thus, we can only expect $f(y)-f(x) = |x-y|$ when $y$ is indeed mapped to $x$ in the moving plan. But this means that imposing $|\nabla f| = 1$ on lines between arbitrary points as done by Improved Training of WGANs (as well as the Semi-Improved variant) may be undesirable.

Of course, under the assumption that the data manifold and the generated probability distribution are in general enough position to ensure that the rays between matched points cover the rays between any points (or even the entire space). However, the data manifold seems sparse and elusive enough that this is quite an assumption. Indeed, the swiss roll experiment seems to show that the Lipschitz-bound is not sharp for, say, an $x$ in the inner part of the roll and a $y$ from the outer part.

This nicely reflects the observation that imposing the WGAN-GP penalty as is between (to paraphrase the text accompanying the DRAGAN code) a real cat image and a blurry air-plane image is trying to be overly ridgid.

Why does DRAGAN apparently fare better?

One explanation might be that close to the real data points the points where the gradient penalty is applied might more likely lie on (or close to) rays between matched points in the distributions and for those the gradient is (approximately) of unit norm.

Another potential explanation is that DRAGAN encourages the discriminator function $f$ to be a good discriminator and solve an Eikonal equation, so essentially if $f$ were to vanish on the reasonably behaved real-data manifold, the discriminator function $f$ behaves like a distance to the manifold in the input space in a neighborhood of the real-data manifold (rather than going via Wasserstein-distance-on-probability-space). This interpretation shows why the discriminator moves the support of the generated distribution towards the real-data manifold, but does not cover alignment of densities between the two. Interestingly, the usual GAN quality criteria seem to have a similar blind spot: As long as it is not too concentrated, looking at generated samples would not give evidence for or against the agreement of the real and generated distributions.

Level-set view of the distance. — Figure 3: Level set view of the signed distance function.

Of course, if we want the $f$ to be this distance (rather than signed distance), the gradient would have to jump around the data manifold, so PRODGAN does not fit this interpretation.

Revisiting the two-sided penalty

Our inspection of the duality theorem suggested that the two-sided gradient penalty $$ E((|\nabla f(\alpha x_{real}-(1-\alpha) x_{fake})|-1)^2) $$ imposes a property that the dual maximizer in the Wasserstein problem does not actually have. Thus it is preferrable to replace it by the one-sided penalty $(\max(|\nabla f|-1, 0))^2$. I should note that Gulrajani et al report that in their experiments that they had faster convergence and better optima using the two sided penalty. They argue that in addition the optimal discriminator had $|\nabla f| = 1$ a.e., but as discussed above, there may be a rather strong implicit assumption in there. So if you think the assumption is quite speculative and value theory over experiments, then the one-sided penalty might be a better fit.

There is one more intuitive reason not to use a two-sided penalty: Consider the situation when the functional with the two-sided penalty gives a stronger gradient on the weights $w$. This will be precisely when the discrimination term $\nabla_w f(x_{real} - \nabla_w f(x_{fake})$ is small compared to $\nabla_w ((|\nabla_x f(x_{interp})|-1)^2)$, in other words when the penalty causes the optimization to more or less ignore the maximization of the discrimination term and only considers the boundary condition. This is particularly bad when the two have oppositie sign for the weights and $|\nabla f|<1$, because then the right thing to do would be to maximize the discrimination term. In this way, the two-sided gradient penalty introduces a (or an additional) nonconvexity that may cause spurious local optima.

Non-convexity of WGAN-GP penalty — Figure 4: The non-convexity of the gradient penalty in WGAN-GP and DRAGAN may introduce spurious local minima, in particular for sparse or complicated geometries of the real-data manifold. The 'erroneous' gradient is fairly strong for small $|\nabla_x f |$.

Note that with the Eikonal interpretation of DRAGAN's objective introduced above, $|\nabla f|=1$ a.e. in the neighborhood of the real-data manifold is the actual condition. However, I do think that the nonconvexity makes it a much more difficult problem for DRAGAN as well.

SLOGAN

In fact, in my experiments with the swiss roll problem, the one-sided version of Lipschitz Penalty 2 (variant 2 of semi-improved training) was stable for penalty weight $\lambda=10$ while the two-sided version was unstable (it did converge for $\lambda = 0.1$ and $\lambda = 1$). Maybe the "mystery" of the choice of $\lambda$ for the toy problems is just that the non-convexity is decidedly unhelpful. So we need to name the semi-improved WGAN with one-sided penalty. We call it Single-sided Lipschitz Objective GAN (SLOGAN).

In fact, one of the difficulties of WGAN-LP with two-sided penalty seems to be that for LP it is more important to use the one-sided penalty. In hindsight this is no surprise, because the Lipschitz quotient $|f(x)-f(y)|/|x-y|$ can only be 1 if the line is parallel to the gradient, which is only the case when $x$ and $y$ are matched, but not in general, e.g. if one of them is perturbed. [Updated:] Note that this equally affects the proposals to approximate the gradient by a finite difference along $x-y$ (I don't speak enough Chinese to follow along, but on the Pytorch forums, 郑华滨 is credited with this blog post, which seems to be a popular discussion of the Improved Training article). The equation does not test the gradient, but only the projection of the gradient onto the line $x-y$, so you cannot expect that to have unit norm but norm less or equal than one. As such, I would recommend using the one-sided penalty with that approximation as well.

Gradient vs. Lipschitz quotient. — Figure 5: Why the Lipschitz quotient should not be forced to be 1. This was wrong in the *semi-improved training*, but fixed by the one-sided penalty of SLOGAN

My impression is that, although the DRAGAN will also converge, that SLOGAN (or WGAN-single-sided-GP for that matter) converges faster on the swiss roll. And it seems that away from the data the level sets are more regular for SLOGAN than for DRAGAN. This is not surprising, but it might be disadvantageous in training.

Swiss Roll SLOGAN — Figure 6: SLOGAN with $\lambda=10$

Swiss Roll WGAN one-sided GP — Figure 7: WGAN-GP with $\lambda=10$ and one-sided penalty converges well, too.

Swiss Roll WGAN GP — Figure 8: The original WGAN-GP with $\lambda=10$ does not converge.

Algorithmic game theory behind DRAGAN

One of the great things about the DRAGAN paper is that the game theory interpretation gives justification to the practice of only partially training the discriminator between steps. I have not yet fully understood the connection between the game-theoretic interpretation of DRAGAN and the choice of penalisation. Also, what would the error term for Follow the Regularized Leader be in this setting?

So I must admit that I have not understood the game theory behind this to really appreciate this aspect of the DRAGAN contribution fully, and so my discussion of DRAGAN leaves out quite an important contribution of the article.

Conclusion

I like the game-theoretic motivation, but my main conclusion is that I should learn more game theory before I can fully appreciate that. It never is hard to find things that one doesn't know is it?

It will be interesting to see how the quality metrics proposed work out. I have not really followed the evaluation state of the art for GANs.

The DRAGAN authors make an important point that the Improved Training objective may be too ridgid. And the swiss roll experiment with $\lambda=10$ seems to be a good case to study that excessive rigidity.

However, I still think that the Wasserstein distance interpretation is great. My personal suspicion is that the question of one-sided versus two-sided penalty may be more crucial here than the local versus global and that the locality may be actually disadvantageous when the generator is not good yet. Maybe there are other methods to overcome the shortcomings of the one-sided gradient penalty observed by the authors of Improved Training. For example, it may be desirable to test the Lipschitz condition introduced by Semi-improved Training on all pairs of the minibatch. Or one could try to deliberately produce a fake sample close to the real one.

Part of the success of DRAGAN might be in the Eikonal-equation they solve.

We have seen PRODGAN as the extreme of the local-global scale. I would not recommend it, though. We have also derived SLOGAN as a fixed semi-improved training method after seeing what is wrong with it. I think that SLOGAN is likely a very good methodology.

Things to do

Systematically look at (weight) gradient norms when $\lambda=10$ causes Improved Training to not succeed. Maybe the contribution of the gradient penalty indeed distorts the discrimination term's gradient when the $|\nabla_x f|$ is small.
Does the DRAGAN discriminator in fact learn a distance-like function close to the real-data manifold?
Set up an illustrative experiment on what DRAGAN does within the support of the real-data manifold.
More real-world experiments with SLOGAN (the one-sided penalty).
Does it help to test the Lipschitz condition for more minibatch items? This would likely have to be done on real problems, not just the toy ones...
What happens if one combines DRAGAN's gradient penalty and the (one-sided) improved training one?

The mathematician in me likes to think about the data as lying on a manifold but expects that whatever it is the support of the real data distribution is very unlikely to be an actual manifold (but a union of manifolds or somesuch). It seems that real-data locus is mathematically vague enough to be more appropriate technically, but it seems that real-data manifold is the more commonly used term. ↩
I should have called Semi-Improved-WGANs WGAN-LP for Lipschitz penalty. So here is another GAN, this time I am learning to invent fancy acronyms. ↩

Lernapparat