Chapter 6
Wasserstein GAN
Earth Mover distance meets adversarial learning. We swap JS for transport geometry, enforce 1-Lipschitz critics, and steady the duel with gradient penalties.
Why Wasserstein GAN?
Traditional GANs lean on Jensen-Shannon divergence and can stall, oscillate, or collapse. The Earth Mover (Wasserstein) distance offers a smoother, more meaningful gradient landscape that tracks how mass must move to align two distributions.
Distance with geometry
Accounts for how far probability mass travels, not just overlap.
Stable objectives
Smoother critic signals reduce mode collapse and training jitter.
Lipschitz control
Weight clipping or gradient penalties enforce 1-Lipschitz critics.
Kantorovich-Rubinstein Dual Form of Wasserstein Distance
The Wasserstein distance, also known as the Earth Mover's distance, compares distributions by the cost of optimal transport. Given two probability distributions $P$ and $Q$ on metric space $(\mathcal{X}, d)$:
$$ W_p(P, Q) = \left(\inf_{\pi \in \Gamma(P, Q)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y)^p \, d\pi(x, y)\right)^{\tfrac{1}{p}} $$
Here $\Gamma(P,Q)$ are couplings with marginals $P$ and $Q$.
For $p=1$, the dual (KR) form is
$$ W_1(P, Q) = \sup_{\|f\|_{\text{Lip}} \leq 1} \left\{\int f(x) \, dP(x) - \int f(x) \, dQ(x)\right\} $$
Supremum over all 1-Lipschitz $f: \mathcal{X}\to\mathbb{R}$; $\|f\|_{\text{Lip}}$ is the Lipschitz constant.
The KR dual form highlights the connection between optimal transport and Lipschitz functions and gives a tractable way to optimize Wasserstein distance in practice.
Why KR duality matters for WGANs
The Kantorovich-Rubinstein theorem is significant in the Wasserstein GAN optimization problem because it provides a way to estimate the Wasserstein distance between two probability distributions $p$ and $q$ by optimizing over the class of 1-Lipschitz functions. This is important because the Wasserstein distance is a more meaningful measure of distance between probability distributions than other metrics like the total variation distance or Kullback-Leibler divergence.
In particular, the Wasserstein GAN uses this result to reformulate its objective function as a saddle-point problem, which can be solved using techniques from convex optimization. This allows for more stable and efficient training of generative models compared to traditional GANs, which suffer from issues like mode collapse and instability.
Specifically, Arjovsky et al. [2017] show that enforcing the Lipschitz constraint using weight clipping or gradient penalty leads to a smoother and more convex objective function, which is easier to optimize using standard techniques from convex optimization. They also demonstrate that this approach helps to mitigate issues like mode collapse and instability that are common in traditional GANs, leading to better quality generated samples and more reliable training.
Furthermore, Arjovsky et al. [2017] compare the performance of WGANs with traditional GANs on several benchmark datasets, showing that WGANs achieve state-of-the-art results in terms of sample quality and diversity, while also being more stable and easier to train. These results provide strong empirical evidence for the benefits of using the Wasserstein distance and 1-Lipschitz functions in generative modeling.
Remarks on using KR duality
The reformulation of the Wasserstein distance as a maximization over 1-Lipschitz functions simplifies the optimization problem in several ways:
- The optimization problem becomes convex: The class of 1-Lipschitz functions is a convex set, and the supremum over this set is a convex function. Therefore, the optimization problem becomes convex, which makes it easier to solve using standard techniques from convex optimization.
- The Lipschitz constraint is easy to enforce: The Lipschitz constraint on the optimizing function can be enforced using techniques like weight clipping or gradient penalty, which are simple to implement and have been shown to work well in practice.
- The objective function is more stable: Unlike traditional GANs, which suffer from issues like mode collapse and instability due to the non-convexity of their objective function, Wasserstein GANs have a more stable objective function that encourages convergence to a global optimum.
- The Wasserstein distance is a more meaningful measure of distance between probability distributions: Unlike other metrics like total variation distance or Kullback-Leibler divergence, the Wasserstein distance takes into account the geometry of the underlying space and provides a more meaningful measure of distance between probability distributions.
Kantorovich-Rubinstein Theorem
Statement
$$ W(p,q)=\inf_{\pi \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \pi}[ \| x-y \|_2]= \sup_{\|h\|_L \leq 1} \big[\mathbb{E}_{x \sim p} [h(x)] -\mathbb{E}_{y \sim q}[h(y)]\big] $$
Full proof walkthrough (tap to expand)
We note that $p$ and $q$ are marginals for joint $\pi$. Hence,
$$ p(x)= \int \pi(x,y) \, dy, \qquad q(y) =\int \pi(x,y) \, dx, $$
which gives equality constraints $h_1(x)$ and $h_2(x)$
\( h_1(x) : p(x)- \int \pi(x,y) \, dy =0, \quad h_2(x) : q(y)- \int \pi(x,y) \, dy =0. \)
Construct the Lagrangian with multipliers $f,g$:
$$ L(\pi,f,g) =\int \|x-y\| \, \pi(x,y) \, dx +\int \left( p(x)-\int \pi(x,y) \, dy \right) f(x) \, dx +\int \left( q(y)-\int \pi(x,y) \, dx \right) g(y) \, dy. $$
Collecting terms:
$$ L(\pi,f,g) = \mathbb{E}_{x \sim p}[ f(x) ]+\mathbb{E}_{y \sim q}[ g(y) ] + \int \left( \| x-y \|_2 -f(x)-g(y) \right) \, \pi(x,y) \, dy \, dx. $$
Slater's condition holds (convex objective, no inequalities), so strong duality gives
$$ W(p_1,p_g) = \inf_{\pi} \sup_{f,g} L(\pi, f,g) = \sup_{f,g} \inf_{\pi} L(\pi,f,g). $$
To avoid $L \to -\infty$, enforce $\|x-y\|_2 \geq f(x)+g(y)$ for all $x,y$, making the infimum occur at $\pi(x,y)=0$.
$$ \sup_{f,g} \inf_{\pi} L(\pi,f,g) = \sup_{f,g} [\mathbb{E}_{x \sim p}[f(x)]+\mathbb{E}_{y \sim q} [g(y)] ] = W(p,q) $$
To obtain the lower bound of $W,$ optimize over 1-Lipschitz $h$:
$$ \mathbb{E}_{x \sim p}[h(x)]-\mathbb{E}_{y \sim q}[h(y)] = \int (h(x)-h(y)) \, \pi(x,y) \, dx \, dy = \inf_{\pi} \int (h(x)-h(y)) \, \pi(x,y) \, dx \, dy \leq \inf_{\pi} \int \|x-y\|_2 \, \pi(x,y) \, dx \, dy \leq W(p,q). $$
The $\inf_{\pi}$ is justified because $\int (h(x)-h(y))\,\pi(x,y)\,dx\,dy$ remains the same for any coupling with marginals $p$ and $q$—the expression is independent of which $\pi$ is chosen.
Since it holds for all $h$ with $\|h\|_L\leq 1$:
$$ \sup_{\|h\|_L \leq 1} [\mathbb{E}_{x \sim p}[h(x)]-\mathbb{E}_{y \sim q}[h(y)]] \leq W(p,q) $$
Upper bound: define infimal convolution $k(x)= \inf_u [ \|x-u\|_2-g(u) ]$. One shows $k$ is $1$-Lipschitz and if $f(x)+g(y) \leq \|x-y\|_2$ then
$$ \mathbb{E}_{x \sim p}[f(x)]+\mathbb{E}_{y \sim q} [g(y)] \leq \mathbb{E}[k(x)] -\mathbb{E}[k(y)]. $$
Therefore $W(p,q)=\sup_{f(x)+g(y) \leq \|x-y\|_2}[\mathbb{E}_p f+\mathbb{E}_q g] \leq \sup_{\|h\|_L\leq1}[\mathbb{E}_p h-\mathbb{E}_q h]$.
Combining bounds yields the KR duality:
$$ W(p,q) = \sup_{\|h\|_L \leq 1} [ \mathbb{E}_{x \sim p}[h(x)]-\mathbb{E}_{y \sim q}[h(y)] ]. $$
Spot the mistake
Which statement would break the KR-dual-based WGAN story?
Improved WGAN (WGAN-GP)
The paper by Gulrajani et al. (2017) begins by discussing the limitations of traditional GAN training methodologies, particularly in terms of mode collapse and training instability. It highlights how the Wasserstein distance, also known as the Earth-Mover (EM) distance, provides a more meaningful and smooth gradient flow for training GANs.
A key innovation described is the use of a gradient penalty term added to the Wasserstein loss function, which enforces a Lipschitz constraint for the critic (discriminator). This approach, unlike previous methods that clip weights (see the original WGAN algorithm below), allows for more effective training dynamics and avoids common pitfalls such as gradient vanishing or exploding.
The authors present extensive experiments comparing their method against traditional GAN training frameworks. The results demonstrate significant improvements in the stability of GAN training, reduction of mode collapse, and enhancement in the quality of generated images. Metrics used for evaluation include the Inception Score and qualitative assessments by human evaluators.
Weight clipping
Simple to implement but can constrain capacity and bias the critic.
Gradient penalty
Penalizes $\|\nabla f\|_2$ away from 1 along sampled interpolants, giving smoother training.
Optimal critic geometry
Proposition
Let $\mathbb{P}_r$ and $\mathbb{P}_g$ be two distributions in $\mathcal{X}$, a compact metric space. Then, there is a $1-$Lipschitz function $f^*$ which is the optimal solution of
$$ \max_{\|f \|_L \leq 1} \mathbb{E}_{y \sim \mathbb{P}_r} [f(y)] -\mathbb{E}_{x \sim \mathbb{P}_g} [f(x)]. $$
Let $\pi$ be the optimal coupling between $ \mathbb{P}_r$ and $\mathbb{P}_g$, defined as minimizer of
$$ W(\mathbb{P}_r,\mathbb{P}_g) = \inf_{\pi \in \Pi(\mathbb{P}_r,\mathbb{P}_g) } \mathbb{E}_{y \sim \mathbb{P}_r}[f(y)]-\mathbb{E}_{x \sim \mathbb{P}_g} [f(x)], $$
where $\Pi(\mathbb{P}_r,\mathbb{P}_g)$ is the set of joint distributions $\pi(x,y)$ whose marginals are $\mathbb{P}_r,\mathbb{P}_g$ respectively. Then if $f^*$ is differentiable and $x_t= (1-t)x+ty$ with $0 \leq t \leq 1$, it holds that
$$ \mathbb{P}_{(x,y) \sim \pi} \left[ \nabla f^*(x_t)=\frac{y-x_t}{\| y-x_t \| } \right] = 1. $$
Proof
Since $\mathcal{X}$ is a compact space, we know that there is an optimal $f^*$ and we know that if $\pi$ is an optimal coupling,
$$ \mathbb{P}_{(x,y) \sim \pi} [ \, f^*(y)-f^*(x)= \|y-x \| \, ] = 1. $$
Let $(x,y)$ be such that $f^*(y)-f^*(x) = \|y-x \|.$ We can safely assume that $x \neq y$ as well, since this happens under $\pi$ with probability 1. Let $\psi(t)=f^*(x_t)-f^*(x)$. We claim that $$\psi(t)=\| x_t-x \| = t \| y-x \|.$$
Let $t,t' \in [0,1]$, the
$$ |\psi(t)-\psi(t')| = \| f^*(x_t)-f^*(x_{t'}) \| \leq \|x_t-x_{t'} \| = |t-t'| \|x-y \|. $$
Therefore, $\psi$ is $\| x-y \|-$ Lipschitz.
$$ \psi(1)-\psi(0) \leq (1-t) \| x-y \|+ \psi(t)-\psi(0) \leq (1-t) \|x-y \| + t \|x-y \| = \| x-y \|. $$
However, $ \psi(1)-\psi(0)=|f^*(y)-f^*(x)|=\| y-x \|$ so the inequalities have to actually be equalities. In particular, $\psi(t)-\psi(0)=t \|x-y \|$ and $\psi(0)=0$. Therefore, $\psi(t)=t \|x-y \|$.
Let
$$ v= \frac{y-x_t}{\|y-x_t \|}= \frac{(1-t)(y-x)}{|1-t| \|y-x \|}= \frac{y-x}{\|y-x \|}. $$
Now we know that $f^*(x_t)-f^*(x)=\psi(t)=t \|x-y \|$, hence, $$f^*(x_t)=f^*(x)+t \|x-y \|.$$ Then
$$ \frac{\partial }{\partial v} f^*(x_t)= \lim_{h \rightarrow 0} \frac{f^*(x_t+hv)-f^*(x_t)}{h}=1. $$
If $f^*$ is differentiable at $x_t$, then we know that $\| \nabla f^*(x_t) \| \leq 1, $ since it is a 1-Lipschitz continuous function. Using Pythagoras theorem with $v$ unit:
$$ 1 \leq \| \nabla f^*(x) \|^2= \langle v, \nabla f^*(x_t)\rangle^2+ \| \nabla f^*(x_t)-\langle v, \nabla f^*(x_t)\rangle v \|^2 = 1+ \| \nabla f^*(x_t)-v \|^2 \leq 1. $$
The equality chain forces $\nabla f^*(x_t)=v=\dfrac{y-x_t}{\| y-x_t \|}.$
This shows that $ \mathbb{P}_{(x,y)\sim \pi} \left[ \nabla f^*(x_t)=\frac{y-x_t}{\| y-x_t \|} \right] = 1.$
Exercises
- If $v$ is a unit vector, then show that $v$ and $(I - vv^T)v$ are orthogonal.
- If $v$ is a unit
- If $v$ is a unit vector, then for any vector $x,$ show that the following holds: $$ \| x \|^2 = (x^T v)^2 - \| (x - x^T v)v \|^2. $$
WGAN Algorithm
The Wasserstein GAN algorithm proceeds as follows:
- The algorithm initializes with default values for the learning rate $(\alpha)$, clipping parameter $(c)$, batch size $(m)$, and the number of critic iterations per generator iteration $(n_{\text{critic}})$. Initial parameters for the critic $(w_0)$ and the generator $(\theta_0)$ are also set.
- The training loop continues until the generator's parameters converge.
- Within the training loop, the following steps occur for a specified number of critic iterations
$(n_{\text{critic}})$:
- A batch of real data $\{x^{(i)}\}_{i=1}^m$ is sampled from the real data distribution.
- A batch of prior samples $\{z^{(i)}\}_{i=1}^m$ is drawn from the latent space's prior distribution.
- The critic's stochastic gradient is computed based on the difference in evaluations between the real data and the generated data from the generator.
- The critic's parameters are updated using the RMSProp optimizer with the computed gradient and the learning rate.
- The critic's weights are then clipped to fall within the range defined by the clipping parameter to maintain the Lipschitz constraint.
- After completing the critic updates, a new batch of prior samples is drawn.
- The generator's stochastic gradient is then computed based on the critic's evaluations of the generated data.
- The generator's parameters are updated using the RMSProp optimizer with the negative of the computed gradient, effectively performing gradient ascent.
- These steps are repeated until the generator's parameters have sufficiently converged, indicating the end of training.
Algorithm (original defaults)
Require: α learning rate, c clipping parameter, m batch size, n_critic critic iterations per generator iteration
Require: w₀ critic params, θ₀ generator params
while θ not converged:
for t = 0 … n_critic:
sample {x^(i)}₁ᵐ ~ P_r
sample {z^(i)}₁ᵐ ~ p(z)
g_w ← ∇_w [ 1/m Σ f_w(x^(i)) - 1/m Σ f_w(g_θ(z^(i))) ]
w ← w + α · RMSProp(w, g_w)
w ← clip(w, -c, c)
sample {z^(i)}₁ᵐ ~ p(z)
g_θ ← -∇_θ 1/m Σ f_w(g_θ(z^(i)))
θ ← θ - α · RMSProp(θ, g_θ)
🧠 Quick Quiz
What enforces the 1-Lipschitz constraint most smoothly in WGAN-GP?
True / False speed round
Mini Lab: Lipschitz toolkit
Click a chip for a stabilization lever.