Divergence Measures
How far is one distribution from another? Meet KL, JS, Wasserstein, Total Variation, and the $f$-divergence family.
"My favorite distance? The one that tells me when my generator is daydreaming vs delivering."
Why divergences matter
Divergence measures quantify the difference between two probability distributions. They steer generative models, stabilize GANs, evaluate synthetic data, and keep us honest about how close $P$ (real) and $Q$ (model) really are.
Key players
KL, JS, Wasserstein (Earth Mover), Total Variation, and the $f$-divergence family (which bundles many of the above).
Use cases
GAN training stability, likelihood-based models, transport-based metrics, evaluation of generated samples, and safety analysis.
Kullback-Leibler (KL) Divergence
Measures how one distribution $Q$ misses mass where $P$ puts it. Asymmetric but information-rich—also known as relative entropy.
Discrete
$D_{\text{KL}}(P \Vert Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$
Non-negative, $D_{\text{KL}}(P||Q) \neq D_{\text{KL}}(Q||P)$
Continuous
$D_{\text{KL}}(P \Vert Q) = \int_x P(x) \log \frac{P(x)}{Q(x)} \; dx$
Same vibe, integrals instead of sums.
KL is an $f$-divergence
Choose $f(u) = u \log u$ in $D_f(P, Q) = \sum_x P(x) f\big(\tfrac{Q(x)}{P(x)}\big)$ to recover KL. Change of base turns it into $\sum_x P(x) \log \tfrac{P(x)}{Q(x)}$.
Not a metric
Symmetry fails: $D_{\text{KL}}(P||Q)=\log 2$ while $D_{\text{KL}}(Q||P)=\infty$ for $P(a)=1,Q(a)=0.5$. Triangle inequality also fails. Keep this in mind when calling it a “distance.”
Jensen-Shannon Divergence
A smoothed, symmetric cousin of KL: $D_{\text{JS}}(P||Q) = \tfrac{1}{2} D_{\text{KL}}(P||M) + \tfrac{1}{2} D_{\text{KL}}(Q||M)$ where $M=\tfrac{1}{2}(P+Q)$. Bounded between 0 and 1.
Wasserstein Distance (Earth Mover)
Minimum “work” to morph $P$ into $Q$. For order $p$, $W_p(P,Q)=\big( \int_0^1 |F_P^{-1}(u)-F_Q^{-1}(u)|^p du \big)^{1/p}$. Feels geometric and stabilizes GAN training (hello WGANs).
Earth mover story
Imagine piles of “earth” shaped like $P$ and $Q$. Cost = amount moved × distance moved. EMD is the minimal cost plan (optimal transport).
Weak convergence
If $W_p(\mu_n,\mu) \to 0$ then $\mu_n$ converges weakly to $\mu$ (expectations of bounded continuous $f$ converge). Kantorovich–Rubinstein duality ties $W_1$ to Lipschitz test functions.
Total Variation (TV) Divergence
Maximum difference in probability mass across events: $D_{\text{TV}}(P,Q)=\tfrac{1}{2}\sum_x |P(x)-Q(x)|$. Symmetric, bounded in $[0,1]$.
Compute it (discrete demo)
For $P=[0.1,0.3,0.4,0.2], Q=[0.2,0.2,0.4,0.2]$: $\tfrac{1}{2}(0.1+0.1+0+0)=0.1$.
Image-processing vibe
Used as regularizer: minimize $\alpha\,TV(I) + \frac{1}{2}\lVert I-I_{\text{noisy}}\rVert^2$ to denoise. TV sums gradient magnitudes, so it preserves edges while removing salt-and-pepper noise.
$f$-Divergence family
For convex $f$ with $f(1)=0$, $D_f(P,Q)=\sum_x P(x)\,f\big(\tfrac{Q(x)}{P(x)}\big)$. KL and JS live here.
Not a metric
Symmetry counterexample: $P(a)=1$, $Q(a)=0.5$, $f(u)=u^2-1$ gives $f(0.5)\neq f(2)+f(0)$. Triangle inequality fails for $f(u)=(u-1)^2$ with $P,Q,R$ on $\{a,b\}$ ($f(0.5)+f(1.5)+f(0.5) > f(0.75)$).
Metric comparison
Wasserstein is a metric (non-negativity, symmetry, triangle inequality). KL/$f$-divergences are not; use them for likelihood thinking, not for triangle-based geometry.
Wasserstein vs KL: pros & cons
Wasserstein perks
- Accounts for geometry; robust to small perturbations.
- True metric; great for GAN stability and avoiding mode collapse.
- Convergence in $W_p$ implies weak convergence.
Cons: solving OT can be expensive, especially in high dims.
KL vibes
- Sensitive to shape; easy to compute for many models.
- Information-theoretic interpretation (entropy, mutual info).
- Great for explicit density training.
Cons: not symmetric, not defined when $Q(x)=0$ with $P(x)>0$, ignores geometry.
Relevance for Generative Modelling
GANs minimize divergence between $P_r$ (real) and $P_g$ (generated). Classic GANs use JS/KL-style signals; Wasserstein GANs swap in $W$ to cure mode collapse and vanishing gradients. Weak convergence of Wasserstein ensures generated samples move meaningfully toward reality.
Objective
$\min_G \max_D \; \mathbb{E}_{x\sim P_r}[D(x)] - \mathbb{E}_{z\sim P_z}[D(G(z))]$ (WGAN flavor).
Why care
Better gradients, fewer collapse issues, meaningful convergence, evaluation that aligns with human perception.
🧠 Quick Quiz Earn +10 vibes
Which divergence is a true metric and accounts for geometry?
True / False speed round
Mini Lab: pick a divergence
Click a chip to see how you might use that divergence in a project (GANs, denoising, evaluation, or theory).
Keep going
Ready to put divergences to work? Next up: explicit vs implicit training with these measures.