Divergence Measures

How far is one distribution from another? Meet KL, JS, Wasserstein, Total Variation, and the $f$-divergence family.

"My favorite distance? The one that tells me when my generator is daydreaming vs delivering."

📏

Why divergences matter

Divergence measures quantify the difference between two probability distributions. They steer generative models, stabilize GANs, evaluate synthetic data, and keep us honest about how close $P$ (real) and $Q$ (model) really are.

Key players

KL, JS, Wasserstein (Earth Mover), Total Variation, and the $f$-divergence family (which bundles many of the above).

Use cases

GAN training stability, likelihood-based models, transport-based metrics, evaluation of generated samples, and safety analysis.

🧮

Kullback-Leibler (KL) Divergence

Measures how one distribution $Q$ misses mass where $P$ puts it. Asymmetric but information-rich—also known as relative entropy.

Discrete

$D_{\text{KL}}(P \Vert Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

Non-negative, $D_{\text{KL}}(P||Q) \neq D_{\text{KL}}(Q||P)$

Continuous

$D_{\text{KL}}(P \Vert Q) = \int_x P(x) \log \frac{P(x)}{Q(x)} \; dx$

Same vibe, integrals instead of sums.

KL is an $f$-divergence

Choose $f(u) = u \log u$ in $D_f(P, Q) = \sum_x P(x) f\big(\tfrac{Q(x)}{P(x)}\big)$ to recover KL. Change of base turns it into $\sum_x P(x) \log \tfrac{P(x)}{Q(x)}$.

Not a metric

Symmetry fails: $D_{\text{KL}}(P||Q)=\log 2$ while $D_{\text{KL}}(Q||P)=\infty$ for $P(a)=1,Q(a)=0.5$. Triangle inequality also fails. Keep this in mind when calling it a “distance.”

🪩

Jensen-Shannon Divergence

A smoothed, symmetric cousin of KL: $D_{\text{JS}}(P||Q) = \tfrac{1}{2} D_{\text{KL}}(P||M) + \tfrac{1}{2} D_{\text{KL}}(Q||M)$ where $M=\tfrac{1}{2}(P+Q)$. Bounded between 0 and 1.

🌊

Wasserstein Distance (Earth Mover)

Minimum “work” to morph $P$ into $Q$. For order $p$, $W_p(P,Q)=\big( \int_0^1 |F_P^{-1}(u)-F_Q^{-1}(u)|^p du \big)^{1/p}$. Feels geometric and stabilizes GAN training (hello WGANs).

Earth mover story

Imagine piles of “earth” shaped like $P$ and $Q$. Cost = amount moved × distance moved. EMD is the minimal cost plan (optimal transport).

Weak convergence

If $W_p(\mu_n,\mu) \to 0$ then $\mu_n$ converges weakly to $\mu$ (expectations of bounded continuous $f$ converge). Kantorovich–Rubinstein duality ties $W_1$ to Lipschitz test functions.

📡

Total Variation (TV) Divergence

Maximum difference in probability mass across events: $D_{\text{TV}}(P,Q)=\tfrac{1}{2}\sum_x |P(x)-Q(x)|$. Symmetric, bounded in $[0,1]$.

Compute it (discrete demo)

For $P=[0.1,0.3,0.4,0.2], Q=[0.2,0.2,0.4,0.2]$: $\tfrac{1}{2}(0.1+0.1+0+0)=0.1$.

Image-processing vibe

Used as regularizer: minimize $\alpha\,TV(I) + \frac{1}{2}\lVert I-I_{\text{noisy}}\rVert^2$ to denoise. TV sums gradient magnitudes, so it preserves edges while removing salt-and-pepper noise.

🧩

$f$-Divergence family

For convex $f$ with $f(1)=0$, $D_f(P,Q)=\sum_x P(x)\,f\big(\tfrac{Q(x)}{P(x)}\big)$. KL and JS live here.

Not a metric

Symmetry counterexample: $P(a)=1$, $Q(a)=0.5$, $f(u)=u^2-1$ gives $f(0.5)\neq f(2)+f(0)$. Triangle inequality fails for $f(u)=(u-1)^2$ with $P,Q,R$ on $\{a,b\}$ ($f(0.5)+f(1.5)+f(0.5) > f(0.75)$).

Metric comparison

Wasserstein is a metric (non-negativity, symmetry, triangle inequality). KL/$f$-divergences are not; use them for likelihood thinking, not for triangle-based geometry.

⚖️

Wasserstein vs KL: pros & cons

Wasserstein perks

  • Accounts for geometry; robust to small perturbations.
  • True metric; great for GAN stability and avoiding mode collapse.
  • Convergence in $W_p$ implies weak convergence.

Cons: solving OT can be expensive, especially in high dims.

KL vibes

  • Sensitive to shape; easy to compute for many models.
  • Information-theoretic interpretation (entropy, mutual info).
  • Great for explicit density training.

Cons: not symmetric, not defined when $Q(x)=0$ with $P(x)>0$, ignores geometry.

🎯

Relevance for Generative Modelling

GANs minimize divergence between $P_r$ (real) and $P_g$ (generated). Classic GANs use JS/KL-style signals; Wasserstein GANs swap in $W$ to cure mode collapse and vanishing gradients. Weak convergence of Wasserstein ensures generated samples move meaningfully toward reality.

Objective

$\min_G \max_D \; \mathbb{E}_{x\sim P_r}[D(x)] - \mathbb{E}_{z\sim P_z}[D(G(z))]$ (WGAN flavor).

Why care

Better gradients, fewer collapse issues, meaningful convergence, evaluation that aligns with human perception.

🧠 Quick Quiz Earn +10 vibes

Which divergence is a true metric and accounts for geometry?

True / False speed round

🧪

Mini Lab: pick a divergence

Click a chip to see how you might use that divergence in a project (GANs, denoising, evaluation, or theory).

Choose a divergence to see an idea.

Keep going

Ready to put divergences to work? Next up: explicit vs implicit training with these measures.