VAE: Mathematical Foundations

Divergence minimization, variational inference, and the reparameterization trick

📐

Divergence Minimization

Definition: Divergence Function

A divergence function $D[\cdot \| \cdot]: \mathcal{P} \times \mathcal{P} \to \mathbb{R}$ quantifies dissimilarity between distributions with properties:

1. Non-negativity: $D[P \| Q] \geq 0$ for all $P, Q \in \mathcal{P}$

2. Identity: $D[P \| Q] = 0 \iff P = Q$

Note: Unlike metrics, divergence doesn't require symmetry or triangle inequality

KL Divergence

$D_{KL}[P\|Q] = \mathbb{E}_P[\log P/Q]$

JS Divergence

Symmetric version of KL

Wasserstein

Earth mover's distance

🕸️

Probabilistic Graphical Models

PGMs use DAGs to represent conditional dependencies between variables:

Joint Distribution Factorization

$p(x_1, \ldots, x_D) = \prod_{i=1}^{D} p(x_i \mid \text{pa}(x_i))$

where $\text{pa}(x_i)$ denotes parents of node $x_i$

VAE/GAN Model

z x

$p(x,z) = p(x|z)p(z)$

Conditional Model

z y x

$p(x,y,z) = p(x|z,y)p(y|z)p(z)$

🎓

Key Theorems

Theorem: Jensen's Inequality

Let $f: \mathbb{R} \to \mathbb{R}$ be convex. For any distribution $p(x)$:

$\mathbb{E}_{p(x)}[f(x)] \geq f(\mathbb{E}_{p(x)}[x])$

Equality holds iff $f$ is linear or $p(x)$ is a delta measure.

Click to see proof

Proof sketch for discrete case:

Let $X$ take values $x_1, \ldots, x_n$ with probabilities $p_1, \ldots, p_n$.

Since $f$ is convex: $f(\sum_i p_i x_i) \leq \sum_i p_i f(x_i)$ by definition of convexity.

This directly gives $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$. ∎

Theorem: LOTUS (Law of Unconscious Statistician)

For $Y = g(X)$ where $\mathbb{E}[|g(X)|] < \infty$:

$\mathbb{E}_{p_Y(y)}[y] = \mathbb{E}_{p_X(x)}[g(x)]$

Allows computing expectations without knowing $p_Y$ explicitly!

Definition: KL Divergence

$D_{KL}[p \| q] = \int p(x) \log \frac{p(x)}{q(x)} dx$

Proof: $D_{KL}[p \| q] \geq 0$

Let $f(x) = -\log x$ (convex), $g(x) = q(x)/p(x)$. By Jensen's inequality:

$D_{KL}[p\|q] = \mathbb{E}_p[-\log g(x)] \geq -\log \mathbb{E}_p[g(x)]$

$= -\log \int p(x)\frac{q(x)}{p(x)}dx = -\log \int q(x)dx = -\log 1 = 0$

Equality iff $p(x) = q(x)$ almost everywhere. ∎

🧠 Quiz 1: Divergence Properties

Which property does KL divergence NOT satisfy?

Maximum Likelihood ≡ KL Minimization

Key Result

Given data distribution $p_{\text{data}}(x)$ and model $p_\theta(x)$:

$\theta^* = \arg\min_\theta D_{KL}[p_{\text{data}} \| p_\theta]$

Expanding the KL divergence:

$D_{KL}[p_{\text{data}} \| p_\theta] = \mathbb{E}_{p_{\text{data}}}[\log p_{\text{data}}] - \mathbb{E}_{p_{\text{data}}}[\log p_\theta]$

First term constant w.r.t. $\theta$, so:

$\theta^* = \arg\max_\theta \mathbb{E}_{p_{\text{data}}}[\log p_\theta(x)]$

This is Maximum Likelihood Estimation!

🔬

Variational Inference

For latent variable model $p_\theta(x) = \int p_\theta(x|z)p(z)dz$, direct maximization is intractable!

⚠️ Challenge:

Computing $p_\theta(x) = \int p_\theta(x|z)p(z)dz$ requires integrating over entire latent space!

Derivation: Variational Lower Bound (ELBO)

Start with log-likelihood:

$\log p_\theta(x) = \log \int p_\theta(x|z)p(z)dz$

Introduce variational distribution $q(z)$:

$= \log \int \frac{p_\theta(x|z)p(z)}{q(z)} q(z)dz = \log \mathbb{E}_q\left[\frac{p_\theta(x|z)p(z)}{q(z)}\right]$

Apply Jensen's inequality ($\log$ is concave):

$\geq \mathbb{E}_q\left[\log\frac{p_\theta(x|z)p(z)}{q(z)}\right]$

Rearrange to ELBO:

$= \mathbb{E}_q[\log p_\theta(x|z)] - D_{KL}[q(z) \| p(z)] = \mathcal{L}(x, q, \theta)$

🔍 Spot the Mistake!

Find the error in this "proof" that KL divergence is symmetric:

$D_{KL}[p \| q] = \int p(x) \log \frac{p(x)}{q(x)} dx$

$= \int p(x) (\log p(x) - \log q(x)) dx$

$= \int p(x) \log p(x) dx - \int p(x) \log q(x) dx$

$= \int p(x) \log p(x) dx - \int q(x) \log q(x) dx$

$= D_{KL}[q \| p]$

KL Divergence Between Gaussians

Theorem: KL for Factorized Gaussians

For $q(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ and $p(z) = \mathcal{N}(0,I)$:

$D_{KL}[q \| p] = \frac{1}{2}\left(\|\mu_\phi(x)\|^2 + \|\sigma_\phi(x)\|^2 - 2\log\sigma_\phi(x) - d\right)$

Click for complete proof

For dimension $i$: $q(z_i) = \mathcal{N}(\mu_i, \sigma_i^2)$, $p(z_i) = \mathcal{N}(0,1)$

$D_{KL}[q(z_i) \| p(z_i)] = \mathbb{E}_q[\log q(z_i) - \log p(z_i)]$

$= \mathbb{E}_q\left[-\frac{1}{2}\log(2\pi\sigma_i^2) - \frac{(z_i-\mu_i)^2}{2\sigma_i^2} + \frac{1}{2}\log(2\pi) + \frac{z_i^2}{2}\right]$

Using $\mathbb{E}[(z_i-\mu_i)^2] = \sigma_i^2$ and $\mathbb{E}[z_i^2] = \mu_i^2 + \sigma_i^2$:

$= -\log\sigma_i - \frac{1}{2} + \frac{1}{2}(\mu_i^2 + \sigma_i^2)$

Sum over all dimensions $i=1,\ldots,d$ to get result. ∎

🎯

The Reparameterization Trick

⚠️ Problem:

Cannot backpropagate through sampling: $z \sim q_\phi(z|x)$

Solution: Reparameterization

❌ Non-differentiable:

$z \sim \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))$

✓ Differentiable:

$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$

where $\epsilon \sim \mathcal{N}(0,I)$

Gradient Flow with Reparameterization

φ μ_φ(x) σ_φ(x) z ε ∂/∂φ ∂/∂z

Noise $\epsilon$ is independent of $\phi$, allowing gradients to flow!

🧠 Quiz 2: Reparameterization

What makes the reparameterization trick work?

Complete VAE Objective

VAE Optimization

$\max_{\phi,\theta} \mathcal{L}(\phi,\theta)$

$= \mathbb{E}_{p_{\text{data}}(x)}\left[\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]\right]$

Reconstruction Term

$\mathbb{E}_{q_\phi}[\log p_\theta(x|z)]$

How well decoder reconstructs

KL Regularization

$D_{KL}[q_\phi(z|x) \| p(z)]$

Keep posterior close to prior

Posterior Collapse

Problem: Encoder Ignored

Posterior collapse occurs when $q_\phi(z_i|x) \approx p(z_i)$ for some dimension $i$.

This means the encoder is not using that latent dimension—it carries no information about $x$!

Causes

  • • Powerful decoder ignores latents
  • • KL penalty too strong
  • • Bad local minima
  • • High-dimensional latent space

Remedies

  • • KL annealing (warm-up)
  • • β-VAE: $\beta D_{KL}$ with $\beta < 1$
  • • Weaker decoder
  • • Free bits constraint
📝

Key Takeaways

🎯 MLE = KL Min

Maximum likelihood is equivalent to minimizing KL divergence

📊 ELBO

Variational lower bound makes intractable inference tractable

🎲 Reparameterization

Enables backpropagation through stochastic sampling

"From divergence to generation—the mathematical beauty of VAEs"