VAE: Mathematical Foundations
Divergence minimization, variational inference, and the reparameterization trick
Divergence Minimization
Definition: Divergence Function
A divergence function $D[\cdot \| \cdot]: \mathcal{P} \times \mathcal{P} \to \mathbb{R}$ quantifies dissimilarity between distributions with properties:
1. Non-negativity: $D[P \| Q] \geq 0$ for all $P, Q \in \mathcal{P}$
2. Identity: $D[P \| Q] = 0 \iff P = Q$
Note: Unlike metrics, divergence doesn't require symmetry or triangle inequality
KL Divergence
$D_{KL}[P\|Q] = \mathbb{E}_P[\log P/Q]$
JS Divergence
Symmetric version of KL
Wasserstein
Earth mover's distance
Probabilistic Graphical Models
PGMs use DAGs to represent conditional dependencies between variables:
Joint Distribution Factorization
$p(x_1, \ldots, x_D) = \prod_{i=1}^{D} p(x_i \mid \text{pa}(x_i))$
where $\text{pa}(x_i)$ denotes parents of node $x_i$
VAE/GAN Model
$p(x,z) = p(x|z)p(z)$
Conditional Model
$p(x,y,z) = p(x|z,y)p(y|z)p(z)$
Key Theorems
Theorem: Jensen's Inequality
Let $f: \mathbb{R} \to \mathbb{R}$ be convex. For any distribution $p(x)$:
$\mathbb{E}_{p(x)}[f(x)] \geq f(\mathbb{E}_{p(x)}[x])$
Equality holds iff $f$ is linear or $p(x)$ is a delta measure.
Click to see proof
Proof sketch for discrete case:
Let $X$ take values $x_1, \ldots, x_n$ with probabilities $p_1, \ldots, p_n$.
Since $f$ is convex: $f(\sum_i p_i x_i) \leq \sum_i p_i f(x_i)$ by definition of convexity.
This directly gives $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$. ∎
Theorem: LOTUS (Law of Unconscious Statistician)
For $Y = g(X)$ where $\mathbb{E}[|g(X)|] < \infty$:
$\mathbb{E}_{p_Y(y)}[y] = \mathbb{E}_{p_X(x)}[g(x)]$
Allows computing expectations without knowing $p_Y$ explicitly!
Definition: KL Divergence
$D_{KL}[p \| q] = \int p(x) \log \frac{p(x)}{q(x)} dx$
Proof: $D_{KL}[p \| q] \geq 0$
Let $f(x) = -\log x$ (convex), $g(x) = q(x)/p(x)$. By Jensen's inequality:
$D_{KL}[p\|q] = \mathbb{E}_p[-\log g(x)] \geq -\log \mathbb{E}_p[g(x)]$
$= -\log \int p(x)\frac{q(x)}{p(x)}dx = -\log \int q(x)dx = -\log 1 = 0$
Equality iff $p(x) = q(x)$ almost everywhere. ∎
🧠 Quiz 1: Divergence Properties
Which property does KL divergence NOT satisfy?
Maximum Likelihood ≡ KL Minimization
Key Result
Given data distribution $p_{\text{data}}(x)$ and model $p_\theta(x)$:
$\theta^* = \arg\min_\theta D_{KL}[p_{\text{data}} \| p_\theta]$
Expanding the KL divergence:
$D_{KL}[p_{\text{data}} \| p_\theta] = \mathbb{E}_{p_{\text{data}}}[\log p_{\text{data}}] - \mathbb{E}_{p_{\text{data}}}[\log p_\theta]$
First term constant w.r.t. $\theta$, so:
$\theta^* = \arg\max_\theta \mathbb{E}_{p_{\text{data}}}[\log p_\theta(x)]$
This is Maximum Likelihood Estimation!
Variational Inference
For latent variable model $p_\theta(x) = \int p_\theta(x|z)p(z)dz$, direct maximization is intractable!
⚠️ Challenge:
Computing $p_\theta(x) = \int p_\theta(x|z)p(z)dz$ requires integrating over entire latent space!
Derivation: Variational Lower Bound (ELBO)
Start with log-likelihood:
$\log p_\theta(x) = \log \int p_\theta(x|z)p(z)dz$
Introduce variational distribution $q(z)$:
$= \log \int \frac{p_\theta(x|z)p(z)}{q(z)} q(z)dz = \log \mathbb{E}_q\left[\frac{p_\theta(x|z)p(z)}{q(z)}\right]$
Apply Jensen's inequality ($\log$ is concave):
$\geq \mathbb{E}_q\left[\log\frac{p_\theta(x|z)p(z)}{q(z)}\right]$
Rearrange to ELBO:
$= \mathbb{E}_q[\log p_\theta(x|z)] - D_{KL}[q(z) \| p(z)] = \mathcal{L}(x, q, \theta)$
🔍 Spot the Mistake!
Find the error in this "proof" that KL divergence is symmetric:
$D_{KL}[p \| q] = \int p(x) \log \frac{p(x)}{q(x)} dx$
$= \int p(x) (\log p(x) - \log q(x)) dx$
$= \int p(x) \log p(x) dx - \int p(x) \log q(x) dx$
$= \int p(x) \log p(x) dx - \int q(x) \log q(x) dx$
$= D_{KL}[q \| p]$
KL Divergence Between Gaussians
Theorem: KL for Factorized Gaussians
For $q(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ and $p(z) = \mathcal{N}(0,I)$:
$D_{KL}[q \| p] = \frac{1}{2}\left(\|\mu_\phi(x)\|^2 + \|\sigma_\phi(x)\|^2 - 2\log\sigma_\phi(x) - d\right)$
Click for complete proof
For dimension $i$: $q(z_i) = \mathcal{N}(\mu_i, \sigma_i^2)$, $p(z_i) = \mathcal{N}(0,1)$
$D_{KL}[q(z_i) \| p(z_i)] = \mathbb{E}_q[\log q(z_i) - \log p(z_i)]$
$= \mathbb{E}_q\left[-\frac{1}{2}\log(2\pi\sigma_i^2) - \frac{(z_i-\mu_i)^2}{2\sigma_i^2} + \frac{1}{2}\log(2\pi) + \frac{z_i^2}{2}\right]$
Using $\mathbb{E}[(z_i-\mu_i)^2] = \sigma_i^2$ and $\mathbb{E}[z_i^2] = \mu_i^2 + \sigma_i^2$:
$= -\log\sigma_i - \frac{1}{2} + \frac{1}{2}(\mu_i^2 + \sigma_i^2)$
Sum over all dimensions $i=1,\ldots,d$ to get result. ∎
The Reparameterization Trick
⚠️ Problem:
Cannot backpropagate through sampling: $z \sim q_\phi(z|x)$
Solution: Reparameterization
❌ Non-differentiable:
$z \sim \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))$
✓ Differentiable:
$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$
where $\epsilon \sim \mathcal{N}(0,I)$
Gradient Flow with Reparameterization
Noise $\epsilon$ is independent of $\phi$, allowing gradients to flow!
🧠 Quiz 2: Reparameterization
What makes the reparameterization trick work?
Complete VAE Objective
VAE Optimization
$\max_{\phi,\theta} \mathcal{L}(\phi,\theta)$
$= \mathbb{E}_{p_{\text{data}}(x)}\left[\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]\right]$
Reconstruction Term
$\mathbb{E}_{q_\phi}[\log p_\theta(x|z)]$
How well decoder reconstructs
KL Regularization
$D_{KL}[q_\phi(z|x) \| p(z)]$
Keep posterior close to prior
Posterior Collapse
Problem: Encoder Ignored
Posterior collapse occurs when $q_\phi(z_i|x) \approx p(z_i)$ for some dimension $i$.
This means the encoder is not using that latent dimension—it carries no information about $x$!
Causes
- • Powerful decoder ignores latents
- • KL penalty too strong
- • Bad local minima
- • High-dimensional latent space
Remedies
- • KL annealing (warm-up)
- • β-VAE: $\beta D_{KL}$ with $\beta < 1$
- • Weaker decoder
- • Free bits constraint
Key Takeaways
🎯 MLE = KL Min
Maximum likelihood is equivalent to minimizing KL divergence
📊 ELBO
Variational lower bound makes intractable inference tractable
🎲 Reparameterization
Enables backpropagation through stochastic sampling
"From divergence to generation—the mathematical beauty of VAEs"