Sinkhorn Generative Modeling
From optimal transport to fast entropy-regularized algorithms for generative modeling
Optimal Transport: The Assignment Problem
Problem: How to optimally assign points from set $X = \{x_1, \ldots, x_n\}$ to points in set $Y = \{y_1, \ldots, y_m\}$ while minimizing total cost?
Interactive Assignment Visualization
Animated flows show one-to-one matching minimizing total distance
β οΈ Limitations of Simple Assignment:
- β Requires $|X| = |Y|$ (equal sizes)
- β No out-of-sample extension for new points
- β Computational cost: $O(n^3)$ via Hungarian algorithm
- β Cannot handle probability distributions directly
Optimal Transport: Formal Definition
Definition: Optimal Transport
Let $(X,p)$, $(Y,q)$ be finite probability spaces with $|X|=n$, $|Y|=m$. Let $\Pi(p,q)$ be the set of joint distributions on $X \times Y$ with marginals $p$ and $q$:
$\Pi(p,q) = \{\pi : \pi \mathbf{1}_m = p, \, \mathbf{1}_n^T \pi = q\}$
Given cost function $c: X \times Y \to \mathbb{R}_+$, the optimal transport distance is:
$d_c(p,q) = \min_{\pi \in \Pi(p,q)} \langle c, \pi \rangle = \min_{\pi \in \Pi(p,q)} \sum_{x,y} c(x,y)\pi(x,y)$
Marginal Constraints
Row sums match source $p$:
$\sum_y \pi(x,y) = p(x)$
Column sums match target $q$:
$\sum_x \pi(x,y) = q(y)$
Special Case: Wasserstein
If $c(x,y) = \|x-y\|_2$:
$d_c(p,q) = W_1(p,q)$
(1-Wasserstein distance)
π§ Quiz 1 Understanding
What does the constraint $\pi \mathbf{1}_m = p$ ensure in optimal transport?
Entropy Regularization
To improve stability and enable fast algorithms, we add an entropy regularization term:
Entropy-Regularized OT
$d_c^\lambda(p,q) = \min_{\pi \in \Pi(p,q)} \langle c, \pi \rangle - \lambda H(\pi)$
where $H(\pi) = -\sum_{x,y} \pi(x,y) \log \pi(x,y)$ is the entropy
Key Insight: KL Divergence Reformulation
Define $K(x,y) = e^{-c(x,y)/\lambda}$ and Gibbs distribution $p_K^\lambda = K/z_\lambda$ where $z_\lambda = \sum_{x,y} K(x,y)$. Then:
$D_{KL}(\pi \| p_K^\lambda) = \sum_{x,y} \pi(x,y) \log\frac{\pi(x,y)}{K(x,y)/z_\lambda}$
$\quad = \sum_{x,y} \pi(x,y) \log \pi(x,y) - \sum_{x,y} \pi(x,y)\left(-\frac{c(x,y)}{\lambda}\right) + \log z_\lambda$
$\quad = -H(\pi) + \frac{1}{\lambda}\langle c, \pi \rangle + \log z_\lambda$
Therefore: $\arg\min_{\pi \in \Pi(p,q)} \langle c, \pi \rangle - \lambda H(\pi) = \arg\min_{\pi \in \Pi(p,q)} D_{KL}(\pi \| p_K^\lambda)$
β Benefits
- β’ Unique solution (strongly convex)
- β’ Fast $O(nm)$ algorithm
- β’ Smooth optimization landscape
π Convergence
As $\lambda \to 0$:
$d_c^\lambda(p,q) \to d_c(p,q)$
Sinkhorn Algorithm
Proposition: Matrix Scaling Form
The optimal transport plan has the form:
$\pi_\lambda^* = \text{diag}(u) K \text{diag}(v)$
where $K_{xy} = e^{-c(x,y)/\lambda}$ and $u \in \mathbb{R}^n, v \in \mathbb{R}^m$ are scaling vectors.
Click to see proof
Using Lagrangian with multipliers $f, g$:
$L(\pi,f,g) = \langle c, \pi \rangle - \lambda H(\pi) + \langle f, p - \pi\mathbf{1}_m \rangle + \langle g, q - \mathbf{1}_n^T\pi \rangle$
First-order optimality at $(x,y)$:
$c(x,y) + \lambda(\log\pi(x,y) + 1) - f_x - g_y = 0$
Solving for $\pi(x,y)$:
$\pi(x,y) = e^{(f_x/\lambda - 1/2)} e^{-c(x,y)/\lambda} e^{(g_y/\lambda - 1/2)}$
Setting $u_x = e^{f_x/\lambda - 1/2}$ and $v_y = e^{g_y/\lambda - 1/2}$ yields the result. β
Algorithm: Sinkhorn Iterations
1: Compute $K = \exp(-C/\lambda)$
2: Initialize $u^{(0)} = \mathbf{1}_n, v^{(0)} = \mathbf{1}_m$
3: for $\ell = 1$ to $L$:
4: $u^{(\ell)} = p \oslash (K v^{(\ell-1)})$ (element-wise division)
5: $v^{(\ell)} = q \oslash (K^T u^{(\ell)})$
6: Return $\pi = \text{diag}(u^{(L)}) K \text{diag}(v^{(L)})$
β Complexity: $O(Lnm)$ where $L$ is number of iterations
Much faster than LP solvers which take $O(nm(n+m)\log(n+m))$
π Spot the Mistake! Challenge
Which statement about Sinkhorn algorithm is INCORRECT?
Sinkhorn Generative Modeling
Use Sinkhorn algorithm to train generators by minimizing the optimal transport distance between generated and real distributions!
Sinkhorn Generative Architecture
Generator produces samples β Compare to real data via Sinkhorn β Backprop loss to update ΞΈ
Algorithm: Sinkhorn Generative Modeling
Input: Real samples, prior $p_z(z)$, iterations $N$, regularization $\epsilon$, learning rate $\alpha$
1: Initialize generator parameters $\theta$ randomly
2: for $k=1$ to $N$:
3: Sample mini-batch $\{y^i\}_{i=1}^n$ from real data
4: Sample $\{z^i\}_{i=1}^m$ from $p_z(z)$
5: Generate $\{x^i\}_{i=1}^m$ via $x^i = G_\theta(z^i)$
6: Compute cost matrix $C$ with $C_{ij} = c(x^i, y^j)$
7: Run Sinkhorn to get $(u, v)$
8: Compute loss: $\hat{E}_L(\theta) = \langle (C \odot K)^T v_L, u_L \rangle$
9: Update: $\theta \leftarrow \theta - \alpha \nabla_\theta \hat{E}_L(\theta)$
π‘ Key Insight: Loss is differentiable w.r.t. ΞΈ through the generated samples x!
Key Takeaways
π OT Basics
Optimal transport finds minimum-cost coupling between distributions
β‘ Sinkhorn
Entropy regularization enables fast O(Lnm) matrix scaling algorithm
π¨ Generation
Use OT distance as differentiable loss for training generators
"From earth moving to pixel pushingβoptimal transport for generative AI!"