Sinkhorn Generative Modeling

From optimal transport to fast entropy-regularized algorithms for generative modeling

🚚

Optimal Transport: The Assignment Problem

Problem: How to optimally assign points from set $X = \{x_1, \ldots, x_n\}$ to points in set $Y = \{y_1, \ldots, y_m\}$ while minimizing total cost?

Interactive Assignment Visualization

Animated flows show one-to-one matching minimizing total distance

⚠️ Limitations of Simple Assignment:

✗ Requires $|X| = |Y|$ (equal sizes)
✗ No out-of-sample extension for new points
✗ Computational cost: $O(n^3)$ via Hungarian algorithm
✗ Cannot handle probability distributions directly

📐

Optimal Transport: Formal Definition

Definition: Optimal Transport

Let $(X,p)$, $(Y,q)$ be finite probability spaces with $|X|=n$, $|Y|=m$. Let $\Pi(p,q)$ be the set of joint distributions on $X \times Y$ with marginals $p$ and $q$:

$\Pi(p,q) = \{\pi : \pi \mathbf{1}_m = p, \, \mathbf{1}_n^T \pi = q\}$

Given cost function $c: X \times Y \to \mathbb{R}_+$, the optimal transport distance is:

$d_c(p,q) = \min_{\pi \in \Pi(p,q)} \langle c, \pi \rangle = \min_{\pi \in \Pi(p,q)} \sum_{x,y} c(x,y)\pi(x,y)$

Marginal Constraints

Row sums match source $p$:

$\sum_y \pi(x,y) = p(x)$

Column sums match target $q$:

$\sum_x \pi(x,y) = q(y)$

Special Case: Wasserstein

If $c(x,y) = \|x-y\|_2$:

$d_c(p,q) = W_1(p,q)$

(1-Wasserstein distance)

🧠 Quiz 1 Understanding

What does the constraint $\pi \mathbf{1}_m = p$ ensure in optimal transport?

🔥

Entropy Regularization

To improve stability and enable fast algorithms, we add an entropy regularization term:

Entropy-Regularized OT

$d_c^\lambda(p,q) = \min_{\pi \in \Pi(p,q)} \langle c, \pi \rangle - \lambda H(\pi)$

where $H(\pi) = -\sum_{x,y} \pi(x,y) \log \pi(x,y)$ is the entropy

Key Insight: KL Divergence Reformulation

Define $K(x,y) = e^{-c(x,y)/\lambda}$ and Gibbs distribution $p_K^\lambda = K/z_\lambda$ where $z_\lambda = \sum_{x,y} K(x,y)$. Then:

$D_{KL}(\pi \| p_K^\lambda) = \sum_{x,y} \pi(x,y) \log\frac{\pi(x,y)}{K(x,y)/z_\lambda}$

$\quad = \sum_{x,y} \pi(x,y) \log \pi(x,y) - \sum_{x,y} \pi(x,y)\left(-\frac{c(x,y)}{\lambda}\right) + \log z_\lambda$

$\quad = -H(\pi) + \frac{1}{\lambda}\langle c, \pi \rangle + \log z_\lambda$

Therefore: $\arg\min_{\pi \in \Pi(p,q)} \langle c, \pi \rangle - \lambda H(\pi) = \arg\min_{\pi \in \Pi(p,q)} D_{KL}(\pi \| p_K^\lambda)$

✓ Benefits

• Unique solution (strongly convex)
• Fast $O(nm)$ algorithm
• Smooth optimization landscape

📊 Convergence

As $\lambda \to 0$:

$d_c^\lambda(p,q) \to d_c(p,q)$

⚡

Sinkhorn Algorithm

Proposition: Matrix Scaling Form

The optimal transport plan has the form:

$\pi_\lambda^* = \text{diag}(u) K \text{diag}(v)$

where $K_{xy} = e^{-c(x,y)/\lambda}$ and $u \in \mathbb{R}^n, v \in \mathbb{R}^m$ are scaling vectors.

Click to see proof

Using Lagrangian with multipliers $f, g$:

$L(\pi,f,g) = \langle c, \pi \rangle - \lambda H(\pi) + \langle f, p - \pi\mathbf{1}_m \rangle + \langle g, q - \mathbf{1}_n^T\pi \rangle$

First-order optimality at $(x,y)$:

$c(x,y) + \lambda(\log\pi(x,y) + 1) - f_x - g_y = 0$

Solving for $\pi(x,y)$:

$\pi(x,y) = e^{(f_x/\lambda - 1/2)} e^{-c(x,y)/\lambda} e^{(g_y/\lambda - 1/2)}$

Setting $u_x = e^{f_x/\lambda - 1/2}$ and $v_y = e^{g_y/\lambda - 1/2}$ yields the result. ∎

Algorithm: Sinkhorn Iterations

1: Compute $K = \exp(-C/\lambda)$

2: Initialize $u^{(0)} = \mathbf{1}_n, v^{(0)} = \mathbf{1}_m$

3: for $\ell = 1$ to $L$:

4: $u^{(\ell)} = p \oslash (K v^{(\ell-1)})$ (element-wise division)

5: $v^{(\ell)} = q \oslash (K^T u^{(\ell)})$

6: Return $\pi = \text{diag}(u^{(L)}) K \text{diag}(v^{(L)})$

✓ Complexity: $O(Lnm)$ where $L$ is number of iterations

Much faster than LP solvers which take $O(nm(n+m)\log(n+m))$

🔍 Spot the Mistake! Challenge

Which statement about Sinkhorn algorithm is INCORRECT?

🎨

Sinkhorn Generative Modeling

Use Sinkhorn algorithm to train generators by minimizing the optimal transport distance between generated and real distributions!

Sinkhorn Generative Architecture

Generator produces samples → Compare to real data via Sinkhorn → Backprop loss to update θ

Algorithm: Sinkhorn Generative Modeling

Input: Real samples, prior $p_z(z)$, iterations $N$, regularization $\epsilon$, learning rate $\alpha$

1: Initialize generator parameters $\theta$ randomly

2: for $k=1$ to $N$:

3: Sample mini-batch $\{y^i\}_{i=1}^n$ from real data

4: Sample $\{z^i\}_{i=1}^m$ from $p_z(z)$

5: Generate $\{x^i\}_{i=1}^m$ via $x^i = G_\theta(z^i)$

6: Compute cost matrix $C$ with $C_{ij} = c(x^i, y^j)$

7: Run Sinkhorn to get $(u, v)$

8: Compute loss: $\hat{E}_L(\theta) = \langle (C \odot K)^T v_L, u_L \rangle$

9: Update: $\theta \leftarrow \theta - \alpha \nabla_\theta \hat{E}_L(\theta)$

💡 Key Insight: Loss is differentiable w.r.t. θ through the generated samples x!

📝

Key Takeaways

🚚 OT Basics

Optimal transport finds minimum-cost coupling between distributions

⚡ Sinkhorn

Entropy regularization enables fast O(Lnm) matrix scaling algorithm

🎨 Generation

Use OT distance as differentiable loss for training generators

"From earth moving to pixel pushing—optimal transport for generative AI!"