๐ŸŒŠ

Diffusion Models

From Hierarchical VAEs to Denoising Diffusion Probabilistic Models, Score Matching, and Langevin Dynamics

HVAE VDM DDPM Score Matching Langevin Dynamics
๐Ÿ—๏ธ

Hierarchical VAE

VAE, or Variational Autoencoder, appears straightforward at first glance, doesn't it? But what if we envision a more expansive version? How could we achieve this? And perhaps most importantly, why should we pursue such a modification?

Furthermore, is it feasible to stack multiple VAEs one on top of the other, creating a hierarchical structure? How might we go about implementing such a scheme? These questions form the basis of our exploration into extending VAEs to hierarchical VAEs, paving the way for deeper understanding and more powerful generative models.

๐Ÿ’ก The Idea

In envisioning the stacking of VAEs, we can consider a chain of nodes: \( x_1, x_2, \dots, x_5 \). Each node in the sequence is "conditioned" on all previous nodes. For instance, \( x_3 \) is conditioned on \( x_1 \) and \( x_2 \). This conditioning reflects the encoding or "forward process" in the direction of the arrows.

Figure 1: Is this how hierarchical VAE would look like?

xโ‚ xโ‚‚ xโ‚ƒ xโ‚„ xโ‚…

Using the product rule, we can express the joint distribution \( q(x_1, \ldots, x_5) \) as:

\[ q(x_1, \ldots, x_5) = q(x_1) q(x_2 \mid x_1) q(x_3 \mid x_1, x_2) q(x_4 \mid x_1, x_2, x_3) q(x_5 \mid x_1, x_2, x_3, x_4) \]

This equation illustrates how each variable \( x_i \) is conditioned on the previous variables in the forward process.

โ“ Critical Questions

Should we consider all possible dependencies?

How could we simplify? Can we drop some dependencies?

These questions prompt us to consider the trade-offs between model complexity and computational efficiency. While capturing all possible dependencies may lead to a more accurate representation of the data, it also increases the computational burden and model complexity.

โ›“๏ธ

Markovian VAE

The answer is yes โ€“ we can simplify by using the Markov property: each node depends only on the previous node!

Figure 2: Markovian Forward Process

xโ‚€ zโ‚‚ zโ‚ƒ zโ‚„ zโ‚…

Right node depends only on previous node (Markov property)

Figure 3: Markovian Backward Process

xโ‚€ zโ‚‚ zโ‚ƒ zโ‚„ zโ‚…

Reverse process - also Markovian!

๐Ÿ“ Joint Distributions

The joint distribution for the forward and backward processes becomes much simpler:

Joint Distribution (simplified thanks to Markov!):

\[ p(x, z_{1:T}) = p(z_T)p_{\theta}(x_0|z_1) \prod_{t=2}^{T} p_{\theta}(z_{t-1}|z_t) \]

Posterior Distribution:

\[ q_\phi(z_{1:T}|x) = q_\phi (z_1|x) \prod_{t=2}^{T} q_\phi(z_t|z_{t-1}) \]
๐Ÿ“Š

ELBO Derivation for Markovian VAE

We now derive the Evidence Lower Bound (ELBO) step by step:

๐Ÿ” Step-by-Step Derivation

Step 1: Start with log-likelihood

\[ \log p(x) = \log \int p(x, z_{1:T}) dz_{1:T} \]

(Integrate out latent variables)

Step 2: Multiply and divide by \(q_{\phi}\)

\[ = \log \int \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} q_\phi(z_{1:T}|x) dz_{1:T} \]

Step 3: Write as expectation

\[ = \log \mathbb{E}_{q_\phi(z_{1:T}|x)} \left[ \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} \right] \]

Step 4: Apply Jensen's inequality

\[ \geq \mathbb{E}_{q_\phi(z_{1:T}|x)} \left[ \log \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} \right] \]

(Log is concave, so Jensen's inequality applies)

๐ŸŒŠ

VDM: Variational Diffusion Models

๐Ÿ”„ Key Changes from HVAE to VDM

  • \( z_i \) is replaced by \( x_i \) โ€” same dimensional latents!
  • Traditional HVAE assumes \( \text{size}(z_i) < \text{size}(x_0) \), but VDM has \( \text{size}(x_i)=\text{size}(x_0) \)
  • We don't see parametrized \( q_{\phi} \) but unparametrized \( q \)
  • If the latent encoder \( q_{\phi} \) is not learned, how do we do forward process?
  • At each timestep, \( q \) is assumed to be linear Gaussian that is "easy" to propagate
  • Gaussian parameters are changed so that final distribution \( p_T \) is standard Gaussian \( \mathcal{N}(0,I) \)

๐Ÿ“‹ The Forward Process Plan

  • Define the Gaussian encoder's mean as \( \mu_t(x_t) = \sqrt{\alpha_t}x_{t-1} \)
  • Define the Gaussian encoder's variance as \( \Sigma_t(x_t) = (1 - \alpha_t)I \)
  • Transitions within the encoder represented as: \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1 - \alpha_t)I) \]

๐ŸŽฏ Explicit Gaussian Form for \( q(x_t|x_0) \)

A beautiful closed-form solution emerges from recursive application:

\[ x_t \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \quad \text{where}~ \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i \]

Proof by recursion:

\begin{align*} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}^*) + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t\alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t - \alpha_t\alpha_{t-1}} \epsilon_{t-2}^* + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon_{t-2} \end{align*}

Note: We used the property that \( \gamma \epsilon_i + \zeta \epsilon_j \sim \mathcal{N}(0, \gamma^2 + \zeta^2) \)

Continuing this recursion until \( t=0 \):

\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_0 \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I) \]
โš™๏ธ

DDPM Training Algorithm

๐Ÿ”ง Training Procedure

Input:

Training data \( \{x^{(โ„“)}\}_{โ„“=1}^L \) (e.g., images)

Repeat until convergence:

  1. Sample random timestep: \( t \sim \mathcal{U}(1, T) \)
  2. Sample noise: \( z \sim \mathcal{N}(0, I) \)
  3. Generate noisy sample: \( x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}z \)
  4. Compute gradient and update: \( \nabla_{\theta} \| \widehat{x}_{\theta}(x_t) - x_0 \|^2 \)

๐ŸŽจ Sampling/Inference Procedure

Initialize:

Start with pure noise: \( x_T \sim \mathcal{N}(0, I) \)

For \( t = T \) down to 1:

  1. Compute denoised estimate: \( \widehat{x}_{\theta}(x_t) \)
  2. Sample noise: \( z \sim \mathcal{N}(0, I) \) if \( t > 1 \), else \( z = 0 \)
  3. Update: \[ x_{t-1} = \frac{(1 - \bar{\alpha}_{t-1})\sqrt{\alpha_t}}{1-\bar{\alpha}_t} x_t + \frac{(1 - \alpha_t)\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_t} \widehat{x}_{\theta}(x_t) + \sigma_q(t)z \]

Output: \( x_0 \) (generated sample)

๐ŸŽฏ Quiz 1: Test Your Understanding of DDPM

What is the key difference between HVAE and VDM?

A) VDM uses larger latent dimensions
B) VDM has learned forward process
C) VDM has same-dimensional latents and fixed Gaussian forward process
D) VDM doesn't use variational inference
๐Ÿ“ก

Signal-to-Noise Ratio (SNR)

๐Ÿ“ SNR Definition

For the forward process \( q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \):

\[ \text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t} \]

This ratio tells us how much signal remains versus noise at timestep \( t \).

๐Ÿ’ก Key Insight

The loss can be rewritten in terms of SNR difference:

\[ L_t(x) = \frac{1}{2} \left( \text{SNR}(t-1) - \text{SNR}(t) \right) \| \widehat{x}_{\theta}(x_t, t) - x_0\|_2^2 \]

Since SNR decreases with \( t \), this means:

  • Minimizing loss = Denoising!
  • Larger SNR difference โ†’ More emphasis on that timestep
  • We're learning to reverse the noise addition process
๐ŸŽฏ

Score Matching & Unnormalized Models

๐Ÿ“Š Unnormalized Probability Models

Consider an unnormalized probability model:

\[ p_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_\theta}, \quad Z_\theta = \int e^{-f_\theta(x)} dx \]

Problem: \( Z_\theta \) is often intractable to compute!

๐ŸŒŸ Score Function

The score is the gradient of log-density:

\[ \nabla_x \log p_{\theta}(x) = -\nabla_x f_{\theta}(x) \]

โœจ Beautiful property: Score doesn't depend on \( Z_\theta \)!

๐ŸŽ“ Fisher Divergence / Score Matching

We can train unnormalized models by minimizing:

\[ \frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \| \nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x) \|_2^2 \right] \]

This bypasses the need to compute \( Z_\theta \)!

๐Ÿ”„

Langevin Dynamics

Langevin dynamics provides a powerful method to sample from a distribution using only its score function!

๐ŸŒ€ Langevin Dynamics Algorithm

Given score function \( \nabla_x \log p(x) \), iterate:

\[ x_{t+1} = x_t + \tau \nabla_x \log p(x_t) + \sqrt{2\tau} z, \quad z \sim \mathcal{N}(0, I) \]

\( \tau \): Step size parameter

\( \nabla_x \log p(x_t) \): Guides toward high probability regions

\( \sqrt{2\tau} z \): Stochastic noise for exploration

๐Ÿ” Connection to Gradient Ascent

Langevin dynamics is like stochastic gradient ascent on \( \log p(x) \):

  • Deterministic part: \( \tau \nabla_x \log p(x_t) \) moves toward maxima
  • Stochastic part: \( \sqrt{2\tau} z \) enables exploration
  • Balance between exploitation and exploration!
๐Ÿงน

Denoising Score Matching

๐ŸŽฏ Denoising Score Matching Loss

Instead of explicit score matching, we can train on noisy data:

\[ \mathcal{J}_{\text{DSM}}(\theta) = \mathbb{E}_{p(x)} \left[ \frac{1}{2} \left\| s_\theta(x + \sigma z) + \frac{z}{\sigma} \right\|^2 \right], \quad z \sim \mathcal{N}(0, I) \]

Key idea: Train the model to predict the noise that was added!

๐ŸŒŸ Vincent's Theorem (2011)

Denoising score matching and explicit score matching are equivalent (up to a constant):

\[ L_{\text{DSM}}(\theta) = L_{\text{ESM}}(\theta) + C \]

This means training on noisy data is as good as having access to the true score!

๐ŸŽฏ Quiz 2: Score Matching Understanding

Why is the score function useful for unnormalized models?

A) It's easier to compute than the probability
B) It doesn't depend on the intractable normalization constant
C) It gives exact samples from the distribution
D) It reduces computational complexity to O(1)

What role does noise play in Langevin dynamics?

A) It slows down convergence
B) It enables exploration and escaping local maxima
C) It's not necessary, just traditional
D) It makes the algorithm deterministic
๐Ÿ“š

Summary: The Diffusion Journey

๐Ÿ—๏ธ Hierarchical VAE โ†’ Markovian VAE

Simplified dependencies using Markov property, each latent depends only on previous one

๐ŸŒŠ VDM (Variational Diffusion Models)

Same-dimensional latents, fixed Gaussian forward process, learn reverse process

โš™๏ธ DDPM Algorithm

Training: add noise at random timesteps. Inference: iteratively denoise from pure noise

๐Ÿ“ก Signal-to-Noise Ratio

SNR(t) = ฮฑฬ…โ‚œ/(1-ฮฑฬ…โ‚œ) controls diffusion schedule, decreases with time

๐ŸŽฏ Score Matching

Train unnormalized models via gradients, bypasses intractable partition function

๐Ÿ”„ Langevin Dynamics

Sample using score function: gradient ascent + stochastic noise

๐ŸŽ“ The Big Picture

Diffusion models are a powerful class of generative models that gradually add noise to data (forward process) and learn to reverse this process (backward process). They achieve state-of-the-art results in image generation, connect deeply to score matching and Langevin dynamics, and provide a principled probabilistic framework for generation!