Diffusion Models | Generative Modelling

🏗️

Hierarchical VAE

VAE, or Variational Autoencoder, appears straightforward at first glance, doesn't it? But what if we envision a more expansive version? How could we achieve this? And perhaps most importantly, why should we pursue such a modification?

Furthermore, is it feasible to stack multiple VAEs one on top of the other, creating a hierarchical structure? How might we go about implementing such a scheme? These questions form the basis of our exploration into extending VAEs to hierarchical VAEs, paving the way for deeper understanding and more powerful generative models.

💡 The Idea

In envisioning the stacking of VAEs, we can consider a chain of nodes: \( x_1, x_2, \dots, x_5 \). Each node in the sequence is "conditioned" on all previous nodes. For instance, \( x_3 \) is conditioned on \( x_1 \) and \( x_2 \). This conditioning reflects the encoding or "forward process" in the direction of the arrows.

Figure 1: Is this how hierarchical VAE would look like?

Using the product rule, we can express the joint distribution \( q(x_1, \ldots, x_5) \) as:

\[ q(x_1, \ldots, x_5) = q(x_1) q(x_2 \mid x_1) q(x_3 \mid x_1, x_2) q(x_4 \mid x_1, x_2, x_3) q(x_5 \mid x_1, x_2, x_3, x_4) \]

This equation illustrates how each variable \( x_i \) is conditioned on the previous variables in the forward process.

❓ Critical Questions

Should we consider all possible dependencies?

How could we simplify? Can we drop some dependencies?

These questions prompt us to consider the trade-offs between model complexity and computational efficiency. While capturing all possible dependencies may lead to a more accurate representation of the data, it also increases the computational burden and model complexity.

⛓️

Markovian VAE

The answer is yes – we can simplify by using the Markov property: each node depends only on the previous node!

Figure 2: Markovian Forward Process

Right node depends only on previous node (Markov property)

Figure 3: Markovian Backward Process

Reverse process - also Markovian!

📐 Joint Distributions

The joint distribution for the forward and backward processes becomes much simpler:

Joint Distribution (simplified thanks to Markov!):

\[ p(x, z_{1:T}) = p(z_T)p_{\theta}(x_0|z_1) \prod_{t=2}^{T} p_{\theta}(z_{t-1}|z_t) \]

Posterior Distribution:

\[ q_\phi(z_{1:T}|x) = q_\phi (z_1|x) \prod_{t=2}^{T} q_\phi(z_t|z_{t-1}) \]

📊

ELBO Derivation for Markovian VAE

We now derive the Evidence Lower Bound (ELBO) step by step:

🔍 Step-by-Step Derivation

Step 1: Start with log-likelihood

\[ \log p(x) = \log \int p(x, z_{1:T}) dz_{1:T} \]

(Integrate out latent variables)

Step 2: Multiply and divide by \(q_{\phi}\)

\[ = \log \int \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} q_\phi(z_{1:T}|x) dz_{1:T} \]

Step 3: Write as expectation

\[ = \log \mathbb{E}_{q_\phi(z_{1:T}|x)} \left[ \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} \right] \]

Step 4: Apply Jensen's inequality

\[ \geq \mathbb{E}_{q_\phi(z_{1:T}|x)} \left[ \log \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} \right] \]

(Log is concave, so Jensen's inequality applies)

🌊

VDM: Variational Diffusion Models

🔄 Key Changes from HVAE to VDM

\( z_i \) is replaced by \( x_i \) — same dimensional latents!
Traditional HVAE assumes \( \text{size}(z_i) < \text{size}(x_0) \), but VDM has \( \text{size}(x_i)=\text{size}(x_0) \)
We don't see parametrized \( q_{\phi} \) but unparametrized \( q \)
If the latent encoder \( q_{\phi} \) is not learned, how do we do forward process?
At each timestep, \( q \) is assumed to be linear Gaussian that is "easy" to propagate
Gaussian parameters are changed so that final distribution \( p_T \) is standard Gaussian \( \mathcal{N}(0,I) \)

📋 The Forward Process Plan

Define the Gaussian encoder's mean as \( \mu_t(x_t) = \sqrt{\alpha_t}x_{t-1} \)
Define the Gaussian encoder's variance as \( \Sigma_t(x_t) = (1 - \alpha_t)I \)
Transitions within the encoder represented as: \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1 - \alpha_t)I) \]

🎯 Explicit Gaussian Form for \( q(x_t|x_0) \)

A beautiful closed-form solution emerges from recursive application:

\[ x_t \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \quad \text{where}~ \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i \]

Proof by recursion:

\begin{align*} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}^*) + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t\alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t - \alpha_t\alpha_{t-1}} \epsilon_{t-2}^* + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon_{t-2} \end{align*}

Note: We used the property that \( \gamma \epsilon_i + \zeta \epsilon_j \sim \mathcal{N}(0, \gamma^2 + \zeta^2) \)

Continuing this recursion until \( t=0 \):

\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_0 \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I) \]

⚙️

DDPM Training Algorithm

🔧 Training Procedure

Input:

Training data \( \{x^{(ℓ)}\}_{ℓ=1}^L \) (e.g., images)

Repeat until convergence:

Sample random timestep: \( t \sim \mathcal{U}(1, T) \)
Sample noise: \( z \sim \mathcal{N}(0, I) \)
Generate noisy sample: \( x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}z \)
Compute gradient and update: \( \nabla_{\theta} \| \widehat{x}_{\theta}(x_t) - x_0 \|^2 \)

🎨 Sampling/Inference Procedure

Initialize:

Start with pure noise: \( x_T \sim \mathcal{N}(0, I) \)

For \( t = T \) down to 1:

Compute denoised estimate: \( \widehat{x}_{\theta}(x_t) \)
Sample noise: \( z \sim \mathcal{N}(0, I) \) if \( t > 1 \), else \( z = 0 \)
Update: \[ x_{t-1} = \frac{(1 - \bar{\alpha}_{t-1})\sqrt{\alpha_t}}{1-\bar{\alpha}_t} x_t + \frac{(1 - \alpha_t)\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_t} \widehat{x}_{\theta}(x_t) + \sigma_q(t)z \]

Output: \( x_0 \) (generated sample)

🎯 Quiz 1: Test Your Understanding of DDPM

What is the key difference between HVAE and VDM?

A) VDM uses larger latent dimensions

B) VDM has learned forward process

C) VDM has same-dimensional latents and fixed Gaussian forward process

D) VDM doesn't use variational inference

📡

Signal-to-Noise Ratio (SNR)

📐 SNR Definition

For the forward process \( q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \):

\[ \text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t} \]

This ratio tells us how much signal remains versus noise at timestep \( t \).

💡 Key Insight

The loss can be rewritten in terms of SNR difference:

\[ L_t(x) = \frac{1}{2} \left( \text{SNR}(t-1) - \text{SNR}(t) \right) \| \widehat{x}_{\theta}(x_t, t) - x_0\|_2^2 \]

Since SNR decreases with \( t \), this means:

Minimizing loss = Denoising!
Larger SNR difference → More emphasis on that timestep
We're learning to reverse the noise addition process

🎯

Score Matching & Unnormalized Models

📊 Unnormalized Probability Models

Consider an unnormalized probability model:

\[ p_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_\theta}, \quad Z_\theta = \int e^{-f_\theta(x)} dx \]

Problem: \( Z_\theta \) is often intractable to compute!

🌟 Score Function

The score is the gradient of log-density:

\[ \nabla_x \log p_{\theta}(x) = -\nabla_x f_{\theta}(x) \]

✨ Beautiful property: Score doesn't depend on \( Z_\theta \)!

🎓 Fisher Divergence / Score Matching

We can train unnormalized models by minimizing:

\[ \frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \| \nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x) \|_2^2 \right] \]

This bypasses the need to compute \( Z_\theta \)!

🔄

Langevin Dynamics

Langevin dynamics provides a powerful method to sample from a distribution using only its score function!

🌀 Langevin Dynamics Algorithm

Given score function \( \nabla_x \log p(x) \), iterate:

\[ x_{t+1} = x_t + \tau \nabla_x \log p(x_t) + \sqrt{2\tau} z, \quad z \sim \mathcal{N}(0, I) \]

\( \tau \): Step size parameter

\( \nabla_x \log p(x_t) \): Guides toward high probability regions

\( \sqrt{2\tau} z \): Stochastic noise for exploration

🔍 Connection to Gradient Ascent

Langevin dynamics is like stochastic gradient ascent on \( \log p(x) \):

Deterministic part: \( \tau \nabla_x \log p(x_t) \) moves toward maxima
Stochastic part: \( \sqrt{2\tau} z \) enables exploration
Balance between exploitation and exploration!

🧹

Denoising Score Matching

🎯 Denoising Score Matching Loss

Instead of explicit score matching, we can train on noisy data:

\[ \mathcal{J}_{\text{DSM}}(\theta) = \mathbb{E}_{p(x)} \left[ \frac{1}{2} \left\| s_\theta(x + \sigma z) + \frac{z}{\sigma} \right\|^2 \right], \quad z \sim \mathcal{N}(0, I) \]

Key idea: Train the model to predict the noise that was added!

🌟 Vincent's Theorem (2011)

Denoising score matching and explicit score matching are equivalent (up to a constant):

\[ L_{\text{DSM}}(\theta) = L_{\text{ESM}}(\theta) + C \]

This means training on noisy data is as good as having access to the true score!

🎯 Quiz 2: Score Matching Understanding

Why is the score function useful for unnormalized models?

A) It's easier to compute than the probability

B) It doesn't depend on the intractable normalization constant

C) It gives exact samples from the distribution

D) It reduces computational complexity to O(1)

What role does noise play in Langevin dynamics?

A) It slows down convergence

B) It enables exploration and escaping local maxima

C) It's not necessary, just traditional

D) It makes the algorithm deterministic

📚

Summary: The Diffusion Journey

🏗️ Hierarchical VAE → Markovian VAE

Simplified dependencies using Markov property, each latent depends only on previous one

🌊 VDM (Variational Diffusion Models)

Same-dimensional latents, fixed Gaussian forward process, learn reverse process

⚙️ DDPM Algorithm

Training: add noise at random timesteps. Inference: iteratively denoise from pure noise

📡 Signal-to-Noise Ratio

SNR(t) = α̅ₜ/(1-α̅ₜ) controls diffusion schedule, decreases with time

🎯 Score Matching

Train unnormalized models via gradients, bypasses intractable partition function

🔄 Langevin Dynamics

Sample using score function: gradient ascent + stochastic noise

🎓 The Big Picture

Diffusion models are a powerful class of generative models that gradually add noise to data (forward process) and learn to reverse this process (backward process). They achieve state-of-the-art results in image generation, connect deeply to score matching and Langevin dynamics, and provide a principled probabilistic framework for generation!