Hierarchical VAE
VAE, or Variational Autoencoder, appears straightforward at first glance, doesn't it? But what if we envision a more expansive version? How could we achieve this? And perhaps most importantly, why should we pursue such a modification?
Furthermore, is it feasible to stack multiple VAEs one on top of the other, creating a hierarchical structure? How might we go about implementing such a scheme? These questions form the basis of our exploration into extending VAEs to hierarchical VAEs, paving the way for deeper understanding and more powerful generative models.
๐ก The Idea
In envisioning the stacking of VAEs, we can consider a chain of nodes: \( x_1, x_2, \dots, x_5 \). Each node in the sequence is "conditioned" on all previous nodes. For instance, \( x_3 \) is conditioned on \( x_1 \) and \( x_2 \). This conditioning reflects the encoding or "forward process" in the direction of the arrows.
Figure 1: Is this how hierarchical VAE would look like?
Using the product rule, we can express the joint distribution \( q(x_1, \ldots, x_5) \) as:
This equation illustrates how each variable \( x_i \) is conditioned on the previous variables in the forward process.
โ Critical Questions
Should we consider all possible dependencies?
How could we simplify? Can we drop some dependencies?
These questions prompt us to consider the trade-offs between model complexity and computational efficiency. While capturing all possible dependencies may lead to a more accurate representation of the data, it also increases the computational burden and model complexity.
Markovian VAE
The answer is yes โ we can simplify by using the Markov property: each node depends only on the previous node!
Figure 2: Markovian Forward Process
Right node depends only on previous node (Markov property)
Figure 3: Markovian Backward Process
Reverse process - also Markovian!
๐ Joint Distributions
The joint distribution for the forward and backward processes becomes much simpler:
Joint Distribution (simplified thanks to Markov!):
\[ p(x, z_{1:T}) = p(z_T)p_{\theta}(x_0|z_1) \prod_{t=2}^{T} p_{\theta}(z_{t-1}|z_t) \]Posterior Distribution:
\[ q_\phi(z_{1:T}|x) = q_\phi (z_1|x) \prod_{t=2}^{T} q_\phi(z_t|z_{t-1}) \]ELBO Derivation for Markovian VAE
We now derive the Evidence Lower Bound (ELBO) step by step:
๐ Step-by-Step Derivation
Step 1: Start with log-likelihood
\[ \log p(x) = \log \int p(x, z_{1:T}) dz_{1:T} \](Integrate out latent variables)
Step 2: Multiply and divide by \(q_{\phi}\)
\[ = \log \int \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} q_\phi(z_{1:T}|x) dz_{1:T} \]Step 3: Write as expectation
\[ = \log \mathbb{E}_{q_\phi(z_{1:T}|x)} \left[ \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} \right] \]Step 4: Apply Jensen's inequality
\[ \geq \mathbb{E}_{q_\phi(z_{1:T}|x)} \left[ \log \frac{p(x, z_{1:T})}{q_\phi(z_{1:T}|x)} \right] \](Log is concave, so Jensen's inequality applies)
VDM: Variational Diffusion Models
๐ Key Changes from HVAE to VDM
- \( z_i \) is replaced by \( x_i \) โ same dimensional latents!
- Traditional HVAE assumes \( \text{size}(z_i) < \text{size}(x_0) \), but VDM has \( \text{size}(x_i)=\text{size}(x_0) \)
- We don't see parametrized \( q_{\phi} \) but unparametrized \( q \)
- If the latent encoder \( q_{\phi} \) is not learned, how do we do forward process?
- At each timestep, \( q \) is assumed to be linear Gaussian that is "easy" to propagate
- Gaussian parameters are changed so that final distribution \( p_T \) is standard Gaussian \( \mathcal{N}(0,I) \)
๐ The Forward Process Plan
- Define the Gaussian encoder's mean as \( \mu_t(x_t) = \sqrt{\alpha_t}x_{t-1} \)
- Define the Gaussian encoder's variance as \( \Sigma_t(x_t) = (1 - \alpha_t)I \)
- Transitions within the encoder represented as: \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1 - \alpha_t)I) \]
๐ฏ Explicit Gaussian Form for \( q(x_t|x_0) \)
A beautiful closed-form solution emerges from recursive application:
\[ x_t \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \quad \text{where}~ \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i \]Proof by recursion:
\begin{align*} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}^*) + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t\alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t - \alpha_t\alpha_{t-1}} \epsilon_{t-2}^* + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon_{t-2} \end{align*}Note: We used the property that \( \gamma \epsilon_i + \zeta \epsilon_j \sim \mathcal{N}(0, \gamma^2 + \zeta^2) \)
Continuing this recursion until \( t=0 \):
\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_0 \sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I) \]DDPM Training Algorithm
๐ง Training Procedure
Input:
Training data \( \{x^{(โ)}\}_{โ=1}^L \) (e.g., images)
Repeat until convergence:
- Sample random timestep: \( t \sim \mathcal{U}(1, T) \)
- Sample noise: \( z \sim \mathcal{N}(0, I) \)
- Generate noisy sample: \( x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}z \)
- Compute gradient and update: \( \nabla_{\theta} \| \widehat{x}_{\theta}(x_t) - x_0 \|^2 \)
๐จ Sampling/Inference Procedure
Initialize:
Start with pure noise: \( x_T \sim \mathcal{N}(0, I) \)
For \( t = T \) down to 1:
- Compute denoised estimate: \( \widehat{x}_{\theta}(x_t) \)
- Sample noise: \( z \sim \mathcal{N}(0, I) \) if \( t > 1 \), else \( z = 0 \)
- Update: \[ x_{t-1} = \frac{(1 - \bar{\alpha}_{t-1})\sqrt{\alpha_t}}{1-\bar{\alpha}_t} x_t + \frac{(1 - \alpha_t)\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_t} \widehat{x}_{\theta}(x_t) + \sigma_q(t)z \]
Output: \( x_0 \) (generated sample)
๐ฏ Quiz 1: Test Your Understanding of DDPM
What is the key difference between HVAE and VDM?
Signal-to-Noise Ratio (SNR)
๐ SNR Definition
For the forward process \( q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \):
\[ \text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t} \]This ratio tells us how much signal remains versus noise at timestep \( t \).
๐ก Key Insight
The loss can be rewritten in terms of SNR difference:
\[ L_t(x) = \frac{1}{2} \left( \text{SNR}(t-1) - \text{SNR}(t) \right) \| \widehat{x}_{\theta}(x_t, t) - x_0\|_2^2 \]Since SNR decreases with \( t \), this means:
- Minimizing loss = Denoising!
- Larger SNR difference โ More emphasis on that timestep
- We're learning to reverse the noise addition process
Score Matching & Unnormalized Models
๐ Unnormalized Probability Models
Consider an unnormalized probability model:
\[ p_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_\theta}, \quad Z_\theta = \int e^{-f_\theta(x)} dx \]Problem: \( Z_\theta \) is often intractable to compute!
๐ Score Function
The score is the gradient of log-density:
\[ \nabla_x \log p_{\theta}(x) = -\nabla_x f_{\theta}(x) \]โจ Beautiful property: Score doesn't depend on \( Z_\theta \)!
๐ Fisher Divergence / Score Matching
We can train unnormalized models by minimizing:
\[ \frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \| \nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x) \|_2^2 \right] \]This bypasses the need to compute \( Z_\theta \)!
Langevin Dynamics
Langevin dynamics provides a powerful method to sample from a distribution using only its score function!
๐ Langevin Dynamics Algorithm
Given score function \( \nabla_x \log p(x) \), iterate:
\[ x_{t+1} = x_t + \tau \nabla_x \log p(x_t) + \sqrt{2\tau} z, \quad z \sim \mathcal{N}(0, I) \]\( \tau \): Step size parameter
\( \nabla_x \log p(x_t) \): Guides toward high probability regions
\( \sqrt{2\tau} z \): Stochastic noise for exploration
๐ Connection to Gradient Ascent
Langevin dynamics is like stochastic gradient ascent on \( \log p(x) \):
- Deterministic part: \( \tau \nabla_x \log p(x_t) \) moves toward maxima
- Stochastic part: \( \sqrt{2\tau} z \) enables exploration
- Balance between exploitation and exploration!
Denoising Score Matching
๐ฏ Denoising Score Matching Loss
Instead of explicit score matching, we can train on noisy data:
\[ \mathcal{J}_{\text{DSM}}(\theta) = \mathbb{E}_{p(x)} \left[ \frac{1}{2} \left\| s_\theta(x + \sigma z) + \frac{z}{\sigma} \right\|^2 \right], \quad z \sim \mathcal{N}(0, I) \]Key idea: Train the model to predict the noise that was added!
๐ Vincent's Theorem (2011)
Denoising score matching and explicit score matching are equivalent (up to a constant):
\[ L_{\text{DSM}}(\theta) = L_{\text{ESM}}(\theta) + C \]This means training on noisy data is as good as having access to the true score!
๐ฏ Quiz 2: Score Matching Understanding
Why is the score function useful for unnormalized models?
What role does noise play in Langevin dynamics?
Summary: The Diffusion Journey
๐๏ธ Hierarchical VAE โ Markovian VAE
Simplified dependencies using Markov property, each latent depends only on previous one
๐ VDM (Variational Diffusion Models)
Same-dimensional latents, fixed Gaussian forward process, learn reverse process
โ๏ธ DDPM Algorithm
Training: add noise at random timesteps. Inference: iteratively denoise from pure noise
๐ก Signal-to-Noise Ratio
SNR(t) = ฮฑฬ โ/(1-ฮฑฬ โ) controls diffusion schedule, decreases with time
๐ฏ Score Matching
Train unnormalized models via gradients, bypasses intractable partition function
๐ Langevin Dynamics
Sample using score function: gradient ascent + stochastic noise
๐ The Big Picture
Diffusion models are a powerful class of generative models that gradually add noise to data (forward process) and learn to reverse this process (backward process). They achieve state-of-the-art results in image generation, connect deeply to score matching and Langevin dynamics, and provide a principled probabilistic framework for generation!