Deep Learning for Generative Models

From perceptrons to Transformers: how layers, normalization, and attention power modern generators.

"Layer by layer we paint a thought, weights remember what gradients taught."

A (very) quick history

  • 1943 McCulloch-Pitts neurons kick off logical neuron modeling.
  • 1958 Rosenblatt's perceptron learns simple patterns.
  • 1969 Minsky & Papert show XOR limits: single-layer pain.
  • 1975/1986 Werbos, Rumelhart, Hinton, Williams popularize backprop.
  • 1990s CNNs for grids (LeCun); RNNs/Elman for sequences.
  • 2010s Deep networks explode: vision, speech, NLP.
  • 2017+ Transformers + attention rule sequence land.
  • Today Diffusion, large generative models, and ever-bigger stacks.

Reference credits: McCulloch & Pitts, Rosenblatt, Minsky & Papert, Werbos, Rumelhart-Hinton-Williams, LeCun, Elman, Vaswani et al., and friends.

🧠

What is a neural network?

A stack of neurons that take inputs, apply weights, bias, and an activation. Feedforward nets push signals forward; losses + gradient descent pull weights into shape.

Perceptron

Weighted sum + activation decides a class. Great for lines, stuck on XOR.

Deep stacks

Input → hidden layers (nonlinear) → output. Learn via backprop + gradient descent on a loss (MSE, cross-entropy, hinge).

🔢

Example: 2-layer feedforward

Inputs $\mathbf{x} = [x_1, x_2]^T$, one hidden neuron, one output neuron.

Parameters

$$ \mathbf{W}^{(1)} = \begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} \end{bmatrix},\; \mathbf{b}^{(1)} = \begin{bmatrix} b_1^{(1)} \end{bmatrix} $$ $$ \mathbf{W}^{(2)} = \begin{bmatrix} w_{11}^{(2)} \\ w_{12}^{(2)} \end{bmatrix},\; \mathbf{b}^{(2)} = \begin{bmatrix} b_1^{(2)} \end{bmatrix} $$

Feedforward recipe

$$ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}, \quad \mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) $$

Repeat for layers $l = 1 \dots L$, return $\mathbf{a}^{(L)}$.

Backprop sketch (2-layer)

Compute output error $\delta^{(2)} = (\hat{y}-y)\hat{y}(1-\hat{y})$, update $\mathbf{W}^{(2)} \leftarrow \mathbf{W}^{(2)} - \eta\,\delta^{(2)}\mathbf{h}^T$, then hidden error $\delta^{(1)} = (\mathbf{W}^{(2)T}\delta^{(2)}) \odot \text{ReLU}'(\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)})$ and update $\mathbf{W}^{(1)}, \mathbf{b}^{(1)}$.

Algorithm flavor: for $l=1\dots L$ do forward ($\mathbf{z}^{(l)}$, $\mathbf{a}^{(l)}$); for $l=L\dots 1$ backpropagate $\delta^{(l)}$ with chain rule; step weights with learning rate $\eta$.

⚖️

Batch Normalization

Normalize per feature per mini-batch to fight internal covariate shift (Ioffe & Szegedy).

  1. Compute batch mean/variance per feature.
  2. Normalize: $\hat{x}_i = (x_i-\mu)/\sqrt{\sigma^2+\epsilon}$.
  3. Scale/shift with learnable $\gamma, \beta$.

Equations

$$\mu = \frac{1}{N}\sum_{i=1}^N x_i,\quad \sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2$$ $$\hat{x}_i = \frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}},\quad y_i = \gamma \hat{x}_i + \beta$$

Why it helps

  • Faster convergence, higher learning rates.
  • Regularization via noisy batch stats.
  • Less sensitive to initialization.
📐

Layer Normalization

Normalize across features per sample (Ba et al.), ideal when batch sizes vary or are tiny.

Equations

$$\mu_i = \frac{1}{F}\sum_{j=1}^{F} x_{ij}, \quad \sigma_i^2 = \frac{1}{F}\sum_{j=1}^{F} (x_{ij}-\mu_i)^2$$ $$\hat{x}_{ij} = \frac{x_{ij}-\mu_i}{\sqrt{\sigma_i^2+\epsilon}}, \quad y_{ij} = \gamma_j \hat{x}_{ij} + \beta_j$$

Why it helps

  • Batch-size independent; friendly to NLP with variable lengths.
  • Reduces covariate shift; speeds convergence.
  • Plays nicely inside Transformers with small batches.
🔁

Recurrent Nets → LSTMs

RNNs carry state through time: $\mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1};\theta)$, $\mathbf{y}_t = g(\mathbf{h}_t;\theta)$. Trained with BPTT, but watch for vanishing/exploding gradients.

LSTM gates

$$ \mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_f),\quad \mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i) $$ $$ \tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_c),\quad \mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t $$ $$ \mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o),\quad \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) $$

Why LSTMs?

  • Gate what to forget, what to write, what to read.
  • Stabilize gradients on long sequences.
  • Great for speech, language, time-series.

LSTM algorithm (one step)

Given $x_t, h_{t-1}, C_{t-1}$: compute gates $i_t=\sigma(W_i x_t + U_i h_{t-1}+b_i)$, $f_t=\sigma(W_f x_t + U_f h_{t-1}+b_f)$, $o_t=\sigma(W_o x_t + U_o h_{t-1}+b_o)$, candidate $g_t=\tanh(W_g x_t+U_g h_{t-1}+b_g)$; update $C_t=f_t\odot C_{t-1}+ i_t\odot g_t$, then $h_t=o_t\odot \tanh(C_t)$.

🛰️

Transformers & Attention

Self-attention scores tokens against each other, enabling global context without recurrence. Masked self-attention keeps autoregressive models honest (no peeking at the future).

Scaled dot-product

$$Z = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-head attention = many heads of this in parallel.

Masked self-attention

Apply $-\infty$ mask for $j>i$ so position $i$ only attends to $1\dots i$. Great for language modeling.

Transformer poem

"Embed the words, add position hue,
Heads attend to what is true.
Skip the loops, let norms align,
Decode the future, one token at a time."

🛡️

Masked self-attention (steps)

  1. Compute scores $S_{ij} = Q_i K_j^\top / \sqrt{d_k}$.
  2. Mask future: add $-\infty$ where $j>i$ so token $i$ cannot see token $j$ ahead.
  3. Softmax over masked scores to get $\alpha_{ij}$.
  4. Output $Z_i = \sum_j \alpha_{ij} V_j$.

Keeps autoregressive models honest while retaining parallelism.

🧠 Quick Quiz

Which trick tackles internal covariate shift during training?

True / False speed round

🧪

Mini Lab: pick an architecture

Click a chip to see how it helps generative modelling.

Choose an architecture to see an idea.
🗺️

Transformer family (infographic list)

Transformer (2017)

Encoder-decoder with self-attention; "Attention is All You Need."

BERT (2018)

Bidirectional encoder; masked LM + NSP pretraining for rich context.

GPT series

Decoder-only, autoregressive: GPT → GPT-2 → GPT-3 → GPT-4 scale parameters/data.

RoBERTa

BERT re-tuned: larger batches, more data, no NSP step.

T5

Text-to-text: every task framed as input text → output text with encoder-decoder.

Use this as a mental map instead of a TikZ chart.

🤖

GPT-3 and GPT-4

Decoder-only Transformers with massive parameter counts (GPT-3: 175B). Self-attention over tokens enables few-shot/zero-shot “in-context” learning. Used for generation, summarization, translation, QA, sentiment, and more.

Limitations: can hallucinate, sensitive to prompts, costly to run at scale. Still, they pushed NLP forward and opened broad generative applications.

Memory hook (poem)

"Perceptrons learn but not XOR dreams,
Backprop mends with gradient streams.
Gates remember, norms align,
Attention whispers: context, shine."

← Back to Divergence Measures