GAN Architectures for Vision Tasks
From conditional generation to high-resolution synthesis: exploring the evolution of GANs for computer vision.
"One architecture to rule them all? Not quiteβbut each brings its own magic!"
Introduction
Since the introduction of GANs, researchers have developed numerous specialized architectures to tackle specific vision tasks. These innovations address challenges like generating high-resolution images, conditioning on labels or semantic maps, and achieving photorealistic results.
In this chapter, we'll explore prominent GAN architectures designed for vision applicationsβfrom conditional GANs that give you control, to progressive training that scales to high resolutions, and specialized models for image-to-image translation.
Conditional GAN (cGAN)
Key Idea: Guide the data generation process by conditioning both the generator and discriminator on additional information (labels, text, images, etc.).
How cGANs Work
cGAN Architecture: Both generator and discriminator receive the conditioning label y, enabling controlled generation.
βοΈ Generator G(z, y)
Takes both noise vector z and condition y (e.g., class label "cat") to produce samples that match the condition.
π Discriminator D(x, y)
Judges both authenticity AND relevance to condition y, ensuring generated samples match the specified condition.
Example: Training on MNIST with labels: passing noise + label "7" generates handwritten digit 7. Without cGAN, the generator produces random digits with no control!
π Applications of cGANs
πΈ Image-to-Image Translation
Sketches β Photos, Day β Night scenes
β¨ Photo Enhancement
Super-resolution, denoising, restoration
π Data Augmentation
Generate samples for rare classes
π¨ Style Transfer
Apply artistic styles to images
π¬ Text-to-Image
Generate images from descriptions
π₯ Medical Imaging
Synthetic data for privacy-sensitive domains
β Advantages
- β Controlled Generation: Specify what you want to generate
- β Higher Quality: Conditioning guides meaningful outputs
- β Versatile: Works with labels, images, text, etc.
- β Cross-Domain: Art, science, medical, entertainment
β οΈ Challenges
- β Training Stability: Still inherits GAN training difficulties
- β Mode Collapse: Can collapse despite conditioning
- β Condition Quality: Effectiveness depends on condition relevance
- β Architecture Tuning: Requires careful design
π§ Quick Quiz 1 Test your understanding
What is the main difference between GAN and cGAN?
Progressive Growing of GANs
Key Innovation: Instead of training on high-resolution images from the start, progressively add layers to both generator and discriminator, beginning with low resolutions and gradually increasing detail.
Progressive Training Stages
Stage 1: 4Γ4
Train on low-res
Stage 2: 4Γ4 β 64Γ64
Add layers, refine details
Stage 3: 4Γ4 β 1024Γ1024
High-res synthesis!
Training Progresses
Faster Training
Start with coarse structures, gradually add fine details
Better Stability
Gradual complexity increase prevents training collapse
Higher Quality
Produces unprecedented quality images at high resolutions
Specialized GAN Architectures
Pix2Pix
πΌοΈImage-to-Image Translation with Conditional Adversarial Networks
Learns mapping from input images to output images using cGAN with combined adversarial + L1 loss for pixel-wise accuracy.
StyleGAN
π¨Style-Based Generator Architecture for High-Quality Images
Uses Adaptive Instance Normalization (AdaIN) to control styles at each layer, enabling fine-grained control over texture, color, and structure.
GigaGAN
πScaling Up GANs for Text-to-Image Synthesis
Multi-stage architecture handling complex text-to-image synthesis at high resolutions, with stages focusing from coarse shapes to fine details.
GauGAN (SPADE)
πSemantic Image Synthesis with Spatially-Adaptive Normalization
SPADE (Spatially-Adaptive Normalization) modulates features based on semantic layout, producing photorealistic images from semantic maps.
π§ Quick Quiz 2 Challenge
What is the main advantage of Progressive GAN?
Architecture Comparison
| Architecture | Key Innovation | Best For | Conditioning |
|---|---|---|---|
| cGAN | Conditional inputs | Controlled generation | β Yes (labels, text, images) |
| Progressive GAN | Growing architecture | High-res synthesis | β No |
| Pix2Pix | Paired image translation | Image-to-image tasks | β Input image |
| StyleGAN | AdaIN style control | Disentangled generation | β Style vectors |
| GigaGAN | Text-to-image scaling | Text descriptions | β Text |
| GauGAN | SPADE normalization | Semantic synthesis | β Semantic maps |
Key Takeaways
π― Control
Conditional GANs enable precise control over generation through labels, text, or images
π Scaling
Progressive training enables stable, high-quality synthesis at unprecedented resolutions
π¨ Specialization
Task-specific architectures (Pix2Pix, StyleGAN, etc.) excel at their designed applications
From Control to Quality...
"Each GAN architecture is a tool in the creative toolkitβ
choose wisely based on your task!"