1. Moving to Keras 📦
Slides 1-11Last week, we built a Neural Network from scratch ($\sim 95\%$ acc). Now, let's use Keras (with TensorFlow backend) to do it professionally.
Our Baseline Model
- Input: 784 pixels (28x28)
- Hidden: 30 neurons (Sigmoid)
- Output: 10 neurons (Sigmoid/Softmax)
- Loss: Mean Squared Error (MSE)
Even with this simple model, Keras gets us to ~96.6% accuracy. But we can do better!
2. Loss Functions & Initialization 📉
Slides 12-22The Problem: Slow Learning
If we initialize weights poorly (e.g., too large), the sigmoid function saturates ($ \sigma(z) \approx 0 \text{ or } 1 $). The derivative becomes near zero, killing the gradient descent.
The Solution: Cross-Entropy
Cross-Entropy loss cancels out the $\sigma'(z)$ term in the gradient, fixing the slow learning problem!
Softmax Output
For multi-class classification, use Softmax + Categorical Cross-Entropy. Softmax treats outputs as probabilities (sum to 1).
3. Fighting Overfitting 🛡️
Slides 24-37Overfitting: When the model memorizes training data but fails on test data. Noticeable when training loss keeps dropping but validation loss rises.
Technique 1: L2 Regularization
Add a penalty for large weights ($ \lambda \sum w^2 $).
Technique 2: Dropout
Randomly turn off neurons during training.
Technique 3: Early Stopping
Stop training automatically when validation loss stops improving.
4. Data Augmentation 🖼️
Slides 38-39More data is the best cure for overfitting. We can artificially create data by rotating or shifting existing images.
5. Optimization Tricks ⚡
Slides 40-48Learning Rate
Too small = slow. Too large = unstable.
Momentum
Accelerates SGD in the relevant direction.
Advanced Optimizers
Instead of tuning learning rate manually, use adaptive optimizers like Adagrad, RMSprop, or Adam/Adadelta.
(Note: Batch Size also affects training dynamics. Smaller batches are noisier but generalize better. See `l303-example-06b.py` in the code list below.)
6. Going Deeper & Activations 🌌
Slides 49-58Vanishing Gradient
Sigmoid derivatives are always < 0.25. In deep networks, gradients multiply and vanish to zero.
Solution: ReLU
Rectified Linear Unit ($f(x) = \max(0,x)$). Derivative is 1 for $x>0$. No vanishing gradient!
The Final "Monster" Model
Combining everything: 512 neurons, ReLU, Dropout, Adadelta.
Variant: Deeper network (4 layers):