Tricks for Improving Neural Networks

From Keras basics to avoiding overfitting and vanishing gradients.
Lecture 3-3

1. Moving to Keras 📦

Slides 1-11

Last week, we built a Neural Network from scratch ($\sim 95\%$ acc). Now, let's use Keras (with TensorFlow backend) to do it professionally.

Our Baseline Model

  • Input: 784 pixels (28x28)
  • Hidden: 30 neurons (Sigmoid)
  • Output: 10 neurons (Sigmoid/Softmax)
  • Loss: Mean Squared Error (MSE)
🐍 + 🔥
pip install keras tensorflow

Even with this simple model, Keras gets us to ~96.6% accuracy. But we can do better!

2. Loss Functions & Initialization 📉

Slides 12-22

The Problem: Slow Learning

If we initialize weights poorly (e.g., too large), the sigmoid function saturates ($ \sigma(z) \approx 0 \text{ or } 1 $). The derivative becomes near zero, killing the gradient descent.

The Solution: Cross-Entropy

$$ C = - \frac{1}{n} \sum [y \ln a + (1-y) \ln (1-a)] $$

Cross-Entropy loss cancels out the $\sigma'(z)$ term in the gradient, fixing the slow learning problem!

Softmax Output

For multi-class classification, use Softmax + Categorical Cross-Entropy. Softmax treats outputs as probabilities (sum to 1).

3. Fighting Overfitting 🛡️

Slides 24-37

Overfitting: When the model memorizes training data but fails on test data. Noticeable when training loss keeps dropping but validation loss rises.

Technique 1: L2 Regularization

Add a penalty for large weights ($ \lambda \sum w^2 $).

Technique 2: Dropout

Randomly turn off neurons during training.

Technique 3: Early Stopping

Stop training automatically when validation loss stops improving.

4. Data Augmentation 🖼️

Slides 38-39

More data is the best cure for overfitting. We can artificially create data by rotating or shifting existing images.

5. Optimization Tricks ⚡

Slides 40-48

Learning Rate

Too small = slow. Too large = unstable.

Momentum

Accelerates SGD in the relevant direction.

Advanced Optimizers

Instead of tuning learning rate manually, use adaptive optimizers like Adagrad, RMSprop, or Adam/Adadelta.

(Note: Batch Size also affects training dynamics. Smaller batches are noisier but generalize better. See `l303-example-06b.py` in the code list below.)

6. Going Deeper & Activations 🌌

Slides 49-58

Vanishing Gradient

Sigmoid derivatives are always < 0.25. In deep networks, gradients multiply and vanish to zero.

Solution: ReLU

Rectified Linear Unit ($f(x) = \max(0,x)$). Derivative is 1 for $x>0$. No vanishing gradient!

The Final "Monster" Model

Combining everything: 512 neurons, ReLU, Dropout, Adadelta.

Variant: Deeper network (4 layers):

🧠 Knowledge Check