1. Recap: Where are we? 📍
Slides 2-3
Last time, we classified MNIST digits using Linear Methods (LDA, Linear SVM).
Accuracy: ~92% (using all pixels). Not bad, but we hit a wall.
The Problem
Life isn't always linear. Sometimes the path from Start to End is a messy squiggle.
"Start --〰️--🌀--> End" (Reality)
2. Nonlinear SVM & The Doughnut Problem 🍩
Slides 4-13
Remember the Kernel Trick? We map data to higher dimensions to make it separable.
Let's test this on a "Double Doughnut" dataset (concentric circles), which is impossible for linear
classifiers.
Hyperparameter Tuning
The RBF kernel has two key parameters:
- C: Regularization (how much we punish errors).
- gamma ($\gamma$): Width of the Gaussian kernel.
Back to MNIST
Can we beat our 92% linear score? Using a tuned RBF SVM ($C=5, \gamma=0.05$):
Result:
Accuracy jumps to ~96.6% (from 91.7%). Nonlinearity works!
3. Artificial Neural Networks (ANN) 🧠
Slides 16-30The Biological Inspiration
Inspired by the brain, but mathematically simple.
Perceptron: Inputs ($x_i$) $\to$ Weighted Sum ($\sum w_i x_i$) $\to$ Threshold
$\to$ Output (0 or 1).
The Solution: Sigmoid Neurons
Replace the hard threshold with a smooth Sigmoid (or Tanh) function.
Smooth, differentiable, 0 to 1.
Implementing a Neural Network from Scratch
We define a class `neurons` that handles:
- Feedforward: Calculating outputs layer by layer.
- Backpropagation: Calculating gradients to update weights.
import random
import numpy as np
def sigma(z):
return 0.5*(np.tanh(0.5*z)+1.)
def sigma_p(z):
return sigma(z)*(1.-sigma(z))
class neurons(object):
def __init__(self, shape):
self.shape = shape
self.v = [np.zeros((n,1)) for n in shape]
self.z = [np.zeros((n,1)) for n in shape[1:]]
self.w = [np.random.randn(n,m) for n,m in zip(shape[1:],shape[:-1])]
self.b = [np.random.randn(n,1) for n in shape[1:]]
self.delw = [np.zeros(w.shape) for w in self.w]
self.delb = [np.zeros(b.shape) for b in self.b]
def predict(self, x):
self.v[0] = x.reshape(self.v[0].shape)
for l in range(len(self.shape)-1):
self.z[l] = np.dot(self.w[l],self.v[l])+self.b[l]
self.v[l+1] = sigma(self.z[l])
return self.v[-1]
def gradient(self, y):
for l in range(len(self.shape)-2,-1,-1):
if l==len(self.shape)-2:
delta = (self.v[-1]-y.reshape(self.v[-1].shape))*sigma_p(self.z[l])
else: delta = np.dot(self.w[l+1].T,self.delb[l+1])*sigma_p(self.z[l])
self.delb[l] = delta
self.delw[l] = np.dot(delta,self.v[l].T)
def fit(self, x_data, y_data, epochs, batch_size, eta):
samples = list(zip(x_data, y_data))
for ep in range(epochs):
print('Epoch: %d/%d' % (ep+1,epochs))
random.shuffle(samples)
sum_delw = [np.zeros(w.shape) for w in self.w]
sum_delb = [np.zeros(b.shape) for b in self.b]
batch_count = 0
for x,y in samples:
self.predict(x)
self.gradient(y)
for l in range(len(self.shape)-1):
sum_delw[l] += self.delw[l]
sum_delb[l] += self.delb[l]
batch_count += 1
if batch_count>=batch_size or (x is samples[-1][0]):
for l in range(len(self.shape)-1):
self.w[l] -= eta/batch_count*sum_delw[l]
self.b[l] -= eta/batch_count*sum_delb[l]
sum_delw[l],sum_delb[l] = 0.,0.
batch_count = 0
ret = self.evaluate(x_data, y_data)
print('Loss: %.4f, Acc: %.4f' % ret)
def evaluate(self, x_data, y_data):
loss, cnt = 0., 0.
for x,y in zip(x_data, y_data):
self.predict(x)
loss += ((self.v[-1]-y.reshape(self.v[-1].shape))**2).sum()
if np.argmax(self.v[-1])==np.argmax(y): cnt += 1.
loss /= 2.*len(x_data)
return loss, cnt/len(x_data)
Before training, the network output is random noise. The histograms of "0" and "1" predictions overlap completely.
4. Training with Backpropagation 🏋️♂️
Slides 31-48The Algorithm: SGD
Stochastic Gradient Descent (SGD):
- Pick a mini-batch of data.
- Compute Loss (MSE).
- Compute Gradients via Backpropagation (Chain Rule).
- Update weights: $w \leftarrow w - \eta \nabla L$
Training on Full MNIST (All Pixels)
Let's connect 784 inputs -> 30 hidden -> 10 outputs.
5. Optimization & Overfitting 🛠️
Slides 49-56The Overfitting Problem
The network memorizes the training data but fails on new data.
Symptom: Training loss goes down, but Testing loss goes up (or flats out).
A Simple Fix: Weight Initialization
Initializing weights with large random numbers saturates the sigmoid function (gradients $\to$
0).
Fix: Scale weights by $1/\sqrt{N_{inputs}}$.
Notice how the scaled initialization converges much faster!
6. The Future: Keras & TensorFlow 🚀
Slides 57-58We built a Neural Network from scratch to understand it. In practice, we use libraries like TensorFlow and Keras.
Coming up in Lecture 3-3...