Incorporating Nonlinear Models

From "Double Doughnuts" to Neural Networks.
Lecture 3-2

1. Recap: Where are we? 📍

Slides 2-3

Last time, we classified MNIST digits using Linear Methods (LDA, Linear SVM).
Accuracy: ~92% (using all pixels). Not bad, but we hit a wall.

🚧

The Problem

Life isn't always linear. Sometimes the path from Start to End is a messy squiggle.

"Start --------> End" (Expectation)
"Start --〰️--🌀--> End" (Reality)

2. Nonlinear SVM & The Doughnut Problem 🍩

Slides 4-13

Remember the Kernel Trick? We map data to higher dimensions to make it separable.
Let's test this on a "Double Doughnut" dataset (concentric circles), which is impossible for linear classifiers.

Hyperparameter Tuning

The RBF kernel has two key parameters:

  • C: Regularization (how much we punish errors).
  • gamma ($\gamma$): Width of the Gaussian kernel.

Back to MNIST

Can we beat our 92% linear score? Using a tuned RBF SVM ($C=5, \gamma=0.05$):

Result:

Accuracy jumps to ~96.6% (from 91.7%). Nonlinearity works!

3. Artificial Neural Networks (ANN) 🧠

Slides 16-30

The Biological Inspiration

Inspired by the brain, but mathematically simple.
Perceptron: Inputs ($x_i$) $\to$ Weighted Sum ($\sum w_i x_i$) $\to$ Threshold $\to$ Output (0 or 1).

Problem: Binary step functions aren't differentiable. We can't use calculus to train them!

The Solution: Sigmoid Neurons

Replace the hard threshold with a smooth Sigmoid (or Tanh) function.

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Smooth, differentiable, 0 to 1.

Implementing a Neural Network from Scratch

We define a class `neurons` that handles:

neurons.py (Core Logic)
import random
import numpy as np

def sigma(z):
    return 0.5*(np.tanh(0.5*z)+1.)
def sigma_p(z):
    return sigma(z)*(1.-sigma(z))

class neurons(object):
    def __init__(self, shape):
        self.shape = shape
        self.v = [np.zeros((n,1)) for n in shape]
        self.z = [np.zeros((n,1)) for n in shape[1:]]
        self.w = [np.random.randn(n,m) for n,m in zip(shape[1:],shape[:-1])]
        self.b = [np.random.randn(n,1) for n in shape[1:]]
        self.delw = [np.zeros(w.shape) for w in self.w]
        self.delb = [np.zeros(b.shape) for b in self.b]

    def predict(self, x):
        self.v[0] = x.reshape(self.v[0].shape)
        for l in range(len(self.shape)-1):
            self.z[l] = np.dot(self.w[l],self.v[l])+self.b[l]
            self.v[l+1] = sigma(self.z[l])
        return self.v[-1]

    def gradient(self, y):
        for l in range(len(self.shape)-2,-1,-1):
            if l==len(self.shape)-2:
                  delta = (self.v[-1]-y.reshape(self.v[-1].shape))*sigma_p(self.z[l])
            else: delta = np.dot(self.w[l+1].T,self.delb[l+1])*sigma_p(self.z[l])
            self.delb[l] = delta
            self.delw[l] = np.dot(delta,self.v[l].T)

    def fit(self, x_data, y_data, epochs, batch_size, eta):
        samples = list(zip(x_data, y_data))
        for ep in range(epochs):
            print('Epoch: %d/%d' % (ep+1,epochs))
            random.shuffle(samples)
            sum_delw = [np.zeros(w.shape) for w in self.w]
            sum_delb = [np.zeros(b.shape) for b in self.b]
            batch_count = 0
            for x,y in samples:
                self.predict(x)
                self.gradient(y)
                for l in range(len(self.shape)-1):
                    sum_delw[l] += self.delw[l]
                    sum_delb[l] += self.delb[l]
                batch_count += 1
                if batch_count>=batch_size or (x is samples[-1][0]):
                    for l in range(len(self.shape)-1):
                        self.w[l] -= eta/batch_count*sum_delw[l]
                        self.b[l] -= eta/batch_count*sum_delb[l]
                        sum_delw[l],sum_delb[l] = 0.,0.
                    batch_count = 0
            ret = self.evaluate(x_data, y_data)
            print('Loss: %.4f, Acc: %.4f' % ret)

    def evaluate(self, x_data, y_data):
        loss, cnt = 0., 0.
        for x,y in zip(x_data, y_data):
            self.predict(x)
            loss += ((self.v[-1]-y.reshape(self.v[-1].shape))**2).sum()
            if np.argmax(self.v[-1])==np.argmax(y): cnt += 1.
        loss /= 2.*len(x_data)
        return loss, cnt/len(x_data)

Before training, the network output is random noise. The histograms of "0" and "1" predictions overlap completely.

4. Training with Backpropagation 🏋️‍♂️

Slides 31-48

The Algorithm: SGD

Stochastic Gradient Descent (SGD):

  1. Pick a mini-batch of data.
  2. Compute Loss (MSE).
  3. Compute Gradients via Backpropagation (Chain Rule).
  4. Update weights: $w \leftarrow w - \eta \nabla L$

Training on Full MNIST (All Pixels)

Let's connect 784 inputs -> 30 hidden -> 10 outputs.

5. Optimization & Overfitting 🛠️

Slides 49-56

The Overfitting Problem

The network memorizes the training data but fails on new data.
Symptom: Training loss goes down, but Testing loss goes up (or flats out).

A Simple Fix: Weight Initialization

Initializing weights with large random numbers saturates the sigmoid function (gradients $\to$ 0).
Fix: Scale weights by $1/\sqrt{N_{inputs}}$.

Notice how the scaled initialization converges much faster!

6. The Future: Keras & TensorFlow 🚀

Slides 57-58
📦

We built a Neural Network from scratch to understand it. In practice, we use libraries like TensorFlow and Keras.

pip install tensorflow keras

Coming up in Lecture 3-3...

🧠 Knowledge Check