⚔️

Support Vector Machines

Maximizing the margin with convex quadratic programming.

🎯

1. Motivation

The Problem

Find a linear decision surface (Hyperplane) that can separate two classes and has the largest distance (margin) between the border line classes.

  • Support Vectors: The data points on the boundary.
  • Margin: The largest gap between the classes.

If the data is linearly separable, there are infinite hyperplanes that can separate the classes. SVM finds the "best" one—the one that maximizes the margin.

📐

2. Primal Formulation

Let the hyperplane be defined by \( w \cdot x + b = 0 \). We want to classify data points \( (x_i, y_i) \) where \( y_i \in \{+1, -1\} \).

Geometric Insight

We scale \( w \) and \( b \) such that the support vectors lie on the hyperplanes:

\[ w \cdot x + b = 1 \quad \text{and} \quad w \cdot x + b = -1 \]

The distance (margin) between these planes is \( \frac{2}{\|w\|} \). To maximize the margin, we minimize \( \|w\|^2 \).

SVM Primal Optimization Problem

\[ \begin{aligned} & \text{minimize} && \frac{1}{2} \|w\|^2 \\ & \text{subject to} && y_i (w \cdot x_i + b) \geq 1, \quad i=1,\dots,m \end{aligned} \]

This is a Convex Quadratic Programming (QP) problem.

🔄

3. Dual Formulation

By forming the Lagrangian and setting derivatives to zero, we obtain the dual problem. This is useful when the number of features is large compared to the number of samples.

SVM Dual Optimization Problem

\[ \begin{aligned} & \text{maximize} && \sum_{i=1}^m \lambda_i - \frac{1}{2} \sum_{i,j=1}^m \lambda_i \lambda_j y_i y_j (x_i \cdot x_j) \\ & \text{subject to} && \lambda_i \geq 0, \quad i=1,\dots,m \\ & && \sum_{i=1}^m \lambda_i y_i = 0 \end{aligned} \]
Key Observation

The dual problem depends only on the dot products \( x_i \cdot x_j \). This allows us to use the Kernel Trick to handle non-linearly separable data by implicitly mapping it to a higher-dimensional space!

🌫️

4. Soft-Margin SVM

Real-world data is often noisy and not perfectly separable. We introduce slack variables \( \xi_i \) to allow some misclassification, but penalize them with a parameter \( C \).

Primal Soft-Margin

\[ \min \frac{1}{2} \|w\|^2 + C \sum_{i=1}^m \xi_i \]

Subject to \( y_i(w \cdot x_i + b) \geq 1 - \xi_i \)

Dual Soft-Margin

Same objective as standard dual, but with box constraints:

\[ 0 \leq \lambda_i \leq C \]

5. The Kernel Trick

What if the data is not linearly separable? We can map the data to a higher-dimensional feature space where it is separable.

Implicit Mapping

Instead of computing the mapping \( \Phi(x) \) explicitly (which can be expensive or infinite-dimensional), we use a Kernel Function \( K(x_i, x_j) \) that computes the dot product in the feature space:

\[ K(x_i, x_j) = \Phi(x_i) \cdot \Phi(x_j) \]

The dual formulation depends only on dot products, so we simply replace \( x_i \cdot x_j \) with \( K(x_i, x_j) \).

Popular Kernels

Polynomial Kernel

\( K(x, z) = (x \cdot z + c)^d \)

Gaussian (RBF) Kernel

\( K(x, z) = \exp(-\gamma \|x - z\|^2) \)

📉

6. Unconstrained Form

We can rewrite the Soft-Margin SVM as an unconstrained optimization problem involving the Hinge Loss.

Loss + Penalty

\[ \text{minimize} \quad \sum_{i=1}^m \max(0, 1 - y_i f(x_i)) + \lambda \|w\|^2 \]
  • Loss: \( \max(0, 1 - y_i f(x_i)) \) (Hinge Loss) - penalizes misclassification.
  • Penalty: \( \lambda \|w\|^2 \) - regularization term (inverse of \( C \)).
📊

7. SVM Regression (SVR)

SVM can also be used for regression! The goal is to find a function that fits the data within an \( \epsilon \)-margin.

Optimization Model

\[ \begin{aligned} & \text{minimize} && \frac{1}{2} \|w\|^2 \\ & \text{subject to} && |y_i - (w \cdot x_i + b)| \leq \epsilon \end{aligned} \]

We ignore errors smaller than \( \epsilon \), creating a "tube" around the regression line.

⚙️

8. Solving via QP

To solve the Dual Kernel SVM, we map it to a standard Quadratic Programming (QP) solver.

Standard QP: min (1/2)xᵀPx + qᵀx s.t. Gx ≤ h, Ax = b

SVM Mapping:
- Variable x = [λ₁, ..., λₘ]ᵀ
- Matrix Pᵢⱼ = yᵢyⱼ K(xᵢ, xⱼ)
- Vector q = [-1, ..., -1]ᵀ
- Equality: Σ λᵢyᵢ = 0 (A = yᵀ, b = 0)
- Bounds: 0 ≤ λᵢ ≤ C
✍️

9. Test Your Understanding

1. What is the main advantage of the Kernel Trick?

2. In Hinge Loss, when is the loss zero?

Previous Chapter

Recommender Systems