1. Motivation
The Problem
Find a linear decision surface (Hyperplane) that can separate two classes and has the largest distance (margin) between the border line classes.
- Support Vectors: The data points on the boundary.
- Margin: The largest gap between the classes.
If the data is linearly separable, there are infinite hyperplanes that can separate the classes. SVM finds the "best" one—the one that maximizes the margin.
2. Primal Formulation
Let the hyperplane be defined by \( w \cdot x + b = 0 \). We want to classify data points \( (x_i, y_i) \) where \( y_i \in \{+1, -1\} \).
Geometric Insight
We scale \( w \) and \( b \) such that the support vectors lie on the hyperplanes:
\[ w \cdot x + b = 1 \quad \text{and} \quad w \cdot x + b = -1 \]The distance (margin) between these planes is \( \frac{2}{\|w\|} \). To maximize the margin, we minimize \( \|w\|^2 \).
SVM Primal Optimization Problem
\[ \begin{aligned} & \text{minimize} && \frac{1}{2} \|w\|^2 \\ & \text{subject to} && y_i (w \cdot x_i + b) \geq 1, \quad i=1,\dots,m \end{aligned} \]This is a Convex Quadratic Programming (QP) problem.
3. Dual Formulation
By forming the Lagrangian and setting derivatives to zero, we obtain the dual problem. This is useful when the number of features is large compared to the number of samples.
SVM Dual Optimization Problem
\[ \begin{aligned} & \text{maximize} && \sum_{i=1}^m \lambda_i - \frac{1}{2} \sum_{i,j=1}^m \lambda_i \lambda_j y_i y_j (x_i \cdot x_j) \\ & \text{subject to} && \lambda_i \geq 0, \quad i=1,\dots,m \\ & && \sum_{i=1}^m \lambda_i y_i = 0 \end{aligned} \]Key Observation
The dual problem depends only on the dot products \( x_i \cdot x_j \). This allows us to use the Kernel Trick to handle non-linearly separable data by implicitly mapping it to a higher-dimensional space!
4. Soft-Margin SVM
Real-world data is often noisy and not perfectly separable. We introduce slack variables \( \xi_i \) to allow some misclassification, but penalize them with a parameter \( C \).
Primal Soft-Margin
\[ \min \frac{1}{2} \|w\|^2 + C \sum_{i=1}^m \xi_i \]Subject to \( y_i(w \cdot x_i + b) \geq 1 - \xi_i \)
Dual Soft-Margin
Same objective as standard dual, but with box constraints:
\[ 0 \leq \lambda_i \leq C \]5. The Kernel Trick
What if the data is not linearly separable? We can map the data to a higher-dimensional feature space where it is separable.
Implicit Mapping
Instead of computing the mapping \( \Phi(x) \) explicitly (which can be expensive or infinite-dimensional), we use a Kernel Function \( K(x_i, x_j) \) that computes the dot product in the feature space:
\[ K(x_i, x_j) = \Phi(x_i) \cdot \Phi(x_j) \]The dual formulation depends only on dot products, so we simply replace \( x_i \cdot x_j \) with \( K(x_i, x_j) \).
Popular Kernels
Polynomial Kernel
\( K(x, z) = (x \cdot z + c)^d \)
Gaussian (RBF) Kernel
\( K(x, z) = \exp(-\gamma \|x - z\|^2) \)
6. Unconstrained Form
We can rewrite the Soft-Margin SVM as an unconstrained optimization problem involving the Hinge Loss.
Loss + Penalty
\[ \text{minimize} \quad \sum_{i=1}^m \max(0, 1 - y_i f(x_i)) + \lambda \|w\|^2 \]- Loss: \( \max(0, 1 - y_i f(x_i)) \) (Hinge Loss) - penalizes misclassification.
- Penalty: \( \lambda \|w\|^2 \) - regularization term (inverse of \( C \)).
7. SVM Regression (SVR)
SVM can also be used for regression! The goal is to find a function that fits the data within an \( \epsilon \)-margin.
Optimization Model
\[ \begin{aligned} & \text{minimize} && \frac{1}{2} \|w\|^2 \\ & \text{subject to} && |y_i - (w \cdot x_i + b)| \leq \epsilon \end{aligned} \]We ignore errors smaller than \( \epsilon \), creating a "tube" around the regression line.
8. Solving via QP
To solve the Dual Kernel SVM, we map it to a standard Quadratic Programming (QP) solver.
SVM Mapping:
- Variable x = [λ₁, ..., λₘ]ᵀ
- Matrix Pᵢⱼ = yᵢyⱼ K(xᵢ, xⱼ)
- Vector q = [-1, ..., -1]ᵀ
- Equality: Σ λᵢyᵢ = 0 (A = yᵀ, b = 0)
- Bounds: 0 ≤ λᵢ ≤ C