1. The Training Problem
The core challenge is to exploit the structure of the training problem to solve it efficiently.
Least Squares Minimization
We want to minimize the error between our approximation \( \tilde{J} \) and the target values \( \beta \):
\[ \min_r \sum_{s=1}^{q} \left( \tilde{J}(x^s, r) - \beta^s \right)^2 \]Challenges
- Nonconvexity: Many local minima, especially with Neural Networks.
- Complicated Landscape: The cost function graph is "horribly complicated".
- Large Sums: Standard gradient/Newton methods are too slow for large \( q \).
The Solution
Incremental Iterative Methods that operate on a single term at each iteration have proven effective.
2. Incremental Gradient Methods
Consider minimizing a sum of terms: \( f(y) = \sum_{i=1}^{m} f_i(y) \).
Ordinary Gradient
Updates using the full gradient sum:
\[ y^{k+1} = y^k - \gamma^k \sum_{i=1}^{m} \nabla f_i(y^k) \]Slow per iteration (computes all \( m \) gradients).
Incremental Gradient
Updates using a single term \( i_k \):
\[ y^{k+1} = y^k - \gamma^k \nabla f_{i_k}(y^k) \]Fast per iteration. \( i_k \) can be chosen cyclically or randomly.
Performance Comparison
Far from convergence: Incremental is much faster (effective speedup of \( m
\)).
Close to convergence: Incremental gets "confused" and needs smaller steps.
3. Advanced Variants
Incremental Aggregated Gradient
Aims to accelerate convergence by using memory.
\[ y^{k+1} = y^k - \gamma^k \sum_{\ell=0}^{m-1} \nabla f_{i_{k-\ell}}(y^{k-\ell}) \]It uses previously calculated gradients as if they were up-to-date. This provides a better approximation of the full gradient.
Stochastic Gradient Descent (SGD)
Minimizes an expected value: \( f(y) = \mathbb{E} \{ F(y, w) \} \).
\[ y^{k+1} = y^k - \gamma^k \nabla_y F(y^k, w^k) \]Insight: The incremental gradient method with random index selection is mathematically equivalent to SGD (viewing the sum as an expectation over a uniform distribution of indices).
4. Implementation Issues
- Stepsize (\( \gamma^k \)) Usually diminishing, e.g., \( \gamma^k = \frac{\gamma}{k+1} \), to handle the region of confusion.
- Term Order Cyclic (1 to m) or Random (sampling). Random often breaks cycles and works better.
- Diagonal Scaling Using a different stepsize for each component of \( y \) to handle ill-conditioning.
- Region of Confusion Detecting when progress stalls due to noise and reducing the stepsize.
5. Neural Network Approximation
Neural Networks are the primary nonlinear architecture used today.
Training Formulation
Given pairs \( (x^s, \beta^s) \), we find parameters \( (A, b, r) \) to minimize:
\[ \min_{A, b, r} \sum_{s=1}^{q} \left( \sum_{\ell=1}^{m} r_{\ell} \sigma \left( (A y(x^s) + b)_{\ell} \right) - \beta^s \right)^2 \]Universal Approximation Property
A neural network with a single hidden layer and sufficiently many neurons can approximate any continuous function arbitrarily well.