📉

Value and Policy Approximation

Training Nonlinear Architectures and Incremental Gradient Methods.

🎯

1. The Training Problem

The core challenge is to exploit the structure of the training problem to solve it efficiently.

Least Squares Minimization

We want to minimize the error between our approximation \( \tilde{J} \) and the target values \( \beta \):

\[ \min_r \sum_{s=1}^{q} \left( \tilde{J}(x^s, r) - \beta^s \right)^2 \]

Challenges

  • Nonconvexity: Many local minima, especially with Neural Networks.
  • Complicated Landscape: The cost function graph is "horribly complicated".
  • Large Sums: Standard gradient/Newton methods are too slow for large \( q \).

The Solution

Incremental Iterative Methods that operate on a single term at each iteration have proven effective.

2. Incremental Gradient Methods

Consider minimizing a sum of terms: \( f(y) = \sum_{i=1}^{m} f_i(y) \).

Ordinary Gradient

Updates using the full gradient sum:

\[ y^{k+1} = y^k - \gamma^k \sum_{i=1}^{m} \nabla f_i(y^k) \]

Slow per iteration (computes all \( m \) gradients).

Incremental Gradient

Updates using a single term \( i_k \):

\[ y^{k+1} = y^k - \gamma^k \nabla f_{i_k}(y^k) \]

Fast per iteration. \( i_k \) can be chosen cyclically or randomly.

Performance Comparison

graph TD subgraph "Far from Convergence" A[Incremental] -->|1/m work| B[Same Progress as Ordinary] end subgraph "Close to Convergence" C[Incremental] -->|Oscillates| D[Needs Diminishing Stepsize] end style A fill:#dbeafe,stroke:#3b82f6 style C fill:#fee2e2,stroke:#ef4444

Far from convergence: Incremental is much faster (effective speedup of \( m \)).
Close to convergence: Incremental gets "confused" and needs smaller steps.

🚀

3. Advanced Variants

Incremental Aggregated Gradient

Aims to accelerate convergence by using memory.

\[ y^{k+1} = y^k - \gamma^k \sum_{\ell=0}^{m-1} \nabla f_{i_{k-\ell}}(y^{k-\ell}) \]

It uses previously calculated gradients as if they were up-to-date. This provides a better approximation of the full gradient.

Stochastic Gradient Descent (SGD)

Minimizes an expected value: \( f(y) = \mathbb{E} \{ F(y, w) \} \).

\[ y^{k+1} = y^k - \gamma^k \nabla_y F(y^k, w^k) \]

Insight: The incremental gradient method with random index selection is mathematically equivalent to SGD (viewing the sum as an expectation over a uniform distribution of indices).

⚙️

4. Implementation Issues

🧠

5. Neural Network Approximation

Neural Networks are the primary nonlinear architecture used today.

Training Formulation

Given pairs \( (x^s, \beta^s) \), we find parameters \( (A, b, r) \) to minimize:

\[ \min_{A, b, r} \sum_{s=1}^{q} \left( \sum_{\ell=1}^{m} r_{\ell} \sigma \left( (A y(x^s) + b)_{\ell} \right) - \beta^s \right)^2 \]

Universal Approximation Property

A neural network with a single hidden layer and sufficiently many neurons can approximate any continuous function arbitrarily well.

📝

6. Test Your Knowledge

1. Why are standard gradient methods ill-suited for the training problem?

2. When is the Incremental Gradient method most effective?

3. What is the "Region of Confusion"?

4. How does SGD relate to Incremental Gradient?

5. What does the Universal Approximation Property state?

Previous

Lecture 25

Next

Lecture 27