Value and Policy Approximation

🎯

1. The Training Problem

The core challenge is to exploit the structure of the training problem to solve it efficiently.

Least Squares Minimization

We want to minimize the error between our approximation \( \tilde{J} \) and the target values \( \beta \):

\[ \min_r \sum_{s=1}^{q} \left( \tilde{J}(x^s, r) - \beta^s \right)^2 \]

Challenges

Nonconvexity: Many local minima, especially with Neural Networks.
Complicated Landscape: The cost function graph is "horribly complicated".
Large Sums: Standard gradient/Newton methods are too slow for large \( q \).

The Solution

Incremental Iterative Methods that operate on a single term at each iteration have proven effective.

⚡

2. Incremental Gradient Methods

Consider minimizing a sum of terms: \( f(y) = \sum_{i=1}^{m} f_i(y) \).

Ordinary Gradient

Updates using the full gradient sum:

\[ y^{k+1} = y^k - \gamma^k \sum_{i=1}^{m} \nabla f_i(y^k) \]

Slow per iteration (computes all \( m \) gradients).

Incremental Gradient

Updates using a single term \( i_k \):

\[ y^{k+1} = y^k - \gamma^k \nabla f_{i_k}(y^k) \]

Fast per iteration. \( i_k \) can be chosen cyclically or randomly.

Performance Comparison

graph TD subgraph "Far from Convergence" A[Incremental] -->|1/m work| B[Same Progress as Ordinary] end subgraph "Close to Convergence" C[Incremental] -->|Oscillates| D[Needs Diminishing Stepsize] end style A fill:#dbeafe,stroke:#3b82f6 style C fill:#fee2e2,stroke:#ef4444

Far from convergence: Incremental is much faster (effective speedup of \( m \)).
Close to convergence: Incremental gets "confused" and needs smaller steps.

🚀

3. Advanced Variants

Incremental Aggregated Gradient

Aims to accelerate convergence by using memory.

\[ y^{k+1} = y^k - \gamma^k \sum_{\ell=0}^{m-1} \nabla f_{i_{k-\ell}}(y^{k-\ell}) \]

It uses previously calculated gradients as if they were up-to-date. This provides a better approximation of the full gradient.

Stochastic Gradient Descent (SGD)

Minimizes an expected value: \( f(y) = \mathbb{E} \{ F(y, w) \} \).

\[ y^{k+1} = y^k - \gamma^k \nabla_y F(y^k, w^k) \]

Insight: The incremental gradient method with random index selection is mathematically equivalent to SGD (viewing the sum as an expectation over a uniform distribution of indices).

⚙️

4. Implementation Issues

Stepsize (\( \gamma^k \)) Usually diminishing, e.g., \( \gamma^k = \frac{\gamma}{k+1} \), to handle the region of confusion.
Term Order Cyclic (1 to m) or Random (sampling). Random often breaks cycles and works better.
Diagonal Scaling Using a different stepsize for each component of \( y \) to handle ill-conditioning.
Region of Confusion Detecting when progress stalls due to noise and reducing the stepsize.

🧠

5. Neural Network Approximation

Neural Networks are the primary nonlinear architecture used today.

Training Formulation

Given pairs \( (x^s, \beta^s) \), we find parameters \( (A, b, r) \) to minimize:

\[ \min_{A, b, r} \sum_{s=1}^{q} \left( \sum_{\ell=1}^{m} r_{\ell} \sigma \left( (A y(x^s) + b)_{\ell} \right) - \beta^s \right)^2 \]

Universal Approximation Property

A neural network with a single hidden layer and sufficiently many neurons can approximate any continuous function arbitrarily well.