1. Sequential Backward Training
We train cost approximations \( \tilde{J}_N, \tilde{J}_{N-1}, \dots, \tilde{J}_0 \) sequentially, going backwards in time.
The Process
- Start: Set \( \tilde{J}_N = g_N \) (Terminal Cost).
- Generate Data: Given \( \tilde{J}_{k+1} \), use one-step lookahead to create training pairs \( (x_k^s, \beta_k^s) \). \[ \beta_k^s = \min_{u \in U_k(x_k^s)} E \left\{ g(x_k^s, u, w_k) + \tilde{J}_{k+1} \left( f_k(x_k^s, u, w_k), r_{k+1} \right) \right\} \]
- Train: Fit an architecture \( \tilde{J}_k \) to this data by minimizing the least squares error: \[ \min_{r_k} \sum_{s=1}^{q} \left( \tilde{J}_k(x_k^s, r_k) - \beta_k^s \right)^2 \]
Advantage
Can be combined with on-line play/approximation in value space.
Challenge
The \( \min_u E\{\cdot\} \) operation complicates data collection (requires expected value calculation for every sample).
2. Model-Free Q-Factors
To avoid the model requirement (expected value calculation), we can approximate Q-factors instead of Value functions.
The Trick
Reverse the order of Expectation and Minimization:
\[ \tilde{Q}_k(x_k, u_k, r_k) \approx E \left\{ g_k(x_k, u_k, w_k) + \min_{u} \tilde{Q}_{k+1}(x_{k+1}, u, r_{k+1}) \right\} \]Now, samples \( \beta_k^s \) can be obtained from a simulator without knowing the explicit model equations.
Online Control Simplification
No model or expected value needed online! Just minimize the learned Q-function.
3. The Danger of Disproportionate Terms
Should we approximate \( Q(x, u) \) directly, or the difference \( Q(x, u) - Q(x, u') \)?
Example: Small Time Steps
Consider a discretized system with small \( \delta \). The Q-factor often looks like:
\[ Q(x, u) = \underbrace{V(x)}_{\text{Huge}} + \underbrace{\delta \cdot \text{Advantage}(x, u)}_{\text{Tiny}} \]If we approximate \( Q(x, u) \) directly, the neural network will focus on fitting the huge \( V(x) \) term and effectively ignore the tiny term that actually determines the optimal control. The policy improvement information is "lost".
The Remedy: Baselines
Subtract a state-dependent constant (baseline) from the Q-factors to remove the huge term.
- Differential Training: Learn \( D(x, x') = J(x) - J(x') \).
- Advantage Updating: Learn \( A(x, u) = Q(x, u) - \min_{u'} Q(x, u') \).