🧩

Fitted Value Iterations

Training cost approximations sequentially backwards and Model-Free Q-Factor Learning.

🔙

1. Sequential Backward Training

We train cost approximations \( \tilde{J}_N, \tilde{J}_{N-1}, \dots, \tilde{J}_0 \) sequentially, going backwards in time.

The Process

  1. Start: Set \( \tilde{J}_N = g_N \) (Terminal Cost).
  2. Generate Data: Given \( \tilde{J}_{k+1} \), use one-step lookahead to create training pairs \( (x_k^s, \beta_k^s) \). \[ \beta_k^s = \min_{u \in U_k(x_k^s)} E \left\{ g(x_k^s, u, w_k) + \tilde{J}_{k+1} \left( f_k(x_k^s, u, w_k), r_{k+1} \right) \right\} \]
  3. Train: Fit an architecture \( \tilde{J}_k \) to this data by minimizing the least squares error: \[ \min_{r_k} \sum_{s=1}^{q} \left( \tilde{J}_k(x_k^s, r_k) - \beta_k^s \right)^2 \]

Advantage

Can be combined with on-line play/approximation in value space.

Challenge

The \( \min_u E\{\cdot\} \) operation complicates data collection (requires expected value calculation for every sample).

🤖

2. Model-Free Q-Factors

To avoid the model requirement (expected value calculation), we can approximate Q-factors instead of Value functions.

The Trick

Reverse the order of Expectation and Minimization:

\[ \tilde{Q}_k(x_k, u_k, r_k) \approx E \left\{ g_k(x_k, u_k, w_k) + \min_{u} \tilde{Q}_{k+1}(x_{k+1}, u, r_{k+1}) \right\} \]

Now, samples \( \beta_k^s \) can be obtained from a simulator without knowing the explicit model equations.

Online Control Simplification

graph LR State("State x_k") -->|Evaluate Q| QFunc("Q(x_k, u, r_k)") QFunc -->|Minimize over u| Control("Select u_k") style Control fill:#cffafe,stroke:#06b6d4

No model or expected value needed online! Just minimize the learned Q-function.

⚠️

3. The Danger of Disproportionate Terms

Should we approximate \( Q(x, u) \) directly, or the difference \( Q(x, u) - Q(x, u') \)?

Example: Small Time Steps

Consider a discretized system with small \( \delta \). The Q-factor often looks like:

\[ Q(x, u) = \underbrace{V(x)}_{\text{Huge}} + \underbrace{\delta \cdot \text{Advantage}(x, u)}_{\text{Tiny}} \]

If we approximate \( Q(x, u) \) directly, the neural network will focus on fitting the huge \( V(x) \) term and effectively ignore the tiny term that actually determines the optimal control. The policy improvement information is "lost".

The Remedy: Baselines

Subtract a state-dependent constant (baseline) from the Q-factors to remove the huge term.

  • Differential Training: Learn \( D(x, x') = J(x) - J(x') \).
  • Advantage Updating: Learn \( A(x, u) = Q(x, u) - \min_{u'} Q(x, u') \).
📝

4. Test Your Knowledge

1. In Fitted Value Iteration, in which direction do we train?

2. What is the main advantage of learning Q-factors?

3. Why might approximating Q-factors directly fail for small time steps?

4. What is "Advantage Updating"?

5. Can Fitted Value Iteration be combined with online play?

Previous

Lecture 26

Next

Lecture 28