1. The Core Idea
Exact DP is often too expensive. Instead, we replace the optimal cost-to-go \( J^*_{k+1} \) with an approximate cost function \( \tilde{J}_{k+1} \).
One-Step Lookahead Minimization
At state \( \tilde{x}_k \), we choose control \( \tilde{u}_k \) by solving:
\[ \tilde{u}_k \in \arg\min_{u_k \in U_k(\tilde{x}_k)} \left[ g_k(\tilde{x}_k, u_k) + \tilde{J}_{k+1} (f_k(\tilde{x}_k, u_k)) \right] \]Then we move to the next state: \( \tilde{x}_{k+1} = f_k(\tilde{x}_k, \tilde{u}_k) \).
Visualizing the Process
2. Approximate Q-Factors
Definition
The Q-factor represents the value of taking action \( u_k \) in state \( x_k \) and then following the approximate policy thereafter.
\[ \tilde{Q}_k(x_k, u_k) = g_k(x_k, u_k) + \tilde{J}_{k+1} (f_k(x_k, u_k)) \]The control selection simply becomes minimizing the Q-factor:
\[ \tilde{u}_k \in \arg\min_{u_k \in U_k(\tilde{x}_k)} \tilde{Q}_k(\tilde{x}_k, u_k) \]3. Offline vs Online
Online Approximation
Compute \( \tilde{J}_{k+1} \) on the fly (e.g., using Rollout).
- Pros: Adapts to the current state.
- Cons: Computationally expensive at each step.
Offline Q-Factors
Train \( \tilde{Q}_k \) beforehand (e.g., Neural Networks) and use it directly.
- Pros: Very fast online execution.
- Cons: Performance depends on training quality; errors can degrade results.