Approximation in Value Space (Deterministic)

💡

1. The Core Idea

Exact DP is often too expensive. Instead, we replace the optimal cost-to-go \( J^*_{k+1} \) with an approximate cost function \( \tilde{J}_{k+1} \).

One-Step Lookahead Minimization

At state \( \tilde{x}_k \), we choose control \( \tilde{u}_k \) by solving:

\[ \tilde{u}_k \in \arg\min_{u_k \in U_k(\tilde{x}_k)} \left[ g_k(\tilde{x}_k, u_k) + \tilde{J}_{k+1} (f_k(\tilde{x}_k, u_k)) \right] \]

Then we move to the next state: \( \tilde{x}_{k+1} = f_k(\tilde{x}_k, \tilde{u}_k) \).

Visualizing the Process

graph LR State["Current State x_k"] -->|Apply u_k| NextState["Next State x_{k+1}"] NextState -->|Evaluate| Cost["Approx Cost J~(x_{k+1})"] State -->|Plus Stage Cost g_k| Total["Total: g_k + J~"] Total -->|Minimize| Decision["Choose Best u_k"] style Decision fill:#cffafe,stroke:#06b6d4,stroke-width:2px

Q

2. Approximate Q-Factors

Definition

The Q-factor represents the value of taking action \( u_k \) in state \( x_k \) and then following the approximate policy thereafter.

\[ \tilde{Q}_k(x_k, u_k) = g_k(x_k, u_k) + \tilde{J}_{k+1} (f_k(x_k, u_k)) \]

The control selection simply becomes minimizing the Q-factor:

\[ \tilde{u}_k \in \arg\min_{u_k \in U_k(\tilde{x}_k)} \tilde{Q}_k(\tilde{x}_k, u_k) \]

⚖️

3. Offline vs Online

Online Approximation

Compute \( \tilde{J}_{k+1} \) on the fly (e.g., using Rollout).

Pros: Adapts to the current state.
Cons: Computationally expensive at each step.

Offline Q-Factors

Train \( \tilde{Q}_k \) beforehand (e.g., Neural Networks) and use it directly.

Pros: Very fast online execution.
Cons: Performance depends on training quality; errors can degrade results.

📝

Approximation in Value Space

1. The Core Idea

One-Step Lookahead Minimization

Visualizing the Process

2. Approximate Q-Factors

Definition

3. Offline vs Online

Online Approximation

Offline Q-Factors

4. Test Your Knowledge

1. What do we replace \( J^*_{k+1} \) with in Approximation in Value Space?

2. What is the "One-Step Lookahead"?

3. What is a Q-Factor?

4. What is a risk of using Offline Trained Q-Factors?

5. Which method is typically faster during online execution?