Model-Free & Truncated Rollout

✂️

1. Truncated Rollout

For problems with very long horizons (large \( N \)), running the heuristic to the end is too expensive.

The Strategy

Run the base heuristic for only \( \ell \) steps (truncation).
Add a Terminal Cost Approximation \( \tilde{J}(x_{k+\ell}) \) at the leaf nodes.
This bridges the gap between full rollout and pure value space approximation.

Truncated Lookahead

graph TD Root("State x_k") -->|u_k| Next("x_{k+1}") Next -->|Heuristic Step 1| H1("...") H1 -->|Heuristic Step l| Leaf("Leaf x_{k+l}") Leaf -->|Approx Cost| Cost("Terminal Cost J~(x_{k+l})") style Leaf fill:#e0e7ff,stroke:#6366f1 style Cost fill:#f3e8ff,stroke:#a855f7

🧠

2. Model-Free Rollout

What if we don't know the cost function \( G(u) \) or the constraints?

The Setup

Base Heuristic: Can generate complete feasible solutions from partial ones.
Expert: A human or software oracle that can rank any two solutions (better/worse) without giving numbers.

Expert-Guided Selection

graph LR Partial("Partial Solution (u_0...u_k)") -->|Extend u^1| Sol1("Complete Sol 1") Partial -->|Extend u^2| Sol2("Complete Sol 2") Sol1 --> Expert("Expert Ranking") Sol2 --> Expert Expert --> Choice("Select Best u_{k+1}") style Expert fill:#fef3c7,stroke:#f59e0b

💬

3. LLMs as Rollout Policies

Large Language Models (LLMs) like GPT can be viewed as policies for generating text (sequential decisions).

The Analogy

State: Context window (partial text).
Control: Next word selection.
Heuristic: The LLM generating the rest of the sentence.

Challenges

Cost Function: Hard to define "quality" numerically.
Expert: Need a way to rank completions (e.g., Reward Models in RLHF).

📝

1. Truncated Rollout

The Strategy

Truncated Lookahead

2. Model-Free Rollout

The Setup

Expert-Guided Selection

3. LLMs as Rollout Policies

The Analogy

Challenges

4. Test Your Knowledge

1. Why do we use Truncated Rollout?

2. What compensates for the truncation error?

3. In Model-Free Rollout, what replaces the cost function?

4. Does the Expert need to provide numerical values?

5. How can an LLM act as a base heuristic?