🤖

Model-Free & Truncated Rollout

Handling large horizons with truncation and unknown costs with expert feedback.

✂️

1. Truncated Rollout

For problems with very long horizons (large \( N \)), running the heuristic to the end is too expensive.

The Strategy

  • Run the base heuristic for only \( \ell \) steps (truncation).
  • Add a Terminal Cost Approximation \( \tilde{J}(x_{k+\ell}) \) at the leaf nodes.
  • This bridges the gap between full rollout and pure value space approximation.

Truncated Lookahead

graph TD Root("State x_k") -->|u_k| Next("x_{k+1}") Next -->|Heuristic Step 1| H1("...") H1 -->|Heuristic Step l| Leaf("Leaf x_{k+l}") Leaf -->|Approx Cost| Cost("Terminal Cost J~(x_{k+l})") style Leaf fill:#e0e7ff,stroke:#6366f1 style Cost fill:#f3e8ff,stroke:#a855f7
🧠

2. Model-Free Rollout

What if we don't know the cost function \( G(u) \) or the constraints?

The Setup

  • Base Heuristic: Can generate complete feasible solutions from partial ones.
  • Expert: A human or software oracle that can rank any two solutions (better/worse) without giving numbers.

Expert-Guided Selection

graph LR Partial("Partial Solution (u_0...u_k)") -->|Extend u^1| Sol1("Complete Sol 1") Partial -->|Extend u^2| Sol2("Complete Sol 2") Sol1 --> Expert("Expert Ranking") Sol2 --> Expert Expert --> Choice("Select Best u_{k+1}") style Expert fill:#fef3c7,stroke:#f59e0b
💬

3. LLMs as Rollout Policies

Large Language Models (LLMs) like GPT can be viewed as policies for generating text (sequential decisions).

The Analogy

  • State: Context window (partial text).
  • Control: Next word selection.
  • Heuristic: The LLM generating the rest of the sentence.

Challenges

  • Cost Function: Hard to define "quality" numerically.
  • Expert: Need a way to rank completions (e.g., Reward Models in RLHF).
📝

4. Test Your Knowledge

1. Why do we use Truncated Rollout?

2. What compensates for the truncation error?

3. In Model-Free Rollout, what replaces the cost function?

4. Does the Expert need to provide numerical values?

5. How can an LLM act as a base heuristic?

Previous

Lecture 22

Next

Lecture 24