🎛️

Approximation in Policy Space

Parametrizing policies directly, Expert Supervised Training, and Unconventional Information Structures.

🏗️

1. General Framework

Instead of approximating the value function, we can parametrize the policy itself.

Parametrized Stationary Policies

We define a policy \( \tilde{\mu}(r) \) with components \( \tilde{\mu}(i, r) \), where \( r \) is a parameter vector.

Goal: Optimize some measure of performance with respect to \( r \).

Example: Supply Chain

graph LR A[Production Center] --> B[Delay] B --> C[Retail Storage] C --> D[Demand] style A fill:#e0f2fe,stroke:#0284c7 style C fill:#dcfce7,stroke:#16a34a

Natural Parametrization: When inventory drops below \( r_1 \), order amount \( r_2 \).
Optimize over \( r = (r_1, r_2) \).

🔄

2. Indirect Parametrization

We can also parametrize policies through cost features.

Value-Based Parametrization

Let \( \tilde{J}(i, r) \) be a cost approximation (e.g., linear features). Define the policy by minimizing the Bellman equation using this approximation:

\[ \tilde{\mu}(i, r) \in \arg \min_{u \in U(i)} \sum_{j=1}^{n} p_{ij}(u) ( g(i, u, j) + \tilde{J}(j, r) ) \]

Useful when we know good features (like in Tetris) but want to optimize in policy space.

🎓

3. Expert Supervised Training

Learning from a software or human expert (Cloning).

The Process

  1. Expert: Obtain "good" controls \( u^s \) at states \( i^s \) from an expert.
  2. Dataset: Form pairs \( (i^s, u^s) \).
  3. Train: Find \( r \) to minimize the difference: \[ \min_{r} \sum_{s=1}^{q} \| u^s - \tilde{\mu}(i^s, r) \|^2 \]

Applications

  • Games: Backgammon, Chess (initialization).
  • Policy Approximation: Approximating a rollout policy \( \hat{\mu} \) to make it faster to compute online.
📡

4. Unconventional Information Structures

Conventional DP assumes access to the full state (or belief state). What if that's not available?

When DP Breaks Down

  • Limited Memory: Controller "forgets" information.
  • Distributed Agents: Each agent acts on local information only.
  • Delayed Information: Agents receive data from others with a lag.

In these cases, Approximation in Policy Space is still applicable, whereas Value Space methods often fail.

📝

5. Test Your Knowledge

1. What is the main idea of Approximation in Policy Space?

2. In the supply chain example, what are the parameters (r1, r2)?

3. What is "Indirect Parametrization"?

4. What is "Expert Supervised Training"?

5. Why is Policy Space Approximation useful for Multiagent Systems?

Previous

Lecture 28

Next

Lecture 30