Approximation in Policy Space

🏗️

1. General Framework

Instead of approximating the value function, we can parametrize the policy itself.

Parametrized Stationary Policies

We define a policy \( \tilde{\mu}(r) \) with components \( \tilde{\mu}(i, r) \), where \( r \) is a parameter vector.

Goal: Optimize some measure of performance with respect to \( r \).

Example: Supply Chain

graph LR A[Production Center] --> B[Delay] B --> C[Retail Storage] C --> D[Demand] style A fill:#e0f2fe,stroke:#0284c7 style C fill:#dcfce7,stroke:#16a34a

Natural Parametrization: When inventory drops below \( r_1 \), order amount \( r_2 \).
Optimize over \( r = (r_1, r_2) \).

🔄

2. Indirect Parametrization

We can also parametrize policies through cost features.

Value-Based Parametrization

Let \( \tilde{J}(i, r) \) be a cost approximation (e.g., linear features). Define the policy by minimizing the Bellman equation using this approximation:

\[ \tilde{\mu}(i, r) \in \arg \min_{u \in U(i)} \sum_{j=1}^{n} p_{ij}(u) ( g(i, u, j) + \tilde{J}(j, r) ) \]

Useful when we know good features (like in Tetris) but want to optimize in policy space.

🎓

3. Expert Supervised Training

Learning from a software or human expert (Cloning).

The Process

Expert: Obtain "good" controls \( u^s \) at states \( i^s \) from an expert.
Dataset: Form pairs \( (i^s, u^s) \).
Train: Find \( r \) to minimize the difference: \[ \min_{r} \sum_{s=1}^{q} \| u^s - \tilde{\mu}(i^s, r) \|^2 \]

Applications

Games: Backgammon, Chess (initialization).
Policy Approximation: Approximating a rollout policy \( \hat{\mu} \) to make it faster to compute online.

📡

4. Unconventional Information Structures

Conventional DP assumes access to the full state (or belief state). What if that's not available?

When DP Breaks Down

Limited Memory: Controller "forgets" information.
Distributed Agents: Each agent acts on local information only.
Delayed Information: Agents receive data from others with a lag.

In these cases, Approximation in Policy Space is still applicable, whereas Value Space methods often fail.

1. General Framework

Parametrized Stationary Policies

Example: Supply Chain

2. Indirect Parametrization

Value-Based Parametrization

3. Expert Supervised Training

The Process

Applications

4. Unconventional Information Structures

When DP Breaks Down

5. Test Your Knowledge

1. What is the main idea of Approximation in Policy Space?

2. In the supply chain example, what are the parameters (r1, r2)?

3. What is "Indirect Parametrization"?

4. What is "Expert Supervised Training"?

5. Why is Policy Space Approximation useful for Multiagent Systems?