1. General Framework
Instead of approximating the value function, we can parametrize the policy itself.
Parametrized Stationary Policies
We define a policy \( \tilde{\mu}(r) \) with components \( \tilde{\mu}(i, r) \), where \( r \) is a parameter vector.
Goal: Optimize some measure of performance with respect to \( r \).
Example: Supply Chain
Natural Parametrization: When inventory drops below \( r_1 \), order amount \( r_2
\).
Optimize over \( r = (r_1, r_2) \).
2. Indirect Parametrization
We can also parametrize policies through cost features.
Value-Based Parametrization
Let \( \tilde{J}(i, r) \) be a cost approximation (e.g., linear features). Define the policy by minimizing the Bellman equation using this approximation:
\[ \tilde{\mu}(i, r) \in \arg \min_{u \in U(i)} \sum_{j=1}^{n} p_{ij}(u) ( g(i, u, j) + \tilde{J}(j, r) ) \]Useful when we know good features (like in Tetris) but want to optimize in policy space.
3. Expert Supervised Training
Learning from a software or human expert (Cloning).
The Process
- Expert: Obtain "good" controls \( u^s \) at states \( i^s \) from an expert.
- Dataset: Form pairs \( (i^s, u^s) \).
- Train: Find \( r \) to minimize the difference: \[ \min_{r} \sum_{s=1}^{q} \| u^s - \tilde{\mu}(i^s, r) \|^2 \]
Applications
- Games: Backgammon, Chess (initialization).
- Policy Approximation: Approximating a rollout policy \( \hat{\mu} \) to make it faster to compute online.
4. Unconventional Information Structures
Conventional DP assumes access to the full state (or belief state). What if that's not available?
When DP Breaks Down
- Limited Memory: Controller "forgets" information.
- Distributed Agents: Each agent acts on local information only.
- Delayed Information: Agents receive data from others with a lag.
In these cases, Approximation in Policy Space is still applicable, whereas Value Space methods often fail.