Policy Gradient Methods

🎯

1. Deterministic Problems

We want to minimize \( J_{\tilde{\mu}(r)}(i_0) \) over \( r \) using the gradient method:

\[ r^{k+1} = r^k - \gamma^k \nabla J_{\tilde{\mu}(r^k)}(i_0) \]

The Challenge

The gradient \( \nabla J \) is often not explicitly available.

Solution: Approximate it by finite differences of cost function values.

Deterministic Case

Finite differences work well because the cost function evaluation is exact (no noise).

Stochastic Case

Cost values are noisy (Monte Carlo). Differencing noisy values leads to very poor gradient estimates.

🎲

2. Stochastic Problems

To handle stochastic problems, we take an unusual step: convert the minimization of \( F(z) \) into a stochastic optimization problem over probability distributions.

The Transformation

Instead of \( \min_{z \in Z} F(z) \), we solve:

\[ \min_{p \in \mathcal{P}_Z} E_p \{ F(z) \} \]

\( z \): Random variable (e.g., state-control trajectory).
\( \mathcal{P}_Z \): Set of probability distributions over \( Z \).
\( p \): A generic distribution (the policy).

Relation to DP

For this to apply to DP, we must enlarge the set of policies to include randomized policies.

graph LR State(State i) -->|Randomized Policy| Dist(Distribution over U) Dist --> Control(Control u) style Dist fill:#e0e7ff,stroke:#6366f1

Note: In standard DP, optimization over randomized policies yields the same optimal cost as deterministic policies. However, this "smoothing" allows us to compute gradients more easily (as we will see).

📝

1. Deterministic Problems

The Challenge

Deterministic Case

Stochastic Case

2. Stochastic Problems

The Transformation

Relation to DP

3. Test Your Knowledge

1. How do we typically approximate gradients in deterministic problems?

2. Why is finite differencing bad for stochastic problems?

3. What "unusual step" is taken for stochastic policy gradients?

4. What does 'z' represent in the DP context?

5. Does optimizing over randomized policies improve the optimal cost in standard DP?