1. The Restricted Problem
We restrict attention to a parametrized subset of probability distributions \( p(z; r) \).
Objective
Minimize the expected cost over the parameter \( r \):
\[ \min_{r} E_{p(z;r)} \{ F(z) \} \]We use a gradient method: \( r^{k+1} = r^k - \gamma^k \nabla \left( E_{p(z;r^k)} \{ F(z) \} \right) \)
2. The Log-Likelihood Ratio Trick
How do we compute the gradient of an expectation? We reverse the order of \( E \) and \( \nabla \).
Derivation
\[ \nabla E \{F(z)\} = \sum \nabla p(z) F(z) = \sum p(z) \frac{\nabla p(z)}{p(z)} F(z) \]
Using the identity \( \nabla \log p(z) = \frac{\nabla p(z)}{p(z)} \):
\[ \nabla E \{F(z)\} = E \{ \nabla (\log p(z; r)) F(z) \} \]Sample-Based Algorithm
- Sample a trajectory \( z^k \) from \( p(z; r^k) \).
- Compute the gradient estimate: \( g^k = \nabla (\log p(z^k; r^k)) F(z^k) \).
- Update: \( r^{k+1} = r^k - \gamma^k g^k \).
3. Discounted Problems
Trajectory Probability
For a randomized policy \( p(u|i; r) \), the log-probability of a trajectory \( z \) decomposes into a sum:
\[ \nabla \log p(z; r) = \sum_{m=0}^{\infty} \nabla \log p(u_m | i_m; r) \]This allows us to compute gradients by summing terms over the visited states and controls.
Issues
- High variance (noise) in cost \( F(z^k) \).
- Need to balance exploration vs. exploitation.
Solution: Baselines
Subtract a baseline \( b \) from the cost to reduce variance without biasing the gradient:
\[ \nabla \log p(z) (F(z) - b) \]4. Cost Shaping
A technique to reduce noise and improve convergence by solving an equivalent "variational" problem.
The Transformation
Subtract a function \( V(x) \) from the optimal cost \( J^*(x) \) and modify the stage cost:
\[ \hat{g}(x,u,y) = g(x,u,y) + \alpha V(y) - V(x) \]This preserves the optimal policy but changes the "shape" of the cost function, making it easier to approximate if \( V \approx J^* \).
5. Robustness & AlphaZero
Pure policy gradient methods have limitations (no online lookahead).
The Hybrid Approach
Instead of using the policy gradient policy directly for control, use it as a base policy for rollout or Monte Carlo Tree Search (MCTS).
This is the secret sauce behind AlphaZero!