Gradient Method for Approximate Solution | Sequential Decision Making

🎯

1. The Restricted Problem

We restrict attention to a parametrized subset of probability distributions \( p(z; r) \).

Objective

Minimize the expected cost over the parameter \( r \):

\[ \min_{r} E_{p(z;r)} \{ F(z) \} \]

We use a gradient method: \( r^{k+1} = r^k - \gamma^k \nabla \left( E_{p(z;r^k)} \{ F(z) \} \right) \)

✨

2. The Log-Likelihood Ratio Trick

How do we compute the gradient of an expectation? We reverse the order of \( E \) and \( \nabla \).

Derivation

\[ \nabla E \{F(z)\} = \sum \nabla p(z) F(z) = \sum p(z) \frac{\nabla p(z)}{p(z)} F(z) \]

Using the identity \( \nabla \log p(z) = \frac{\nabla p(z)}{p(z)} \):

\[ \nabla E \{F(z)\} = E \{ \nabla (\log p(z; r)) F(z) \} \]

Sample-Based Algorithm

Sample a trajectory \( z^k \) from \( p(z; r^k) \).
Compute the gradient estimate: \( g^k = \nabla (\log p(z^k; r^k)) F(z^k) \).
Update: \( r^{k+1} = r^k - \gamma^k g^k \).

🎲

3. Discounted Problems

Trajectory Probability

For a randomized policy \( p(u|i; r) \), the log-probability of a trajectory \( z \) decomposes into a sum:

\[ \nabla \log p(z; r) = \sum_{m=0}^{\infty} \nabla \log p(u_m | i_m; r) \]

This allows us to compute gradients by summing terms over the visited states and controls.

Issues

High variance (noise) in cost \( F(z^k) \).
Need to balance exploration vs. exploitation.

Solution: Baselines

Subtract a baseline \( b \) from the cost to reduce variance without biasing the gradient:

\[ \nabla \log p(z) (F(z) - b) \]

🛠️

4. Cost Shaping

A technique to reduce noise and improve convergence by solving an equivalent "variational" problem.

The Transformation

Subtract a function \( V(x) \) from the optimal cost \( J^*(x) \) and modify the stage cost:

\[ \hat{g}(x,u,y) = g(x,u,y) + \alpha V(y) - V(x) \]

This preserves the optimal policy but changes the "shape" of the cost function, making it easier to approximate if \( V \approx J^* \).

🛡️

5. Robustness & AlphaZero

Pure policy gradient methods have limitations (no online lookahead).

The Hybrid Approach

Instead of using the policy gradient policy directly for control, use it as a base policy for rollout or Monte Carlo Tree Search (MCTS).

This is the secret sauce behind AlphaZero!

📝

Approximate Gradient Methods

1. The Restricted Problem

Objective

2. The Log-Likelihood Ratio Trick

Derivation

Sample-Based Algorithm

3. Discounted Problems

Trajectory Probability

Issues

Solution: Baselines

4. Cost Shaping

The Transformation

5. Robustness & AlphaZero

The Hybrid Approach

6. Test Your Knowledge

1. What is the "Log-Likelihood Ratio" trick used for?

2. Does the policy gradient method require the cost function F(z) to be differentiable?

3. What is the purpose of a baseline 'b'?

4. How does Cost Shaping help?

5. How does AlphaZero use policy gradients?