Infinite Horizon Problems | Sequential Decision Making

📖

1. Infinite Horizon Stochastic DP Problem

Problem Statement

We aim to minimize the total cost over an infinite number of stages, given by:

\[ J_\pi(x_0) = \lim_{N \to \infty} E_{{w_k}, \: k=0,1,\dots} \left\{ \sum_{k=0}^{N-1} \alpha^k g(x_k, \mu_k(x_k), w_k) \right\} \]

\( J_\pi(x_0) \) is the cost associated with initial state \( x_0 \) and policy \( \pi = \{\mu_0, \mu_1, \dots\} \).
\( \alpha \) is a discount factor in the interval \( (0, 1] \).
System equation: \( x_{k+1} = f(x_k, u_k, w_k) \).
Stochastic disturbances \( w_k \) have a common probability distribution \( P(\cdot | x_k, u_k) \).

🔄

2. Value Iteration (VI)

How do we solve this? We can think of it as the limit of finite horizon problems.

The Algorithm

Let \( J_N(x) \) denote the optimal cost of an \( N \)-stage problem. The Value Iteration algorithm generates this cost iteratively:

\[ J_{k+1}(x) = \min_{u \in U(x)} E_w \left\{ g(x, u, w) + \alpha J_k (f(x, u, w)) \right\}, \quad k = 0, 1, \dots \]

Starting from \( J_0(x) = 0 \).

Optimal Cost \( J^* \) & Bellman Equation

The optimal infinite horizon cost is the limit \( J^*(x) = \lim_{N \to \infty} J_N(x) \). It satisfies the famous Bellman Equation:

\[ J^*(x) = \min_{u \in U(x)} E_w \left\{ g(x, u, w) + \alpha J^* (f(x, u, w)) \right\} \]

Remarks on Bellman Equation

It is a system of equations (one per state \( x \)).
If \( \mu(x) \) attains the minimum for each \( x \), then the stationary policy \( \{\mu, \mu, \dots\} \) is optimal.
The cost \( J_\mu \) of a stationary policy \( \mu \) satisfies: \[ J_\mu(x) = E_w \left\{ g(x, \mu(x), w) + \alpha J_\mu (f(x, \mu(x), w)) \right\} \]

🎯

3. Policy Iteration (PI)

Policy Iteration forms the foundation for self-learning in RL.

1. Policy Evaluation

Compute cost \( J_\mu \) of current policy \( \mu \) by solving the Bellman equation for \( \mu \).

Can be done via Monte Carlo simulation (averaging over many trajectories).

2. Policy Improvement

Compute improved policy \( \tilde{\mu} \) via one-step lookahead minimization.

This is also known as the Rollout Algorithm when performed online.

🔭

4. Approximation in Value Space

For large state spaces, we use an approximation \( \tilde{J} \) instead of \( J^* \). We then generate controls using Lookahead Minimization.

The Three Approximations

(a)
Terminal Cost Approximation \( \tilde{J} \): Replaces the optimal cost \( J^* \). In infinite horizon, we only need one such function (unlike \( N \) functions for finite horizon).
(b)
Expected Value Approximation: Replaces the \( E\{\cdot\} \) operation. For example, replacing stochastic disturbances with deterministic values (Certainty Equivalence).
(c)
Minimization Approximation: Simplifies the \( \min_u \) operation, especially useful in multiagent systems where the control space is vast.

\( \ell \)-Step Lookahead

Motivation

Why look further ahead? By increasing \( \ell \), we might get better performance even with a less precise approximation \( \tilde{J} \).

Example: In AlphaZero (Chess/Go), a deep multistep lookahead is crucial for effective online play. It also helps with the stability of the policy.

🏗️

5. Constructing Approximations

Off-line Approximation

Compute \( \tilde{J} \) off-line by solving a simpler problem.

Simplification: Reduce state space, remove constraints, or assume deterministic dynamics.
Aggregation: Group states into clusters and solve the aggregate problem.
Parametric Approximation: Train a neural network \( J(x, r) \) using methods like \( Q \)-learning or TD-methods.

On-line Approximation

Compute values on-the-fly.

Rollout: Run a "base policy" (heuristic) via simulation to estimate costs.
MPC: Solve a short-horizon optimization problem online.

Feature Extraction

For parametric approximation, we often extract "features" from the state (e.g., height of columns in Tetris) and learn a function of these features.

🎮

6. From Training to Play

Once we have trained a policy \( \mu \) or value function \( \tilde{J} \) off-line, how do we use it?

1. Given a Policy \( \mu \)

Use \( \tilde{J} = J_\mu \) as the terminal cost. Perform one-step or \( \ell \)-step lookahead (Rollout).

2. Given a Value Approx \( \tilde{J} \)

Use \( \tilde{J} \) directly as the terminal cost in lookahead minimization.

3. Truncated Rollout

Run rollout with policy \( \mu \) for some steps, then use \( \tilde{J} \) to approximate the rest (e.g., TD-Gammon).

✍️

7. Test Your Understanding

1. Why do we use \( \ell \)-step lookahead instead of 1-step?

2. What is "Certainty Equivalence"?

🔍 Spot the Mistake!

Scenario: A student claims:

"Rollout is an off-line algorithm because it requires simulation."

False. Rollout is an on-line algorithm.

True. Simulation means off-line.