Better Iterative Methods & Google

From Conjugate Gradients to Markov Chains & Monte Carlo

Part 1: Optimization

The Quadratic Form

We define a quadratic form $f(x)$ as:

$$ f(x) = \frac{1}{2} x^T A x - b^T x + c $$

Example: Given $A= \begin{bmatrix}3 & 2\\ 2& 6\end{bmatrix}, \quad b= \begin{bmatrix}2\\-8\end{bmatrix}, \quad c = 0$

% MATLAB Visualization
syms x y
X = [x;y];
A = [3 2; 2 6]; b= [2; -8];
f = (1/2) * transpose(X) * A * X - transpose(b) * X;
ezsurfc(f)
hold on;
ezplot('(1/2) * (2 - 3*x)');
ezplot('(1/6) * (-8 - 2*x)');

Gradient & Minima

The gradient $\nabla f(x)$ points uphill. The derivation of the $i^{th}$ component yields:

$$ \nabla f = \frac{1}{2} A^T x + \frac{1}{2} A x - b $$

If $A$ is symmetric ($A^T = A$), this simplifies to:

$\nabla f = Ax - b$

Thus, at the minimum where $\nabla f = 0$, we have $Ax = b$.

Positive Definite A: Concave up (Bowl). Unique minimum.

Indefinite A: Saddle point (e.g., $x_1^2 - x_2^2$).

Steepest Descent

We want to find $\alpha$ to minimize $f(x_{k+1})$ where $x_{k+1} = x_k + \alpha r_k$.

$$ \frac{d}{d\alpha}f(x_{k+1}) = \nabla f(x_{k+1})^T \frac{d}{d\alpha}(x_k + \alpha r_k) = -r_{k+1}^T r_k = 0 $$

This implies we pick $\alpha$ such that the new residual is orthogonal to the current direction. The optimal search parameter is derived as:

$$ \alpha = \frac{r_k^T r_k}{r_k^T A r_k} $$

Conjugate Gradients (CG)

Steepest descent leads to "staggered" convergence. CG improves this by choosing search directions $s_k$ that are $A$-orthogonal (conjugate) to previous directions. It converges in at most $n$ steps.

// Conjugate Gradient Algorithm
$x_0 = $ initial guess
$r_0 = b - Ax_0$
$s_0 = r_0$
for $k = 0, 1, 2, \dots$
$\alpha_k = \frac{r_k^T r_k}{s_k^T A s_k}$ % Step size
$x_{k+1} = x_k + \alpha_k s_k$ % Update x
$r_{k+1} = r_k - \alpha_k A s_k$ % Update residual
$\beta_{k+1} = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$ % Gram-Schmidt-like
correction
$s_{k+1} = r_{k+1} + \beta_{k+1} s_k$ % New conjugate direction

Convergence & Preconditioning

Convergence Bound

The $A$-norm of the error is minimized. The rate depends on the condition number $\kappa = \lambda_{max}/\lambda_{min}$.

$$ \|e_k\|_A \leq 2 \left( \frac{\sqrt{\kappa} -1}{\sqrt{\kappa} + 1} \right)^k \|e_0 \|_A $$

Preconditioning

We transform $Ax=b$ with a matrix $Q$ so $Q^{-1}A$ is closer to identity. For CG, we generally use Symmetric Preconditioning:

Left: $QAx = Qb$
Right: $AQw = b, Qw=x$
Symmetric: $Q^TAQw = Q^Tb, Qw=x$ (Required for CG)

🧠 Quick Check Win $10

Why is Symmetric Preconditioning preferred for Conjugate Gradients?

Part 2: Google & Markov Chains

Randomly Walking with Google

Start at a webpage, randomly select a link, and repeat.

Dead End: Page with no outgoing links.
Cycle (Markov Chain): You end up where you began.
PageRank: The limiting probability that an infinitely dedicated random surfer visits a page.

Markov Chain: The Broken Machine

A machine has 3 states: Working, Temporarily Broken, and Permanently Broken.

Transition Matrix P

$$ P = \begin{pmatrix} 0.9 & 0.6 & 0 \\ 0.095 & 0.4 & 0 \\ 0.005 & 0 & 1 \end{pmatrix} $$

Question: If the machine is working on Day 1 ($v^{(0)} = [1, 0, 0]^T$), what is the probability it is still functional after a year (365 days)?

We compute $P^{365}$:

$$ P^{365} \approx \begin{pmatrix} 0.1779 & 0.1792 & 0 \\ 0.0284 & 0.0286 & 0 \\ 0.7937 & 0.7922 & 1 \end{pmatrix} $$

The first column tells us the state probabilities starting from State 1. The probability of being Permanently Broken (Row 3) is 0.7937.

State 3 is an absorbing state. The steady state solution is $v = [0, 0, 1]^T$.

The Google Matrix

Let $G$ be the connectivity matrix ($n \times n$). Let $c_j$ be the column sums (in-degree). We define matrix $A$:

$$ A_{ij} = p \frac{G_{ij}}{c_j} + \delta $$

$p$: Probability of following a link (typically 0.85).
$1-p$: Probability of random teleportation (handling dead ends/cycles).
$\delta$: $(1-p)/n$.

The PageRank vector $x$ is the eigenvector where $x = Ax$. This can be written as the linear system:

$$ (I - pGD)x = e $$

% MATLAB: Moler's Surfer
[U,G] = surfer('http://www.uiuc.edu', 20);
spy(G); % View sparsity structure
pagerank(U,G);

Part 3: Integration & Statistics

Integration as Averaging

Integration can be thought of as averaging a function over a domain.

1. Average Daily Snowfall (1D)

Domain: Year (Time)

$$ \text{Average} = \frac{\int_{day \in year} s(day) \, d(day)}{\text{Length of Year}} $$

2. Average Snowfall in Illinois (2D)

Domain: Surface Area (Space)

$$ \text{Average} = \frac{\int_{loc \in IL} s(loc) \, d(loc)}{\text{Area of IL}} $$

3. Average Snowfall in Illinois Today (3D)

Domain: Space $\times$ Time

$$ \text{Average} = \frac{\int_{day}\int_{loc} s(loc, day) \, d(loc, day)}{\text{Area} \cdot \text{Length}} $$

Discrete Random Variables

Expected Value ($E[x]$)

$$ E[x] = \sum_{j=1}^{n} x_j p_j $$

Example (Die Roll): $(1+2+3+4+5+6)/6 = 3.5$

Variance ($\sigma^2$)

$$ \sigma^2[x] = E[(x - E[x])^2] = E[x^2] - E[x]^2 $$

Measure of variation from the average.

Estimation & Law of Large Numbers

We can estimate $E[x]$ by choosing $N$ random values and averaging them:

$$ E[x] \approx \frac{1}{N} \sum_{j=1}^{N} x_j $$

Law of Large Numbers

As $N \to \infty$, the estimate converges to the correct value with probability 1.

$$ P(E[x] = \lim_{N\to\infty} \frac{1}{N} \sum_{i=1}^{N} x_i) = 1 $$

Continuous Extensions

For complex geometries (like maps), we extend this to integration using a density function $\rho(x)$:

$$ E[g(x)] = \int_{x \in D} g(x)\rho(x) \, dA_x \approx \frac{1}{N} \sum_{i=1}^{N} g(x_i) $$

🧠 Final Challenge Win $20

Based on the Law of Large Numbers, how does the error of our estimate behave as we increase $N$?