Duality | Optimization Methods

💡

1. Motivation

The core idea of duality is to transform a difficult primal problem into a potentially easier dual problem. This concept has deep historical roots.

Joseph-Louis Lagrange

Introduced Lagrange multipliers for equality constrained problems, forming the basis of duality theory.

Werner Fenchel

Developed conjugate functions, providing a modern convex analysis perspective on duality.

📐

2. Geometric Intuition

Example Problem

Minimize \( f(x,y) = 8x^2 - 2y \) subject to \( x^2 + y^2 = 1 \).

Key Insight

The minimum occurs where the level curve of \( f \) just touches the constraint curve \( h(x,y)=0 \).
At this touch point, their tangent lines are parallel.
Therefore, their normal vectors (gradients) must be parallel:

\( \nabla f(x,y) = \lambda \nabla h(x,y) \)

Here, \( \lambda \) is the Lagrange Multiplier.

Why Orthogonality?

Imagine moving along the constraint curve \( h(x,y)=0 \). The motion is along the tangent vector \( T \).

To change the value of \( f \), we must move with a component along \( \nabla f \). If \( \nabla f \cdot T \neq 0 \), we can still decrease \( f \).

At an extremum (min/max), we cannot decrease \( f \) anymore while staying on the curve. Thus, the motion along \( T \) must be orthogonal to \( \nabla f \):

\[ \nabla f \cdot T = 0 \]

Conclusion

Since \( T \) is orthogonal to \( \nabla h \) (by definition) and now also to \( \nabla f \), the gradients \( \nabla f \) and \( \nabla h \) must be parallel (or anti-parallel).

\[ \nabla f + \lambda \nabla h = 0 \]

⚙️

3. The Lagrangian

Definition

For the problem minimize \( f(x) \) subject to \( h(x) = 0 \), we define the Lagrangian:

\[ \mathcal{L}(x, \lambda) = f(x) + \lambda h(x) \]

The condition \( \nabla \mathcal{L} = 0 \) encapsulates both the original constraint and the parallel gradient condition:

\[ \nabla_x \mathcal{L} = \nabla f + \lambda \nabla h = 0 \] \[ \nabla_\lambda \mathcal{L} = h(x) = 0 \]

🧩

4. Example: Non-convex Problem

Minimize \( f(x,y) = xy \)

Subject to \( h(x,y) = \frac{x^2}{8} + \frac{y^2}{2} - 1 = 0 \)

Lagrangian:

\[ \mathcal{L}(x,y,\lambda) = xy + \lambda(\frac{x^2}{8} + \frac{y^2}{2} - 1) \]

Setting \( \nabla \mathcal{L} = 0 \) gives the system:

1) y + \lambda x / 4 = 0
2) x + \lambda y = 0
3) x^2/8 + y^2/2 = 1

Solving this system yields \( \lambda = \pm 2 \).

Extremal points: \( (\pm 2, \pm 1) \).

Minimum value: -2 at \( (2, -1) \) and \( (-2, 1) \).

🔗

5. Lagrange Dual Function

Definition

The Lagrange dual function \( g: \mathbb{R}^m \times \mathbb{R}^p \rightarrow \mathbb{R} \) is defined as the minimum value of the Lagrangian over \( x \):

\[ g(\lambda, \nu) = \inf_{x \in \mathcal{D}} L(x, \lambda, \nu) = \inf_{x \in \mathcal{D}} \left( f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \right) \]

Key Properties

Concavity: The dual function \( g(\lambda, \nu) \) is always concave, even if the original problem is not convex! (It's the pointwise infimum of affine functions).
Lower Bound: For any \( \lambda \geq 0 \) and any \( \nu \), \( g(\lambda, \nu) \leq p^* \).

Example: Least Squares

Minimize \( x^T x \) subject to \( Ax = b \).

Lagrangian: \( L(x, \nu) = x^T x + \nu^T (Ax - b) \).

Minimize w.r.t \( x \): \( \nabla_x L = 2x + A^T \nu = 0 \implies x = -(1/2)A^T \nu \).

Substitute back to get \( g(\nu) = -(1/4)\nu^T A A^T \nu - b^T \nu \).

⚖️

6. Weak and Strong Duality

Weak Duality

Let \( d^* \) be the optimal value of the dual problem (maximize \( g \)).

\[ d^* \leq p^* \]

This always holds. The difference \( p^* - d^* \) is the duality gap.

Strong Duality

When the gap is zero:

\[ d^* = p^* \]

Usually holds for convex problems under certain conditions (like Slater's).

Slater's Condition

If the primal problem is convex and there exists a strictly feasible point \( x \) (where \( f_i(x) < 0 \)), then strong duality holds.

🔑

7. KKT Conditions

For differentiable functions, if strong duality holds, the optimal primal and dual points must satisfy the Karush-Kuhn-Tucker (KKT) conditions:

1. Primal Feasibility: \( f_i(x^*) \leq 0, \quad h_i(x^*) = 0 \)

2. Dual Feasibility: \( \lambda^* \geq 0 \)

3. Complementary Slackness: \( \lambda_i^* f_i(x^*) = 0 \)

4. Stationarity: \( \nabla f_0(x^*) + \sum \lambda_i^* \nabla f_i(x^*) + \sum \nu_i^* \nabla h_i(x^*) = 0 \)

Theorem: For convex problems, KKT conditions are sufficient for optimality. If you find points satisfying these, you have found the solution!

8. Solving Primal via Dual

If strong duality holds, we can solve the (often easier) dual problem to find \( \lambda^*, \nu^* \). Then, we find the primal solution \( x^* \) by minimizing the Lagrangian:

\[ x^* = \text{argmin}_x L(x, \lambda^*, \nu^*) \]

🧪

9. Duality Examples

Example 1: Equality Constrained QP

Consider the problem:

minimize (1/2)x^T P x + q^T x + r
subject to Ax = b

where \( P \in S^n_+ \). The KKT conditions are:

Primal Feasibility: \( Ax^* = b \)
Dual Feasibility: None (no inequality constraints)
Complementary Slackness: None
Stationarity: \( Px^* + q + A^T \nu^* = 0 \)

This leads to the KKT system: \[ \begin{bmatrix} P & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} x^* \\ \nu^* \end{bmatrix} = \begin{bmatrix} -q \\ b \end{bmatrix} \]

Example 2: Water-Filling Problem

Maximize total communication rate by allocating power \( x_i \) to channels with noise \( \alpha_i \):

minimize -\sum_{i=1}^n \log(\alpha_i + x_i)
subject to x \geq 0, \quad \mathbf{1}^T x = 1

KKT Conditions:

\( x^* \geq 0, \quad \sum x_i^* = 1 \)
\( \lambda^* \geq 0 \) (for \( -x \leq 0 \))
\( \lambda_i^* x_i^* = 0 \)
\( -1/(\alpha_i + x_i^*) - \lambda_i^* + \nu^* = 0 \)

Solution: \( x_i^* = \max(0, 1/\nu^* - \alpha_i) \). This is called water-filling because we fill power up to a level \( 1/\nu^* \).

✍️

10. Test Your Understanding

1. The Lagrange dual function \( g(\lambda, \nu) \) is always:

2. Slater's condition guarantees:

3. Complementary slackness \( \lambda_i f_i(x) = 0 \) implies:

🔍 Spot the Mistake!

Scenario 1: A student claims:

"If \( \nabla f(x) = 0 \), then \( x \) must be the solution to the constrained optimization problem minimize \( f(x) \) subject to \( h(x)=0 \)."

False. The gradient doesn't have to be zero, just parallel to \( \nabla h \).

True. Gradient zero means minimum.

Scenario 2: Another student claims:

"For the problem \(\min x^2 + y^2\) s.t. \(x+y=1\), the Lagrange multiplier \(\lambda\) must be positive."

False. \(\lambda\) can be negative for equality constraints.

True. Multipliers are always positive.