Optimization

Learning objectives

Set up a problem as an optimization problem
Use a method to find an approximation of the minimum
Identify challenges in optimization

Minima of a function

Consider a function $f : S \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><mstyle scriptlevel="0"><mspace width="0.278em"></mspace></mstyle><mi>S</mi><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ , and $S \subset R n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mo>\subset</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup></math>$ . The point $\boldsymbolx∗∈S<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup><mo>∈</mo><mi>S</mi></math>$ is called the minimizer or minimum of $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ if $f(\boldsymbolx∗)≤f(\boldsymbolx)∀x∈S<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup><mo stretchy="false">)</mo><mo>≤</mo><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi mathvariant="normal">∀</mi><mi>x</mi><mo>∈</mo><mi>S</mi></math>$ .

For the rest of this topic we try to find the minimizer of a function, as one can easily find the maximizer of a function $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ by trying to find the minimizer of $- f <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mi>f</mi></math>$ .

Local vs. Global Minima

Consider a domain $S \subset R n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi><mo>\subset</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup></math>$ , and $f : S \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><mi>S</mi><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ .

Local Minima: $\boldsymbolx∗<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup></math>$ is a local minimum if $f(\boldsymbolx∗)≤f(\boldsymbolx)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup><mo stretchy="false">)</mo><mo>≤</mo><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo></math>$ for all feasible $\boldsymbolx<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></math>$ in some neighborhood of $\boldsymbolx∗<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup></math>$ .
Global Minima: $\boldsymbolx∗<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup></math>$ is a global minimum if $f(\boldsymbolx∗)≤f(\boldsymbolx)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup><mo stretchy="false">)</mo><mo>≤</mo><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo></math>$ for all $\boldsymbolx∈S<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∈</mo><mi>S</mi></math>$ .

Note that it is easier to find the local minimum than the global minimum. Given a function, finding whether a global minimum exists over the domain is in itself a non-trivial problem. Hence, we will limit ourselves to finding the local minima of the function.

Criteria for 1-D Local Minima

In the case of 1-D optimization, we can tell if a point $x * \in S <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mo>*</mo></msup><mo>\in</mo><mi>S</mi></math>$ is a local minimum by considering the values of the derivatives. Specifically,

(First-order) Necessary condition: $f' (x *) = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><msup><mi>x</mi><mo>*</mo></msup><mo stretchy="false">)</mo><mo>=</mo><mn>0</mn></math>$
(Second-order) Sufficient condition: $f' (x *) = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><msup><mi>x</mi><mo>*</mo></msup><mo stretchy="false">)</mo><mo>=</mo><mn>0</mn></math>$ and $f ″ (x *) > 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">″</mo></msup><mo stretchy="false">(</mo><msup><mi>x</mi><mo>*</mo></msup><mo stretchy="false">)</mo><mo>></mo><mn>0</mn></math>$ .

Example: Consider the function $f (x) = x 3 - 6 x 2 + 9 x - 6 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msup><mi>x</mi><mn>3</mn></msup><mo>-</mo><mn>6</mn><msup><mi>x</mi><mn>2</mn></msup><mo>+</mo><mn>9</mn><mi>x</mi><mo>-</mo><mn>6</mn></math>$ The first and second derivatives are as follows: $f' (x) = 3 x 2 - 12 x + 9 f ″ (x) = 6 x - 12 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mn>3</mn><msup><mi>x</mi><mn>2</mn></msup><mo>-</mo><mn>12</mn><mi>x</mi><mo>+</mo><mn>9</mn><mspace linebreak="newline"></mspace><msup><mi>f</mi><mo data-mjx-alternate="1">″</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mn>6</mn><mi>x</mi><mo>-</mo><mn>12</mn></math>$

The critical points are tabulated as:

$x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></math>$	$f' (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></math>$	$f ″ (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><msup><mi>f</mi><mo data-mjx-alternate="1">″</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></math>$	Characteristic
3	0	$6 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>6</mn></math>$	Local Minimum
1	0	$- 6 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mn>6</mn></math>$	Local Maximum

Looking at the table, we see that $x = 3 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>=</mo><mn>3</mn></math>$ satisfies the sufficient condition for being a local minimum.

Criteria for N-D Local Minima

As we saw in 1-D, on extending that concept to $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ dimensions we can tell if $\boldsymbolx∗<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup></math>$ is a local minimum by the following conditions:

Necessary condition: the gradient $∇f(\boldsymbolx∗)=\boldsymbol0<math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup><mo stretchy="false">)</mo><mo>=</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></math>$
Sufficient condition: the gradient $∇f(\boldsymbolx∗)=\boldsymbol0<math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><msup><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>∗</mo></msup><mo stretchy="false">)</mo><mo>=</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></math>$ and the Hessian matrix $Hf(\boldsymbolx∗)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>H</mi><mi>f</mi></msub><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msup><mi>x</mi><mo>∗</mo></msup></mrow><mo stretchy="false">)</mo></math>$ is positive definite.

Definiton of Gradient and Hessian Matrix

Given $f : R n \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ we define the gradient function $\nabla f : R n \to R n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\nabla</mi><mi>f</mi><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup><mo accent="false" stretchy="false">\to</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup></math>$ as:

∇f(\boldsymbolx)=(∂f∂x1∂f∂x2⋮∂f∂xn)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mfrac><mrow><mi>∂</mi><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mn>1</mn></msub></mrow></mfrac></mtd></mtr><mtr><mtd><mfrac><mrow><mi>∂</mi><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mn>2</mn></msub></mrow></mfrac></mtd></mtr><mtr><mtd><mrow data-mjx-texclass="ORD"><mo>⋮</mo></mrow></mtd></mtr><mtr><mtd><mfrac><mrow><mi>∂</mi><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mi>n</mi></msub></mrow></mfrac></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>

Given $f : R n \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ we define the Hessian matrix $H f : R n \to R n \times n <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mi>f</mi></msub><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup><mo accent="false" stretchy="false">\to</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi><mo>\times</mo><mi>n</mi></mrow></msup></math>$ as:

Hf(\boldsymbolx)=[∂2f∂x21∂2f∂x1∂x2…∂2f∂x1∂xn∂2f∂x2∂x1∂2f∂x22…∂2f∂x2∂xn⋮⋮⋱⋮∂2f∂xn∂x1∂2f∂xn∂x2…∂2f∂x2n]<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mi>f</mi></msub><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msubsup><mi>x</mi><mn>1</mn><mn>2</mn></msubsup></mrow></mfrac></mtd><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mn>1</mn></msub><mi>∂</mi><msub><mi>x</mi><mn>2</mn></msub></mrow></mfrac></mtd><mtd><mo>…</mo></mtd><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mn>1</mn></msub><mi>∂</mi><msub><mi>x</mi><mi>n</mi></msub></mrow></mfrac></mtd></mtr><mtr><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mn>2</mn></msub><mi>∂</mi><msub><mi>x</mi><mn>1</mn></msub></mrow></mfrac></mtd><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msubsup><mi>x</mi><mn>2</mn><mn>2</mn></msubsup></mrow></mfrac></mtd><mtd><mo>…</mo></mtd><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mn>2</mn></msub><mi>∂</mi><msub><mi>x</mi><mi>n</mi></msub></mrow></mfrac></mtd></mtr><mtr><mtd><mrow data-mjx-texclass="ORD"><mo>⋮</mo></mrow></mtd><mtd><mrow data-mjx-texclass="ORD"><mo>⋮</mo></mrow></mtd><mtd><mo>⋱</mo></mtd><mtd><mrow data-mjx-texclass="ORD"><mo>⋮</mo></mrow></mtd></mtr><mtr><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mi>n</mi></msub><mi>∂</mi><msub><mi>x</mi><mn>1</mn></msub></mrow></mfrac></mtd><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msub><mi>x</mi><mi>n</mi></msub><mi>∂</mi><msub><mi>x</mi><mn>2</mn></msub></mrow></mfrac></mtd><mtd><mo>…</mo></mtd><mtd><mfrac><mrow><msup><mi>∂</mi><mn>2</mn></msup><mi>f</mi></mrow><mrow><mi>∂</mi><msubsup><mi>x</mi><mi>n</mi><mn>2</mn></msubsup></mrow></mfrac></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>

Unimodal function

A 1-dimensional function $f : S \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><mi>S</mi><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ , is said to be unimodal if for all $x 1, x 2 \in S <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</mi><mn>2</mn></msub><mo>\in</mo><mi>S</mi></math>$ , with $x 1 < x 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>1</mn></msub><mo><</mo><msub><mi>x</mi><mn>2</mn></msub></math>$ and $x * <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mo>*</mo></msup></math>$ as the minimizer:

If $x 2 < x * \Rightarrow f (x 1) > f (x 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>2</mn></msub><mo><</mo><msup><mi>x</mi><mo>*</mo></msup><mo stretchy="false">\Rightarrow</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mo>></mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math>$
If $x * < x 1 \Rightarrow f (x 1) < f (x 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mo>*</mo></msup><mo><</mo><msub><mi>x</mi><mn>1</mn></msub><mo stretchy="false">\Rightarrow</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mo><</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math>$

In order to simplify, we will consider our objective function to be unimodal as it guarantees us a unique solution to the minimization problem.

Optimization Techniques in 1-D

Newton’s Method

We know that in order to find a local minimum we need to find the root of the derivative of the function. Inspired from Newton’s method for root-finding we define our iterative scheme as follows: $xk+1=xk−f′(xk)f″(xk)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>=</mo><msub><mi>x</mi><mi>k</mi></msub><mo>−</mo><mfrac><mrow><msup><mi>f</mi><mo data-mjx-alternate="1">′</mo></msup><mo stretchy="false">(</mo><msub><mi>x</mi><mi>k</mi></msub><mo stretchy="false">)</mo></mrow><mrow><msup><mi>f</mi><mo data-mjx-alternate="1">″</mo></msup><mo stretchy="false">(</mo><msub><mi>x</mi><mi>k</mi></msub><mo stretchy="false">)</mo></mrow></mfrac></math>$ This is equivalent to using Newton’s method for root finding to solve $f' (x) = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mn>0</mn></math>$ , so the method converges quadratically, provided that $x k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>k</mi></msub></math>$ is sufficiently close to the local minimum.

For Newton’s method for optimization in 1-D, we evaluate $f' (x k) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">(</mo><msub><mi>x</mi><mi>k</mi></msub><mo stretchy="false">)</mo></math>$ and $f ″ (x k) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>f</mi><mo data-mjx-alternate="1">″</mo></msup><mo stretchy="false">(</mo><msub><mi>x</mi><mi>k</mi></msub><mo stretchy="false">)</mo></math>$ , so it requires 2 function evaluations per iteration.

Golden Section Search

Inspired by bisection for root finding, we define an interval reduction method for finding the minimum of a function. As in bisection where we reduce the interval such that the reduced interval always contains the root, in Golden Section Search we reduce our interval such that it always has a unique minimizer in our domain.

Algorithm to find the minima of $f : [a, b] \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><mo stretchy="false">[</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo stretchy="false">]</mo><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ :

Our goal is to reduce the domain to $[x 1, x 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">]</mo></math>$ such that:

If $f (x 1) > f (x 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mo>></mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math>$ our new interval would be $(x 1, b) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><mi>b</mi><mo stretchy="false">)</mo></math>$
If $f (x 1) \leq f (x 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mo>\leq</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math>$ our new interval would be $(a, x 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>a</mi><mo>,</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math>$

We select $x 1, x 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</mi><mn>2</mn></msub></math>$ as interior points to $[x 1, x 2] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">]</mo></math>$ by choosing a $0 \leq τ \leq 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn><mo>\leq</mo><mi>τ</mi><mo>\leq</mo><mn>1</mn></math>$ and setting:

x 1 = a + (1 - τ) (b - a) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>x</mi><mn>1</mn></msub><mo>=</mo><mi>a</mi><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>τ</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>b</mi><mo>-</mo><mi>a</mi><mo stretchy="false">)</mo></math>

x 2 = a + τ (b - a) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>x</mi><mn>2</mn></msub><mo>=</mo><mi>a</mi><mo>+</mo><mi>τ</mi><mo stretchy="false">(</mo><mi>b</mi><mo>-</mo><mi>a</mi><mo stretchy="false">)</mo></math>

The challenging part is to select a $τ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>τ</mi></math>$ such that we ensure symmetry i.e. after each iteration we reduce the interval by the same factor, which gives us the indentity $τ 2 = 1 - τ <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>τ</mi><mn>2</mn></msup><mo>=</mo><mn>1</mn><mo>-</mo><mi>τ</mi></math>$ . Hence,

τ=√5−12.<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>τ</mi><mo>=</mo><mfrac><mrow><msqrt><mn>5</mn></msqrt><mo>−</mo><mn>1</mn></mrow><mn>2</mn></mfrac><mo>.</mo></math>

As the interval gets reduced by a fixed factor each time, it can be observed that the method is linearly convergent. The number $√5−12<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><msqrt><mn>5</mn></msqrt><mo>−</mo><mn>1</mn></mrow><mn>2</mn></mfrac></math>$ is the inverse of the “Golden-Ratio” and hence this algorithm is named Golden Section Search.

In golden section search, we do not need to evaluate any derivatives of $f (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ . At each iteration we need $f (x 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>1</mn></msub><mo stretchy="false">)</mo></math>$ and $f (x 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mn>2</mn></msub><mo stretchy="false">)</mo></math>$ , but one of $x 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>1</mn></msub></math>$ or $x 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>2</mn></msub></math>$ will be the same as the previous iteration, so it only requires 1 additional function evaluation per iteration (after the first iteration).

Optimization in N Dimensions

Steepest Descent

The negative of the gradient of a differentiable function $f : R n \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>n</mi></msup><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ points downhill i.e. towards points in the domain having lower values. This hints us to move in the direction of $- \nabla f <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mi mathvariant="normal">\nabla</mi><mi>f</mi></math>$ while searching for the minimum until we reach the point where $∇f(\boldsymbolx)=\boldsymbol0<math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></math>$ . Therefore, at a point $\boldsymbolx<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></math>$ the direction ‘’ $−∇f(\boldsymbolx)<math xmlns="http://www.w3.org/1998/Math/MathML"><mo>−</mo><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo></math>$ ’’ is called the direction of steepest descent.

We know the direction we need to move to approach the minimum but we still do not know the distance we need to move in order to approach the minimum. If $\boldsymbolxk<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mi>k</mi></msub></mrow></math>$ was our earlier point then we select the next guess by moving it in the direction of the negative gradient:

\boldsymbolxk+1=\boldsymbolxk+α(−∇f(\boldsymbolxk)).<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msub></mrow><mo>=</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mi>k</mi></msub></mrow><mo>+</mo><mi>α</mi><mo stretchy="false">(</mo><mo>−</mo><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mi>k</mi></msub></mrow><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>.</mo></math>

The next problem would be to find the $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ , and we use the 1-dimensional optimization algorithms to find the required $α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ . Hence, the problem translates to:

\boldsymbols=−∇f(\boldsymbolxk)minα(f(\boldsymbolxk+α\boldsymbols))<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow><mo>=</mo><mo>−</mo><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mi>k</mi></msub></mrow><mo stretchy="false">)</mo><mspace linebreak="newline"></mspace><munder><mo data-mjx-texclass="OP" movablelimits="true">min</mo><mrow data-mjx-texclass="ORD"><mi>α</mi></mrow></munder><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>f</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><msub><mi>x</mi><mi>k</mi></msub></mrow><mo>+</mo><mi>α</mi><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>

The steepest descent algorithm can be summed up in the following function:

import numpy.linalg as la
import scipy.optimize as opt
import numpy as np

def obj_func(alpha, x, s):
    # code for computing the objective function at (x+alpha*s)
    return f_of_x_plus_alpha_s

def gradient(x):
    # code for computing gradient
    return grad_x

def steepest_descent(x_init):
    x_new = x_init
    x_prev = np.random.randn(x_init.shape[0])
    while(la.norm(x_prev - x_new) > 1e-6):
        x_prev = x_new
        s = -gradient(x_prev)
        alpha = opt.minimize_scalar(obj_func, args=(x_prev, s)).x
        x_new = x_prev + alpha*s

    return x_new

The steepest descent method converges linearly.

Newton’s Method

Newton’s Method in $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ dimensions is similar to Newton’s method for root finding in $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ dimensions, except we just replace the $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ -dimensional function by the gradient and the Jacobian matrix by the Hessian matrix. We can arrive at the result by considering the Taylor expansion of the function.

f(\boldsymbolx+s)≈f(\boldsymbolx)+∇f(\boldsymbolx)Ts+12sTHf(\boldsymbolx)s<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo>+</mo><mi>s</mi><mo stretchy="false">)</mo><mo>≈</mo><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mo>+</mo><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><msup><mo stretchy="false">)</mo><mi>T</mi></msup><mi>s</mi><mo>+</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mi>s</mi><mi>T</mi></msup><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mi>f</mi></msub><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mi>s</mi></math>

We solve for $∇f(\boldsymbols)=\boldsymbol0<math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></math>$ for $\boldsymbols<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></math>$ . Hence the equation can be translated as:

Hf(\boldsymbolx)\boldsymbols=−∇f(\boldsymbolx).<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">H</mi></mrow><mi>f</mi></msub><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow><mo>=</mo><mo>−</mo><mi mathvariant="normal">∇</mi><mi>f</mi><mo stretchy="false">(</mo><mtext mathcolor="red">\boldsymbol</mtext><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mo stretchy="false">)</mo><mo>.</mo></math>

The Newton’s Method can be expressed as a python function as follows:

import numpy as np
def hessian(x):
    # Computes the hessian matrix corresponding the given objective function
    return hessian_matrix_at_x

def gradient(x):
    # Computes the gradient vector corresponding the given objective function
    return gradient_vector_at_x

def newtons_method(x_init):
    x_new = x_init
    x_prev = np.random.randn(x_init.shape[0])
    while(la.norm(x_prev-x_new)>1e-6):
        x_prev = x_new
        s = -la.solve(hessian(x_prev), gradient(x_prev))
        x_new = x_prev + s
    return x_new

Review Questions

What are the necessary and sufficient conditions for a point to be a local minimum in one dimension?
What are the necessary and sufficient conditions for a point to be a local minimum in $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ dimensions?
How do you classify extrema as minima, maxima, or saddle points?
What is the difference between a local and a global minimum?
What does it mean for a function to be unimodal?
What special attribute does a function need to have for golden section search to find a minimum?
Run one iteration of golden section search.
Calculate the gradient of a function (function has many inputs, one output).
Calculate the Jacobian of a function (function has many inputs, many outputs).
Calculate the Hessian of a function.
Find the search direction in steepest/gradient descent.
Why must you perform a line search each step of gradient descent?
Run one step of Newton’s method in one dimension.
Run one step of Newton’s method in $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ dimensions.
When does Newton’s method fail to converge to a minimum?
What operations do you need to perform each iteration of golden section search?
What operations do you need to perform each iteration of Newton’s method in one dimension?
What operations do you need to perform each iteration of Newton’s method in $n$ dimensions?
What is the convergence rate of Newton’s method?

ChangeLog

2020-04-26 Mariana Silva mfsilva@illinois.edu: small text revision
2018-4-25 Adam Stewart adamjs5@illinois.edu: fixes missing parenthesis in newtons_method
2017-11-25 Adam Stewart adamjs5@illinois.edu: fixes missing partial in Hessian matrix
2017-11-20 Kaushik Kulkarni kgk2@illinois.edu: fixes table formatting
2017-11-20 Nate Bowman nlbowma2@illinois.edu: adds review questions
2017-11-20 Erin Carrier ecarrie2@illinois.edu: removes Gauss-Newton/LM, minor rewording and small changes throughout
2017-11-20 Kaushik Kulkarni kgk2@illinois.edu and Arun Lakshmanan lakshma2@illinois.edu: first full draft
2017-10-17 Luke Olson lukeo@illinois.edu: outline