Global Optimization Techniques

Global Optimization Techniques:

Simulated Annealing (SA) and Genetic Algorithms (GA)

See Simulation Link Page for Applet Examples,
or Below for Traveling Salesman Applet or GA Worked Example (non-Applet)

There are many techniques (and improvements to these methods) for global optimization (i.e., finding the global minimum or maximum of some complex functional). We discuss only two: SA and GA. These two techniques are similar, naturally motivated, general-purpose procedures for optimization. SA and GAs work well on a variety of problems, require little problem specific information, do not need gradient information, and both generate new points in search space probabilistically.

**It should be clear that when we speak of minimization, the case of finding a maxima can also be treated by either taking the reciprocal of function of interest, or taking the negative of function, which ever is most reasonable.

Global minimization problems can be classified as NP-complete, that is, non-deterministic, polynomial time, complete. No method bound taking computer time which depends on a power of N has been found to solve these problems. N is the number of degrees of freedom. There are two main types of strategies:

Divide-and-Conquer
This approach is good if sub-problems are disjoint.
Iterative
This approach is good if you know configurations and
standard re-arrangement operations are performed until a
re-arranged configuration is found which lowers the function.

The Iterative method is like an extremely rapid quench in statistical mechanics such as steepest descent.

Minimization using Simulated Annealing

Simulated Annealing allows us to utilize the ideas developed in statistical mechanics of liquids (i.e., Metropolis algorithm for obtaining the Boltzmann probability distribution) to more general optimization problems. The general approach to classical systems based on these ideas was Optimization by Simulated Annealing by S. Kirkpatrick, C.D. Gelatt, Jr. and M.P. Vecchi in Science 220, 671-680 (1983), which discussed the thermodynamic connection as well as optimization of electronic circuitry on chips. To reach the optimal (or ground state) think how nature achieves ground state structures.

Nature's algorithm involves p(E) ~ e^-E/kT so that at a given T states are distributed amongst variety of different energy levels.
Slow cooling (annealing) a liquid allows us to access the global ground state, just as happens in the real world.
Fast cooling (quenching) a liquid leads to a metastable, local minimum (just like glass).

Hence, for use with other than Thermodynamic Systems we require:

A description of possible system configurations.
A generator of random changes in the configurations, i.e. the new solution options presented to system.
An objective function (or, in genetics, a fitness function) which is the analog of Energy whose minimization is our goal.
A control parameter T (analogue of temperature) and an annealing schedule, i.e. rate, dT, at which T is reduced.

Both Molecular Dynamics and Monte Carlo Methods may be used to perform simulated annealing.

Molecular Dynamics:

Choose annealing schedule (i.e. cooling rate).
Solve Newton's Eqs. for each atom by numerical integration.
Control T via coupling to heat bath, like Nose-Hoover Thermostat.

Monte Carlo:

Reject or Accept according to Metropolis Algorithm [p = min(1, e^-E/T)] which obey microscopic reversibility.
Choose annealing schedule (i.e. cooling rate).

The Metropolis algorithm allows us to explore phase space and find the thermodynamic average of the "Energy" for a given temperature. For optimization, we want actually the T=0 K global minimum (ground state) state obtained by slowly annealing the "system". Importantly, the Metropolis Algorithm allows uphill moves along with downhill moves with p(E)= min[1, e^-DE/kT], for changes in energy. Clearly, then, if T is very low, the Metropolis Algorithm for the Monte Carlo version is analogous to steepest-descent method (i.e., no uphill moves allowed because they are exponentially damped out) however the underlying trajectory is a random walk which can tunnel through barriers.

Both methods have a duration of simulation to locate the global minimum that scales like e^E_max^/E_min, where E_max is the maximum value in the set of ALL Minimum energy barriers separating basins on the E-surface, and E_min is the difference between global E and lowest local minimum E. Typically, E_max/E_min >> 1. (See, e.g., Amara, Ma and Straub, DIMACS Series in Discrete Mathematics and Theoretical Computer Science , vol. 23 (1996) and Straub in New Developments in Theoretical Studies of Proteins, ed. by Elber (1996).)

Usefulness of the Thermodynamic Analogy

How fast should one anneal? The answer is, of course, dependent upon the optimization problem. However, because Simulated Annealing has its origin from Statistical Mechanics, we know that as one nears a phase transition the specific heat, i.e. C_V = dE/dT, shows (possibly sharp) structure as the system cools. When the equivalent of C_V in the optimization problem begins to grow, this is a signal (as it is in standard thermodynamic simulations) that we should anneal more slowly so as to access more properly the phase space, for we are approaching a possible change in system trajectory (analogous to a phase transition). There can be critical slowing down and hystersis. Hence, by monitoring the thermodynamically equivalent quantity that is important in the optimization problem (possible something other than C_V, such as susceptibility, diffusion, etc.), we may alter the annealing rate to make the search for efficient. This important technique will help in not finding a false "optimal" point. You will note below for the Traveling Salesman Problem that a fixed annealing schedule is maintained (and used only after "thermalizing"), so that no alteration to the annealing schedule is required. This is mainly because this optimization problem is combinatorial and, after exploring annealing rates for such a problem, one can determine a reasonable rate once and for all. The use of analogy to thermodynamic simulation was discussed in the original Science article, ibid.

SA Optimization with Respect to Continuous Parameters

While discrete parameter sets, such as the Ising model, combinatorial problems, etc., occur frequently, equally important are minimization with respect to Continuous Parameters like required in function fitting. For example, suppose one desires to fit a function to a sum of exponentials or Gaussians, the choice of decay constants are an optimization problem. (linear or quadratic problems have much more efficient methods) Simulated Annealing applied to continuous global optimization problems in this context were addressed by Vanderbilt and Louie, Journal of Comput. Phys. 56, 259 (1984). The main difficulty introduced in going from discrete to continuous application of SA-Monte Carlo is that choosing random steps becomes more subtle. It is helpful to built in some "intelligence" to the search, and Vanderbilt and Louie used excursions from the random-walk to explore the local topology and, thereby, choose the step size.

The Traveling Salesman Problem: A Combinatorial Minimization Problem

Problem: A salesman travels to N cities with various 2-D coordinates (x_i,y_i) with i=1,...,N, returning finally to the city of origin. Each city is to be visited once and only once and the route taken is to be as short as possible.

Question: What is the shortest route?

This belongs to the class of optimization problems able to be addressed efficiently with simulated annealing and scales like e^aN, where a is constant.

See Traveling Salesman Applet from Carleton University for example simulation.
How does annealing rate change result (for better or worse)? We discuss the solution using Genetic Algorithm below. Can you see an Inversion take place?

Monte Carlo - Simulated Annealing Approach

Configuration: a configuration is any permutation of the N numbered list of the cities and the list is interpreted as the order in which the cities are visited.
Rearrangement: A random set of exchanges of cities in the list. Lin (Bell Syst. Tech. Journ. 44, 2245 (1965) has suggested a Genetic Algorithm type method: Randomly pick a section of path is removed and then replace with (1) same number of cities running in opposite order (INVERSION mutation); or (2) in between two cities on another randomly chosen part of path (CROSSOVER).
"Energy" (i.e., fitness or objective) function is chosen: in this case it is obviously TOTAL PATH LENGTH, L.

E=L=

_i=1,N SQRT( (x_i - x_j)² + (y_i - y_j)²)
such that city(1) = city(N+1) .

Finally, as discussed in class, this objective function can be chosen to suit problem and code does not have to be altered. For example, as discussed in Numerical Recipes one can make up other energy penalties that modify problem, such as making river crossings cost more. Objective function becomes

E=L=

_i=1,N SQRT( (x_i - x_j)² + (y_i - y_j)²) - K (s_i - s_j)²

where s_i ± 1 for left or right side of river and an extra penalty of 4K is found if salesman crossed river.

Annealing Schedule: A start temperature is chosen (which is problem dependent) and a rate (say, 5-10% decrease in T each temperature step). For each T, you hold it constant for about 100N reconfigurations, or for 10N successful reconfigurations, whichever comes first.

Other Problems You can now imagine how to change this to other problems. But, many problems map to the Lattice Ising model that we have studied previously. For example, the circuit optimization describe in class and in the SCIENCE article. This technique has been used to address a wide range of Engineering, Math, Physical Science, Computer Science, Design and Gaming problems that are fall into the class of optimization problems.

Circuit Optimization (The Ising Model, again!) Consider a N circuits to be partitioned onto 2 chips. Propagating a signal across chip boundary is slow. Hence, the placement of elements on the two chips can be optimized. While putting N circuits on 1 chip eliminates problem, it leaves no room for additional circuits, or there is not enough room for the N needed. Define the Connectivity = a_ij, the number of signals passing between circuits i and j. Let _i = ±1 , in dicating which chip the circuit resides (e.g., left or right). The Signal is then

S =

_{i > j} a_ij (

_i -

_j)²/4.

Note that _i _i is the different number of circuits on the 2 chips. Hence, a penalty can be added for any imbalance in the number of circuits by (_i _i)². If z is the average number of circuits connected to a type of chip, then ~ z/2 (equal weight to balance circuits). With such a penalty, the "cost" function is

E = _{i > j} ( - a_ij/2) _i _j ,

which is (again) the Lenz-Ising Model.

Optimization using Genetic Algorithms (GAs)

GAs were introduced by J.H. Holland (1962) (see, more recently, Adaptation in Natural and Artificial Systems, Univ. of Michigan Press (1975)). A nice introduction to the concepts is contained in D.E. Goldberg's text Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, 1989), although the coding is in PASCAL. David Goldberg, who is a Professor at UIUC in General Engineering, initiated Engineering applications by trying to optimize a gas pipeline distribution system using the GA techniques. There are some tutorials available, such as Shaffer and Small, Analytical Chemistry 69, 236A (1997) or L. Davis, Handbook of Genetic Algorithms, chpt 4., (van Nostrand, 1991).

The main GA terminology is that of analogy with biological reproduction: Chromosones are gene sequences, with Genes being composed of alleles, a_i. Symbolically, a sequence of alleles define the gene: G= a₁a₂ ... a_N. Different allele sequences define a different gene. And, a population is defined as a collection of genes (which are a subset of allowed genes, which themselves define the search space). For use below we note that a binary representation of a gene (here we use only 5 bits) can be evaluated as: x= a₁ 16 + a₂ 8 + a₃ 4 + a₄ 2 + a₅ 1, where 16= 2⁴, 8= 2³, 4= 2², 2= 2¹, 1= 2⁰. So, for an N-bit Gene, the sum is up to 2^N-1. And, any gene with all bits zero has x=0.

Genetic Algorithms are clearly indicated by there similarity to concepts in genetic reproduction. They require a Population Size, Recombination Methods, and Mutation Rate (even various types of mutations are allowed). Importantly, most operation in the GAs are probabilistically controlled, i.e. they are randomized search methods. As Goldberg notes, "randomized search does not necessarily imply directionless search".

GA are different than standard optimization and search procedures by:

GAs work with a coding of the parameter set, not the parameters themselves.
GAs search from populations of points, not a single point-by-point approach.
GAs use a payoff (objective, or "energy", function) information, not derivatives.
GAs searches using sampling, a blind search weighted using the objective function (so it is blind but not stupid!).
GAs use probabilistic rules, not deterministic rules, to evolve configurations.

N.B. Point-by-point searches can be dangerous as it can locate false peaks in multinodal search spaces. Working with large populations prevent this. Note also that the Population Evolution in GAs requires a defined Process (i.e., Reproduction, Mutation, and Selection) besides, of course, the objective function. The detailed elements of the Evolutionary Process are best determined by the problem at hand, but all fall into the three general categories above.

Applications for GA Optimization.

An example optimization is found below. You may do a web search or browse the cited text for many applications in a multitude of fields. Just to name some areas: biology (of course!); game playing (the hexapawn board, 3x3 chess board with only pawns-- more difficult than you might imagine at first); prisoner's dilemma (should prisoners cooperate, rat out one another, or remain silent, depending on reward for each); pattern recognition; mechanical engineering (bridge component optimization); electrical engineering (circuit optimization); mathematical function optimization; social science (hunter-gatherer); physical science (fitting potential surfaces, as in pseudo-potential calculations, or reaction rate calculations). For your amusement, before GAs the best algorithm for games of conflict was the so-called tit-for-tat algorithm. The tit-for-tat response was that you did whatever your opponent did previously, and this beat most very sophisticated approaches, that is, until GAs.

Whether to use a GA Optimization approach is best done by considering the problem at hand.

For a problem where gradient is numerically precise and fast, use gradient algorithm.
For a problem with an exact ground state, do not use genetic algorithm.
GAs require more evaluations to move uphill, which may be hard in some problems.
GAs do not work well on so-called GA-hard problems: ordering chromosome, or length, is difficult to process efficiently with GA. Characteristic of GA-hard problems: good starting point known, variables are highly correlated, limited function evaluations are available.

GA Advantages

no need for knowledge or gradient information about object or energy surface.
discontinuities present on surface have little effect on optimization.
resistant to becoming trapped in local minima.
work well on large-scale optimization problems.
can be used on a wide variety of problems.

GA Disadvantages

trouble finding exact global minimum.
require a large number of energy or object function evaluations.
starting/setting up configurations is not straightforward.

STEPS to perform a GA Optimization.

Define an objective function (i.e., payoff, energy, or fitness function).
Encode the variables (recall you work with parameter set, not parameters!).
Population is initialized at Random.
Fitness is calculated for each gene in population.
Reproduction produces next generation, or offspring (this copying is subject to weight from objective function so that Darwinian evolution, i.e. survival of the "fittest", is built in). This means reproduction is random but not blindly random!
Crossover, exchange of genetic materials, occurs at randomly selected sites along two randomly selected offspring.
Mutation is performed bit-by-bit according to some preselected mutation probability, P_mu. For example, with P_mu=0.001, and with only 20 bits of information, there is a 0.02 bit mutation expected, i.e. zero; but the mutation grows as new bits are generated.

Mutation can be more complicated, such as Gene Inversion which reversed the order of some randomly selected elements of the gene. This is then not bit-by-bit, but a collection of bits. Recall that INVERSION was one sample generation technique used in faster Traveling Salesman Problem Algorithms.

Finally, the Evolutionary Process is complete, the new population has been generated and steps 4-7 are repeated until acceptable optimization has occurred.

Concrete but Simple GA Optimization Example.

This is an example discussed at length in David Goldberg's text Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, 1989). It utilizes all the classic GA techniques mentioned above and shows in detail the STEPS given above. Obviously, it is trivial to solve, we know the answer, and we could write down the entire population space. From what better one to learn?

Optimization Problem: We have electrical switch box with a sequence of 5 switches which may be off or on. We are trying to maximize the signal output, denoted as f(s).

Let us follow our steps from above.

We arbitrarily choose the objective function: f(x)=x². (It's just easy.)
We encode the variables (switch positions) by a binary bits, x = 0 (off) or 1 (on). This is a parameter mapping.
We assign the Genes by stringing together the 5 bits of switch positions, G= a₁a₂a₃ a₄a₅, e.g. G₁=01101.
(So we now the exact answer is G=11111, x=31 and f(x)=961.)
THE GA STEPS (4-7 above)

generate Population,
calculate Fitness (e.g., G=01101 has unassigned integer value x=13, so f(x)=169),
do Reproduction, Crossover, and Mutation, then repeat.

We must go through REPRODUCTION. Below we generate "initial population" of switch settings encoded as gene by random coins flips. Here we have used 4 genes and require 20 coin flips, or 20 bits of information. The "unassigned integer" value of the Genes is then calculated. The "objective function" value is calculated from x. In order to perform probabilistic but weighted search, we calculate the "selection probability" from the population (must sum to 1.0). Before performing a randomly seeded update of population, we may estimate the "expected occurrence" of the "old" gene in the new gene pool. We form a Roulette Wheel which has regions weighted according to the "selection probability" column, then spin the Wheel and we obtain the "actual occurrences" given by chance. It is important to realize that this weighted probabilistic search allows a smart but blind search of gene (phase) space. Note that the last two columns, expected and actual occurrence of the genes, conform to our expectations: The Best genes get copied more; The Average genes stay even; and The Worst genes die out! Survival of the fittest.


G-string number, i	initial G population (random/coin flip)	unassigned integer value of x=	object function f(x)= x²	selection probability f_i/(Sum f_i)	expected occurrence f_i/f_avg	actual occurrence from chance (like Roulette)
1	01101	13	168	0.14	0.58	1
2	11000	24	576	0.49	1.97	2
3	01000	8	64	0.05	0.22	0
4	10011	19	361	0.31	1.23	1

Sum over column			1170	1.00	4.00	4.00
Avg of column			293	0.25	1.00	1.00
Max of column			576	0.49	1.97	2.00

Next we must go through the MATING PROCESS. Notice below that old Gene 1 occurs once, old Gene 2 occurs twice, old Gene 3 does not occur, and old Gene 4 occurs once. This was produced from the spin of the Roulette Wheel above (chance). To read this table, note that once mated at random the crossover points along the gene (also chosen at random) must be the same for each mate. In the table, the || indicates the Crossover point which is chosen randomly and appears in column 4. As an Crossover example, after randomly choosing a Crossover point along Gene (4th position below), we exchange genetic material:

CROSSOVER EXAMPLE

MATED
OLD GENES NEW GENES

0110||1 --------> 0110||0

|
| SWAP BITS
| After Crossover Point
|

1100||0 --------> 1100||1

Here we have just preformed crossover with tail of genes, past the crossover point. However, you certainly could define other types (depending on needs of problem), such as sometime swap left with left, or right with right, left with right, or right with left, making sure to conserve the length of gene (at least in this case).

New G-strings	Mating Pool after reproduction	Randomly selected Mates	Crossover sites (randomly selected)	New G population	value of x=	f(x)= x²
1	0110\|\|1	2	4	01100	12	144
2	1100\|\|0	1	4	11001	25	625
3	11\|\|000	4	2	11011	27	729
4	10\|\|011	3	2	10000	16	256

Sum over column						1754
Avg over column						439
Max over column						729

We must now perform MUTATION. With a mutation probability of P=0.001, the first generation does not get mutated because we have only 20 bits, or 0.02 chance of mutation. So, I did not include this. But it certainly could be very effective during the course of the algorithm. Also, be aware that Mutation by Inversion may have been effective. For example, if you have G=10011 and you randomly choose 2nd break to start inversion and you do this for all remaining bits, we get G=10110, which is much closer to optimum than 10011.

Notice that in 1 Generation, average fitness went from 293 to 439, and maximum fitness went from 576 to 729, where we know the exact answer is 961. So, although the random processes help cause this improvement, one can see that it is no fluke! Best Gene, G=11000, string gets copied twice. When combined at random with next highest string (10011) and crossed, a more optimal gene is produced. All because the probabilistic allows blind search, but the weighting adds intelligence to search. Imagine how effective this may become with really large search space and large gene pools.

Notice that if we would have started out with a really lousy starting population this algorithm may have taken a very long time to go anywhere. Consider several G=00000 entries, for example. In such cases, you may see that bit mutation could really help speed things up!

Finally, for this particular simple problem, it is important to realize how the object function and encoding was chosen for this approach to be successful (which must be done for each problem). Because the electrical switch connections are in series, the signal will be always 0 unless switch 1 is ON. Therefore, this requires larger weighting and this is why it is the first allele in the gene. The arbitrary f(x) = x² was chosen to favor this sequencing of ON from switch 1 to 5. Basically, we know a lot of information about this problem and it can be set up well.