-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathMinimization.tex
1182 lines (1077 loc) · 56.1 KB
/
Minimization.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\ifx\wholebook\relax\else
\documentclass[twoside]{book}
\usepackage[active]{srcltx}
\usepackage[LY1]{fontenc}
\input{DhbDefinitions}
\begin{document}
\fi
\chapter{Optimization}
\label{ch:minimization} \vspace{1 ex}
\begin{flushright}
{\sl Cours vite au but, mais gare \`a la chute.}\footnote{Run fast
to the goal, but beware of the fall.}\\ Alexandre Soljenitsyne
\end{flushright}
\vspace{1 ex} An optimization problem is a numerical problem where
the solution is characterized by the largest or smallest value of
a numerical function depending on several parameters. Such
function is often called the goal function. Many kinds of problems
can be expressed into optimization, that is, finding the maximum
or the minimum of a goal function. This technique has been applied
to a wide variety of fields going from operation research to game
playing or artificial intelligence. In chapter \ref{ch:estimation}
for example, the solution of maximum likelihood or least square
fits was obtained by finding the maximum, respectively the minimum
of a function.
In fact generations of high energy physicists have used the
general purpose minimization program MINUIT\footnote{F.James, M.
Roos, {\sl MINUIT --- a system for function minimization and
analysis of the parameter errors and corrections}, Comput. Phys.
Commun., 10 (1975) 343-367.} written by Fred James\footnote{I take
this opportunity to thank Fred for the many useful discussions we
have had on the subject of minimization.} of CERN to perform least
square fits and maximum likelihood fits. To achieve generality,
MINUIT uses several strategies to reach a minimum. In this chapter
we shall discuss a few techniques and conclude with a program
quite similar in spirit to MINUIT. Our version, however, will not
have all the features offered by MINUIT.
If the goal function can be expressed with an analytical form, the
problem of optimization can be reduced into calculating the
derivatives of the goal function respective to all parameters, a
tedious but manageable job. In most cases, however, the goal
function cannot always be expressed analytically.
Figure \ref{fig:soptimizingclasses} shows the Smalltalk class diagram.
\begin{figure}
\centering\includegraphics[width=11cm]{Figures/Optimizing}
\caption{Smalltak classes used in optimization}
\label{fig:soptimizingclasses}
\end{figure}
\section{Introduction}
\label{sec:optimum} Let us state the problem is general term. Let
$f\left({\bf x}\right)$ be a function of a vector ${\bf x}$ of
dimension $n$. The $n$-dimensional space is called the search
space of the problem. Depending on the problem the space can be
continuous or not. In this section we shall assume that the space
is continuous.
If the function is derivable, the gradient of the function
respective to the vector ${\bf x}$ must vanish at the optimum.
Finding the optimum of the function can be replaced by the problem
of finding the vector ${\bf x}$ such that:
\begin{equation}
\label{eq:mincondition}
{d f\left({\bf x}\right) \over d{\bf x}} = 0.
\end{equation}
Unfortunately, the above equation is not a necessary condition for
an optimum. It can be either a maximum. a minimum or a saddle
point, that is a point where the function has a minimum in one
projection and a maximum in another projection. Furthermore, the
function may have several optima. Figure \ref{fig:localabsobulte}
shows an example of a function having two minima.
\begin{figure}
\centering\includegraphics[width=11cm]{Figures/MinimumAbsLoc}
\caption{Local and absolute optima} \label{fig:localabsobulte}
\end{figure}
Some problems require to find the absolute optimum of the
function. Thus, one must verify that the solution of
\ref{eq:mincondition} corresponds indeed to an optimum with the
expected properties. The reader can already see at this point that
searching for an optimum in the general case is a very difficult
task.
\noindent All optimization algorithms can be classified in two
broad categories:
\begin{itemize}
\item {\sl Greedy algorithms}: these algorithms are characterized by a
local search in the most promising direction. They are usually efficient
and quite good at finding local optima. Among greedy algorithms,
one must distinguish those requiring the evaluation of the
function's derivatives.
\item {\sl Random based algorithms}: these algorithms are using
a random approach. They are not efficient; however, they are good at
finding absolute optima. Simulated annealing \cite{Press} and
genetic algorithms\cite{BerLin} belong to this class.
\end{itemize}
The table \ref{tb:optimizingalgorithms} lists the properties of
the algorithms presented in this chapter.
\begin{table}[h]
\centering
\caption{Optimizing algorithms presented in this book}\label{tb:optimizingalgorithms}
\vspace{1 ex}\begin{tabular}{| l | c | c |} \hline
\hfil{\bf Name} & {\bf Category} & {\bf Derivatives} \\ \hline
Extended Newton & greedy & yes \\
Powell's hill climbing & greedy & no \\
Simplex & greedy & no \\
Genetic algorithm & random based & no \\ \hline
\end{tabular}
\end{table}
\section{Extended Newton algorithms}
Extended Newton algorithms are using a generalized version of
Newton's zero finding algorithm. These algorithms assume that the
function is continuous and has only one optimum in the region
where the search is initiated.
Let us expand the function $f\left({\bf x}\right)$ around a point
${\bf x}^{\left(0\right)}$ near the solution. We have in
components:
\begin{equation}
f\left({\bf x}\right) = f\left[{\bf x}^{\left(0\right)}\right] +\sum_j
\left.{\partial f\left({\bf x}\right) \over \partial x_j}\right|_{{\bf x}={\bf
x}^{\left(0\right)}}
\left[x_j-x^{\left(0\right)}_j\right].
\end{equation}
Using the expansion above into equation \ref{eq:mincondition}
yields:
\begin{equation}
\sum_j
\left.{\partial^2 f\left({\bf x}\right) \over \partial x_i\partial x_j}\right|_{{\bf x}={\bf
x}^{\left(0\right)}}
\left[x_j-x^{\left(0\right)}_j\right]
+ \left.{\partial f\left({\bf x}\right) \over \partial x_i}\right|_{{\bf x}={\bf
x}^{\left(0\right)}} =0.
\end{equation}
This equation can be written as a system of linear equations of
the form
\begin{equation}
{\bf M}\Delta = {\bf c},
\end{equation}
where $\Delta_j =x_j-x^{\left(0\right)}_j$. The components of the
matrix ${\bf M}$ --- called the Hessian matrix --- are given by:
\begin{equation}
m_{ij} = \left.{\partial^2 f\left({\bf x}\right) \over \partial x_i\partial x_j}\right|_{{\bf x}={\bf
x}^{\left(0\right)}},
\end{equation}
and the components of the vector ${\bf c}$ are given by:
\begin{equation}
c_i = -\left.{\partial f\left({\bf x}\right) \over \partial x_i}\right|_{{\bf x}={\bf
x}^{\left(0\right)}}.
\end{equation}
Like in section \ref{sec:lsfnonlin} one can iterate this process
by replacing ${\bf x}^{\left(0\right)}$ with ${\bf
x}^{\left(0\right)}+\Delta$. This process is actually equivalent
to the Newton zero finding method (\cf section \ref{sec:newton}).
The final solution is a minimum if the matrix ${\bf M}$ is
positive definite; else it is a maximum.
This technique is used by MINUIT in the vicinity of the goal
function's optimum. It is the region where the algorithm described
above works well. Far from the optimum, the risk of reaching a
point where the matrix ${\bf M}$ cannot be inverted is quite high
in general. In addition, the extended Newton algorithm requires
that the second order derivatives of the function can be computed
analytically; at least the first order derivatives must be
provided, otherwise the cost of computation at each step becomes
prohibitive. A concrete implementation of the technique is not
given here. The reader can find in this book all the necessary
tools to make such an implementation. It is left as a exercise for
the reader. In the rest of this chapter, we shall present other
methods which work without an analytical knowledge of the
function.
\section{Hill climbing algorithms}
Hill climbing is a generic term covering many algorithms trying to
reach an optimum by determining the optimum along successive
directions. The general algorithm is outlined below.
\begin{enumerate}
\item select an initial point ${\bf x}_0$ and a direction ${\bf v}$;
\item find ${\bf x}_1$, the optimum of the function along the selected
direction;
\item if convergence is attained, terminate the algorithm;
\item set ${\bf x}_0={\bf x}_1$, select a different direction and go back to step 2.
\end{enumerate}
The simplest of these algorithms simply follows each axis in turn
until a convergence is reached. More elaborate algorithms
exist\cite{Press}. One of them is described in section
\ref{sec:powell}.
Hill climbing algorithms can be applied to any continuous
function, especially when the function's derivatives are not
easily calculated. The core of the hill climbing algorithm is
finding the optimum along one direction. Let ${\bf v}$ be the
direction, then finding the optimum of the vector function
$f\left({\bf x}\right)$ along the direction ${\bf v}$ starting
from point ${\bf x}_0$ is equivalent to finding the optimum of the
one-variable function $g\left(\lambda\right)=f\left({\bf
x}_0+\lambda{\bf v}\right)$.
Therefore, in order to implement a hill climbing algorithm, we
first need to implement an algorithm able to find the optimum of a
one-variable function. This is the topic of the sections
\ref{sec:optonedim} and \ref{sec:bracket}. Before this, we need to
discuss the implementation details providing a common framework to
all classes discussed in the rest of this chapter.
\subsection{Optimizing --- General implementation}
\label{sec:goptonedim} At this point the reader may be a little
puzzled by the use of {\sl optimum} instead of speaking of minimum
or maximum. We shall now disclose a general implementation which
works both for finding a minimum or a maximum. This should not
come to a surprise since, in mathematics, a minimum or a maximum
are both very similar --- position where the derivative of a
function vanishes
--- and can be easily turned into each other --- e.g. by negating the
function.
To implement a general purpose optimizing framework, we introduce
two new classes: {\tt MinimizingPoint} and {\tt MaximizingPoint},
a subclass of {\tt MinimizingPoint}. These two classes
are used as \patstyle{Strategy} by the optimizing algorithms. The
class {\tt MinimizingPoint} has two instance variables
\begin{description}
\item[\tt value] the value of the goal function, that is
$g\left(\lambda\right)$ or $f\left({\bf x}\right)$;
\item[\tt position] the position at which the function has been
evaluated, that is $\lambda$ or ${\bf x}$.
\end{description}
The class {\tt MinimizingPoint} contains most of the methods. The
only method overloaded by the class {\tt MaximizingPoint} is the
method {\tt betterThan}, which tells whether an optimizing point
is better than another. The method {\tt betterThan} can be used in
all parts of the optimizing algorithms to find out which point is
the optimum so far. In algorithms working in multiple dimensions,
the method {\tt betterThan} is also used to sort the points from
the best to the worst.
A convenience instance creation method allows to create instances
for a given function with a given argument. The instance is then
initialized with the function's value evaluated at the argument.
Thus, all optimizing algorithms described here do not call the
goal function explicitly.
Otherwise the implementation of the one dimensional optimum search
uses the general framework of the iterative process. More
specifically it uses the class {\tt FunctionalIterator} described
in section \ref{sec:iterrel}.
A final remark concerns the method {\tt initializeIteration}. The
golden search algorithm assume that the 3 points $\lambda_0$,
$\lambda_1$ and $\lambda_2$ have been determined. What if they
have not been? In this case, the method {\tt initializeIteration}
uses the optimum bracket finder described in section
\ref{sec:bracket}
\subsection{Common optimizing classes --- Smalltalk implementation}
\label{sec:sgeneralOpt} \marginpar{Figure
\ref{fig:soptimizingclasses} with the boxes {\bf
FunctionOptimizer}, {\bf MinimizingPoint}, {\bf MaximizingPoint},
{\bf ProjectedOneVariableFunction} and {\bf
VectorProjectedFunction} grayed.} In Smalltalk we have two classes
of optimizing points: {\tt DhbMinimizingPoint} and its subclass
{\tt DhbMaximizingPoint}. These classes are shown in listing
\ref{ls:optimizerCommon}. The class {\tt DhbFunctionOptimizer} is
in charge of handling the management of the optimizing points.
This clas is shown in listing \ref{ls:optimizerAbstract}.
\noindent The class {\tt DhbMinimizingPoint} has the following
instance variables:
\begin{description}
\item[\tt position] contains the position at which the function
is evaluated; this instance variable is a number if the function
to optimize is a one variable function and an array or a vector
if the function to evaluate is a function of many variables;
\item[\tt value] contains the value of the function evaluated at the point's position;
\end{description}
Accessor methods corresponding to these variables are supplied. As
we noted in section \ref{sec:goptonedim}, the only method
redefined by the subclass {\tt DhbMaximizingPoint} is the method
{\tt betterThan:} used to decide whether a point is better than
another.
Optimizing points are created with the convenience method {\tt
vector:function:} which evaluates the function supplied as second
argument at the position supplied as the first argument.
\begin{listing} Smalltalk classes common to all optimizing classes \label{ls:optimizerCommon}
\input{Smalltalk/Minimization/DhbMinimizingPoint}
\input{Smalltalk/Minimization/DhbMaximizingPoint}
\end{listing}
The class {\tt DhbFunctionOptimizer} is in charge of handling the
optimizing points. it has the following instance variables:
\begin{description}
\item[\tt optimizingPointClass] is the class of the optimizing
points used as \patstyle{Strategy} by the optimizer; it is used
to create instances of points at a given position for a given
function;
\item[\tt bestPoints] contains a sorted collection of optimizing
points; the best point is the first and the worst point is the
last; all optimizers rely on the fact that sorting is done by this sorted collection.
\end{description}
The method {\tt addPointAt:} creates an optimizing point at the
position supplied as argument and adds this point to the
collection of best points. Since that collection is sorted, one is
always certain to find the best result in the first position. This
fact is used by the method {\tt finalizeIterations}, which
retrieves the result from the collection of best points.
Instances of the function optimizer are created with the two
convenience methods {\tt minimizingFuntion:} and {\tt
maximizingFuntion:} helping to define the type of optimum. An
additional convenience method, {\tt forOptimizer:} allows to
create a new optimizer with the same strategy --- that is, the
same class of optimizing points --- and the same function as the
optimizer supplied as argument. This method is used to create
optimizers used in intermediate steps.
Because finding an optimum cannot be determined numerically with
high precision \cite{Press} the class {\tt DhbFunctionOptimizer}
redefines the method {\tt defaultPrecision} to be 100 times the
default numerical precision.
\begin{listing} Smalltalk abstract class for all optimizing classes \label{ls:optimizerAbstract}
\input{Smalltalk/Minimization/DhbFunctionOptimizer}
\end{listing}
In order to find an optimum along a given direction, one needs to
construct an object able to transform a vector function into a one
variable function. The class {\tt DhbProjectedOneVariableFunction}
and its subclass {\tt DhbVectorProjectedFunction} provide this
functionality. They are shown in listing
\ref{ls:projectedfunctions}. The class {\tt
DhbProjectedOneVariableFunction} has the following instance
variables:
\begin{description}
\item[\tt function] the goal function $f\left({\bf x}\right)$;
\item[\tt argument] the vector argument of the goal function,
that is the vector ${\bf x}$;
\item[\tt index] the index of the axis along which the function
is projected.
\end{description}
The instance variables {\tt argument} and {\tt index} can be read
and modified using direct accessor methods. The goal function is
set only at creation time: the instance creation method {\tt
function:} take the goal function as argument. A convenience
method {\tt bumpIndex} allows to alter the index in circular
fashion\footnote{We do not give the implementation of the simplest
of the hill climbing algorithms using alternatively each axes of
the reference system. This implementation, which uses the method
{\tt bumpIndex}, is left as an exercise for the reader.}.
The class {\tt DhbVectorProjectedFunction} has no additional
variables. Instead it is reusing the instance variable {\tt index}
as the direction along which the function is evaluated. For
clarity, the accessor methods have been renamed {\tt direction},
{\tt direction:}, {\tt origin} and {\tt origin:}.
For both classes, the method {\tt argumentAt:} returns the
argument vector for the goal function, that is the vector ${\bf
x}$. The method {\tt value:} returns the value of the function
$g\left(\lambda\right)$ for the supplied argument $\lambda$.
\begin{listing} Smalltalk projected function classes \label{ls:projectedfunctions}
\input{Smalltalk/Minimization/DhbProjectedOneVariableFunction}
\input{Smalltalk/Minimization/DhbVectorProjectedFunction}
\end{listing}
\section{Optimizing in one dimension}
\label{sec:optonedim} To find the optimum of a one-variable
function, $g\left(\lambda\right)$, whose derivative is unknown,
the most robust algorithm is an algorithm similar to the bisection
algorithm described in section \ref{sec:bisection}.
Let us assume that we have found three points $\lambda_0$,
$\lambda_1$ and $\lambda_2$ such that
$\lambda_0<\lambda_1<\lambda_2$ and such that
$g\left(\lambda_1\right)$ is better than both
$g\left(\lambda_0\right)$ and $g\left(\lambda_2\right)$. If the
function $g$ is continuous over the interval
$\left[\lambda_0,\lambda_2\right]$, then we are certain that an
optimum of the function is located in the interval
$\left[\lambda_0,\lambda_2\right]$. As for the bisection
algorithm, we shall try to find a new triplet of values with
similar properties while reducing the size of the interval. A
point is picked in the largest of the two intervals
$\left[\lambda_0,\lambda_1\right]$ or
$\left[\lambda_1,\lambda_2\right]$ and is used to reduce the
initial interval.
If $\lambda_1-\lambda_0\leq\lambda_2-\lambda_1$ we compute
$\lambda_4 =\lambda_1 + \omega\left(\lambda_2-\lambda_1\right)$
where $\omega$ is the golden
section\footnote{$\omega={3-\sqrt{5}\over 2}\approx0.38197$} from
Pythagorean lore. Choosing $\omega$ instead of $1/2$ ensures that
successive intervals have the same relative scale. A complete
derivation of this argument can be found in \cite{Press}. If
$\lambda_4$ yields a better function value than $\lambda_1$, the
new triplet of point becomes $\lambda_1$, $\lambda_4$ and
$\lambda_2$; otherwise, the triplet becomes $\lambda_0$,
$\lambda_1$ and $\lambda_4$.
If we have $\lambda_1-\lambda_0>\lambda_2-\lambda_1$ we compute
$\lambda_4 =\lambda_1 + \omega\left(\lambda_0-\lambda_1\right)$.
Then the new triplets can be either $\lambda_0$, $\lambda_4$ and
$\lambda_1$, or $\lambda_4$, $\lambda_1$ and $\lambda_2$.
The reader can verify that the interval decreases steadily
although not as fast as in the case of bisection where the
interval is halved at each iteration. Since the algorithm is using
the golden section it is called golden section search.
By construction the golden section search algorithm makes sure
that the optimum is always located between the points $\lambda_0$
and $\lambda_2$. Thus, at each iteration, the quantity
$\lambda_2-\lambda_0$ give an estimate of the error on the
position of the optimum.
\subsection{Optimizing in one dimension --- Smalltalk implementation}
\marginpar{Figure \ref{fig:soptimizingclasses} with the box {\bf
OneVariableFunctionOptimizer} grayed.} Listing
\ref{ls:optimizerOneDim} shows the class {\tt
DhbOneVariableFunctionOptimizer} implementing the search for an
optimum of a one-variable function using the golden section
search. The following code example shows how to use this class to
find the maximum of the gamma distribution discussed in section
\ref{sec:gammadist}.
\begin{codeExample}
\begin{verbatim}
| distr finder maximum |
distr := DhbGammaDistribution shape: 2 scale: 5.
finder := DhbOneVariableFunctionOptimizer maximizingFunction: distr.
maximum := finder evaluate.
\end{verbatim}
\end{codeExample}
The first line after the declarations creates a new instance of a
gamma distribution with parameters $\alpha = 2$ and $\beta = 5$.
The next line creates an instance of the optimum finder. The
selector used to create the instance selects a search for a
maximum. The last line is the familiar statement to evaluate the
iterations --- that is, performing the search for the maximum ---
and to retrieve the result. Since no initial value was supplied
the search begins at a random location.
The class {\tt DhbOneVariableFunctionOptimizer} is a subclass of
the class {\tt FunctionOptimizer}. It does not need any additional
instance variables. The golden section is kept as a class variable
and is calculated at the first time it is needed.
At each iteration the method {\tt nextXValue} is used to compute
the next position at which the function is evaluated. This
corresponding new optimizing point is added to the collection of
best points. Then, the method {\tt indexOfOuterPoint} is used to
determine which point must be discarded: it is always the second
point on either side of the best point. The precision of the
result is estimated from the bracketing interval in the method
{\tt computePrecision}, using relative precision (of course!).
The method {\tt addPoint:} of the superclass can be used to supply
an initial bracketing interval. The method {\tt
computeInitialValues} first checks whether a valid bracketing
interval has been supplied into the collection of best points. If
this is not the case, a search for a bracketing interval is
conducted using the class {\tt DhbOptimizingBracketFinder}
described in section \ref{sec:sbracket}. The instance of the
bracket finder is created with the method {\tt forOptimizer:} so
that its strategy and goal function are taken over from the golden
section optimum finder.
\begin{listing} Smalltalk golden section optimum finder \label{ls:optimizerOneDim}
\input{Smalltalk/Minimization/DhbOneVariableFunctionOptimizer}
\end{listing}
\section{Bracketing the optimum in one dimension}
\label{sec:bracket} As we have seen in section \ref{sec:optonedim}
the golden section algorithm requires the knowledge of a
bracketing interval. This section describes a very simple
algorithm to obtain a bracketing interval with certainty if the
function is continuous and does indeed have an optimum of the
sought type.
The algorithm goes as follows. Take two points $\lambda_0$ and
$\lambda_1$. If they do not exist, pick up some random values
(random generators are described in section \ref{sec:random}). Let
us assume that $g\left(\lambda_1\right)$ is better than
$g\left(\lambda_0\right)$.
\begin{enumerate}
\item Let $\lambda_2=3\lambda_1-2\lambda_0$, that is, $\lambda_2$ is twice as far from $\lambda_1$ than
$\lambda_0$ and is located on the other side, toward the optimum.
\item If $g\left(\lambda_1\right)$ is better than
$g\left(\lambda_2\right)$ we have a bracketing interval; the
algorithm is stopped.
\item Otherwise, set $\lambda_0=\lambda_1$ and $\lambda_1=\lambda_2$ and go back to step 1.
\end{enumerate}
The reader can see that the interval
$\left[\lambda_0,\lambda_1\right]$ is increasing at each step.
Thus, if the function has no optimum of the sought type, the
algorithm will cause a floating overflow exception quite rapidly.
\noindent The implementation in each language have too little in
common. The common section is therefore omitted.
\subsection{Bracketing the optimum --- Smalltalk implementation}
\label{sec:sbracket} \marginpar{Figure
\ref{fig:soptimizingclasses} with the box {\bf
OptimizingBracketFinder} grayed.} Listing
\ref{ls:optimizerbracket} shows the Smalltalk code of the class
implementing the search algorithm for an optimizing bracket. The
class {\tt DhbOptimizingBracketFinder} is a subclass of class {\tt
DhbOneVariableFunctionOptimizer} from section
\ref{ls:optimizerOneDim}. This was a convenient, but not
necessary, choice to be able to reuse most of the management and
accessor methods. The methods pertaining to the algorithm are of
course quite different.
Example of use of the optimizing bracket finder can be found in
method {\tt computeInitialValues} of class {\tt
DhbOneVariableFunctionOptimizer} (\cf listing
\ref{ls:optimizerOneDim}).
The method {\tt setInitialPoints:} allows to use the collection of
best points of another optimizer inside the class. This breach to
the rule of hiding the implementation can be tolerated here
because the class {\tt DhbOptimizingBracketFinder} is used
exclusively with the class {\tt DhbOneVariableFunctionOptimizer}.
It allows the two class to use the same sorted collection of
optimizing points. If no initial point has been supplied, it is
obtained from a random generator.
The {\sl precision} calculated in the method {\tt
evaluateIteration} is a large number, which becomes negative as
soon as the condition to terminate the algorithm is met. Having a
negative precision causes an iterative process as defined in
chapter \ref{ch:iteration} to stop.
\begin{listing} Smalltalk optimum bracket finder \label{ls:optimizerbracket}
\input{Smalltalk/Minimization/DhbOptimizingBracketFinder}
\end{listing}
\section{Powell's algorithm}
\label{sec:powell}Powell's algorithm is one of many hill climbing
algorithms \cite{Press}. The idea underlying Powell's algorithm is
that once an optimum has been found in one direction, the chance
for the biggest improvement lies in the direction perpendicular to
that direction. A mathematical formulation of this sentence can be
found in \cite{Press} and references therein. Powell's algorithm
provides a way to keep track of the next best direction at each
iteration step.
\noindent The original steps of Powell's algorithm are as follow:
\begin{enumerate}
\item Let ${\bf x}_0$ the best point so far and initialize a
series of vectors ${\bf v}_1,\ldots,{\bf v}_n$ forming the system
of reference ($n$ is the
dimension of the vector ${\bf x}_0$); in
other words the components of the vector ${\bf v}_k$ are all
zero except for the $k\th$ components, which is one.
\item Set $k=1$.
\item Find the optimum of the goal function along the direction ${\bf
v}_k$ starting from point ${\bf x}_{k-1}$. Let ${\bf x}_k$ be
the position of that optimum.
\item Set $k=k+1$. If $k\leq n$, go back to step 3.
\item For $k=1,\ldots,n-1$, set ${\bf v}_k={\bf v}_{k-1}$.
\item Set ${\bf v}_n ={ {\bf x}_n-{\bf x}_0 \over \left| {\bf x}_n-{\bf x}_0\right|}$.
\item Find the optimum of the goal function along the direction ${\bf
v}_n$. Let ${\bf x}_{n+1}$ be the position of that optimum.
\item If $\left| {\bf x}_n-{\bf x}_0\right|$ is less than the
desired precision, terminate.
\item Otherwise, set ${\bf x}_0={\bf x}_{n+1}$ and go back to
step 1.
\end{enumerate}
There is actually two hill climbing algorithms within each other.
The progression obtained by the inner loop is taken as the
direction in which to continue the search.
Powell recommends to use this algorithm on goal functions having a
quadratic behaviour near the optimum. It is clear that this
algorithm cannot be used safely on any function. If the goal
function has narrow valleys, all directions ${\bf v}_1,\ldots,{\bf
v}_n$ will become colinear when the algorithm ends up in such a
valley. Thus, the search is likely to end up in a position where
no optimum is located. Press et al. \cite{Press} mention two
methods avoiding such problems: one method is quite complex and
the other slows down the convergence of the algorithm.
In spite of this {\it caveat}, we implement Powell's algorithm in
its original form. However, we recommend its use only in the
vicinity of the minimum. In section \ref{sec:multistrategy} we
show how other techniques can be utilized to read the vicinity of
the optimum, where Powell's algorithm can safely be used to make
the final determination of the optimum's position.
\subsection{Powell's algorithm --- General implementation}
Since the class implementing the vector projected function
$g\left(\lambda\right)$ described in section
\ref{sec:sgeneralOpt} keep the vector
${\bf x}_0$ and ${\bf v}$ in instance variables, there is no need
to allocate explicit storage for the vectors ${\bf
x}_1,\ldots,{\bf x}_n$ and ${\bf v}_1,\ldots,{\bf v}_n$. Instead,
the class implementing Powell's algorithm keep an array of vector
projected functions with the corresponding parameters. Then, the
manipulation of the vector ${\bf x}_1,\ldots,{\bf x}_n$ and ${\bf
v}_1,\ldots,{\bf v}_n$ is made directly on the projected function.
Since the origin of the projected function is always the starting
value, ${\bf x}_k$, the initial value for the search of the
optimum of the function $g\left(\lambda\right)$ is always 0.
The method {\tt initializeIterations} allocated a series of vector
projected functions starting with the axes of the reference
system. This method also creates an instance of a one dimensional
optimum finder kept in the instance variable, {\tt
unidimensionalFinder}. The goal function of the finder is
alternatively assigned to each of the projected functions.
We made a slight modification to Powell's algorithm. If the norm
of the vector ${\bf x}_n-{\bf x}_0$ at step 6 is smaller than the
desired precision, the directions are only rotated, the assignment
of step 6 is not done and the search of step 7 is omitted.
The precision computed at the end of each iterations is the
maximum of the relative change on all components between the
vectors ${\bf x}_n$ and ${\bf x}_0$.
\subsection{Powell's algorithm --- Smalltalk implementation}
\marginpar{Figure \ref{fig:soptimizingclasses} with the box {\bf
HillClimbingOptimizer} grayed.} Listing \ref{ls:optimizerpowell}
shows the implementation of Powell's algorithm in Smalltalk. The
following code example shows how to find the maximum of a vector
function
\begin{codeExample}
\label{ex:spowell}
\begin{verbatim}
| fBlock educatedGuess hillClimber result |
\end{verbatim}
{\tt fBlock :=<\sl the goal function\tt >}\hfil\break
{\tt educatedGuess :=<\sl a vector not too far from the optimum\tt >}
\begin{verbatim}
hillClimber := DhbHillClimbingOptimizer maximizingFunction: fBlock.
hillClimber initialValue: educatedGuess.
result := hillClimber evaluate.
\end{verbatim}
\end{codeExample}
The class {\tt DhbHillClimbingOptimizer} is a subclass of class
{\tt DhbFunctionOptimizer}. It has only one additional instance
variable, {\tt unidimensionalFinder}, to hold the one-dimensional
optimizer used to find an optimum of the goal function along a
given direction.
The method {\tt evaluateIteration} uses the method {\tt
inject:into:} to perform steps 2-4 of the algorithm. Similarly
step 5 of the algorithm is performed with a method {\tt
inject:into:} within the method {\tt shiftDirection}. This mode of
using the iterator method {\tt inject:into:} --- performing an
action involving two consecutive elements of an indexed collection
--- is somewhat unusual, but highly convenient\cite{Beck}. The method {\tt
minimizeDirection:} implements step 3 of the algorithm.
\begin{listing} Smalltalk implementation of Powell's algorithm
\label{ls:optimizerpowell}
\input{Smalltalk/Minimization/DhbHillClimbingOptimizer}
\end{listing}
\section{Simplex algorithm}
\label{sec:simplex} The simplex algorithm, invented by Nelder and
Mead, provides an efficient way to find a good approximation of
the optimum of a function starting from any place \cite{Press}.
The only trap into which the simplex algorithm can run into is a
local optimum. On the other hand, this algorithm does not converge
very well in the vicinity of the optimum. Thus, it must not be
used with the desired precision set to a very low value. Once the
optimum has been found with the simplex algorithm, other more
precise algorithms can then be used, such as the ones describes in
section \ref{sec:optimum} or \ref{sec:powell}. MINUIT uses a
combination of simplex and Newton algorithms. Our implementation
of general purpose optimizer uses a combination of simplex and
Powell algorithms.
A simplex in a $n$-dimensional space is a figure formed with $n+1$
summits. For example, a simplex in a 2-dimensional space is a
triangle; a simplex in a 3-dimensional space is a tetrahedron.
Let us now discuss the algorithm for finding the optimum of a
function with a simplex.
\begin{enumerate}
\item Pick up $n+1$ points in the search space and evaluate
the goal function at each of them. Let ${\bf A}$ be the summit
yielding the worst function's value.
\item if the size of the simplex is smaller than the desired
precision, terminate the algorithm.
\item Calculate ${\bf G}$, the center of gravity of the $n$ best points,
that is all points except ${\bf A}$.
\item Calculate the location of the symmetric point of ${\bf A}$
relative to ${\bf G}$: ${\bf A}^{\prime}=2{\bf G}-{\bf A}$.
\item If $f\left({\bf A}^{\prime}\right)$ is not the best value
found so far go to step 9.
\item Calculate the point ${\bf A}^{\prime\prime}=2{\bf A}^{\prime}-{\bf
G}$, that is a point twice as far from ${\bf G}$ as ${\bf
A}^{\prime}$.
\item If $f\left({\bf A}^{\prime\prime}\right)$ is a
better value than $f\left({\bf A}^{\prime}\right)$ build a new
simplex with the point ${\bf A}$ replaced by the point ${\bf
A}^{\prime\prime}$ and go to step 2.
\item Otherwise, build a new
simplex with the point ${\bf A}$ replaced by the point ${\bf
A}^{\prime}$ and go to step 2.
\item Calculate the point ${\bf B}= {\displaystyle\left({\bf G}+{\bf
A}\right)\over\displaystyle 2}$.
\item If $f\left({\bf B}\right)$ yields the best value found so far
build a new simplex with the point ${\bf A}$ replaced by the point ${\bf
A}^{\prime\prime}$ and go to step 2.
\item Otherwise build a new simplex obtained by dividing all
edges leading to the point yielding the best value by 2 and go
back to step 2.
\end{enumerate}
Figure \ref{fig:simplexsample} shows the meaning of the operations
involved in the algorithm in the 3 dimensional case.
\begin{figure}
\centering\includegraphics[width=10cm]{Figures/Simplex}
\caption{Operations of the simplex
algorithm}\label{fig:simplexsample}
\end{figure}
Step 6 makes the simplex grow into the direction where the
function is the best so far. Thus, the simplex becomes elongated
in the expected direction of the optimum. Because of its
geometrical shape, the next step is necessarily taken along
another direction, causing an exploration of the regions
surrounding the growth obtained at the preceding step. Over the
iterations, the shape of the simplex adapts itself to narrow
valleys where the hill climbing algorithms notoriously get into
trouble. Steps 9 and 11 ensures the convergence of the algorithm
when the optimum lies inside the simplex. In this mode the simplex
works very much like the golden section search or the bisection
algorithms.
Finding the initial points can be done in several ways. If a good
approximation of the region where the maximum might be located can
be obtained one uses that approximation as a start and generate
$n$ other points by finding the optimum of the function along each
axis. Otherwise, one can generate random points and select $n+1$
points yielding the best values to build the initial simplex. In
all cases, one must make sure that the initial simplex has a
non-vanishing size in all dimensions of the space. Otherwise the
algorithm will not reach the optimum.
\subsection{Simplex algorithm --- General implementation}
The class implementing the simplex algorithm belong to the
hierarchy of the iterative processes discussed in chapter
\ref{ch:iteration}. The method {\tt evaluateIteration} directly
implements the steps of the algorithm as described above. The
points ${\bf G}$, ${\bf A}^{\prime}$, ${\bf A}^{\prime\prime}$ and
${\bf B}$ are calculated using the vector operations described in
section \ref{sec:linearalgebra}.
The routine {\tt initializeIterations} assumes that an initial
value has been provided. It then finds the location of an optimum
of the goal function along each axis of the reference system
starting each time from the supplied initial value, unlike hill
climbing algorithms. Restarting from the initial value is
necessary to avoid creating a simplex with a zero volume. Such
mishaps can arise when the initial value is located on an axis of
symmetry of the goal function. This can happen quite frequently
with {\sl educated guesses}.
\subsection{Simplex algorithm --- Smalltalk implementation}
\marginpar{Figure \ref{fig:soptimizingclasses} with the box {\bf
SimplexOptimizer} grayed.} Listing \ref{ls:optimizersimplex} shows
the Smalltalk implementation of the simplex algorithm. The
following code example shows how to invoke the class to find the
minimum of a vector function.
\begin{codeExample}
\begin{verbatim}
| fBlock educatedGuess simplex result |
\end{verbatim}
{\tt fBlock :=<\sl the goal function\tt >}\hfil\break
{\tt educatedGuess :=<\sl a vector in the search space\tt >}
\begin{verbatim}
simplex := DhbSimplexOptimizer minimizingFunction: fBlock.
simplex initialValue: educatedGuess.
result := simplex evaluate.
\end{verbatim}
\end{codeExample}
Except for the line creating the instance of the simplex
optimizer, this code example is identical to the example of
Powell's hill climbing algorithm (code example \ref{ex:spowell}).
The class {\tt DhbSimplexOptimizer} is a subclass of class {\tt
DhbFunctionOptimizer}. In order to be able to use the iterator
methods efficiently, the worst point of the simplex, ${\bf A}$, is
held in a separate instance variable {\tt worstPoint}. As we do
not need to know the function's value $f\left({\bf A}\right)$, it
is kept as a vector. The remaining points of the simplex are kept
in the instance variable {\tt bestPoints} of the superclass. Since
this collection is sorted automatically when points are inserted
to it, there is no explicit sorting step.
\begin{listing} Smalltalk implementation of simplex algorithm
\label{ls:optimizersimplex}
\input{Smalltalk/Minimization/DhbSimplexOptimizer}
\end{listing}
\section{Genetic algorithm}
All optimizing algorithm discussed so far have one common flaw:
they all terminate when a local optimum is encountered. In most
problems, however, one wants to find the absolute optimum of the
function. This is especially true if the goal function represents
some economical merit.
\noindent One academic example is the maximization of the function
\begin{equation}
\label{eq:geneticCase}
f\left({\bf x}\right)={ \sin^2\left|{\bf x}\right| \over\left|{\bf
x}\right|^2}.
\end{equation}
This function has an absolute maximum at ${\bf x}=0$, but all
algorithms discussed so far will end up inside a ring
corresponding to $\left|{\bf x}\right|=n\pi/2$ where $n$ is any
positive odd integer.
In 1975 John Holland introduced a new type of algorithm --- dubbed
genetic algorithm --- because it tries to mimic the evolutionary
process identified as the cause for the diversity of living
species by Charles Darwin. In a genetic algorithm the elements of
the search space are considered as the chromosomes of individuals;
the goal function is considered as the measure of the fitness of
the individual to adapt itself to its
environment\cite{BerLin}\cite{Koza}. The iterations are aping (pun
intended) the Darwinian principle of survival and reproduction. At
each iteration, the fittest individuals survive and reproduce
themselves. To bring some variability to the algorithm mutation
and crossover of chromosomes are taken into account.
Mutation occurs when one gene of a chromosome is altered at
reproduction time. Crossover occurs when two chromosomes break
themselves and recombine with the piece coming from the other
chromosome. These processes are illustrated on figure
\ref{fig:crossover}.
\begin{figure}
\centering\includegraphics[width=11cm]{Figures/Crossover}
\caption{Mutation and crossover reproduction of
chromosomes}\label{fig:crossover}
\end{figure}
The point where the chromosomes are breaking is called the
crossover point. Which individual survives and reproduces itself,
when and where mutation occurs and when and where a crossover
happens is determined randomly. This is precisely the random
nature of the algorithm which gives it the ability to jump out of
a local optimum to look further for the absolute optimum.
\rubrique{Mapping the search space on chromosomes} To be able to
implement a genetic algorithm one must establish how to represent
the genes of a chromosome. At the smallest level the genes could
be the bits of the structure representing the chromosome. If the
search space of the goal function do cover the domain generated by
all possible permutations of the bits, this is a good approach.
However, this is not always a practical solution since some bit
combinations may be forbidden by the structure. For example, some
of the combinations of a 64 bit word do not correspond to a valid
floating point number.
In the case of the optimization of a vector function, the simplest
choice is to take the components of the vector as the genes.
Genetic algorithms are used quite often to adjust the parameters
of a neural network \cite{BerLin}. In this case, the chromosomes
are the coefficients of each neuron. Chromosomes can even be
computer subprograms in the case of genetic programming
\cite{Koza}. In this latter case, each individual is a computer
program trying to solve a given problem.
Figure \ref{fig:geneticFlow} shows a flow diagram of a general
genetic algorithm.
\begin{figure}
\centering\includegraphics[width=10cm]{Figures/GeneticFlow}
\caption{General purpose genetic algorithm}\label{fig:geneticFlow}
\end{figure}
The reproduction of the individual is taken literally: a copy of
the reproducing individual is copied into the next generation. The
important feature of a generic algorithm is that the building of
the next generation is a random process. To ensure the survival of
the fittest, the selection of the parents of the individuals of
the next generation is performed at random with uneven
probability: the fittest individuals have a larger probability of
being selected than the others. Mutation enables the algorithm to
create individuals having genes corresponding to unexplored
regions of the search space. Most of the times such mutants will
be discarded at the next iteration; but, in some cases, a mutation
may uncover a better candidate. In the case of the function of
equation \ref{eq:geneticCase}, this would correspond to jumping
from one ring to another ring closer to the function's maximum.
Finally the crossover operation mixes good genes in the hope of
building a better individual out of the properties of good
inidividuals. Like mutation the crossover operation gives a
stochastic behavior to the algorithm enabling it to explore
uncharted regions of the search space.
\note{Because of its stochastic nature a genetic algorithm is the
algorithm of choice when the goal function is expressed on
integers.}
\subsection{Genetic algorithm --- General implementation}
\label{sec:gengenetic} The left hand side of the diagram of figure
\ref{fig:geneticFlow} is quite similar to the flow diagram of an
iterative process (\cf figure \ref{fig:itercoarse} in chapter
\ref{ch:iteration}). Thus, the class implementing the genetic
algorithm is a subclass of the iterative process class discussed
in chapter \ref{ch:iteration}.
The {\sl genetic} nature of the algorithm is located in the right
hand side of the diagram of figure \ref{fig:geneticFlow}. As we
have mentioned before the implementation of the chromosomes is
highly problem dependent. All operations located in the top
portion of the mentioned area can be expressed in generic terms
without any knowledge of the chromosomic implementation. to handle
the lower part of the right hand side of the diagram of figure
\ref{fig:geneticFlow}, we shall implement a new object, the
chromosome manager.
One should also notice that the value of the function is not
needed when the next generation is build. Thus, the chromosome
manager does not need to have any knowledge of the goal function.
The goal function comes into play when transfering the next
generation to the {\sl mature} population, that is, the population
used for reproduction at the next iteration . At the maturity
stage, the value of the goal function is needed to identify the
fittest individuals. In our implementation, the next generation is
maintained by the chromosome manager whereas the population of
mature individuals is maintained by the object in charge of the
genetic algorithm which has the knowledge of the goal function.
\noindent The chromosome manager has the following instance
variables:
\begin{description}
\item[\tt populationSize] contains the size of the population;
one should pick up a large enough number to be able to cover the
search space efficiently: the larger the dimension of the space
search space, the larger must be the population size;
\item[\tt rateOfMutation] contains the probability of having a
mutation while reproducing;
\item[\tt rateOfCrossover] contains the probability of having a
crossover while reproducing.
\end{description}
All of these variables have getter and setter accessor methods. In
addition a convenience instance creation method is supplied to
create a chromosome manager with given values for all three
instance variables. The chromosome manager implements the
following methods:
\begin{description}
\item[\tt isFullyPopulated] to signal that a sufficient number of individuals
has been generated into the population;
\item[\tt process] to process a pair of individuals; this method
does the selection of the genetic operation and applies it;
individuals are processed by pair to always have a possibility
of crossover;
\item[\tt randomnizePopulation] to generate a random population;
\item[\tt reset] to create an empty population for the next
generation.
\end{description}
Finally the chromosome manager must also implement methods
performing each of the genetic operations: reproduction, mutation
and crossover. The Smalltalk implementation supplies methods that
returns a new individual.
The genetic optimizer is the object implementing the genetic
algorithm proper. It is a subclass of the iterative process class
described in chapter \ref{sec:iterrel}. In addition to the
handling of the iterations the genetic optimizer implements the
steps of the algorithm drawn on the top part of the right hand
side of the diagram of figure \ref{fig:geneticFlow}. It has one
instance variable containing the chromosome manager with which it
will interact. The instance creation method take three arguments:
the function to optimize, the optimizing strategy and the
chromosome manager.
The method {\tt initializeIteration} asks the chromosome manager
to supply a random population. The method {\tt evaluateIteration}
performs the loop of the right hand side of the diagram of figure
\ref{fig:geneticFlow}. It selects a pair of parents on which the
chromosome manager performs the genetic operation.
Selecting the genetic operation is performed with a random
generator. The values of the goal function are used as weights.
Let $f\left(p_i\right)$ be the value of the goal function for
individual $p_i$ and let $p_b$ and $p_w$ be respectively the
fittest and the lest fit individual found so far ($b$ stands for
best and $w$ stands for worst). One first computes the
unnormalized probability:
\begin{equation}
\tilde{P}_i={\displaystyle f\left(p_i\right) - f\left(p_w\right)
\over\displaystyle f\left(p_b\right) - f\left(p_w\right)}.
\end{equation}
This definition ensures that $\tilde{P}_i$ is always comprised
between 0 and 1 for any goal function. Then we can use the
discrete probability
\begin{equation}
\label{eq:geneticprob}