Evolutionary computing

EVOLUTIONARY COMPUTING CMT563

Antonia J. Jones

6 November 2005

Antonia J. Jones: 6 November 2005

UNIVERSITY OF WALES, CARDIFF
DEPARTMENT OF COMPUTER SCIENCE (COMSC)

COURSE: M.Sc. CMT563
MODULE: Evolutionary Computing
LECTURER: Antonia J. Jones, COMSC
DATED: Originally 15 January 1997
LAST REVISED: 6 November 2005
ACCESS: Lecturer (extn 5490, room N2.15).

Overhead slides are posted on:

http://users.cs.cf.ac.uk:81/Antonia.J.Jones/

electronically as pdf Acrobat files. It is not normally necessary for students attending the course to print this file
as complete sets of printed slides will be issued.

©2001 Antonia J. Jones. Permission is hereby granted to any web surfer for downloading, printing and use of this
material for personal study only. Copyright permission is explicitly withheld for modification, re-circulation or
publication by any other means, or commercial exploitation in any manner whatsoever, of this file or the material
therein.

Bibliography:

MAIN RECOMMENDATIONS

The recommended text for the course is:

[Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing.
Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk).

A cheaper alternative is

Yoh-Han Pao, Adaptive pattern recognition and neural networks. Addison-Wesley, 1989. ISBN 0-201-12584-6.
Price (UK) £31.45.

A useful addition for the Mathematica labs is:

Simulating Neural Networks with Mathematica. James A. Freeman. Addison-Wesley. 1994. ISBN 0-201-
56629-X.

These books cover most of the course, except any theory on genetic algorithms, and the first is the recommended
book for the course because it has excellent mathematical analyses of many of the models we shall discuss. The
second includes some interesting material on the application Bayesian statistics and Fuzzy logic to adaptive pattern
recognition. It is clearly written and the emphasis is on computing rather than physiological models.

1


The principle sources of inspiration for work in neuro and evolutionary computation are:

! E. R. Kandel, J. H. Schwartz, and T. M. Jessel. Principles of Neural Science (Third Edition),
Prentice-Hall Inc., 1991. ISBN 0-8385-8068-8.

! J. D. Watson, Nancy H. Hopkins, J. W. Roberts, Joan A. Steitz, and A. M. Weiner. Molecular
Biology of the Gene, Benjamin/Cummings Publishing Company Inc., 1988. ISBN 0-8053-
9614-4.

When you see how big they are you will understand why! It is a sobering thought that most of the knowledge in
these tomes has been obtained in the last 20 years.

Although extensive references are provided with the course notes (these are also a useful source of information for
projects in Neural Computing) definitive bibliographies for computing aspects of the subject are:

The 1989 Neuro-Computing Bibliography. Ed. Casimir C. Klimasauskas, MIT Press / Bradford Books. 1989. ISBN
0-262-11134-9.

Finally, the key papers up to 1988 can be found together in:

Neurocomputing: Foundations of Research. Ed. James A. Anderson and Edward Rosenfeld, MIT Press 1988.
ISBN 0-262-01097-6.

NETS - OTHER (HISTORICALLY) INTERESTING MATERIAL

Perceptrons, Marvin Minsky and Seymour Papert, MIT Press 1972. ISBN 0-262-63022-2 (was reprinted recently).

Neural Assemblies, G. Palm, Springer-Verlag, 1982.

Self-Organisation and Associative Memory, T. Kohonen, Springer- Verlag, 1984.

Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, Lawrence Erlbaum, 1981.

Connectionist Models and Their Applications, Special Issue of Cognitive Science 9, 1985.

Computer, March 1988. Artificial Neural Systems, IEEE.

Neural Computing Architectures, Ed. I. Aleksander, Kogan Page, December, 1988.

Parallel Distributed Processing. Vol.I Foundations. Vol. II Psychological and Biological Models. David E.
Rumelhart et. al., MIT Press / Bradford Books. 1986. ISBN 0-262-18123-1 (Set).

Explorations in Parallel Distributed Processing - A Handbook of Models, Programs, and Exercises. James L.
McClelland and David E. Rumelhart, MIT Press / Bradford Books. 1988. ISBN 0-262-63113X. (Includes some
very useful software for an IBM PC - there is also a newer version with software for the MAC).

GENERAL

An Introduction to Cybernetics, W. Ross-Ashby, John Wiley and Sons, 1964.

A classic text on cybernetics.

Vision: A computational investigation into the human representation and processing of visual information, David

2


Marr, W. H. Freeman and Company, 1982. ISBN 0-7167-1284-9.

One of the classic works in computational vision.

Artificial Intelligence, F. H. George, Gordon & Breach, 1985.

Useful textbook on AI.

GENETIC ALGORITHMS/ARTIFICIAL LIFE

Artificial Life, Ed. Christopher G. Langton, Addison-Wesley 1989. ISBN 0-201-09356-1 pbk.

A fascinating collection of essays from the first AL workshop at Los Alamos National Laboratory in 1987. The
book covers an enormous range of topics (genetics, self-replication, cellular automata, etc.) on this subject in a very
readable way but with great technical authority. There are innumerable figures, some forty colour plates and even
some simple programs to experiment with. All this leads to a book that is beautifully presented and compulsive
reading for anyone with a modest background in the field.

Synthetic systems that exhibit behaviour characteristic of living systems complement the traditional analysis of
living systems practised by the biological sciences. It is an approach to the study of life that would hardly be
feasible without the advent of the modern computer and may eventually lead to a theory of living systems which
is independent of the physical realisation of the organisms (carbon based, in this neck of the woods).

The primary goal of the first workshop was to collect different models and methodologies from scattered
publications and to present as many of these as possible in a uniform way. The distilled essence of the book is the
theme that Artificial Life involves the realisation of lifelike behaviour on the part of man-made systems consisting
of populations of semi-autonomous entities whose local interactions with one another are governed by a set of
simple rules. Such systems contain no rules for the behaviour of the population at the global level.

Adaptation in Natural and Artificial Systems, John H. Holland, University of Michigan Press, 1975.

The book that started Genetic Algorithms, a classic.

Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing 1987. ISBN 0-273-08771-1
(UK), 0-934613-44-3 (US).

A collection of interesting papers on GA related subjects.

Genetic Algorithms in Search, Optimization, and Machine Learning, David E. Goldberg, Addison-Wesley, 1989.
ISBN 0-201-15767-5.

The first real text book on GAs.

3


CONTENTS

I What is evolutionary computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
A general framework for neural models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
The need for machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

II Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
The archetypal GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Design issues - what do you want the algorithm to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Rapid convergence to a global optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Produce a diverse population of near optimal solutions in different `niches' . . . . . . . . . . . 19
* Results and methods related to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Evolutionary Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

III Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Hopfield nets and energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
The outer product rule for assigning weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Networks for combinatoric search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Assignment of weights for the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
* The Hopfield and Tank application to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

IV The WISARD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Wisard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
WISARD - analysis of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Comparison of storage requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

V Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Backpropagation - mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
The output layer calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
The rule for adjusting weights in hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
The conventional model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Problems with backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
The gamma test - a new technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
* Metabackpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
* Neural networks for adaptive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

* VI The chaotic frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4


Chaos in biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Controlling chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
The original OGY control law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chaotic conventional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Controlling chaotic neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Control varying T in a particular layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Using small variations of the inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Time delayed feedback and a generic scheme for chaotic neural networks . . . . . . . . . . . . . . . . . . . 70
Example: Controlling the Hénon neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

COURSEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

LIST OF FIGURES

Figure 1-1 The stylised version of a standard connectionist neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. . . . . . . . . . . . . . . . . 12
Figure 1-3 Storage capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2-1 Generic model for a genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2-2 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2-3 Premature convergence - no sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. . . . . . . . . . . . . . . . 22
Figure 2-5 The EDAC (top) and simple 2-Opt (bottom) time complexity (log scales). . . . . . . . . . . . . . . . . . 23
Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. . . . . . . . . . . . . . . . . . . . 24
Figure 2-7 EDACII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . . 25
Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . 25
Figure 3-1 Distance Connections. Each node (i, p) has inhibitory connections to the two adjacent columns whose
weights reflect the cost of joining the three cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 3-2 Exclusion connections. Each node (i, p) has inhibitory connections to all units in the same row and
column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 4-1 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4-2 Continuous response of discriminators to the input word 'toothache' [From Neural Computing
Architectures, Ed. I Aleksander]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 4-3 A discriminator for centering on a bar [From Neural Computing, I. Aleksander and H. Morton].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 5-1 Solving the XOR problem with a hidden unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 5-2 Feedforward network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 5-3 The previous layer calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 5-4 The Water Tank Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 5-5 Architecture for direct inverse neurocontrol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs for the Water Tank
Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 5-7 Least squares fit to 200 data points with 20 nearest neighbours: = 0.0332. . . . . . . . . . . . . . . . . 56
Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear Planner. . . . 57
Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052. Linear Planner.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training. Linear Planner.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 6-1 Stable attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 6-2 A chaotic time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 6-3 The butterfly effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5


Figure 6-4 Intervals for which the variables are defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 6-5 Feedforward network as a dynamical system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 6-6 Chaotic attractor of Wang's neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 6-7 The Ikeda strange attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6-8 Attractor for the chaotic 2-10-10-2 neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6-9 Bifurcation diagram x obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . . 68
Figure 6-10 Bifurcation diagram y obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . 68
Figure 6-11 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-12 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-13 Parameter changes during output layer control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-14 Bifurcation diagram for the output x(t+1) using an external variable added to the input x(t). . . 69
Figure 6-15 Bifurcation diagram for the output y(t+1) using an external variable added to the input x(t). . . 69
Figure 6-16 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-17 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-18 Parameter changes during input x control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network: the chaotic
"delayed" network is trained on suitable input-output data constructed from a chaotic time series; a
delayed feedback control is applied to each input line; entry points for external stimulus are suggested,
with a switch signal to activate the control module during external stimulation; signals on the delay lines
or output can be observed at the "observation points". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-20 The control signal corresponding to the delayed feedback control shown in Figure 6-21. Note that
the control signal becomes small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-21 Response signal on x(n-6) with control signal activated on x(n-6) using k = 0.441628, = 2 and J
without external stimulation after first 10 transient iterations. After n = 1000 iterations, the control is
switched off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-22 Response signals on network output x(n), with control signal activated on x(n-6) using k = 0.441628,
= 2 and with constant external stimulation sn added to x(n-6), where sn varies from -1.5 to 1.5 in steps
J
of 0.1 at each 500 iterative steps (indicated by the change of Hue of the plot points) after 20 initial
transient steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 6-23 The control signal corresponding to the delayed feedback control shown in Figure 6-22. Note that
the control signal becomes small even when the network is under changing external stimulation. . 72
Figure 6-24 Response signals on network output x(n), with control setup same as in Figure 6-22 but with
Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.05, at each iteration step. . . . . 73
F
Figure 6-25 The control signal corresponding to the delayed feedback control shown in Figure 6-24. . . . . 73
Figure 6-26 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but
with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.15, at each iteration step. F
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 6-27 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but
with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.3, at each iteration step. F
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

LIST OF ALGORITHMS

Algorithm 2-1 Archetypal genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Algorithm 3-1 Hopfield network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Algorithm 5-1 The Gamma test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Algorithm 5-2 Metabackpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Algorithm 7-1 Generic GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Algorithm 7-2 Generic Hopfield net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6


I What is evolutionary computing?

"
dna tsap naht erutan enon-ro-lla na fo yldigir ssel hcum era hcihw seiroeht ot dael lliw siht fo llA
.retcarahc ,lacitylana erom hcum dna ,lacirotanibmoc ssel hcum a fo eb lliw yehT .cigol lamrof tneserp
ll i w cigol lamrof fo metsys wen siht taht eveileb su ekam ot snoitacidni suoremun era ereht ,tcaf nI
si sihT .cigol htiw tsap eht ni deknil elttil neeb s a h h c i h w e n i l picsid rehtona ot resolc evom
fo trap taht si ti dna ,nnamztloB morf deviecer saw ti mrof eht ni yliramirp ,scimanydomreht
g n i r u s a e m d n a g n i t al u p i n a m o t s t c e p s a s t i f o e m o s n i t s e r a e n s e m o c h c i h w s c i s y h p l a c i t e r o e h t
]403 .p ,5 .loV skroW detcelloC ,nnamueN nov[ ".noitamrofni

Introduction.

Evolutionary computing embraces models of computation inspired by living Nature. For example, evolution of
species by means of natural selection and the genetic operators of mutation, sexual reproduction and inversion can
be considered as a parallel search process. Perhaps we can tackle hard combinatoric search problems in computer
science by mimicking (in a very stylised form) the natural process of evolutionary search.

Evolution through natural selection drives the adaptation of whole species, but individual members of a species
can also adapt to a greater or lesser extent. The adaptation of individual behaviour on the basis of experience is
learning and stems from plasticity of the neural structures which convey and process information in animals.

Learning enables us to recognise previously encountered stimuli or experiences and modify our behaviour
accordingly. It facilitates prediction and control of the environment, both essential prerequisites to planning. All
of these are facets of what we loosely call intelligence.

Real-world intelligence is essentially a computational process. This is a contentious assertion known as "the strong
AI position". If it is true then the precise mechanism of computation (the hardware or neural wetware) ought to
be irrelevant to the actual principles of the computational process.

If this is indeed the case then the only obstacles to the construction of a truly intelligent artifact are our own
understanding of the computational processes involved and our technical capability to construct suitable and
sufficiently powerful computational devices.

A general framework for neural models.

Throughout this course we describe a number of neural models, each a variation on the connectionist paradigm
(often called Parallel Distributed Processing - PDP), which in turn is derived from networks of highly stylised
versions of the biological neuron.

It is useful to begin with an analysis of the various components of these models. There are seven major aspects of
a connectionist model:

! A set of processing units ui, each producing a scalar output xi(t) (1 # #i n).

! A connectivity graph which determines the pattern of connections (links) from each unit to each of the
other units in the network. We shall often suppose that each unit has n inputs, but there is no particular
reason why all units should have the same number of inputs.

7


Although it is often convenient for theoretical discussions to consider fully interconnected networks, for very large
networks of either real or artificial neurons the relevant case is that of relatively sparse connectivity. The
connectivity graph then describes the fine topology of the network. This can be useful in practical applications,
for example in speech recognition networks it is often helpful to have several copies of the same sub-net connected
to temporally distinct inputs. These sub-net copies act as a feature detector and so can share their weights - this
effectively reduces the number of parameters needed to describe the full network and speeds up learning. It is
sufficient to be given a list of inputs and outputs for each node, for we then can recover the connectivity graph.

! A set of parameters pi1,...,pik, fixed in number, attached to each unit ui, which are adjusted during
learning. Most commonly k = n and the parameters are weights wij (1 j n), where wij is often taken
# #
to be associated with the link from j to i, or in biological terms associated with the synaptic gap.

! An activation function for each unit, neti = neti(x1,...,xn;pi1,...,pik), which combines the inputs to ui into
a scalar value. In the commonly used model neti = wijxj.
'

It is important to realise that the basic principle of neural networks is that of simple (but arbitrary) computational
function at each node. Learning when it occurs can be considered as an adjustment of the parameters associated
with a node based on information locally available to the node. ‘Locally’ here means as specified by the
connectivity graph. This information often takes the form of a correlation between the firings of adjacent nodes,
but it could be a more sophisticated calculation. Thus we are really dealing with a very general class of parallel
algorithms. The concentration on the ‘weights associated with links’ model has arisen partly because of the
biological precedent, because of the extreme simplicity of the computational function of a node, and because this
special case has been shown to be of practical interest.

x1(t) x2(t) x3(t) ... xn(t)

neti = neti(x1, ... ,xn, pi1, ... , pik)

unit i output xi
xi = f(neti) neti
input
Sigmoidal output function

xi(t+1) ... xi(t+1) ... xi(t+1)

Figure 1-1 The stylised version of a standard connectionist neuron.

! An output function xi = f(neti) which transforms the activation function into an output. In the earliest
models f was a discontinuous step function. However, this poses analytical difficulties for learning
algorithms so that often now f is a smooth sigmoidal shaped function. In some models f is allowed to vary

8


from one unit to another and so then we write fi for f.

! A learning rule whereby the parameters associated with each processing unit are modified by
experience.

! An environment within which the system must operate.

A set of processing units. Figure 1-1 illustrates a standard connectionist component. All of the processing of a a
connectionist system is carried out by these units. There is no executive or overseer. There are only relatively
simple units, each doing its own relatively simple job. A unit's job is simply to receive input from other units and,
as a function of the input it receives and the current values of its internal parameters, to compute an output value
xi which it sends to the other units. This output is discrete in some models and continuous in others. When the
output is continuous it is often confined to [0,1] or [-1,1]. The system is inherently parallel in that many units carry
out their computations at the same time.

Within any system we are modelling, it is sometimes useful to characterize three types of units: input, output, and
hidden. The hidden units are those whose inputs and outputs are within the system we are modelling. They are not
‘visible’ to outside systems.

A connectivity graph. Each unit passes its output to other units along links. The graph of links represents the
connectivity of the network.

A set of parameters and an activation function. In the conventional model the parameters for unit i are assumed
to be weights wij associated with the link from unit j to unit i. If wij > 0 the link is said to be an excitatory link, if
wij = 0 unit j is effectively not connected to unit i, and if wij < 0 the link is said to be inhibitory link. In this case
neti is calculated as
n
net i j ' wijx j (1)
j ' 1

This is a linear function of the inputs and so neti is constant over hyperplanes in the n-dimensional space of inputs
to unit i.

In fact, if one is interested in generalising the computational function of a unit, it is often convenient to associate
the parameters (in the conventional case weights) with the unit. In which case one thinks of the links as passing
activation values and one is no longer constrained to have exactly n (the number of inputs) parameters per unit.
For example, one could have a unit which performed it's distinction function by determining whether or not the
input vector lay within some ellipsoid. In this case there would be n parameters associated with the centre of the
ellipsoid and another n parameters associated with the axes. (In addition one could provide the ellipsoid with
rotations which would provide further parameters.) Now the activation function would look like
n
net i j ' Aij (x j & cij)2 (2)
j ' 1

This is a simple example of a higher order network in which the function neti is not a linear function of the inputs.

An output function. The simplest possible output function f would be the identity function, i.e. just take xi = neti.
However, in this case with the activation function (1) the unit would be performing a totally linear function on the
inputs and, as it turns out, such nets are rather uninteresting.

In any event our unit is not yet making a distinction. In the discrete model the output function is usually

9


xi ' 1 > θ
i
if net i (3)
xi ' 0 # θ i

where i is the threshold, a parameter associated with the unit. However, this creates discontinuities of the
2
derivatives and so we usually smooth the output function and write
x i f(net i)
' (4)
In the linear case f is some sort of sigmoidal function. For our ellipsoidal example Gaussian smoothing might be
suitable, i.e. f(x) = exp(-x2), so that the output is large (near one) when the input vector is near the centre of the
ellipsoid.

Sometimes the output function is stochastic so that the output of the unit depends in a probabilistic fashion on neti.

For an individual unit the sequence of events in operational mode (not learning) is

1. Combine inputs to produce activation neti(t).
2. Compute value of output xi = f(neti).
3. Place outputs, based on new activation level, on output links (available from t+1 onward).

Changing the processing or knowledge structure in a connectionist model involves modifying the patterns of
interconnections or parameters associated with each unit. This is accomplished by modifying pi1,...,pik (or the wij
in the usual model) through experience using a learning rule.

Virtually all learning rules are based on some variant of a Hebbian principle (discussed in the next section) which
is invariably derived mathematically through some form of gradient descent. For example, the Delta or
Widrow-Hoff rule. Here modification of weights is proportional to the difference between the actual activation
achieved and the target activation provided by a teacher

) wij = (ti(t)-neti(t))xj(t),
0

where > 0 is constant. This is a generalization of the Perceptron learning rule and is all very well provided we
0
know the desired values of ti(t).

Hebbian learning.

Donald O. Hebb's book The Organization of Behavior (1949) is famous among neural modelers because it
contained the first explicit statement of the physiological learning rule for synaptic modification that has since
become known as the Hebb synapse:

Hebb rule. When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes
part in firing it, some growth process or metabolic change takes place in one or both cells such that A's
efficiency, as one of the cells firing B, is increased.

The physiological basis for this synaptic potentiation is now understood more clearly [Brown 1988]. Hebb's
introduction to the book also contains the first use of the word 'connectionism' in the context of neural modeling.
The Hebb rule is not a mathematical statement, though it is close to one. For example, Hebb does not discuss the
various possible ways inhibition might enter the picture, or the quantitative learning rule that is being followed.
This has meant that a number of quite different learning rules can legitimately be called 'Hebbian rules'. We shall
see later that nearly all such learning rules bear a close mathematical relationship to the idea of `gradient descent',
which roughly means that if we wish to move to the lowest point of some error surface a good heuristic is: we
should always tend to go `downhill'. However, for the present chapter we shall conceptualise the Hebb rule in terms
of autocorrelations, i.e. the internal correlations between each pair of components of the pattern vectors we wish

10


the system to memorise.

Hebb was keenly aware of the `distributed' nature of the representation he assumed the nervous system uses; that
to represent something assemblies of many cells are required and that an individual cell may be a participant
member of many representations at different times. He postulated the formation of cell assemblies representing
learned patterns of activity.

The need for machine learning.

Why do we need to discover how to get machines to learn? After all, is it not the case that the most practical
developments in Artificial Intelligence, such as Expert Systems, have emerged from the development of advanced
symbolic programming languages such as LISP or Prolog? Indeed, this is so. But there are convincing arguments
[Bock 1985] which suggest that the technique of simulating human skills using symbolic programs cannot hope,
in the long run, to satisfy the principal goals of AI. Mainly these centre around the time it would take to figure out
the rules and write the software. But first we should consider the evolution of hardware.

How can one measure the overall computational power of an information processing system? There are two obvious
aspects we should consider. Firstly, information storage capacity - a system cannot be very smart if it has little or
no memory. On the other hand, a system may have a vast memory but little or no capacity to manipulate
information; so a second essential measure is the number of binary operations per second. On these two scales
Figure 1-2 illustrates the information processing capability of some familiar biological and technological
information processing systems. In the case of the biological systems these estimates are based on connectionist
models and may be excessively conservative.

We consider each axis independently. As we saw earlier, research in neurophysiology has revealed that the brain
and central nervous system consists of about 1011 individual parallel processors, called neurons. Each neuron has
roughly 104 synaptic connections and if we allow only 1 bit per synapse then each neuron is capable of storing
about 104 bits of information. The information capacity of the brain is thus about 1015 bits. Much of this
information is probably redundant but using this figure as a conservative estimate let us consider when we might
expect to have high-speed memories of 1015 bits.

11


Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec].

Figure 1-3 shows that the amount of high-speed
random access memory that may be conventionally
accessed by a large computer has increased by an order
of magnitude every six years. If we can trust this
simple extrapolation, in generation thirteen, AD
2024-30, the average high speed memory capacity of a
large computer will reach 1015 bits.

Now consider the evolution of technological processing
power. Remarkably, this follows much the same trend.
Of course, the real trick is putting the two together to
achieve the desired result, it seems relatively unlikely
that we shall be in a position to accomplish this by
2024.

So much for the hardware. Now consider the software.
Even adult human brains are not filled to capacity. So
we will assume that 10% of the total capacity, i.e. 1014
bits, is the extent of the `software' base of an adult
human brain. How long will it take to write the
Figure 1-3 Storage capacity.

12


programs to fill 1014 bits (production rules, knowledge bases etc.)? The currently accepted rate of production of
software, from conception through testing, de-bugging and documentation to installation, is about one line of code
per hour. Assuming, generously, that an average line of code contains approximately 60 characters, or 500 bits,
we discover that the project will require 100 million person years! We'll never get anywhere by trying to program
human intelligence into a machine.

What other options are available? One is direct transfer from the human brain to the machine. Considering
conventional transfer rates over a high speed bus this would take about 12 days. The only problem is: nobody has
the slightest idea how to build such a device.

What's left? In the biological world intelligence is acquired every day, therefore there must be another alternative.
Every day babies are born and in the course of time acquire a full spectrum of intelligence. How do they do it? The
answer, of course, is that they learn.

If we assume that the eyes, our major source of sensory input, receive information at the rate of about 250,000 bits
per second, we can fill the 1014 bits of our machine's memory capacity in about 20 years. Now storing sensory input
is not the same thing as developing intelligence, however this figure is in the right ball park. Maybe what we must
do is connect our machine brain to a large number of high-data-rate sensors, endow it with a comparatively simple
algorithm for self organization, provide it with a continuous and varied stream of stimuli and evaluations for its
responses, and let it learn.

This argument may seem cavalier in some aspects. The human brain is highly parallel and somewhat
inhomogeneous in its architecture. It does not clock at high serial speeds and does not access RAM to recall
information for processing. The storage capacity may be vastly greater than the 1015 bits estimated by Sagan, since
each neuron is connected to as many as 10,000 others and the structure of these interconnections may also store
information. Indeed, although we do not know a great deal about the mechanisms of human memory, we do know
that is multi-levelled with partial bio-chemical storage. However, none of this invalidates Bock's point that
programming can never be a substitute for learning.

13


II Genetic Algorithms

Introduction.

The idea that the process of evolutionary search might be used as a model for hard combinatoric search algorithms
developed significantly in the mid 1960's. Evolutionary algorithms fall into the class of probabilistic heuristic
algorithms which one might use to attack NP-complete or NP-hard problems (see, for example [Horowitz 1978],
Chapters 11 and 12), such as the Travelling Salesman/person Problem (TSP). Of course, many of these problems
have significant applications in engineering hardware or software design and commercial optimisation problems,
but the underlying motivation for the study of evolutionary algorithms is principally to try to gain insight into the
evolutionary process itself.

Variously known as genetic algorithms, the phrase coined by the US school stemming from the work of John
Holland [Holland 1975], evolutionary programming, originally developed by L. J. Fogel, A. J. Owens and M. J.
Walsh, again in the US, and Evolutionsstrategie, as studied in Germany at around the same time by I. Rechenberg
and H-P. Schwefel [Schwefel 1965], the subject has exploded over the last 15 years. Curiously, the European and
US schools seemed largely unaware of each others existence for quite some while.

Evolutionary algorithms, have been applied to a variety of problems and offer intriguing possibilities for general
purpose adaptive search algorithms in artificial intelligence, especially, but not necessarily, for situations where
it is difficult or impossible to precisely model the external circumstances faced by the program. Search based on
evolutionary models had, of course, been tried before Holland's introduction of genetic algorithms. However, these
models were based on mutation and not notably successful. The principal difference of the more modern research
is an emphasis on the power of natural selection and the incorporation of a ‘crossover’ operator to mimic the effect
of sexual reproduction.

Two rather different types of theoretical analysis have developed for evolutionary algorithms: the classical
approach stemming from the original work of Mendel on heritability and the later statistical work of Galton and
Pearson at the end of the last century, and the Schema theory approach developed by Holland.

Mendel constructed a chance model of heritability involving what are now called genes. He conjectured the
existence of genes by pure reasoning - he never saw any. Galton and Pearson found striking statistical regularities
in heritability in large populations, for example, on average a son is halfway between his father's height and the
overall average height for sons. They also invented many of the statistical tools in use today such as the scatter
diagram, regression and correlation (see, for example, [Freedman 1991]). Around 1920 Fisher, Wright and
Haldane more or less simultaneously recognised the need to recast Darwinian theory as described by Galton and
Pearson in Mendelian terms. They succeeded in this task, and more recently Price's Covariance and Selection
Theorem [Price 1970], [Price 1972], an elaboration of these ideas, has provided a useful tool for algorithm analysis.

The archetypal GA.

In Nature each gene has several forms or alternatives - alleles - producing differences in the set of characteristics
associated with that gene, e.g. certain strains of garden pea have a single gene which determines blossom colour,
one allele causing the blossom to be white, the other pink. There are tens of thousands of genes in the
chromosomes of a typical vertebrate, each of which, on the available evidence, has several alleles. Hence the set
of chromosomes attained by taking all possible combinations of alleles contains on the order of 10 to the 3,000
structures for a typical vertebrate species. Even a very large population, say 10 billion individuals, contains only
a minuscule fraction of the possibilities.

14


A further complication is that alleles interact so that adaptation becomes primarily the search for co-adapted sets
of alleles. In the environment against which the organism is tested any individual exemplifies a large number of
possible `patterns of co-adapted alleles' or schema, as Holland calls them. In testing this individual we shall see
that all schema of which the individual is an instantiation are also tested. If the rules whereby genes are combined
have a tendency to generate new instances of above average schema then the resulting adaptive system has a high
degree of `intrinsic parallelism'1 which accelerates the evolutionary process. Considerations of this type offer an
explanation of how evolution can proceed at all. If a simple enumerative plan were employed and if 10 to the 12
structures could be tried every second it would take a time vastly exceeding the estimated age of the universe to
test 10 to the 100 structures.

The basic idea of an evolutionary algorithm is illustrated in Figure 2-1.

INITIALISE
Create initial population
Evaluate fitness of each
member.

INTERNAL EXTERNAL

Create children from
existing population
using genetic operators

Evaluate fitness of children

Substitute children
in population deleting
an equivalent number

Figure 2-1 Generic model for a genetic algorithm.

We seek to optimise members of a population of ‘structures’. These structures are encoded in some manner by a
‘gene string’. The population is then ‘evolved’ in a very stylised version of the evolutionary process.

We are given a set, A, of `structures' which we can think of, in the first instance, as being a set of strings of fixed
length l. The object of the adaptive search is to find a structure which performs well in terms of a measure of
performance v : A --> +, where + denotes the positive real numbers.
ú ú

1
The notion of 'intrinsic parallelism' will be discussed but it should be mentioned that it has nothing to do with parallelism in the sense
normally intended in computing.

15


The programmer must provide a representation for the structures to be optimised. In the terminology of genetic
algorithms a particular structure is called a phenotype and its representation as a string is called a chromosome
or genotype. Usually this representation consists of a fixed length string in which each component, or gene, may
take only a small range of values, or alleles. In this context `small' often means two, so that binary strings are used
for the genotypes.

There is nothing obligatory in taking a one-bit range for each allele but there are theoretical reasons to prefer
few-alleles-at-many-sites over many-alleles-at-few-sites (the arguments have been given by [Holland 1975](p. 71),
[Smith 1980](p. 56) and supporting evidence for the correctness of these arguments has been presented by
[Schaffer 1984](p. 107).

1. Randomly generate a population of M structures

S(0) = {s(1,0),...,s(M,0)}.

2. For each new string s(i,t) in S(t), compute and save its measure of utility v(s(i,t)).

3. For each s(i,t) in S(t) compute the selection probability defined by

p(i,t) = v(s(i,t))/ (
E i v(s(i,t)) ).

4. Generate a new population S(t+1) by selecting structures from S(t) via the selection
probability distribution and applying the idealised genetic operators to the structures
generated.

5. Goto 2.

Algorithm 2-1 Archetypal genetic algorithm.

The function v provides a measure of ‘fitness’ for a given phenotype and (since the programmer must also supply
a mapping from the set of genotypes to the set of phenotypes) hence for a given genotype. Given a particular
n
genotype or string the goal function provides a means for calculating the probability that the string will be selected
to contribute to the next generation. It should be noted that the composition function v( ) mapping genotypes to
n
fitness is invariably discontinuous, nevertheless genetic algorithms cope remarkably well with this difficulty.

The basis of Darwinian evolution is the idea of natural selection i.e. population genetics tends to use the

Selection Principle. The fitness of an individual is proportional to the probability that it will
reproduce effectively.2

In genetic algorithm design we tend to apply this in the converse form: the probability that an individual will
reproduce is proportional to its fitness. ‘Fit’ strings, i.e. strings having larger goal function values, will be more
likely to be selected but all members of the population will have some chance to contribute.

2
Obfuscation of the definition of ‘fitness’ occurs frequently in the classical literature. The reasons are not
difficult to understand. Both Darwin and Fisher found it hard to swallow that the lower classes bred more
prolifically and were therefore, by definition, ‘fitter’ than their ‘social superiors’. This confusion regarding ‘fitness’
still occurs in the GA’s literature for different reasons.

16


The box contains a sketch of the standard serial style genetic algorithm. Typically the evaluation of the goal
function for a particular phenotype, a process which strictly speaking is external to the genetic algorithm itself,
is the most time consuming aspect of the computation.

Given the mapping from genotype to phenotype, the goal function, and an initial random population the genetic
algorithm proceeds to create new members of the population (which progressively replace the old members) using
genetic operators, typically mutation, crossover and inversion, modelled on their biological analogs.

For the moment we represent strings as a1a2a3...al [ai = 1 or 0].

Using this notation we can describe the
operators by which strings are combined
to produce new strings. It is the choice of CROSSOVER
these operators which produces a search
strategy that exploits co-adapted sets of cut points
structural compon en t s a l r eady
discovered. Holland uses three such
Parent 1 1011 010011 10111
principal operators Crossover, Mutation
and Inversion (which we shall not Parent 2 1100 111000 11010
discuss in detail here).
Child 1 1100 010011 11010
Crossover. In crossover one or more cut
points are selected at random and the Child 2 1011 111000 10111
operation illustrated in Figure 2-2,
Figure 7-1 (where two cut points are
employed) is used to create two children. MUTATION
A variety of control regimes are possible,
but a simple strategy might be `select 110011100011010 111011101011010
one of the children at random to go into
the next generation'. Children tend to be
`like' their parents, so that crossover can
be considered as a focussing operator INVERSION
which exploits knowledge already
gained, its effects are quite quickly 111111100011010 110011111011010
apparent.

Crossing over proceeds in three steps. Figure 2-2 Standard genetic operators.

a) Two structures a1...al and b1...bl are selected at random from the current population.

b) A crossover point x, in the range 1 to l-1 is selected, again at random.

c) Two new structures

a1a2...axbx+1bx+2...bl
b1b2...bxax+1ax+2...al
are formed.

In modifying the pool of schema (discussed below), crossing over continually introduces new schema for trial
whilst testing extant schema in new contexts. It can be shown that each crossing over affects a great number of
schema.

17


There is large variation in the crossover operators which have been used by different experimenters. For example,
it is possible to cross at more than one point. The extreme case of this is where each allele is randomly selected
from one or other parent string with uniform probability - this is called uniform crossover. Although some writers
have argued in favour of uniform crossover, there would seem to be theoretical arguments against its use viz. if
evolution is the search for co-adapted sets of alleles then this search is likely to be severely undermined if many
cut points are used. In language we shall develop shortly: the probability of schema disruption when using uniform
crossover is much higher than when using one or two point crossover.

The design of the crossover operator is strongly influenced by the nature of the representation. For example, if the
problem is the TSP and the representation of a tour is a straightforward list of cities in the order in which they are
to be visited then a simple crossover operator will, in general, not produce a tour. In this case the options are:

! Change the representation.

! Modify the crossover operator.

or ! Effect ‘genetic repair’ on non-tours which may result.

There is obviously much scope for experiment for any particular problem. The danger is that the resulting
algorithm may be so far removed from the canonical form that the correlation between parental and child fitness
may be small - in which case the whole justification for the method will have been lost.

Mutation. In mutation an allele is altered at each site with some fixed probability. Mutation disperses the
population throughout the search space and so might be considered as an information gathering or exploration
operator. Search by mutation is a slow process analogous to exhaustive search. Thus mutation is a ‘background’
operator, assuring that the crossover operator has a full range of alleles so that the adaptive plan is not trapped on
local optima.

Each structure a1a2...al in the population is operated upon as follows. Position x is modified, with probability p
independent of the other positions, so that the string is replaced by

a1a2...ax-1 z ax+1...al

where z is drawn at random from the possible values. If p is the probability of mutation at a single position then
the probability of h mutations in a given string is determined by a Poisson distribution with parameter p.

A simple demonstrator is given in the Mathematica program GA_Simple.nb. A more complicated GA using
Inversion is given in GA_Inversion.nb.

Design issues - what do you want the algorithm to do?

Now we have to ask just what is we want of a genetic algorithm. There are several, sometimes mutually exclusive,
possibilities. For example:

! Rapid convergence to a global optimum.

! Produce a diverse population of near optimal solutions in different ‘niches’.

! Be adaptive in ‘real-time’ to changes in the goal function.

We shall deal with each of these in turn but first let us briefly consider the nature of the search space. If the space
is flat with just one spike then no algorithm short of exhaustive search will suffice. If the space is smooth and
unimodal then a conventional hill-climbing technique should be used.

18


Somewhere between these two extremes are problems in which the goal function is a highly non-linear multi-
modal function of the gene values - these are the problems of hard combinatoric search for which some style of
genetic algorithm may be appropriate.

Rapid convergence to a global optimum.

Of course this is rather simplistic. Holland's theory holds for large populations. However, in many AI applications
it is computationally infeasible to use large populations and this in turn leads to a problem commonly referred to
as Premature Convergence (to a sub-optimal solution) or Loss of Diversity in the literature of genetic algorithms.
When this occurs the population tends to become dominated by one relatively good solution and locked into a sub-
optimal region of the search space. For small populations the schema theorem is actually an explanation for
premature convergence (i.e. the failure of the algorithm) rather than a result which explains success.

Premature convergence is related to a phenomenon observed in Nature. Allelic frequencies may fluctuate purely
by chance about their mean from one generation to another; this is termed Random Genetic Drift. Its effect on the
gene pool in a large population is negligible, but in a small effectively interbreeding population, chance alteration
in Mendelian ratios can have a significant effect on gene frequencies and can lead to the fixation of one allele and
loss of another. For example, isolated communities within a given population have been found to have frequencies
for blood group alleles different from the population as a whole. Figure 2-3 illustrates this phenomenon with a
simple function optimisation genetic algorithm.

The inexperienced often tend to attempt to counteract
premature convergence by increasing the rate of mutation.
However, this is not a good idea.

! A high rate of mutation tends to devalue the
role of crossover in building co-adapted sets of
alleles and in essence pushes the algorithm in
the direction of exhaustive search. Whilst some
mutation is necessary a high rate of mutation is
invariably counter-productive.

In trying to counteract premature convergence we are
essentially trying to balance the exploitation of good
solutions found so far against the exploration which is
required to find hitherto unknown promising regions of
the search space. It is worth observing that, in
computational terms, any algorithm which often inserts
copies of strings into the current population is wasteful. Figure 2-3 Premature convergence - no sharing.
This is true for the Traditional Genetic Algorithm (TGA)
outlined as 2, 7-1.

Produce a diverse population of near optimal solutions in different `niches'.

The problem of premature convergence has been addressed by a number of authors using a diversity of techniques.
Many of the papers in [Davis 1987] contain discussions of precisely this point. The methods used to combat
premature convergence in TGAs are not necessarily appropriate to the parallel formulations of genetic algorithms
(PGAs) which we shall discuss shortly.

Cavicchio, in his doctoral dissertation, suggested a preselection mechanism as a means of promoting genotype
diversity. Preselection filters children generated, possibly picking the fittest, and replaces parent members of the
population with their offspring [Cavicchio 1970].

19


De Jong's crowding scheme is an elaboration of the preselection mechanism. In the crowding scheme, an offspring
replaces the most similar string from a randomly drawn subpopulation having size CF (the crowding factor) of the
current population. Thus a member of the population experiences a selection pressure in proportion to its similarity
to other members of the population [De Jong 1975]. Empirical determination of CF with a five function test bed
determined CF = 3 as optimal.

Booker implemented a sharing method in a classifier system environment which used the bucket brigade algorithm
[Booker 1982]. The idea here was that if related rules share payments then sub-populations of rules will form
naturally. However, it seems difficult to apply this mechanism to standard genetic algorithms. Schaffer has
extended the idea of sub-populations in his VEGA model in which each fitness element has its own sub-population
[Schaffer 1984].

A different approach to help maintain genotype diversity was introduced by Mauldin via his uniqueness operator
[Mauldin 1984]. The uniqueness operator helped to maintain diversity by incorporating a `censorship' operator
in which the insertion of an offspring into the population is possible only if the offspring is genotypically different
from all members of the population at a number of specified genotypical loci.

* Results and methods related to the TSP.

We digress briefly to give a little more detailed background material on the TSP. The question is often asked: if
one cannot exactly solve any very large TSP problem (except in special cases at present `very large' means a
problem involving more than a thousand cities) how can one know how accurate a solution produced by a
probabilistic or heuristic algorithm actually is?

The best exact solution methods for the travelling salesman problem are capable of solving problems of several
hundred cities [Grötschel 1991], but unfortunately excessive amounts of computer time are used in the process and,
as N increases, any exact solution method rapidly becomes impractical. For large problems we therefore have no
way of knowing the exact solution, but in order to gauge the solution quality of any algorithm we need a reasonably
accurate estimate of the minimal tour length. This is usually provided in one of two ways.

For a uniform distribution of cities the classic work by Beardwood, Halton and Hammersley (BHH) [Beardwood
1959] obtains an asymptotic best possible upper bound for the minimum tour length for large N. Let {Xi}, 1 i < #
4 , be independent random variables uniformly distributed over the unit square, and let LN denote the shortest closed
path which connects all the elements of {X1,...,XN}. In the case of the unit square they proved, for example, that
there is a constant c > 0 such that, with probability 1,
1/2
lim LN N & ' c (1)
N 46

where c > 0 is a constant. In general c depends on the geometry of the region considered.

One can use the estimate provided by the BHH theorem in the following form: the expected length LN* of a minimal
tour for an N-city problem, in which the cities are uniformly distributed in a square region of the Euclidean plane,
is given by
(
LN . c2 NR (2)

where R is the area of the square and the constant (for historical reasons known as Stein's constant - [Stein 1977])
c2 0.70805 ± 0.00007, recently been estimated by Johnson, McGeogh and Rothberg [Johnson 1996].
.

A second possibility would be to use a problem specific estimate of the minimal tour length which gives a very
accurate estimate: the Held-Karp lower bound [Held 1970], [Held 1971]. Computing the Held-Karp lower bound
is an iterative process involving the evaluation of Minimal Spanning Trees for N-1 cities of the TSP followed by
Lagrangean relaxations, see [Valenzuela 1997].

20


If one seeks approximate solutions then various algorithms based on simple rule based heuristics (e.g. nearest
neighbour and greedy heuristics), or local search tour improvement heuristics (e.g. 2-Opt, 3-Opt and Lin-
Kernighan), can produce good quality solutions much faster than exact methods. A combinatorial local search
algorithm is built around a `combinatoric neighbourhood search' procedure, which given a tour, examines all tours
which are closely related to it and finds a shorter `neighbouring' tour, if one exists. Algorithms of this type are
discussed in [Papadimitriou 1982]. The definition of `closely related' varies with the details of the particular local
search heuristic.

The particularly successful combinatorial local search heuristic described by Lin and Kernighan [Lin 1973] defines
`neighbours' of a tour to be those tours which can be obtained from it by doing a limited number of interchanges
of tour edges with non-tour edges. The slickest local heuristic algorithms3, which on average tend to have
complexity O(n ), for > 2, can produce solutions with approximately 1-2% excess for 1000 cities in a few
"
"
minutes. However, for 10,000 cities the time escalates rapidly and one might expect that the solution quality also
degrades, see [Gorges-Schleuter 1990], p 101.

An approximation scheme A is an algorithm which given problem instance I and > 0 returns a solution of length
,
A(I, ) such that
,
A(I, ) Ln(I) * ε & *
# ε (3)
Ln(I)

Such an approximation scheme is called a fully polynomial time approximation scheme if its run time is bounded
by a function that is polynomial in both the instance size and 1/ . Unfortunately the following theorem holds, see
,
for example [Lawler 1985], p165-166.

Theorem. If V N then there can be no fully polynomial time approximation scheme for the TSP, even if
P V
instances are restricted to points in the plane under the Euclidean metric.

Although the possibility of a fully polynomial time approximation scheme is effectively ruled out, there remains
the possibility of an approximation scheme that although it is not polynomial in 1/ , does have a running time
,
which is polynomial in n for every fixed > 0. The Karp algorithms, based on cellular dissection, provide
,
`probabilistic' approximation schemes for the geometric TSP.

Theorem [Karp 1977]. For every > 0 there is an algorithm A( ) such that A( ) runs in time C( )n+O(nlogn)
, , , ,
and, with probability 1, A( ) produces a tour of length not more than 1+ times the length of a minimal tour.
, ,

The Karp-Steele algorithms [Steele 1986] can in principle converge in probability to near optimal tours very
rapidly. Cellular dissection is a form of divide and conquer. Karp's algorithms partition the region R into small
subregions, each containing about t cities. An exact or heuristic method is then applied to each subproblem and
the resulting sub-tours are finally patched together to yield a tour through all the cities.

Evolutionary Divide and Conquer.

Until recently the best genetic algorithms designed for TSP problems have used permutation crossovers for
example [Davis 1985], [Goldberg 1985], [Smith 1985], or edge recombination operators [Whitley 1989], and
required massive computing power to gain very good approximate solutions (often actually optimal) to problems
with a few hundred cities [Gorges-Schleuter 1990]. Gorges-Schleuter cleverly exploited the architecture of a
transputer bank to define a topology on the population and introduce local mating schemes which enabled her to
delay the onset of premature convergence. However, this improvement to the genetic algorithm is independent of

3
The most impressive results in this direction are due to David Johnson at AT&T Bell Laboratories - mostly reported in unpublished
Workshop presentations.

21


any limitations inherent in permutation crossovers. Eventually, for problems of more than around 1000 cities, all
such genetic algorithms tend to produce a flat graph of improvement against number of individuals tested, no
matter how long they are run.

Thus experience with genetic algorithms using permutation operators applied to the Geometric Travelling
Salesman Problem (TSP) suggests that these algorithms fail in two respects when applied to very large problems:
they scale rather poorly as the number of cities n increases, and the solution quality degrades rapidly as the problem
size increases much above 1000 cities. An interesting novel approach developed by Valenzuela and Jones
[Valenzuela 1994] which seeks to circumvent these problems is based on the idea of using the genetic algorithm
to explore the space of problem subdivisions, rather than the space of solutions itself.

This alternative method, for genetic algorithms applied to hard combinatoric search, can be described as
Evolutionary Divide and Conquer (EDAC), and the approach has potential for any search problem in which
knowledge of good solutions for subproblems can be exploited to improve the solution of the problem itself. As they
say
! Essentially we are suggesting that intrinsic parallelism is no substitute for divide and conquer in hard combinatoric search and
we aim to have both. [Valenzulea 1994]

The goal was to develop a genetic algorithm capable of producing reasonable quality solutions for problems of
several thousand cities, and one which will scale well as the problem size n increases. `Scaling well' in this context
almost inevitably means a time complexity of O(n) or at worst O(nlogn). This is a fairly severe constraint, for
example given a list of n city co-ordinates the simple act of computing all possible edge lengths, a O(n2) operation
is excluded. Such an operation may be tolerable for n = 5000 but becomes intolerable for n = 100,000.

In the previous section we mentioned the Karp and Steele cellular disection algorithms, and it is this technique
which is the basis of the Valenzuela-Jones EDAC genetic algorithms for the TSP.

Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1.

22


In practice a one-shot deterministic Karp algorithm yields
rather poor solutions, typically 30% excess (with simple
patching) when applied to 500 - 1000 city problems.
Nevertheless, the Karp technique is a good starting point
for exploring EDAC applied to the TSP. There are several
reasons. First, according to Karp's theorem there is some
probabilistic asymptotic guarantee of solution quality as
the problem size increases. Second, the time complexity is
about as good as one can hope for, namely O(nlogn). The
run time of a genetic algorithm based on exploring the
space of `Karp-like' solutions will be proportional to nlogn
multiplied by the number of times the Karp algorithm is
run, i.e. the number of individuals tested.

Karp's algorithm proceeds by partitioning the problem
recursively from the top down. At each step the current
rectangle is bisected horizontally or vertically, according
to a deterministic rule designed to keep the rectangle
perimeter minimal. This bisection proceeds until each Figure 2-5 The EDAC (top) and simple 2-Opt
subrectangle contains a preset maximum number of cities (bottom) time complexity (log scales).
t (typically t 10). Each small subproblem is then solved
-
and the resulting subtours are patched together to produce a solution to the original problem - see Figure 2-4

In the EDAC algorithm the genotype is a p X p binary array in which a `1' or `0' indicates whether to cut
horizontally or vertically at the current bisection. If we maintain the subproblem size, t, and increase the number
of cities in the TSP, then a partition better than Karp's becomes progressively harder to find by randomly choosing
a horizontal or vertical bisection at each step. If the problem size is n 2kt, where 2k is the number of subsquares,
-
then the corresponding genotype requires at least n/t - 1 bits. The size of the partition space is 2 to the power p2,
which for p = 80 (the value used for n = 5000) is approximately exp(4436). For n = 5000 the size of permutation
search space, roughly estimated using Stirling's formula, is around exp(37586). Thus searching partition space is
easier than searching permutation space and this provides a third argument in favour of exploring this
representation of problem subdivision as a genotype. We know from Karp's theorem that the class of tours
produced by disection and patching will have representatives very close to the optimum tour, so by restricting
attention to this smaller set one is not `throwing out the baby with the bath-water', i.e. the set may be smaller but
it nevertheless contains near optimal tours.

This approach contrasts sharply with the idea of `broadcast languages' mooted in Chapter 8 of [Holland 1975], in
which techniques for searching the space of representations for a genetic algorithm are discussed. In general the
space of representations is vastly larger than the search space of the problem itself, but we have seen with the TSP
that this space is already so huge that it is impractical to search in any comprehensive fashion for all except the
smallest problems. Hence, it seems unlikely that replacing the original search space by an even larger one will turn
out to be a productive approach.

23


Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem.

In any event even the EDAC algorithm requires clever recursive repair techniques to improve the accuracy when
subtours are patched together. Nevertheless, the algorithm scales well. Figure 2-5 compares the EDAC algorithms
with simple 2-Opt (which gives an accuracy of around 8% excess). This version of the EDAC algorithm produces
solutions at the 5% level, see Figure 2-6, but a later more elaborate variant reliably produces solutions with around
1% excess and has been tested on problems sizes of up to 10,000 cities.

This technique probably represents the best that can be done at the present time using genetic algorithms for the
TSP. It is not yet practical by comparison with iterated Lin-Kernighan (or even 2-Opt)4, but it scales well and may
eventually offer a viable technique for obtaining good solutions to TSP problems involving several hundred
thousand cities.

Parallel EDACII and EDACIII were both tested on a range of problems between 500 and 5000 cities. Parental pairs
were chosen from the initial random population and the mid-parent value of the tour lengths calculated and
recorded. Crossover and mutation were then applied to each selected parental pair and the tour length evaluated
for the resulting offspring. Pearson's correlation coefficient, rxy, was calculated in each experiment and significance
tests based on Fisher's transformation carried out in order to establish whether the resulting correlation coefficients
differed significantly from zero (i.e. no correlation). Scatter diagrams in Figure 2-7 and Figure 2-8 illustrate the
Price correlation for parallel EDACII and EDACIII on the 5000 city problem.

Although the genotype used in these experiments was a binary array it could more naturally (at the cost of
complication in the coding) be represented by a pair of binary trees, or a quadtree. The use of trees here would be
more in keeping with the recursive construction of the phenotype from the genotype, a process analogous to

4
For example, wildly extrapolating the figures gives the breakeven point with 2-Opt at around n = 422,800 requiring some 74 cpu days!
Of course, other things would collapse before then.

24


growth, and it is possible to produce a modified Schema theorem for the case of trees, where the genetic
information is encoded in the shape of the tree and information placed at leaf nodes.

113.0
112.0

111.0 109.7

109.0 107.5

107.0 105.2

103.0
105.0 103.0 104.5 106.0 107.5 109.0
105.0 107.0 109.0 111.0 113.0
mid-parent value
mid-parent value

Figure 2-7 EDACII mid-parent vs offspring correlation Figure 2-8 EDACIII mid-parent vs offspring correlation
for 5000 cities [Valenzuela 1995]. for 5000 cities [Valenzuela 1995].

In Nature very complex phenotypical structures frequently give the appearance of having been constructed
recursively from the genotype. Examples of recursive algorithms which lead to very natural looking graphical
representations of natural living structures such as trees, plants, and so on, can be found in the work of
Lindenmayer [Lindenmayer 1971] on what are now called L-systems. These production systems are very similar
to the production rules which define various kinds of context sensitive or context free grammars. The combination
of tree structured genotypes, or recursive construction algorithms similar to production rules, combined with the
divide-and-conquer paradigm suggest a powerful computational technique for the compression of complex
phenotypical structures into useful genotypical structures. So much so that, as our understanding of exactly how
DNA encodes the phenotypical structure of individual biological organisms (particularly the neural systems of
mammals) progresses, it would be surprising if to find that Nature has not employed some such technique.

Chapter references

[Altenberg 1987] L. Altenberg and M. W. Feldman. Selection, generalised transmission, and the evolution of
modifier genes. The reduction principle. Genetics 117:559-572.

[Altenburg 1994] L. Altenberg. The Evolution of Evolvability in Genetic Programming. Chapter 3 in Advances
in Genetic Programming, Ed Kenneth E. Kinnear, Jr., MIT Press, 1994.

[Belew 1990] R. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using the genetic algorithm
with connectionist learning. CSE Technical Report CS90-174, University of California, San Diego, 1990.

[Booker 1982] L. B. Booker. Intelligent behaviour as an adaption to the task environment. Doctoral dissertation,
University of Michigan, 1982. Dissertation Abstracts International 43(2), 469B.

[Brandon 1990] R. N. Brandon. Adaptation and Environment, pages 83-84. Princeton University Press, 1990.

[Cavalli-Sforza 1976] L. L. Cavelli-Sforza and M. W. Feldman. Evolution of continuous variation: direct
approach through joint distribution of genotypes and phenotypes. Proceedings of the national Academy of Science
U.S.A., 73:1689-1692, 1976.

25


[Cavicchio 1970] D. J. Cavicchio. Adaptive search using simulated evolution. Doctoral dissertation, University
of Michigan (unpublished), 1970.

[Chalmers 1990] David J. Chalmers. The Evolution of Learning: An experiment in Genetic Connectionism.
Proceedings of the 1990 Connectionist Models Summer School, San Marco, CA. Morgan Kaufmann, 1990.

[Collins 1991] R. J. Collins and D. R. Jefferson. Selection in massively parallel genetic algorithms. Proceedings
of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991.

[Davis 1987] Lawrence Davis, Editor. Genetic Algorithms and Simulated Annealing, Pitman Publishing, London.

[De Jong 1975] K. De Jong. An analysis of the behaviour of a class of genetic adaptive systems. Doctoral
dissertation, University of Michigan, 1975. Dissertation Abstracts International 36(10), 5140B.

[Freedman 1991] D. Freedman, R. Pisani, R. Purves and A. Adhikkari. Statistics, Second edition, W. W. Norton,
New York, 1991.

[Georges-Schleuter 1990] Martina Georges-Schleuter. Genetic Algorithms and Population Structures, A Massively
Parallel Algorithm. Ph.D. Thesis, Department of Computer Science, University of Dortmund, Germany. August
1990.

[Goldberg 1987] David E. Goldberg and Jon Richardson. Genetic Algorithms with Sharing for Multimodal
Function Optimization. Proc. Second Int. Conf. on Genetic Algorithms, pp. 41-49, MIT.

[Gorges-Schleuter 1990] Martina Gorges-Schleuter. Genetic Algorithms and Population Structures: A Massively
Parallel Algorithm. Ph.D. Thesis, University of Dortmund, August 1990.

[Grefenstette 1987] John J. Grefenstette. Incorporating Problem Specific Knowledge into Genetic Algorithms. In
Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing, London.

[Holland 1975] John H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan
Press.

[Horowitz 1978] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. London, Pitman Publishing
Ltd.

[Johnson 1996] D. S. Johnson, L. A. McGeoch and E. E. Rothberg. Asymptotic experimental analysis for the Held-
Karp traveling salesman bound. Proceeding 1996 ACM-SIAM symp. on Discrete Algorithms, to appear.

[Jones 1993] Antonia J. Jones. Genetic Algorithms and their Applications to the Design of Neural Networks,
Neural Computing & Applications, 1(1):32-45, 1993.

[Koza 1992] John L. Koza. Genetic Programming: On the Programming of Computers by Means of Natural
Selection. Bradford Books, MIT Press, 1992. ISBN 0-262-11170-5.

[Lindenmayer 1971] A. Lindenmayer. Developmental systems without cellular interaction, their languages and
grammars. J. Theoretical Biology 30, 455-484, 1971.

[Lyubich 1992] Y. I. Lyubich. Mathematical Structures in Population Genetics. Springer-Verlag, New York, pages
291-306. 1992.

[Manderick 1989] B. Manderick and P. Spiessens. Fine grained parallel genetic algorithms. Proceedings of the
third international conference on genetic algorithms. Morgan Kaufmann, 1989.

26


[Manderick 1991] Manderick, B. de Weger, M. and Spiessens, P. The genetic algorithm and the structure of the
fitness landscape. In R. K. Belew and L. B. Booker, Editors, Proceedings of the Fourth International Conference
on Genetic Algorithms, pages 143-150, San Mateo CA, Morgan Kaufmann.

[Mauldin 1984] M. L. Mauldin. Maintaining diversity in genetic search. National Conference on Artificial
Intelligence, 247-250, 1984.

[Macfarlane 1993] D. Macfarlane and Antonia J. Jones. Comparing networks with differing neural-node functions
using Transputer based genetic algorithms. Neural Computing & Applications, 1(4): 256-267, 1993.

[Menczer 1992] Menczer,F. and Parisi, D. Evidence of hyperplanes in the genetic learning of neural networks.
Biological Cybernetics 66(3):283-289.

[Miller 1989] G. Miller, P. Todd, and S. Hegde. Designing neural networks using genetic algorithms. In
Proceedings of the Third Conference on Genetic Algorithms and their Applications, San Mateo, CA. Morgan
Kaufmann, 1989.

[Muhlenbein 1988] H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer. Evolution Algorithms in Combinatorial
Optimisation. Parallel Computing, 7, pp. 65-85.

[Price 1970] G. R. Price. Selection and covariance. Nature, 227:520-521.

[Price 1972] G. R. Price. Extension of covariance mathematics. Annals of Human Genetics 35:485-489.

[Salmon 1971] W. C. Salmon. Statistical Explanation and Statistical Relevance. University of Pittsburg Press,
Pittsburgh, 1971.

[Schaffer 1984] J. D. Schaffer. Some Experiments in Machine Learning Using Vector Evaluated Genetic
Algorithms. Ph.D. Thesis, Department of Electrical Engineering, Vanderbilt University, December 1984.

[Schwefel 1965] H-P Schwefel. Kybernetische Evolution als Strategie experimentellen Forschung in der
Strömungstechnik. Diploma thesis, Technical University of Berlin, 1965.

[Slatkin 1970] M. Slatkin. Selection and polygenic characters. Proceedings of the National Academy of Sciences
U.S.A. 66:87-93. 1970.

[Smith 1980] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms. Ph.D. Dissertation,
University of Pittsburg.

[Spiessens 1991] P. Spiessens and B. Manderick. A massively parallel genetic algorithm - implementation and
first analysis. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991.

[Valenzuela 1994] Christine L. Valenzuela and Antonia J. Jones. Evolutionary Divide and Conquer (I): A novel
genetic approach to the TSP. Evolutionary Computation 1(4):313-333, 1994.

[Valenzuela 1995] Christine L. Valenzuela. Evolutionary Divide and Conquer: A novel genetic approach to the
TSP. Ph.D. Thesis, Department of Computing, Imperial College, London. 1995

[Valenzuela 1997] Christine L. Valenzuela and Antonia J. Jones. Estimating the Held-Karp lower bound for the
geometric TSP. To appear: European Journal of Operational Research, 1997.

[Whitley 1990] D. Whitley, T. Starkweather, and C. Bogart. Genetic algorithms and neural networks: Optimizing
connections and connectivity. Parallel Computing, forthcoming.

27


[Wilson 1990] Perceptron redux. Physica D, forthcoming.

28


III Hopfield networks.

Introduction.

As far back as 1954 Cragg and Temperley [Cragg 1954] had introduced the spin-neuron analogy using a
ferromagnetic model. They remarked

"It remains to be considered whether biological analogues can be found for the concepts of temperature and interaction
energy in a physical system."

In 1974 Little [Little 1974] introduced the temperature-noise analogy, but it was not until 1982 that John Hopfield
[Hopfield 1982], a physicist, made significant progress in the direction requested by Cragg and Temperley. In a
single short paragraph, he suggests one of the most important new techniques to have been proposed in neural
networks.

Hopfield nets and energy.

The standard approach to a neural network is to propose a learning rule, usually based on synaptic modification,
and then to show that a number of interesting effects arise from it. Hopfield starts by saying that:

"The function of the nervous system is to develop a number of locally stable states in state space."

Other points in state space flow into the stable points, called attractors. In some other dynamic systems the
behaviour is much more complex, for example the system may orbit two or more points in state space in a non-
periodic way, see [Abraham 1985]. However, this turns out not to be the case for the Hopfield net.

The flow of the system towards a stable point allows a mechanism for correcting errors, since deviations from the
stable points disappear. The system can thus reconstruct missing information since the stable point will
appropriately complete missing parts of an incomplete initial state vector.

Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at
maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined
as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values
of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm.
For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean
2
attempt rate µ, setting

x i(t) ' 1 > θ i
x i(t) ' x i(t 1)
& if j wijxj(t 1)
& ' θ i (1)
x i(t) ' 0 j i
… < θ i

Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts
accordingly.

Although this model has superficial similarities to the Perceptron there are essential differences. Firstly,
Perceptrons were modelled chiefly with the neural connections in a `forward' direction and the analysis of such
networks with backward coupling proved intractable. All the interesting results of the Hopfield model arise as a
consequence of the strong backward coupling. Secondly, studies of perceptrons usually made a random net of

29


neurons deal with the external world, and did not ask the questions essential to finding the more abstract emergent
computational properties. Finally, perceptron modelling required synchronous neurons like a conventional digital
computer. Although synchrony of sorts must exist in biological nervous systems, for example the act of walking
involves the precise temporal coordination of both legs, or the Purkinje fibres which help control the heart, there
is certainly no global synchrony in the same sense that electronic hardware is clocked. Given the variations of
delays of nerve signal propagation, there would probably be no way to use global synchrony effectively. Chiefly
computational properties which can exist in spite of asynchrony have interesting implications in biological
computing.

Hopfield considers the special case wij = wji (all i, j), wii = 0 (all i) and defines a function
1
E w xx
j &' ix i
% θj
2 i, j ij i j i (2)
i j
…

which is an analog to the physical energy of the system. A low rate of neural firing is approximated by assuming
that only one unit changes state at any given moment. Then, since wij = wji, E due to xi is given by
) )
E M '
E ∆ xi ( wij x j
j &' ∆ i) xi & θ ∆
x M j (3)
i
j i…

Now consider the effect of the threshold rule (1). If the unit changes state at all then xi = ±1. If xi = 1 this means
) )
the unit changes state from 0 to 1, hence by the threshold rule
wijx j > i
j θ
(4)
j i
…

in which case, by (3), E < 0. Alternatively, xi = -1, the unit therefore changes state from 1 to 0, and
) )
wijx j < i j θ
(5)
j i
…

and again E < 0. Thus if a unit changes state at all the energy must decrease. State changes will continue until
)
a locally least E is reached.5 The energy is playing the role of a Hamiltonian in the more general dynamic system
context.

For the Hopfield network individual state changes are deterministic. However, in more general models, such as
the Boltzmann machine, we can add a stochastic component to the node update rule which introduces a parameter
T called temperature. At T = 0 state changes are decided deterministically by the threshold rule. For T > 0, as T
increases the system becomes progressively less deterministic and more stochastic until, at high temperatures, any
individual node is in either state with probability ½. Thus we can regard the Hopfield network in operational mode
as the zero temperature case of the Boltzmann machine.

Hopfield now makes the critical observation that

"This case is isomorphic with an Ising model."

thereby allowing a deluge of physical theory describing spin-glass models to enter network modelling. This flood
of new participants has transformed the field of neural networks.

A spin glass is a magnetic alloy formed, for instance, by dilute solutions of manganese in copper or iron in gold.
These impurities interact with each other by means of conduction electrons and the couplings are either of the

5
This particular argument is valid only if the neural states change one at a time in some random order; which is approximated by a low
neural firing rate. However, Cohen and Grossberg have proved a theorem about a much wider class of networks which guarantees a similar
kind of stability for asynchronous operation.

30


ferromagnetic (wij > 0) or antiferromagnetic (wij < 0) type. The interest of these alloys comes from the fact that they
exhibit a wide variety of stable or meta-stable states. The dipoles interact via the couplings wij. In the simplest case,
a spin interacts only with its nearest neighbours, while the equivalent of the neural networks considered here,
requires infinite range interactions, where each spin is coupled to all others. In the Ising model, the Hamiltonian
of such a spin glass is proportional to

j wijx ix j (6)
i … j

the spins contributing to the total energy by pairwise interactions, and the system stabilizes at an equilibrium point
which is a minimum of the free energy, see [Binder 1986].

Procedure Hopfield (Assumes weights are assigned)

Repeat until updating every unit produces no change of state.
Randomise initial state x {0, 1}n
0
Select unit i (1 i n) with uniform random probability.
# #
Update unit i according to (1)
End

Algorithm 3-1 Hopfield network.

A number of computer simulations and some analysis led Hopfield to conclude that the number of `memories'
(point attractors) that could be stored by a network was about 0.15n, where n is the number of neurons in the
network, a figure quite precisely confirmed by the later work. In an analytic tour de force [Amit 1987] it is shown
that the Hopfield model can be solved exactly, in the thermodynamic limit as n -> . A phase diagram
4
(Temperature T, storage P/n) is obtained, where T is a measure of the noise level and P/n is the ratio of the number
of learnt patterns to the number of neurons [Crisanti 1986]. The main result is the existence of a sharp,
discontinuous phase transition at P/n = Ao --> 0.14 (for T --> 0). When P < Aon the retrieval is very good (0.97
correlation) but it drops suddenly for P > Aon.

Real neurons need not make synapses both i->j and j->i. We therefore ask if wij = wji is important. Without this
condition the probability of making errors is increased but the algorithm continues to generate stable minima. Why
should stable limit points or regions persist when wij is not equal to wji? If the algorithm at some time changes xi
from 0 to 1, or vice versa, the change of the energy can be split into two terms. The first is the change that would
apply to the symmetric model. The second is identically zero if wij is symmetric and is `stochastic' with mean zero
if wij and wji are randomly chosen. The algorithm in the non-symmetric case therefore changes E in time in a
fashion similar to the symmetric case but corresponding to a finite temperature, i.e. a lower signal-to-noise ratio.

In [Hopfield 1984] a more realistic neuron model is used in which an internal continuous variable stores the linear
sum of the excitation and inhibition weighted by the appropriate connection strength. The internal variable is
converted into an output activity by a sigmoidal non-linearity. As in the earlier paper he sets up an energy function
and shows that the evolution of the system in time, given the properties of the neurons, will be to decrease energy.
Thus the results found in the previous paper still largely hold. There is a brief, but clear, account of the Hopfield
model in [Farhat 1985] in which an optical implementation is described.

The outer product rule for assigning weights.

An early rule used for memory storage is associative memory models can also be used to store memories in the

31


Hopfield model. If the {0, 1} node outputs are mapped to xi {-1, +1} the rule assigned weights as follows. For
0
each pattern vector x, which we require to memorise, we consider the matrix
x1 x1 x1 x1 x2 ... x1 x n
x2 x2 x1 x2 x2 ... x2 x n
x xT ' x1, x2, ,..., x n ' (7)
. . . ... .
xn x n x1 x n x2 ... x n x n

and then average these matrices over all pattern vectors (prototypes). At the time the explanation was that in this
way we can capture the average correlations between components of the pattern vectors and then use this
information, during the operation of the network, to recapture missing or corrupted components. Assuming that
we know the patterns required to be memorised the outer-product rule is a one-shot computation of the weights
and so perhaps should not qualify as a `learning rule' in the usual sense of progressive weight modification upon
exposure to experience.

However, from the perspective of the Hopfield model there is another interpretation that can be given to this rule.
For a point x = (x1, ..., xn) to be a stable attractor (i.e. a memory) we require that it be a local energy minimum.
Suppose that patterns are presented sequentially and we wish to determine some rule which is intended to make
frequently presented patterns likely to be energy minima. If we suppose that x is given at some stage, then it has
an associated energy and, if we calculate
M E
&' x ix j (8)
M wij

then by taking wij proportional to minus this gradient we obtain
)
wij ∆ x ix j
' η (9)
where > 0 is some small constant. In other words the averaging process of the outer product rule can now be
0
seen, in the context of progressive learning, as a form of gradient descent.

Unfortunately, however we assign the weights, as progressively more prototype memories are added, the system
eventually reaches saturation (at around 0.15n). This happens because the number of local minima is not really
under our control, as more memories are added we obtain an exponentially increasing number of `spurious
memories', ie. local minima that do not correspond to patterns we wish to memorise. For example, we have

Theorem [Tanaka 80], [McEliece 87]. If the synaptic matrix is symmetric with wii = 0 and if its elements are
independent Gaussian variables with zero mean and unit variance, then an asymptotic estimate for the number of
fixed points is given by
NF • (1.0505) 20.2874n (10)

A proof can be found in [Kamp 1990].

Networks for combinatoric search.

Apart from their applications to associative recall or pattern recognition, Hopfield networks can be applied to the
very different problem of combinatoric search. It should be made clear at the outset that, with the possible
exception of the Boltzmann machine, the application of neural networks to hard combinatoric search has not yet
yielded systems or algorithms which compare favourably with state-of-the-art probabilistic algorithms designed
for the specific problem, e.g. the TSP. However, this area is of considerable theoretical interest and may eventually
prove to be of practical interest.

32


We know that the dynamics of a Hopfield network cause it to relax into a local energy minimum. Given a specific
combinatoric search problem to be solved we are faced with two issues. First we have to design a representation
which relates network states to the objects in the search space. Second we have to arrange that low energy states
of the network correspond to good solutions.

To make matters specific we can consider the geometric TSP. Here the objects of search are tours and given a list
of N cities we can identify a tour as any permutation of this list. There are N! permutations and only ½(N-1)!
distinct tours, so we have already introduced some replication by simply identifying tours with permutations, but
this causes no serious problems of itself. The next step is to consider how tours might be represented as a state of
a network. Here, the generally used method is illustrated below.

For a 5-city problem {A,B,C,D,E}, if city A is in position 2 Table 2-2 Example representation.
of the tour this is represented by the second neuron from an
array of five having an output of 1 and all others in the 1 2 3 4 5
array having an output of 0, i.e. (0,1,0,0,0). The global state
of Table 10.1 represents the tour (C,A,E,B,D). Thus for N A 0 1 0 0 0
cities a total of n = N2 neurons are required to specify a
complete tour. B 0 0 0 1 0

C 1 0 0 0 0
Clearly there is a 1-1 correspondence between valid tours
and the set of all NXN permutation matrices, i.e. matrices D 0 0 0 0 1
which have precisely one `1' in each row and column, all
other entries being zero. For an N-city TSP, there are N! E 0 0 1 0 0
states of such matrices of which represent tours, and 2 to the
power N2 states in all. Now, this is not a very satisfactory
situation. We have replaced the original search space, of size order N!, by a space of much greater size because
1
log(N!) . NlogN & N % log(2 N)
π
2
(11)
2
log(2N ) ' N 2 log2

where the approximation for N! is Sterlings theorem. This is contrary to a guiding principle that wherever possible
in hard combinatoric search we should simplify the space searched (subject to the condition that good solutions
are rich in the smaller space) rather than make it larger. Nevertheless, the above representation has a certain
paradigmatic simplicity and will serve to illustrated the ideas.

The next problem to be addressed is: how to assign the weights so that states with low energy correspond to short
tours. This `assignment problem' is dealt with in the next section.

Assignment of weights for the TSP.

Let dij denote the distance between city i and city j. Here we shall formulate the TSP as a 0-1 programming
problem, by defining it as a quadratic assignment problem [Garfinkel 1985]. The TSP can also be formulated as
a linear assignment problem [Aarts 1988] but, of course, then there are far more constraints. Using the n = N2 node
state variables defined by

xip ' 1, if the tour visits city i at the p th position (12)
0, otherwise

we can formulate the TSP as the following quadratic assignment problem:

Minimise

33


N & 1
F(x) ' j aijpq xipxjq (13)
i, j, p, q ' 0

subject to xip, xjq 0 {0, 1} and
N & 1

j xip ' 1 (0 # p # N & 1)
i ' 0
(14)
N & 1

j xip ' 1 (0 # # i N & 1)
p ' 0

The first condition of (14) asserts there is just one `1' in every column, and the second condition places a similar
constraint on every row. The aijpq are defined by
dij, if q / p ± 1 (mod N)
aijpq ' (15)
0, otherwise

p p

p p
i
i

p - 1 (mod N ) p + 1 (mod N)
p - 1 (mod N ) p + 1 (mod N)

Figure 3-1 Distance Connections. Each node (i, p)
has inhibitory connections to the two adjacent Figure 3-2 Exclusion connections. Each node (i, p)
columns whose weights reflect the cost of joining the has inhibitory connections to all units in the same row
three cities. and column.

We wish to choose the weights so that minimising the objective function (13), subject to the constraints of (14),
corresponds to minimising the energy as defined in (2), (1).

Figure 3-1 and Figure 3-2 illustrate the connections made to each node of the network. These are divided into two
types. Distance connections, for which the weights are chosen so that, if the net is in a state which corresponds
to a tour, these weights will reflect the energy cost of joining the (p - 1)th city of the tour to the pth city, and the
pth city to the (p + 1)th. Note these connections wrap around (mod N). Exclusion connections, which inhibit two
units in the same row or column from being on at the same time. Exclusion connections are designed so as to
encourage the network to settle in a state which corresponds to a tour state. As all connections so far are inhibitory
we need to provide the network with some incentive to turn on any units at all. This can be done by manipulating

34


the thresholds. Intuitively we can see that some arrangement such as this might well have the desired effect, but
the next theorem shows exactly how to choose these weights so that it does.

We consider the following sets of parameters
St ip : 0
' i, p
θ # N 1
& #
Sd ' wipjq : i … j and q / p ± 1 (mod N) (16)

Se ' wipjq : (i ' j and p … q) or (i … j and p ' q)

Theorem (Aarts). Let the weights and thresholds be chosen so that
œ
ip St we have ip < max dik dil : k
θ 0 θ & % … l, 0 # k, l N 1
& #
œ wipjq 0 Sd we have wipjq &' dij (17)

œ wipjq 0 Se we have wipjq < min θ ip, θ jq

then
(i) Feasibility. Valid tour states of the network exactly correspond to local minima of the energy function.

and (ii) Ordering. The energy function is order-preserving with respect to tour length.

Proof. (i) Feasibility. Firstly, it is easy to check that a tour state is indeed a local minimum. We simply note the
effect of changing the state of some unit. Now suppose the state is not a tour state. We divide this into two possible
cases.

Case 1. Suppose the network state has more than a single `1' in some row or column.

Then at least one exclusion connection is activated. For definiteness suppose units (i, p) and (j, p) are both on and
that ip = min { ip, jp} > wipjp. What is the effect of turning unit (i, p) off? Distance connections cause no
2 2 2
problems, for suppose that some unit (k, q), with q p ± 1, in an adjacent column is also on. In this case turning
/
off (i, p) will remove a distance connection contribution dik to the energy, and so decrease it. For the exclusion
connection, if the weights are chosen according to (17) then turning off the connected unit (i, p), which has lowest
threshold, causes a change in energy
E wipjp ∆ip < 0
' & θ (18)
and so again reduces the energy. Hence a network state with too many `1's in some row or column cannot
correspond to a local minima.

Case 2. Now suppose that the network state has at most one `1' in every row and column (so that there are at most
N units on) and that at least one row or column has no unit on.

Suppose, for definiteness, that the pth column has no unit on. This means that in any row, the unit in the pth
column cannot be on. If every row contains some unit which is on and only N-1 columns are available, this means
that some column would have to contain two units which are on, which is contrary to hypothesis. Hence there must
be fewer than N units on, so that some unit in the pth column must be in a row which has no units on. Suppose
this is unit (i, p). Turning this unit on does not contribute to the energy via any exclusion connections (because
all other units in the same row or column are off). We next consider the contribution due to the distance
connections. In each of the adjacent columns (mod N) there is at most one unit on. Call these (k, p-1) and (l, p+1)
respectively, where it is understood that the indices p-1 and p+1 are taken (mod N). The choice of weights in (17)
ensures that the change of energy produced by turning unit (i, p) on is

35


∆ E &' wkp &1 ipxkp 1xip
& & wipl p 1xipxl p
% % 1 % θ ipxip
(19)
' dik % dil % θ ip < 0

Hence the initial state could not have been an energy minima.

The two cases considered cover all possibilities for network states which are not tour states, and so we conclude
that the tour states exactly correspond to energy minima.

(ii) Ordering. To show that the energy function is order preserving we first observe that ip does not depend on p, 2
i.e. all ip for units in a given row are equal. Consequently, the contribution to the energy from the threshold terms
2
is the same for all tour states. The remaining terms contribute
1
j dij xi pxjq
2
i p, jq (20)
(i, p) (j, q)
…

which evaluates to the tour length. Hence energy preserves the ordering of tour length. %

At first sight this theorem may seem paradoxical. In analysing the memory capacity of a Hopfield network, we
concluded earlier that only around 0.15n prototype vectors could be memorised before the onset of catastrophic
degradation in recall. Thus we were only able to control the network behaviour at a relatively small proportion of
an exponentially large (as in (10)) number of local minima. Yet, in the assignment of weights for the TSP network,
we have ( n)! local minima exactly where we want them. How can this be? The answer lies in the following
%
observation. In analysing the effect of the outer-product rule for assigning weights in order to recall stored
memories, we assumed that there was no structure to the prototype vectors, i.e. that they were uncorrelated. In the
case of the TSP assignment there is a very definite structure in the set of states we are seeking to assign to minima:
the structure imposed by the ½n(n-1) distances between the cities of the original problem. The TSP weight
assignment respects this structure and that additional degree of control allows us to place N! states in exactly the
relationship needed to solve the problem.

A Mathematica implementation of the asynchronous Hopfield network is given in the Mathematica directory. The
assignment of weights for a specific TSP problem discussed here is also given there.

* The Hopfield and Tank application to the TSP.

Hopfield and Tank first proposed using the Hopfield model to solve the TSP in [Hopfield 1986]. Their choice of
energy function was motivated by similar considerations to those of the previous section, but was slightly different
in detail.

For a specific TSP problem we assume that the assignment of weights is made as described in the previous section.
Suppose we now initialise the corresponding Hopfield network to a random state. Then if the network is run we
should expect it to settle into a local energy minima, and we have proved that this will correspond to some tour
state. However, it is unlikely to correspond to the optimal tour. Hopfield and Tank suggested an ingenious way of
overcoming this problem, which is very similar to the Boltzmann machine approach (not discussed in these notes).

Instead of the {0, 1} network Hopfield and Tank used a continuous model, 0 xi 1, in which neurons possess # #
a sigmoidal activation function xi = f( neti), where neti is the net input and is a positive constant which
8 8
represents the gain and is equivalent to varying the slope of the sigmoidal. Hence f represents the input-output
characteristics of a non-linear amplifier with negligible response time. The discrete model represents the case
where -> . In their simulation is taken to be 50 (large but finite).
8 4 8

For 10 cities there are 181,440 possible tours. In the Hopfield and Tank simulations about 50% of the trials
produced one of the two shortest paths. They ask why is the computation so effective and provide the following

36


answer:

"The solution to a TSP is a path and the decoding of the TSP's network final stable state to obtain this discrete decision or solution
requires having the final xip values to be near 0 or 1. However, the actual analog computation occurs in the continuous domain 0
# xip 1. The decision-making process or computation consists of the smooth motion from an initial state in the interior of the space
#
(where the notion of a `tour' is not even defined) to an ultimate stable point near enough to a corner of the continuous domain to be
able to identify with that corner. It is as though the logical operations of a calculation could be given continuous values between `true'
and `false', and evolve toward certainty only near the end of the calculation."

Naturally, with such an interesting approach to an NP hard problem, others tried to repeat these results. It was
found by a number of researchers that the Hopfield-Tank algorithm, as originally formulated, was highly unstable

"Our simulations indicate that Hopfield and Tank were very fortunate in the limited number of TSP simulations they attempted. Even
at the value N = 10 it transpires that their basic method is unreliable..." [Wilson 1988].

However, others later refined the energy function and modified the algorithm to perform more reliably. There are
also a number of related papers on `elastic net' methods for the TSP problem [Durbin 1987].

Conclusions.

One lesson we have learned from this chapter is that the performance of networks for hard combinatoric search
is critically dependent both on the mapping from the problem domain to the network, and on the details of the
network architecture and weight assignment which encode the constraints of the original problem. Appropriate
encoding of domain knowledge into the network architecture can profoundly enhance the overall performance. One
of the important general questions which arises from these observations is how can such encoding of basic
constraints themselves be learnt, possibly at the genetic level?

In another direction entirely, we first discussed associative memories with an emphasis on direct methods of
encoding prototype patterns into the weights of a fully connected network. As our study of these networks
progressed the emphasis moved to a perspective more in accord with dynamic systems. This also mirrors the
historical development of the subject of artificial neural networks. Following Hopfield's papers prototype vectors,
or `memories', were associated with point attractors of the dynamics of the network. However, we can see that this
preoccupation is, in itself, just a special case of a much more general vista. Why should we restrict our attention
to the point attractors of the system? In modelling biological systems we are studying techniques for embedding
learned behaviour into the system. In other words we should really be studying the following question

! How can we sculpt the dynamic evolution of the system through time so as to induce
behaviours characteristic of, or responsive to, a given external dynamic system?

If we loosely identify a `behaviour' with a trajectory through the state space of the network, then point attractors
are just one of the simplest characterisations of dynamic system behaviour. It would be much more interesting to
develop learning techniques for whole classes of trajectories, or even chaotic attractors. In other words to address
the issue of how to get a neural network to capture a model of a given dynamic system or Markov process for
example.

This is very much the line of reasoning suggested by Freeman's studies and simulations of biological neural
systems [Freeman 1991]. Phase portraits made from EEGs generated by computer models reflect the overall
activity of the olfactory system of a rabbit at rest and in response to a familiar scent (e.g. banana). The resemblance
of these portraits to irregularly shaped, but still structured, coils of wire reveals that brain activity in both
conditions is chaotic but that the response to the known stimulus is more ordered, more nearly periodic during
perception, than at rest.

Chapter references

37


[Aarts 1988] E. Aarts and J. H. M. Korst. Boltzmann machines for travelling salesman problems, European Journal
of Operational Research - in press.

[Abraham 1985] R. H. Abraham and C. D Shaw, Dynamics - The Geometry of Behaviour, Part 2: Chaotic
Behaviour, Arial Press, 1985.

[Amit 1987] D. J. Amit, H. Gutfreund, and H. Sompolinsky. Information storage in neural networks with low
levels of activity. Physical Review A, 35:2239-2303, 1987.

[Binder 1986] K. Binder and A. P. Young. Spin-Glasses - Experimental Facts, Theoretical Concepts and Open
Questions. Reviews of Modern Physics, 58(1):801-976, 1986.

[Burr 19XX]. D. J. Burr. An Improved Elastic Net Method for the Traveling Salesman Problem, ???

[Cragg 1954] B. G. Cragg and H. N. V. Temperley. Electroencephalog. Clin. Neurophys. 6:85, 1954.

[Crisanti 1986] A. Crisanti, D. J. Amit, and H. Gutfreund. Europhys. Letters 2:337, 1986.

[Durbin 1987] R. Durbin and D. Willshaw, An analogue approach to the travelling salesman problem using an
elastic net method, Nature 326, 16 April 1987.

[Farhat 1985] N. H. Farhat, et al. Optical implementation of the Hopfield model. Applied Optics, 24:1469-1475,
1985.

[Feller 1966] W. Feller. An introduction to probability theory and its applications, Vol. 1. J. Wiley & Sons, New
York, 1966.

[Freeman 1991] W. J. Freeman. The physiology of perception. Scientific American, pp 34-41, February 1991.

[Garfinkel 1985] R. S. Garfinkel. Motivation and modelling, in The Travelling Salesman Problem: a guided tour
of combinatoric optimisation. Eds. E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. Wiley,
Chichester, 1985.

[Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing.
Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk).

[Hopfield 1982] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities, Proceedings of the National Academy of Sciences 79: 2554-2558.

[Hopfield 1984] J. J. Hopfield, Neurons with graded response have collective computational properties like those
of two state neurons. Proceedings of the National Academy of Sciences 81: 3088-3092.

[Hopfield 1986] J. J. Hopfield and D. W. Tank. `Neural' computation of decisions in optimization problems.
Biological Cybernetics, 52:141-152., 1986.

[Kamp 1990] Y. Kamp and M. Hasler. Recursive Neural Networks for Associative Memory, John Wiley & Sons,
New York, 1990.

[Keeler 1988] J. D. Keeler. Cognit. Sci. 12: 299-329, 1988.

[Keeler 1989]. J. D. Keeler, E. E. Pichler and J. Ross. Noise in neural networks: Thresholds, hysteresis, and
neuromodulation of signal-to-noise. Proc. Nat. Acad. Sci. USA, 86: 1712-1716, March 1989.

38


[Little 1974] W. A. Little. Math. Biosci. 19:101, 1974.

[McEliece 1987] R. McEliece, E. Posner, E. Rodemich, and S. Venkatesh. The capacity of the Hopfield associative
memory. IEEE Transactions on Information Theory, IT-33:461-482, 1987.

[Tanaka 1980] F. Tanaka and S. Edwards. Analytic theory of the ground state properties of a spin glass: I. Ising
spin glass. J. Phys. F: Metal. Phys, 10:2769-2778, 1980.

[Wilson 1988] G. V. Wilson and G. S. Pawley, On the Stability of the Travelling Salesman Problem Algorithm
of Hopfield and Tank. Bilogical Cybernetics 58:63-70, 1988.

39


IV The WISARD model.

Introduction.

WISARD (WIlkie, Stonham, Aleksander Recognition Device) is an implementation in hardware of the n-tuple
sampling technique first described in [Bledsoe 1959]. The scheme outlined in Figure 4-1, Figure 7-2 was first
proposed by Aleksander and Stonham in [Aleksander 1979].

Figure 4-1 Schematic of a 3-tuple recogniser.

The sample data to be recognized is stored as a 2-dimensional array (the 'retina') of binary elements. Depending
on the nature of the data this can be done in a variety of ways. For visual processing we can simply place a pre-
processed version of the image onto the retina. For temporal data in signal processing or speech recognition

40


successive samples in time can be stored in successive columns, and the value of the sample represented by a
coding of the binary elements in each column. The particular coding used is liable to depend on the application.
One of several possible codings is to represent a sample feature value by a 'bar' of binary 1's; the length of the bar
being proportional to the value of the sample feature.

Wisard model.

Random connections are made onto the elements of the array, n such connections being grouped together to form
an n-tuple which is used to address one random access memory (RAM) per discriminator. In this way a large
number of RAM's are grouped together to form a class discriminator whose output or score is the sum of all its
RAM's outputs. This configuration is repeated to give one discriminator for each class of pattern to be recognized.
The RAMs implement logic functions which are set up during training; thus the method does not involve any
direct storage of pattern data.

A random map from array elements to n-tuples is preferable in theory, since a systematic mapping is more likely
to render the recogniser blind to distinct patterns having a systematic difference. Hard-wiring a random map in
a totally parallel system makes fabrication infeasible at high resolutions. In many applications systematic
differences in input patterns of the type liable to pose problems with a non-random mapping are unlikely to occur
since real data tends to be 'fuzzy' at the pixel level. However, the issue of randomly hardwiring individual RAMs
is somewhat academic since in most contexts a totally parallel system is not needed as its speed (independent of
the number of classes and of the order of the access time of a memory element) would far exceed data input rates.
At 512X512 resolution a semi-parallel structure is used where the mapping is 'soft' (i.e. achieved by
pseudo-random addressing with parallel shift registers) and the processing within discriminators is serial but the
discriminators themselves are operating in parallel. Using memory elements with an access time of 10-7 s, this
gives a minimum operating time of around 70 mS, which once again is independent of the number of classes.

The system is trained using samples of patterns from each class. A pattern is stored into the retina array and a
logical 1 is written into the RAM's of the discriminator associated with the class of this training pattern at the
locations addressed by the n-tuples. This is repeated many times, typically 25-50 times, for each class.

In recognition mode, the unknown pattern is stored in the array and the RAM's of every discriminator put into
READ mode. The input pattern then stimulates the logic functions in the discriminator network and an overall
response is obtained by summing all the logical outputs. The pattern is then assigned to the class of the
discriminator producing the highest score.

Where very high resolution image data is presented, as in visual imaging, this design lends itself to easy
implementation in massively parallel hardware. However, even with visual images, experience tends to suggest
that often a very good recognition performance can be obtained on relatively low resolution data. Hence in many
applications massively parallel hardware can be replaced by a fast serial processor and associated RAM, emulating
the design in micro-coded software. This was the approach used by Binstead and Stonham in Optical Character
recognition, with notable success. Such a system has the advantage of being able to make optimal use of available
memory in applications where the n- tuple size, or the number of discriminators, may be required to vary.

The advantages of the WISARD model for pattern recognition are:

! Implementation as a parallel, or serial, system in currently available hardware is inexpensive and
simple.

! Given labelled samples of each recognition class, training times are extremely short.

! The time required by a trained system to classify an unknown pattern is very small and, in a parallel
implementation, is independent of the number of classes.

41


The requirement for labelled samples of
each class poses particular problems in
speech recognition when dealing with
smaller units than whole words; the
extraction of samples by acoustic and
visual inspection is a labour intensive
and time consuming activity. It is here
that paradigms such as Kohonen's
topologising network, as applied to
speech by Tattershall, show particular
promise. Of course, in such approaches
there are other compensating problems;
principally, after the network has been
trained and produced a dimensionally
reduced and feature-clustered map of the
pattern space, it is necessary to interpret
this map in terms of output symbols
useful to higher levels. One approach to
this problem is to train an associative Figure 4-2 Continuous response of discriminators to the input word
memory on the net output together with 'toothache' [From Neural Computing Architectures, Ed. I Aleksander].
the associated symbol.

Applications of n-tuple sampling in hardware have been rather sparse, the commercial version of WISARD as a
visual pattern recognition device able to operate at TV frame rates being one of the few to date - another is the
Optical Character Recogniser developed by Binstead and Stonham. However, one can envisage a multitude of
applications for such pattern recognition systems as their operation and advantages become more widely
understood.

The total amount of memory needed for the discriminator RAMs is easily calculated. Let n be the number of inputs,
C the number of classes, and assume a 1-1 mapping of a retina with p binary pixels. For convenience we assume
that n divides p. Then the number of 1 bit RAMs will be p/n and each RAM has 2n bits of storage. So the number
of bits per discriminator is (p/n).2n. Hence the total number of bits for C classes is
p n
C. .2 (21)
n

Practical n-tuple pattern recognition systems have developed from the original implementation of the hardware
WISARD, which used regularly sized blocks of RAM storing only the discriminator states. As memory has become
cheaper and processors faster, for many applications such heavily constrained systems are no longer appropriate.
Algorithms can be implemented as serial emulations of parallel hardware and RAM can also be used to describe
a more flexible structure.

Typically the real-time system is preceded by a software simulation in which various parameters of the theoretical
model are optimized for the particular application. A design technique which is sufficiently general to cope with
a large class of such net-systems whilst at the same time preserving a high degree of computational efficiency is
described in [Binstead 1987]. In addition the structure produced has the property that it is easily mapped into
hardware to a level determined by the application requirements.

The rationale for believing that n-tuple techniques might be successfully applied to speech recognizers is briefly
outlined in [Tattershall 1984], where it is demonstrated that n-tuple recognisers can be designed so that in training
they derive an implicit map of the class conditional probabilities. Since the n-tuple scheme requires almost no
computation it appears to be an attractive way of implementing a Bayesian classifier. In a real time speech

42


recognition system the pre-processed input data can be slid across the retina and the system tuned to respond to
significant peaking of a class discriminator response, see Figure 4-2

Work on using WISARD nets for speech recognition was started at the University of Brunel Pattern Recognition
Laboratory in 1983, some results of this work are reported in [Aleksander 1988a], Chapter 10. One novel feature
of this chapter is an account of the work of Jones and Valenzuela using Holland's Genetic Algorithm to breed
WISARD nets for the purpose of vowel detection.

WISARD - analysis of response.

Assume that there are a number of n-tuples whose address lines are randomly and uniformly connected to the
retina, and that the total area of the retina is 1. Suppose an n-tuple has been trained on patterns T1, T2,..., Tl. Then
we shall evaluate the expected response to an unknown test pattern U. Let

I1 ' U _ T1, I2 ' U _ T2, ..., Il ' U _ Tl

Iij ' U _ Ti _ Tj (for i … j),
(22)
Iijk ' U _ Ti _ Tj _ Tk (i, j, k pairwise distinct),

I12,...,l ' U _ T1 _ T2 _ ... _ Tl

Here Ii = U Ti is the set of all points on the retina for which the pixels in U and Ti take the same value, either
1
both '1' or both '0'. Let U Ti denote the area of this set intersection. Since the area of the retina is supposed to
* 1 *
be 1, it follows that U Ti is the probability that a single address line will give the same result when sampling
* 1 *
U as when sampling Ti. Since the n address lines are assumed uniformly distributed across the retina, the
probability that all n address lines give the same response when sampling U as when sampling Ti is just U Ti n. * 1 *
This is the probability that an n-tuple trained only on the pattern Ti will give the same response (i.e. fire) when
presented with the unknown pattern U.

For convenience, let
pi * *' Ii n, pij * *' Iij n, ..., p12...l *' I12...l * n
(23)

Suppose the n-tuple has been trained on two patterns Ti and Tj. Then the probability that an n-tuple trained only
on the patterns Ti and Tj will give the same response (i.e. fire) when presented with the unknown pattern U is
pi pj pij % & (24)
Now suppose the n-tuple has been trained on l patterns T1,...,Tl. By a well known combinatoric principle, it follows
that the probability that an n-tuple will give the same response (i.e. fire) when presented with the unknown pattern
U is
p j' pi j& pij j% pijk & ... & % ( 1)l 1 p12...l
%
i i j
… i j
… (25)
j k
…
k i
…

In the vernacular of n-tuple sampling theory this is called the nth power-law. Note that if U = Ti, i.e. is equal to one
of the training patterns, then Ii = 1 and the remaining terms in the equation above sum to zero, hence p = 1 as
* *
we might expect.

43


For example, suppose, a discriminator is required to detect the horizontal position of a vertical bar and report on
the distance of the bar from the centre.
The task is shown in the figure below.
T1, the only training pattern, is seen to
be a bar of width 1/3 units, where we
take the width and height to each be one
unit. The test pattern U could be a
vertical bar of the same width as T1 but
anywhere that is wholly within the
window. The distance of the bar from
the left hand edge is D, and the
maximum value of D is 2/3. As D
increases from 0 to 1/3 then the overlap
between U and T1 increases linearly but
the response R = I1n of the system for
different n is governed by the nth power
law, see Figure 4-3.

Comparison of storage requirements.

The functionality of the McCulloch and
Pitts neuron model was a matter of
interest in the mid 1960's and has been
discussed by [Muroga 1965]. With n
inputs the 0/1 neuron can only perform
linearly separable functions, i.e. those
that may be achieved by an n-1
hyperplane of the n-hypercube. We have Figure 4-3 A discriminator for centering on a bar [From Neural
seen that there are about 2n of these. Computing, I. Aleksander and H. Morton].
Typical of the functions that such a
device cannot perform are parity checking etc.

Suppose now we regard the n binary inputs as addressing memory in a RAM, as in the WISARD model discussed
earlier. There are 2^(2^n) logic functions f
f : {x1, x2, ... ,xn} {0, 1}
6 (26)
since there are 2n possible inputs for each function and the function value at each of these inputs can be 0 or 1. The
functionality of the two models can be compared as follows.

With w-bit weights and zero threshold, an McCullock and Pitts node of type ? requires wn bits of memory and can
perform at most 2wn of the possible logic functions: a proportion of the total number possible which tends
exponentially to zero as n becomes large. In fact, there is no point in taking w very large for the discrete neuron,
since the number of hyperplane dichotomies of the 2n vertices of the n-hypercube is fixed in terms of n (around 22n-
1
); one needs just enough bits to make sure that all these hyperplane dichotomies are possible, and no more.
Although the proportion of possible logic functions which can be implemented by the McCullock and Pitts neuron
is asymptotically zero as n tends to infinity, nevertheless it is this restricted functionality which gives the node the
capability of generalisation.

By comparison, a RAM with n address lines and 2n bits of store can implement any of the possible logic functions
on n bits. However, the storage requirements rise exponentially with the number of inputs, and hence, in
assemblies of RAMs, with the same level of interconnectivity. This is not the case with the McCullock and Pitts
model, where storage increases linearly with n. In any event, detailed discussion of the relative merits of different
neural components frequently overlooks the fact that what is perhaps more relevant is the relative functionality

44


of large assemblies of such components.

In applications where the interconnectivity is low the idea of using a RAM offers obvious attractions. Two
problems present themselves. Firstly, what training algorithm should be used for assemblies of RAMs? Secondly,
how will such a system be capable of generalisation?

Subsequent work by Igor Aleksander's group at Imperial has resulted in a model known as the Probabilistic Logic
Node (PLN); such nodes are then cascaded into a pyramidal structure and combined with a simple multilayer
learning algorithm.

The PLN is a connectionist model introduced in [Aleksander 1988b] which is implementable as a RAM. Binary
inputs address a memory location; the node outputs the bit value stored there. The training algorithm sets location
contents according to a global error/correct signal; PLNs require no individual error information. A PLN can learn
any of the 2^(2^n) Boolean functions of its n inputs.

By this definition, it is straightforward to implement the PLN as a RAM with 2n addressable memory locations.
In practice the output generally passes through a stochastic device before leaving the PLN. This randomizer allows
the PLN to exhibit non-deterministic properties; organic neurons also behave in a stochastic manner [Sejnowski
1981]

Chapter references

[Aleksander 1979] I. Aleksander and T.J. Stonham. A Guide to Pattern Recognition Using Random-Access
Memories. IEE Journal Computers and Digital Techniques, Vol. 2 (1), 29-40, 1979.

[Aleksander 1988a] A. Badii, M. J. Binstead, Antonia J. Jones, T. J. Stonham and Christine L. Valenzuela.
Applications of N-tuple Sampling and Genetic Algorithms to Speech Recognition. Neural Computing
Architectures, Chapter 10. Ed. I. Aleksander, Kogan Page, October 1988.

[Aleksander 1988b] I. Aleksander. Logical connectionist systems. In R. Eckmiller and Ch. v. d. Malsburg (Eds.)
Neural Computers (pp. 189-197). Springer-Verlag, Berlin, 1988.

[Bernstein 1981] J. Bernstein. Profiles: AI, Marvin Minsky. The New Yorker, December 14, 1981, pp 50-126.

[Binstead 1987] M. J. Binstead and Antonia J. Jones. A Design Technique for Dynamically Evolving N-tuple Nets.
IEE Proceedings, Vol. 134 Part E, No. 6, pp 265-269, November 1987.

[Bledsoe 1959] W. W. Bledsoe and I. Browning. Pattern Recognition and Reading by Machine. Proc. Eastern Joint
Computer Conf. Boston, Mass., 1959.

45


V Feedforward networks and backpropagation.

Introduction.

In two papers describing the same model [Rumelhart 1986a], [Rumelhart 1986b] Rumelhart, Hinton and Williams
introduce a generalization of the Widrow-Hoff error correction rule called back propagation. The algorithm was
first described by Paul J. Werbos in his Harvard Ph.D. thesis [Werbos 1974] and also independently rediscovered
by [Parker 1985] and [Le Cun 1986]. The model assumes a discrete time system with synchronous update and with
each connection involving a unit delay.

Simple two-layer associative networks have no hidden units, they involve only input and output units. In these
cases there is no internal representation. As Minsky and Papert pointed out we need hidden units to provide the
possibility of recoding the input pattern into an internal representation because many problems simply cannot
otherwise be solved.

An example is the XOR problem mentioned earlier. Here the addition of a unit which detects the logical
conjunction of the inputs changes the similarity structure of the patterns sufficiently to allow the solution to be
learned, see Figure 5-1

The problem addressed by simulated annealing
and backpropagation is the provision of a locally
computed learning rule which guarantees that an
internal representation adequate to solve the
problem will be found.

Backpropagation is based on local gradient
descent and output functions which step between
0 and 1, as the activation of the neuron increases
through a threshold value, will not be
differentiable and therefore provide no useful
gradient in weight space along which we can
descend. We therefore consider semilinear
activation functions. A semilinear activation
function fi(neti) is one in which the output of a
the unit is a non-decreasing and differentiable Figure 5-1 Solving the XOR problem with a hidden unit.
function of the net input to the unit. In most
cases fi is independent of i and so we write f = fi.

Backpropagation - mathematical background.

The following functions, and hence all their partial derivatives, are assumed known.

Error function.
E(z1, z2, ..., zn, t1, t2, ... ,t n) (1)
Here z1, z2, ..., zn are the outputs from the output layer units and t1, t2, ... ,tn are the target outputs.

46


Figure 5-2 Feedforward network architecture.

Activation function.

Output layer:
net j ' net j (y1, y2, ... ,y m, pj1, ... pjt) (2)
Here y1, y2, ... ,ym are the outputs from the previous layer and pj1, ... , pjt are parameters associated with the jth node
of the output layer. Frequently t = t(m), i.e. the number of parameters associated with a node is a function of the
number of inputs.

Previous layer:
net i ' net i (x1, x2, ..., x l, pi1, ... ,pis) (3)
Here x1, x2, ... ,xl are the outputs from the layer prior to the previous layer.

Output function.

Output layer:
zj ' f(net j) (1 # # j n) (4)
Previous layer:
yi ' f(net i) (1 # # i m) (5)

The output layer calculation.

For the output layer we have
M η& ' E
∆ pjz (6)
M pjz

where 1 z
# # t, 1 j
# # n, and 0 is the learning rate.

Hence
M η& ' E net jM net j
M
∆ pjz ' δη j (7)
M net j pjz
M M pjz

47


where
M &' E M M E zj M E
δ j &' &' f (net j)
)
(8)
M net j M M zj net j M zj

Equations (7) and (8) express the pjz in terms of known quantities.
)

The rule for adjusting weights in hidden layers.

In a similar way we can compute a rule which adjusts the weights in the previous layers.

Figure 5-3 The previous layer calculation.

We shall not go through the details of the derivation (which are quite straightforward) but the rule which emerges
is, for node i in this previous layer,
net i M
piz i
∆ ' δη (9)
piz M

where
n net j
ME M
δ i &' f (net i)
) ' f (net i)
) δj j (10)
Myi j 1
' M yi

In (9) the partial derivative neti/ piz is known from (3). In (10) f´(neti) is known from (5), the
M M * j were computed
in the previous step, and the last term is known from (2).

The conventional model.

The usual sum squared error is given by
n
1
E(z1, ..., zn, t1, ..., t n) j ' zj & tj 2 (11)
2
j 1
'

Hence, for 1 j
# # n
E
' M zj & tj (12)
zjM

48


The linear activation function becomes
m Mnet j
net j j' wji y i ' yi
i 1
' M wji
l
(13)
M net i
net i j' wih x h ' xh
h 1 ' M wih

where t = m and s = l. Thus (7) becomes
∆ wjz ' δ η j yz (14)
and, using (12), (8) becomes
E
δ j &' )
f (net j) &' M )
f (net j) zj & tj (15)
zj
M

for an output layer unit.

Similarly (9) becomes
net i
M
∆ wiz ' δη i
' δη ix z (16)
M wiz

where, from (10),
m M net j
δ i
' f (net i)
) δj j
j 1 ' M yi
(17)
m
' f (net i)
) δj j wji
j 1 '

for a hidden layer unit.

For this example of a linear activation function any sigmoidal function f is suitable. Frequently the function
1
zj f(netj)' ' (18)
net & % θ
1 e j j %

is used. Here j is the threshold, or bias, of the unit. Conventionally the threshold is treated as just another weight
2
by creating a dummy unit which is always on (somewhat like ground in an electrical circuit). However, from the
present viewpoint the threshold can be considered as just another parameter associated with the unit, in which case
there is no need for the dummy unit.

Thus the implementation of back propagation involves a forward pass through the layers to estimate the error, and
then a backward pass modifying the synapses to decrease the error. Practical implementations are not difficult, but
without modification it is still rather slow, especially for systems with many layers. Still, it is at present the most
popular learning algorithm for multilayer networks. It is considerably faster than the Boltzmann machine.

The whole process is illustrated in the Mathematica file backprop.ma. The Mathematica program runs too slowly
to be of much practical use and is intended only to illustrate the process. A C-code implementation will run much
faster and is available in various implementations.

Problems with backpropagation.

49


Backpropagation is a very powerful tool for constructing non-linear models. However, there are a number of
problems which arise when one tries to use backpropagation in practice. These are:

! In many cases, given a particular training set, the minimum mean-squared error which one can expect
on each output is unknown. Overtraining will result in memorisation, which will reduce the
generalisation capability. Dealing with this problem can be very time consuming and involve trial and
error.

! The optimal architecture for the network is unknown: too many units and the system will generalise
poorly whilst taking ta very long time to train; too few units and the system will not reach the optimal
(unknown) mean-squared error. Dealing with this problem can be very time consuming and involve trial
and error.

! Optimal values for the learning rate > 0 (and the momentum term > 0 - if used) are unknown.
0 "
Dealing with this problem can be very time consuming and involve trial and error.

One way for adjusting > 0 dynamically as learning progresses is called the Bold Driver method. This adjusts the
0
learning rate according to the previous value of the Error function. If the value has gone up the learning rate is
reduced proportionately, similarly the learning rate is increased if the value has gone down.

Until recently a successful application of backpropagation to a large problem was to some extent a matter of
patience and luck.

The gamma test - a new technique.

The Gamma test was developed by Aðalbjörn Stefánsson, N. Kon ar [Kon ar 1997], [Adalbjörn Stefánsson 1997]
… …
and and myself and has emerged as an extremely useful tool to overcome the kinds of problems mentioned above.
It is a very simple technique which in many cases can be used to considerably simplify the design process of
constructing a smooth data model such as a neural network. The Gamma test is a data analysis routine, that (in
an optimal implementation) runs in time O(MlogM) as M -> , where M is the number of sample data points, and
4
which aims to estimate the best Mean Squared Error (MSError) that can be achieved by any continuous or smooth
(bounded first partial derivatives) data model constructed using the data. A proof of the result under fairly general
hypotheses was finally given in [Evans 2002a] and [Evans 2002b].

Let a data sample be represented by
((x1, ..., x m), y) ' (x, y) (19)
m
in which we think of the vector x = (x1, ..., xm) as the input, confined to a closed bounded set C úf , and the scalar
y as the output. In the interests of simplicity the following explanation is presented for a single scalar output y. But
the same algorithm can be applied to the situation where y is a vector with very little extra complication or time
penalty. The Gamma test is designed to give a data-derived estimate for Var(r).

We focus on the case where samples are generated by a suitably smooth function (bounded first and second order
m
partial derivatives) f: C -> and
úf ú
y f(x1, ..., x m) r
' % (20)
where r represents an indeterminable part, which may be due to real noise or might be due to lack of functional
determination in the posited input/output relationship i.e. an element of `one -> many-ness' present in the data.

We make the following assumption

! Assumption A. We assume that training and testing data are different sample sets in which:
(a) the training set inputs are non-sparse in input-space; (b) each output is determined from the

50


inputs by a deterministic process which is the same for both training and test sets; (c) each
output is subjected to statistical noise with finite variance whose distribution may be different
for different outputs but which is the same in both training and test sets for corresponding
outputs.

Suppose (x, y) is a data sample. Let (x´, y´) be a data sample such that x´ - x > 0 is minimal. Here . denotes * * **
Euclidean distance and the minimum is taken over the set of all sample points different from (x, y). Thus x´ is the
nearest neighbour to x (in any ambiguous case we just pick one of the several equidistant points arbitrarily).

The Gamma test (or near neighbour technique) is based on the statistic
M
1
γ ' j (y (i)
) & y(i))2 (21)
2M i ' 1

where y´(i) is the y value corresponding to the first near neighbour of x(i).

It can be shown that -> Var(r) in probability as the nearest neighbour distances approach zero. In a finite data
(
set we cannot have nearest neighbour distances arbitrarily small so the Gamma test is designed to estimate this
limit by means of a linear correlation.

Given data samples (x(i), y(i)), where x(i) = (x1(i), ..., xm(i)), 1 # # i M, let N[i, p] be the list of (equidistant) p th
nearest neighbours to x(i). We write
M M
1 1 1
δ (p) ' j j x(j) & x(i) 2 ' j x(N[i, p]) & x(i) 2 (22)
M i ' 1 L(N[i, p]) j 0 N[i, p] M i ' 1

where L(N[i, p]) is the length of the list N[i, p]. Thus (p) is the mean square distance to the pth nearest neighbour.
*
Nearest neighbour lists for pth nearest neighbours (1 p pmax), typically we take pmax in the range 20-50, can be
# #
found in O(MlogM) time using techniques developed by Bentley, see for example [Friedman 1977].

We also write
M
1 1
γ (p) ' j j (y(j) & y(i))2 (23)
2M i ' 1 L(N[i,p]) j 0 N[i,p]

where the y observations are subject to statistical noise assumed independent of x and having bounded variance.6

Under reasonable conditions one can show that
Var(r) γ . % A δ % o( ) δ as M 46 (24)

where the convergence is in probability.

The Gamma Test computes the mean-squared pth nearest neighbour distances (p) (1 p pmax, typically pmax * # # .
10) and the corresponding (p). Next the ( (p), (p)) regression line is computed and the vertical intercept is
( * (
returned as the gamma value. Effectively this is the limit lim as -> 0 which in theory is Var(r). ( *

6
The original version in [Aðalbjörn Stefánsson 1997] and [Kon ar 1997] used smoothed versions of (p) and … *
((p) which rolled off the significance of more distant near neighbours. Later experience showed that this
complication was largely unnecessary and the version of the software used here is implemented as described above.

51


Procedure Gamma (or Near Neighbour Test) (data)
(* data is an array of points (x(i), y(i)), (1 i M), in # #
which x is a real vector of dimension m and y is a real scalar
*)

For i = 1 to M (* compute x-nearest neighbour list for each data
point. This can be done in O(MlogM) time using a kd-tree for
example.*)
For p = 1 to pmax
AppendTo[N(i, p)] all the elements t where x(t) is a p th
nearest neighbour to x(i).
endfor p
endfor i
For p = 1 to pmax
compute (p) as in (22)
*
compute (p) as in (23)
(
endfor p
Perform least squares fit on coordinates ( (p), * ( (p)) (1 # p # pmax)
obtaining (say) y = Ax + ¯ '
Return (¯, A) '

Algorithm 5-1 The Gamma test.
A technique which allows one to estimate Var(r), on the hypothesis of an underlying continuous or smooth model
f, is of considerable practical utility in applications such as control or time series modelling. The implication of
being able to estimate Var(r) in neural network modelling is that then one does not need to train the network (or
indeed any smooth data model) in order to predict the best possible performance with reasonable accuracy.

We have used the Gamma test to:

! To find the minimal number of data samples required to produce a near optimal model. We do this by
computing ¯ for increasing M. The graph asymptotes to the true value of Var(r). When M is sufficiently
'
large to ensure that ¯ has stabilised close to the asymptote there is little advantage to be gained by
'
increasing M.

! To automatically and rapidly (the near neighbour information can be used to speed up training by
picking the data points with worst errors and performing backpropagation on a suitable subset of the
weights) construct a near minimal neural network architecture and weights which best models the data
[Kon ar 1997]. The Gamma test provides the criterion for ceasing training.
…

For estimating the number of `hills' required in an initial approximation for the network architecture we use the
heuristically based formula h = am, where
2 2
a . mA (25)
π

! To determine the best embedding dimension and delay time for time series [Masayuki 1997].

! To determine the best set of inputs from a list of possible inputs for a neuro-controller [Kon ar 1997].
…

The last two applications illustrate what is perhaps the main utility of the Gamma-test: on data sets which are not
excessively large the test is sufficiently fast to be run on a compete examination of all possible subsets of up to 20
inputs (for a larger number of inputs we use a Genetic Algorithm for which the fitness of a selection of inputs is

52


based on how small the ¯ value is for the selection).
'

* Metabackpropagation.

This is an algorithm developed by N. Kon ar and myself which overcomes many of the disadvantages of simple
…
backpropagation. It is pivotally dependent on the Gamma test.

Procedure Metabackpropagation
1. Use the Gamma test to determine the optimal number of data vectors
(input-output pairs) (M), the selection and number of inputs (m), the
mean squared error (MSError) to which the network should be trained
( ), and the first approximation to the neural architecture (using
A).
2. Create a feedforward network that has the number of hills (h)
specified by A. A single hill is made of two fully connected hidden
layers with 2m nodes in the first hidden layer and one node in the
second hidden layer.
3. Initialise each hill by doing a small number of Backpropagation
training cycles on subsets of the data. Subsets are chosen from the
near neighbour lists generated during the computation of the Gamma
test results. To find these subsets, the neural network weights are
first randomised and the points which give the largest errors are
identified by feeding every input vector through the network. The
near neighbour lists of the points which give the largest errors are
the subsets that are chosen. Each hill is trained on its own
exclusive subset. This will initialise the weights of each hill so
that it is positioned in the right place.

4. Perform Backpropagation training on the entire neural network
architecture until either a specified number of cycles is exceeded or
the target MSError is reached. Adjust the learning rate by the Bold
Driver method.
5. Increase the number of hills if the Gamma test MSError is not
reached and then go back to 2.

Algorithm 5-2 Metabackpropagation.

We have found that the application of the Gamma test and Metabackpropagation have transformed the production
of a nonlinear model using feedforward networks from a black-art to an almost fully automated and fast process.

* Neural networks for adaptive control.

To illustrate our discussion of adaptive control we shall use a simple example.

There are two tanks of water at temperatures Tcold < Thot respectively. The tanks are drained at
rates c1(t) and c2(t) ltrs/sec into a third tank of volume Vmax (not used). The third tank drains at
an (initially) constant rate r. The parameters c1 and c2 are considered as control variables and
it is desired to determine a control strategy for c1 and c2 which will maintain a volume Vgoal at
temperature Tgoal, where Tcold < Tgoal < Thot, in the third tank.

53


Tcold Thot

cont[[1]] cont[[2]]

Target tank

Drain

Figure 5-4 The Water Tank Problem

Let (V(t), T(t)) denote the volume and temperature of the target tank. The differential equations describing the
system are
dT dV
V T
% c1(t)Tcold c2(t)Thot rT
' % &
dt dt
(26)
dV
'c1(t) c2(t) r
% &
dt

where c1 and c2 are the cold and hot valve settings and r is the rate of drain from the target tank.

Physical Assumptions.

1. We assume that as water drains into the third tank mixing is instantaneous.
2. We assume that there is no heat loss from the third tank, apart from that lost due to the outflow.
3. We have assumed that the two feeder tanks contain water with a specific heat of unity. It is a simple
matter to modify the equations to deal with the case of two inert liquids of specific heats s1 and s2
respectively. If an endo- or exo-thermic reaction results from the mixing then this could also be taken
account of in the model.

We now proceed to construct an adaptive neurocontroller for the Water Tank Problem. The basic architecture we
shall use was developed by N. Kon ar and myself and applied to the Attitude Control Problem [Kon ar 1995]. It
… …
is described in Figure 5-5.

The components of this system are:

The Planner: Knowing the long term goal state (which could dynamically change) and the current state,
together with maximal variations in one time step, the Planner sets the next desired state. A simple Linear
Planner proceeds by taking a small segment of the straight line in state space between the current state
and the goal. A wide range of other variations are also possible.

The ) Mapping: This transforms the next desired state and the current state into an efficient
representation of the difference between the two into inputs for the neural network.

The Neural Network: Using the outputs from the mapping the neural network generates a suitable set
)
of control signals for application at the current time step.

54


Input
Long term Desired next map
goal state Planner state xG(k+1) Control
unit NN controller input u(k)
∆
(running) Dynamic Actual next
Current system state x(k+1)
state x(k)

Training data from observed facts: Add the data ((x(k+1), x(k)), ( u(k))), (inputs, outputs),
to the circular buffer of most recent (input,output) vector pairs. i.e. If the desired next state
were x(k+1) and the current state is x(k) then the correct control input would be u(k).

Input
map
Next (desired) state x(k+1) NN controller Control
∆ (learning)
Current state x(k) u(k)

Training consists of performing backpropagation on the entire contents of the circular
buffer of most recent (input, output) vector pairs.

Figure 5-5 Architecture for direct inverse neurocontrol.

The model is adaptive because whenever the network fails to place the next state within a prescribed error tolerance
of the next desired state, then the ((state, nextstate), (controls)) are added to the training buffer and the network
undergoes a further round of backpropagation training (it is assumed that this is done in hardware - the networks
under discussion usually have only a small number of nodes).

The main problem in specific applications centres around two issues:

! What input mapping is optimal for the specific application? This turns out to be quite critical. With
)
the wrong mapping the controller may not work efficiently (or even at all). Ideally we seek to reduce
equivalence classes of control inputs to a single representative, thus reducing the learning demands on
the neural network.

! What is the optimal architecture for the neural network?

In fact both questions can be addresssed by a combination of Metabackpropagation and the Gamma test.

The difference between the desired state (Vdes, Tdes) and the actual state (V, T) can be measured in various ways.
For example, with a time step between changes in control signals of delta = 1 in the simulator, if we just take all
four of the scalars as inputs then we obtain a ¯ of around 0.08 on each control output. Since (0.08) 0.28, this
' % .
implies an absolute error on the control outputs of the order 28%, which is unlikely to be effective.

If we return to the original equations for the Water Tank we can determine that if at two states (V1, T1), (V2, T2)
we apply the same control inputs (c1, c2) then (dV/dt, dT/dt) will be the same at both states provided
c1Tcold % c2Thot
V2T1 & V1T2 ' (V2 & V1) (27)
c1 % c2

55


Thus, under these circumstances, the two states (V1, T1), (V2, T2) are control equivalent, i.e. the application of
(c1, c2) in either of these states will (instantaneously) produce the same effect. This suggests the input mapping
described in Figure 5-6.

Vdes VdesT - VTdes
Tdes
∆
V Vdes - V
T

Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs
for the Water Tank Problem.

If we perform the Gamma test using these inputs we obtain (for c1 and similarly for c2) the regression line
illustrated in Figure 5-7. This may well not be the ideal mapping but we have some grounds for preferring over
)
the naive choice. (The Attitude Control problem similarly involves some mathematical ingenuity in the
construction of the mapping.)
)

If we take the average for the p th (2 p 5) nearest
' # #
neighbour then we estimate ¯ = lim
' ' as 0.027. capgamma

Leading to an absolute error on the ouputs of the order
of 16.4%. This represents the best MSError that a 0.035

feedforward neural network trained on the data can 0.03

achieve. 0.025

0.02

Although the question of building an adaptive ) 0.015
mapping is obviously crucial, if we have some
0.01
mathematical understanding of the system we are
seeking to control then the Gamma test is obviously 0.005

very time saving. 0
0.002 0.004 0.006 0.008 0.01 0.012 0.014
capdelta

Figure 5-8 - Figure 5-10 show the results of a quick
trial of the neurocontroller (illustrated in the Figure 5-7 Least squares fit to 200 data points with 20
J
Mathematica file nn-tank.ma). These results for the nearest neighbours: ¯ = 0.0332. '
Water Tank Problem are preliminary examples and do
not represent a final design. We have not made the training adaptive in the example file. The training was done
prior to the simulation on 200 data points uniformly distributed in state-difference space. Performance for small
state differences (effectively performance near the goal state) could easily be improved by increasing the density
of training data near the origin of state-difference space. It might also be improved by modifications of the
Planner module.

56


Volume (ltrs)

104

102

100

98

96

Time (secs)
102 104 106 108 110

Volume (ltrs)

140

120

100

80

60

40

20

0 Time (secs)
50 100 150 200

Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear
Planner.

Temperature
51

50.75

50.5

50.25

Temperature 50

49.75

49.5
70 49.25

49 Time (secs)
96 98 100 102 104

60

50

40

30

20 Time (secs)
50 100 150 200

Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052.
Linear Planner.

Valve setting (ltrs/sec)

4

3.5

3

2.5

2

Valve setting (ltrs/sec) 1.5

1

0.5

0 Time (secs)

4 96 98 100 102 104

3.5

3

2.5

2

1.5

1

0.5

0 Time (secs)
50 100 150 200

Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training.
Linear Planner.

57


Chapter references

[Evans 2002a] D. Evans and Antonia J. Jones. A proof of the Gamma test. Proc. Roy. Soc. Series A 458(2027),
2759-2799, 2002.

[Evans 2002b] D. Evans, Antonia J. Jones, W. M. Schmidt. Asymptotic moments of near neighbour distance
distributions. Proc. Roy. Soc. Lond. Series A, 458(2028):2839-2849, 2002.

[Friedman 1977] J.H. Friedman, J.L. Bentley and R.A. Finkel. An algorithm for finding best matches in
logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):200-226, 1977.

[Jones 2002] Antonia J. Jones, A.P.M Tsui, and Ana G. Oliveira Neural models of arbitrary chaotic systems:
construction and the role of time delayed feedback in control and synchronization.. With html and pdf electronic
supplement. Complexity International, Volume 09, 2002. ISSN 1320-0682. Paper ID: tsui01, URL:
http://www.csu.edu.au/ci/vol09/tsui01/

[Kon ar 1997] N. Kon ar. Optimisation methodologies for direct inverse neurocontrol. Forthcoming Ph.D. thesis,
… …
Department of Computing, 180 Queen's Gate, London, SW7 2BZ, U.K.

[Kon ar 1995] N. Kon ar and Antonia J. Jones. Adapative real-time neural network attitude control of chaotic
… …
satellite motion, Presented at Aerospace/Defense Sensing & Control and Dual-Use Photonics, SPIE (The
International Society for Optical Engineering and Photonics in Aerospace Engineering) International Synposium.
Orlando, Florida April 17-21, 1995.

[Rumelhart 1986a] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error
propagation. Chapter 8, Parallel Distributed Processing, Vol. 1., M.I.T. Press.

[Rumelhart 1986b] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by
back-propagating errors, Nature 323: 533-536.

[Adalbjörn Stefánsson 1997] Adalbjörn Stefánsson, N. Koncar and Antonia J. Jones. A note on the Gamma test,
Neural Computing & Applications 5(3):131-133, 1997. ISSN 0-941-0643. [Preprint]

[Tsui 2002] Alban P. Tsui, Antonia J. Jones and Ana Guedes de Oliveira. The construction of smooth models using
irregular embeddings determined by a Gamma test analysis. Neural Computing & Applications 10(4), 318-329,
April 2002. ISSN 0-941-0643.

[Werbos 1974] P. J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral sciences.
Ph.D. thesis, Harvard University, Cambridge, MA.

58


* VI The chaotic frontier.

Introduction.

It is interesting to observe from a modern standpoint that Newtonian physics already contained the seeds of its own
destruction. Quite apart from the later quantum mechanical caveat of the Heisenberg Uncertainty Principle, the
classicists overlooked the 'computational cost' of making deterministic predictions an indeterminate time into the
future. The fact is that for many deterministic systems, the computational cost of making an accurate prediction
a substantial time into the future becomes prohibitive. Thus the classical view that if a system is deterministic then
its future behaviour could be predicted for all time contains a basic flaw. This flaw becomes particularly apparent
when we consider chaotic systems.

Chaos is the word we use to describe deterministic behaviour for which nevertheless, in view of the computational
cost, even if the initial conditions were known to an arbitrary degree of precision, the long term behaviour cannot
be accurately predicted. This is certainly the case with many natural systems for which in any case we cannot know
the initial conditions to an arbitrary degree of precision. A classic example, first considered by E. N. Lorenz, is
the weather [Lorenz 1963].

About a century ago, Henri Poincaré observed that the motion of three bodies under gravity can be extremely
complicated. His discovery was the first mathematical evidence of chaos. Since that time there have been many
observations of chaos both in mathematical models and natural systems. For many years chaos observed through
the study of nonlinear dynamic systems was avoided, due to its complexity. In practice such behaviour was often
totally ignored being interpreted as either completely unpredictable or ascribed to statistical noise. The theory of
nonlinear dynamics founded by Poincaré describes and classifies the behaviour of complex dynamical systems and
the manner in which they evolve through time. Such systems were extraordinarily difficult to study. The situation
changed dramatically with the invention of the modern computer. Scientists, especially mathematicians and
physicists, who had previously encountered chaos could pursue a more systematic study of the phenomenon using
the new tool.

The crucial importance of chaos is that it provides an alternative explanation for apparent randomness, one that
depends on neither noise nor complexity. Chaotic behaviour appears in systems that are essentially free from noise
and are also relatively simple, often with only a few degrees of freedom. Many natural systems exhibit chaos. In
the early years of the study, it was common to assume that such behaviour is unpredictable7 and therefore
uncontrollable. Since the late 1980's, a number of quite different techniques have been proposed for controlling
chaotic systems [Ott 1990], [Dracopoulos 1994], [Ogorzalek 1993].

Mathematical background.

A dth order autonomous continuous time system is defined as
dx i
' gi(x1, . . . , x d, pi), (1 i
# # d) (1)
dt

where t is the time, x = (x1, . . . , xd) is a vector in n-dimensional state space, i.e. xi are the dynamic variables,
g = (g1, . . . , gd) is a vector field in the state space, i.e. gi are functions of xi, and pi are the corresponding vectors
of the control parameters. As the vector field does not depend on time, the initial time may be taken as t0 = 0.

7
Unpredictable in the sense that, although completely deterministic the computational `cost' of an accurate
prediction rapidly becomes prohibitive as the prediction interval increases.

59


The Jacobian matrix J of an autonomous system described by d first-order differential equations, is a d x d matrix
with the elements defined as
M gl
jl, m ' (1 # l, m # d) (2)
M xm

If the determinant of the Jacobian matrix, det J, is one at all points the system is conservative. If the average of
*det J < 1 then the system is dissipative. For the average of det J > 1, the values in state space expand with time.
* * *

We distinguish between conservative and dissipative dynamic systems. In conservative systems volume elements
in the state space are conserved, whereas in a dissipative system the volume elements contract as the system
evolves. For dissipative systems, the effects of transients associated with initial conditions disappear in time. The
trajectory in state space will head for some final attracting region, or regions, which might be a point, curve, area,
and so on. Such an object is called the attractor for the system, since a number of distinct trajectories will be
attracted to this set of points in the state space. The properties of the attractor determine the long term dynamical
behaviour of the system.

The terms are best understood with examples. A well known nonlinear dissipative system called the Lorenz model
[Lorenz 1963]. The Lorenz model is defined by a set of differential equations as
x (y x)
0 ' σ &
y
0 ' x (R & z) & y (3)
z
0 ' x y & b z

This system has three degrees of freedom as there are three dynamic x
0
variables x, y and z. The control parameters are , R and b. For R less than
F y
2 0.5
1.5 1
1, all trajectories, no matter what their initial conditions, eventually end up 1
1.5

approaching the origin of the xyz state space. That is for R < 1, all of the xyz 0.5
space is the basin of attraction for the attractor at the origin. Figure 6-1 0
illustrates a trajectory of the model with = 10.0, R = 0.5, b = 8/3, x = 1.0,
F 3
y = 2.0 and z = 3.0.

For dissipative systems, the effects of transients associated with initial 2
conditions disappear in time. The trajectory in state space will head for some z
final attracting region, or regions, which might be a point, curve, area, and
so on. Such an object is called the attractor for the system, since a number 1

of distinct trajectories will be attracted to this set of points in the state space.
The properties of the attractor determine the long term dynamical behaviour
0
of the system.

Chaos

There is currently great excitement and much speculation about chaos theory Figure 6-1 Stable attractor.
and its potential role in understanding the world. A brief introduction to the
history of the mathematical foundations of the subject can be cited in [Holmes 1990]. A chaotic system will remain
apparently noisy regardless of how well experimental conditions are controlled.

For the Duffing oscillator model, different values of d, f and exhibit completely different system behaviours. In
T
Figure 6-2, a time series of a periodic behaviour of the model for d = 0.15, f = 0.3 and = 1.0 is shown. In FigureT
6-2, a typical time series of the chaotic behaviour of the model for d = 0.2, f = 36 and = 0.665 is illustrated. As
T
can be seen, the chaotic time series is more complicated in appearance, but there is a boundary within which the

60


system stays.

If a system displays divergence of nearby trajectories or sensitive dependence on initial conditions for some range
of its control parameter, then the long term behaviour of that system becomes essentially unpredictable8, i.e. the
long term future of a chaotic system is in practice indeterminable even though the system is theoretically
deterministic.

The effect of divergence of nearby trajectories on the behaviour of nonlinear systems is known as the butterfly
effect. The term was introduced by Lorenz based on the picturesque notion that if the atmosphere displays chaotic
behaviour with divergence of nearby trajectories, then even the flapping of a butterfly's wings would alter any long
term prediction of atmospheric dynamics.

This phenomenon is illustrated in Figure 6-3 for the x-coordinate of
the Lorenz model with one trajectory starting at x = 1.0, y = 2.0 and
z = 3.0 in black, and another at x = 1.01, y = 2.01 and z = 3.01, in
x(t)
grey. Instead of R = 0.5, we have used R = 28.0 as the model exhibits
4
a chaotic behaviour with this value.
2
For many nonlinear systems, we must integrate the equations step by
step to find future behaviour. Any small error in specifying the initial 100 200 300 400 500 600
t

conditions will be magnified, thus leading to grossly different long
-2
term behaviour of the system, therefore we cannot predict that long
term behaviour in practice. Thus, chaotic behaviour is characterised -4
by the divergence of nearby trajectories in state space. As a function
of time, the separation between two nearby trajectories increases
exponentially, at least for short time. (For short time because the
trajectories stay within some bounded region of the state space.) Figure 6-2 A chaotic time series.

In three or more dimensions, initially nearby trajectories can continue
to diverge by wrapping over and under each other. The crucial feature
of state space with three or more dimensions which permits chaotic
x(t)
behaviour is that trajectories remain within some bounded region by
intertwining and wrapping around each other, without intersecting 15
and without repeating themselves exactly. The geometry created by 10
such trajectories is strange. Such attractors are thus called strange 5
attractors [Ruelle 1980], i.e. if nearby trajectories on average diverge 0
2.5 5 7.5 10 12.5 15 17.5 20
t

exponentially then we say the attractor is strange or chaotic. -5

-10

Chaos in biology. -15

A relatively new model of brain function was first described by
Freeman [Freeman 1991]. The idea is that `thought' (in particular
perception, prediction and control) consists of the flow (in the high Figure 6-3 The butterfly effect.
dimensional state space of vast assemblies of neurons) from one
chaotic orbit to a periodic orbit. Freeman argues that chaos is evident in the tendency of neural assemblies to shift
abruptly from one complex activity pattern to a more stable one in response to the smallest of inputs. This is a
plausible model and if it stands the test of experiment and scrutiny then chaos is an intrinsic feature of brain
function.

Phase portraits made from EEGs (electroencephalographs) generated by computer models reflect the overall

8
In the sense previously discussed.

61


activity of the olfactory bulb of a rabbit at rest and in response to a familiar scent (e.g. banana). The resemblance
of these portraits to irregularly shaped, but still structured, coils of wire reveals that brain activity in both
conditions is chaotic but that the response to the known stimulus is more ordered, more nearly periodic during
perception, than at rest.

The heart also provides interesting examples of chaotic behaviour in biological systems. It is clear that the cardiac
waveform is nonlinear. There is also evidence that the cardiac cycle can usefully be described in terms of chaos.
A. Babloyantz and A. Destexhe [Babloyantz 1988] examined the ECGs (electrocardiographs) of four normal
human hearts, using qualitative and quantitative methods. With a variety of processing algorithms, such as power
spectrum, autocorrelation function, phase portrait, Poincaré section, Lyapunov exponent etc, they demonstrated
that the heart is not a perfect oscillator, but that cardiac activity stems from deterministic dynamics of a chaotic
nature. Numerous in-vivo and in-vitro experiments have investigated cardiac oscillatory activity and found
characteristic signatures of chaos [Choi 1983], [Chay 1985], [Geuvara 1981], [Goldberg 1984], [Keener 1981].

Controlling chaos.

The extreme sensitivity to initial conditions displayed by chaotic systems makes them unstable and unpredictable.
Yet the same sensitivity also makes them highly susceptible to control, provided that the chaotic system can be
analyzed and the analysis is then used to make small effective control interventions. By perturbing the system in
the right way, it is possible to encourage it to follow one of its many unstable but natural behaviours. In such
situations, it may be possible to use chaos to advantage, as chaotic systems, once under control, are very flexible.
Such systems can rapidly switch among many different behaviours. Incorporating chaos deliberately into practical
systems therefore offers the possibility of achieving greater flexibility in their performance.

In the context of chaos, control could mean a number of things. It could mean the elimination of multiple basins
of attraction, stabilisation of the fixed points or stabilisation of the unstable periodic orbits. Control of chaos is still
in its infancy but the potential it offers is enormous.

There are four main categories of chaos control methodologies. They are low energy, high energy, non-feedback
and feedback methods.

Low energy control methods require very small changes in the control parameter. In contrast, high energy control
methods require large changes. It is always desirable to have a control method which is of the low energy type, as
in physical systems control parameter may be fixed or can be changed by only a very small amount. When large
changes are required, a physical system may need to be redesigned defeating the `control of chaos' concept, as such
an approach is closer to avoiding chaos.

In feedback methods, a control parameter is changed during the control. In non-feedback methods, a control
parameter is changed at the beginning of the control only, and untouched during the control phase.

The original OGY control law.

Suppose p is some scalar control parameter which is to be
ξi pi ξi+1
varied at times ti, say p = pi over the interval (ti, ti + 1), see
Figure 6-4. Suppose that the nominal value of p is p0. Our aim
pi-1
is to vary p by small amounts about p0 so as to stabilise (t) > pi+1 ξi+2
about a suitable control point F. For all i 1 let
> $
Time
ti ti+1 ti+2

Figure 6-4 Intervals for which the variables are
defined.

62


δ p ' p & p0 and % ξδi 1(p i) ' % ξ i 1(p i) & ξ F(p0) (4)

Suppose that the iteration is described by the map i + 1 = F( i, p). The locally linear behaviour of F in the vicinity
> >
of a control point F is described by the dE x dE Jacobian matrix
>

J ' D F( , p)
>
> (5)
> ' >F, p ' p0

and in what follows we assume det J P 0.

This yields the first order approximation
% ξδ i 1(p i) . J ξδ i(pi & 1) % u pi δ (6)
where u is a vector which reflects the direction of the local gradient with respect to p.

Now suppose that an eigenvector of J, say eu, has a real eigenvalue whose absolute value is greater than 1. This
means that points i such that i lies in the direction of eu will, if p = p0 in the intervening time period, be such
> >*
that on the next iteration i + 1 will lie further away from F(p0). We refer to this as an unstable direction. Stable
> >
directions are characterised by eigenvectors of J which have absolute value less than 1.

The basic idea of the OGY method is to choose pi so as to eliminate the component of i + 1 in the unstable
* >*
direction(s). We are now almost ready to derive the appropriate control strategy. However, we first observe the
following lemma.

Lemma 6.1. Suppose the d x d matrix J has d linearly independent eigenvectors e1, ..., ed, with real eigenvalues
d
81, ..., d. Thus we assume the eigenvectors form a basis in
8 . Construct the dual basis f1, ..., fd defined by
ú
1 if i ' j
ei . fj ' (7)
0 if i … j

d
Then for any x ú0
fu . Jx ' λ u fu . x (8)

Proof. Express x in terms of the eigenvectors, writing
x ' α 1e1 % α 2e2 % ... % α de d (9)

where the " i are suitable scalars depending on x. Thus from (7)
fu.x ' α u (10)

The effect of J on x is (from the definition of eigenvectors and eigenvalues)
Jx ' λα 1 1e1 % ... % d d ed
λα (11)

Taking the inner product with fu yields fu.Jx = 8 "
u u and the conclusion now follows from (10). %

We can now prove

Theorem 6.1 (OGY). The constraint fu. >* i+1 = 0 leads to the first order control law:

63


fu. ξδ i(pi & 1)
δ pi &. λ u (12)
fu.u

where for > i near > F , the sensitivity vector u is defined as
M ' % ξ
i 1(pi) & J i(pi
ξ & 1)
u % ξδ
i 1
(pi) & J ξδ i
(pi & 1
) ' lim (13)
M pi pi 6 p0 pi & p0
p0

Proof. Dotting (6) with fu using Lemma 1 we have
f u. % ξδ
i 1(p i) . λ u f u. ξδ i(pi & 1) % f u.u pi
δ (14)

Using the constraint fu. >* i+1 = 0 we obtain from (14)
λ u fu. ξδ i(pi & 1) % fu.u pi δ . 0 (15)

which on solving for pi yields (12). * %

The following conditions are required to control a chaotic system with the original OGY method.

! Experimental time series of some scalar-dependent variable xt can be measured and a suitable
embedding technique can be applied, or the mathematical model describing the system is
available.
! The dynamics of the system can be represented as a low-dimensional surface of section and
the system has at least two linearly independent real eigenvectors.
! There is a specific periodic orbit of the map which lies in the attractor and around which one
wishes to stabilise and the corresponding unstable periodic point can be located.
! A parameter p is available for external adjustment which can be used to slightly modify the
system dynamics. Let the range in which p is allowed to vary pMAX > p > pMIN. There is
maximum perturbation p* in the parameter p by which it is acceptable to vary p from the
*
nominal value p0.
! The position of the periodic orbit is a function of p, but the local dynamics about it do not vary
much with small changes in p.

Chaotic conventional neural networks.

It has been known since at least 1991 that conventional neural network models can exhibit chaotic behaviour.
Wang [Wang 1991] constructed a rather stylised simple 2-2 network with weights
a ka
W ' (16)
b kb

for a = -5, b = -25, k = -1 whose sigmoidal transfer function is
1
(x) σ ' (17)
1 e x/T % &

where T = 1/4, in which the outputs are fed back to the inputs as in Figure 6-5.

64


Wang proved that there exists period-doublings to chaos and strange attractors by using a homeomorphism from
the network to a known dynamical system having these properties. This formally established that artificial neural
networks can exhibit chaotic behaviour. At the same time [Welstead 1991] trained feedforward networks on the
Ikeda and Henon maps and then by feeding the outputs back into the inputs empirically produced neural networks
with chaotic attractors. Somewhat later further examples were given in [Dracopoulos 1993]9. None of these papers
considered the question of controlling the neural network behaviour.

y
1

0.9
x (t) x (t+1)
Feedforward 0.8

Network
0.7
y (t) y (t+1)
0.6

x
0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 6-5 Feedforward network as a dynamical system. Figure 6-6 Chaotic attractor of Wang's neural
network.

Controlling chaotic neural networks

The attractor of the Wang network is rather difficult to work with because the attractor is almost a curve, see
Figure 6-6. However, we were able to locate an unstable fixed point at (0.896853, 0.999980) which has a Jacobian
& 1.96322 2.08867
J ' (18)
& 0.00755664 0.00893465

The awkward shape of the attractor is reflected in the fact that this Jacobian has very small determinant. Using
T as a control parameter with a nominal value of 1/4 we managed to stabilise this system about the fixed point.

In this section we consider neural networks with feedback whose dynamical behaviour is chaotic. This is a subject
of increasing interest for two reasons. First, because if Freeman's hypothesis is correct then despite their value in
practical applications (in for example pattern recognition) the idea that feedforward networks, regardless of the
training algorithm employed, are an accurate analogy to equivalent biological computations is seriously challenged.
Second, it is possible that by storing memories in (unstable) periodic behaviours, rather than at point attractors as
in the Hopfield model, the memory capacity of simple neural networks may be considerably enhanced.

We consider ways in which a small 2-10-10-2 network, trained on the Ikeda attractor, and whose outputs feed back
to the inputs can be controlled using small variations of various parameters or system variables. It needs to be said
that the control mechanisms employed are external to the neural network. This is in contrast to the hypothetical
biological process in which presumably some form of Hebbian learning causes frequently encountered sensory
input to be associated with a particular (unstable) periodic dynamical regime. The result being that when the
sensory input is re-encountered the neural system relaxes naturally onto a particular (unstable) periodic behaviour
which characterises the input, such switching of the dynamical behaviour being implicit to the neural structure
rather than being externally imposed. Nevertheless, we believe that such studies of how chaotic neural systems can
be encouraged to follow particular unstable periodic orbits to be an interesting and probably necessary first step
in developing some understanding of how such behaviour might be made a natural characteristic of the neural

9
At that time the authors were unaware of the results presented in IIJCNN-91.

65


system.

Other approaches are possible using different neural networks models and different control techniques. For
example Babloyantz's group [Sepulchre 1993] [Lourenço 1994] [Babloyantz 1995] have controlled a network of
oscillators coupled to their four nearest neighbours using both the OGY method and a delayed feedback control
technique first suggested by Pyragas [Pyragas 1992]. Another example [Solé 1995] uses the GM technique
([Güemez 1993], [Matías 1994]) to control small (three or four neurons) fully connected neural networks whose
attractors are similar to a chaotic network first discussed by [Wang 1991].

y
0.7

0.6
0.5

0.5

x 0.4
-0.25 0.25 0.5 0.75 1 1.25

0.3
-0.5
0.2

0.1
-1

0
0.5 0.6 0.7 0.8 0.9

Figure 6-7 The Ikeda strange attractor. Figure 6-8 Attractor for the chaotic 2-10-10-2 neural
network.

However, for the purposes of the present section we work with the Ikeda map [Hammel 1985] defined by
α
g(z) ' γ % R z exp i κ & (19)
2
1 **% z

where z is a complex variable, of the form x + i y, and i2 = -1. We can identify x + i y with the point (x, y) on the
complex plane so that g can also thought of as a mapping of 2 -> 2. The dynamical system is then defined by
ú ú
zn+1 = g(zn). For parameter values = 5.5, = 0.85, = 0.4 and R = 0.9, this mapping has a strange attractor
" ( 6
illustrated in Figure 6-7. With only 4000 training pairs (re-scaled into the range [0, 1]) and training MSE error
of about 9.9×10-5, the network already produces an attractor in Figure 6-8 with features similar to the Ikeda map
strange attractor shown in Figure 6-7.

We use this network as the basis for the control experiments, the objective being to determine which parameters
or system variables are most effective in stabilising the system onto an unstable periodic attractor.

The OGY control method was applied to control the chaotic neural network described above. An unstable fixed
point F = (0.626870, 0.553256) was located by examining successive iterations of the system and was used as the
>

& 1.26617 & 1.03629
J ' (20)
& 0.564996 & 1.06779

unstable periodic point to be stabilised. The Jacobian at this point was
with eigenvalues s = -0.395399 and u = -1.93857, and stable eigenvector es = (0.7656, -0.643317) and unstable
8 8

66


eigenvector eu = (-0.838887, -0.544306).

Control varying T in a particular layer.

Initial attempts to control using the slope parameter T were not successful. The next attempts were made by varying
T only for neurons in a particular layer of the network, and here the OGY control method was more effective. It
seems that by varying T in only one particular layer the chaotic regions of the bifurcation diagrams become broader
(see Figure 6-9 and Figure 6-10) and so control becomes easier with small variations of T. The variations of T and
the controlled result are illustrated in Figure 6-13 - Figure 6-12.

Using small variations of the inputs.

The results of using an external signal feeding into one of the inputs as a control parameter whose nominal value
is set to zero were significantly more interesting. The bifurcation diagrams for x(t) are given in Figure 6-14 and
Figure 6-15. We use the same fixed point as before, so the Jacobian and associated eigenvectors and eigenvalues
remain unchanged.

Using an external signal feeding into input x (cf. Figure 6-5), the sensitivity vector ux = (-1.076260, -0.675875)
was approximated. After applying the OGY control for less than 25 time steps (the control variations are shown
in Figure 6-18) the system rapidly stabilized onto the unstable fixed point as illustrated in Figure 6-16 - Figure
6-17.

In these experiments, an improved technique due to [Otani 1996] was actually used to estimate the sensitivity
vectors u. In (13) the Jacobian is used to obtain a prediction of where the system would be at the next iteration if
no control were applied. However, in the case of a neural network this is unnecessary since the neural network is
its own Jacobian at every point. We can therefore obtain an exact prediction of the next system state by simply
iterating the network without control. This resulted in much more accurate estimations of the sensitivity vectors
which itself made control of the system using the OGY method much easier.

The OGY method can be applied to the control of conventional feedforward networks whose behaviour under
iterated feedback has been trained to be chaotic. Whilst the method is computationally expensive and, in its
original form subject to a number of limitations (for example inaccuracies in estimating the Jacobian or sensitivity
vectors can make control difficult if not impossible) nevertheless see that stabilisation of unstable fixed points is
perfectly feasible. However, this relaxation onto a fixed point is achieved by a control external to the network itself
rather than as an implicit consequence of network function.

It is interesting to observe that control by variation of a global slope parameter is not easy to achieve but becomes
easier when the control variations are applied to a single layer rather than to the whole network. It is notable that
control becomes very much easier when the controlling parameter is a small signal applied to one of the inputs.
This may be closer to being a biological analogy than control of behaviour through global or selective slope control.

67


x y

T
0.6 0.8 1.2 1.4
0.9 0.8

0.8
0.6
0.7
0.4
0.6

0.5 0.2
0.4
T
0.3 0.6 0.8 1.2 1.4

Figure 6-9 Bifurcation diagram x obtained by varying Figure 6-10 Bifurcation diagram y obtained by
T in the output layer only. varying T in the output layer only.

x
y

0.551971
0.8
0.55197
0.7
0.55197
0.6 time step
200 400 600 800 1000
time step 0.551969
200 400 600 800 1000

Figure 6-11 Variations of x from initiation of control. Figure 6-12 Variations of y from initiation of control.

parameter T

1.075
1.05
1.025
time step
200 400 600 800 1000
0.975
0.95
0.925

Figure 6-13 Parameter changes during output layer control.

68


x y

ext to x
-0.4 -0.2 0.2 0.4
0.9 0.8
0.8
0.6
0.7
0.6 0.4
0.5
0.2
0.4
0.3 ext to x
-0.4 -0.2 0.2 0.4

Figure 6-14 Bifurcation diagram for the output x(t+1) Figure 6-15 Bifurcation diagram for the output y(t+1)
using an external variable added to the input x(t). using an external variable added to the input x(t).

x y

0.8 0.6

0.5
0.7
0.4

0.3
0.6
0.2
time step time step
50 100 150 200 50 100 150 200

Figure 6-16 Variations of x from initiation of control. Figure 6-17 Variations of y from initiation of control.

ext sig x

0.4

0.2

time step
50 100 150 200
-0.2

-0.4

Figure 6-18 Parameter changes during input x control.

69


Quite how easy it would be to extend such control to networks with many outputs being fed back to many inputs
remains to be determined. It also remains to be determined whether it is practical to control high dimensional
networks to follow unstable periodic orbits rather than fixed points. It is likely that more sophisticated variations
of the OGY technique or some completely different control method would be required to accomplish this goal.

Time delayed feedback and a generic scheme for chaotic neural networks.

Recently [Tsui 1999], [Oliveira 1998] [Tsui 2002], Ana Olivera, Alban Tsui and myself have discovered that it
is very easy to control and synchronize chaotic neural systems using time delayed feedback. Combined with the
Gamma test to select appropriate time-delays we can now easily achieve the following:

! Given an arbitrary smooth multidimensional chaotic system produce an iterated neural network which
closely models the system.

! Using a simple technique of time delayed feedback cause the iterated chaotic neural network when
presented with a stimulus to stabilize onto an unstable periodic orbit characteristic of the applied stimulus.
Moreover, this response is quite stable in the presence of noise.

! Synchronize two identical copies of the network by transmitting the single output of one to the other.
This provides an interesting model of other kinds of cortical activity and also has interesting applications
in secure communications.

Although not a full account of our methods, these papers answer many of the questions mentioned. [Tsui 1999]
provides the first actual implementation of an artificial neural network which exhibits all of the properties
discussed in [Freeman 1991].

A generic scheme for such a stimulus-response recurrent network is shown in Figure 6-19. The single output of
the network feeds back into the inputs using delay buffers according to the embedding previously determined by
the Gamma test experiments. This embedding should contain enough information for predicting the next system
state.

For stabilization control a multiple (gain constant) of the delayed feedback is added to each neural network input
specified by the irregular embedding, based on the idea from Pyragas' delayed feedback control. The control
module is shown in Figure 6-19 and the control perturbation for the ith input at the nth iteration is
k i x i(n i
& & τ ) x i(n i)
& & (21)
where ki is a gain constant and is the delay time.
J

We imagine that the presence of an external stimulus excites (activates) the control circuitry, which is otherwise
inhibited. Thus to achieve a stabilised dynamical regime in response to a stimulus the control is switched on at the
same time as the external signal is fed into the input line xn. By varying the external signal in small steps and
holding the new setting fixed long enough for the system to stabilise we can observe the response of the network
to small changes in stimulus.

In the diagram, is the same for each control perturbation but of course, we could set to be different on each
J J
control line. External stimulus of the network can be applied to the controlled inputs as shown in the diagram. The
control module should switch on automatically and simultaneously whenever there is an external stimulation.
Variations of stimulation, such as on the control delayed feedback lines may also be used.

70


iterative feedback

Hidden layer
delay d + x(n-d) + kd (x(n-d-t) - x(n-d))

observation Controlled neural inputs
points External with no external stimulus
stimulus x(n)

delay 3 + x(n-3) + k3 (x(n-3-t) - x(n-3))

delay 2 + x(n-2) + k2 (x(n-2-t) - x(n-2))
Feedforward
delay 1 + x(n-1) + k1 (x(n-1-t) - x(n-1)) neural network

Switch signal
Control
feedback k1 (x(n-1-t) - x(n-1))
for each k2 (x(n-2-t) - x(n-2))
Time delayed
delayed line control feedback Keys
k3 (x(n-3-t) - x(n-3)) observation
module point
kd (x(n-d-t) - x(n-d))

Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network:
the chaotic "delayed" network is trained on suitable input-output data constructed from a chaotic time
series; a delayed feedback control is applied to each input line; entry points for external stimulus are
suggested, with a switch signal to activate the control module during external stimulation; signals
on the delay lines or output can be observed at the "observation points".

Example: Controlling the Hénon neural network

There follows an example of different responses of the for the Hénon neural system using different settings of
controls and external stimulation. The response signals of the system can be observed at the output x(n) of the
feedforward neural network module or the "observation points" on the delay lines x(n-1) , x(n-d), as indicated Y
in Figure 6-19. Due to the complexity of these neural systems, of course, not all possible settings are tried and
presented.

s
x

1.8 0.4

1.6 0.2
1.4
n
1.2 200 400 600 800 1000 1200 1400

n -0.2
200 400 600 800 1000 1200 1400
0.8 -0.4
0.6
-0.6

Figure 6-21 Response signal on x(n-6) with control Figure 6-20 The control signal corresponding to the
signal activated on x(n-6) using k = 0.441628, = 2 J delayed feedback control shown in Figure 6-21. Note
and without external stimulation after first 10 that the control signal becomes small.
transient iterations. After n = 1000 iterations, the
control is switched off.

71


s
x
1
2
0.75
1
0.5

n 0.25
2500 5000 7500 10000 12500 15000
n
2500 5000 7500 10000 12500 15000
-1
-0.25

-2 -0.5
-0.75

Figure 6-22 Response signals on network output x(n), Figure 6-23 The control signal corresponding to the
with control signal activated on x(n-6) using k = delayed feedback control shown in Figure 6-22. Note
0.441628, J= 2 and with constant external that the control signal becomes small even when the
stimulation sn added to x(n-6), where sn varies from network is under changing external stimulation.
-1.5 to 1.5 in steps of 0.1 at each 500 iterative steps
(indicated by the change of Hue of the plot points)
after 20 initial transient steps.

We use = 2 and k = 0.441628 for our control parameters on all the possible feedback control lines. The control
J
is applied to the delayed feedback line x(n-6). Without any external stimulation and using only a single control
delayed feedback, the network quickly produces a stabilised response as shown in Figure 6-21 with the
corresponding control signal shown in Figure 6-20. Notice that the control signal is very small during the
stabilised behaviour. Under external stimulation with varying strength the network is still stabilised, but with a
variety of new periodic behaviours as shown in Figure 6-22. The corresponding control signal is still small (see
Figure 6-23).

For this system we then investigated the response of the system when the sensory input was perturbed by additive
Gaussian noise r with Mean[r] = 0 and standard deviation SD[r] = . Using the experimental setup as in Figure
F
6-22, the external stimulus was perturbed at each iteration step by adding a Gaussian noise r with standard
deviation , i.e. having an external stimulus sn+r. This experiment was repeated for different , where was
F F F
varied from = 0.05 to = 0.3, a high noise standard deviation with respect to the external stimulus strength from
F F
-1.5 t o1.5. The result for = 0.05 is shown in Figure 6-24 and Figure 6-25. Surprisingly, the response signal
F
almost stays the same but the control signal is not small at all. The results for = 0.15 and = 0.3 are in Figure
F F
6-26 and Figure 6-27 respectively. As illustrated in these figures, the system dynamics remain essentially
unchanged, although as one might expect the response signal becomes progressively "blurred" as the noise level
increases. Similar results can be obtained for the other examples.

72


x s
0.04
1

0.02
n
2500 5000 7500 10000 12500 15000

-1 n
2500 5000 7500 10000 12500 15000

-2
-0.02

Figure 6-24 Response signals on network output x(n),
-0.04
with control setup same as in Figure 6-22 but with
Gaussian noise r added to external stimulation, i.e. Figure 6-25 The control signal corresponding to the
sn+r, with = 0.05, at each iteration step.
F delayed feedback control shown in Figure 6-24.

x
x
2 2

1 1

n n
2500 5000 7500 10000 12500 15000 2500 5000 7500 10000 12500 15000
-1 -1

-2
-2

Figure 6-26 Response signals on network output x(n), Figure 6-27 Response signals on network output x(n),
with control experiment setup same as in Figure 6-22 with control experiment setup same as in Figure 6-22
but with Gaussian noise r added to external but with Gaussian noise r added to external
stimulation, i.e. sn+r, with = 0.15, at each iteration
F stimulation, i.e. sn+r, with = 0.3, at each iteration
F
step. step.

Chapter references

[Abraham 1982] R. H. Abraham and C. D. Shaw, Dynamics - The geometry of behavior Part One : Periodic
Behavior, Aerial Press, California, 1982.

[Atkinson 1978] K. E. Atkinson, An introduction to numerical analysis, John Wiley & Sons, Canada, 1978.

[Auerbach 1987] Ditza Auerbach, Predrag Cvitanovic, Jean-Pierre Eckmann, Gemunu Gunaratne and Itamar
Procaccia. Exploring chaotic motion through periodic orbits. Physical Review Letters 58(23), 2387-2389, 1987.

[Auerbach 1992] D. Auerbach, C. Grebogi, E. Ott and J. A. Yorke. Controlling Chaos in High Dimensional
Systems, Physical Review Letters 69, 24, 3479-3482, 1992.

73


[Azevedo 1991] A. Azevedo and S. M. Rezende. Controlling Chaos in Spin-Wave Instabilities, Physical Review
Letters 66, 10, 1342-1345, 1991.

[Babloyantz 1988] A. Babloyantz and A. Destexhe. Is the normal heart a periodic oscillator? Biol. Cybern. 58,
203-211, 1988.

[Babloyantz 1995] A. Babloyantz, C. Lourenço, J.A. Sepulchre, Control of chaos in delay differential equations,
in a network of oscillators and in model cortex, Physica D 86, 274-283, 1995.

[Belmonte 1988] A. L. Belmonte, M. J. Vision, J. A. Glazier, G. H. Gunaratne and B. G. Kenny. Trajectory
Scaling Functions at the Onset of Chaos: Experimental Results, Physical Review Letters 61, 5, 539-542, 1988.

[Belmonte 1988] A. L. Belmonte, M. J. Vision, J. A. Glazier, G. H. Gunaratne and B. G. Kenny. Trajectory
scaling functions at the onset of chaos: experimental results. Physical Review Letters 61(5), 539-542, 1988.

[Carroll 1992] T. L. Carroll, I. Triandaf, I. Schwartz and L. Pecora. Tracking unstable orbits in an experiment,
Physical Review A 46, 10, 6189-6192, 1992.

[Carroll 1993] Thomas L. Carroll and Louis M. Pecora. Using chaos to keep period-multiplied systems in phase,
Physics Review E 48, 4, 2426-2436, 1993.

[Choi 1983] M. Y. Choi and B. A. Huberman. Dynamic behaviour of nonlinear networks, The American Physical
Society, 28, 1204-1206, 1983.

[Chay 1985] T. R. Chay and J. Rinzel. Bursting, beating, and chaos in an excitable membrane model, J.
Biophysical Society, 47, 357-366, 1985.

[Crutchfield 1980] J. Crutchfield, D. Farmer, N. Packard, R. Shaw, G. Jones and R. J. Donnelly. Power spectral
analysis of a dynamical system, Physics Letters A 76, 1-4, 1980.

[Crutchfield 1981] J. Crutchfield, M. Nauenberg and J. Rudnick. Scaling for External Noise at the Onset of Chaos,
Physical Review Letters 46, 933-935, 1981.

[Cvitanovic 1989] P. Cvitanovic. Universality in Chaos : second edition, Adam Hilger, Bristol, 1989.

[Derrick 1993] W. R. Derrick in Chaos in chemistry and biochemistry, Ed R. J. Field and L. Györgyi, World
Scientific Publishing, 1993.

[Ditto 1990] W. L. Ditto, S. N. Rauseo and M. L. Spano. Experimental Control of Chaos, Physical Review Letters
65, 26, 3211-3214, 1990.

[Ditto 1993] W. L. Ditto and L. M. Pecora. Mastering Chaos, SCIENTIFIC AMERICAN, 62-68, August 1993.

[Dracopoulos 1993] D. C. Dracopoulos and Antonia J. Jones. Neuromodels of analytic Dynamic Systems. Neural
Computing & Applications, 1(4):268-279, 1993.

[Dracopoulos 1994] D. C. Dracopoulos and Antonia J. J. Jones. Neuro-Genetic Adaptive Attitude Control, Neural
Computing & Applications, 2(4):183-204, 1994.

[Dressler 1992] U. Dressler and G. Nitsche, Controlling chaos using time delay coordinates, Physics Review
Letters 68(1):1-4, 1992.

[Eckmann 1985] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors, Rev. Modern

74


Physics 57, 3, 617-656, 1985.

[Freeman 1991] W.J. Freeman, The Physiology of Perception, Scientific American, 34-41, February 1991.

[Garfinkel 1992] A. Garfinkel, M. L. Spano, W. L. Ditto and J. N. Weiss. Controlling Cardiac Chaos, Science
257:1230-1235, 1992.

[Geuvara 1981] M. R. Geuvara, L. Glass and A. Shrier. Phase locking, period-doubling bifurcations, and irregular
dynamics in periodically stimulated cardiac cells, Science 214:1350-1352, 1981.

[Gills 1992] Z. Gillis, C. Iwata, R. Roy, I. B. Schwartz and I. Triandaf. Tracking Unstable Steady States:
Extending the Stability Regime of a Multimode Laser System, Physical Review Letters 69(22):3169-3172, 1992.

[Gleick 1987] J. Gleick. CHAOS : Making a new science, Abacus book, London, 1987.

[Goldberg 1984] A. L. Goldberger, L. J. Findley, M. R. Blackburn and A. J. Mandell. Nonlinear dynamics in heart
failure: Implications of long-wavelengths cardiopulmonary oscillations, Amer. Heart J. 107:612-615, 1984.

[Grassberger 1983] P. Grassberger and I. Procaccia, Characterization of Strange Attractors, Physical Review
Letters 50(5):346-349, 1983.

[Grebogi 1988] C. Grebogi, E. Ott, J. A. Yorke. Unstable periodic orbits and the dimensions of multifractal
chaotic attractors, Physical Review A 37:1711-1723, 1988.

[Grebogi 1982] C.Grebogi, E.Ott, J.A.Yorke. Chaotic Attractors in Crisis, Physical Review Letters 48:1507-1510,
1986.

[Grebogi 1986] C. Grebogi, E. Ott, J. A. Yorke. Critical Exponent of Chaotic Transients in Nonlinear Dynamical
Systems, Physical Review Letters 57:1284-1287, 1986.

[Grebogi 1987] C. Grebogi, E. Ott, J. A. Yorke. Critical exponents for crisis-induced intermittency, Physical
Review A 36:5365-5380, 1987.

[Greenside 1982] H. S. Greenside, A. Wolf, J. Swift and T. Pignataro. Physical Review A 25:3453, 1982.

[Güémez 1993] J. Güémez and M.A. Matías, Control of chaos in unidimensional maps, Physics Letters A 181:29-
32, 1993.

[Gunaratne 1989] G. H. Gunaratne, P. S. Linsay and M. J. Vision. Chaos beyond Onset: A Comparison of Theory
and Experiment, Physics Review Letters 63(1):1-4, 1989.

[Hall 1992] N. Hall. The new scientist guide to chaos, Penguin books, London, 1992.

[Hammel 1985] S. Hammel, C. Jones, J. Moloney, Global Dynamical Behaviour of the Optical Field in a Ring
Cavity, J. Opt. Soc. Am. B 2(4):552-564, 1985.

[Hao 1990] B. Hao, Chaos II, World Scientific, 1990.

[Hayes 1993] S. Hayes, C. Grebogi and E. Ott. Communicating with Chaos, Physical Review Letters 70(20):3031-
3034, 1993.

[Hénon 1976] M. Hénon. A Two-dimensional Mapping with a Strange Attractor, Communications in Mathematical
Physics 50:69-77, 1976.

75


[Hilborn 1994] R. C. Hilborn. Chaos and Nonlinear Dynamics : An introduction for scientists and engineers,
Oxford University press, New York, 1994.

[Holmes 1990] P. Holmes. Poincarè. celestial mechanics, dynamical-systems theory and "chaos", Physics Reports
193(3):137-163, 1990.

[Hunt 1991] E. R. Hunt. Stabilizing High-Period Orbits in a Chaotic System: The Diode Resonator, Physical
Review Letters 67(15):1953-1955, 1991.

[Kaplan 1979] J. Kaplan, J. A. Yorke in Chaotic behavior of multidimensional difference equations, H. O. Peitgen
et al., Eds., "Springer Lecture, Notes in Mathematics", Springer-Verlag, Berlin, 730:204-227, 1979.

[Keener 1981] J. P. Keener. Chaotic cardiac dynamics, Lectures in Applied Mathematics 19:299-325, 1981.

[Kim 1992] J. H. Kim and J. Stringer, Applied Chaos, John Wiley & Sons, Canada, 1992.

[Lai 1994] Ying-Cheng Lai and Celso Grebogi. Synchronization of spatiotemporal chaotic systems by feedback
control, Physics Review E 50(3):1894-1899, 1994.

[Lai 1993] Y-C Lai, M. Ding and C. Grebogi. Controlling Hamiltonian chaos, Physical Review E 47(1):86-92,
1993.

[Lathrop 1989] Daniel P. Lathrop and Eric J. Kostelich. Characterization of an experimental strange attractor
by periodic orbits, Physics Review A 40(7):4028-4031, 1989.

[Lorenz 1963] E. N. Lorenz. Deterministic Nonperiodic Flow, J. Atmospheric Sciences 20:130-141, 1963.

[Lorenz 1993] E. N. Lorenz, The essence of chaos, University of Washington press, 1993.

[Lourenço 1994] C. Lourenço, A. Babloyantz, Control of Chaos in Networks with Delay: A Model for
Synchronization of Cortical Tissue, Neural Computation 6:1141-1154, 1994.

[Matías 1994] M.A. Matías and J. Güémez, Stabilization of Chaos by Proportional Pulses in the System Variables,
Physical Review Letters 72(10):1455-1458, 1994.

[Moss 1994] Frank Moss. Chaos under control, Nature 370:596-597, 1994.

[Ogorzalek 1993] M. J. Ogorzalek. Taming Chaos Part II: Control, IEEE Transactions on circuits and systems -
1: Fundamental theory and Applications. Volume 40(10):700-706, October 1993.

[Oliveira 1998] Synchronisation of chaotic maps by feedback control and application to secure communications
using chaotic neural networks. Ana Guedes de Oliveira and Antonia J. Jones. International Journal of Bifurcation
and Chaos, 8(11), November 1998.

[Otani 1997] M. Otani and Antonia J. Jones, Guiding chaotic orbits. Research Report, Department of Computer
Science, University of Wales, Cardiff, December 1997.

[Ott 1990] E. Ott, C. Grebogi and J.A. Yorke, Controlling Chaos, Physical Review Letters 64(11):1196-1199,
1990.

[Ott 1994] E. Ott, T Sauer and J. A. Yorke, Coping with Chaos, John Wiley & Sons, Canada, 1994.

[Parker 1989] T. S. Parker and L. O. Chua. Practical Numerical Algorithms for Chaotic Systems, Springer-Verlag,

76


New York, 1989.

[Parlitz 1985] U. Parlitz and W. Lauterborn. Superstructure in the bifurcation set of the Duffing equation, Physics
Letters A 107(8):351-355, 1985.

[Pearson 1986] C. E. Pearson, Numerical methods in engineering and science, Van Nostrand Reinhold Company
Inc, England, 1986.

[Peinke 1992] J. Peinke, J. Parisi, O. E. Rössler and R. Stoop, Encounter with Chaos : Self-Organized Hierarchical
Complexity in Semiconductor Experiments, Springer-Verlag, 1992.

[Petrov 1993] Valery Petrov, Vilmos Gaspar, Jonathan Masere and Kenneth Showalter. Controlling chaos in the
Belousov-Zhabotinsky reaction, Nature 361:240-243, 1993.

[Pfister 1992] G. Pfister, Th. Buzug and N. Enge. Characterization of experimental time series from Talyor-
Couette flow, Physica D 58:441-454, 1992.

[Provenzale 1992] A. Provenzale, L. A. Smith, R. Vio and G. Murante. Distinguishing between low-dimensional
dynamics and randomness in measured time series, Physica D 58:31-49, 1992.

[Pyragas 1992] K. Pyragas, Continuous control of chaos by self-controlling feedback, Physics Letters A 170:421-
428, 1992.

[Romeiras 1992] F. J. Romeiras, C. Grebogi, E. Ott and W. P. Dayawansa. Controlling chaotic dynamical systems,
Physica D 58:165-192, 1992.

[Rössler 1976] O. E. Rössler. An Equation for Continuous Chaos, Physics Letters A 57:397-398, 1976.

[Roy 1992] R. R. Roy, T. W. Murphy, T. D. Maier, Z. Gills and E. R. Hunt. Dynamical Control of a Chaotic
Lazer: Experimental Stabilization of a Globally Coupled System, Physical Review Letters 68(9):1259-1262, 1992.

[Roy 1994] Rajarshi Roy and K. Scott Thornburg, Jr. Experimental Synchronization of Chaotic Lasers, Physical
Review Letters 72(13):2009-2012, 1994.

[Ruelle 1980] D. Ruelle. Strange Attractors, The Mathematical Intelligence 2:126-137, 1980.

[Russell 1980] D. A. Russell, J. D. Hansen, and E. Ott. Dimensions of Strange Attractors, Physical Review Letters
45:1175-1178, 1980.

[Sano 1985] M. Sano and Y. Sawada. Measurement of the Lyapunov Spectrum from a Chaotic Time Series,

[Schiff 1994] S. J. Schiff, K. Jerger, D. H. Duong, T. Chang, M. L. Spano and W. L. Ditto, Controlling chaos in
the brain, Nature, 370:615-620, 1994.

[Schwartz 1994] I. B. Schwartz and I. Triandaf. Controlling unstable states in reaction-diffusion systems modeled
by time series, Physical Review E 50(4):2548-2552, 1994.

[Sepulchre 1993] J.A. Sepulchre and A. Babloyantz, Controlling chaos in network of oscillators, Physical Review
E 48(2):945-950, 1993.

[Shinbrot 1990] T. Shinbrot, C. E. Ott, Grebogi and J. A. Yorke. Using Chaos to Direct Trajectories to Target,

77


[Shinbrot 1993] Troy Shinbrot, Celso Grebogi, Edward Ott and James A. Yorke. Using small perturbations to
control chaos, Nature 363:411-417, 1993.

[Shinbrot 1992a] T. Shinbrot, W. Ditto, C. Grebogi, E. Ott, M. Spano, and J. A. Yorke. Using the Sensitive
Dependence of Chaos (the "Butterfly Effect") to Direct Trajectories in an Experimental Chaotic System, Physical
Review Letters 68(19):2863-2866, 1992.

[Shinbrot 1992b] T. Shinbrot, C. Grebogi, E. Ott and J. A. Yorke. Using chaos to target stationary states of flows,
Physics Letters A 196:349-354, 1992.

[Singer 1991] J. Singer, Y-Z. Wang and H. H. Bau. Controlling a Chaotic System, Physical Review Letters
66(9):1123-1125, 1991.

[Smith 1986] W. A. Smith, Elementary Numerical Analysis, Prentice-Hall, England, 1986.

[Solé 1995] Ricard V. Solé, Liset Menéndez de la Prida, Controlling chaos in discrete neural networks, Physics
Letters A 199:65-69, 1995.

[Stewart 1989] I.Stewart, Does God Play Dice : The new mathematics of chaos, Penguin books, England, 1989.

[Stewart 1996] I.Stewart, From Here to Infinity : A guide to today's mathematics, Oxford University Press,
England, 1996.

[Thompson 1994] J. M. T. Thompson and S. R. Bishop, Nonlinearity and Chaos in Engineering Dynamics, John
Wiley & Sons, England, 1994.

[Tsui 1999] Periodic response to external simulation of a chaotic neural network with delayed feedback. Alban
P. Tsui and Antonia J. Jones. International Journal of Bifurcation and Chaos, 9(4), April 1999.

[Tsui 2002] Alban P. Tsui, Antonia J. Jones and Ana Guedes de Oliveira. The construction of smooth models using
irregular embeddings determined by a Gamma test analysis. Neural Computing & Applications 10(4), 318-329,
April 2002. ISSN 0-941-0643.

[Wang 1991] Xin Wang, Period-Doublings to Chaos in A Simple Neural Network, 1991 IEES INNS International
Joint Conference on Neural Networks - Seattle, Vol II:333-339, 1991.

[Welstead 1991] Stephen T. Welstead, Multilayer Feedforward Networks Can Learn Strange Attractors, 1991
IEES INNS International Joint Conference on Neural Networks - Seattle, Vol II: 139-144, 1991.

78


COURSEWORK

This work should be handed in three weeks before the Easter Break. It is suggested that you
work on questions as the subject matter is covered in the course.

1.(a) (i) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed
length strings.

(ii) Give a detailed description of a typical genetic algorithm.

(iii) Explain the different roles played by crossover and mutation in the process of genetic search.

(b) What are the main design problems in constructing a GA for a particular problem? What simple checks would
you suggest before running a full test of a new genetic algorithm (GA) to verify that it has some chance of
performing adequately? [4]

(c) (i) "Darwin's theory of evolution is supposed to explain the diversity of species. If by definition two
members of different species cannot interbreed then increasing specialization is the only plausible
mechanism for species creation. Very specialized species are vulnerable to external changes and so in a
dynamic environment might not be expected to survive in the long run. This would tend to decrease
diversity rather than increase it." Discuss in two or three sentences.

(ii) In the light of the above discussion identify features which might be present in natural evolution
which are absent in genetic algorithms.

2(a). Let
i11 i12
I '
i21 i22

denote the input to a binary retina. Show that it is impossible to find a set of weights W = (wjk), wjk real numbers,
1 j, k 2, and a threshold so that the single linear function
# # 2
1 if j wjkijk > θ
j, k
P(I) '
0 if j wjkijk < θ
j, k

can discriminate between horizontal and vertical lines.

(b) An alternative system is proposed in which a 2-tuple WISARD classifier with a 1-1 mapping is used.
Investigate the possible mappings and show that two-thirds of these lead to a system which discriminates perfectly
between horizontal and vertical lines.

(c) Suppose that for some given two-class problem on a 512X512 retina suitable weights can be found for the single
linear classifier system. Using 8 bits per weight estimate the storage requirements for such a system. Allowing 10
micro sec for multiplication and 3 micro sec per addition and temporary storage overheads, calculate the

79


classification time. Would it be suitable for industrial application.

(d) Perform similar calculations, relating to storage and response time, for a WISARD 2-tuple system with a 1-1
mapping. Assume a conventional 16 bit architecture with an access time of 1 micro sec.

3(a). Briefly describe the main features of a Hopfield network: your discussion should include definitions for the
update rule and energy function.

(b) How are the weights of a Hopfield network usually assigned for a pattern recognition problem and what are
two limitations of this approach?

(c) It is proposed to apply some style of Hopfield network to the problem of finding a minimal tour for the
geometric TSP. Suggest and discuss one method of assigning network states to tours and briefly describe how the
weights of the network can be related to the distances between cities.

4(a). (i) Why are hidden units necessary to solve general problems of function modelling using feedforward
neural networks?

(ii) What is the main difficulty in constructing a learning rule for feedforward networks with hidden
units?

(b) (i) Describe the backpropagation rule for learning using a feedforward network with one hidden layer.

Notation. The following functions, and hence all their partial derivatives, are assumed known.

Error function.
E(z1, z2, ..., zn, t1, t2, ... ,t n) (24)

Here z1, z2, ..., zn are the outputs from the output layer units and t1, t2, ... ,tn are the target outputs.

Activation function.

Output layer:
net j ' net j (y1, y2, ... ,y m, pj1, ... pjt) (25)

Here y1, y2, ... ,ym are the outputs from the previous layer and pj1, ... , pjt are parameters
associated with the jth node of the output layer. Frequently t = t(m), i.e. the number of
parameters associated with a node is a function of the number of inputs.

Previous layer:
net i ' net i (x1, x2, ..., x l, pi1, ... ,pis) (26)

Here x1, x2, ... ,xl are the outputs from the layer prior to the previous layer.

Output function.

Output layer:
zj ' f(net j) (1 j
# # n) (27)

80


Previous layer:
yi ' f(net i) (1 i
# # m) (28)

(i) What are the strengths and weaknesses of the backpropagation method?

(c) The N-bit parity problem is this: given any vector x = (x1, x2, ..., xN), where xi {0, 1}, (i.e. x is a vector
0
of 0's and 1's) the determine the parity of x. The parity of x is defined as 0 if the number of 1's in x is even
and 1 if the number of 1's is odd.

Construct a feedforward network to solve the N-bit parity problem.

81

CMT563

Sample Examination Paper

Time Allowed 2 hours

Answer THREE Questions

1.(a) Give a high level pseudo-code description of a typical genetic algorithm. [5]

(b) Describe the genetic operators for crossover and mutation applied to genotypes consisting of
fixed length strings. [5]

(c) Discuss three criteria by which one might evaluate the effectiveness of a heuristic search
algorithm, such as a genetic algorithm, when applied to a hard combinatoric search problem.
Briefly comment on each of these criteria. [5]

(d) What simple checks would you suggest before running a full test of a new genetic algorithm
(GA) to verify that the representation and crossover operator which have been selected have a
reasonable possibility of ensuring that the GA performs better than random search. [5]

2(a). Describe the Hopfield model for a fully interconnected asynchronous network. Your
description should include a definition of the update rule for individual neurons and the method
for selecting which neuron to update at the next step. [8]

(b) The energy of a Hopfield network with weights wij from the j th node to the i th node, with
wii = 0 for all i, wij = wji for all i, j, and thresholds i, when the network is in state x = (x1, ..., xn)
2
0 {0, 1}n, where n is the number of neurons, is defined as
1
E
&' j w xx j% x
θ
2 i, j ij i j i
i i

i j
…

Show that under the rules described in (a) the network will iterate to a state at which energy is
a local minimum and stay there. [8]

(c) (i) If the Hopfield model is used as an self-associative memory how are the weights
determined from the patterns? [2]

(ii) What problems are encountered as more memories are added and what is the practical
upper limit for memory storage with a given Hopfield network?
[2]

82

CMT563

3(a). Describe the Wisard model for pattern recognition. State some advantages of this model
over other methods of pattern recognition. [10]

(b) Explain how the storage requirements and response of a Wisard network alters as the size of
the n-tuple increases. [5]

(c) Suppose that for some given two-class problem using a 512 × 512 retina a hardware Wisard
system is simulated using a conventional 16-bit serial architecture with a single 1 MHz processor.
Using 8 bits per weight estimate the storage requirements for such a system. Allowing 1 micro
sec for access, 10 micro sec for multiplication and 3 micro sec per addition and temporary storage
overheads, estimate the storage requirements and calculate the classification time. Would it be
suitable for industrial application? [5]

4(a). Briefly describe the backpropagation algorithm (detailed equations are not required but you
should explain what weight adjustments depend on at each step) and discuss its strengths and
weaknesses. [8]

(b) Suppose that the training data for a feedforward network is derived from a process which can
be modelled by a smooth function f from inputs to the single output y, and that in the data y is
subjected to measurement error r with mean zero. Identify the principal factors which will
determine the best Mean Squared Error that a trained network can achieve when tested on an
unseen set of data drawn from the same process. [4]

(c) Briefly describe the Gamma test. Give an example of the type of problem when it would be
appropriate to use the Gamma test and an example where it would not be appropriate. [8]

83

CMT563

Solutions

1. (a) Give a high level pseudo-code description of a typical genetic algorithm [5]

1. Randomly generate a population of M structures
S(0) = {s(1,0),...,s(M,0)}.

2. For each new string s(i,t) in S(t), compute and save its
measure of utility v(s(i,t)).
3. For each s(i,t) in S(t) compute the selection probability
defined by
p(i,t) = v(s(i,t))/( E i v(s(i,t))).
4. Generate a new population S(t+1) by selecting structures from
S(t) via the selection probability distribution and applying the
idealised genetic operators to the structures generated.

5. Goto 2.

Algorithm 7-1 Generic GA

(b) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed length
strings. [5]

For the moment we represent strings as CROSSOVER
a1a2a3...al [ai = 1 or 0]. Using this
notation we can describe the operators by
cut points
which strings are combined to produce
new strings.
Parent 1 1011 010011 10111
Crossover. In crossover one or more cut
points are selected at random and the Parent 2 1100 111000 11010
operation illustrated in the figure (where
two cut points are employed) is used to Child 1 1100 010011 11010
create two children. A variety of control 1011 111000 10111
Child 2
regimes are possible, but a simple
strategy might be `select one of the
children at random to go into the next MUTATION
generation'.

Crossing over proceeds in three steps: 110011100011010 111011101011010

a) Two structures a1...al and
b1...bl are selected at random
from the current population. INVERSION
b) A crossover point x, in the
range 1 to l-1 is selected, again 111111100011010 110011111011010
at random.

Standard genetic operators.

84

CMT563

c) Two new structures a1a2...axbx+1bx+2...bl
b1b2...bxax+1ax+2...al

are formed.

Children tend to be ‘like’ their parents, so that crossover can be considered as a focussing operator which exploits
knowledge already gained, its effects are quite quickly apparent.

Mutation. In mutation an allele is altered at each site with some fixed probability. Each structure a1a2...al in the
population is operated upon as follows. Position x is modified, with probability p independent of the other
positions, so that the string is replaced by

a1a2...ax-1 z ax+1...al

where z is drawn at random from the possible values. If p is the probability of mutation at a single position then
the probability of h mutations in a given string is determined by a Poisson distribution with parameter p.

Mutation disperses the population throughout the search space and so might be considered as an information
gathering or exploration operator. Search by mutation is a slow process analogous to exhaustive search. Thus
mutation is a `background' operator, assuring that the crossover operator has a full range of alleles so that the
adaptive plan is not trapped on local optima.

(c) Discuss three criteria by which one might evaluate the effectiveness of a heuristic search algorithm, such as
a genetic algorithm, when applied to a hard combinatoric search problem. Briefly comment on each of these
criteria. [5]

The three principal criteria are: solution quality; scaling in run time and memory for a given solution quality;
absolute run time should be acceptable. Solution quality is often hard to measure: for the TSP we might use the
Held-Karp lower bound. For hard combinatoric search a scaling of O(NlogN), where N is a measure of problem
size, is normally the best that can be achieved - anything worse than this results in unacceptable run times for large
problems. Acceptable absolute run time is a function of the commercial benefits and time available - for some early
VLSI layout designs super-computers were used with run times of several months. More usually a run time of
several days is the most that is acceptable.

(d) What simple checks would you suggest before running a full test of a new genetic algorithm (GA) to verify that
the representation and crossover operator which have been selected have a reasonable possibility of ensuring that
the GA performs better than random search. [5]

One simple check is to run several thousand crossover events with randomly selected parents and record
child_fitness versus mean_parental_fitness: if the result point plot/correlation calculation reveals little or no
correlation between the two then the combination of representation and crossover is unlikely to produce a GA that
works any better than random search. On the other hand if there is a good correlation between child_fitness and
mean_parental_fitness then the mechanisms of evolutionary search have some chance of being effective.

2. (a) Describe the Hopfield model for a fully interconnected asynchronous network. Your description should
include a definition of the update rule for individual neurons and the method for selecting which neuron to update
at the next step. [8]

Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at
maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined

85

CMT563

as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values
of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm.
For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean
2
attempt rate µ, setting
x i(t) ' 1 > θ i
x i(t) ' x i(t 1)
& if j wijxj(t 1)
& ' θ i (2)
x i(t) ' 0 j i
… < θ i

Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts
accordingly. The procedure is described below.

Procedure Hopfield (Assumes weights are assigned)
Repeat until updating every unit produces no change of state.
Randomise initial state x {0, 1}n 0
Select unit i (1 i n) with uniform random probability.
# #
Update unit i according to (1)
End

Algorithm 7-2 Generic Hopfield net.

(b) The energy of a Hopfield network with weights wij from the j th node to the i th node, with wii = 0 for all i, wij
= wji for all i, j, and thresholds i, when the network is in state x = (x1, ..., xn) {0, 1}n, where n is the number of
2 0
neurons, is defined as
1
E j &' w xx ix i % θj
2 i, j ij i j i (3)
i j
…

Show that under the rules described in (a) the network will iterate to a state at which energy is a local minimum
and stay there. [8]

A low rate of neural firing is approximated by assuming that only one unit changes state at any given moment.
Then, since wij = wji, E due to xi is given by
) )
E M '
∆E xi ( wij x j
j &' ∆ i) xi & θ ∆
xiM j (4)
j i
…

Now consider the effect of the threshold rule (2). If the unit changes state at all then xi = ±1. If xi = 1 this means
) )
the unit changes state from 0 to 1, hence by the threshold rule
wijx j > i θ
j
j i
… (5)

in which case, by (4), E < 0. Alternatively, xi = -1, the unit therefore changes state from 1 to 0, and
) )
wijx j < i θ
j
j i
… (6)

and again E < 0. Thus if a unit changes state at all the energy must decrease. State changes will continue until
)
a locally least E is reached. The energy is playing the role of a Hamiltonian in the more general dynamic system
context.

(c) (i) If the Hopfield model is used as an self-associative memory how are the weights determined from the
patterns? [2]

86

CMT563

An early rule used for memory storage in associative memory models can also be used to store memories in the
Hopfield model. If the {0, 1} node outputs are mapped to xi {-1, +1} the rule assigned weights as follows. For
0
each pattern vector x, which we require to memorise, we consider the matrix
x1 x1 x1 x1 x2 ... x1 x n
x2 x2 x1 x2 x2 ... x2 x n
x xT ' x1, x2, ,..., x n ' (7)
. . . ... .
xn x n x1 x n x2 ... x n x n

and then average these matrices over all pattern vectors (prototypes). We then set wii = 0 and note that the resulting
matrix is symmetric.

In this way we can capture the average correlations between components of the pattern vectors and then use this
information, during the operation of the network, to recapture missing or corrupted components.

(ii) What problems are encountered as more memories are added and what is the practical upper limit for
memory storage with a given Hopfield network? [2]

The main problem is that as the number of patterns P increase we find that an exponentially increasing number
`spurious' local minima are introduced, i.e. minima not associated with a pattern. When P is approximately 0.15n,
where n is the number of nodes in the network, there is a dramatic degradation in the ability of the network to
recall noisy patterns.

3. (a) Describe the Wisard model for pattern recognition.

Schematic of a 3-tuple recogniser.

The scheme outlined above was first proposed by Aleksander and Stonham in [Aleksander 1979]. The sample data

87

CMT563

to be recognized is stored as a 2-dimensional array (the 'retina') of binary elements. Random connections are made
onto the elements of the array, N such connections being grouped together to form an N-tuple which is used to
address one random access memory (RAM) per discriminator. In this way a large number of RAM's are grouped
together to form a class discriminator whose output or score is the sum of all its RAM's outputs. This configuration
is repeated to give one discriminator for each class of pattern to be recognized. The RAMs implement logic
functions which are set up during training; thus the method does not involve any direct storage of pattern data.

The system is trained using samples of patterns from each class. A pattern is stored into the retina array and a
logical 1 is written into the RAM's of the discriminator associated with the class of this training pattern at the
locations addressed by the N-tuples. This is repeated many times, typically 25-50 times, for each class.

In recognition mode, the unknown pattern is stored in the array and the RAM's of every discriminator put into
READ mode. The input pattern then stimulates the logic functions in the discriminator network and an overall
response is obtained by summing all the logical outputs. The pattern is then assigned to the class of the
discriminator producing the highest score.

State some advantages of this model over other methods of pattern recognition.

Some advantages of the WISARD model for pattern recognition are:

! Implementation as a parallel, or serial, system in currently available hardware is inexpensive and
simple.

! Given labelled samples of each recognition class, training times are extremely short.

! The time required by a trained system to classify an unknown pattern is very small and, in a parallel
implementation, is independent of the number of classes.

(b) Explain how the storage requirements and response of a Wisard network alters as the size of the n-tuple
increases.

The total amount of memory needed for the discriminator RAMs is easily calculated. Let n be the number of inputs,
C the number of classes, and assume a 1-1 mapping of a retina with p binary pixels. For convenience we assume
that n divides p. Then the number of 1 bit RAMs will be p/n and each RAM has 2n bits of storage. So the number
of bits per discriminator is (p/n).2n. Hence the total number of bits for C classes is
p n
C. .2 (8)
n

The response of a discriminator becomes more sensitive to precise similarities of a pattern to patterns from the
corresponding training class as n increases.

(c) Suppose that for some given two-class problem using a 512 × 512 retina a hardware Wisard system is simulated
using a conventional 16-bit serial architecture with a single 1 MHz processor. Using 8 bits per weight estimate the
storage requirements for such a system. Allowing 1 micro sec for access, 10 micro sec for multiplication and 3
micro sec per addition and temporary storage overheads, estimate the storage requirements and calculate the
classification time. Would it be suitable for industrial application? [5]

Storage requirements for n-tuple system:

(No. of classes) × (Size of image) × 2n /n

With two classes and n=2 this gives

88

CMT563

2 × 512 × 512 × 4/2 = 106 bits (approx).

Response time:

With a conventional 16-bit architecture, the computation time is mainly one of storage access (once per n-tuple
per pattern class). Taking 1 micro sec per access we have

512 × 512 × 2 × 10-6 /2 = 1/4 sec (approx).

This is a bit slow for an industrial application (e.g. an assembly line).

4.(a) Briefly describe the backpropagation algorithm (detailed equations are not required but you should explain
what weight adjustments depend on at each step) and discuss its strengths and weaknesses. [6]

As is well known, there is no advantage in using several layers if the units have linear activation functions. Since
the delta rule is a modification of gradient descent we need to consider derivatives, and the activation functions
of linear threshold units are not differentiable (their derivative is infinite at the threshold and zero elsewhere). We
therefore consider semilinear activation functions. A semilinear activation function fi(neti) is one in which the
output of a the unit is a non-decreasing and differentiable function of the net input to the unit.

Suppose the ith unit is an output unit. Let opi be the output produced by the ith unit when pattern p is presented
and tpi be the target output. In this case we set the error to be
)
δ pi ' (tpi & opi) fi (netpi) (for an output unit)

and the weight change for the weight associated with the jth input to the ith unit is then
pwij piopj ∆ ' δη

where 0 > 0 is the learning rate.

The error signal for a hidden unit is determined recursively in terms of the error signals of the units to which it
directly connects and the weights of those connections, i.e.
)
δ pi ' fi (netpi) δj pkwki (for a hidden unit)
k

where k varies over those units to which the ith units delivers outputs.

The weight change is then computed as before.

Thus the implementation of back propagation involves a forward pass through the layers to estimate the error, and
then a backward pass modifying the synapses to decrease the error. Practical implementations are not difficult, but
without modification it is still rather slow, especially for systems with many layers. Still, it is at present the most
popular learning algorithm for multilayer networks. It is considerably faster than the Boltzmann machine.

(b) Suppose that the training data for a feedforward network is derived from a process which can be modelled by
a smooth function f from inputs to the single output y, and that in the data y is subjected to measurement error r with
mean zero. Identify the principal factors which will determine the best Mean Squared Error that a trained network
can achieve when tested on an unseen set of data drawn from the same process. [4]

The function f can be regarded as a surface in m dimensions which we seek to approximate by a neural network
whose input -> output mapping we can call g. The training data gives a noisy model of the surface. It is plain that
there is no point in seeking to train the network beyond the stage where the Mean Squared Error (MSE) on the
training data is less than Var(r), the variance of r, since this corresponds to overfitting. This then will be the best

89

CMT563

MSE possible.

The principal factors determining whether or we can train g to have MSE . Var(r) are:

! The `bumpiness' of the surface defined by f. To accurately model a very bumpy surface obviously
requires more data points.

! The size of Var(r). The larger Var(r) becomes the less information is contained in any given data point.
When Var(r) becomes comparable to the range of y very little information regarding f is retained in the
training data.

! The size of the training set. A very bumpy noisy surface will require a very large training set.

(c) Briefly describe the Gamma test. Give an example of the type of problem when it would be appropriate to use
the Gamma test and an example where it would not be appropriate. [10]

[Any reasonable summary of the following]

Given data samples (x(i), y(i)), where x(i) = (x1(i), ..., xm(i)), 1 i
# # M, let N[i, p] be the list of (equidistant) p th
nearest neighbours to x(i). We write
M M
1 1 1
δ (p) ' j j x(j) & x(i) 2 ' j x(N[i, p]) & x(i) 2 (12)
M L(N[i, p]) M
i ' 1 j 0 N[i, p] i ' 1

where L(N[i, p]) is the length of the list N[i, p]. Thus (p) is the mean square distance to the pth nearest neighbour.
*
Nearest neighbour lists for pth nearest neighbours (1 p pmax), typically we take pmax in the range 20-50, can be
# #
found in O(MlogM) time using techniques developed by Bentley, see for example [Friedman 1977].

We also write
M
1 1
γ (p) ' j j (y(j) & y(i))2 (13)
2M i ' 1 L(N[i,p]) j 0 N[i,p]

where the y observations are subject to statistical noise assumed independent of x and having bounded variance.10

Under reasonable conditions one can show that
Var(r) γ . % A δ % o( )
δ as M 46 (14)
where the convergence is in probability.

The Gamma Test computes the mean-squared pth nearest neighbour distances (p) (1 p pmax, typically pmax * # # .
10) and the corresponding (p). Next the ( (p), (p)) regression line is computed and the vertical intercept is
( * (
returned as the gamma value. Effectively this is the limit lim as -> 0 which in theory is Var(r). ( *

A technique which allows one to estimate Var(r), on the hypothesis of an underlying continuous or smooth model
f, is of considerable practical utility in applications such as control or time series modelling. The implication of

10
The original version in [Aðalbjörn Stefánsson 1997] and [Kon ar 1997] used smoothed versions of (p) and … *
((p) which rolled off the significance of more distant near neighbours. Later experience showed that this
complication was largely unnecessary and the version of the software used here is implemented as described above.

90

CMT563

being able to estimate Var(r) in neural network modelling is that then one does not need to train the network (or
indeed construct any smooth data model at all) in order to predict the best possible performance with reasonable
accuracy.

An appropriate problem type would be one in which the output is expected to be a smooth function of continuously
varying inputs.

An inappropriate problem type would be one in which many of the inputs took categorical values (e.g. 0/1).

91

Evolutionary computing

More Related Content

Similar to Evolutionary computing (20)

Recently uploaded (20)

Evolutionary computing