SlideShare a Scribd company logo
EVOLUTIONARY COMPUTING CMT563




         Antonia J. Jones




                                6 November 2005
Antonia J. Jones: 6 November 2005




UNIVERSITY OF WALES, CARDIFF
DEPARTMENT OF COMPUTER SCIENCE (COMSC)


         COURSE:                     M.Sc. CMT563
         MODULE:                     Evolutionary Computing
         LECTURER:                   Antonia J. Jones, COMSC
         DATED:                      Originally 15 January 1997
         LAST REVISED:               6 November 2005
         ACCESS:                     Lecturer (extn 5490, room N2.15).

Overhead slides are posted on:

                                    http://users.cs.cf.ac.uk:81/Antonia.J.Jones/

electronically as pdf Acrobat files. It is not normally necessary for students attending the course to print this file
as complete sets of printed slides will be issued.

©2001 Antonia J. Jones. Permission is hereby granted to any web surfer for downloading, printing and use of this
material for personal study only. Copyright permission is explicitly withheld for modification, re-circulation or
publication by any other means, or commercial exploitation in any manner whatsoever, of this file or the material
therein.

Bibliography:

MAIN RECOMMENDATIONS

The recommended text for the course is:

    [Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing.
    Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk).


A cheaper alternative is

Yoh-Han Pao, Adaptive pattern recognition and neural networks. Addison-Wesley, 1989. ISBN 0-201-12584-6.
Price (UK) £31.45.

A useful addition for the Mathematica labs is:

    Simulating Neural Networks with Mathematica. James A. Freeman. Addison-Wesley. 1994. ISBN 0-201-
    56629-X.


These books cover most of the course, except any theory on genetic algorithms, and the first is the recommended
book for the course because it has excellent mathematical analyses of many of the models we shall discuss. The
second includes some interesting material on the application Bayesian statistics and Fuzzy logic to adaptive pattern
recognition. It is clearly written and the emphasis is on computing rather than physiological models.



                                                          1
Antonia J. Jones: 6 November 2005

The principle sources of inspiration for work in neuro and evolutionary computation are:

         ! E. R. Kandel, J. H. Schwartz, and T. M. Jessel. Principles of Neural Science (Third Edition),
         Prentice-Hall Inc., 1991. ISBN 0-8385-8068-8.

         ! J. D. Watson, Nancy H. Hopkins, J. W. Roberts, Joan A. Steitz, and A. M. Weiner. Molecular
         Biology of the Gene, Benjamin/Cummings Publishing Company Inc., 1988. ISBN 0-8053-
         9614-4.

When you see how big they are you will understand why! It is a sobering thought that most of the knowledge in
these tomes has been obtained in the last 20 years.

Although extensive references are provided with the course notes (these are also a useful source of information for
projects in Neural Computing) definitive bibliographies for computing aspects of the subject are:

The 1989 Neuro-Computing Bibliography. Ed. Casimir C. Klimasauskas, MIT Press / Bradford Books. 1989. ISBN
0-262-11134-9.

Finally, the key papers up to 1988 can be found together in:

Neurocomputing: Foundations of Research. Ed. James A. Anderson and Edward Rosenfeld, MIT Press 1988.
ISBN 0-262-01097-6.

NETS - OTHER (HISTORICALLY) INTERESTING MATERIAL

Perceptrons, Marvin Minsky and Seymour Papert, MIT Press 1972. ISBN 0-262-63022-2 (was reprinted recently).

Neural Assemblies, G. Palm, Springer-Verlag, 1982.

Self-Organisation and Associative Memory, T. Kohonen, Springer- Verlag, 1984.

Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, Lawrence Erlbaum, 1981.

Connectionist Models and Their Applications, Special Issue of Cognitive Science 9, 1985.

Computer, March 1988. Artificial Neural Systems, IEEE.

Neural Computing Architectures, Ed. I. Aleksander, Kogan Page, December, 1988.

Parallel Distributed Processing. Vol.I Foundations. Vol. II Psychological and Biological Models. David E.
Rumelhart et. al., MIT Press / Bradford Books. 1986. ISBN 0-262-18123-1 (Set).

Explorations in Parallel Distributed Processing - A Handbook of Models, Programs, and Exercises. James L.
McClelland and David E. Rumelhart, MIT Press / Bradford Books. 1988. ISBN 0-262-63113X. (Includes some
very useful software for an IBM PC - there is also a newer version with software for the MAC).

GENERAL

An Introduction to Cybernetics, W. Ross-Ashby, John Wiley and Sons, 1964.

A classic text on cybernetics.

Vision: A computational investigation into the human representation and processing of visual information, David


                                                        2
Antonia J. Jones: 6 November 2005

Marr, W. H. Freeman and Company, 1982. ISBN 0-7167-1284-9.

One of the classic works in computational vision.

Artificial Intelligence, F. H. George, Gordon & Breach, 1985.

Useful textbook on AI.

GENETIC ALGORITHMS/ARTIFICIAL LIFE

Artificial Life, Ed. Christopher G. Langton, Addison-Wesley 1989. ISBN 0-201-09356-1 pbk.

A fascinating collection of essays from the first AL workshop at Los Alamos National Laboratory in 1987. The
book covers an enormous range of topics (genetics, self-replication, cellular automata, etc.) on this subject in a very
readable way but with great technical authority. There are innumerable figures, some forty colour plates and even
some simple programs to experiment with. All this leads to a book that is beautifully presented and compulsive
reading for anyone with a modest background in the field.

Synthetic systems that exhibit behaviour characteristic of living systems complement the traditional analysis of
living systems practised by the biological sciences. It is an approach to the study of life that would hardly be
feasible without the advent of the modern computer and may eventually lead to a theory of living systems which
is independent of the physical realisation of the organisms (carbon based, in this neck of the woods).

The primary goal of the first workshop was to collect different models and methodologies from scattered
publications and to present as many of these as possible in a uniform way. The distilled essence of the book is the
theme that Artificial Life involves the realisation of lifelike behaviour on the part of man-made systems consisting
of populations of semi-autonomous entities whose local interactions with one another are governed by a set of
simple rules. Such systems contain no rules for the behaviour of the population at the global level.

Adaptation in Natural and Artificial Systems, John H. Holland, University of Michigan Press, 1975.

The book that started Genetic Algorithms, a classic.

Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing 1987. ISBN 0-273-08771-1
(UK), 0-934613-44-3 (US).

A collection of interesting papers on GA related subjects.

Genetic Algorithms in Search, Optimization, and Machine Learning, David E. Goldberg, Addison-Wesley, 1989.
ISBN 0-201-15767-5.

The first real text book on GAs.




                                                          3
Antonia J. Jones: 6 November 2005


                                                                     CONTENTS


I What is evolutionary computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         A general framework for neural models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
         The need for machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

II Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     14
         Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   14
         The archetypal GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        14
         Design issues - what do you want the algorithm to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                          18
                  Rapid convergence to a global optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         19
                  Produce a diverse population of near optimal solutions in different `niches' . . . . . . . . . . .                                        19
         * Results and methods related to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                   20
         Evolutionary Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  21
         Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       25

III Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   29
         Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   29
         Hopfield nets and energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           29
         The outer product rule for assigning weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    31
         Networks for combinatoric search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 32
         Assignment of weights for the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  33
         * The Hopfield and Tank application to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         36
         Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    37
         Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       37

IV The WISARD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           40
       Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     40
       Wisard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         41
       WISARD - analysis of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  43
       Comparison of storage requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     44
       Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         45

V Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        46
        Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    46
        Backpropagation - mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         46
                 The output layer calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 47
                 The rule for adjusting weights in hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            48
        The conventional model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             48
        Problems with backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 49
        The gamma test - a new technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  50
        * Metabackpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             53
        * Neural networks for adaptive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  53
        Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        58

* VI The chaotic frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     59
        Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    59
        Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               59
        Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   60


                                                                              4
Antonia J. Jones: 6 November 2005

            Chaos in biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      61
            Controlling chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       62
            The original OGY control law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              62
            Chaotic conventional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  64
            Controlling chaotic neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 65
                     Control varying T in a particular layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    67
                     Using small variations of the inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                   67
            Time delayed feedback and a generic scheme for chaotic neural networks . . . . . . . . . . . . . . . . . . .                                      70
                     Example: Controlling the Hénon neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                              71
            Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      73

COURSEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


                                                                 LIST OF FIGURES


Figure 1-1 The stylised version of a standard connectionist neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. . . . . . . . . . . . . . . . . 12
Figure 1-3 Storage capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2-1 Generic model for a genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2-2 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2-3 Premature convergence - no sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. . . . . . . . . . . . . . . . 22
Figure 2-5 The EDAC (top) and simple 2-Opt (bottom) time complexity (log scales). . . . . . . . . . . . . . . . . . 23
Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. . . . . . . . . . . . . . . . . . . . 24
Figure 2-7 EDACII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . . 25
Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . 25
Figure 3-1 Distance Connections. Each node (i, p) has inhibitory connections to the two adjacent columns whose
        weights reflect the cost of joining the three cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 3-2 Exclusion connections. Each node (i, p) has inhibitory connections to all units in the same row and
        column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 4-1 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4-2 Continuous response of discriminators to the input word 'toothache' [From Neural Computing
        Architectures, Ed. I Aleksander]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 4-3 A discriminator for centering on a bar [From Neural Computing, I. Aleksander and H. Morton].
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 5-1 Solving the XOR problem with a hidden unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 5-2 Feedforward network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 5-3 The previous layer calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 5-4 The Water Tank Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 5-5 Architecture for direct inverse neurocontrol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs for the Water Tank
        Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 5-7 Least squares fit to 200 data points with 20 nearest neighbours: = 0.0332. . . . . . . . . . . . . . . . . 56
Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear Planner. . . . 57
Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052. Linear Planner.
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training. Linear Planner.
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 6-1 Stable attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 6-2 A chaotic time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 6-3 The butterfly effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

                                                                                5
Antonia J. Jones: 6 November 2005

Figure 6-4 Intervals for which the variables are defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 6-5 Feedforward network as a dynamical system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 6-6 Chaotic attractor of Wang's neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 6-7 The Ikeda strange attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6-8 Attractor for the chaotic 2-10-10-2 neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6-9 Bifurcation diagram x obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . . 68
Figure 6-10 Bifurcation diagram y obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . 68
Figure 6-11 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-12 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-13 Parameter changes during output layer control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-14 Bifurcation diagram for the output x(t+1) using an external variable added to the input x(t). . . 69
Figure 6-15 Bifurcation diagram for the output y(t+1) using an external variable added to the input x(t). . . 69
Figure 6-16 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-17 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-18 Parameter changes during input x control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network: the chaotic
        "delayed" network is trained on suitable input-output data constructed from a chaotic time series; a
        delayed feedback control is applied to each input line; entry points for external stimulus are suggested,
        with a switch signal to activate the control module during external stimulation; signals on the delay lines
        or output can be observed at the "observation points". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-20 The control signal corresponding to the delayed feedback control shown in Figure 6-21. Note that
        the control signal becomes small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-21 Response signal on x(n-6) with control signal activated on x(n-6) using k = 0.441628, = 2 and                                          J
        without external stimulation after first 10 transient iterations. After n = 1000 iterations, the control is
        switched off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-22 Response signals on network output x(n), with control signal activated on x(n-6) using k = 0.441628,
          = 2 and with constant external stimulation sn added to x(n-6), where sn varies from -1.5 to 1.5 in steps
            J
        of 0.1 at each 500 iterative steps (indicated by the change of Hue of the plot points) after 20 initial
        transient steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 6-23 The control signal corresponding to the delayed feedback control shown in Figure 6-22. Note that
        the control signal becomes small even when the network is under changing external stimulation. . 72
Figure 6-24 Response signals on network output x(n), with control setup same as in Figure 6-22 but with
        Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.05, at each iteration step. . . . . 73
                                                                                                       F
Figure 6-25 The control signal corresponding to the delayed feedback control shown in Figure 6-24. . . . . 73
Figure 6-26 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but
        with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.15, at each iteration step.   F
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 6-27 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but
        with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.3, at each iteration step.   F
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

                                                            LIST OF ALGORITHMS

Algorithm 2-1 Archetypal genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     16
Algorithm 3-1 Hopfield network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               31
Algorithm 5-1 The Gamma test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               52
Algorithm 5-2 Metabackpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  53
Algorithm 7-1 Generic GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            84
Algorithm 7-2 Generic Hopfield net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               86



                                                                                6
Antonia J. Jones: 6 November 2005




                                                        I What is evolutionary computing?


         "
         dna tsap naht erutan enon-ro-lla na fo yldigir ssel hcum era hcihw seiroeht ot dael lliw siht fo llA
         .retcarahc ,lacitylana erom hcum dna ,lacirotanibmoc ssel hcum a fo eb lliw yehT .cigol lamrof tneserp
         ll i w cigol lamrof fo metsys wen siht taht eveileb su ekam ot snoitacidni suoremun era ereht ,tcaf nI
         si sihT .cigol htiw tsap eht ni deknil elttil neeb s a h h c i h w e n i l picsid rehtona ot resolc evom
         fo trap taht si ti dna ,nnamztloB morf deviecer saw ti mrof eht ni yliramirp ,scimanydomreht
         g n i r u s a e m d n a g n i t al u p i n a m o t s t c e p s a s t i f o e m o s n i t s e r a e n s e m o c h c i h w s c i s y h p l a c i t e r o e h t
                                                               ]403 .p ,5 .loV skroW detcelloC ,nnamueN nov[ ".noitamrofni



Introduction.

Evolutionary computing embraces models of computation inspired by living Nature. For example, evolution of
species by means of natural selection and the genetic operators of mutation, sexual reproduction and inversion can
be considered as a parallel search process. Perhaps we can tackle hard combinatoric search problems in computer
science by mimicking (in a very stylised form) the natural process of evolutionary search.

Evolution through natural selection drives the adaptation of whole species, but individual members of a species
can also adapt to a greater or lesser extent. The adaptation of individual behaviour on the basis of experience is
learning and stems from plasticity of the neural structures which convey and process information in animals.

Learning enables us to recognise previously encountered stimuli or experiences and modify our behaviour
accordingly. It facilitates prediction and control of the environment, both essential prerequisites to planning. All
of these are facets of what we loosely call intelligence.

Real-world intelligence is essentially a computational process. This is a contentious assertion known as "the strong
AI position". If it is true then the precise mechanism of computation (the hardware or neural wetware) ought to
be irrelevant to the actual principles of the computational process.

If this is indeed the case then the only obstacles to the construction of a truly intelligent artifact are our own
understanding of the computational processes involved and our technical capability to construct suitable and
sufficiently powerful computational devices.

A general framework for neural models.

Throughout this course we describe a number of neural models, each a variation on the connectionist paradigm
(often called Parallel Distributed Processing - PDP), which in turn is derived from networks of highly stylised
versions of the biological neuron.

It is useful to begin with an analysis of the various components of these models. There are seven major aspects of
a connectionist model:

         !   A set of processing units ui, each producing a scalar output xi(t) (1                                          # #i     n).

         ! A connectivity graph which determines the pattern of connections (links) from each unit to each of the
         other units in the network. We shall often suppose that each unit has n inputs, but there is no particular
         reason why all units should have the same number of inputs.


                                                                                     7
Antonia J. Jones: 6 November 2005

Although it is often convenient for theoretical discussions to consider fully interconnected networks, for very large
networks of either real or artificial neurons the relevant case is that of relatively sparse connectivity. The
connectivity graph then describes the fine topology of the network. This can be useful in practical applications,
for example in speech recognition networks it is often helpful to have several copies of the same sub-net connected
to temporally distinct inputs. These sub-net copies act as a feature detector and so can share their weights - this
effectively reduces the number of parameters needed to describe the full network and speeds up learning. It is
sufficient to be given a list of inputs and outputs for each node, for we then can recover the connectivity graph.

         !  A set of parameters pi1,...,pik, fixed in number, attached to each unit ui, which are adjusted during
         learning. Most commonly k = n and the parameters are weights wij (1 j n), where wij is often taken
                                                                                     # #
         to be associated with the link from j to i, or in biological terms associated with the synaptic gap.

         ! An activation function for each unit, neti = neti(x1,...,xn;pi1,...,pik), which combines the inputs to ui into
         a scalar value. In the commonly used model neti = wijxj.
                                                                '

It is important to realise that the basic principle of neural networks is that of simple (but arbitrary) computational
function at each node. Learning when it occurs can be considered as an adjustment of the parameters associated
with a node based on information locally available to the node. ‘Locally’ here means as specified by the
connectivity graph. This information often takes the form of a correlation between the firings of adjacent nodes,
but it could be a more sophisticated calculation. Thus we are really dealing with a very general class of parallel
algorithms. The concentration on the ‘weights associated with links’ model has arisen partly because of the
biological precedent, because of the extreme simplicity of the computational function of a node, and because this
special case has been shown to be of practical interest.



                               x1(t) x2(t) x3(t)                           ...                      xn(t)



                                                neti = neti(x1, ... ,xn, pi1, ... , pik)


             unit i                                                 output xi
                                        xi = f(neti)                         neti
                                                                           input
                                                   Sigmoidal output function



                              xi(t+1)          ...             xi(t+1)           ...            xi(t+1)

       Figure 1-1 The stylised version of a standard connectionist neuron.

         ! An output function xi = f(neti) which transforms the activation function into an output. In the earliest
         models f was a discontinuous step function. However, this poses analytical difficulties for learning
         algorithms so that often now f is a smooth sigmoidal shaped function. In some models f is allowed to vary


                                                           8
Antonia J. Jones: 6 November 2005

         from one unit to another and so then we write fi for f.

         ! A learning rule whereby the parameters associated with each processing unit are modified by
         experience.

         !   An environment within which the system must operate.

A set of processing units. Figure 1-1 illustrates a standard connectionist component. All of the processing of a a
connectionist system is carried out by these units. There is no executive or overseer. There are only relatively
simple units, each doing its own relatively simple job. A unit's job is simply to receive input from other units and,
as a function of the input it receives and the current values of its internal parameters, to compute an output value
xi which it sends to the other units. This output is discrete in some models and continuous in others. When the
output is continuous it is often confined to [0,1] or [-1,1]. The system is inherently parallel in that many units carry
out their computations at the same time.

Within any system we are modelling, it is sometimes useful to characterize three types of units: input, output, and
hidden. The hidden units are those whose inputs and outputs are within the system we are modelling. They are not
‘visible’ to outside systems.

A connectivity graph. Each unit passes its output to other units along links. The graph of links represents the
connectivity of the network.

A set of parameters and an activation function. In the conventional model the parameters for unit i are assumed
to be weights wij associated with the link from unit j to unit i. If wij > 0 the link is said to be an excitatory link, if
wij = 0 unit j is effectively not connected to unit i, and if wij < 0 the link is said to be inhibitory link. In this case
neti is calculated as
                                                                           n
                                                       net i   j '                 wijx j                            (1)
                                                                       j   '   1



This is a linear function of the inputs and so neti is constant over hyperplanes in the n-dimensional space of inputs
to unit i.

In fact, if one is interested in generalising the computational function of a unit, it is often convenient to associate
the parameters (in the conventional case weights) with the unit. In which case one thinks of the links as passing
activation values and one is no longer constrained to have exactly n (the number of inputs) parameters per unit.
For example, one could have a unit which performed it's distinction function by determining whether or not the
input vector lay within some ellipsoid. In this case there would be n parameters associated with the centre of the
ellipsoid and another n parameters associated with the axes. (In addition one could provide the ellipsoid with
rotations which would provide further parameters.) Now the activation function would look like
                                                               n
                                               net i   j '             Aij (x j      &      cij)2                    (2)
                                                           j   '   1



This is a simple example of a higher order network in which the function neti is not a linear function of the inputs.

An output function. The simplest possible output function f would be the identity function, i.e. just take xi = neti.
However, in this case with the activation function (1) the unit would be performing a totally linear function on the
inputs and, as it turns out, such nets are rather uninteresting.

In any event our unit is not yet making a distinction. In the discrete model the output function is usually




                                                                       9
Antonia J. Jones: 6 November 2005

                                               xi   '   1              >   θ
                                                                           i
                                                            if net i                                              (3)
                                               xi   '   0              #   θ   i


where i is the threshold, a parameter associated with the unit. However, this creates discontinuities of the
       2
derivatives and so we usually smooth the output function and write
                                                  x i f(net i)
                                                            '                                           (4)
In the linear case f is some sort of sigmoidal function. For our ellipsoidal example Gaussian smoothing might be
suitable, i.e. f(x) = exp(-x2), so that the output is large (near one) when the input vector is near the centre of the
ellipsoid.

Sometimes the output function is stochastic so that the output of the unit depends in a probabilistic fashion on neti.

For an individual unit the sequence of events in operational mode (not learning) is

           1. Combine inputs to produce activation neti(t).
           2. Compute value of output xi = f(neti).
           3. Place outputs, based on new activation level, on output links (available from t+1 onward).

Changing the processing or knowledge structure in a connectionist model involves modifying the patterns of
interconnections or parameters associated with each unit. This is accomplished by modifying pi1,...,pik (or the wij
in the usual model) through experience using a learning rule.

Virtually all learning rules are based on some variant of a Hebbian principle (discussed in the next section) which
is invariably derived mathematically through some form of gradient descent. For example, the Delta or
Widrow-Hoff rule. Here modification of weights is proportional to the difference between the actual activation
achieved and the target activation provided by a teacher

                                              )   wij = (ti(t)-neti(t))xj(t),
                                                        0

where > 0 is constant. This is a generalization of the Perceptron learning rule and is all very well provided we
       0
know the desired values of ti(t).

Hebbian learning.

Donald O. Hebb's book The Organization of Behavior (1949) is famous among neural modelers because it
contained the first explicit statement of the physiological learning rule for synaptic modification that has since
become known as the Hebb synapse:

           Hebb rule. When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes
           part in firing it, some growth process or metabolic change takes place in one or both cells such that A's
           efficiency, as one of the cells firing B, is increased.

The physiological basis for this synaptic potentiation is now understood more clearly [Brown 1988]. Hebb's
introduction to the book also contains the first use of the word 'connectionism' in the context of neural modeling.
The Hebb rule is not a mathematical statement, though it is close to one. For example, Hebb does not discuss the
various possible ways inhibition might enter the picture, or the quantitative learning rule that is being followed.
This has meant that a number of quite different learning rules can legitimately be called 'Hebbian rules'. We shall
see later that nearly all such learning rules bear a close mathematical relationship to the idea of `gradient descent',
which roughly means that if we wish to move to the lowest point of some error surface a good heuristic is: we
should always tend to go `downhill'. However, for the present chapter we shall conceptualise the Hebb rule in terms
of autocorrelations, i.e. the internal correlations between each pair of components of the pattern vectors we wish


                                                                10
Antonia J. Jones: 6 November 2005

the system to memorise.

Hebb was keenly aware of the `distributed' nature of the representation he assumed the nervous system uses; that
to represent something assemblies of many cells are required and that an individual cell may be a participant
member of many representations at different times. He postulated the formation of cell assemblies representing
learned patterns of activity.



The need for machine learning.

Why do we need to discover how to get machines to learn? After all, is it not the case that the most practical
developments in Artificial Intelligence, such as Expert Systems, have emerged from the development of advanced
symbolic programming languages such as LISP or Prolog? Indeed, this is so. But there are convincing arguments
[Bock 1985] which suggest that the technique of simulating human skills using symbolic programs cannot hope,
in the long run, to satisfy the principal goals of AI. Mainly these centre around the time it would take to figure out
the rules and write the software. But first we should consider the evolution of hardware.

How can one measure the overall computational power of an information processing system? There are two obvious
aspects we should consider. Firstly, information storage capacity - a system cannot be very smart if it has little or
no memory. On the other hand, a system may have a vast memory but little or no capacity to manipulate
information; so a second essential measure is the number of binary operations per second. On these two scales
Figure 1-2 illustrates the information processing capability of some familiar biological and technological
information processing systems. In the case of the biological systems these estimates are based on connectionist
models and may be excessively conservative.

We consider each axis independently. As we saw earlier, research in neurophysiology has revealed that the brain
and central nervous system consists of about 1011 individual parallel processors, called neurons. Each neuron has
roughly 104 synaptic connections and if we allow only 1 bit per synapse then each neuron is capable of storing
about 104 bits of information. The information capacity of the brain is thus about 1015 bits. Much of this
information is probably redundant but using this figure as a conservative estimate let us consider when we might
expect to have high-speed memories of 1015 bits.




                                                         11
Antonia J. Jones: 6 November 2005




      Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec].

Figure 1-3 shows that the amount of high-speed
random access memory that may be conventionally
accessed by a large computer has increased by an order
of magnitude every six years. If we can trust this
simple extrapolation, in generation thirteen, AD
2024-30, the average high speed memory capacity of a
large computer will reach 1015 bits.

Now consider the evolution of technological processing
power. Remarkably, this follows much the same trend.
Of course, the real trick is putting the two together to
achieve the desired result, it seems relatively unlikely
that we shall be in a position to accomplish this by
2024.

So much for the hardware. Now consider the software.
Even adult human brains are not filled to capacity. So
we will assume that 10% of the total capacity, i.e. 1014
bits, is the extent of the `software' base of an adult
human brain. How long will it take to write the
                                                           Figure 1-3 Storage capacity.

                                                           12
Antonia J. Jones: 6 November 2005

programs to fill 1014 bits (production rules, knowledge bases etc.)? The currently accepted rate of production of
software, from conception through testing, de-bugging and documentation to installation, is about one line of code
per hour. Assuming, generously, that an average line of code contains approximately 60 characters, or 500 bits,
we discover that the project will require 100 million person years! We'll never get anywhere by trying to program
human intelligence into a machine.

What other options are available? One is direct transfer from the human brain to the machine. Considering
conventional transfer rates over a high speed bus this would take about 12 days. The only problem is: nobody has
the slightest idea how to build such a device.

What's left? In the biological world intelligence is acquired every day, therefore there must be another alternative.
Every day babies are born and in the course of time acquire a full spectrum of intelligence. How do they do it? The
answer, of course, is that they learn.

If we assume that the eyes, our major source of sensory input, receive information at the rate of about 250,000 bits
per second, we can fill the 1014 bits of our machine's memory capacity in about 20 years. Now storing sensory input
is not the same thing as developing intelligence, however this figure is in the right ball park. Maybe what we must
do is connect our machine brain to a large number of high-data-rate sensors, endow it with a comparatively simple
algorithm for self organization, provide it with a continuous and varied stream of stimuli and evaluations for its
responses, and let it learn.

This argument may seem cavalier in some aspects. The human brain is highly parallel and somewhat
inhomogeneous in its architecture. It does not clock at high serial speeds and does not access RAM to recall
information for processing. The storage capacity may be vastly greater than the 1015 bits estimated by Sagan, since
each neuron is connected to as many as 10,000 others and the structure of these interconnections may also store
information. Indeed, although we do not know a great deal about the mechanisms of human memory, we do know
that is multi-levelled with partial bio-chemical storage. However, none of this invalidates Bock's point that
programming can never be a substitute for learning.




                                                         13
Antonia J. Jones: 6 November 2005




                                              II Genetic Algorithms



Introduction.

The idea that the process of evolutionary search might be used as a model for hard combinatoric search algorithms
developed significantly in the mid 1960's. Evolutionary algorithms fall into the class of probabilistic heuristic
algorithms which one might use to attack NP-complete or NP-hard problems (see, for example [Horowitz 1978],
Chapters 11 and 12), such as the Travelling Salesman/person Problem (TSP). Of course, many of these problems
have significant applications in engineering hardware or software design and commercial optimisation problems,
but the underlying motivation for the study of evolutionary algorithms is principally to try to gain insight into the
evolutionary process itself.

Variously known as genetic algorithms, the phrase coined by the US school stemming from the work of John
Holland [Holland 1975], evolutionary programming, originally developed by L. J. Fogel, A. J. Owens and M. J.
Walsh, again in the US, and Evolutionsstrategie, as studied in Germany at around the same time by I. Rechenberg
and H-P. Schwefel [Schwefel 1965], the subject has exploded over the last 15 years. Curiously, the European and
US schools seemed largely unaware of each others existence for quite some while.

Evolutionary algorithms, have been applied to a variety of problems and offer intriguing possibilities for general
purpose adaptive search algorithms in artificial intelligence, especially, but not necessarily, for situations where
it is difficult or impossible to precisely model the external circumstances faced by the program. Search based on
evolutionary models had, of course, been tried before Holland's introduction of genetic algorithms. However, these
models were based on mutation and not notably successful. The principal difference of the more modern research
is an emphasis on the power of natural selection and the incorporation of a ‘crossover’ operator to mimic the effect
of sexual reproduction.

Two rather different types of theoretical analysis have developed for evolutionary algorithms: the classical
approach stemming from the original work of Mendel on heritability and the later statistical work of Galton and
Pearson at the end of the last century, and the Schema theory approach developed by Holland.

Mendel constructed a chance model of heritability involving what are now called genes. He conjectured the
existence of genes by pure reasoning - he never saw any. Galton and Pearson found striking statistical regularities
in heritability in large populations, for example, on average a son is halfway between his father's height and the
overall average height for sons. They also invented many of the statistical tools in use today such as the scatter
diagram, regression and correlation (see, for example, [Freedman 1991]). Around 1920 Fisher, Wright and
Haldane more or less simultaneously recognised the need to recast Darwinian theory as described by Galton and
Pearson in Mendelian terms. They succeeded in this task, and more recently Price's Covariance and Selection
Theorem [Price 1970], [Price 1972], an elaboration of these ideas, has provided a useful tool for algorithm analysis.

The archetypal GA.

In Nature each gene has several forms or alternatives - alleles - producing differences in the set of characteristics
associated with that gene, e.g. certain strains of garden pea have a single gene which determines blossom colour,
one allele causing the blossom to be white, the other pink. There are tens of thousands of genes in the
chromosomes of a typical vertebrate, each of which, on the available evidence, has several alleles. Hence the set
of chromosomes attained by taking all possible combinations of alleles contains on the order of 10 to the 3,000
structures for a typical vertebrate species. Even a very large population, say 10 billion individuals, contains only
a minuscule fraction of the possibilities.


                                                         14
Antonia J. Jones: 6 November 2005

A further complication is that alleles interact so that adaptation becomes primarily the search for co-adapted sets
of alleles. In the environment against which the organism is tested any individual exemplifies a large number of
possible `patterns of co-adapted alleles' or schema, as Holland calls them. In testing this individual we shall see
that all schema of which the individual is an instantiation are also tested. If the rules whereby genes are combined
have a tendency to generate new instances of above average schema then the resulting adaptive system has a high
degree of `intrinsic parallelism'1 which accelerates the evolutionary process. Considerations of this type offer an
explanation of how evolution can proceed at all. If a simple enumerative plan were employed and if 10 to the 12
structures could be tried every second it would take a time vastly exceeding the estimated age of the universe to
test 10 to the 100 structures.

The basic idea of an evolutionary algorithm is illustrated in Figure 2-1.




                                                            INITIALISE
                                                       Create initial population
                                                       Evaluate fitness of each
                                                       member.




                                 INTERNAL                                             EXTERNAL

                          Create children from
                           existing population
                          using genetic operators

                                                                               Evaluate fitness of children

                            Substitute children
                            in population deleting
                            an equivalent number




                 Figure 2-1 Generic model for a genetic algorithm.


We seek to optimise members of a population of ‘structures’. These structures are encoded in some manner by a
‘gene string’. The population is then ‘evolved’ in a very stylised version of the evolutionary process.

We are given a set, A, of `structures' which we can think of, in the first instance, as being a set of strings of fixed
length l. The object of the adaptive search is to find a structure which performs well in terms of a measure of
performance v : A --> +, where + denotes the positive real numbers.
                            ú             ú


   1
     The notion of 'intrinsic parallelism' will be discussed but it should be mentioned that it has nothing to do with parallelism in the sense
normally intended in computing.

                                                                     15
Antonia J. Jones: 6 November 2005

The programmer must provide a representation for the structures to be optimised. In the terminology of genetic
algorithms a particular structure is called a phenotype and its representation as a string is called a chromosome
or genotype. Usually this representation consists of a fixed length string in which each component, or gene, may
take only a small range of values, or alleles. In this context `small' often means two, so that binary strings are used
for the genotypes.

There is nothing obligatory in taking a one-bit range for each allele but there are theoretical reasons to prefer
few-alleles-at-many-sites over many-alleles-at-few-sites (the arguments have been given by [Holland 1975](p. 71),
[Smith 1980](p. 56) and supporting evidence for the correctness of these arguments has been presented by
[Schaffer 1984](p. 107).




                1. Randomly generate a population of M structures

                                                  S(0) = {s(1,0),...,s(M,0)}.

                2. For each new string s(i,t) in S(t), compute and save its measure of utility v(s(i,t)).

                3. For each s(i,t) in S(t) compute the selection probability defined by

                          p(i,t) = v(s(i,t))/ (
                                              E   i   v(s(i,t)) ).

                4. Generate a new population S(t+1) by selecting structures from S(t) via the selection
                probability distribution and applying the idealised genetic operators to the structures
                generated.

                5. Goto 2.


   Algorithm 2-1 Archetypal genetic algorithm.

The function v provides a measure of ‘fitness’ for a given phenotype and (since the programmer must also supply
a mapping from the set of genotypes to the set of phenotypes) hence for a given genotype. Given a particular
            n
genotype or string the goal function provides a means for calculating the probability that the string will be selected
to contribute to the next generation. It should be noted that the composition function v( ) mapping genotypes to
                                                                                              n
fitness is invariably discontinuous, nevertheless genetic algorithms cope remarkably well with this difficulty.

The basis of Darwinian evolution is the idea of natural selection i.e. population genetics tends to use the

         Selection Principle. The fitness of an individual is proportional to the probability that it will
         reproduce effectively.2

In genetic algorithm design we tend to apply this in the converse form: the probability that an individual will
reproduce is proportional to its fitness. ‘Fit’ strings, i.e. strings having larger goal function values, will be more
likely to be selected but all members of the population will have some chance to contribute.



    2
       Obfuscation of the definition of ‘fitness’ occurs frequently in the classical literature. The reasons are not
difficult to understand. Both Darwin and Fisher found it hard to swallow that the lower classes bred more
prolifically and were therefore, by definition, ‘fitter’ than their ‘social superiors’. This confusion regarding ‘fitness’
still occurs in the GA’s literature for different reasons.

                                                                     16
Antonia J. Jones: 6 November 2005

The box contains a sketch of the standard serial style genetic algorithm. Typically the evaluation of the goal
function for a particular phenotype, a process which strictly speaking is external to the genetic algorithm itself,
is the most time consuming aspect of the computation.

Given the mapping from genotype to phenotype, the goal function, and an initial random population the genetic
algorithm proceeds to create new members of the population (which progressively replace the old members) using
genetic operators, typically mutation, crossover and inversion, modelled on their biological analogs.

For the moment we represent strings as a1a2a3...al [ai = 1 or 0].

Using this notation we can describe the
operators by which strings are combined
to produce new strings. It is the choice of                                CROSSOVER
these operators which produces a search
strategy that exploits co-adapted sets of                                     cut points
structural compon en t s a l r eady
discovered. Holland uses three such
                                                         Parent 1          1011 010011 10111
principal operators Crossover, Mutation
and Inversion (which we shall not                        Parent 2          1100 111000 11010
discuss in detail here).
                                                         Child 1           1100 010011 11010
Crossover. In crossover one or more cut
points are selected at random and the                    Child 2           1011 111000 10111
operation illustrated in Figure 2-2,
Figure 7-1 (where two cut points are
employed) is used to create two children.                                  MUTATION
A variety of control regimes are possible,
but a simple strategy might be `select                 110011100011010                 111011101011010
one of the children at random to go into
the next generation'. Children tend to be
`like' their parents, so that crossover can
be considered as a focussing operator                                      INVERSION
which exploits knowledge already
gained, its effects are quite quickly                 111111100011010                  110011111011010
apparent.

Crossing over proceeds in three steps.        Figure 2-2 Standard genetic operators.


         a) Two structures a1...al and b1...bl are selected at random from the current population.

         b) A crossover point x, in the range 1 to l-1 is selected, again at random.

         c) Two new structures

                                                  a1a2...axbx+1bx+2...bl
                                                  b1b2...bxax+1ax+2...al
         are formed.

In modifying the pool of schema (discussed below), crossing over continually introduces new schema for trial
whilst testing extant schema in new contexts. It can be shown that each crossing over affects a great number of
schema.



                                                           17
Antonia J. Jones: 6 November 2005

There is large variation in the crossover operators which have been used by different experimenters. For example,
it is possible to cross at more than one point. The extreme case of this is where each allele is randomly selected
from one or other parent string with uniform probability - this is called uniform crossover. Although some writers
have argued in favour of uniform crossover, there would seem to be theoretical arguments against its use viz. if
evolution is the search for co-adapted sets of alleles then this search is likely to be severely undermined if many
cut points are used. In language we shall develop shortly: the probability of schema disruption when using uniform
crossover is much higher than when using one or two point crossover.

The design of the crossover operator is strongly influenced by the nature of the representation. For example, if the
problem is the TSP and the representation of a tour is a straightforward list of cities in the order in which they are
to be visited then a simple crossover operator will, in general, not produce a tour. In this case the options are:

         !   Change the representation.

         !   Modify the crossover operator.

or       !   Effect ‘genetic repair’ on non-tours which may result.

There is obviously much scope for experiment for any particular problem. The danger is that the resulting
algorithm may be so far removed from the canonical form that the correlation between parental and child fitness
may be small - in which case the whole justification for the method will have been lost.

Mutation. In mutation an allele is altered at each site with some fixed probability. Mutation disperses the
population throughout the search space and so might be considered as an information gathering or exploration
operator. Search by mutation is a slow process analogous to exhaustive search. Thus mutation is a ‘background’
operator, assuring that the crossover operator has a full range of alleles so that the adaptive plan is not trapped on
local optima.

Each structure a1a2...al in the population is operated upon as follows. Position x is modified, with probability p
independent of the other positions, so that the string is replaced by

                                                 a1a2...ax-1 z ax+1...al

where z is drawn at random from the possible values. If p is the probability of mutation at a single position then
the probability of h mutations in a given string is determined by a Poisson distribution with parameter p.

A simple demonstrator is given in the Mathematica program GA_Simple.nb. A more complicated GA using
Inversion is given in GA_Inversion.nb.

Design issues - what do you want the algorithm to do?

Now we have to ask just what is we want of a genetic algorithm. There are several, sometimes mutually exclusive,
possibilities. For example:

         !   Rapid convergence to a global optimum.

         !   Produce a diverse population of near optimal solutions in different ‘niches’.

         !   Be adaptive in ‘real-time’ to changes in the goal function.

We shall deal with each of these in turn but first let us briefly consider the nature of the search space. If the space
is flat with just one spike then no algorithm short of exhaustive search will suffice. If the space is smooth and
unimodal then a conventional hill-climbing technique should be used.


                                                          18
Antonia J. Jones: 6 November 2005

Somewhere between these two extremes are problems in which the goal function is a highly non-linear multi-
modal function of the gene values - these are the problems of hard combinatoric search for which some style of
genetic algorithm may be appropriate.

Rapid convergence to a global optimum.

Of course this is rather simplistic. Holland's theory holds for large populations. However, in many AI applications
it is computationally infeasible to use large populations and this in turn leads to a problem commonly referred to
as Premature Convergence (to a sub-optimal solution) or Loss of Diversity in the literature of genetic algorithms.
When this occurs the population tends to become dominated by one relatively good solution and locked into a sub-
optimal region of the search space. For small populations the schema theorem is actually an explanation for
premature convergence (i.e. the failure of the algorithm) rather than a result which explains success.

Premature convergence is related to a phenomenon observed in Nature. Allelic frequencies may fluctuate purely
by chance about their mean from one generation to another; this is termed Random Genetic Drift. Its effect on the
gene pool in a large population is negligible, but in a small effectively interbreeding population, chance alteration
in Mendelian ratios can have a significant effect on gene frequencies and can lead to the fixation of one allele and
loss of another. For example, isolated communities within a given population have been found to have frequencies
for blood group alleles different from the population as a whole. Figure 2-3 illustrates this phenomenon with a
simple function optimisation genetic algorithm.

The inexperienced often tend to attempt to counteract
premature convergence by increasing the rate of mutation.
However, this is not a good idea.

         !  A high rate of mutation tends to devalue the
         role of crossover in building co-adapted sets of
         alleles and in essence pushes the algorithm in
         the direction of exhaustive search. Whilst some
         mutation is necessary a high rate of mutation is
         invariably counter-productive.

In trying to counteract premature convergence we are
essentially trying to balance the exploitation of good
solutions found so far against the exploration which is
required to find hitherto unknown promising regions of
the search space. It is worth observing that, in
computational terms, any algorithm which often inserts
copies of strings into the current population is wasteful. Figure 2-3 Premature convergence - no sharing.
This is true for the Traditional Genetic Algorithm (TGA)
outlined as 2, 7-1.

Produce a diverse population of near optimal solutions in different `niches'.

The problem of premature convergence has been addressed by a number of authors using a diversity of techniques.
Many of the papers in [Davis 1987] contain discussions of precisely this point. The methods used to combat
premature convergence in TGAs are not necessarily appropriate to the parallel formulations of genetic algorithms
(PGAs) which we shall discuss shortly.

Cavicchio, in his doctoral dissertation, suggested a preselection mechanism as a means of promoting genotype
diversity. Preselection filters children generated, possibly picking the fittest, and replaces parent members of the
population with their offspring [Cavicchio 1970].



                                                         19
Antonia J. Jones: 6 November 2005

De Jong's crowding scheme is an elaboration of the preselection mechanism. In the crowding scheme, an offspring
replaces the most similar string from a randomly drawn subpopulation having size CF (the crowding factor) of the
current population. Thus a member of the population experiences a selection pressure in proportion to its similarity
to other members of the population [De Jong 1975]. Empirical determination of CF with a five function test bed
determined CF = 3 as optimal.

Booker implemented a sharing method in a classifier system environment which used the bucket brigade algorithm
[Booker 1982]. The idea here was that if related rules share payments then sub-populations of rules will form
naturally. However, it seems difficult to apply this mechanism to standard genetic algorithms. Schaffer has
extended the idea of sub-populations in his VEGA model in which each fitness element has its own sub-population
[Schaffer 1984].

A different approach to help maintain genotype diversity was introduced by Mauldin via his uniqueness operator
[Mauldin 1984]. The uniqueness operator helped to maintain diversity by incorporating a `censorship' operator
in which the insertion of an offspring into the population is possible only if the offspring is genotypically different
from all members of the population at a number of specified genotypical loci.

* Results and methods related to the TSP.

We digress briefly to give a little more detailed background material on the TSP. The question is often asked: if
one cannot exactly solve any very large TSP problem (except in special cases at present `very large' means a
problem involving more than a thousand cities) how can one know how accurate a solution produced by a
probabilistic or heuristic algorithm actually is?

The best exact solution methods for the travelling salesman problem are capable of solving problems of several
hundred cities [Grötschel 1991], but unfortunately excessive amounts of computer time are used in the process and,
as N increases, any exact solution method rapidly becomes impractical. For large problems we therefore have no
way of knowing the exact solution, but in order to gauge the solution quality of any algorithm we need a reasonably
accurate estimate of the minimal tour length. This is usually provided in one of two ways.

For a uniform distribution of cities the classic work by Beardwood, Halton and Hammersley (BHH) [Beardwood
1959] obtains an asymptotic best possible upper bound for the minimum tour length for large N. Let {Xi}, 1 i <   #
4 , be independent random variables uniformly distributed over the unit square, and let LN denote the shortest closed
path which connects all the elements of {X1,...,XN}. In the case of the unit square they proved, for example, that
there is a constant c > 0 such that, with probability 1,
                                                                    1/2
                                                 lim LN N          &      '   c                                   (1)
                                                 N   46


where c > 0 is a constant. In general c depends on the geometry of the region considered.

One can use the estimate provided by the BHH theorem in the following form: the expected length LN* of a minimal
tour for an N-city problem, in which the cities are uniformly distributed in a square region of the Euclidean plane,
is given by
                                                          (
                                                     LN       .   c2 NR                                           (2)

where R is the area of the square and the constant (for historical reasons known as Stein's constant - [Stein 1977])
c2 0.70805 ± 0.00007, recently been estimated by Johnson, McGeogh and Rothberg [Johnson 1996].
    .

A second possibility would be to use a problem specific estimate of the minimal tour length which gives a very
accurate estimate: the Held-Karp lower bound [Held 1970], [Held 1971]. Computing the Held-Karp lower bound
is an iterative process involving the evaluation of Minimal Spanning Trees for N-1 cities of the TSP followed by
Lagrangean relaxations, see [Valenzuela 1997].


                                                              20
Antonia J. Jones: 6 November 2005

If one seeks approximate solutions then various algorithms based on simple rule based heuristics (e.g. nearest
neighbour and greedy heuristics), or local search tour improvement heuristics (e.g. 2-Opt, 3-Opt and Lin-
Kernighan), can produce good quality solutions much faster than exact methods. A combinatorial local search
algorithm is built around a `combinatoric neighbourhood search' procedure, which given a tour, examines all tours
which are closely related to it and finds a shorter `neighbouring' tour, if one exists. Algorithms of this type are
discussed in [Papadimitriou 1982]. The definition of `closely related' varies with the details of the particular local
search heuristic.

The particularly successful combinatorial local search heuristic described by Lin and Kernighan [Lin 1973] defines
`neighbours' of a tour to be those tours which can be obtained from it by doing a limited number of interchanges
of tour edges with non-tour edges. The slickest local heuristic algorithms3, which on average tend to have
complexity O(n ), for > 2, can produce solutions with approximately 1-2% excess for 1000 cities in a few
                    "
                            "
minutes. However, for 10,000 cities the time escalates rapidly and one might expect that the solution quality also
degrades, see [Gorges-Schleuter 1990], p 101.

An approximation scheme A is an algorithm which given problem instance I and > 0 returns a solution of length
                                                                                               ,
A(I, ) such that
    ,
                                           A(I, ) Ln(I) *   ε   &       *
                                                                            #   ε                        (3)
                                                Ln(I)

Such an approximation scheme is called a fully polynomial time approximation scheme if its run time is bounded
by a function that is polynomial in both the instance size and 1/ . Unfortunately the following theorem holds, see
                                                                            ,
for example [Lawler 1985], p165-166.

Theorem. If     V N then there can be no fully polynomial time approximation scheme for the TSP, even if
                    P   V
instances are restricted to points in the plane under the Euclidean metric.

Although the possibility of a fully polynomial time approximation scheme is effectively ruled out, there remains
the possibility of an approximation scheme that although it is not polynomial in 1/ , does have a running time
                                                                                                    ,
which is polynomial in n for every fixed > 0. The Karp algorithms, based on cellular dissection, provide
                                                    ,
`probabilistic' approximation schemes for the geometric TSP.

Theorem [Karp 1977]. For every > 0 there is an algorithm A( ) such that A( ) runs in time C( )n+O(nlogn)
                                         ,                                      ,              ,                     ,
and, with probability 1, A( ) produces a tour of length not more than 1+ times the length of a minimal tour.
                                ,                                                      ,

The Karp-Steele algorithms [Steele 1986] can in principle converge in probability to near optimal tours very
rapidly. Cellular dissection is a form of divide and conquer. Karp's algorithms partition the region R into small
subregions, each containing about t cities. An exact or heuristic method is then applied to each subproblem and
the resulting sub-tours are finally patched together to yield a tour through all the cities.

Evolutionary Divide and Conquer.

Until recently the best genetic algorithms designed for TSP problems have used permutation crossovers for
example [Davis 1985], [Goldberg 1985], [Smith 1985], or edge recombination operators [Whitley 1989], and
required massive computing power to gain very good approximate solutions (often actually optimal) to problems
with a few hundred cities [Gorges-Schleuter 1990]. Gorges-Schleuter cleverly exploited the architecture of a
transputer bank to define a topology on the population and introduce local mating schemes which enabled her to
delay the onset of premature convergence. However, this improvement to the genetic algorithm is independent of



   3
     The most impressive results in this direction are due to David Johnson at AT&T Bell Laboratories - mostly reported in unpublished
Workshop presentations.

                                                                 21
Antonia J. Jones: 6 November 2005

any limitations inherent in permutation crossovers. Eventually, for problems of more than around 1000 cities, all
such genetic algorithms tend to produce a flat graph of improvement against number of individuals tested, no
matter how long they are run.

Thus experience with genetic algorithms using permutation operators applied to the Geometric Travelling
Salesman Problem (TSP) suggests that these algorithms fail in two respects when applied to very large problems:
they scale rather poorly as the number of cities n increases, and the solution quality degrades rapidly as the problem
size increases much above 1000 cities. An interesting novel approach developed by Valenzuela and Jones
[Valenzuela 1994] which seeks to circumvent these problems is based on the idea of using the genetic algorithm
to explore the space of problem subdivisions, rather than the space of solutions itself.

This alternative method, for genetic algorithms applied to hard combinatoric search, can be described as
Evolutionary Divide and Conquer (EDAC), and the approach has potential for any search problem in which
knowledge of good solutions for subproblems can be exploited to improve the solution of the problem itself. As they
say
         ! Essentially we are suggesting that intrinsic parallelism is no substitute for divide and conquer in hard combinatoric search and
         we aim to have both. [Valenzulea 1994]

The goal was to develop a genetic algorithm capable of producing reasonable quality solutions for problems of
several thousand cities, and one which will scale well as the problem size n increases. `Scaling well' in this context
almost inevitably means a time complexity of O(n) or at worst O(nlogn). This is a fairly severe constraint, for
example given a list of n city co-ordinates the simple act of computing all possible edge lengths, a O(n2) operation
is excluded. Such an operation may be tolerable for n = 5000 but becomes intolerable for n = 100,000.

In the previous section we mentioned the Karp and Steele cellular disection algorithms, and it is this technique
which is the basis of the Valenzuela-Jones EDAC genetic algorithms for the TSP.




          Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1.




                                                                  22
Antonia J. Jones: 6 November 2005

In practice a one-shot deterministic Karp algorithm yields
rather poor solutions, typically 30% excess (with simple
patching) when applied to 500 - 1000 city problems.
Nevertheless, the Karp technique is a good starting point
for exploring EDAC applied to the TSP. There are several
reasons. First, according to Karp's theorem there is some
probabilistic asymptotic guarantee of solution quality as
the problem size increases. Second, the time complexity is
about as good as one can hope for, namely O(nlogn). The
run time of a genetic algorithm based on exploring the
space of `Karp-like' solutions will be proportional to nlogn
multiplied by the number of times the Karp algorithm is
run, i.e. the number of individuals tested.

Karp's algorithm proceeds by partitioning the problem
recursively from the top down. At each step the current
rectangle is bisected horizontally or vertically, according
to a deterministic rule designed to keep the rectangle
perimeter minimal. This bisection proceeds until each Figure 2-5 The EDAC (top) and simple 2-Opt
subrectangle contains a preset maximum number of cities (bottom) time complexity (log scales).
t (typically t 10). Each small subproblem is then solved
             -
and the resulting subtours are patched together to produce a solution to the original problem - see Figure 2-4

In the EDAC algorithm the genotype is a p X p binary array in which a `1' or `0' indicates whether to cut
horizontally or vertically at the current bisection. If we maintain the subproblem size, t, and increase the number
of cities in the TSP, then a partition better than Karp's becomes progressively harder to find by randomly choosing
a horizontal or vertical bisection at each step. If the problem size is n 2kt, where 2k is the number of subsquares,
                                                                       -
then the corresponding genotype requires at least n/t - 1 bits. The size of the partition space is 2 to the power p2,
which for p = 80 (the value used for n = 5000) is approximately exp(4436). For n = 5000 the size of permutation
search space, roughly estimated using Stirling's formula, is around exp(37586). Thus searching partition space is
easier than searching permutation space and this provides a third argument in favour of exploring this
representation of problem subdivision as a genotype. We know from Karp's theorem that the class of tours
produced by disection and patching will have representatives very close to the optimum tour, so by restricting
attention to this smaller set one is not `throwing out the baby with the bath-water', i.e. the set may be smaller but
it nevertheless contains near optimal tours.

This approach contrasts sharply with the idea of `broadcast languages' mooted in Chapter 8 of [Holland 1975], in
which techniques for searching the space of representations for a genetic algorithm are discussed. In general the
space of representations is vastly larger than the search space of the problem itself, but we have seen with the TSP
that this space is already so huge that it is impractical to search in any comprehensive fashion for all except the
smallest problems. Hence, it seems unlikely that replacing the original search space by an even larger one will turn
out to be a productive approach.




                                                         23
Antonia J. Jones: 6 November 2005




  Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem.


In any event even the EDAC algorithm requires clever recursive repair techniques to improve the accuracy when
subtours are patched together. Nevertheless, the algorithm scales well. Figure 2-5 compares the EDAC algorithms
with simple 2-Opt (which gives an accuracy of around 8% excess). This version of the EDAC algorithm produces
solutions at the 5% level, see Figure 2-6, but a later more elaborate variant reliably produces solutions with around
1% excess and has been tested on problems sizes of up to 10,000 cities.

This technique probably represents the best that can be done at the present time using genetic algorithms for the
TSP. It is not yet practical by comparison with iterated Lin-Kernighan (or even 2-Opt)4, but it scales well and may
eventually offer a viable technique for obtaining good solutions to TSP problems involving several hundred
thousand cities.

Parallel EDACII and EDACIII were both tested on a range of problems between 500 and 5000 cities. Parental pairs
were chosen from the initial random population and the mid-parent value of the tour lengths calculated and
recorded. Crossover and mutation were then applied to each selected parental pair and the tour length evaluated
for the resulting offspring. Pearson's correlation coefficient, rxy, was calculated in each experiment and significance
tests based on Fisher's transformation carried out in order to establish whether the resulting correlation coefficients
differed significantly from zero (i.e. no correlation). Scatter diagrams in Figure 2-7 and Figure 2-8 illustrate the
Price correlation for parallel EDACII and EDACIII on the 5000 city problem.

Although the genotype used in these experiments was a binary array it could more naturally (at the cost of
complication in the coding) be represented by a pair of binary trees, or a quadtree. The use of trees here would be
more in keeping with the recursive construction of the phenotype from the genotype, a process analogous to



   4
     For example, wildly extrapolating the figures gives the breakeven point with 2-Opt at around n = 422,800 requiring some 74 cpu days!
Of course, other things would collapse before then.

                                                                  24
Antonia J. Jones: 6 November 2005

growth, and it is possible to produce a modified Schema theorem for the case of trees, where the genetic
information is encoded in the shape of the tree and information placed at leaf nodes.



         113.0
                                                                                   112.0




         111.0                                                                     109.7




         109.0                                                                     107.5




         107.0                                                                     105.2




                                                                                   103.0
         105.0                                                                             103.0   104.5          106.0       107.5   109.0
                 105.0   107.0          109.0       111.0    113.0
                                                                                                           mid-parent value
                                 mid-parent value




Figure 2-7 EDACII mid-parent vs offspring correlation Figure 2-8 EDACIII mid-parent vs offspring correlation
for 5000 cities [Valenzuela 1995].                    for 5000 cities [Valenzuela 1995].

In Nature very complex phenotypical structures frequently give the appearance of having been constructed
recursively from the genotype. Examples of recursive algorithms which lead to very natural looking graphical
representations of natural living structures such as trees, plants, and so on, can be found in the work of
Lindenmayer [Lindenmayer 1971] on what are now called L-systems. These production systems are very similar
to the production rules which define various kinds of context sensitive or context free grammars. The combination
of tree structured genotypes, or recursive construction algorithms similar to production rules, combined with the
divide-and-conquer paradigm suggest a powerful computational technique for the compression of complex
phenotypical structures into useful genotypical structures. So much so that, as our understanding of exactly how
DNA encodes the phenotypical structure of individual biological organisms (particularly the neural systems of
mammals) progresses, it would be surprising if to find that Nature has not employed some such technique.

                                                              Chapter references


[Altenberg 1987] L. Altenberg and M. W. Feldman. Selection, generalised transmission, and the evolution of
modifier genes. The reduction principle. Genetics 117:559-572.

[Altenburg 1994] L. Altenberg. The Evolution of Evolvability in Genetic Programming. Chapter 3 in Advances
in Genetic Programming, Ed Kenneth E. Kinnear, Jr., MIT Press, 1994.

[Belew 1990] R. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using the genetic algorithm
with connectionist learning. CSE Technical Report CS90-174, University of California, San Diego, 1990.

[Booker 1982] L. B. Booker. Intelligent behaviour as an adaption to the task environment. Doctoral dissertation,
University of Michigan, 1982. Dissertation Abstracts International 43(2), 469B.

[Brandon 1990] R. N. Brandon. Adaptation and Environment, pages 83-84. Princeton University Press, 1990.

[Cavalli-Sforza 1976] L. L. Cavelli-Sforza and M. W. Feldman. Evolution of continuous variation: direct
approach through joint distribution of genotypes and phenotypes. Proceedings of the national Academy of Science
U.S.A., 73:1689-1692, 1976.



                                                                      25
Antonia J. Jones: 6 November 2005

[Cavicchio 1970] D. J. Cavicchio. Adaptive search using simulated evolution. Doctoral dissertation, University
of Michigan (unpublished), 1970.

[Chalmers 1990] David J. Chalmers. The Evolution of Learning: An experiment in Genetic Connectionism.
Proceedings of the 1990 Connectionist Models Summer School, San Marco, CA. Morgan Kaufmann, 1990.

[Collins 1991] R. J. Collins and D. R. Jefferson. Selection in massively parallel genetic algorithms. Proceedings
of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991.

[Davis 1987] Lawrence Davis, Editor. Genetic Algorithms and Simulated Annealing, Pitman Publishing, London.

[De Jong 1975] K. De Jong. An analysis of the behaviour of a class of genetic adaptive systems. Doctoral
dissertation, University of Michigan, 1975. Dissertation Abstracts International 36(10), 5140B.

[Freedman 1991] D. Freedman, R. Pisani, R. Purves and A. Adhikkari. Statistics, Second edition, W. W. Norton,
New York, 1991.

[Georges-Schleuter 1990] Martina Georges-Schleuter. Genetic Algorithms and Population Structures, A Massively
Parallel Algorithm. Ph.D. Thesis, Department of Computer Science, University of Dortmund, Germany. August
1990.

[Goldberg 1987] David E. Goldberg and Jon Richardson. Genetic Algorithms with Sharing for Multimodal
Function Optimization. Proc. Second Int. Conf. on Genetic Algorithms, pp. 41-49, MIT.

[Gorges-Schleuter 1990] Martina Gorges-Schleuter. Genetic Algorithms and Population Structures: A Massively
Parallel Algorithm. Ph.D. Thesis, University of Dortmund, August 1990.

[Grefenstette 1987] John J. Grefenstette. Incorporating Problem Specific Knowledge into Genetic Algorithms. In
Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing, London.

[Holland 1975] John H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan
Press.

[Horowitz 1978] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. London, Pitman Publishing
Ltd.

[Johnson 1996] D. S. Johnson, L. A. McGeoch and E. E. Rothberg. Asymptotic experimental analysis for the Held-
Karp traveling salesman bound. Proceeding 1996 ACM-SIAM symp. on Discrete Algorithms, to appear.

[Jones 1993] Antonia J. Jones. Genetic Algorithms and their Applications to the Design of Neural Networks,
Neural Computing & Applications, 1(1):32-45, 1993.

[Koza 1992] John L. Koza. Genetic Programming: On the Programming of Computers by Means of Natural
Selection. Bradford Books, MIT Press, 1992. ISBN 0-262-11170-5.

[Lindenmayer 1971] A. Lindenmayer. Developmental systems without cellular interaction, their languages and
grammars. J. Theoretical Biology 30, 455-484, 1971.

[Lyubich 1992] Y. I. Lyubich. Mathematical Structures in Population Genetics. Springer-Verlag, New York, pages
291-306. 1992.

[Manderick 1989] B. Manderick and P. Spiessens. Fine grained parallel genetic algorithms. Proceedings of the
third international conference on genetic algorithms. Morgan Kaufmann, 1989.


                                                       26
Antonia J. Jones: 6 November 2005

[Manderick 1991] Manderick, B. de Weger, M. and Spiessens, P. The genetic algorithm and the structure of the
fitness landscape. In R. K. Belew and L. B. Booker, Editors, Proceedings of the Fourth International Conference
on Genetic Algorithms, pages 143-150, San Mateo CA, Morgan Kaufmann.

[Mauldin 1984] M. L. Mauldin. Maintaining diversity in genetic search. National Conference on Artificial
Intelligence, 247-250, 1984.

[Macfarlane 1993] D. Macfarlane and Antonia J. Jones. Comparing networks with differing neural-node functions
using Transputer based genetic algorithms. Neural Computing & Applications, 1(4): 256-267, 1993.

[Menczer 1992] Menczer,F. and Parisi, D. Evidence of hyperplanes in the genetic learning of neural networks.
Biological Cybernetics 66(3):283-289.

[Miller 1989] G. Miller, P. Todd, and S. Hegde. Designing neural networks using genetic algorithms. In
Proceedings of the Third Conference on Genetic Algorithms and their Applications, San Mateo, CA. Morgan
Kaufmann, 1989.

[Muhlenbein 1988] H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer. Evolution Algorithms in Combinatorial
Optimisation. Parallel Computing, 7, pp. 65-85.

[Price 1970] G. R. Price. Selection and covariance. Nature, 227:520-521.

[Price 1972] G. R. Price. Extension of covariance mathematics. Annals of Human Genetics 35:485-489.

[Salmon 1971] W. C. Salmon. Statistical Explanation and Statistical Relevance. University of Pittsburg Press,
Pittsburgh, 1971.

[Schaffer 1984] J. D. Schaffer. Some Experiments in Machine Learning Using Vector Evaluated Genetic
Algorithms. Ph.D. Thesis, Department of Electrical Engineering, Vanderbilt University, December 1984.

[Schwefel 1965] H-P Schwefel. Kybernetische Evolution als Strategie experimentellen Forschung in der
Strömungstechnik. Diploma thesis, Technical University of Berlin, 1965.

[Slatkin 1970] M. Slatkin. Selection and polygenic characters. Proceedings of the National Academy of Sciences
U.S.A. 66:87-93. 1970.

[Smith 1980] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms. Ph.D. Dissertation,
University of Pittsburg.

[Spiessens 1991] P. Spiessens and B. Manderick. A massively parallel genetic algorithm - implementation and
first analysis. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991.

[Valenzuela 1994] Christine L. Valenzuela and Antonia J. Jones. Evolutionary Divide and Conquer (I): A novel
genetic approach to the TSP. Evolutionary Computation 1(4):313-333, 1994.

[Valenzuela 1995] Christine L. Valenzuela. Evolutionary Divide and Conquer: A novel genetic approach to the
TSP. Ph.D. Thesis, Department of Computing, Imperial College, London. 1995

[Valenzuela 1997] Christine L. Valenzuela and Antonia J. Jones. Estimating the Held-Karp lower bound for the
geometric TSP. To appear: European Journal of Operational Research, 1997.

[Whitley 1990] D. Whitley, T. Starkweather, and C. Bogart. Genetic algorithms and neural networks: Optimizing
connections and connectivity. Parallel Computing, forthcoming.


                                                      27
Antonia J. Jones: 6 November 2005

[Wilson 1990] Perceptron redux. Physica D, forthcoming.




                                                   28
Antonia J. Jones: 6 November 2005




                                                       III Hopfield networks.



Introduction.

As far back as 1954 Cragg and Temperley [Cragg 1954] had introduced the spin-neuron analogy using a
ferromagnetic model. They remarked

         "It remains to be considered whether biological analogues can be found for the concepts of temperature and interaction
         energy in a physical system."

In 1974 Little [Little 1974] introduced the temperature-noise analogy, but it was not until 1982 that John Hopfield
[Hopfield 1982], a physicist, made significant progress in the direction requested by Cragg and Temperley. In a
single short paragraph, he suggests one of the most important new techniques to have been proposed in neural
networks.

Hopfield nets and energy.

The standard approach to a neural network is to propose a learning rule, usually based on synaptic modification,
and then to show that a number of interesting effects arise from it. Hopfield starts by saying that:

         "The function of the nervous system is to develop a number of locally stable states in state space."

Other points in state space flow into the stable points, called attractors. In some other dynamic systems the
behaviour is much more complex, for example the system may orbit two or more points in state space in a non-
periodic way, see [Abraham 1985]. However, this turns out not to be the case for the Hopfield net.

The flow of the system towards a stable point allows a mechanism for correcting errors, since deviations from the
stable points disappear. The system can thus reconstruct missing information since the stable point will
appropriately complete missing parts of an incomplete initial state vector.

Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at
maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined
as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values
of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm.
For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean
                                                       2
attempt rate µ, setting

                                          x i(t)   '   1                                  >   θ   i
                                          x i(t)   '   x i(t 1)
                                                           &      if   j     wijxj(t 1)
                                                                                   &      '   θ   i                               (1)
                                          x i(t)   '   0               j i
                                                                       …                  <   θ   i



Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts
accordingly.

Although this model has superficial similarities to the Perceptron there are essential differences. Firstly,
Perceptrons were modelled chiefly with the neural connections in a `forward' direction and the analysis of such
networks with backward coupling proved intractable. All the interesting results of the Hopfield model arise as a
consequence of the strong backward coupling. Secondly, studies of perceptrons usually made a random net of

                                                                   29
Antonia J. Jones: 6 November 2005

neurons deal with the external world, and did not ask the questions essential to finding the more abstract emergent
computational properties. Finally, perceptron modelling required synchronous neurons like a conventional digital
computer. Although synchrony of sorts must exist in biological nervous systems, for example the act of walking
involves the precise temporal coordination of both legs, or the Purkinje fibres which help control the heart, there
is certainly no global synchrony in the same sense that electronic hardware is clocked. Given the variations of
delays of nerve signal propagation, there would probably be no way to use global synchrony effectively. Chiefly
computational properties which can exist in spite of asynchrony have interesting implications in biological
computing.

Hopfield considers the special case wij = wji (all i, j), wii = 0 (all i) and defines a function
                                                    1
                                           E               w xx
                                                         j &'               ix i
                                                                               %   θj
                                                    2 i, j ij i j       i                                                             (2)
                                                                i j
                                                                 …


which is an analog to the physical energy of the system. A low rate of neural firing is approximated by assuming
that only one unit changes state at any given moment. Then, since wij = wji, E due to xi is given by
                                                                                                 )         )
                                             E     M '
                                      E     ∆     xi    (     wij x j
                                                             j &' ∆     i)  xi      &   θ   ∆
                                             x     M      j                                                  (3)
                                                         i
                                                                        j i…


Now consider the effect of the threshold rule (1). If the unit changes state at all then xi = ±1. If xi = 1 this means
                                                                                                      )             )
the unit changes state from 0 to 1, hence by the threshold rule
                                                         wijx j > i
                                                               j               θ
                                                                                                                  (4)
                                                               j i
                                                                …


in which case, by (3), E < 0. Alternatively, xi = -1, the unit therefore changes state from 1 to 0, and
                           )                             )
                                                    wijx j < i j               θ
                                                                                                                                      (5)
                                                               j i
                                                                …


and again E < 0. Thus if a unit changes state at all the energy must decrease. State changes will continue until
             )
a locally least E is reached.5 The energy is playing the role of a Hamiltonian in the more general dynamic system
context.

For the Hopfield network individual state changes are deterministic. However, in more general models, such as
the Boltzmann machine, we can add a stochastic component to the node update rule which introduces a parameter
T called temperature. At T = 0 state changes are decided deterministically by the threshold rule. For T > 0, as T
increases the system becomes progressively less deterministic and more stochastic until, at high temperatures, any
individual node is in either state with probability ½. Thus we can regard the Hopfield network in operational mode
as the zero temperature case of the Boltzmann machine.

Hopfield now makes the critical observation that

                                                "This case is isomorphic with an Ising model."

thereby allowing a deluge of physical theory describing spin-glass models to enter network modelling. This flood
of new participants has transformed the field of neural networks.

A spin glass is a magnetic alloy formed, for instance, by dilute solutions of manganese in copper or iron in gold.
These impurities interact with each other by means of conduction electrons and the couplings are either of the


    5
      This particular argument is valid only if the neural states change one at a time in some random order; which is approximated by a low
neural firing rate. However, Cohen and Grossberg have proved a theorem about a much wider class of networks which guarantees a similar
kind of stability for asynchronous operation.

                                                                      30
Antonia J. Jones: 6 November 2005

ferromagnetic (wij > 0) or antiferromagnetic (wij < 0) type. The interest of these alloys comes from the fact that they
exhibit a wide variety of stable or meta-stable states. The dipoles interact via the couplings wij. In the simplest case,
a spin interacts only with its nearest neighbours, while the equivalent of the neural networks considered here,
requires infinite range interactions, where each spin is coupled to all others. In the Ising model, the Hamiltonian
of such a spin glass is proportional to

                                                       j           wijx ix j                                       (6)
                                                       i   …   j



the spins contributing to the total energy by pairwise interactions, and the system stabilizes at an equilibrium point
which is a minimum of the free energy, see [Binder 1986].


   Procedure Hopfield (Assumes weights are assigned)

             Repeat until updating every unit produces no change of state.
                     Randomise initial state x {0, 1}n
                                                  0
                     Select unit i (1 i n) with uniform random probability.
                                       # #
                     Update unit i according to (1)
             End


Algorithm 3-1 Hopfield network.



A number of computer simulations and some analysis led Hopfield to conclude that the number of `memories'
(point attractors) that could be stored by a network was about 0.15n, where n is the number of neurons in the
network, a figure quite precisely confirmed by the later work. In an analytic tour de force [Amit 1987] it is shown
that the Hopfield model can be solved exactly, in the thermodynamic limit as n -> . A phase diagram
                                                                                                 4
(Temperature T, storage P/n) is obtained, where T is a measure of the noise level and P/n is the ratio of the number
of learnt patterns to the number of neurons [Crisanti 1986]. The main result is the existence of a sharp,
discontinuous phase transition at P/n = Ao --> 0.14 (for T --> 0). When P < Aon the retrieval is very good (0.97
correlation) but it drops suddenly for P > Aon.

Real neurons need not make synapses both i->j and j->i. We therefore ask if wij = wji is important. Without this
condition the probability of making errors is increased but the algorithm continues to generate stable minima. Why
should stable limit points or regions persist when wij is not equal to wji? If the algorithm at some time changes xi
from 0 to 1, or vice versa, the change of the energy can be split into two terms. The first is the change that would
apply to the symmetric model. The second is identically zero if wij is symmetric and is `stochastic' with mean zero
if wij and wji are randomly chosen. The algorithm in the non-symmetric case therefore changes E in time in a
fashion similar to the symmetric case but corresponding to a finite temperature, i.e. a lower signal-to-noise ratio.

In [Hopfield 1984] a more realistic neuron model is used in which an internal continuous variable stores the linear
sum of the excitation and inhibition weighted by the appropriate connection strength. The internal variable is
converted into an output activity by a sigmoidal non-linearity. As in the earlier paper he sets up an energy function
and shows that the evolution of the system in time, given the properties of the neurons, will be to decrease energy.
Thus the results found in the previous paper still largely hold. There is a brief, but clear, account of the Hopfield
model in [Farhat 1985] in which an optical implementation is described.

The outer product rule for assigning weights.

An early rule used for memory storage is associative memory models can also be used to store memories in the


                                                                   31
Antonia J. Jones: 6 November 2005

Hopfield model. If the {0, 1} node outputs are mapped to xi {-1, +1} the rule assigned weights as follows. For
                                                                      0
each pattern vector x, which we require to memorise, we consider the matrix
                                      x1                    x1 x1 x1 x2 ... x1 x n
                                         x2                               x2 x1 x2 x2 ... x2 x n
                              x xT   '        x1, x2, ,..., x n   '                                             (7)
                                          .                                 .      .    ...    .
                                         xn                               x n x1 x n x2 ... x n x n

and then average these matrices over all pattern vectors (prototypes). At the time the explanation was that in this
way we can capture the average correlations between components of the pattern vectors and then use this
information, during the operation of the network, to recapture missing or corrupted components. Assuming that
we know the patterns required to be memorised the outer-product rule is a one-shot computation of the weights
and so perhaps should not qualify as a `learning rule' in the usual sense of progressive weight modification upon
exposure to experience.

However, from the perspective of the Hopfield model there is another interpretation that can be given to this rule.
For a point x = (x1, ..., xn) to be a stable attractor (i.e. a memory) we require that it be a local energy minimum.
Suppose that patterns are presented sequentially and we wish to determine some rule which is intended to make
frequently presented patterns likely to be energy minima. If we suppose that x is given at some stage, then it has
an associated energy and, if we calculate
                                                      M   E
                                                            &'   x ix j                                          (8)
                                                      M  wij


then by taking wij proportional to minus this gradient we obtain
               )
                                                  wij  ∆ x ix j
                                                             '    η                                             (9)
where > 0 is some small constant. In other words the averaging process of the outer product rule can now be
       0
seen, in the context of progressive learning, as a form of gradient descent.

Unfortunately, however we assign the weights, as progressively more prototype memories are added, the system
eventually reaches saturation (at around 0.15n). This happens because the number of local minima is not really
under our control, as more memories are added we obtain an exponentially increasing number of `spurious
memories', ie. local minima that do not correspond to patterns we wish to memorise. For example, we have

Theorem [Tanaka 80], [McEliece 87]. If the synaptic matrix is symmetric with wii = 0 and if its elements are
independent Gaussian variables with zero mean and unit variance, then an asymptotic estimate for the number of
fixed points is given by
                                                NF   •   (1.0505) 20.2874n                                     (10)



A proof can be found in [Kamp 1990].

Networks for combinatoric search.

Apart from their applications to associative recall or pattern recognition, Hopfield networks can be applied to the
very different problem of combinatoric search. It should be made clear at the outset that, with the possible
exception of the Boltzmann machine, the application of neural networks to hard combinatoric search has not yet
yielded systems or algorithms which compare favourably with state-of-the-art probabilistic algorithms designed
for the specific problem, e.g. the TSP. However, this area is of considerable theoretical interest and may eventually
prove to be of practical interest.


                                                             32
Antonia J. Jones: 6 November 2005

We know that the dynamics of a Hopfield network cause it to relax into a local energy minimum. Given a specific
combinatoric search problem to be solved we are faced with two issues. First we have to design a representation
which relates network states to the objects in the search space. Second we have to arrange that low energy states
of the network correspond to good solutions.

To make matters specific we can consider the geometric TSP. Here the objects of search are tours and given a list
of N cities we can identify a tour as any permutation of this list. There are N! permutations and only ½(N-1)!
distinct tours, so we have already introduced some replication by simply identifying tours with permutations, but
this causes no serious problems of itself. The next step is to consider how tours might be represented as a state of
a network. Here, the generally used method is illustrated below.

For a 5-city problem {A,B,C,D,E}, if city A is in position 2                     Table 2-2 Example representation.
of the tour this is represented by the second neuron from an
array of five having an output of 1 and all others in the                               1   2      3      4          5
array having an output of 0, i.e. (0,1,0,0,0). The global state
of Table 10.1 represents the tour (C,A,E,B,D). Thus for N                   A           0   1      0      0          0
cities a total of n = N2 neurons are required to specify a
complete tour.                                                              B           0   0      0      1          0

                                                                  C       1      0      0      0         0
Clearly there is a 1-1 correspondence between valid tours
and the set of all NXN permutation matrices, i.e. matrices        D       0      0      0      0         1
which have precisely one `1' in each row and column, all
other entries being zero. For an N-city TSP, there are N!         E       0      0      1      0         0
states of such matrices of which represent tours, and 2 to the
power N2 states in all. Now, this is not a very satisfactory
situation. We have replaced the original search space, of size order N!, by a space of much greater size because
                                                                            1
                                         log(N!)    .   NlogN   &   N   %       log(2 N)
                                                                                    π
                                                                            2
                                                                                                                         (11)
                                                2
                                         log(2N )   '   N 2 log2

where the approximation for N! is Sterlings theorem. This is contrary to a guiding principle that wherever possible
in hard combinatoric search we should simplify the space searched (subject to the condition that good solutions
are rich in the smaller space) rather than make it larger. Nevertheless, the above representation has a certain
paradigmatic simplicity and will serve to illustrated the ideas.

The next problem to be addressed is: how to assign the weights so that states with low energy correspond to short
tours. This `assignment problem' is dealt with in the next section.

Assignment of weights for the TSP.

Let dij denote the distance between city i and city j. Here we shall formulate the TSP as a 0-1 programming
problem, by defining it as a quadratic assignment problem [Garfinkel 1985]. The TSP can also be formulated as
a linear assignment problem [Aarts 1988] but, of course, then there are far more constraints. Using the n = N2 node
state variables defined by

                              xip   '   1, if the tour visits city i at the p th position                                (12)
                                        0, otherwise


we can formulate the TSP as the following quadratic assignment problem:

         Minimise



                                                             33
Antonia J. Jones: 6 November 2005

                                                                                    N   &   1
                                                                   F(x)   '           j               aijpq xipxjq                                                    (13)
                                                                                 i, j, p, q   '   0



         subject to xip, xjq   0      {0, 1} and
                                                       N   &   1

                                                           j       xip   '   1                (0      #   p   #   N       &   1)
                                                       i   '   0
                                                                                                                                                                      (14)
                                                       N   &   1

                                                           j       xip   '   1                (0      # # i       N   &       1)
                                                       p   '   0



The first condition of (14) asserts there is just one `1' in every column, and the second condition places a similar
constraint on every row. The aijpq are defined by
                                                                         dij, if q          /     p ± 1 (mod N)
                                                      aijpq    '                                                                                                      (15)
                                                                          0, otherwise



                  p                                                                                                                p

                                  p                                                                                                               p
            i
                                                                                                                      i




                       p - 1 (mod N ) p + 1 (mod N)
                                                                                                                                       p - 1 (mod N ) p + 1 (mod N)



Figure 3-1 Distance Connections. Each node (i, p)
has inhibitory connections to the two adjacent                                                    Figure 3-2 Exclusion connections. Each node (i, p)
columns whose weights reflect the cost of joining the                                             has inhibitory connections to all units in the same row
three cities.                                                                                     and column.


We wish to choose the weights so that minimising the objective function (13), subject to the constraints of (14),
corresponds to minimising the energy as defined in (2), (1).

Figure 3-1 and Figure 3-2 illustrate the connections made to each node of the network. These are divided into two
types. Distance connections, for which the weights are chosen so that, if the net is in a state which corresponds
to a tour, these weights will reflect the energy cost of joining the (p - 1)th city of the tour to the pth city, and the
pth city to the (p + 1)th. Note these connections wrap around (mod N). Exclusion connections, which inhibit two
units in the same row or column from being on at the same time. Exclusion connections are designed so as to
encourage the network to settle in a state which corresponds to a tour state. As all connections so far are inhibitory
we need to provide the network with some incentive to turn on any units at all. This can be done by manipulating


                                                                                        34
Antonia J. Jones: 6 November 2005

the thresholds. Intuitively we can see that some arrangement such as this might well have the desired effect, but
the next theorem shows exactly how to choose these weights so that it does.

We consider the following sets of parameters
                           St      ip : 0
                                        '   i, p
                                            θ           #              N 1
                                                                      & #
                                   Sd   '   wipjq : i    …      j and q    /     p ± 1 (mod N)                                       (16)

                                   Se   '   wipjq : (i      '    j and p     …   q) or (i           …   j and p    '    q)

Theorem (Aarts). Let the weights and thresholds be chosen so that
                       œ
                       ip  St we have ip < max dik dil : k
                           θ   0                    θ             &                   %                 …   l, 0   #   k, l    N 1
                                                                                                                              & #
                       œ   wipjq   0    Sd we have wipjq           &'      dij                                                       (17)

                       œ   wipjq   0    Se we have wipjq < min                   θ   ip,   θ   jq


then
           (i) Feasibility. Valid tour states of the network exactly correspond to local minima of the energy function.

and        (ii) Ordering. The energy function is order-preserving with respect to tour length.

Proof. (i) Feasibility. Firstly, it is easy to check that a tour state is indeed a local minimum. We simply note the
effect of changing the state of some unit. Now suppose the state is not a tour state. We divide this into two possible
cases.

Case 1. Suppose the network state has more than a single `1' in some row or column.

Then at least one exclusion connection is activated. For definiteness suppose units (i, p) and (j, p) are both on and
that ip = min { ip, jp} > wipjp. What is the effect of turning unit (i, p) off? Distance connections cause no
       2           2 2
problems, for suppose that some unit (k, q), with q p ± 1, in an adjacent column is also on. In this case turning
                                                                       /
off (i, p) will remove a distance connection contribution dik to the energy, and so decrease it. For the exclusion
connection, if the weights are chosen according to (17) then turning off the connected unit (i, p), which has lowest
threshold, causes a change in energy
                                                E wipjp      ∆ip < 0
                                                                   '             &   θ                          (18)
and so again reduces the energy. Hence a network state with too many `1's in some row or column cannot
correspond to a local minima.

Case 2. Now suppose that the network state has at most one `1' in every row and column (so that there are at most
N units on) and that at least one row or column has no unit on.

Suppose, for definiteness, that the pth column has no unit on. This means that in any row, the unit in the pth
column cannot be on. If every row contains some unit which is on and only N-1 columns are available, this means
that some column would have to contain two units which are on, which is contrary to hypothesis. Hence there must
be fewer than N units on, so that some unit in the pth column must be in a row which has no units on. Suppose
this is unit (i, p). Turning this unit on does not contribute to the energy via any exclusion connections (because
all other units in the same row or column are off). We next consider the contribution due to the distance
connections. In each of the adjacent columns (mod N) there is at most one unit on. Call these (k, p-1) and (l, p+1)
respectively, where it is understood that the indices p-1 and p+1 are taken (mod N). The choice of weights in (17)
ensures that the change of energy produced by turning unit (i, p) on is




                                                                           35
Antonia J. Jones: 6 November 2005

                                 ∆   E   &'        wkp   &1 ipxkp 1xip
                                                                     &         &    wipl p 1xipxl p
                                                                                          %       %   1   %   θ   ipxip
                                                                                                                              (19)
                                         '   dik   %     dil     %   θ   ip   < 0

Hence the initial state could not have been an energy minima.

The two cases considered cover all possibilities for network states which are not tour states, and so we conclude
that the tour states exactly correspond to energy minima.

(ii) Ordering. To show that the energy function is order preserving we first observe that ip does not depend on p,        2
i.e. all ip for units in a given row are equal. Consequently, the contribution to the energy from the threshold terms
       2
is the same for all tour states. The remaining terms contribute
                                                             1
                                                                         j          dij xi pxjq
                                                             2
                                                                      i p, jq                                                 (20)
                                                                 (i, p) (j, q)
                                                                          …


which evaluates to the tour length. Hence energy preserves the ordering of tour length.                                   %

At first sight this theorem may seem paradoxical. In analysing the memory capacity of a Hopfield network, we
concluded earlier that only around 0.15n prototype vectors could be memorised before the onset of catastrophic
degradation in recall. Thus we were only able to control the network behaviour at a relatively small proportion of
an exponentially large (as in (10)) number of local minima. Yet, in the assignment of weights for the TSP network,
we have ( n)! local minima exactly where we want them. How can this be? The answer lies in the following
           %
observation. In analysing the effect of the outer-product rule for assigning weights in order to recall stored
memories, we assumed that there was no structure to the prototype vectors, i.e. that they were uncorrelated. In the
case of the TSP assignment there is a very definite structure in the set of states we are seeking to assign to minima:
the structure imposed by the ½n(n-1) distances between the cities of the original problem. The TSP weight
assignment respects this structure and that additional degree of control allows us to place N! states in exactly the
relationship needed to solve the problem.

A Mathematica implementation of the asynchronous Hopfield network is given in the Mathematica directory. The
assignment of weights for a specific TSP problem discussed here is also given there.

* The Hopfield and Tank application to the TSP.

Hopfield and Tank first proposed using the Hopfield model to solve the TSP in [Hopfield 1986]. Their choice of
energy function was motivated by similar considerations to those of the previous section, but was slightly different
in detail.

For a specific TSP problem we assume that the assignment of weights is made as described in the previous section.
Suppose we now initialise the corresponding Hopfield network to a random state. Then if the network is run we
should expect it to settle into a local energy minima, and we have proved that this will correspond to some tour
state. However, it is unlikely to correspond to the optimal tour. Hopfield and Tank suggested an ingenious way of
overcoming this problem, which is very similar to the Boltzmann machine approach (not discussed in these notes).

Instead of the {0, 1} network Hopfield and Tank used a continuous model, 0 xi 1, in which neurons possess          # #
a sigmoidal activation function xi = f( neti), where neti is the net input and is a positive constant which
                                              8                                                                     8
represents the gain and is equivalent to varying the slope of the sigmoidal. Hence f represents the input-output
characteristics of a non-linear amplifier with negligible response time. The discrete model represents the case
where -> . In their simulation is taken to be 50 (large but finite).
       8       4                     8

For 10 cities there are 181,440 possible tours. In the Hopfield and Tank simulations about 50% of the trials
produced one of the two shortest paths. They ask why is the computation so effective and provide the following


                                                                              36
Antonia J. Jones: 6 November 2005

answer:

          "The solution to a TSP is a path and the decoding of the TSP's network final stable state to obtain this discrete decision or solution
          requires having the final xip values to be near 0 or 1. However, the actual analog computation occurs in the continuous domain 0
          # xip 1. The decision-making process or computation consists of the smooth motion from an initial state in the interior of the space
               #
          (where the notion of a `tour' is not even defined) to an ultimate stable point near enough to a corner of the continuous domain to be
          able to identify with that corner. It is as though the logical operations of a calculation could be given continuous values between `true'
          and `false', and evolve toward certainty only near the end of the calculation."

Naturally, with such an interesting approach to an NP hard problem, others tried to repeat these results. It was
found by a number of researchers that the Hopfield-Tank algorithm, as originally formulated, was highly unstable

          "Our simulations indicate that Hopfield and Tank were very fortunate in the limited number of TSP simulations they attempted. Even
          at the value N = 10 it transpires that their basic method is unreliable..." [Wilson 1988].

However, others later refined the energy function and modified the algorithm to perform more reliably. There are
also a number of related papers on `elastic net' methods for the TSP problem [Durbin 1987].

Conclusions.

One lesson we have learned from this chapter is that the performance of networks for hard combinatoric search
is critically dependent both on the mapping from the problem domain to the network, and on the details of the
network architecture and weight assignment which encode the constraints of the original problem. Appropriate
encoding of domain knowledge into the network architecture can profoundly enhance the overall performance. One
of the important general questions which arises from these observations is how can such encoding of basic
constraints themselves be learnt, possibly at the genetic level?

In another direction entirely, we first discussed associative memories with an emphasis on direct methods of
encoding prototype patterns into the weights of a fully connected network. As our study of these networks
progressed the emphasis moved to a perspective more in accord with dynamic systems. This also mirrors the
historical development of the subject of artificial neural networks. Following Hopfield's papers prototype vectors,
or `memories', were associated with point attractors of the dynamics of the network. However, we can see that this
preoccupation is, in itself, just a special case of a much more general vista. Why should we restrict our attention
to the point attractors of the system? In modelling biological systems we are studying techniques for embedding
learned behaviour into the system. In other words we should really be studying the following question

          ! How can we sculpt the dynamic evolution of the system through time so as to induce
          behaviours characteristic of, or responsive to, a given external dynamic system?

If we loosely identify a `behaviour' with a trajectory through the state space of the network, then point attractors
are just one of the simplest characterisations of dynamic system behaviour. It would be much more interesting to
develop learning techniques for whole classes of trajectories, or even chaotic attractors. In other words to address
the issue of how to get a neural network to capture a model of a given dynamic system or Markov process for
example.

This is very much the line of reasoning suggested by Freeman's studies and simulations of biological neural
systems [Freeman 1991]. Phase portraits made from EEGs generated by computer models reflect the overall
activity of the olfactory system of a rabbit at rest and in response to a familiar scent (e.g. banana). The resemblance
of these portraits to irregularly shaped, but still structured, coils of wire reveals that brain activity in both
conditions is chaotic but that the response to the known stimulus is more ordered, more nearly periodic during
perception, than at rest.


                                                           Chapter references


                                                                       37
Antonia J. Jones: 6 November 2005

[Aarts 1988] E. Aarts and J. H. M. Korst. Boltzmann machines for travelling salesman problems, European Journal
of Operational Research - in press.

[Abraham 1985] R. H. Abraham and C. D Shaw, Dynamics - The Geometry of Behaviour, Part 2: Chaotic
Behaviour, Arial Press, 1985.

[Amit 1987] D. J. Amit, H. Gutfreund, and H. Sompolinsky. Information storage in neural networks with low
levels of activity. Physical Review A, 35:2239-2303, 1987.

[Binder 1986] K. Binder and A. P. Young. Spin-Glasses - Experimental Facts, Theoretical Concepts and Open
Questions. Reviews of Modern Physics, 58(1):801-976, 1986.

[Burr 19XX]. D. J. Burr. An Improved Elastic Net Method for the Traveling Salesman Problem, ???

[Cragg 1954] B. G. Cragg and H. N. V. Temperley. Electroencephalog. Clin. Neurophys. 6:85, 1954.

[Crisanti 1986] A. Crisanti, D. J. Amit, and H. Gutfreund. Europhys. Letters 2:337, 1986.

[Durbin 1987] R. Durbin and D. Willshaw, An analogue approach to the travelling salesman problem using an
elastic net method, Nature 326, 16 April 1987.

[Farhat 1985] N. H. Farhat, et al. Optical implementation of the Hopfield model. Applied Optics, 24:1469-1475,
1985.

[Feller 1966] W. Feller. An introduction to probability theory and its applications, Vol. 1. J. Wiley & Sons, New
York, 1966.

[Freeman 1991] W. J. Freeman. The physiology of perception. Scientific American, pp 34-41, February 1991.

[Garfinkel 1985] R. S. Garfinkel. Motivation and modelling, in The Travelling Salesman Problem: a guided tour
of combinatoric optimisation. Eds. E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. Wiley,
Chichester, 1985.

[Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing.
Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk).

[Hopfield 1982] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities, Proceedings of the National Academy of Sciences 79: 2554-2558.

[Hopfield 1984] J. J. Hopfield, Neurons with graded response have collective computational properties like those
of two state neurons. Proceedings of the National Academy of Sciences 81: 3088-3092.

[Hopfield 1986] J. J. Hopfield and D. W. Tank. `Neural' computation of decisions in optimization problems.
Biological Cybernetics, 52:141-152., 1986.

[Kamp 1990] Y. Kamp and M. Hasler. Recursive Neural Networks for Associative Memory, John Wiley & Sons,
New York, 1990.

[Keeler 1988] J. D. Keeler. Cognit. Sci. 12: 299-329, 1988.

[Keeler 1989]. J. D. Keeler, E. E. Pichler and J. Ross. Noise in neural networks: Thresholds, hysteresis, and
neuromodulation of signal-to-noise. Proc. Nat. Acad. Sci. USA, 86: 1712-1716, March 1989.



                                                       38
Antonia J. Jones: 6 November 2005

[Little 1974] W. A. Little. Math. Biosci. 19:101, 1974.

[McEliece 1987] R. McEliece, E. Posner, E. Rodemich, and S. Venkatesh. The capacity of the Hopfield associative
memory. IEEE Transactions on Information Theory, IT-33:461-482, 1987.

[Tanaka 1980] F. Tanaka and S. Edwards. Analytic theory of the ground state properties of a spin glass: I. Ising
spin glass. J. Phys. F: Metal. Phys, 10:2769-2778, 1980.

[Wilson 1988] G. V. Wilson and G. S. Pawley, On the Stability of the Travelling Salesman Problem Algorithm
of Hopfield and Tank. Bilogical Cybernetics 58:63-70, 1988.




                                                      39
Antonia J. Jones: 6 November 2005


                                           IV The WISARD model.


Introduction.

WISARD (WIlkie, Stonham, Aleksander Recognition Device) is an implementation in hardware of the n-tuple
sampling technique first described in [Bledsoe 1959]. The scheme outlined in Figure 4-1, Figure 7-2 was first
proposed by Aleksander and Stonham in [Aleksander 1979].




Figure 4-1 Schematic of a 3-tuple recogniser.


The sample data to be recognized is stored as a 2-dimensional array (the 'retina') of binary elements. Depending
on the nature of the data this can be done in a variety of ways. For visual processing we can simply place a pre-
processed version of the image onto the retina. For temporal data in signal processing or speech recognition


                                                       40
Antonia J. Jones: 6 November 2005

successive samples in time can be stored in successive columns, and the value of the sample represented by a
coding of the binary elements in each column. The particular coding used is liable to depend on the application.
One of several possible codings is to represent a sample feature value by a 'bar' of binary 1's; the length of the bar
being proportional to the value of the sample feature.

Wisard model.

Random connections are made onto the elements of the array, n such connections being grouped together to form
an n-tuple which is used to address one random access memory (RAM) per discriminator. In this way a large
number of RAM's are grouped together to form a class discriminator whose output or score is the sum of all its
RAM's outputs. This configuration is repeated to give one discriminator for each class of pattern to be recognized.
The RAMs implement logic functions which are set up during training; thus the method does not involve any
direct storage of pattern data.

A random map from array elements to n-tuples is preferable in theory, since a systematic mapping is more likely
to render the recogniser blind to distinct patterns having a systematic difference. Hard-wiring a random map in
a totally parallel system makes fabrication infeasible at high resolutions. In many applications systematic
differences in input patterns of the type liable to pose problems with a non-random mapping are unlikely to occur
since real data tends to be 'fuzzy' at the pixel level. However, the issue of randomly hardwiring individual RAMs
is somewhat academic since in most contexts a totally parallel system is not needed as its speed (independent of
the number of classes and of the order of the access time of a memory element) would far exceed data input rates.
At 512X512 resolution a semi-parallel structure is used where the mapping is 'soft' (i.e. achieved by
pseudo-random addressing with parallel shift registers) and the processing within discriminators is serial but the
discriminators themselves are operating in parallel. Using memory elements with an access time of 10-7 s, this
gives a minimum operating time of around 70 mS, which once again is independent of the number of classes.

The system is trained using samples of patterns from each class. A pattern is stored into the retina array and a
logical 1 is written into the RAM's of the discriminator associated with the class of this training pattern at the
locations addressed by the n-tuples. This is repeated many times, typically 25-50 times, for each class.

In recognition mode, the unknown pattern is stored in the array and the RAM's of every discriminator put into
READ mode. The input pattern then stimulates the logic functions in the discriminator network and an overall
response is obtained by summing all the logical outputs. The pattern is then assigned to the class of the
discriminator producing the highest score.

Where very high resolution image data is presented, as in visual imaging, this design lends itself to easy
implementation in massively parallel hardware. However, even with visual images, experience tends to suggest
that often a very good recognition performance can be obtained on relatively low resolution data. Hence in many
applications massively parallel hardware can be replaced by a fast serial processor and associated RAM, emulating
the design in micro-coded software. This was the approach used by Binstead and Stonham in Optical Character
recognition, with notable success. Such a system has the advantage of being able to make optimal use of available
memory in applications where the n- tuple size, or the number of discriminators, may be required to vary.

The advantages of the WISARD model for pattern recognition are:

         ! Implementation as a parallel, or serial, system in currently available hardware is inexpensive and
         simple.

         !   Given labelled samples of each recognition class, training times are extremely short.

         ! The time required by a trained system to classify an unknown pattern is very small and, in a parallel
         implementation, is independent of the number of classes.



                                                         41
Antonia J. Jones: 6 November 2005


The requirement for labelled samples of
each class poses particular problems in
speech recognition when dealing with
smaller units than whole words; the
extraction of samples by acoustic and
visual inspection is a labour intensive
and time consuming activity. It is here
that paradigms such as Kohonen's
topologising network, as applied to
speech by Tattershall, show particular
promise. Of course, in such approaches
there are other compensating problems;
principally, after the network has been
trained and produced a dimensionally
reduced and feature-clustered map of the
pattern space, it is necessary to interpret
this map in terms of output symbols
useful to higher levels. One approach to
this problem is to train an associative Figure 4-2 Continuous response of discriminators to the input word
memory on the net output together with 'toothache' [From Neural Computing Architectures, Ed. I Aleksander].
the associated symbol.

Applications of n-tuple sampling in hardware have been rather sparse, the commercial version of WISARD as a
visual pattern recognition device able to operate at TV frame rates being one of the few to date - another is the
Optical Character Recogniser developed by Binstead and Stonham. However, one can envisage a multitude of
applications for such pattern recognition systems as their operation and advantages become more widely
understood.

The total amount of memory needed for the discriminator RAMs is easily calculated. Let n be the number of inputs,
C the number of classes, and assume a 1-1 mapping of a retina with p binary pixels. For convenience we assume
that n divides p. Then the number of 1 bit RAMs will be p/n and each RAM has 2n bits of storage. So the number
of bits per discriminator is (p/n).2n. Hence the total number of bits for C classes is
                                                          p n
                                                     C.     .2                                                 (21)
                                                          n


Practical n-tuple pattern recognition systems have developed from the original implementation of the hardware
WISARD, which used regularly sized blocks of RAM storing only the discriminator states. As memory has become
cheaper and processors faster, for many applications such heavily constrained systems are no longer appropriate.
Algorithms can be implemented as serial emulations of parallel hardware and RAM can also be used to describe
a more flexible structure.

Typically the real-time system is preceded by a software simulation in which various parameters of the theoretical
model are optimized for the particular application. A design technique which is sufficiently general to cope with
a large class of such net-systems whilst at the same time preserving a high degree of computational efficiency is
described in [Binstead 1987]. In addition the structure produced has the property that it is easily mapped into
hardware to a level determined by the application requirements.

The rationale for believing that n-tuple techniques might be successfully applied to speech recognizers is briefly
outlined in [Tattershall 1984], where it is demonstrated that n-tuple recognisers can be designed so that in training
they derive an implicit map of the class conditional probabilities. Since the n-tuple scheme requires almost no
computation it appears to be an attractive way of implementing a Bayesian classifier. In a real time speech


                                                          42
Antonia J. Jones: 6 November 2005

recognition system the pre-processed input data can be slid across the retina and the system tuned to respond to
significant peaking of a class discriminator response, see Figure 4-2

Work on using WISARD nets for speech recognition was started at the University of Brunel Pattern Recognition
Laboratory in 1983, some results of this work are reported in [Aleksander 1988a], Chapter 10. One novel feature
of this chapter is an account of the work of Jones and Valenzuela using Holland's Genetic Algorithm to breed
WISARD nets for the purpose of vowel detection.

WISARD - analysis of response.

Assume that there are a number of n-tuples whose address lines are randomly and uniformly connected to the
retina, and that the total area of the retina is 1. Suppose an n-tuple has been trained on patterns T1, T2,..., Tl. Then
we shall evaluate the expected response to an unknown test pattern U. Let

                               I1    '    U   _       T1, I2         '     U      _   T2, ..., Il           '    U   _    Tl

                              Iij    '    U   _       Ti   _    Tj                                     (for i        …   j),
                                                                                                                                             (22)
                             Iijk    '    U   _       Ti   _    Tj    _      Tk                        (i, j, k pairwise distinct),

                         I12,...,l   '    U   _       T1    _   T2       _   ...      _    Tl

Here Ii = U Ti is the set of all points on the retina for which the pixels in U and Ti take the same value, either
            1
both '1' or both '0'. Let U Ti denote the area of this set intersection. Since the area of the retina is supposed to
                     * 1 *
be 1, it follows that U Ti is the probability that a single address line will give the same result when sampling
                       * 1 *
U as when sampling Ti. Since the n address lines are assumed uniformly distributed across the retina, the
probability that all n address lines give the same response when sampling U as when sampling Ti is just U Ti n.                            * 1 *
This is the probability that an n-tuple trained only on the pattern Ti will give the same response (i.e. fire) when
presented with the unknown pattern U.

For convenience, let
                                           pi     * *'     Ii n, pij         * *'     Iij n, ..., p12...l        *'      I12...l   *   n
                                                                                                                                             (23)


Suppose the n-tuple has been trained on two patterns Ti and Tj. Then the probability that an n-tuple trained only
on the patterns Ti and Tj will give the same response (i.e. fire) when presented with the unknown pattern U is
                                                  pi pj pij                        %            &           (24)
Now suppose the n-tuple has been trained on l patterns T1,...,Tl. By a well known combinatoric principle, it follows
that the probability that an n-tuple will give the same response (i.e. fire) when presented with the unknown pattern
U is
                                      p    j'          pi      j&          pij     j%           pijk   &   ...   & % ( 1)l 1 p12...l
                                                                                                                               %
                                                  i                  i j
                                                                     …                    i j
                                                                                          …                                                  (25)
                                                                                          j k
                                                                                          …
                                                                                          k i
                                                                                          …

In the vernacular of n-tuple sampling theory this is called the nth power-law. Note that if U = Ti, i.e. is equal to one
of the training patterns, then Ii = 1 and the remaining terms in the equation above sum to zero, hence p = 1 as
                                     * *
we might expect.




                                                                                          43
Antonia J. Jones: 6 November 2005

For example, suppose, a discriminator is required to detect the horizontal position of a vertical bar and report on
the distance of the bar from the centre.
The task is shown in the figure below.
T1, the only training pattern, is seen to
be a bar of width 1/3 units, where we
take the width and height to each be one
unit. The test pattern U could be a
vertical bar of the same width as T1 but
anywhere that is wholly within the
window. The distance of the bar from
the left hand edge is D, and the
maximum value of D is 2/3. As D
increases from 0 to 1/3 then the overlap
between U and T1 increases linearly but
the response R = I1n of the system for
different n is governed by the nth power
law, see Figure 4-3.

Comparison of storage requirements.

The functionality of the McCulloch and
Pitts neuron model was a matter of
interest in the mid 1960's and has been
discussed by [Muroga 1965]. With n
inputs the 0/1 neuron can only perform
linearly separable functions, i.e. those
that may be achieved by an n-1
hyperplane of the n-hypercube. We have Figure 4-3 A discriminator for centering on a bar [From Neural
seen that there are about 2n of these. Computing, I. Aleksander and H. Morton].
Typical of the functions that such a
device cannot perform are parity checking etc.

Suppose now we regard the n binary inputs as addressing memory in a RAM, as in the WISARD model discussed
earlier. There are 2^(2^n) logic functions f
                                          f : {x1, x2, ... ,xn} {0, 1}
                                                                 6                                  (26)
since there are 2n possible inputs for each function and the function value at each of these inputs can be 0 or 1. The
functionality of the two models can be compared as follows.

With w-bit weights and zero threshold, an McCullock and Pitts node of type ? requires wn bits of memory and can
perform at most 2wn of the possible logic functions: a proportion of the total number possible which tends
exponentially to zero as n becomes large. In fact, there is no point in taking w very large for the discrete neuron,
since the number of hyperplane dichotomies of the 2n vertices of the n-hypercube is fixed in terms of n (around 22n-
1
 ); one needs just enough bits to make sure that all these hyperplane dichotomies are possible, and no more.
Although the proportion of possible logic functions which can be implemented by the McCullock and Pitts neuron
is asymptotically zero as n tends to infinity, nevertheless it is this restricted functionality which gives the node the
capability of generalisation.

By comparison, a RAM with n address lines and 2n bits of store can implement any of the possible logic functions
on n bits. However, the storage requirements rise exponentially with the number of inputs, and hence, in
assemblies of RAMs, with the same level of interconnectivity. This is not the case with the McCullock and Pitts
model, where storage increases linearly with n. In any event, detailed discussion of the relative merits of different
neural components frequently overlooks the fact that what is perhaps more relevant is the relative functionality


                                                          44
Antonia J. Jones: 6 November 2005

of large assemblies of such components.

In applications where the interconnectivity is low the idea of using a RAM offers obvious attractions. Two
problems present themselves. Firstly, what training algorithm should be used for assemblies of RAMs? Secondly,
how will such a system be capable of generalisation?

Subsequent work by Igor Aleksander's group at Imperial has resulted in a model known as the Probabilistic Logic
Node (PLN); such nodes are then cascaded into a pyramidal structure and combined with a simple multilayer
learning algorithm.

The PLN is a connectionist model introduced in [Aleksander 1988b] which is implementable as a RAM. Binary
inputs address a memory location; the node outputs the bit value stored there. The training algorithm sets location
contents according to a global error/correct signal; PLNs require no individual error information. A PLN can learn
any of the 2^(2^n) Boolean functions of its n inputs.

By this definition, it is straightforward to implement the PLN as a RAM with 2n addressable memory locations.
In practice the output generally passes through a stochastic device before leaving the PLN. This randomizer allows
the PLN to exhibit non-deterministic properties; organic neurons also behave in a stochastic manner [Sejnowski
1981]


                                              Chapter references


[Aleksander 1979] I. Aleksander and T.J. Stonham. A Guide to Pattern Recognition Using Random-Access
Memories. IEE Journal Computers and Digital Techniques, Vol. 2 (1), 29-40, 1979.

[Aleksander 1988a] A. Badii, M. J. Binstead, Antonia J. Jones, T. J. Stonham and Christine L. Valenzuela.
Applications of N-tuple Sampling and Genetic Algorithms to Speech Recognition. Neural Computing
Architectures, Chapter 10. Ed. I. Aleksander, Kogan Page, October 1988.

[Aleksander 1988b] I. Aleksander. Logical connectionist systems. In R. Eckmiller and Ch. v. d. Malsburg (Eds.)
Neural Computers (pp. 189-197). Springer-Verlag, Berlin, 1988.

[Bernstein 1981] J. Bernstein. Profiles: AI, Marvin Minsky. The New Yorker, December 14, 1981, pp 50-126.

[Binstead 1987] M. J. Binstead and Antonia J. Jones. A Design Technique for Dynamically Evolving N-tuple Nets.
IEE Proceedings, Vol. 134 Part E, No. 6, pp 265-269, November 1987.

[Bledsoe 1959] W. W. Bledsoe and I. Browning. Pattern Recognition and Reading by Machine. Proc. Eastern Joint
Computer Conf. Boston, Mass., 1959.




                                                        45
Antonia J. Jones: 6 November 2005


                                 V Feedforward networks and backpropagation.


Introduction.

In two papers describing the same model [Rumelhart 1986a], [Rumelhart 1986b] Rumelhart, Hinton and Williams
introduce a generalization of the Widrow-Hoff error correction rule called back propagation. The algorithm was
first described by Paul J. Werbos in his Harvard Ph.D. thesis [Werbos 1974] and also independently rediscovered
by [Parker 1985] and [Le Cun 1986]. The model assumes a discrete time system with synchronous update and with
each connection involving a unit delay.

Simple two-layer associative networks have no hidden units, they involve only input and output units. In these
cases there is no internal representation. As Minsky and Papert pointed out we need hidden units to provide the
possibility of recoding the input pattern into an internal representation because many problems simply cannot
otherwise be solved.

An example is the XOR problem mentioned earlier. Here the addition of a unit which detects the logical
conjunction of the inputs changes the similarity structure of the patterns sufficiently to allow the solution to be
learned, see Figure 5-1

The problem addressed by simulated annealing
and backpropagation is the provision of a locally
computed learning rule which guarantees that an
internal representation adequate to solve the
problem will be found.

Backpropagation is based on local gradient
descent and output functions which step between
0 and 1, as the activation of the neuron increases
through a threshold value, will not be
differentiable and therefore provide no useful
gradient in weight space along which we can
descend. We therefore consider semilinear
activation functions. A semilinear activation
function fi(neti) is one in which the output of a
the unit is a non-decreasing and differentiable Figure 5-1 Solving the XOR problem with a hidden unit.
function of the net input to the unit. In most
cases fi is independent of i and so we write f = fi.

Backpropagation - mathematical background.

The following functions, and hence all their partial derivatives, are assumed known.

         Error function.
                                            E(z1, z2, ..., zn, t1, t2, ... ,t n)                                        (1)
         Here z1, z2, ..., zn are the outputs from the output layer units and t1, t2, ... ,tn are the target outputs.




                                                            46
Antonia J. Jones: 6 November 2005




                        Figure 5-2 Feedforward network architecture.

          Activation function.

          Output layer:
                                             net j      '   net j (y1, y2, ... ,y m, pj1, ... pjt)                    (2)
Here y1, y2, ... ,ym are the outputs from the previous layer and pj1, ... , pjt are parameters associated with the jth node
of the output layer. Frequently t = t(m), i.e. the number of parameters associated with a node is a function of the
number of inputs.

          Previous layer:
                                             net i   '      net i (x1, x2, ..., x l, pi1, ... ,pis)                   (3)
          Here x1, x2, ... ,xl are the outputs from the layer prior to the previous layer.

          Output function.

          Output layer:
                                                   zj   '    f(net j)               (1   # # j        n)              (4)
          Previous layer:
                                                yi      '    f(net i)               (1   # # i        m)              (5)


The output layer calculation.

For the output layer we have
                                                                              M η& ' E
                                                                 ∆      pjz                                           (6)
                                                                              M      pjz


where 1    z
          # #   t, 1    j
                       # #   n, and   0   is the learning rate.

Hence
                                                            M η& '       E net jM                      net j
                                                                                                       M
                                               ∆     pjz                                 '   δη   j                   (7)
                                                             M          net j pjz
                                                                               M                      M    pjz


                                                                              47
Antonia J. Jones: 6 November 2005

where
                                             M &'      E      M M          E zj                                               M E
                                     δ   j                         &'                         &'           f (net j)
                                                                                                            )
                                                                                                                                                      (8)
                                              M       net j    M M         zj net j                                           M zj

Equations (7) and (8) express the pjz in terms of known quantities.
                                      )

The rule for adjusting weights in hidden layers.

In a similar way we can compute a rule which adjusts the weights in the previous layers.




                         Figure 5-3 The previous layer calculation.

We shall not go through the details of the derivation (which are quite straightforward) but the rule which emerges
is, for node i in this previous layer,
                                                               net i                     M
                                                    piz    i
                                                              ∆        '   δη                                 (9)
                                                               piz                      M

where
                                                                                                       n                net j
                                                                  ME                                                 M
                                         δ   i   &'   f (net i)
                                                       )                   '    f (net i)
                                                                                    )             δj            j                                    (10)
                                                                  Myi                                j 1
                                                                                                     '              M    yi

In (9) the partial derivative neti/ piz is known from (3). In (10) f´(neti) is known from (5), the
                                M    M                                                                                               *   j   were computed
in the previous step, and the last term is known from (2).

The conventional model.

The usual sum squared error is given by
                                                                                                 n
                                                                                             1
                                             E(z1, ..., zn, t1, ..., t n)               j '                zj       &   tj 2                         (11)
                                                                                             2
                                                                                                 j 1
                                                                                                 '


Hence, for 1    j
               # #   n
                                                                  E
                                                                  ' M          zj   &    tj                                                          (12)
                                                                  zjM



                                                                           48
Antonia J. Jones: 6 November 2005

The linear activation function becomes
                                                            m                                              Mnet j
                                        net j   j'              wji y i                                                   '   yi
                                                        i 1
                                                        '                                                  M   wji
                                                            l
                                                                                                                                   (13)
                                                                                                        M   net i
                                        net i   j'              wih x h                                                  '    xh
                                                        h 1 '                                           M   wih

where t = m and s = l. Thus (7) becomes
                                                                ∆       wjz    '   δ η         j       yz                          (14)
and, using (12), (8) becomes
                                                                          E
                                       δ   j   &'       )
                                                    f (net j)           &' M                   )
                                                                                           f (net j) zj                   &   tj   (15)
                                                                          zj
                                                                           M

for an output layer unit.

Similarly (9) becomes
                                                                                   net i
                                                                                   M
                                                ∆       wiz     '       δη    i
                                                                                                       '    δη    ix z             (16)
                                                                                   M   wiz

where, from (10),
                                                                                       m                 M  net j
                                                δ   i
                                                            '   f (net i)
                                                                    )              δj              j
                                                                                   j 1 '                M    yi
                                                                                                                                   (17)
                                                                                       m
                                                            '   f (net i)
                                                                    )              δj              j   wji
                                                                                   j 1 '


for a hidden layer unit.

For this example of a linear activation function any sigmoidal function f is suitable. Frequently the function
                                                               1
                                          zj f(netj)'                         '                              (18)
                                                                 net                                    &         %   θ
                                                         1 e j j                           %

is used. Here j is the threshold, or bias, of the unit. Conventionally the threshold is treated as just another weight
              2
by creating a dummy unit which is always on (somewhat like ground in an electrical circuit). However, from the
present viewpoint the threshold can be considered as just another parameter associated with the unit, in which case
there is no need for the dummy unit.

Thus the implementation of back propagation involves a forward pass through the layers to estimate the error, and
then a backward pass modifying the synapses to decrease the error. Practical implementations are not difficult, but
without modification it is still rather slow, especially for systems with many layers. Still, it is at present the most
popular learning algorithm for multilayer networks. It is considerably faster than the Boltzmann machine.

The whole process is illustrated in the Mathematica file backprop.ma. The Mathematica program runs too slowly
to be of much practical use and is intended only to illustrate the process. A C-code implementation will run much
faster and is available in various implementations.

Problems with backpropagation.


                                                                                  49
Antonia J. Jones: 6 November 2005

Backpropagation is a very powerful tool for constructing non-linear models. However, there are a number of
problems which arise when one tries to use backpropagation in practice. These are:

         ! In many cases, given a particular training set, the minimum mean-squared error which one can expect
         on each output is unknown. Overtraining will result in memorisation, which will reduce the
         generalisation capability. Dealing with this problem can be very time consuming and involve trial and
         error.

         ! The optimal architecture for the network is unknown: too many units and the system will generalise
         poorly whilst taking ta very long time to train; too few units and the system will not reach the optimal
         (unknown) mean-squared error. Dealing with this problem can be very time consuming and involve trial
         and error.

         ! Optimal values for the learning rate > 0 (and the momentum term > 0 - if used) are unknown.
                                                    0                                   "
         Dealing with this problem can be very time consuming and involve trial and error.

One way for adjusting > 0 dynamically as learning progresses is called the Bold Driver method. This adjusts the
                        0
learning rate according to the previous value of the Error function. If the value has gone up the learning rate is
reduced proportionately, similarly the learning rate is increased if the value has gone down.

Until recently a successful application of backpropagation to a large problem was to some extent a matter of
patience and luck.

The gamma test - a new technique.

The Gamma test was developed by Aðalbjörn Stefánsson, N. Kon ar [Kon ar 1997], [Adalbjörn Stefánsson 1997]
                                                                         …       …
and and myself and has emerged as an extremely useful tool to overcome the kinds of problems mentioned above.
It is a very simple technique which in many cases can be used to considerably simplify the design process of
constructing a smooth data model such as a neural network. The Gamma test is a data analysis routine, that (in
an optimal implementation) runs in time O(MlogM) as M -> , where M is the number of sample data points, and
                                                                    4
which aims to estimate the best Mean Squared Error (MSError) that can be achieved by any continuous or smooth
(bounded first partial derivatives) data model constructed using the data. A proof of the result under fairly general
hypotheses was finally given in [Evans 2002a] and [Evans 2002b].

Let a data sample be represented by
                                              ((x1, ..., x m), y)   '   (x, y)                                     (19)
                                                                                                       m
in which we think of the vector x = (x1, ..., xm) as the input, confined to a closed bounded set C úf    , and the scalar
y as the output. In the interests of simplicity the following explanation is presented for a single scalar output y. But
the same algorithm can be applied to the situation where y is a vector with very little extra complication or time
penalty. The Gamma test is designed to give a data-derived estimate for Var(r).

We focus on the case where samples are generated by a suitably smooth function (bounded first and second order
                           m
partial derivatives) f: C    -> and
                            úf     ú
                                           y f(x1, ..., x m) r
                                                   '                    %                                (20)
where r represents an indeterminable part, which may be due to real noise or might be due to lack of functional
determination in the posited input/output relationship i.e. an element of `one -> many-ness' present in the data.

We make the following assumption

         ! Assumption A. We assume that training and testing data are different sample sets in which:
         (a) the training set inputs are non-sparse in input-space; (b) each output is determined from the


                                                           50
Antonia J. Jones: 6 November 2005

         inputs by a deterministic process which is the same for both training and test sets; (c) each
         output is subjected to statistical noise with finite variance whose distribution may be different
         for different outputs but which is the same in both training and test sets for corresponding
         outputs.

Suppose (x, y) is a data sample. Let (x´, y´) be a data sample such that x´ - x > 0 is minimal. Here . denotes               *                *                         **
Euclidean distance and the minimum is taken over the set of all sample points different from (x, y). Thus x´ is the
nearest neighbour to x (in any ambiguous case we just pick one of the several equidistant points arbitrarily).

The Gamma test (or near neighbour technique) is based on the statistic
                                                                                          M
                                                                               1
                                                              γ   '                       j       (y (i)
                                                                                                    )       &     y(i))2                                                     (21)
                                                                              2M      i   '   1



where y´(i) is the y value corresponding to the first near neighbour of x(i).

It can be shown that -> Var(r) in probability as the nearest neighbour distances approach zero. In a finite data
                        (
set we cannot have nearest neighbour distances arbitrarily small so the Gamma test is designed to estimate this
limit by means of a linear correlation.

Given data samples (x(i), y(i)), where x(i) = (x1(i), ..., xm(i)), 1                                        # #   i       M, let N[i, p] be the list of (equidistant) p th
nearest neighbours to x(i). We write
                                        M                                                                                             M
                                1                   1                                                                      1
                  δ   (p)   '           j                                     j           x(j)     &    x(i) 2        '               j       x(N[i, p])   &   x(i) 2        (22)
                                M   i   '   1   L(N[i, p])        j       0   N[i, p]                                      M      i   '   1



where L(N[i, p]) is the length of the list N[i, p]. Thus (p) is the mean square distance to the pth nearest neighbour.
                                                                                         *
Nearest neighbour lists for pth nearest neighbours (1 p pmax), typically we take pmax in the range 20-50, can be
                                                                                      # #
found in O(MlogM) time using techniques developed by Bentley, see for example [Friedman 1977].

We also write
                                                                          M
                                                            1                        1
                                            γ   (p)   '                   j                                 j             (y(j)       &   y(i))2                             (23)
                                                           2M         i   '   1   L(N[i,p])             j   0   N[i,p]



where the y observations are subject to statistical noise assumed independent of x and having bounded variance.6

Under reasonable conditions one can show that
                                       Var(r)         γ   .                    %      A   δ   %    o( ) δ         as M         46                                            (24)

where the convergence is in probability.

The Gamma Test computes the mean-squared pth nearest neighbour distances (p) (1 p pmax, typically pmax                                        *       # #                       .
10) and the corresponding (p). Next the ( (p), (p)) regression line is computed and the vertical intercept is
                                    (                             *               (
returned as the gamma value. Effectively this is the limit lim as -> 0 which in theory is Var(r).           (         *




    6
    The original version in [Aðalbjörn Stefánsson 1997] and [Kon ar 1997] used smoothed versions of (p) and           …                                                  *
((p) which rolled off the significance of more distant near neighbours. Later experience showed that this
complication was largely unnecessary and the version of the software used here is implemented as described above.


                                                                                              51
Antonia J. Jones: 6 November 2005

          Procedure Gamma (or Near Neighbour Test) (data)
                     (* data is an array of points (x(i), y(i)), (1     i    M), in             #       #
                     which x is a real vector of dimension m and y is a real scalar
                     *)

          For i = 1 to M (* compute x-nearest neighbour list for each data
          point. This can be done in O(MlogM) time using a kd-tree for
          example.*)
                For p = 1 to pmax
                      AppendTo[N(i, p)] all the elements t where x(t) is a p th
                      nearest neighbour to x(i).
                endfor p
          endfor i
          For p = 1 to pmax
                compute (p) as in (22)
                                 *
                compute (p) as in (23)
                                 (
          endfor p
          Perform least squares fit on coordinates ( (p),                    *        (   (p)) (1   #   p   #   pmax)
          obtaining (say) y = Ax + ¯              '
          Return (¯, A)   '

Algorithm 5-1 The Gamma test.
A technique which allows one to estimate Var(r), on the hypothesis of an underlying continuous or smooth model
f, is of considerable practical utility in applications such as control or time series modelling. The implication of
being able to estimate Var(r) in neural network modelling is that then one does not need to train the network (or
indeed any smooth data model) in order to predict the best possible performance with reasonable accuracy.

We have used the Gamma test to:

         ! To find the minimal number of data samples required to produce a near optimal model. We do this by
         computing ¯ for increasing M. The graph asymptotes to the true value of Var(r). When M is sufficiently
                      '
         large to ensure that ¯ has stabilised close to the asymptote there is little advantage to be gained by
                                '
         increasing M.

         ! To automatically and rapidly (the near neighbour information can be used to speed up training by
         picking the data points with worst errors and performing backpropagation on a suitable subset of the
         weights) construct a near minimal neural network architecture and weights which best models the data
         [Kon ar 1997]. The Gamma test provides the criterion for ceasing training.
                …


For estimating the number of `hills' required in an initial approximation for the network architecture we use the
heuristically based formula h = am, where
                                                            2 2
                                                    a   .         mA                                             (25)
                                                             π


         !   To determine the best embedding dimension and delay time for time series [Masayuki 1997].

         !   To determine the best set of inputs from a list of possible inputs for a neuro-controller [Kon ar 1997].
                                                                                                            …


The last two applications illustrate what is perhaps the main utility of the Gamma-test: on data sets which are not
excessively large the test is sufficiently fast to be run on a compete examination of all possible subsets of up to 20
inputs (for a larger number of inputs we use a Genetic Algorithm for which the fitness of a selection of inputs is


                                                            52
Antonia J. Jones: 6 November 2005

based on how small the ¯ value is for the selection).
                         '

* Metabackpropagation.

This is an algorithm developed by N. Kon ar and myself which overcomes many of the disadvantages of simple
                                            …
backpropagation. It is pivotally dependent on the Gamma test.

 Procedure Metabackpropagation
          1. Use the Gamma test to determine the optimal number of data vectors
          (input-output pairs) (M), the selection and number of inputs (m), the
          mean squared error (MSError) to which the network should be trained
          ( ), and the first approximation to the neural architecture (using
          A).
          2. Create a feedforward network that has the number of hills (h)
          specified by A. A single hill is made of two fully connected hidden
          layers with 2m nodes in the first hidden layer and one node in the
          second hidden layer.
          3. Initialise each hill by doing a small number of Backpropagation
          training cycles on subsets of the data. Subsets are chosen from the
          near neighbour lists generated during the computation of the Gamma
          test results. To find these subsets, the neural network weights are
          first randomised and the points which give the largest errors are
          identified by feeding every input vector through the network. The
          near neighbour lists of the points which give the largest errors are
          the subsets that are chosen. Each hill is trained on its own
          exclusive subset. This will initialise the weights of each hill so
          that it is positioned in the right place.

          4. Perform Backpropagation training on the entire neural network
          architecture until either a specified number of cycles is exceeded or
          the target MSError is reached. Adjust the learning rate by the Bold
          Driver method.
          5. Increase the number of hills if the Gamma test MSError is not
          reached and then go back to 2.


Algorithm 5-2 Metabackpropagation.

We have found that the application of the Gamma test and Metabackpropagation have transformed the production
of a nonlinear model using feedforward networks from a black-art to an almost fully automated and fast process.

* Neural networks for adaptive control.

To illustrate our discussion of adaptive control we shall use a simple example.

         There are two tanks of water at temperatures Tcold < Thot respectively. The tanks are drained at
         rates c1(t) and c2(t) ltrs/sec into a third tank of volume Vmax (not used). The third tank drains at
         an (initially) constant rate r. The parameters c1 and c2 are considered as control variables and
         it is desired to determine a control strategy for c1 and c2 which will maintain a volume Vgoal at
         temperature Tgoal, where Tcold < Tgoal < Thot, in the third tank.




                                                         53
Antonia J. Jones: 6 November 2005




                                Tcold                                               Thot




                                                 cont[[1]]        cont[[2]]




                                                                              Target tank



                                                                                Drain


                            Figure 5-4 The Water Tank Problem


Let (V(t), T(t)) denote the volume and temperature of the target tank. The differential equations describing the
system are
                                      dT      dV
                                  V         T
                                            %       c1(t)Tcold c2(t)Thot rT
                                                     '                  %       &
                                      dt      dt
                                                                                                           (26)
                                  dV
                                        'c1(t) c2(t) r
                                                %            &
                                   dt

where c1 and c2 are the cold and hot valve settings and r is the rate of drain from the target tank.

Physical Assumptions.

         1. We assume that as water drains into the third tank mixing is instantaneous.
         2. We assume that there is no heat loss from the third tank, apart from that lost due to the outflow.
         3. We have assumed that the two feeder tanks contain water with a specific heat of unity. It is a simple
         matter to modify the equations to deal with the case of two inert liquids of specific heats s1 and s2
         respectively. If an endo- or exo-thermic reaction results from the mixing then this could also be taken
         account of in the model.

We now proceed to construct an adaptive neurocontroller for the Water Tank Problem. The basic architecture we
shall use was developed by N. Kon ar and myself and applied to the Attitude Control Problem [Kon ar 1995]. It
                                   …                                                                   …
is described in Figure 5-5.

The components of this system are:

         The Planner: Knowing the long term goal state (which could dynamically change) and the current state,
         together with maximal variations in one time step, the Planner sets the next desired state. A simple Linear
         Planner proceeds by taking a small segment of the straight line in state space between the current state
         and the goal. A wide range of other variations are also possible.

         The  )  Mapping: This transforms the next desired state and the current state into an efficient
         representation of the difference between the two into inputs for the neural network.

         The Neural Network: Using the outputs from the mapping the neural network generates a suitable set
                                                                  )
         of control signals for application at the current time step.



                                                             54
Antonia J. Jones: 6 November 2005


                                                       Input
  Long term                       Desired next          map
  goal state          Planner     state xG(k+1)                                      Control
                      unit                                             NN controller input u(k)
                                                           ∆
                                                                        (running)                         Dynamic       Actual next
   Current                                                                                                system        state x(k+1)
   state x(k)




         Training data from observed facts: Add the data ((x(k+1), x(k)), ( u(k))), (inputs, outputs),
         to the circular buffer of most recent (input,output) vector pairs. i.e. If the desired next state
         were x(k+1) and the current state is x(k) then the correct control input would be u(k).


                                                                    Input
                                                                     map
                         Next (desired) state       x(k+1)                            NN controller       Control
                                                                      ∆                (learning)
                        Current state               x(k)                                                   u(k)




          Training consists of performing backpropagation on the entire contents of the circular
          buffer of most recent (input, output) vector pairs.


Figure 5-5 Architecture for direct inverse neurocontrol.

The model is adaptive because whenever the network fails to place the next state within a prescribed error tolerance
of the next desired state, then the ((state, nextstate), (controls)) are added to the training buffer and the network
undergoes a further round of backpropagation training (it is assumed that this is done in hardware - the networks
under discussion usually have only a small number of nodes).

The main problem in specific applications centres around two issues:

          ! What input mapping is optimal for the specific application? This turns out to be quite critical. With
                                        )
          the wrong mapping the controller may not work efficiently (or even at all). Ideally we seek to reduce
          equivalence classes of control inputs to a single representative, thus reducing the learning demands on
          the neural network.

          !    What is the optimal architecture for the neural network?

In fact both questions can be addresssed by a combination of Metabackpropagation and the Gamma test.

The difference between the desired state (Vdes, Tdes) and the actual state (V, T) can be measured in various ways.
For example, with a time step between changes in control signals of delta = 1 in the simulator, if we just take all
four of the scalars as inputs then we obtain a ¯ of around 0.08 on each control output. Since (0.08) 0.28, this
                                                           '                                                        %   .
implies an absolute error on the control outputs of the order 28%, which is unlikely to be effective.

If we return to the original equations for the Water Tank we can determine that if at two states (V1, T1), (V2, T2)
we apply the same control inputs (c1, c2) then (dV/dt, dT/dt) will be the same at both states provided
                                                                   c1Tcold   %   c2Thot
                                        V2T1    &   V1T2       '                          (V2   &   V1)                         (27)
                                                                       c1    %   c2


                                                                      55
Antonia J. Jones: 6 November 2005


Thus, under these circumstances, the two states (V1, T1), (V2, T2) are control equivalent, i.e. the application of
(c1, c2) in either of these states will (instantaneously) produce the same effect. This suggests the input mapping
described in Figure 5-6.



                 Vdes                                                               VdesT - VTdes
                 Tdes
                                              ∆
                 V                                                                   Vdes - V
                 T

             Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs
             for the Water Tank Problem.

If we perform the Gamma test using these inputs we obtain (for c1 and similarly for c2) the regression line
illustrated in Figure 5-7. This may well not be the ideal mapping but we have some grounds for preferring over
                                                          )
the naive choice. (The Attitude Control problem similarly involves some mathematical ingenuity in the
construction of the mapping.)
                   )

If we take the average for the p th (2 p 5) nearest
                       '               # #
neighbour then we estimate ¯ = lim
                               '        '    as 0.027.           capgamma

Leading to an absolute error on the ouputs of the order
of 16.4%. This represents the best MSError that a              0.035


feedforward neural network trained on the data can              0.03

achieve.                                                       0.025


                                                                0.02

Although the question of building an adaptive        )         0.015
mapping is obviously crucial, if we have some
                                                                0.01
mathematical understanding of the system we are
seeking to control then the Gamma test is obviously            0.005


very time saving.                                                  0
                                                                            0.002   0.004   0.006   0.008   0.01   0.012   0.014
                                                                                                                                   capdelta




Figure 5-8 - Figure 5-10 show the results of a quick
trial of the neurocontroller (illustrated in the Figure 5-7 Least squares fit to 200 data points with 20
             J
Mathematica file nn-tank.ma). These results for the nearest neighbours: ¯ = 0.0332.         '
Water Tank Problem are preliminary examples and do
not represent a final design. We have not made the training adaptive in the example file. The training was done
prior to the simulation on 200 data points uniformly distributed in state-difference space. Performance for small
state differences (effectively performance near the goal state) could easily be improved by increasing the density
of training data near the origin of state-difference space. It might also be improved by modifications of the
Planner module.




                                                          56
Antonia J. Jones: 6 November 2005



                                                        Volume (ltrs)



                                                          104



                                                          102


                                                          100


                                                           98



                                                           96

                                                                                                                 Time (secs)
                                                                        102   104     106         108         110


                         Volume (ltrs)



                             140


                             120


                             100


                                 80


                                 60


                                 40


                                 20


                                 0                                                                                                                                          Time (secs)
                                                50                                   100                                            150                           200




Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear
Planner.




                                                Temperature
                                                  51

                                                50.75

                                                 50.5

                                                50.25

                      Temperature                 50

                                                49.75

                                                 49.5
                        70                      49.25

                                                  49                                                             Time (secs)
                                                          96            98     100          102         104




                        60



                        50



                        40



                        30



                        20                                                                                                                                                      Time (secs)
                                           50                                       100                                             150                                   200



Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052.
Linear Planner.




                                                                                                               Valve setting (ltrs/sec)

                                                                                                                        4

                                                                                                                      3.5

                                                                                                                        3

                                                                                                                      2.5

                                                                                                                        2

                 Valve setting (ltrs/sec)                                                                             1.5

                                                                                                                        1

                                                                                                                      0.5

                                                                                                                        0                                         Time (secs)

                             4                                                                                                 96         98    100   102   104




                         3.5

                             3

                         2.5

                             2

                         1.5

                             1

                         0.5

                             0                                                                                                                                                     Time (secs)
                                                50                                    100                                                      150                              200



Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training.
Linear Planner.



                                                                                        57
Antonia J. Jones: 6 November 2005



                                             Chapter references

[Evans 2002a] D. Evans and Antonia J. Jones. A proof of the Gamma test. Proc. Roy. Soc. Series A 458(2027),
2759-2799, 2002.

[Evans 2002b] D. Evans, Antonia J. Jones, W. M. Schmidt. Asymptotic moments of near neighbour distance
distributions. Proc. Roy. Soc. Lond. Series A, 458(2028):2839-2849, 2002.

[Friedman 1977] J.H. Friedman, J.L. Bentley and R.A. Finkel. An algorithm for finding best matches in
logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):200-226, 1977.

[Jones 2002] Antonia J. Jones, A.P.M Tsui, and Ana G. Oliveira Neural models of arbitrary chaotic systems:
construction and the role of time delayed feedback in control and synchronization.. With html and pdf electronic
supplement. Complexity International, Volume 09, 2002. ISSN 1320-0682. Paper ID: tsui01, URL:
http://www.csu.edu.au/ci/vol09/tsui01/

[Kon ar 1997] N. Kon ar. Optimisation methodologies for direct inverse neurocontrol. Forthcoming Ph.D. thesis,
     …               …
Department of Computing, 180 Queen's Gate, London, SW7 2BZ, U.K.

[Kon ar 1995] N. Kon ar and Antonia J. Jones. Adapative real-time neural network attitude control of chaotic
     …                …
satellite motion, Presented at Aerospace/Defense Sensing & Control and Dual-Use Photonics, SPIE (The
International Society for Optical Engineering and Photonics in Aerospace Engineering) International Synposium.
Orlando, Florida April 17-21, 1995.

[Rumelhart 1986a] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error
propagation. Chapter 8, Parallel Distributed Processing, Vol. 1., M.I.T. Press.

[Rumelhart 1986b] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by
back-propagating errors, Nature 323: 533-536.

[Adalbjörn Stefánsson 1997] Adalbjörn Stefánsson, N. Koncar and Antonia J. Jones. A note on the Gamma test,
Neural Computing & Applications 5(3):131-133, 1997. ISSN 0-941-0643. [Preprint]

[Tsui 2002] Alban P. Tsui, Antonia J. Jones and Ana Guedes de Oliveira. The construction of smooth models using
irregular embeddings determined by a Gamma test analysis. Neural Computing & Applications 10(4), 318-329,
April 2002. ISSN 0-941-0643.

[Werbos 1974] P. J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral sciences.
Ph.D. thesis, Harvard University, Cambridge, MA.




                                                      58
Antonia J. Jones: 6 November 2005


                                                 * VI The chaotic frontier.


Introduction.

It is interesting to observe from a modern standpoint that Newtonian physics already contained the seeds of its own
destruction. Quite apart from the later quantum mechanical caveat of the Heisenberg Uncertainty Principle, the
classicists overlooked the 'computational cost' of making deterministic predictions an indeterminate time into the
future. The fact is that for many deterministic systems, the computational cost of making an accurate prediction
a substantial time into the future becomes prohibitive. Thus the classical view that if a system is deterministic then
its future behaviour could be predicted for all time contains a basic flaw. This flaw becomes particularly apparent
when we consider chaotic systems.

Chaos is the word we use to describe deterministic behaviour for which nevertheless, in view of the computational
cost, even if the initial conditions were known to an arbitrary degree of precision, the long term behaviour cannot
be accurately predicted. This is certainly the case with many natural systems for which in any case we cannot know
the initial conditions to an arbitrary degree of precision. A classic example, first considered by E. N. Lorenz, is
the weather [Lorenz 1963].

About a century ago, Henri Poincaré observed that the motion of three bodies under gravity can be extremely
complicated. His discovery was the first mathematical evidence of chaos. Since that time there have been many
observations of chaos both in mathematical models and natural systems. For many years chaos observed through
the study of nonlinear dynamic systems was avoided, due to its complexity. In practice such behaviour was often
totally ignored being interpreted as either completely unpredictable or ascribed to statistical noise. The theory of
nonlinear dynamics founded by Poincaré describes and classifies the behaviour of complex dynamical systems and
the manner in which they evolve through time. Such systems were extraordinarily difficult to study. The situation
changed dramatically with the invention of the modern computer. Scientists, especially mathematicians and
physicists, who had previously encountered chaos could pursue a more systematic study of the phenomenon using
the new tool.

The crucial importance of chaos is that it provides an alternative explanation for apparent randomness, one that
depends on neither noise nor complexity. Chaotic behaviour appears in systems that are essentially free from noise
and are also relatively simple, often with only a few degrees of freedom. Many natural systems exhibit chaos. In
the early years of the study, it was common to assume that such behaviour is unpredictable7 and therefore
uncontrollable. Since the late 1980's, a number of quite different techniques have been proposed for controlling
chaotic systems [Ott 1990], [Dracopoulos 1994], [Ogorzalek 1993].

Mathematical background.

A dth order autonomous continuous time system is defined as
                                     dx i
                                            '   gi(x1, . . . , x d, pi),   (1    i
                                                                                # #   d)                              (1)
                                     dt

where t is the time, x = (x1, . . . , xd) is a vector in n-dimensional state space, i.e. xi are the dynamic variables,
g = (g1, . . . , gd) is a vector field in the state space, i.e. gi are functions of xi, and pi are the corresponding vectors
of the control parameters. As the vector field does not depend on time, the initial time may be taken as t0 = 0.




   7
     Unpredictable in the sense that, although completely deterministic the computational `cost' of an accurate
prediction rapidly becomes prohibitive as the prediction interval increases.

                                                                59
Antonia J. Jones: 6 November 2005

The Jacobian matrix J of an autonomous system described by d first-order differential equations, is a d x d matrix
with the elements defined as
                                                   M     gl
                                           jl, m     '                   (1    #   l, m    #   d)                                                       (2)
                                                   M xm


If the determinant of the Jacobian matrix, det J, is one at all points the system is conservative. If the average of
*det J < 1 then the system is dissipative. For the average of det J > 1, the values in state space expand with time.
      *                                                                   *            *

We distinguish between conservative and dissipative dynamic systems. In conservative systems volume elements
in the state space are conserved, whereas in a dissipative system the volume elements contract as the system
evolves. For dissipative systems, the effects of transients associated with initial conditions disappear in time. The
trajectory in state space will head for some final attracting region, or regions, which might be a point, curve, area,
and so on. Such an object is called the attractor for the system, since a number of distinct trajectories will be
attracted to this set of points in the state space. The properties of the attractor determine the long term dynamical
behaviour of the system.

The terms are best understood with examples. A well known nonlinear dissipative system called the Lorenz model
[Lorenz 1963]. The Lorenz model is defined by a set of differential equations as
                                              x      (y x)
                                                     0    '   σ      &
                                                     y
                                                     0    '   x (R   &        z)   &   y                                                                (3)
                                                     z
                                                     0    '   x y   &    b z

This system has three degrees of freedom as there are three dynamic                                                                           x
                                                                                                                                    0
variables x, y and z. The control parameters are , R and b. For R less than
                                                         F                                                            y
                                                                                                                                2       0.5
                                                                                                                          1.5                     1
1, all trajectories, no matter what their initial conditions, eventually end up                                       1
                                                                                                                                                      1.5

approaching the origin of the xyz state space. That is for R < 1, all of the xyz                                0.5
space is the basin of attraction for the attractor at the origin. Figure 6-1                            0
illustrates a trajectory of the model with = 10.0, R = 0.5, b = 8/3, x = 1.0,
                                             F                                                          3
y = 2.0 and z = 3.0.

For dissipative systems, the effects of transients associated with initial                              2
conditions disappear in time. The trajectory in state space will head for some                      z
final attracting region, or regions, which might be a point, curve, area, and
so on. Such an object is called the attractor for the system, since a number                                1

of distinct trajectories will be attracted to this set of points in the state space.
The properties of the attractor determine the long term dynamical behaviour
                                                                                                            0
of the system.

Chaos

There is currently great excitement and much speculation about chaos theory Figure 6-1 Stable attractor.
and its potential role in understanding the world. A brief introduction to the
history of the mathematical foundations of the subject can be cited in [Holmes 1990]. A chaotic system will remain
apparently noisy regardless of how well experimental conditions are controlled.

For the Duffing oscillator model, different values of d, f and exhibit completely different system behaviours. In
                                                                              T
Figure 6-2, a time series of a periodic behaviour of the model for d = 0.15, f = 0.3 and = 1.0 is shown. In FigureT
6-2, a typical time series of the chaotic behaviour of the model for d = 0.2, f = 36 and = 0.665 is illustrated. As
                                                                                                                  T
can be seen, the chaotic time series is more complicated in appearance, but there is a boundary within which the


                                                                    60
Antonia J. Jones: 6 November 2005

system stays.

If a system displays divergence of nearby trajectories or sensitive dependence on initial conditions for some range
of its control parameter, then the long term behaviour of that system becomes essentially unpredictable8, i.e. the
long term future of a chaotic system is in practice indeterminable even though the system is theoretically
deterministic.

The effect of divergence of nearby trajectories on the behaviour of nonlinear systems is known as the butterfly
effect. The term was introduced by Lorenz based on the picturesque notion that if the atmosphere displays chaotic
behaviour with divergence of nearby trajectories, then even the flapping of a butterfly's wings would alter any long
term prediction of atmospheric dynamics.

This phenomenon is illustrated in Figure 6-3 for the x-coordinate of
the Lorenz model with one trajectory starting at x = 1.0, y = 2.0 and
z = 3.0 in black, and another at x = 1.01, y = 2.01 and z = 3.01, in
                                                                           x(t)
grey. Instead of R = 0.5, we have used R = 28.0 as the model exhibits
                                                                           4
a chaotic behaviour with this value.
                                                                           2
For many nonlinear systems, we must integrate the equations step by
step to find future behaviour. Any small error in specifying the initial        100   200  300   400   500                 600
                                                                                                                                 t

conditions will be magnified, thus leading to grossly different long
                                                                          -2
term behaviour of the system, therefore we cannot predict that long
term behaviour in practice. Thus, chaotic behaviour is characterised -4
by the divergence of nearby trajectories in state space. As a function
of time, the separation between two nearby trajectories increases
exponentially, at least for short time. (For short time because the
trajectories stay within some bounded region of the state space.)        Figure 6-2 A chaotic time series.

In three or more dimensions, initially nearby trajectories can continue
to diverge by wrapping over and under each other. The crucial feature
of state space with three or more dimensions which permits chaotic
                                                                            x(t)
behaviour is that trajectories remain within some bounded region by
intertwining and wrapping around each other, without intersecting           15
and without repeating themselves exactly. The geometry created by           10
such trajectories is strange. Such attractors are thus called strange          5
attractors [Ruelle 1980], i.e. if nearby trajectories on average diverge       0
                                                                                   2.5   5   7.5   10   12.5   15   17.5    20
                                                                                                                              t

exponentially then we say the attractor is strange or chaotic.              -5

                                                                           -10

Chaos in biology.                                                          -15



A relatively new model of brain function was first described by
Freeman [Freeman 1991]. The idea is that `thought' (in particular
perception, prediction and control) consists of the flow (in the high Figure 6-3 The butterfly effect.
dimensional state space of vast assemblies of neurons) from one
chaotic orbit to a periodic orbit. Freeman argues that chaos is evident in the tendency of neural assemblies to shift
abruptly from one complex activity pattern to a more stable one in response to the smallest of inputs. This is a
plausible model and if it stands the test of experiment and scrutiny then chaos is an intrinsic feature of brain
function.

Phase portraits made from EEGs (electroencephalographs) generated by computer models reflect the overall



   8
       In the sense previously discussed.

                                                         61
Antonia J. Jones: 6 November 2005

activity of the olfactory bulb of a rabbit at rest and in response to a familiar scent (e.g. banana). The resemblance
of these portraits to irregularly shaped, but still structured, coils of wire reveals that brain activity in both
conditions is chaotic but that the response to the known stimulus is more ordered, more nearly periodic during
perception, than at rest.

The heart also provides interesting examples of chaotic behaviour in biological systems. It is clear that the cardiac
waveform is nonlinear. There is also evidence that the cardiac cycle can usefully be described in terms of chaos.
A. Babloyantz and A. Destexhe [Babloyantz 1988] examined the ECGs (electrocardiographs) of four normal
human hearts, using qualitative and quantitative methods. With a variety of processing algorithms, such as power
spectrum, autocorrelation function, phase portrait, Poincaré section, Lyapunov exponent etc, they demonstrated
that the heart is not a perfect oscillator, but that cardiac activity stems from deterministic dynamics of a chaotic
nature. Numerous in-vivo and in-vitro experiments have investigated cardiac oscillatory activity and found
characteristic signatures of chaos [Choi 1983], [Chay 1985], [Geuvara 1981], [Goldberg 1984], [Keener 1981].

Controlling chaos.

The extreme sensitivity to initial conditions displayed by chaotic systems makes them unstable and unpredictable.
Yet the same sensitivity also makes them highly susceptible to control, provided that the chaotic system can be
analyzed and the analysis is then used to make small effective control interventions. By perturbing the system in
the right way, it is possible to encourage it to follow one of its many unstable but natural behaviours. In such
situations, it may be possible to use chaos to advantage, as chaotic systems, once under control, are very flexible.
Such systems can rapidly switch among many different behaviours. Incorporating chaos deliberately into practical
systems therefore offers the possibility of achieving greater flexibility in their performance.

In the context of chaos, control could mean a number of things. It could mean the elimination of multiple basins
of attraction, stabilisation of the fixed points or stabilisation of the unstable periodic orbits. Control of chaos is still
in its infancy but the potential it offers is enormous.

There are four main categories of chaos control methodologies. They are low energy, high energy, non-feedback
and feedback methods.

Low energy control methods require very small changes in the control parameter. In contrast, high energy control
methods require large changes. It is always desirable to have a control method which is of the low energy type, as
in physical systems control parameter may be fixed or can be changed by only a very small amount. When large
changes are required, a physical system may need to be redesigned defeating the `control of chaos' concept, as such
an approach is closer to avoiding chaos.

In feedback methods, a control parameter is changed during the control. In non-feedback methods, a control
parameter is changed at the beginning of the control only, and untouched during the control phase.

The original OGY control law.

Suppose p is some scalar control parameter which is to be
                                                                                   ξi    pi      ξi+1
varied at times ti, say p = pi over the interval (ti, ti + 1), see
Figure 6-4. Suppose that the nominal value of p is p0. Our aim
                                                                            pi-1
is to vary p by small amounts about p0 so as to stabilise (t)    >                                      pi+1   ξi+2
about a suitable control point F. For all i 1 let
                                 >            $
                                                                                                                       Time
                                                                                   ti            ti+1           ti+2

                                                                     Figure 6-4 Intervals for which the variables are
                                                                     defined.




                                                            62
Antonia J. Jones: 6 November 2005

                                    δ   p   '    p   &   p0               and               % ξδi        1(p i)            '   % ξ i        1(p i)   &   ξ   F(p0)             (4)

Suppose that the iteration is described by the map i + 1 = F( i, p). The locally linear behaviour of F in the vicinity
                                                                                    >                    >
of a control point F is described by the dE x dE Jacobian matrix
                        >

                                                                      J   '    D F( , p)
                                                                                    >
                                                                                            >                                                                                  (5)
                                                                                                         >    '      >F,   p   '   p0



and in what follows we assume det J                  P    0.

This yields the first order approximation
                                                           % ξδ   i       1(p i)        .   J   ξδ   i(pi        &    1)       %   u pi δ                                      (6)
where u is a vector which reflects the direction of the local gradient with respect to p.

Now suppose that an eigenvector of J, say eu, has a real eigenvalue whose absolute value is greater than 1. This
means that points i such that i lies in the direction of eu will, if p = p0 in the intervening time period, be such
                         >                  >*
that on the next iteration i + 1 will lie further away from F(p0). We refer to this as an unstable direction. Stable
                                >                                                                    >
directions are characterised by eigenvectors of J which have absolute value less than 1.

The basic idea of the OGY method is to choose pi so as to eliminate the component of i + 1 in the unstable
                                                                                    *                                                                                >*
direction(s). We are now almost ready to derive the appropriate control strategy. However, we first observe the
following lemma.

Lemma 6.1. Suppose the d x d matrix J has d linearly independent eigenvectors e1, ..., ed, with real eigenvalues
                                                            d
81, ..., d. Thus we assume the eigenvectors form a basis in
       8                                                      . Construct the dual basis f1, ..., fd defined by
                                                                                                              ú
                                                                                                1            if i          '   j
                                                                          ei . fj   '                                                                                          (7)
                                                                                                0            if i          …   j

                            d
Then for any x        ú0
                                                                               fu . Jx          '    λ   u    fu . x                                                           (8)

Proof. Express x in terms of the eigenvectors, writing
                                                              x       '   α   1e1       %   α   2e2      %           ...   %       α   de d                                    (9)

where the   "   i   are suitable scalars depending on x. Thus from (7)
                                                                                        fu.x        '        α   u                                                            (10)

The effect of J on x is (from the definition of eigenvectors and eigenvalues)
                                                                  Jx      '    λα   1 1e1       %       ...      %         d d ed
                                                                                                                           λα                                                 (11)

Taking the inner product with fu yields fu.Jx =                               8 "
                                                                               u        u   and the conclusion now follows from (10).                                     %

We can now prove

Theorem 6.1 (OGY). The constraint fu.                      >*     i+1     = 0 leads to the first order control law:



                                                                                                63
Antonia J. Jones: 6 November 2005

                                                                                                                       fu.   ξδ   i(pi      &    1)
                                                                                  δ   pi   &.            λ     u                                                                            (12)
                                                                                                                               fu.u


where for   >   i   near   >   F   , the sensitivity vector u is defined as
                                          M '                                                                                                         % ξ
                                                                                                                                                       i    1(pi)     &   J i(pi
                                                                                                                                                                           ξ       &   1)
                                      u                   % ξδ
                                                            i     1
                                                                     (pi)        &    J   ξδ   i
                                                                                                   (pi   &     1
                                                                                                                   )          '         lim                                                 (13)
                                           M         pi                                                                             pi      6    p0              pi   &   p0
                                                                                                                        p0




Proof. Dotting (6) with fu using Lemma 1 we have
                                                                f u.   % ξδ
                                                                         i        1(p i)           .     λ   u f u.    ξδ    i(pi   &       1)    %    f u.u pi
                                                                                                                                                             δ                              (14)

Using the constraint fu.                 >*   i+1   = 0 we obtain from (14)
                                                                             λ   u fu.   ξδ   i(pi       &    1)       %     fu.u pi    δ        .     0                                    (15)

which on solving for pi yields (12). *                           %

The following conditions are required to control a chaotic system with the original OGY method.

         ! Experimental time series of some scalar-dependent variable xt can be measured and a suitable
         embedding technique can be applied, or the mathematical model describing the system is
         available.
         ! The dynamics of the system can be represented as a low-dimensional surface of section and
         the system has at least two linearly independent real eigenvectors.
         ! There is a specific periodic orbit of the map which lies in the attractor and around which one
         wishes to stabilise and the corresponding unstable periodic point can be located.
         ! A parameter p is available for external adjustment which can be used to slightly modify the
         system dynamics. Let the range in which p is allowed to vary pMAX > p > pMIN. There is
         maximum perturbation p* in the parameter p by which it is acceptable to vary p from the
                                                          *
         nominal value p0.
         ! The position of the periodic orbit is a function of p, but the local dynamics about it do not vary
         much with small changes in p.

Chaotic conventional neural networks.

It has been known since at least 1991 that conventional neural network models can exhibit chaotic behaviour.
Wang [Wang 1991] constructed a rather stylised simple 2-2 network with weights
                                                                                                                       a ka
                                                                                               W         '                                                                                  (16)
                                                                                                                       b kb

for a = -5, b = -25, k = -1 whose sigmoidal transfer function is
                                                            1
                                                 (x)                                       σ             '                                                                                  (17)
                                                        1 e x/T                                                         %           &



where T = 1/4, in which the outputs are fed back to the inputs as in Figure 6-5.



                                                                                                                64
Antonia J. Jones: 6 November 2005

Wang proved that there exists period-doublings to chaos and strange attractors by using a homeomorphism from
the network to a known dynamical system having these properties. This formally established that artificial neural
networks can exhibit chaotic behaviour. At the same time [Welstead 1991] trained feedforward networks on the
Ikeda and Henon maps and then by feeding the outputs back into the inputs empirically produced neural networks
with chaotic attractors. Somewhat later further examples were given in [Dracopoulos 1993]9. None of these papers
considered the question of controlling the neural network behaviour.


                                                                          y
                                                                         1


                                                                       0.9
         x (t)                                  x (t+1)
                      Feedforward                                      0.8

                        Network
                                                                       0.7
         y (t)                                  y (t+1)
                                                                       0.6


                                                                                                                       x
                                                                               0.4   0.5   0.6   0.7   0.8   0.9   1



Figure 6-5 Feedforward network as a dynamical system.               Figure 6-6 Chaotic attractor of Wang's neural
                                                                    network.

Controlling chaotic neural networks

The attractor of the Wang network is rather difficult to work with because the attractor is almost a curve, see
Figure 6-6. However, we were able to locate an unstable fixed point at (0.896853, 0.999980) which has a Jacobian
                                                      &   1.96322    2.08867
                                        J   '                                                                          (18)
                                                  &   0.00755664 0.00893465

The awkward shape of the attractor is reflected in the fact that this Jacobian has very small determinant. Using
T as a control parameter with a nominal value of 1/4 we managed to stabilise this system about the fixed point.

In this section we consider neural networks with feedback whose dynamical behaviour is chaotic. This is a subject
of increasing interest for two reasons. First, because if Freeman's hypothesis is correct then despite their value in
practical applications (in for example pattern recognition) the idea that feedforward networks, regardless of the
training algorithm employed, are an accurate analogy to equivalent biological computations is seriously challenged.
Second, it is possible that by storing memories in (unstable) periodic behaviours, rather than at point attractors as
in the Hopfield model, the memory capacity of simple neural networks may be considerably enhanced.

We consider ways in which a small 2-10-10-2 network, trained on the Ikeda attractor, and whose outputs feed back
to the inputs can be controlled using small variations of various parameters or system variables. It needs to be said
that the control mechanisms employed are external to the neural network. This is in contrast to the hypothetical
biological process in which presumably some form of Hebbian learning causes frequently encountered sensory
input to be associated with a particular (unstable) periodic dynamical regime. The result being that when the
sensory input is re-encountered the neural system relaxes naturally onto a particular (unstable) periodic behaviour
which characterises the input, such switching of the dynamical behaviour being implicit to the neural structure
rather than being externally imposed. Nevertheless, we believe that such studies of how chaotic neural systems can
be encouraged to follow particular unstable periodic orbits to be an interesting and probably necessary first step
in developing some understanding of how such behaviour might be made a natural characteristic of the neural



   9
       At that time the authors were unaware of the results presented in IIJCNN-91.

                                                               65
Antonia J. Jones: 6 November 2005

system.

Other approaches are possible using different neural networks models and different control techniques. For
example Babloyantz's group [Sepulchre 1993] [Lourenço 1994] [Babloyantz 1995] have controlled a network of
oscillators coupled to their four nearest neighbours using both the OGY method and a delayed feedback control
technique first suggested by Pyragas [Pyragas 1992]. Another example [Solé 1995] uses the GM technique
([Güemez 1993], [Matías 1994]) to control small (three or four neurons) fully connected neural networks whose
attractors are similar to a chaotic network first discussed by [Wang 1991].

                      y
                                                                                       0.7


                                                                                       0.6
               0.5

                                                                                       0.5


                                                                 x                     0.4
      -0.25               0.25   0.5   0.75   1       1.25

                                                                                       0.3
              -0.5
                                                                                       0.2


                                                                                       0.1
                 -1

                                                                                          0
                                                                                                      0.5   0.6   0.7   0.8    0.9



Figure 6-7 The Ikeda strange attractor.                                          Figure 6-8 Attractor for the chaotic 2-10-10-2 neural
                                                                                 network.

However, for the purposes of the present section we work with the Ikeda map [Hammel 1985] defined by
                                                                                              α
                                              g(z)    '   γ      %   R z exp i   κ    &                                          (19)
                                                                                                      2
                                                                                          1   **% z

where z is a complex variable, of the form x + i y, and i2 = -1. We can identify x + i y with the point (x, y) on the
complex plane so that g can also thought of as a mapping of 2 -> 2. The dynamical system is then defined by
                                                                                     ú        ú
zn+1 = g(zn). For parameter values = 5.5, = 0.85, = 0.4 and R = 0.9, this mapping has a strange attractor
                                              "              (            6
illustrated in Figure 6-7. With only 4000 training pairs (re-scaled into the range [0, 1]) and training MSE error
of about 9.9×10-5, the network already produces an attractor in Figure 6-8 with features similar to the Ikeda map
strange attractor shown in Figure 6-7.

We use this network as the basis for the control experiments, the objective being to determine which parameters
or system variables are most effective in stabilising the system onto an unstable periodic attractor.

The OGY control method was applied to control the chaotic neural network described above. An unstable fixed
point F = (0.626870, 0.553256) was located by examining successive iterations of the system and was used as the
     >

                                                          &           1.26617    &   1.03629
                                                      J     '                                                                   (20)
                                                          &          0.564996    &   1.06779

unstable periodic point to be stabilised. The Jacobian at this point was
with eigenvalues s = -0.395399 and u = -1.93857, and stable eigenvector es = (0.7656, -0.643317) and unstable
                          8                       8

                                                                           66
Antonia J. Jones: 6 November 2005

eigenvector eu = (-0.838887, -0.544306).

Control varying T in a particular layer.

Initial attempts to control using the slope parameter T were not successful. The next attempts were made by varying
T only for neurons in a particular layer of the network, and here the OGY control method was more effective. It
seems that by varying T in only one particular layer the chaotic regions of the bifurcation diagrams become broader
(see Figure 6-9 and Figure 6-10) and so control becomes easier with small variations of T. The variations of T and
the controlled result are illustrated in Figure 6-13 - Figure 6-12.

Using small variations of the inputs.

The results of using an external signal feeding into one of the inputs as a control parameter whose nominal value
is set to zero were significantly more interesting. The bifurcation diagrams for x(t) are given in Figure 6-14 and
Figure 6-15. We use the same fixed point as before, so the Jacobian and associated eigenvectors and eigenvalues
remain unchanged.

Using an external signal feeding into input x (cf. Figure 6-5), the sensitivity vector ux = (-1.076260, -0.675875)
was approximated. After applying the OGY control for less than 25 time steps (the control variations are shown
in Figure 6-18) the system rapidly stabilized onto the unstable fixed point as illustrated in Figure 6-16 - Figure
6-17.

In these experiments, an improved technique due to [Otani 1996] was actually used to estimate the sensitivity
vectors u. In (13) the Jacobian is used to obtain a prediction of where the system would be at the next iteration if
no control were applied. However, in the case of a neural network this is unnecessary since the neural network is
its own Jacobian at every point. We can therefore obtain an exact prediction of the next system state by simply
iterating the network without control. This resulted in much more accurate estimations of the sensitivity vectors
which itself made control of the system using the OGY method much easier.

The OGY method can be applied to the control of conventional feedforward networks whose behaviour under
iterated feedback has been trained to be chaotic. Whilst the method is computationally expensive and, in its
original form subject to a number of limitations (for example inaccuracies in estimating the Jacobian or sensitivity
vectors can make control difficult if not impossible) nevertheless see that stabilisation of unstable fixed points is
perfectly feasible. However, this relaxation onto a fixed point is achieved by a control external to the network itself
rather than as an implicit consequence of network function.

It is interesting to observe that control by variation of a global slope parameter is not easy to achieve but becomes
easier when the control variations are applied to a single layer rather than to the whole network. It is notable that
control becomes very much easier when the controlling parameter is a small signal applied to one of the inputs.
This may be closer to being a biological analogy than control of behaviour through global or selective slope control.




                                                          67
Antonia J. Jones: 6 November 2005


                                       x                                                                             y

                                                                         T
         0.6         0.8                      1.2            1.4
                                 0.9                                                                           0.8

                                 0.8
                                                                                                               0.6
                                 0.7
                                                                                                               0.4
                                 0.6

                                 0.5                                                                           0.2
                                 0.4
                                                                                                                                                   T
                                 0.3                                                        0.6         0.8                    1.2      1.4


Figure 6-9 Bifurcation diagram x obtained by varying                                 Figure 6-10 Bifurcation diagram y obtained by
T in the output layer only.                                                          varying T in the output layer only.




         x
                                                                                                   y


                                                                                        0.551971
   0.8
                                                                                         0.55197
   0.7
                                                                                         0.55197
   0.6                                                                                                                                 time step
                                                                                                         200   400       600   800 1000
                                                             time step                  0.551969
               200         400         600   800      1000




Figure 6-11 Variations of x from initiation of control.                              Figure 6-12 Variations of y from initiation of control.


                                              parameter T

                                              1.075
                                               1.05
                                              1.025
                                                                                                   time step
                                                                200      400   600    800   1000
                                              0.975
                                               0.95
                                              0.925


         Figure 6-13 Parameter changes during output layer control.



                                                                               68
Antonia J. Jones: 6 November 2005



                              x                                                                                       y

                                                             ext to x
        -0.4     -0.2               0.2          0.4
                        0.9                                                                                     0.8
                        0.8
                                                                                                                0.6
                        0.7
                        0.6                                                                                     0.4
                        0.5
                                                                                                                0.2
                        0.4
                        0.3                                                                                                                     ext to x
                                                                                            -0.4         -0.2               0.2     0.4


Figure 6-14 Bifurcation diagram for the output x(t+1)                               Figure 6-15 Bifurcation diagram for the output y(t+1)
using an external variable added to the input x(t).                                 using an external variable added to the input x(t).




        x                                                                                   y



  0.8                                                                                 0.6

                                                                                      0.5
  0.7
                                                                                      0.4

                                                                                      0.3
  0.6
                                                                                      0.2
                                                             time step                                                                          time step
                 50           100     150              200                                                50          100     150         200




Figure 6-16 Variations of x from initiation of control.                             Figure 6-17 Variations of y from initiation of control.



                                          ext sig x


                                            0.4

                                            0.2

                                                                                                         time step
                                                                50       100        150            200
                                          -0.2

                                          -0.4




            Figure 6-18 Parameter changes during input x control.



                                                                               69
Antonia J. Jones: 6 November 2005


Quite how easy it would be to extend such control to networks with many outputs being fed back to many inputs
remains to be determined. It also remains to be determined whether it is practical to control high dimensional
networks to follow unstable periodic orbits rather than fixed points. It is likely that more sophisticated variations
of the OGY technique or some completely different control method would be required to accomplish this goal.

Time delayed feedback and a generic scheme for chaotic neural networks.

Recently [Tsui 1999], [Oliveira 1998] [Tsui 2002], Ana Olivera, Alban Tsui and myself have discovered that it
is very easy to control and synchronize chaotic neural systems using time delayed feedback. Combined with the
Gamma test to select appropriate time-delays we can now easily achieve the following:

         ! Given an arbitrary smooth multidimensional chaotic system produce an iterated neural network which
         closely models the system.

         ! Using a simple technique of time delayed feedback cause the iterated chaotic neural network when
         presented with a stimulus to stabilize onto an unstable periodic orbit characteristic of the applied stimulus.
         Moreover, this response is quite stable in the presence of noise.

         ! Synchronize two identical copies of the network by transmitting the single output of one to the other.
         This provides an interesting model of other kinds of cortical activity and also has interesting applications
         in secure communications.

Although not a full account of our methods, these papers answer many of the questions mentioned. [Tsui 1999]
provides the first actual implementation of an artificial neural network which exhibits all of the properties
discussed in [Freeman 1991].

A generic scheme for such a stimulus-response recurrent network is shown in Figure 6-19. The single output of
the network feeds back into the inputs using delay buffers according to the embedding previously determined by
the Gamma test experiments. This embedding should contain enough information for predicting the next system
state.

For stabilization control a multiple (gain constant) of the delayed feedback is added to each neural network input
specified by the irregular embedding, based on the idea from Pyragas' delayed feedback control. The control
module is shown in Figure 6-19 and the control perturbation for the ith input at the nth iteration is
                                         k i x i(n i
                                                 & &    τ ) x i(n i)
                                                            &      &                                         (21)
where ki is a gain constant and is the delay time.
                                 J

We imagine that the presence of an external stimulus excites (activates) the control circuitry, which is otherwise
inhibited. Thus to achieve a stabilised dynamical regime in response to a stimulus the control is switched on at the
same time as the external signal is fed into the input line xn. By varying the external signal in small steps and
holding the new setting fixed long enough for the system to stabilise we can observe the response of the network
to small changes in stimulus.

In the diagram, is the same for each control perturbation but of course, we could set to be different on each
                 J                                                                          J
control line. External stimulus of the network can be applied to the controlled inputs as shown in the diagram. The
control module should switch on automatically and simultaneously whenever there is an external stimulation.
Variations of stimulation, such as on the control delayed feedback lines may also be used.




                                                         70
Antonia J. Jones: 6 November 2005

                                                                                         iterative feedback

                                                                                                                    Hidden layer
                delay d           +                  x(n-d) + kd (x(n-d-t) - x(n-d))



                    observation                             Controlled neural inputs
                    points                External          with no external stimulus
                                          stimulus                                                                                            x(n)


                delay 3               +              x(n-3) + k3 (x(n-3-t) - x(n-3))

                delay 2                   +          x(n-2) + k2 (x(n-2-t) - x(n-2))
                                                                                                                            Feedforward
                delay 1                       +      x(n-1) + k1 (x(n-1-t) - x(n-1))                                        neural network

                                                                                                    Switch signal
                     Control
                   feedback                                     k1 (x(n-1-t) - x(n-1))
                    for each                                    k2 (x(n-2-t) - x(n-2))
                                                                                                    Time delayed
                delayed line                                                                       control feedback                 Keys
                                                                k3 (x(n-3-t) - x(n-3))                                                 observation
                                                                                                       module                          point
                                                                kd (x(n-d-t) - x(n-d))

        Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network:
        the chaotic "delayed" network is trained on suitable input-output data constructed from a chaotic time
        series; a delayed feedback control is applied to each input line; entry points for external stimulus are
        suggested, with a switch signal to activate the control module during external stimulation; signals
        on the delay lines or output can be observed at the "observation points".

Example: Controlling the Hénon neural network

There follows an example of different responses of the for the Hénon neural system using different settings of
controls and external stimulation. The response signals of the system can be observed at the output x(n) of the
feedforward neural network module or the "observation points" on the delay lines x(n-1) , x(n-d), as indicated                      Y
in Figure 6-19. Due to the complexity of these neural systems, of course, not all possible settings are tried and
presented.

                                                                                               s
    x


 1.8                                                                                       0.4

 1.6                                                                                       0.2
 1.4
                                                                                                                                                     n
 1.2                                                                                                 200      400     600     800 1000 1200 1400

                                                                         n               -0.2
          200    400      600     800      1000 1200 1400
 0.8                                                                                     -0.4
 0.6
                                                                                         -0.6

Figure 6-21 Response signal on x(n-6) with control                                       Figure 6-20 The control signal corresponding to the
signal activated on x(n-6) using k = 0.441628, = 2                   J                   delayed feedback control shown in Figure 6-21. Note
and without external stimulation after first 10                                          that the control signal becomes small.
transient iterations. After n = 1000 iterations, the
control is switched off.




                                                                                 71
Antonia J. Jones: 6 November 2005


                                                                     s
    x
                                                                    1
   2
                                                                 0.75
   1
                                                                  0.5

                                                       n         0.25
            2500    5000   7500    10000 12500 15000
                                                                                                                    n
                                                                             2500   5000   7500 10000 12500 15000
 -1
                                                                -0.25

  -2                                                             -0.5
                                                                -0.75

Figure 6-22 Response signals on network output x(n),            Figure 6-23 The control signal corresponding to the
with control signal activated on x(n-6) using k =               delayed feedback control shown in Figure 6-22. Note
0.441628,       J= 2 and with constant external                 that the control signal becomes small even when the
stimulation sn added to x(n-6), where sn varies from            network is under changing external stimulation.
-1.5 to 1.5 in steps of 0.1 at each 500 iterative steps
(indicated by the change of Hue of the plot points)
after 20 initial transient steps.


We use = 2 and k = 0.441628 for our control parameters on all the possible feedback control lines. The control
        J
is applied to the delayed feedback line x(n-6). Without any external stimulation and using only a single control
delayed feedback, the network quickly produces a stabilised response as shown in Figure 6-21 with the
corresponding control signal shown in Figure 6-20. Notice that the control signal is very small during the
stabilised behaviour. Under external stimulation with varying strength the network is still stabilised, but with a
variety of new periodic behaviours as shown in Figure 6-22. The corresponding control signal is still small (see
Figure 6-23).

For this system we then investigated the response of the system when the sensory input was perturbed by additive
Gaussian noise r with Mean[r] = 0 and standard deviation SD[r] = . Using the experimental setup as in Figure
                                                                         F
6-22, the external stimulus was perturbed at each iteration step by adding a Gaussian noise r with standard
deviation , i.e. having an external stimulus sn+r. This experiment was repeated for different , where was
            F                                                                                         F       F
varied from = 0.05 to = 0.3, a high noise standard deviation with respect to the external stimulus strength from
                F          F
-1.5 t o1.5. The result for = 0.05 is shown in Figure 6-24 and Figure 6-25. Surprisingly, the response signal
                               F
almost stays the same but the control signal is not small at all. The results for = 0.15 and = 0.3 are in Figure
                                                                                    F            F
6-26 and Figure 6-27 respectively. As illustrated in these figures, the system dynamics remain essentially
unchanged, although as one might expect the response signal becomes progressively "blurred" as the noise level
increases. Similar results can be obtained for the other examples.




                                                           72
Antonia J. Jones: 6 November 2005


        x                                                                   s
                                                                      0.04
       1

                                                                      0.02
                                                        n
              2500    5000   7500   10000 12500 15000

       -1                                                                                                                       n
                                                                                  2500    5000    7500 10000 12500 15000

       -2
                                                                     -0.02

Figure 6-24 Response signals on network output x(n),
                                                                     -0.04
with control setup same as in Figure 6-22 but with
Gaussian noise r added to external stimulation, i.e.                 Figure 6-25 The control signal corresponding to the
sn+r, with = 0.05, at each iteration step.
             F                                                       delayed feedback control shown in Figure 6-24.



                                                                        x
   x
   2                                                                   2

   1                                                                   1

                                                            n                                                                   n
            2500     5000    7500   10000 12500 15000                           2500     5000    7500       10000 12500 15000
 -1                                                                   -1

  -2
                                                                      -2


Figure 6-26 Response signals on network output x(n),                 Figure 6-27 Response signals on network output x(n),
with control experiment setup same as in Figure 6-22                 with control experiment setup same as in Figure 6-22
but with Gaussian noise r added to external                          but with Gaussian noise r added to external
stimulation, i.e. sn+r, with = 0.15, at each iteration
                                F                                    stimulation, i.e. sn+r, with = 0.3, at each iteration
                                                                                                        F
step.                                                                step.




                                                        Chapter references


[Abraham 1982] R. H. Abraham and C. D. Shaw, Dynamics - The geometry of behavior Part One : Periodic
Behavior, Aerial Press, California, 1982.

[Atkinson 1978] K. E. Atkinson, An introduction to numerical analysis, John Wiley & Sons, Canada, 1978.

[Auerbach 1987] Ditza Auerbach, Predrag Cvitanovic, Jean-Pierre Eckmann, Gemunu Gunaratne and Itamar
Procaccia. Exploring chaotic motion through periodic orbits. Physical Review Letters 58(23), 2387-2389, 1987.

[Auerbach 1992] D. Auerbach, C. Grebogi, E. Ott and J. A. Yorke. Controlling Chaos in High Dimensional
Systems, Physical Review Letters 69, 24, 3479-3482, 1992.



                                                                73
Antonia J. Jones: 6 November 2005

[Azevedo 1991] A. Azevedo and S. M. Rezende. Controlling Chaos in Spin-Wave Instabilities, Physical Review
Letters 66, 10, 1342-1345, 1991.

[Babloyantz 1988] A. Babloyantz and A. Destexhe. Is the normal heart a periodic oscillator? Biol. Cybern. 58,
203-211, 1988.

[Babloyantz 1995] A. Babloyantz, C. Lourenço, J.A. Sepulchre, Control of chaos in delay differential equations,
in a network of oscillators and in model cortex, Physica D 86, 274-283, 1995.

[Belmonte 1988] A. L. Belmonte, M. J. Vision, J. A. Glazier, G. H. Gunaratne and B. G. Kenny. Trajectory
Scaling Functions at the Onset of Chaos: Experimental Results, Physical Review Letters 61, 5, 539-542, 1988.

[Belmonte 1988] A. L. Belmonte, M. J. Vision, J. A. Glazier, G. H. Gunaratne and B. G. Kenny. Trajectory
scaling functions at the onset of chaos: experimental results. Physical Review Letters 61(5), 539-542, 1988.

[Carroll 1992] T. L. Carroll, I. Triandaf, I. Schwartz and L. Pecora. Tracking unstable orbits in an experiment,
Physical Review A 46, 10, 6189-6192, 1992.

[Carroll 1993] Thomas L. Carroll and Louis M. Pecora. Using chaos to keep period-multiplied systems in phase,
Physics Review E 48, 4, 2426-2436, 1993.

[Choi 1983] M. Y. Choi and B. A. Huberman. Dynamic behaviour of nonlinear networks, The American Physical
Society, 28, 1204-1206, 1983.

[Chay 1985] T. R. Chay and J. Rinzel. Bursting, beating, and chaos in an excitable membrane model, J.
Biophysical Society, 47, 357-366, 1985.

[Crutchfield 1980] J. Crutchfield, D. Farmer, N. Packard, R. Shaw, G. Jones and R. J. Donnelly. Power spectral
analysis of a dynamical system, Physics Letters A 76, 1-4, 1980.

[Crutchfield 1981] J. Crutchfield, M. Nauenberg and J. Rudnick. Scaling for External Noise at the Onset of Chaos,
Physical Review Letters 46, 933-935, 1981.

[Cvitanovic 1989] P. Cvitanovic. Universality in Chaos : second edition, Adam Hilger, Bristol, 1989.

[Derrick 1993] W. R. Derrick in Chaos in chemistry and biochemistry, Ed R. J. Field and L. Györgyi, World
Scientific Publishing, 1993.

[Ditto 1990] W. L. Ditto, S. N. Rauseo and M. L. Spano. Experimental Control of Chaos, Physical Review Letters
65, 26, 3211-3214, 1990.

[Ditto 1993] W. L. Ditto and L. M. Pecora. Mastering Chaos, SCIENTIFIC AMERICAN, 62-68, August 1993.

[Dracopoulos 1993] D. C. Dracopoulos and Antonia J. Jones. Neuromodels of analytic Dynamic Systems. Neural
Computing & Applications, 1(4):268-279, 1993.

[Dracopoulos 1994] D. C. Dracopoulos and Antonia J. J. Jones. Neuro-Genetic Adaptive Attitude Control, Neural
Computing & Applications, 2(4):183-204, 1994.

[Dressler 1992] U. Dressler and G. Nitsche, Controlling chaos using time delay coordinates, Physics Review
Letters 68(1):1-4, 1992.

[Eckmann 1985] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors, Rev. Modern


                                                       74
Antonia J. Jones: 6 November 2005

Physics 57, 3, 617-656, 1985.

[Freeman 1991] W.J. Freeman, The Physiology of Perception, Scientific American, 34-41, February 1991.

[Garfinkel 1992] A. Garfinkel, M. L. Spano, W. L. Ditto and J. N. Weiss. Controlling Cardiac Chaos, Science
257:1230-1235, 1992.

[Geuvara 1981] M. R. Geuvara, L. Glass and A. Shrier. Phase locking, period-doubling bifurcations, and irregular
dynamics in periodically stimulated cardiac cells, Science 214:1350-1352, 1981.

[Gills 1992] Z. Gillis, C. Iwata, R. Roy, I. B. Schwartz and I. Triandaf. Tracking Unstable Steady States:
Extending the Stability Regime of a Multimode Laser System, Physical Review Letters 69(22):3169-3172, 1992.

[Gleick 1987] J. Gleick. CHAOS : Making a new science, Abacus book, London, 1987.

[Goldberg 1984] A. L. Goldberger, L. J. Findley, M. R. Blackburn and A. J. Mandell. Nonlinear dynamics in heart
failure: Implications of long-wavelengths cardiopulmonary oscillations, Amer. Heart J. 107:612-615, 1984.

[Grassberger 1983] P. Grassberger and I. Procaccia, Characterization of Strange Attractors, Physical Review
Letters 50(5):346-349, 1983.

[Grebogi 1988] C. Grebogi, E. Ott, J. A. Yorke. Unstable periodic orbits and the dimensions of multifractal
chaotic attractors, Physical Review A 37:1711-1723, 1988.

[Grebogi 1982] C.Grebogi, E.Ott, J.A.Yorke. Chaotic Attractors in Crisis, Physical Review Letters 48:1507-1510,
1986.

[Grebogi 1986] C. Grebogi, E. Ott, J. A. Yorke. Critical Exponent of Chaotic Transients in Nonlinear Dynamical
Systems, Physical Review Letters 57:1284-1287, 1986.

[Grebogi 1987] C. Grebogi, E. Ott, J. A. Yorke. Critical exponents for crisis-induced intermittency, Physical
Review A 36:5365-5380, 1987.

[Greenside 1982] H. S. Greenside, A. Wolf, J. Swift and T. Pignataro. Physical Review A 25:3453, 1982.

[Güémez 1993] J. Güémez and M.A. Matías, Control of chaos in unidimensional maps, Physics Letters A 181:29-
32, 1993.

[Gunaratne 1989] G. H. Gunaratne, P. S. Linsay and M. J. Vision. Chaos beyond Onset: A Comparison of Theory
and Experiment, Physics Review Letters 63(1):1-4, 1989.

[Hall 1992] N. Hall. The new scientist guide to chaos, Penguin books, London, 1992.

[Hammel 1985] S. Hammel, C. Jones, J. Moloney, Global Dynamical Behaviour of the Optical Field in a Ring
Cavity, J. Opt. Soc. Am. B 2(4):552-564, 1985.

[Hao 1990] B. Hao, Chaos II, World Scientific, 1990.

[Hayes 1993] S. Hayes, C. Grebogi and E. Ott. Communicating with Chaos, Physical Review Letters 70(20):3031-
3034, 1993.

[Hénon 1976] M. Hénon. A Two-dimensional Mapping with a Strange Attractor, Communications in Mathematical
Physics 50:69-77, 1976.

                                                       75
Antonia J. Jones: 6 November 2005

[Hilborn 1994] R. C. Hilborn. Chaos and Nonlinear Dynamics : An introduction for scientists and engineers,
Oxford University press, New York, 1994.

[Holmes 1990] P. Holmes. Poincarè. celestial mechanics, dynamical-systems theory and "chaos", Physics Reports
193(3):137-163, 1990.

[Hunt 1991] E. R. Hunt. Stabilizing High-Period Orbits in a Chaotic System: The Diode Resonator, Physical
Review Letters 67(15):1953-1955, 1991.

[Kaplan 1979] J. Kaplan, J. A. Yorke in Chaotic behavior of multidimensional difference equations, H. O. Peitgen
et al., Eds., "Springer Lecture, Notes in Mathematics", Springer-Verlag, Berlin, 730:204-227, 1979.

[Keener 1981] J. P. Keener. Chaotic cardiac dynamics, Lectures in Applied Mathematics 19:299-325, 1981.

[Kim 1992] J. H. Kim and J. Stringer, Applied Chaos, John Wiley & Sons, Canada, 1992.

[Lai 1994] Ying-Cheng Lai and Celso Grebogi. Synchronization of spatiotemporal chaotic systems by feedback
control, Physics Review E 50(3):1894-1899, 1994.

[Lai 1993] Y-C Lai, M. Ding and C. Grebogi. Controlling Hamiltonian chaos, Physical Review E 47(1):86-92,
1993.

[Lathrop 1989] Daniel P. Lathrop and Eric J. Kostelich. Characterization of an experimental strange attractor
by periodic orbits, Physics Review A 40(7):4028-4031, 1989.

[Lorenz 1963] E. N. Lorenz. Deterministic Nonperiodic Flow, J. Atmospheric Sciences 20:130-141, 1963.

[Lorenz 1993] E. N. Lorenz, The essence of chaos, University of Washington press, 1993.

[Lourenço 1994] C. Lourenço, A. Babloyantz, Control of Chaos in Networks with Delay: A Model for
Synchronization of Cortical Tissue, Neural Computation 6:1141-1154, 1994.

[Matías 1994] M.A. Matías and J. Güémez, Stabilization of Chaos by Proportional Pulses in the System Variables,
Physical Review Letters 72(10):1455-1458, 1994.

[Moss 1994] Frank Moss. Chaos under control, Nature 370:596-597, 1994.

[Ogorzalek 1993] M. J. Ogorzalek. Taming Chaos Part II: Control, IEEE Transactions on circuits and systems -
1: Fundamental theory and Applications. Volume 40(10):700-706, October 1993.

[Oliveira 1998] Synchronisation of chaotic maps by feedback control and application to secure communications
using chaotic neural networks. Ana Guedes de Oliveira and Antonia J. Jones. International Journal of Bifurcation
and Chaos, 8(11), November 1998.

[Otani 1997] M. Otani and Antonia J. Jones, Guiding chaotic orbits. Research Report, Department of Computer
Science, University of Wales, Cardiff, December 1997.

[Ott 1990] E. Ott, C. Grebogi and J.A. Yorke, Controlling Chaos, Physical Review Letters 64(11):1196-1199,
1990.

[Ott 1994] E. Ott, T Sauer and J. A. Yorke, Coping with Chaos, John Wiley & Sons, Canada, 1994.

[Parker 1989] T. S. Parker and L. O. Chua. Practical Numerical Algorithms for Chaotic Systems, Springer-Verlag,

                                                      76
Antonia J. Jones: 6 November 2005

New York, 1989.

[Parlitz 1985] U. Parlitz and W. Lauterborn. Superstructure in the bifurcation set of the Duffing equation, Physics
Letters A 107(8):351-355, 1985.

[Pearson 1986] C. E. Pearson, Numerical methods in engineering and science, Van Nostrand Reinhold Company
Inc, England, 1986.

[Peinke 1992] J. Peinke, J. Parisi, O. E. Rössler and R. Stoop, Encounter with Chaos : Self-Organized Hierarchical
Complexity in Semiconductor Experiments, Springer-Verlag, 1992.

[Petrov 1993] Valery Petrov, Vilmos Gaspar, Jonathan Masere and Kenneth Showalter. Controlling chaos in the
Belousov-Zhabotinsky reaction, Nature 361:240-243, 1993.

[Pfister 1992] G. Pfister, Th. Buzug and N. Enge. Characterization of experimental time series from Talyor-
Couette flow, Physica D 58:441-454, 1992.

[Provenzale 1992] A. Provenzale, L. A. Smith, R. Vio and G. Murante. Distinguishing between low-dimensional
dynamics and randomness in measured time series, Physica D 58:31-49, 1992.

[Pyragas 1992] K. Pyragas, Continuous control of chaos by self-controlling feedback, Physics Letters A 170:421-
428, 1992.

[Romeiras 1992] F. J. Romeiras, C. Grebogi, E. Ott and W. P. Dayawansa. Controlling chaotic dynamical systems,
Physica D 58:165-192, 1992.

[Rössler 1976] O. E. Rössler. An Equation for Continuous Chaos, Physics Letters A 57:397-398, 1976.

[Roy 1992] R. R. Roy, T. W. Murphy, T. D. Maier, Z. Gills and E. R. Hunt. Dynamical Control of a Chaotic
Lazer: Experimental Stabilization of a Globally Coupled System, Physical Review Letters 68(9):1259-1262, 1992.

[Roy 1994] Rajarshi Roy and K. Scott Thornburg, Jr. Experimental Synchronization of Chaotic Lasers, Physical
Review Letters 72(13):2009-2012, 1994.

[Ruelle 1980] D. Ruelle. Strange Attractors, The Mathematical Intelligence 2:126-137, 1980.

[Russell 1980] D. A. Russell, J. D. Hansen, and E. Ott. Dimensions of Strange Attractors, Physical Review Letters
45:1175-1178, 1980.

[Sano 1985] M. Sano and Y. Sawada. Measurement of the Lyapunov Spectrum from a Chaotic Time Series,
Physical Review Letters 55(10):1082-1085, 1985.

[Schiff 1994] S. J. Schiff, K. Jerger, D. H. Duong, T. Chang, M. L. Spano and W. L. Ditto, Controlling chaos in
the brain, Nature, 370:615-620, 1994.

[Schwartz 1994] I. B. Schwartz and I. Triandaf. Controlling unstable states in reaction-diffusion systems modeled
by time series, Physical Review E 50(4):2548-2552, 1994.

[Sepulchre 1993] J.A. Sepulchre and A. Babloyantz, Controlling chaos in network of oscillators, Physical Review
E 48(2):945-950, 1993.

[Shinbrot 1990] T. Shinbrot, C. E. Ott, Grebogi and J. A. Yorke. Using Chaos to Direct Trajectories to Target,
Physical Review Letters 65(26):3215-3218, 1990.

                                                        77
Antonia J. Jones: 6 November 2005

[Shinbrot 1993] Troy Shinbrot, Celso Grebogi, Edward Ott and James A. Yorke. Using small perturbations to
control chaos, Nature 363:411-417, 1993.

[Shinbrot 1992a] T. Shinbrot, W. Ditto, C. Grebogi, E. Ott, M. Spano, and J. A. Yorke. Using the Sensitive
Dependence of Chaos (the "Butterfly Effect") to Direct Trajectories in an Experimental Chaotic System, Physical
Review Letters 68(19):2863-2866, 1992.

[Shinbrot 1992b] T. Shinbrot, C. Grebogi, E. Ott and J. A. Yorke. Using chaos to target stationary states of flows,
Physics Letters A 196:349-354, 1992.

[Singer 1991] J. Singer, Y-Z. Wang and H. H. Bau. Controlling a Chaotic System, Physical Review Letters
66(9):1123-1125, 1991.

[Smith 1986] W. A. Smith, Elementary Numerical Analysis, Prentice-Hall, England, 1986.

[Solé 1995] Ricard V. Solé, Liset Menéndez de la Prida, Controlling chaos in discrete neural networks, Physics
Letters A 199:65-69, 1995.

[Stewart 1989] I.Stewart, Does God Play Dice : The new mathematics of chaos, Penguin books, England, 1989.

[Stewart 1996] I.Stewart, From Here to Infinity : A guide to today's mathematics, Oxford University Press,
England, 1996.

[Thompson 1994] J. M. T. Thompson and S. R. Bishop, Nonlinearity and Chaos in Engineering Dynamics, John
Wiley & Sons, England, 1994.

[Tsui 1999] Periodic response to external simulation of a chaotic neural network with delayed feedback. Alban
P. Tsui and Antonia J. Jones. International Journal of Bifurcation and Chaos, 9(4), April 1999.

[Tsui 2002] Alban P. Tsui, Antonia J. Jones and Ana Guedes de Oliveira. The construction of smooth models using
irregular embeddings determined by a Gamma test analysis. Neural Computing & Applications 10(4), 318-329,
April 2002. ISSN 0-941-0643.

[Wang 1991] Xin Wang, Period-Doublings to Chaos in A Simple Neural Network, 1991 IEES INNS International
Joint Conference on Neural Networks - Seattle, Vol II:333-339, 1991.

[Welstead 1991] Stephen T. Welstead, Multilayer Feedforward Networks Can Learn Strange Attractors, 1991
IEES INNS International Joint Conference on Neural Networks - Seattle, Vol II: 139-144, 1991.




                                                        78
Antonia J. Jones: 6 November 2005



                                                    COURSEWORK


            This work should be handed in three weeks before the Easter Break. It is suggested that you
            work on questions as the subject matter is covered in the course.


1.(a)       (i) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed
            length strings.

            (ii) Give a detailed description of a typical genetic algorithm.

            (iii) Explain the different roles played by crossover and mutation in the process of genetic search.

(b) What are the main design problems in constructing a GA for a particular problem? What simple checks would
you suggest before running a full test of a new genetic algorithm (GA) to verify that it has some chance of
performing adequately?                                                                                    [4]

(c)         (i) "Darwin's theory of evolution is supposed to explain the diversity of species. If by definition two
            members of different species cannot interbreed then increasing specialization is the only plausible
            mechanism for species creation. Very specialized species are vulnerable to external changes and so in a
            dynamic environment might not be expected to survive in the long run. This would tend to decrease
            diversity rather than increase it." Discuss in two or three sentences.

            (ii) In the light of the above discussion identify features which might be present in natural evolution
            which are absent in genetic algorithms.


2(a). Let
                                                                i11 i12
                                                        I   '
                                                                i21 i22


denote the input to a binary retina. Show that it is impossible to find a set of weights W = (wjk), wjk real numbers,
1 j, k 2, and a threshold so that the single linear function
  #     #                       2
                                                            1 if     j      wjkijk >   θ
                                                                     j, k
                                             P(I)   '
                                                            0 if     j      wjkijk <   θ
                                                                     j, k


can discriminate between horizontal and vertical lines.

(b) An alternative system is proposed in which a 2-tuple WISARD classifier with a 1-1 mapping is used.
Investigate the possible mappings and show that two-thirds of these lead to a system which discriminates perfectly
between horizontal and vertical lines.

(c) Suppose that for some given two-class problem on a 512X512 retina suitable weights can be found for the single
linear classifier system. Using 8 bits per weight estimate the storage requirements for such a system. Allowing 10
micro sec for multiplication and 3 micro sec per addition and temporary storage overheads, calculate the


                                                                79
Antonia J. Jones: 6 November 2005

classification time. Would it be suitable for industrial application.

(d) Perform similar calculations, relating to storage and response time, for a WISARD 2-tuple system with a 1-1
mapping. Assume a conventional 16 bit architecture with an access time of 1 micro sec.

3(a). Briefly describe the main features of a Hopfield network: your discussion should include definitions for the
update rule and energy function.

(b) How are the weights of a Hopfield network usually assigned for a pattern recognition problem and what are
two limitations of this approach?

(c) It is proposed to apply some style of Hopfield network to the problem of finding a minimal tour for the
geometric TSP. Suggest and discuss one method of assigning network states to tours and briefly describe how the
weights of the network can be related to the distances between cities.

4(a).    (i) Why are hidden units necessary to solve general problems of function modelling using feedforward
         neural networks?

         (ii) What is the main difficulty in constructing a learning rule for feedforward networks with hidden
         units?

(b)      (i) Describe the backpropagation rule for learning using a feedforward network with one hidden layer.

                  Notation. The following functions, and hence all their partial derivatives, are assumed known.

                  Error function.
                                               E(z1, z2, ..., zn, t1, t2, ... ,t n)                                    (24)

                  Here z1, z2, ..., zn are the outputs from the output layer units and t1, t2, ... ,tn are the target outputs.

                  Activation function.

                  Output layer:
                                       net j    '   net j (y1, y2, ... ,y m, pj1, ... pjt)                             (25)

                  Here y1, y2, ... ,ym are the outputs from the previous layer and pj1, ... , pjt are parameters
                  associated with the jth node of the output layer. Frequently t = t(m), i.e. the number of
                  parameters associated with a node is a function of the number of inputs.

                  Previous layer:
                                       net i   '    net i (x1, x2, ..., x l, pi1, ... ,pis)                            (26)

                  Here x1, x2, ... ,xl are the outputs from the layer prior to the previous layer.

                  Output function.

                  Output layer:
                                           zj   '    f(net j)          (1     j
                                                                            # #     n)                                 (27)



                                                                 80
Antonia J. Jones: 6 November 2005

               Previous layer:
                                       yi   '   f(net i)        (1    i
                                                                     # #   m)                                  (28)


      (i) What are the strengths and weaknesses of the backpropagation method?

(c)   The N-bit parity problem is this: given any vector x = (x1, x2, ..., xN), where xi {0, 1}, (i.e. x is a vector
                                                                                          0
      of 0's and 1's) the determine the parity of x. The parity of x is defined as 0 if the number of 1's in x is even
      and 1 if the number of 1's is odd.

      Construct a feedforward network to solve the N-bit parity problem.




                                                           81
CMT563


                                       Sample Examination Paper


                                          Time Allowed 2 hours


                                        Answer THREE Questions




1.(a) Give a high level pseudo-code description of a typical genetic algorithm.                        [5]

(b) Describe the genetic operators for crossover and mutation applied to genotypes consisting of
fixed length strings.                                                                        [5]

(c) Discuss three criteria by which one might evaluate the effectiveness of a heuristic search
algorithm, such as a genetic algorithm, when applied to a hard combinatoric search problem.
Briefly comment on each of these criteria.                                                 [5]

(d) What simple checks would you suggest before running a full test of a new genetic algorithm
(GA) to verify that the representation and crossover operator which have been selected have a
reasonable possibility of ensuring that the GA performs better than random search.          [5]


2(a). Describe the Hopfield model for a fully interconnected asynchronous network. Your
description should include a definition of the update rule for individual neurons and the method
for selecting which neuron to update at the next step.                                        [8]

(b) The energy of a Hopfield network with weights wij from the j th node to the i th node, with
wii = 0 for all i, wij = wji for all i, j, and thresholds i, when the network is in state x = (x1, ..., xn)
                                                           2
0 {0, 1}n, where n is the number of neurons, is defined as
                                                1
                                        E
                                       &'     j       w xx     j%   x
                                                                    θ
                                                2 i, j ij i j  i
                                                                   i i

                                              i j
                                               …

Show that under the rules described in (a) the network will iterate to a state at which energy is
a local minimum and stay there.                                                              [8]

(c)     (i) If the Hopfield model is used as an self-associative memory how are the weights
        determined from the patterns?                                                   [2]

        (ii) What problems are encountered as more memories are added and what is the practical
        upper limit for memory storage with a given Hopfield network?
                                                                                            [2]


                                                      82
CMT563

3(a). Describe the Wisard model for pattern recognition. State some advantages of this model
over other methods of pattern recognition.                                              [10]

(b) Explain how the storage requirements and response of a Wisard network alters as the size of
the n-tuple increases.                                                                      [5]

(c) Suppose that for some given two-class problem using a 512 × 512 retina a hardware Wisard
system is simulated using a conventional 16-bit serial architecture with a single 1 MHz processor.
Using 8 bits per weight estimate the storage requirements for such a system. Allowing 1 micro
sec for access, 10 micro sec for multiplication and 3 micro sec per addition and temporary storage
overheads, estimate the storage requirements and calculate the classification time. Would it be
suitable for industrial application?                                                           [5]

4(a). Briefly describe the backpropagation algorithm (detailed equations are not required but you
should explain what weight adjustments depend on at each step) and discuss its strengths and
weaknesses.                                                                                    [8]

(b) Suppose that the training data for a feedforward network is derived from a process which can
be modelled by a smooth function f from inputs to the single output y, and that in the data y is
subjected to measurement error r with mean zero. Identify the principal factors which will
determine the best Mean Squared Error that a trained network can achieve when tested on an
unseen set of data drawn from the same process.                                              [4]

(c) Briefly describe the Gamma test. Give an example of the type of problem when it would be
appropriate to use the Gamma test and an example where it would not be appropriate.      [8]




                                               83
CMT563


                                                     Solutions


1. (a) Give a high level pseudo-code description of a typical genetic algorithm                           [5]


             1. Randomly generate a population of M structures
                                     S(0) = {s(1,0),...,s(M,0)}.

             2. For each new string s(i,t) in S(t), compute and save its
             measure of utility v(s(i,t)).
             3. For each s(i,t) in S(t) compute the selection probability
             defined by
                      p(i,t) = v(s(i,t))/(            E   i   v(s(i,t))).
             4. Generate a new population S(t+1) by selecting structures from
             S(t) via the selection probability distribution and applying the
             idealised genetic operators to the structures generated.

             5. Goto 2.

   Algorithm 7-1 Generic GA

(b) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed length
strings.                             [5]

For the moment we represent strings as                                 CROSSOVER
a1a2a3...al [ai = 1 or 0]. Using this
notation we can describe the operators by
                                                                            cut points
which strings are combined to produce
new strings.
                                                      Parent 1        1011 010011 10111
Crossover. In crossover one or more cut
points are selected at random and the                  Parent 2       1100 111000 11010
operation illustrated in the figure (where
two cut points are employed) is used to                Child 1        1100 010011 11010
create two children. A variety of control                             1011 111000 10111
                                                       Child 2
regimes are possible, but a simple
strategy might be `select one of the
children at random to go into the next                                 MUTATION
generation'.

Crossing over proceeds in three steps:               110011100011010                 111011101011010

         a) Two structures a1...al and
         b1...bl are selected at random
         from the current population.                                 INVERSION
         b) A crossover point x, in the
         range 1 to l-1 is selected, again          111111100011010                  110011111011010
         at random.

                                             Standard genetic operators.

                                                              84
CMT563



         c) Two new structures                   a1a2...axbx+1bx+2...bl
                                                 b1b2...bxax+1ax+2...al

         are formed.

Children tend to be ‘like’ their parents, so that crossover can be considered as a focussing operator which exploits
knowledge already gained, its effects are quite quickly apparent.

Mutation. In mutation an allele is altered at each site with some fixed probability. Each structure a1a2...al in the
population is operated upon as follows. Position x is modified, with probability p independent of the other
positions, so that the string is replaced by

                                                 a1a2...ax-1 z ax+1...al

where z is drawn at random from the possible values. If p is the probability of mutation at a single position then
the probability of h mutations in a given string is determined by a Poisson distribution with parameter p.

Mutation disperses the population throughout the search space and so might be considered as an information
gathering or exploration operator. Search by mutation is a slow process analogous to exhaustive search. Thus
mutation is a `background' operator, assuring that the crossover operator has a full range of alleles so that the
adaptive plan is not trapped on local optima.

(c) Discuss three criteria by which one might evaluate the effectiveness of a heuristic search algorithm, such as
a genetic algorithm, when applied to a hard combinatoric search problem. Briefly comment on each of these
criteria.                                                                                                     [5]

The three principal criteria are: solution quality; scaling in run time and memory for a given solution quality;
absolute run time should be acceptable. Solution quality is often hard to measure: for the TSP we might use the
Held-Karp lower bound. For hard combinatoric search a scaling of O(NlogN), where N is a measure of problem
size, is normally the best that can be achieved - anything worse than this results in unacceptable run times for large
problems. Acceptable absolute run time is a function of the commercial benefits and time available - for some early
VLSI layout designs super-computers were used with run times of several months. More usually a run time of
several days is the most that is acceptable.

(d) What simple checks would you suggest before running a full test of a new genetic algorithm (GA) to verify that
the representation and crossover operator which have been selected have a reasonable possibility of ensuring that
the GA performs better than random search.                                                                     [5]

One simple check is to run several thousand crossover events with randomly selected parents and record
child_fitness versus mean_parental_fitness: if the result point plot/correlation calculation reveals little or no
correlation between the two then the combination of representation and crossover is unlikely to produce a GA that
works any better than random search. On the other hand if there is a good correlation between child_fitness and
mean_parental_fitness then the mechanisms of evolutionary search have some chance of being effective.


2. (a) Describe the Hopfield model for a fully interconnected asynchronous network. Your description should
include a definition of the update rule for individual neurons and the method for selecting which neuron to update
at the next step.                                                                                              [8]

Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at
maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined


                                                          85
CMT563

as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values
of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm.
For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean
                                                  2
attempt rate µ, setting
                                     x i(t)   '   1                                          >   θ   i
                                     x i(t)   '   x i(t 1)
                                                      &            if   j      wijxj(t 1)
                                                                                     &       '   θ   i               (2)
                                     x i(t)   '   0                     j i
                                                                        …                    <   θ   i


Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts
accordingly. The procedure is described below.

 Procedure Hopfield (Assumes weights are assigned)
          Repeat until updating every unit produces no change of state.
                Randomise initial state x   {0, 1}n                      0
                Select unit i (1   i   n) with uniform random probability.
                                                  #          #
                Update unit i according to (1)
          End

Algorithm 7-2 Generic Hopfield net.

(b) The energy of a Hopfield network with weights wij from the j th node to the i th node, with wii = 0 for all i, wij
= wji for all i, j, and thresholds i, when the network is in state x = (x1, ..., xn) {0, 1}n, where n is the number of
                                2                                                                        0
neurons, is defined as
                                                   1
                                             E      j &' w xx            ix i  %   θj
                                                   2 i, j ij i j     i                                           (3)
                                                             i j
                                                             …


Show that under the rules described in (a) the network will iterate to a state at which energy is a local minimum
and stay there.                                                                                                [8]

A low rate of neural firing is approximated by assuming that only one unit changes state at any given moment.
Then, since wij = wji, E due to xi is given by
                      )          )
                                            E M '
                                     ∆E         xi    (     wij x j
                                                          j &' ∆     i)  xi          &   θ   ∆
                                            xiM          j                                               (4)
                                                                         j i
                                                                         …


Now consider the effect of the threshold rule (2). If the unit changes state at all then xi = ±1. If xi = 1 this means
                                                                                                             )   )
the unit changes state from 0 to 1, hence by the threshold rule
                                                         wijx j > i            θ
                                                           j
                                                           j i
                                                             …                                                       (5)

in which case, by (4), E < 0. Alternatively, xi = -1, the unit therefore changes state from 1 to 0, and
                       )                              )
                                                    wijx j < i                 θ
                                                           j
                                                           j i
                                                             …                                                       (6)

and again E < 0. Thus if a unit changes state at all the energy must decrease. State changes will continue until
           )
a locally least E is reached. The energy is playing the role of a Hamiltonian in the more general dynamic system
context.

(c)      (i) If the Hopfield model is used as an self-associative memory how are the weights determined from the
         patterns?                                                                                           [2]



                                                                    86
CMT563

An early rule used for memory storage in associative memory models can also be used to store memories in the
Hopfield model. If the {0, 1} node outputs are mapped to xi {-1, +1} the rule assigned weights as follows. For
                                                                      0
each pattern vector x, which we require to memorise, we consider the matrix
                                      x1                    x1 x1 x1 x2 ... x1 x n
                                         x2                               x2 x1 x2 x2 ... x2 x n
                              x xT   '        x1, x2, ,..., x n   '                                              (7)
                                         .                                  .      .    ...    .
                                         xn                               x n x1 x n x2 ... x n x n

and then average these matrices over all pattern vectors (prototypes). We then set wii = 0 and note that the resulting
matrix is symmetric.

In this way we can capture the average correlations between components of the pattern vectors and then use this
information, during the operation of the network, to recapture missing or corrupted components.

         (ii) What problems are encountered as more memories are added and what is the practical upper limit for
         memory storage with a given Hopfield network?                                                       [2]

The main problem is that as the number of patterns P increase we find that an exponentially increasing number
`spurious' local minima are introduced, i.e. minima not associated with a pattern. When P is approximately 0.15n,
where n is the number of nodes in the network, there is a dramatic degradation in the ability of the network to
recall noisy patterns.

3. (a) Describe the Wisard model for pattern recognition.




            Schematic of a 3-tuple recogniser.

The scheme outlined above was first proposed by Aleksander and Stonham in [Aleksander 1979]. The sample data


                                                             87
CMT563

to be recognized is stored as a 2-dimensional array (the 'retina') of binary elements. Random connections are made
onto the elements of the array, N such connections being grouped together to form an N-tuple which is used to
address one random access memory (RAM) per discriminator. In this way a large number of RAM's are grouped
together to form a class discriminator whose output or score is the sum of all its RAM's outputs. This configuration
is repeated to give one discriminator for each class of pattern to be recognized. The RAMs implement logic
functions which are set up during training; thus the method does not involve any direct storage of pattern data.

The system is trained using samples of patterns from each class. A pattern is stored into the retina array and a
logical 1 is written into the RAM's of the discriminator associated with the class of this training pattern at the
locations addressed by the N-tuples. This is repeated many times, typically 25-50 times, for each class.

In recognition mode, the unknown pattern is stored in the array and the RAM's of every discriminator put into
READ mode. The input pattern then stimulates the logic functions in the discriminator network and an overall
response is obtained by summing all the logical outputs. The pattern is then assigned to the class of the
discriminator producing the highest score.

State some advantages of this model over other methods of pattern recognition.

Some advantages of the WISARD model for pattern recognition are:

         ! Implementation as a parallel, or serial, system in currently available hardware is inexpensive and
         simple.

         !   Given labelled samples of each recognition class, training times are extremely short.

         ! The time required by a trained system to classify an unknown pattern is very small and, in a parallel
         implementation, is independent of the number of classes.

(b) Explain how the storage requirements and response of a Wisard network alters as the size of the n-tuple
increases.

The total amount of memory needed for the discriminator RAMs is easily calculated. Let n be the number of inputs,
C the number of classes, and assume a 1-1 mapping of a retina with p binary pixels. For convenience we assume
that n divides p. Then the number of 1 bit RAMs will be p/n and each RAM has 2n bits of storage. So the number
of bits per discriminator is (p/n).2n. Hence the total number of bits for C classes is
                                                         p n
                                                    C.     .2                                                  (8)
                                                         n


The response of a discriminator becomes more sensitive to precise similarities of a pattern to patterns from the
corresponding training class as n increases.

(c) Suppose that for some given two-class problem using a 512 × 512 retina a hardware Wisard system is simulated
using a conventional 16-bit serial architecture with a single 1 MHz processor. Using 8 bits per weight estimate the
storage requirements for such a system. Allowing 1 micro sec for access, 10 micro sec for multiplication and 3
micro sec per addition and temporary storage overheads, estimate the storage requirements and calculate the
classification time. Would it be suitable for industrial application?                                           [5]

Storage requirements for n-tuple system:

                                     (No. of classes) × (Size of image) × 2n /n

With two classes and n=2 this gives


                                                         88
CMT563

                                                        2 × 512 × 512 × 4/2 = 106 bits (approx).

Response time:

With a conventional 16-bit architecture, the computation time is mainly one of storage access (once per n-tuple
per pattern class). Taking 1 micro sec per access we have

                                                     512 × 512 × 2 × 10-6 /2 = 1/4 sec                          (approx).

This is a bit slow for an industrial application (e.g. an assembly line).

4.(a) Briefly describe the backpropagation algorithm (detailed equations are not required but you should explain
what weight adjustments depend on at each step) and discuss its strengths and weaknesses.                    [6]

As is well known, there is no advantage in using several layers if the units have linear activation functions. Since
the delta rule is a modification of gradient descent we need to consider derivatives, and the activation functions
of linear threshold units are not differentiable (their derivative is infinite at the threshold and zero elsewhere). We
therefore consider semilinear activation functions. A semilinear activation function fi(neti) is one in which the
output of a the unit is a non-decreasing and differentiable function of the net input to the unit.

Suppose the ith unit is an output unit. Let opi be the output produced by the ith unit when pattern p is presented
and tpi be the target output. In this case we set the error to be
                                                                         )
                                   δ   pi   '    (tpi       &   opi) fi (netpi)           (for an output unit)


and the weight change for the weight associated with the jth input to the ith unit is then
                                                 pwij       piopj        ∆       '   δη

where   0   > 0 is the learning rate.

The error signal for a hidden unit is determined recursively in terms of the error signals of the units to which it
directly connects and the weights of those connections, i.e.
                                                        )
                                        δ   pi   '   fi (netpi)    δj        pkwki        (for a hidden unit)
                                                                     k


where k varies over those units to which the ith units delivers outputs.

The weight change is then computed as before.

Thus the implementation of back propagation involves a forward pass through the layers to estimate the error, and
then a backward pass modifying the synapses to decrease the error. Practical implementations are not difficult, but
without modification it is still rather slow, especially for systems with many layers. Still, it is at present the most
popular learning algorithm for multilayer networks. It is considerably faster than the Boltzmann machine.

(b) Suppose that the training data for a feedforward network is derived from a process which can be modelled by
a smooth function f from inputs to the single output y, and that in the data y is subjected to measurement error r with
mean zero. Identify the principal factors which will determine the best Mean Squared Error that a trained network
can achieve when tested on an unseen set of data drawn from the same process.                                       [4]

The function f can be regarded as a surface in m dimensions which we seek to approximate by a neural network
whose input -> output mapping we can call g. The training data gives a noisy model of the surface. It is plain that
there is no point in seeking to train the network beyond the stage where the Mean Squared Error (MSE) on the
training data is less than Var(r), the variance of r, since this corresponds to overfitting. This then will be the best


                                                                                 89
CMT563

MSE possible.

The principal factors determining whether or we can train g to have MSE                                                                 .   Var(r) are:

         ! The `bumpiness' of the surface defined by f. To accurately model a very bumpy surface obviously
         requires more data points.

         ! The size of Var(r). The larger Var(r) becomes the less information is contained in any given data point.
         When Var(r) becomes comparable to the range of y very little information regarding f is retained in the
         training data.

         !   The size of the training set. A very bumpy noisy surface will require a very large training set.

(c) Briefly describe the Gamma test. Give an example of the type of problem when it would be appropriate to use
the Gamma test and an example where it would not be appropriate.                                           [10]

[Any reasonable summary of the following]

Given data samples (x(i), y(i)), where x(i) = (x1(i), ..., xm(i)), 1                                                  i
                                                                                                                    # #       M, let N[i, p] be the list of (equidistant) p th
nearest neighbours to x(i). We write
                                        M                                                                                     M
                                1                     1                                                               1
                  δ   (p)   '           j                             j            x(j)        &   x(i) 2       '             j       x(N[i, p])     &   x(i) 2        (12)
                                M               L(N[i, p])                                                           M
                                    i   '   1                 j   0   N[i, p]                                             i   '   1




where L(N[i, p]) is the length of the list N[i, p]. Thus (p) is the mean square distance to the pth nearest neighbour.
                                                                                              *
Nearest neighbour lists for pth nearest neighbours (1 p pmax), typically we take pmax in the range 20-50, can be
                                                                                           # #
found in O(MlogM) time using techniques developed by Bentley, see for example [Friedman 1977].

We also write
                                                                               M
                                                               1                          1
                                                γ   (p)   '                    j                                j             (y(j)    &    y(i))2                     (13)
                                                              2M           i   '   1   L(N[i,p])            j   0   N[i,p]



where the y observations are subject to statistical noise assumed independent of x and having bounded variance.10

Under reasonable conditions one can show that
                                       Var(r)             γ   .                    %       A   δ   %    o( )
                                                                                                          δ           as M            46                               (14)
where the convergence is in probability.

The Gamma Test computes the mean-squared pth nearest neighbour distances (p) (1 p pmax, typically pmax                                        *          # #                  .
10) and the corresponding (p). Next the ( (p), (p)) regression line is computed and the vertical intercept is
                                    (                                  *               (
returned as the gamma value. Effectively this is the limit lim as -> 0 which in theory is Var(r).               (         *

A technique which allows one to estimate Var(r), on the hypothesis of an underlying continuous or smooth model
f, is of considerable practical utility in applications such as control or time series modelling. The implication of



    10
    The original version in [Aðalbjörn Stefánsson 1997] and [Kon ar 1997] used smoothed versions of (p) and               …                                        *
((p) which rolled off the significance of more distant near neighbours. Later experience showed that this
complication was largely unnecessary and the version of the software used here is implemented as described above.


                                                                                                   90
CMT563

being able to estimate Var(r) in neural network modelling is that then one does not need to train the network (or
indeed construct any smooth data model at all) in order to predict the best possible performance with reasonable
accuracy.

An appropriate problem type would be one in which the output is expected to be a smooth function of continuously
varying inputs.

An inappropriate problem type would be one in which many of the inputs took categorical values (e.g. 0/1).




                                                       91

More Related Content

PDF
People and the Planet
PPTX
Introduction to Evolutionary Algorithms
PPT
2005: Natural Computing - Concepts and Applications
PDF
AI/ML session by GDSC ZHCET AMU, ALIGARH
PDF
Grokking Artificial Intelligence Algorithms 1st Edition Rishal Hurbans
PDF
Methods of Combining Neural Networks and Genetic Algorithms
PPT
Useful Techniques in Artificial Intelligence
PDF
Sample Paper (1).pdf
People and the Planet
Introduction to Evolutionary Algorithms
2005: Natural Computing - Concepts and Applications
AI/ML session by GDSC ZHCET AMU, ALIGARH
Grokking Artificial Intelligence Algorithms 1st Edition Rishal Hurbans
Methods of Combining Neural Networks and Genetic Algorithms
Useful Techniques in Artificial Intelligence
Sample Paper (1).pdf

Similar to Evolutionary computing (20)

PPT
Topic_6
PDF
09 genetic algorithms by Priyesh Marvi
PDF
genetic algorithms-artificial intelligence
PDF
Genetic algorithms
PDF
Bio-inspired Active Vision System
PPTX
Techniques Machine Learning
PDF
Machine learning
PDF
Machine Learning
PDF
MACHINE LEARNING
PDF
www.webre24h.com - Artificial neural networks in real life applications.idea ...
PDF
Artificial Intelligence Algorithms
PPT
Genetic Algorithms.ppt
PPTX
Soft computing
PPT
Genetic Algorithms-1.ppt
PDF
Introduction to artificial_intelligence_syllabus
PPTX
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
PPTX
PDF
A Quick Overview of Artificial Intelligence and Machine Learning
PDF
IRJET- A Survey on Soft Computing Techniques and Applications
PDF
X Machines for Agent Based Modeling FLAME Perspectives 1st Edition Mariam Kiran
Topic_6
09 genetic algorithms by Priyesh Marvi
genetic algorithms-artificial intelligence
Genetic algorithms
Bio-inspired Active Vision System
Techniques Machine Learning
Machine learning
Machine Learning
MACHINE LEARNING
www.webre24h.com - Artificial neural networks in real life applications.idea ...
Artificial Intelligence Algorithms
Genetic Algorithms.ppt
Soft computing
Genetic Algorithms-1.ppt
Introduction to artificial_intelligence_syllabus
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
A Quick Overview of Artificial Intelligence and Machine Learning
IRJET- A Survey on Soft Computing Techniques and Applications
X Machines for Agent Based Modeling FLAME Perspectives 1st Edition Mariam Kiran
Ad

Recently uploaded (20)

PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Trump Administration's workforce development strategy
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Pharma ospi slides which help in ospi learning
PDF
Complications of Minimal Access Surgery at WLH
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Yogi Goddess Pres Conference Studio Updates
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
VCE English Exam - Section C Student Revision Booklet
Weekly quiz Compilation Jan -July 25.pdf
Supply Chain Operations Speaking Notes -ICLT Program
202450812 BayCHI UCSC-SV 20250812 v17.pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
2.FourierTransform-ShortQuestionswithAnswers.pdf
Trump Administration's workforce development strategy
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
RMMM.pdf make it easy to upload and study
FourierSeries-QuestionsWithAnswers(Part-A).pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Pharma ospi slides which help in ospi learning
Complications of Minimal Access Surgery at WLH
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
Yogi Goddess Pres Conference Studio Updates
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Ad

Evolutionary computing

  • 1. EVOLUTIONARY COMPUTING CMT563 Antonia J. Jones 6 November 2005
  • 2. Antonia J. Jones: 6 November 2005 UNIVERSITY OF WALES, CARDIFF DEPARTMENT OF COMPUTER SCIENCE (COMSC) COURSE: M.Sc. CMT563 MODULE: Evolutionary Computing LECTURER: Antonia J. Jones, COMSC DATED: Originally 15 January 1997 LAST REVISED: 6 November 2005 ACCESS: Lecturer (extn 5490, room N2.15). Overhead slides are posted on: http://users.cs.cf.ac.uk:81/Antonia.J.Jones/ electronically as pdf Acrobat files. It is not normally necessary for students attending the course to print this file as complete sets of printed slides will be issued. ©2001 Antonia J. Jones. Permission is hereby granted to any web surfer for downloading, printing and use of this material for personal study only. Copyright permission is explicitly withheld for modification, re-circulation or publication by any other means, or commercial exploitation in any manner whatsoever, of this file or the material therein. Bibliography: MAIN RECOMMENDATIONS The recommended text for the course is: [Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing. Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk). A cheaper alternative is Yoh-Han Pao, Adaptive pattern recognition and neural networks. Addison-Wesley, 1989. ISBN 0-201-12584-6. Price (UK) £31.45. A useful addition for the Mathematica labs is: Simulating Neural Networks with Mathematica. James A. Freeman. Addison-Wesley. 1994. ISBN 0-201- 56629-X. These books cover most of the course, except any theory on genetic algorithms, and the first is the recommended book for the course because it has excellent mathematical analyses of many of the models we shall discuss. The second includes some interesting material on the application Bayesian statistics and Fuzzy logic to adaptive pattern recognition. It is clearly written and the emphasis is on computing rather than physiological models. 1
  • 3. Antonia J. Jones: 6 November 2005 The principle sources of inspiration for work in neuro and evolutionary computation are: ! E. R. Kandel, J. H. Schwartz, and T. M. Jessel. Principles of Neural Science (Third Edition), Prentice-Hall Inc., 1991. ISBN 0-8385-8068-8. ! J. D. Watson, Nancy H. Hopkins, J. W. Roberts, Joan A. Steitz, and A. M. Weiner. Molecular Biology of the Gene, Benjamin/Cummings Publishing Company Inc., 1988. ISBN 0-8053- 9614-4. When you see how big they are you will understand why! It is a sobering thought that most of the knowledge in these tomes has been obtained in the last 20 years. Although extensive references are provided with the course notes (these are also a useful source of information for projects in Neural Computing) definitive bibliographies for computing aspects of the subject are: The 1989 Neuro-Computing Bibliography. Ed. Casimir C. Klimasauskas, MIT Press / Bradford Books. 1989. ISBN 0-262-11134-9. Finally, the key papers up to 1988 can be found together in: Neurocomputing: Foundations of Research. Ed. James A. Anderson and Edward Rosenfeld, MIT Press 1988. ISBN 0-262-01097-6. NETS - OTHER (HISTORICALLY) INTERESTING MATERIAL Perceptrons, Marvin Minsky and Seymour Papert, MIT Press 1972. ISBN 0-262-63022-2 (was reprinted recently). Neural Assemblies, G. Palm, Springer-Verlag, 1982. Self-Organisation and Associative Memory, T. Kohonen, Springer- Verlag, 1984. Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, Lawrence Erlbaum, 1981. Connectionist Models and Their Applications, Special Issue of Cognitive Science 9, 1985. Computer, March 1988. Artificial Neural Systems, IEEE. Neural Computing Architectures, Ed. I. Aleksander, Kogan Page, December, 1988. Parallel Distributed Processing. Vol.I Foundations. Vol. II Psychological and Biological Models. David E. Rumelhart et. al., MIT Press / Bradford Books. 1986. ISBN 0-262-18123-1 (Set). Explorations in Parallel Distributed Processing - A Handbook of Models, Programs, and Exercises. James L. McClelland and David E. Rumelhart, MIT Press / Bradford Books. 1988. ISBN 0-262-63113X. (Includes some very useful software for an IBM PC - there is also a newer version with software for the MAC). GENERAL An Introduction to Cybernetics, W. Ross-Ashby, John Wiley and Sons, 1964. A classic text on cybernetics. Vision: A computational investigation into the human representation and processing of visual information, David 2
  • 4. Antonia J. Jones: 6 November 2005 Marr, W. H. Freeman and Company, 1982. ISBN 0-7167-1284-9. One of the classic works in computational vision. Artificial Intelligence, F. H. George, Gordon & Breach, 1985. Useful textbook on AI. GENETIC ALGORITHMS/ARTIFICIAL LIFE Artificial Life, Ed. Christopher G. Langton, Addison-Wesley 1989. ISBN 0-201-09356-1 pbk. A fascinating collection of essays from the first AL workshop at Los Alamos National Laboratory in 1987. The book covers an enormous range of topics (genetics, self-replication, cellular automata, etc.) on this subject in a very readable way but with great technical authority. There are innumerable figures, some forty colour plates and even some simple programs to experiment with. All this leads to a book that is beautifully presented and compulsive reading for anyone with a modest background in the field. Synthetic systems that exhibit behaviour characteristic of living systems complement the traditional analysis of living systems practised by the biological sciences. It is an approach to the study of life that would hardly be feasible without the advent of the modern computer and may eventually lead to a theory of living systems which is independent of the physical realisation of the organisms (carbon based, in this neck of the woods). The primary goal of the first workshop was to collect different models and methodologies from scattered publications and to present as many of these as possible in a uniform way. The distilled essence of the book is the theme that Artificial Life involves the realisation of lifelike behaviour on the part of man-made systems consisting of populations of semi-autonomous entities whose local interactions with one another are governed by a set of simple rules. Such systems contain no rules for the behaviour of the population at the global level. Adaptation in Natural and Artificial Systems, John H. Holland, University of Michigan Press, 1975. The book that started Genetic Algorithms, a classic. Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing 1987. ISBN 0-273-08771-1 (UK), 0-934613-44-3 (US). A collection of interesting papers on GA related subjects. Genetic Algorithms in Search, Optimization, and Machine Learning, David E. Goldberg, Addison-Wesley, 1989. ISBN 0-201-15767-5. The first real text book on GAs. 3
  • 5. Antonia J. Jones: 6 November 2005 CONTENTS I What is evolutionary computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 A general framework for neural models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 The need for machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 II Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 The archetypal GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Design issues - what do you want the algorithm to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Rapid convergence to a global optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Produce a diverse population of near optimal solutions in different `niches' . . . . . . . . . . . 19 * Results and methods related to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Evolutionary Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 III Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Hopfield nets and energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 The outer product rule for assigning weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Networks for combinatoric search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Assignment of weights for the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 * The Hopfield and Tank application to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 IV The WISARD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Wisard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 WISARD - analysis of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Comparison of storage requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 V Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Backpropagation - mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 The output layer calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 The rule for adjusting weights in hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 The conventional model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Problems with backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 The gamma test - a new technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 * Metabackpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 * Neural networks for adaptive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 * VI The chaotic frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4
  • 6. Antonia J. Jones: 6 November 2005 Chaos in biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Controlling chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 The original OGY control law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chaotic conventional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Controlling chaotic neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Control varying T in a particular layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Using small variations of the inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Time delayed feedback and a generic scheme for chaotic neural networks . . . . . . . . . . . . . . . . . . . 70 Example: Controlling the Hénon neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 COURSEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 LIST OF FIGURES Figure 1-1 The stylised version of a standard connectionist neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. . . . . . . . . . . . . . . . . 12 Figure 1-3 Storage capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2-1 Generic model for a genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 2-2 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2-3 Premature convergence - no sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. . . . . . . . . . . . . . . . 22 Figure 2-5 The EDAC (top) and simple 2-Opt (bottom) time complexity (log scales). . . . . . . . . . . . . . . . . . 23 Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. . . . . . . . . . . . . . . . . . . . 24 Figure 2-7 EDACII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . . 25 Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . 25 Figure 3-1 Distance Connections. Each node (i, p) has inhibitory connections to the two adjacent columns whose weights reflect the cost of joining the three cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3-2 Exclusion connections. Each node (i, p) has inhibitory connections to all units in the same row and column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 4-1 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 4-2 Continuous response of discriminators to the input word 'toothache' [From Neural Computing Architectures, Ed. I Aleksander]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 4-3 A discriminator for centering on a bar [From Neural Computing, I. Aleksander and H. Morton]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 5-1 Solving the XOR problem with a hidden unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 5-2 Feedforward network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 5-3 The previous layer calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 5-4 The Water Tank Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 5-5 Architecture for direct inverse neurocontrol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs for the Water Tank Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 5-7 Least squares fit to 200 data points with 20 nearest neighbours: = 0.0332. . . . . . . . . . . . . . . . . 56 Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear Planner. . . . 57 Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052. Linear Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training. Linear Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 6-1 Stable attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Figure 6-2 A chaotic time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 6-3 The butterfly effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5
  • 7. Antonia J. Jones: 6 November 2005 Figure 6-4 Intervals for which the variables are defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 6-5 Feedforward network as a dynamical system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6-6 Chaotic attractor of Wang's neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6-7 The Ikeda strange attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6-8 Attractor for the chaotic 2-10-10-2 neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6-9 Bifurcation diagram x obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . . 68 Figure 6-10 Bifurcation diagram y obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . 68 Figure 6-11 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6-12 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6-13 Parameter changes during output layer control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6-14 Bifurcation diagram for the output x(t+1) using an external variable added to the input x(t). . . 69 Figure 6-15 Bifurcation diagram for the output y(t+1) using an external variable added to the input x(t). . . 69 Figure 6-16 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6-17 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6-18 Parameter changes during input x control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network: the chaotic "delayed" network is trained on suitable input-output data constructed from a chaotic time series; a delayed feedback control is applied to each input line; entry points for external stimulus are suggested, with a switch signal to activate the control module during external stimulation; signals on the delay lines or output can be observed at the "observation points". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6-20 The control signal corresponding to the delayed feedback control shown in Figure 6-21. Note that the control signal becomes small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6-21 Response signal on x(n-6) with control signal activated on x(n-6) using k = 0.441628, = 2 and J without external stimulation after first 10 transient iterations. After n = 1000 iterations, the control is switched off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6-22 Response signals on network output x(n), with control signal activated on x(n-6) using k = 0.441628, = 2 and with constant external stimulation sn added to x(n-6), where sn varies from -1.5 to 1.5 in steps J of 0.1 at each 500 iterative steps (indicated by the change of Hue of the plot points) after 20 initial transient steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 6-23 The control signal corresponding to the delayed feedback control shown in Figure 6-22. Note that the control signal becomes small even when the network is under changing external stimulation. . 72 Figure 6-24 Response signals on network output x(n), with control setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.05, at each iteration step. . . . . 73 F Figure 6-25 The control signal corresponding to the delayed feedback control shown in Figure 6-24. . . . . 73 Figure 6-26 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.15, at each iteration step. F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 6-27 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.3, at each iteration step. F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 LIST OF ALGORITHMS Algorithm 2-1 Archetypal genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Algorithm 3-1 Hopfield network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Algorithm 5-1 The Gamma test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Algorithm 5-2 Metabackpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Algorithm 7-1 Generic GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Algorithm 7-2 Generic Hopfield net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6
  • 8. Antonia J. Jones: 6 November 2005 I What is evolutionary computing? " dna tsap naht erutan enon-ro-lla na fo yldigir ssel hcum era hcihw seiroeht ot dael lliw siht fo llA .retcarahc ,lacitylana erom hcum dna ,lacirotanibmoc ssel hcum a fo eb lliw yehT .cigol lamrof tneserp ll i w cigol lamrof fo metsys wen siht taht eveileb su ekam ot snoitacidni suoremun era ereht ,tcaf nI si sihT .cigol htiw tsap eht ni deknil elttil neeb s a h h c i h w e n i l picsid rehtona ot resolc evom fo trap taht si ti dna ,nnamztloB morf deviecer saw ti mrof eht ni yliramirp ,scimanydomreht g n i r u s a e m d n a g n i t al u p i n a m o t s t c e p s a s t i f o e m o s n i t s e r a e n s e m o c h c i h w s c i s y h p l a c i t e r o e h t ]403 .p ,5 .loV skroW detcelloC ,nnamueN nov[ ".noitamrofni Introduction. Evolutionary computing embraces models of computation inspired by living Nature. For example, evolution of species by means of natural selection and the genetic operators of mutation, sexual reproduction and inversion can be considered as a parallel search process. Perhaps we can tackle hard combinatoric search problems in computer science by mimicking (in a very stylised form) the natural process of evolutionary search. Evolution through natural selection drives the adaptation of whole species, but individual members of a species can also adapt to a greater or lesser extent. The adaptation of individual behaviour on the basis of experience is learning and stems from plasticity of the neural structures which convey and process information in animals. Learning enables us to recognise previously encountered stimuli or experiences and modify our behaviour accordingly. It facilitates prediction and control of the environment, both essential prerequisites to planning. All of these are facets of what we loosely call intelligence. Real-world intelligence is essentially a computational process. This is a contentious assertion known as "the strong AI position". If it is true then the precise mechanism of computation (the hardware or neural wetware) ought to be irrelevant to the actual principles of the computational process. If this is indeed the case then the only obstacles to the construction of a truly intelligent artifact are our own understanding of the computational processes involved and our technical capability to construct suitable and sufficiently powerful computational devices. A general framework for neural models. Throughout this course we describe a number of neural models, each a variation on the connectionist paradigm (often called Parallel Distributed Processing - PDP), which in turn is derived from networks of highly stylised versions of the biological neuron. It is useful to begin with an analysis of the various components of these models. There are seven major aspects of a connectionist model: ! A set of processing units ui, each producing a scalar output xi(t) (1 # #i n). ! A connectivity graph which determines the pattern of connections (links) from each unit to each of the other units in the network. We shall often suppose that each unit has n inputs, but there is no particular reason why all units should have the same number of inputs. 7
  • 9. Antonia J. Jones: 6 November 2005 Although it is often convenient for theoretical discussions to consider fully interconnected networks, for very large networks of either real or artificial neurons the relevant case is that of relatively sparse connectivity. The connectivity graph then describes the fine topology of the network. This can be useful in practical applications, for example in speech recognition networks it is often helpful to have several copies of the same sub-net connected to temporally distinct inputs. These sub-net copies act as a feature detector and so can share their weights - this effectively reduces the number of parameters needed to describe the full network and speeds up learning. It is sufficient to be given a list of inputs and outputs for each node, for we then can recover the connectivity graph. ! A set of parameters pi1,...,pik, fixed in number, attached to each unit ui, which are adjusted during learning. Most commonly k = n and the parameters are weights wij (1 j n), where wij is often taken # # to be associated with the link from j to i, or in biological terms associated with the synaptic gap. ! An activation function for each unit, neti = neti(x1,...,xn;pi1,...,pik), which combines the inputs to ui into a scalar value. In the commonly used model neti = wijxj. ' It is important to realise that the basic principle of neural networks is that of simple (but arbitrary) computational function at each node. Learning when it occurs can be considered as an adjustment of the parameters associated with a node based on information locally available to the node. ‘Locally’ here means as specified by the connectivity graph. This information often takes the form of a correlation between the firings of adjacent nodes, but it could be a more sophisticated calculation. Thus we are really dealing with a very general class of parallel algorithms. The concentration on the ‘weights associated with links’ model has arisen partly because of the biological precedent, because of the extreme simplicity of the computational function of a node, and because this special case has been shown to be of practical interest. x1(t) x2(t) x3(t) ... xn(t) neti = neti(x1, ... ,xn, pi1, ... , pik) unit i output xi xi = f(neti) neti input Sigmoidal output function xi(t+1) ... xi(t+1) ... xi(t+1) Figure 1-1 The stylised version of a standard connectionist neuron. ! An output function xi = f(neti) which transforms the activation function into an output. In the earliest models f was a discontinuous step function. However, this poses analytical difficulties for learning algorithms so that often now f is a smooth sigmoidal shaped function. In some models f is allowed to vary 8
  • 10. Antonia J. Jones: 6 November 2005 from one unit to another and so then we write fi for f. ! A learning rule whereby the parameters associated with each processing unit are modified by experience. ! An environment within which the system must operate. A set of processing units. Figure 1-1 illustrates a standard connectionist component. All of the processing of a a connectionist system is carried out by these units. There is no executive or overseer. There are only relatively simple units, each doing its own relatively simple job. A unit's job is simply to receive input from other units and, as a function of the input it receives and the current values of its internal parameters, to compute an output value xi which it sends to the other units. This output is discrete in some models and continuous in others. When the output is continuous it is often confined to [0,1] or [-1,1]. The system is inherently parallel in that many units carry out their computations at the same time. Within any system we are modelling, it is sometimes useful to characterize three types of units: input, output, and hidden. The hidden units are those whose inputs and outputs are within the system we are modelling. They are not ‘visible’ to outside systems. A connectivity graph. Each unit passes its output to other units along links. The graph of links represents the connectivity of the network. A set of parameters and an activation function. In the conventional model the parameters for unit i are assumed to be weights wij associated with the link from unit j to unit i. If wij > 0 the link is said to be an excitatory link, if wij = 0 unit j is effectively not connected to unit i, and if wij < 0 the link is said to be inhibitory link. In this case neti is calculated as n net i j ' wijx j (1) j ' 1 This is a linear function of the inputs and so neti is constant over hyperplanes in the n-dimensional space of inputs to unit i. In fact, if one is interested in generalising the computational function of a unit, it is often convenient to associate the parameters (in the conventional case weights) with the unit. In which case one thinks of the links as passing activation values and one is no longer constrained to have exactly n (the number of inputs) parameters per unit. For example, one could have a unit which performed it's distinction function by determining whether or not the input vector lay within some ellipsoid. In this case there would be n parameters associated with the centre of the ellipsoid and another n parameters associated with the axes. (In addition one could provide the ellipsoid with rotations which would provide further parameters.) Now the activation function would look like n net i j ' Aij (x j & cij)2 (2) j ' 1 This is a simple example of a higher order network in which the function neti is not a linear function of the inputs. An output function. The simplest possible output function f would be the identity function, i.e. just take xi = neti. However, in this case with the activation function (1) the unit would be performing a totally linear function on the inputs and, as it turns out, such nets are rather uninteresting. In any event our unit is not yet making a distinction. In the discrete model the output function is usually 9
  • 11. Antonia J. Jones: 6 November 2005 xi ' 1 > θ i if net i (3) xi ' 0 # θ i where i is the threshold, a parameter associated with the unit. However, this creates discontinuities of the 2 derivatives and so we usually smooth the output function and write x i f(net i) ' (4) In the linear case f is some sort of sigmoidal function. For our ellipsoidal example Gaussian smoothing might be suitable, i.e. f(x) = exp(-x2), so that the output is large (near one) when the input vector is near the centre of the ellipsoid. Sometimes the output function is stochastic so that the output of the unit depends in a probabilistic fashion on neti. For an individual unit the sequence of events in operational mode (not learning) is 1. Combine inputs to produce activation neti(t). 2. Compute value of output xi = f(neti). 3. Place outputs, based on new activation level, on output links (available from t+1 onward). Changing the processing or knowledge structure in a connectionist model involves modifying the patterns of interconnections or parameters associated with each unit. This is accomplished by modifying pi1,...,pik (or the wij in the usual model) through experience using a learning rule. Virtually all learning rules are based on some variant of a Hebbian principle (discussed in the next section) which is invariably derived mathematically through some form of gradient descent. For example, the Delta or Widrow-Hoff rule. Here modification of weights is proportional to the difference between the actual activation achieved and the target activation provided by a teacher ) wij = (ti(t)-neti(t))xj(t), 0 where > 0 is constant. This is a generalization of the Perceptron learning rule and is all very well provided we 0 know the desired values of ti(t). Hebbian learning. Donald O. Hebb's book The Organization of Behavior (1949) is famous among neural modelers because it contained the first explicit statement of the physiological learning rule for synaptic modification that has since become known as the Hebb synapse: Hebb rule. When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased. The physiological basis for this synaptic potentiation is now understood more clearly [Brown 1988]. Hebb's introduction to the book also contains the first use of the word 'connectionism' in the context of neural modeling. The Hebb rule is not a mathematical statement, though it is close to one. For example, Hebb does not discuss the various possible ways inhibition might enter the picture, or the quantitative learning rule that is being followed. This has meant that a number of quite different learning rules can legitimately be called 'Hebbian rules'. We shall see later that nearly all such learning rules bear a close mathematical relationship to the idea of `gradient descent', which roughly means that if we wish to move to the lowest point of some error surface a good heuristic is: we should always tend to go `downhill'. However, for the present chapter we shall conceptualise the Hebb rule in terms of autocorrelations, i.e. the internal correlations between each pair of components of the pattern vectors we wish 10
  • 12. Antonia J. Jones: 6 November 2005 the system to memorise. Hebb was keenly aware of the `distributed' nature of the representation he assumed the nervous system uses; that to represent something assemblies of many cells are required and that an individual cell may be a participant member of many representations at different times. He postulated the formation of cell assemblies representing learned patterns of activity. The need for machine learning. Why do we need to discover how to get machines to learn? After all, is it not the case that the most practical developments in Artificial Intelligence, such as Expert Systems, have emerged from the development of advanced symbolic programming languages such as LISP or Prolog? Indeed, this is so. But there are convincing arguments [Bock 1985] which suggest that the technique of simulating human skills using symbolic programs cannot hope, in the long run, to satisfy the principal goals of AI. Mainly these centre around the time it would take to figure out the rules and write the software. But first we should consider the evolution of hardware. How can one measure the overall computational power of an information processing system? There are two obvious aspects we should consider. Firstly, information storage capacity - a system cannot be very smart if it has little or no memory. On the other hand, a system may have a vast memory but little or no capacity to manipulate information; so a second essential measure is the number of binary operations per second. On these two scales Figure 1-2 illustrates the information processing capability of some familiar biological and technological information processing systems. In the case of the biological systems these estimates are based on connectionist models and may be excessively conservative. We consider each axis independently. As we saw earlier, research in neurophysiology has revealed that the brain and central nervous system consists of about 1011 individual parallel processors, called neurons. Each neuron has roughly 104 synaptic connections and if we allow only 1 bit per synapse then each neuron is capable of storing about 104 bits of information. The information capacity of the brain is thus about 1015 bits. Much of this information is probably redundant but using this figure as a conservative estimate let us consider when we might expect to have high-speed memories of 1015 bits. 11
  • 13. Antonia J. Jones: 6 November 2005 Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. Figure 1-3 shows that the amount of high-speed random access memory that may be conventionally accessed by a large computer has increased by an order of magnitude every six years. If we can trust this simple extrapolation, in generation thirteen, AD 2024-30, the average high speed memory capacity of a large computer will reach 1015 bits. Now consider the evolution of technological processing power. Remarkably, this follows much the same trend. Of course, the real trick is putting the two together to achieve the desired result, it seems relatively unlikely that we shall be in a position to accomplish this by 2024. So much for the hardware. Now consider the software. Even adult human brains are not filled to capacity. So we will assume that 10% of the total capacity, i.e. 1014 bits, is the extent of the `software' base of an adult human brain. How long will it take to write the Figure 1-3 Storage capacity. 12
  • 14. Antonia J. Jones: 6 November 2005 programs to fill 1014 bits (production rules, knowledge bases etc.)? The currently accepted rate of production of software, from conception through testing, de-bugging and documentation to installation, is about one line of code per hour. Assuming, generously, that an average line of code contains approximately 60 characters, or 500 bits, we discover that the project will require 100 million person years! We'll never get anywhere by trying to program human intelligence into a machine. What other options are available? One is direct transfer from the human brain to the machine. Considering conventional transfer rates over a high speed bus this would take about 12 days. The only problem is: nobody has the slightest idea how to build such a device. What's left? In the biological world intelligence is acquired every day, therefore there must be another alternative. Every day babies are born and in the course of time acquire a full spectrum of intelligence. How do they do it? The answer, of course, is that they learn. If we assume that the eyes, our major source of sensory input, receive information at the rate of about 250,000 bits per second, we can fill the 1014 bits of our machine's memory capacity in about 20 years. Now storing sensory input is not the same thing as developing intelligence, however this figure is in the right ball park. Maybe what we must do is connect our machine brain to a large number of high-data-rate sensors, endow it with a comparatively simple algorithm for self organization, provide it with a continuous and varied stream of stimuli and evaluations for its responses, and let it learn. This argument may seem cavalier in some aspects. The human brain is highly parallel and somewhat inhomogeneous in its architecture. It does not clock at high serial speeds and does not access RAM to recall information for processing. The storage capacity may be vastly greater than the 1015 bits estimated by Sagan, since each neuron is connected to as many as 10,000 others and the structure of these interconnections may also store information. Indeed, although we do not know a great deal about the mechanisms of human memory, we do know that is multi-levelled with partial bio-chemical storage. However, none of this invalidates Bock's point that programming can never be a substitute for learning. 13
  • 15. Antonia J. Jones: 6 November 2005 II Genetic Algorithms Introduction. The idea that the process of evolutionary search might be used as a model for hard combinatoric search algorithms developed significantly in the mid 1960's. Evolutionary algorithms fall into the class of probabilistic heuristic algorithms which one might use to attack NP-complete or NP-hard problems (see, for example [Horowitz 1978], Chapters 11 and 12), such as the Travelling Salesman/person Problem (TSP). Of course, many of these problems have significant applications in engineering hardware or software design and commercial optimisation problems, but the underlying motivation for the study of evolutionary algorithms is principally to try to gain insight into the evolutionary process itself. Variously known as genetic algorithms, the phrase coined by the US school stemming from the work of John Holland [Holland 1975], evolutionary programming, originally developed by L. J. Fogel, A. J. Owens and M. J. Walsh, again in the US, and Evolutionsstrategie, as studied in Germany at around the same time by I. Rechenberg and H-P. Schwefel [Schwefel 1965], the subject has exploded over the last 15 years. Curiously, the European and US schools seemed largely unaware of each others existence for quite some while. Evolutionary algorithms, have been applied to a variety of problems and offer intriguing possibilities for general purpose adaptive search algorithms in artificial intelligence, especially, but not necessarily, for situations where it is difficult or impossible to precisely model the external circumstances faced by the program. Search based on evolutionary models had, of course, been tried before Holland's introduction of genetic algorithms. However, these models were based on mutation and not notably successful. The principal difference of the more modern research is an emphasis on the power of natural selection and the incorporation of a ‘crossover’ operator to mimic the effect of sexual reproduction. Two rather different types of theoretical analysis have developed for evolutionary algorithms: the classical approach stemming from the original work of Mendel on heritability and the later statistical work of Galton and Pearson at the end of the last century, and the Schema theory approach developed by Holland. Mendel constructed a chance model of heritability involving what are now called genes. He conjectured the existence of genes by pure reasoning - he never saw any. Galton and Pearson found striking statistical regularities in heritability in large populations, for example, on average a son is halfway between his father's height and the overall average height for sons. They also invented many of the statistical tools in use today such as the scatter diagram, regression and correlation (see, for example, [Freedman 1991]). Around 1920 Fisher, Wright and Haldane more or less simultaneously recognised the need to recast Darwinian theory as described by Galton and Pearson in Mendelian terms. They succeeded in this task, and more recently Price's Covariance and Selection Theorem [Price 1970], [Price 1972], an elaboration of these ideas, has provided a useful tool for algorithm analysis. The archetypal GA. In Nature each gene has several forms or alternatives - alleles - producing differences in the set of characteristics associated with that gene, e.g. certain strains of garden pea have a single gene which determines blossom colour, one allele causing the blossom to be white, the other pink. There are tens of thousands of genes in the chromosomes of a typical vertebrate, each of which, on the available evidence, has several alleles. Hence the set of chromosomes attained by taking all possible combinations of alleles contains on the order of 10 to the 3,000 structures for a typical vertebrate species. Even a very large population, say 10 billion individuals, contains only a minuscule fraction of the possibilities. 14
  • 16. Antonia J. Jones: 6 November 2005 A further complication is that alleles interact so that adaptation becomes primarily the search for co-adapted sets of alleles. In the environment against which the organism is tested any individual exemplifies a large number of possible `patterns of co-adapted alleles' or schema, as Holland calls them. In testing this individual we shall see that all schema of which the individual is an instantiation are also tested. If the rules whereby genes are combined have a tendency to generate new instances of above average schema then the resulting adaptive system has a high degree of `intrinsic parallelism'1 which accelerates the evolutionary process. Considerations of this type offer an explanation of how evolution can proceed at all. If a simple enumerative plan were employed and if 10 to the 12 structures could be tried every second it would take a time vastly exceeding the estimated age of the universe to test 10 to the 100 structures. The basic idea of an evolutionary algorithm is illustrated in Figure 2-1. INITIALISE Create initial population Evaluate fitness of each member. INTERNAL EXTERNAL Create children from existing population using genetic operators Evaluate fitness of children Substitute children in population deleting an equivalent number Figure 2-1 Generic model for a genetic algorithm. We seek to optimise members of a population of ‘structures’. These structures are encoded in some manner by a ‘gene string’. The population is then ‘evolved’ in a very stylised version of the evolutionary process. We are given a set, A, of `structures' which we can think of, in the first instance, as being a set of strings of fixed length l. The object of the adaptive search is to find a structure which performs well in terms of a measure of performance v : A --> +, where + denotes the positive real numbers. ú ú 1 The notion of 'intrinsic parallelism' will be discussed but it should be mentioned that it has nothing to do with parallelism in the sense normally intended in computing. 15
  • 17. Antonia J. Jones: 6 November 2005 The programmer must provide a representation for the structures to be optimised. In the terminology of genetic algorithms a particular structure is called a phenotype and its representation as a string is called a chromosome or genotype. Usually this representation consists of a fixed length string in which each component, or gene, may take only a small range of values, or alleles. In this context `small' often means two, so that binary strings are used for the genotypes. There is nothing obligatory in taking a one-bit range for each allele but there are theoretical reasons to prefer few-alleles-at-many-sites over many-alleles-at-few-sites (the arguments have been given by [Holland 1975](p. 71), [Smith 1980](p. 56) and supporting evidence for the correctness of these arguments has been presented by [Schaffer 1984](p. 107). 1. Randomly generate a population of M structures S(0) = {s(1,0),...,s(M,0)}. 2. For each new string s(i,t) in S(t), compute and save its measure of utility v(s(i,t)). 3. For each s(i,t) in S(t) compute the selection probability defined by p(i,t) = v(s(i,t))/ ( E i v(s(i,t)) ). 4. Generate a new population S(t+1) by selecting structures from S(t) via the selection probability distribution and applying the idealised genetic operators to the structures generated. 5. Goto 2. Algorithm 2-1 Archetypal genetic algorithm. The function v provides a measure of ‘fitness’ for a given phenotype and (since the programmer must also supply a mapping from the set of genotypes to the set of phenotypes) hence for a given genotype. Given a particular n genotype or string the goal function provides a means for calculating the probability that the string will be selected to contribute to the next generation. It should be noted that the composition function v( ) mapping genotypes to n fitness is invariably discontinuous, nevertheless genetic algorithms cope remarkably well with this difficulty. The basis of Darwinian evolution is the idea of natural selection i.e. population genetics tends to use the Selection Principle. The fitness of an individual is proportional to the probability that it will reproduce effectively.2 In genetic algorithm design we tend to apply this in the converse form: the probability that an individual will reproduce is proportional to its fitness. ‘Fit’ strings, i.e. strings having larger goal function values, will be more likely to be selected but all members of the population will have some chance to contribute. 2 Obfuscation of the definition of ‘fitness’ occurs frequently in the classical literature. The reasons are not difficult to understand. Both Darwin and Fisher found it hard to swallow that the lower classes bred more prolifically and were therefore, by definition, ‘fitter’ than their ‘social superiors’. This confusion regarding ‘fitness’ still occurs in the GA’s literature for different reasons. 16
  • 18. Antonia J. Jones: 6 November 2005 The box contains a sketch of the standard serial style genetic algorithm. Typically the evaluation of the goal function for a particular phenotype, a process which strictly speaking is external to the genetic algorithm itself, is the most time consuming aspect of the computation. Given the mapping from genotype to phenotype, the goal function, and an initial random population the genetic algorithm proceeds to create new members of the population (which progressively replace the old members) using genetic operators, typically mutation, crossover and inversion, modelled on their biological analogs. For the moment we represent strings as a1a2a3...al [ai = 1 or 0]. Using this notation we can describe the operators by which strings are combined to produce new strings. It is the choice of CROSSOVER these operators which produces a search strategy that exploits co-adapted sets of cut points structural compon en t s a l r eady discovered. Holland uses three such Parent 1 1011 010011 10111 principal operators Crossover, Mutation and Inversion (which we shall not Parent 2 1100 111000 11010 discuss in detail here). Child 1 1100 010011 11010 Crossover. In crossover one or more cut points are selected at random and the Child 2 1011 111000 10111 operation illustrated in Figure 2-2, Figure 7-1 (where two cut points are employed) is used to create two children. MUTATION A variety of control regimes are possible, but a simple strategy might be `select 110011100011010 111011101011010 one of the children at random to go into the next generation'. Children tend to be `like' their parents, so that crossover can be considered as a focussing operator INVERSION which exploits knowledge already gained, its effects are quite quickly 111111100011010 110011111011010 apparent. Crossing over proceeds in three steps. Figure 2-2 Standard genetic operators. a) Two structures a1...al and b1...bl are selected at random from the current population. b) A crossover point x, in the range 1 to l-1 is selected, again at random. c) Two new structures a1a2...axbx+1bx+2...bl b1b2...bxax+1ax+2...al are formed. In modifying the pool of schema (discussed below), crossing over continually introduces new schema for trial whilst testing extant schema in new contexts. It can be shown that each crossing over affects a great number of schema. 17
  • 19. Antonia J. Jones: 6 November 2005 There is large variation in the crossover operators which have been used by different experimenters. For example, it is possible to cross at more than one point. The extreme case of this is where each allele is randomly selected from one or other parent string with uniform probability - this is called uniform crossover. Although some writers have argued in favour of uniform crossover, there would seem to be theoretical arguments against its use viz. if evolution is the search for co-adapted sets of alleles then this search is likely to be severely undermined if many cut points are used. In language we shall develop shortly: the probability of schema disruption when using uniform crossover is much higher than when using one or two point crossover. The design of the crossover operator is strongly influenced by the nature of the representation. For example, if the problem is the TSP and the representation of a tour is a straightforward list of cities in the order in which they are to be visited then a simple crossover operator will, in general, not produce a tour. In this case the options are: ! Change the representation. ! Modify the crossover operator. or ! Effect ‘genetic repair’ on non-tours which may result. There is obviously much scope for experiment for any particular problem. The danger is that the resulting algorithm may be so far removed from the canonical form that the correlation between parental and child fitness may be small - in which case the whole justification for the method will have been lost. Mutation. In mutation an allele is altered at each site with some fixed probability. Mutation disperses the population throughout the search space and so might be considered as an information gathering or exploration operator. Search by mutation is a slow process analogous to exhaustive search. Thus mutation is a ‘background’ operator, assuring that the crossover operator has a full range of alleles so that the adaptive plan is not trapped on local optima. Each structure a1a2...al in the population is operated upon as follows. Position x is modified, with probability p independent of the other positions, so that the string is replaced by a1a2...ax-1 z ax+1...al where z is drawn at random from the possible values. If p is the probability of mutation at a single position then the probability of h mutations in a given string is determined by a Poisson distribution with parameter p. A simple demonstrator is given in the Mathematica program GA_Simple.nb. A more complicated GA using Inversion is given in GA_Inversion.nb. Design issues - what do you want the algorithm to do? Now we have to ask just what is we want of a genetic algorithm. There are several, sometimes mutually exclusive, possibilities. For example: ! Rapid convergence to a global optimum. ! Produce a diverse population of near optimal solutions in different ‘niches’. ! Be adaptive in ‘real-time’ to changes in the goal function. We shall deal with each of these in turn but first let us briefly consider the nature of the search space. If the space is flat with just one spike then no algorithm short of exhaustive search will suffice. If the space is smooth and unimodal then a conventional hill-climbing technique should be used. 18
  • 20. Antonia J. Jones: 6 November 2005 Somewhere between these two extremes are problems in which the goal function is a highly non-linear multi- modal function of the gene values - these are the problems of hard combinatoric search for which some style of genetic algorithm may be appropriate. Rapid convergence to a global optimum. Of course this is rather simplistic. Holland's theory holds for large populations. However, in many AI applications it is computationally infeasible to use large populations and this in turn leads to a problem commonly referred to as Premature Convergence (to a sub-optimal solution) or Loss of Diversity in the literature of genetic algorithms. When this occurs the population tends to become dominated by one relatively good solution and locked into a sub- optimal region of the search space. For small populations the schema theorem is actually an explanation for premature convergence (i.e. the failure of the algorithm) rather than a result which explains success. Premature convergence is related to a phenomenon observed in Nature. Allelic frequencies may fluctuate purely by chance about their mean from one generation to another; this is termed Random Genetic Drift. Its effect on the gene pool in a large population is negligible, but in a small effectively interbreeding population, chance alteration in Mendelian ratios can have a significant effect on gene frequencies and can lead to the fixation of one allele and loss of another. For example, isolated communities within a given population have been found to have frequencies for blood group alleles different from the population as a whole. Figure 2-3 illustrates this phenomenon with a simple function optimisation genetic algorithm. The inexperienced often tend to attempt to counteract premature convergence by increasing the rate of mutation. However, this is not a good idea. ! A high rate of mutation tends to devalue the role of crossover in building co-adapted sets of alleles and in essence pushes the algorithm in the direction of exhaustive search. Whilst some mutation is necessary a high rate of mutation is invariably counter-productive. In trying to counteract premature convergence we are essentially trying to balance the exploitation of good solutions found so far against the exploration which is required to find hitherto unknown promising regions of the search space. It is worth observing that, in computational terms, any algorithm which often inserts copies of strings into the current population is wasteful. Figure 2-3 Premature convergence - no sharing. This is true for the Traditional Genetic Algorithm (TGA) outlined as 2, 7-1. Produce a diverse population of near optimal solutions in different `niches'. The problem of premature convergence has been addressed by a number of authors using a diversity of techniques. Many of the papers in [Davis 1987] contain discussions of precisely this point. The methods used to combat premature convergence in TGAs are not necessarily appropriate to the parallel formulations of genetic algorithms (PGAs) which we shall discuss shortly. Cavicchio, in his doctoral dissertation, suggested a preselection mechanism as a means of promoting genotype diversity. Preselection filters children generated, possibly picking the fittest, and replaces parent members of the population with their offspring [Cavicchio 1970]. 19
  • 21. Antonia J. Jones: 6 November 2005 De Jong's crowding scheme is an elaboration of the preselection mechanism. In the crowding scheme, an offspring replaces the most similar string from a randomly drawn subpopulation having size CF (the crowding factor) of the current population. Thus a member of the population experiences a selection pressure in proportion to its similarity to other members of the population [De Jong 1975]. Empirical determination of CF with a five function test bed determined CF = 3 as optimal. Booker implemented a sharing method in a classifier system environment which used the bucket brigade algorithm [Booker 1982]. The idea here was that if related rules share payments then sub-populations of rules will form naturally. However, it seems difficult to apply this mechanism to standard genetic algorithms. Schaffer has extended the idea of sub-populations in his VEGA model in which each fitness element has its own sub-population [Schaffer 1984]. A different approach to help maintain genotype diversity was introduced by Mauldin via his uniqueness operator [Mauldin 1984]. The uniqueness operator helped to maintain diversity by incorporating a `censorship' operator in which the insertion of an offspring into the population is possible only if the offspring is genotypically different from all members of the population at a number of specified genotypical loci. * Results and methods related to the TSP. We digress briefly to give a little more detailed background material on the TSP. The question is often asked: if one cannot exactly solve any very large TSP problem (except in special cases at present `very large' means a problem involving more than a thousand cities) how can one know how accurate a solution produced by a probabilistic or heuristic algorithm actually is? The best exact solution methods for the travelling salesman problem are capable of solving problems of several hundred cities [Grötschel 1991], but unfortunately excessive amounts of computer time are used in the process and, as N increases, any exact solution method rapidly becomes impractical. For large problems we therefore have no way of knowing the exact solution, but in order to gauge the solution quality of any algorithm we need a reasonably accurate estimate of the minimal tour length. This is usually provided in one of two ways. For a uniform distribution of cities the classic work by Beardwood, Halton and Hammersley (BHH) [Beardwood 1959] obtains an asymptotic best possible upper bound for the minimum tour length for large N. Let {Xi}, 1 i < # 4 , be independent random variables uniformly distributed over the unit square, and let LN denote the shortest closed path which connects all the elements of {X1,...,XN}. In the case of the unit square they proved, for example, that there is a constant c > 0 such that, with probability 1, 1/2 lim LN N & ' c (1) N 46 where c > 0 is a constant. In general c depends on the geometry of the region considered. One can use the estimate provided by the BHH theorem in the following form: the expected length LN* of a minimal tour for an N-city problem, in which the cities are uniformly distributed in a square region of the Euclidean plane, is given by ( LN . c2 NR (2) where R is the area of the square and the constant (for historical reasons known as Stein's constant - [Stein 1977]) c2 0.70805 ± 0.00007, recently been estimated by Johnson, McGeogh and Rothberg [Johnson 1996]. . A second possibility would be to use a problem specific estimate of the minimal tour length which gives a very accurate estimate: the Held-Karp lower bound [Held 1970], [Held 1971]. Computing the Held-Karp lower bound is an iterative process involving the evaluation of Minimal Spanning Trees for N-1 cities of the TSP followed by Lagrangean relaxations, see [Valenzuela 1997]. 20
  • 22. Antonia J. Jones: 6 November 2005 If one seeks approximate solutions then various algorithms based on simple rule based heuristics (e.g. nearest neighbour and greedy heuristics), or local search tour improvement heuristics (e.g. 2-Opt, 3-Opt and Lin- Kernighan), can produce good quality solutions much faster than exact methods. A combinatorial local search algorithm is built around a `combinatoric neighbourhood search' procedure, which given a tour, examines all tours which are closely related to it and finds a shorter `neighbouring' tour, if one exists. Algorithms of this type are discussed in [Papadimitriou 1982]. The definition of `closely related' varies with the details of the particular local search heuristic. The particularly successful combinatorial local search heuristic described by Lin and Kernighan [Lin 1973] defines `neighbours' of a tour to be those tours which can be obtained from it by doing a limited number of interchanges of tour edges with non-tour edges. The slickest local heuristic algorithms3, which on average tend to have complexity O(n ), for > 2, can produce solutions with approximately 1-2% excess for 1000 cities in a few " " minutes. However, for 10,000 cities the time escalates rapidly and one might expect that the solution quality also degrades, see [Gorges-Schleuter 1990], p 101. An approximation scheme A is an algorithm which given problem instance I and > 0 returns a solution of length , A(I, ) such that , A(I, ) Ln(I) * ε & * # ε (3) Ln(I) Such an approximation scheme is called a fully polynomial time approximation scheme if its run time is bounded by a function that is polynomial in both the instance size and 1/ . Unfortunately the following theorem holds, see , for example [Lawler 1985], p165-166. Theorem. If V N then there can be no fully polynomial time approximation scheme for the TSP, even if P V instances are restricted to points in the plane under the Euclidean metric. Although the possibility of a fully polynomial time approximation scheme is effectively ruled out, there remains the possibility of an approximation scheme that although it is not polynomial in 1/ , does have a running time , which is polynomial in n for every fixed > 0. The Karp algorithms, based on cellular dissection, provide , `probabilistic' approximation schemes for the geometric TSP. Theorem [Karp 1977]. For every > 0 there is an algorithm A( ) such that A( ) runs in time C( )n+O(nlogn) , , , , and, with probability 1, A( ) produces a tour of length not more than 1+ times the length of a minimal tour. , , The Karp-Steele algorithms [Steele 1986] can in principle converge in probability to near optimal tours very rapidly. Cellular dissection is a form of divide and conquer. Karp's algorithms partition the region R into small subregions, each containing about t cities. An exact or heuristic method is then applied to each subproblem and the resulting sub-tours are finally patched together to yield a tour through all the cities. Evolutionary Divide and Conquer. Until recently the best genetic algorithms designed for TSP problems have used permutation crossovers for example [Davis 1985], [Goldberg 1985], [Smith 1985], or edge recombination operators [Whitley 1989], and required massive computing power to gain very good approximate solutions (often actually optimal) to problems with a few hundred cities [Gorges-Schleuter 1990]. Gorges-Schleuter cleverly exploited the architecture of a transputer bank to define a topology on the population and introduce local mating schemes which enabled her to delay the onset of premature convergence. However, this improvement to the genetic algorithm is independent of 3 The most impressive results in this direction are due to David Johnson at AT&T Bell Laboratories - mostly reported in unpublished Workshop presentations. 21
  • 23. Antonia J. Jones: 6 November 2005 any limitations inherent in permutation crossovers. Eventually, for problems of more than around 1000 cities, all such genetic algorithms tend to produce a flat graph of improvement against number of individuals tested, no matter how long they are run. Thus experience with genetic algorithms using permutation operators applied to the Geometric Travelling Salesman Problem (TSP) suggests that these algorithms fail in two respects when applied to very large problems: they scale rather poorly as the number of cities n increases, and the solution quality degrades rapidly as the problem size increases much above 1000 cities. An interesting novel approach developed by Valenzuela and Jones [Valenzuela 1994] which seeks to circumvent these problems is based on the idea of using the genetic algorithm to explore the space of problem subdivisions, rather than the space of solutions itself. This alternative method, for genetic algorithms applied to hard combinatoric search, can be described as Evolutionary Divide and Conquer (EDAC), and the approach has potential for any search problem in which knowledge of good solutions for subproblems can be exploited to improve the solution of the problem itself. As they say ! Essentially we are suggesting that intrinsic parallelism is no substitute for divide and conquer in hard combinatoric search and we aim to have both. [Valenzulea 1994] The goal was to develop a genetic algorithm capable of producing reasonable quality solutions for problems of several thousand cities, and one which will scale well as the problem size n increases. `Scaling well' in this context almost inevitably means a time complexity of O(n) or at worst O(nlogn). This is a fairly severe constraint, for example given a list of n city co-ordinates the simple act of computing all possible edge lengths, a O(n2) operation is excluded. Such an operation may be tolerable for n = 5000 but becomes intolerable for n = 100,000. In the previous section we mentioned the Karp and Steele cellular disection algorithms, and it is this technique which is the basis of the Valenzuela-Jones EDAC genetic algorithms for the TSP. Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. 22
  • 24. Antonia J. Jones: 6 November 2005 In practice a one-shot deterministic Karp algorithm yields rather poor solutions, typically 30% excess (with simple patching) when applied to 500 - 1000 city problems. Nevertheless, the Karp technique is a good starting point for exploring EDAC applied to the TSP. There are several reasons. First, according to Karp's theorem there is some probabilistic asymptotic guarantee of solution quality as the problem size increases. Second, the time complexity is about as good as one can hope for, namely O(nlogn). The run time of a genetic algorithm based on exploring the space of `Karp-like' solutions will be proportional to nlogn multiplied by the number of times the Karp algorithm is run, i.e. the number of individuals tested. Karp's algorithm proceeds by partitioning the problem recursively from the top down. At each step the current rectangle is bisected horizontally or vertically, according to a deterministic rule designed to keep the rectangle perimeter minimal. This bisection proceeds until each Figure 2-5 The EDAC (top) and simple 2-Opt subrectangle contains a preset maximum number of cities (bottom) time complexity (log scales). t (typically t 10). Each small subproblem is then solved - and the resulting subtours are patched together to produce a solution to the original problem - see Figure 2-4 In the EDAC algorithm the genotype is a p X p binary array in which a `1' or `0' indicates whether to cut horizontally or vertically at the current bisection. If we maintain the subproblem size, t, and increase the number of cities in the TSP, then a partition better than Karp's becomes progressively harder to find by randomly choosing a horizontal or vertical bisection at each step. If the problem size is n 2kt, where 2k is the number of subsquares, - then the corresponding genotype requires at least n/t - 1 bits. The size of the partition space is 2 to the power p2, which for p = 80 (the value used for n = 5000) is approximately exp(4436). For n = 5000 the size of permutation search space, roughly estimated using Stirling's formula, is around exp(37586). Thus searching partition space is easier than searching permutation space and this provides a third argument in favour of exploring this representation of problem subdivision as a genotype. We know from Karp's theorem that the class of tours produced by disection and patching will have representatives very close to the optimum tour, so by restricting attention to this smaller set one is not `throwing out the baby with the bath-water', i.e. the set may be smaller but it nevertheless contains near optimal tours. This approach contrasts sharply with the idea of `broadcast languages' mooted in Chapter 8 of [Holland 1975], in which techniques for searching the space of representations for a genetic algorithm are discussed. In general the space of representations is vastly larger than the search space of the problem itself, but we have seen with the TSP that this space is already so huge that it is impractical to search in any comprehensive fashion for all except the smallest problems. Hence, it seems unlikely that replacing the original search space by an even larger one will turn out to be a productive approach. 23
  • 25. Antonia J. Jones: 6 November 2005 Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. In any event even the EDAC algorithm requires clever recursive repair techniques to improve the accuracy when subtours are patched together. Nevertheless, the algorithm scales well. Figure 2-5 compares the EDAC algorithms with simple 2-Opt (which gives an accuracy of around 8% excess). This version of the EDAC algorithm produces solutions at the 5% level, see Figure 2-6, but a later more elaborate variant reliably produces solutions with around 1% excess and has been tested on problems sizes of up to 10,000 cities. This technique probably represents the best that can be done at the present time using genetic algorithms for the TSP. It is not yet practical by comparison with iterated Lin-Kernighan (or even 2-Opt)4, but it scales well and may eventually offer a viable technique for obtaining good solutions to TSP problems involving several hundred thousand cities. Parallel EDACII and EDACIII were both tested on a range of problems between 500 and 5000 cities. Parental pairs were chosen from the initial random population and the mid-parent value of the tour lengths calculated and recorded. Crossover and mutation were then applied to each selected parental pair and the tour length evaluated for the resulting offspring. Pearson's correlation coefficient, rxy, was calculated in each experiment and significance tests based on Fisher's transformation carried out in order to establish whether the resulting correlation coefficients differed significantly from zero (i.e. no correlation). Scatter diagrams in Figure 2-7 and Figure 2-8 illustrate the Price correlation for parallel EDACII and EDACIII on the 5000 city problem. Although the genotype used in these experiments was a binary array it could more naturally (at the cost of complication in the coding) be represented by a pair of binary trees, or a quadtree. The use of trees here would be more in keeping with the recursive construction of the phenotype from the genotype, a process analogous to 4 For example, wildly extrapolating the figures gives the breakeven point with 2-Opt at around n = 422,800 requiring some 74 cpu days! Of course, other things would collapse before then. 24
  • 26. Antonia J. Jones: 6 November 2005 growth, and it is possible to produce a modified Schema theorem for the case of trees, where the genetic information is encoded in the shape of the tree and information placed at leaf nodes. 113.0 112.0 111.0 109.7 109.0 107.5 107.0 105.2 103.0 105.0 103.0 104.5 106.0 107.5 109.0 105.0 107.0 109.0 111.0 113.0 mid-parent value mid-parent value Figure 2-7 EDACII mid-parent vs offspring correlation Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. for 5000 cities [Valenzuela 1995]. In Nature very complex phenotypical structures frequently give the appearance of having been constructed recursively from the genotype. Examples of recursive algorithms which lead to very natural looking graphical representations of natural living structures such as trees, plants, and so on, can be found in the work of Lindenmayer [Lindenmayer 1971] on what are now called L-systems. These production systems are very similar to the production rules which define various kinds of context sensitive or context free grammars. The combination of tree structured genotypes, or recursive construction algorithms similar to production rules, combined with the divide-and-conquer paradigm suggest a powerful computational technique for the compression of complex phenotypical structures into useful genotypical structures. So much so that, as our understanding of exactly how DNA encodes the phenotypical structure of individual biological organisms (particularly the neural systems of mammals) progresses, it would be surprising if to find that Nature has not employed some such technique. Chapter references [Altenberg 1987] L. Altenberg and M. W. Feldman. Selection, generalised transmission, and the evolution of modifier genes. The reduction principle. Genetics 117:559-572. [Altenburg 1994] L. Altenberg. The Evolution of Evolvability in Genetic Programming. Chapter 3 in Advances in Genetic Programming, Ed Kenneth E. Kinnear, Jr., MIT Press, 1994. [Belew 1990] R. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using the genetic algorithm with connectionist learning. CSE Technical Report CS90-174, University of California, San Diego, 1990. [Booker 1982] L. B. Booker. Intelligent behaviour as an adaption to the task environment. Doctoral dissertation, University of Michigan, 1982. Dissertation Abstracts International 43(2), 469B. [Brandon 1990] R. N. Brandon. Adaptation and Environment, pages 83-84. Princeton University Press, 1990. [Cavalli-Sforza 1976] L. L. Cavelli-Sforza and M. W. Feldman. Evolution of continuous variation: direct approach through joint distribution of genotypes and phenotypes. Proceedings of the national Academy of Science U.S.A., 73:1689-1692, 1976. 25
  • 27. Antonia J. Jones: 6 November 2005 [Cavicchio 1970] D. J. Cavicchio. Adaptive search using simulated evolution. Doctoral dissertation, University of Michigan (unpublished), 1970. [Chalmers 1990] David J. Chalmers. The Evolution of Learning: An experiment in Genetic Connectionism. Proceedings of the 1990 Connectionist Models Summer School, San Marco, CA. Morgan Kaufmann, 1990. [Collins 1991] R. J. Collins and D. R. Jefferson. Selection in massively parallel genetic algorithms. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991. [Davis 1987] Lawrence Davis, Editor. Genetic Algorithms and Simulated Annealing, Pitman Publishing, London. [De Jong 1975] K. De Jong. An analysis of the behaviour of a class of genetic adaptive systems. Doctoral dissertation, University of Michigan, 1975. Dissertation Abstracts International 36(10), 5140B. [Freedman 1991] D. Freedman, R. Pisani, R. Purves and A. Adhikkari. Statistics, Second edition, W. W. Norton, New York, 1991. [Georges-Schleuter 1990] Martina Georges-Schleuter. Genetic Algorithms and Population Structures, A Massively Parallel Algorithm. Ph.D. Thesis, Department of Computer Science, University of Dortmund, Germany. August 1990. [Goldberg 1987] David E. Goldberg and Jon Richardson. Genetic Algorithms with Sharing for Multimodal Function Optimization. Proc. Second Int. Conf. on Genetic Algorithms, pp. 41-49, MIT. [Gorges-Schleuter 1990] Martina Gorges-Schleuter. Genetic Algorithms and Population Structures: A Massively Parallel Algorithm. Ph.D. Thesis, University of Dortmund, August 1990. [Grefenstette 1987] John J. Grefenstette. Incorporating Problem Specific Knowledge into Genetic Algorithms. In Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing, London. [Holland 1975] John H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press. [Horowitz 1978] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. London, Pitman Publishing Ltd. [Johnson 1996] D. S. Johnson, L. A. McGeoch and E. E. Rothberg. Asymptotic experimental analysis for the Held- Karp traveling salesman bound. Proceeding 1996 ACM-SIAM symp. on Discrete Algorithms, to appear. [Jones 1993] Antonia J. Jones. Genetic Algorithms and their Applications to the Design of Neural Networks, Neural Computing & Applications, 1(1):32-45, 1993. [Koza 1992] John L. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Bradford Books, MIT Press, 1992. ISBN 0-262-11170-5. [Lindenmayer 1971] A. Lindenmayer. Developmental systems without cellular interaction, their languages and grammars. J. Theoretical Biology 30, 455-484, 1971. [Lyubich 1992] Y. I. Lyubich. Mathematical Structures in Population Genetics. Springer-Verlag, New York, pages 291-306. 1992. [Manderick 1989] B. Manderick and P. Spiessens. Fine grained parallel genetic algorithms. Proceedings of the third international conference on genetic algorithms. Morgan Kaufmann, 1989. 26
  • 28. Antonia J. Jones: 6 November 2005 [Manderick 1991] Manderick, B. de Weger, M. and Spiessens, P. The genetic algorithm and the structure of the fitness landscape. In R. K. Belew and L. B. Booker, Editors, Proceedings of the Fourth International Conference on Genetic Algorithms, pages 143-150, San Mateo CA, Morgan Kaufmann. [Mauldin 1984] M. L. Mauldin. Maintaining diversity in genetic search. National Conference on Artificial Intelligence, 247-250, 1984. [Macfarlane 1993] D. Macfarlane and Antonia J. Jones. Comparing networks with differing neural-node functions using Transputer based genetic algorithms. Neural Computing & Applications, 1(4): 256-267, 1993. [Menczer 1992] Menczer,F. and Parisi, D. Evidence of hyperplanes in the genetic learning of neural networks. Biological Cybernetics 66(3):283-289. [Miller 1989] G. Miller, P. Todd, and S. Hegde. Designing neural networks using genetic algorithms. In Proceedings of the Third Conference on Genetic Algorithms and their Applications, San Mateo, CA. Morgan Kaufmann, 1989. [Muhlenbein 1988] H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer. Evolution Algorithms in Combinatorial Optimisation. Parallel Computing, 7, pp. 65-85. [Price 1970] G. R. Price. Selection and covariance. Nature, 227:520-521. [Price 1972] G. R. Price. Extension of covariance mathematics. Annals of Human Genetics 35:485-489. [Salmon 1971] W. C. Salmon. Statistical Explanation and Statistical Relevance. University of Pittsburg Press, Pittsburgh, 1971. [Schaffer 1984] J. D. Schaffer. Some Experiments in Machine Learning Using Vector Evaluated Genetic Algorithms. Ph.D. Thesis, Department of Electrical Engineering, Vanderbilt University, December 1984. [Schwefel 1965] H-P Schwefel. Kybernetische Evolution als Strategie experimentellen Forschung in der Strömungstechnik. Diploma thesis, Technical University of Berlin, 1965. [Slatkin 1970] M. Slatkin. Selection and polygenic characters. Proceedings of the National Academy of Sciences U.S.A. 66:87-93. 1970. [Smith 1980] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms. Ph.D. Dissertation, University of Pittsburg. [Spiessens 1991] P. Spiessens and B. Manderick. A massively parallel genetic algorithm - implementation and first analysis. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991. [Valenzuela 1994] Christine L. Valenzuela and Antonia J. Jones. Evolutionary Divide and Conquer (I): A novel genetic approach to the TSP. Evolutionary Computation 1(4):313-333, 1994. [Valenzuela 1995] Christine L. Valenzuela. Evolutionary Divide and Conquer: A novel genetic approach to the TSP. Ph.D. Thesis, Department of Computing, Imperial College, London. 1995 [Valenzuela 1997] Christine L. Valenzuela and Antonia J. Jones. Estimating the Held-Karp lower bound for the geometric TSP. To appear: European Journal of Operational Research, 1997. [Whitley 1990] D. Whitley, T. Starkweather, and C. Bogart. Genetic algorithms and neural networks: Optimizing connections and connectivity. Parallel Computing, forthcoming. 27
  • 29. Antonia J. Jones: 6 November 2005 [Wilson 1990] Perceptron redux. Physica D, forthcoming. 28
  • 30. Antonia J. Jones: 6 November 2005 III Hopfield networks. Introduction. As far back as 1954 Cragg and Temperley [Cragg 1954] had introduced the spin-neuron analogy using a ferromagnetic model. They remarked "It remains to be considered whether biological analogues can be found for the concepts of temperature and interaction energy in a physical system." In 1974 Little [Little 1974] introduced the temperature-noise analogy, but it was not until 1982 that John Hopfield [Hopfield 1982], a physicist, made significant progress in the direction requested by Cragg and Temperley. In a single short paragraph, he suggests one of the most important new techniques to have been proposed in neural networks. Hopfield nets and energy. The standard approach to a neural network is to propose a learning rule, usually based on synaptic modification, and then to show that a number of interesting effects arise from it. Hopfield starts by saying that: "The function of the nervous system is to develop a number of locally stable states in state space." Other points in state space flow into the stable points, called attractors. In some other dynamic systems the behaviour is much more complex, for example the system may orbit two or more points in state space in a non- periodic way, see [Abraham 1985]. However, this turns out not to be the case for the Hopfield net. The flow of the system towards a stable point allows a mechanism for correcting errors, since deviations from the stable points disappear. The system can thus reconstruct missing information since the stable point will appropriately complete missing parts of an incomplete initial state vector. Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm. For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean 2 attempt rate µ, setting x i(t) ' 1 > θ i x i(t) ' x i(t 1) & if j wijxj(t 1) & ' θ i (1) x i(t) ' 0 j i … < θ i Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts accordingly. Although this model has superficial similarities to the Perceptron there are essential differences. Firstly, Perceptrons were modelled chiefly with the neural connections in a `forward' direction and the analysis of such networks with backward coupling proved intractable. All the interesting results of the Hopfield model arise as a consequence of the strong backward coupling. Secondly, studies of perceptrons usually made a random net of 29
  • 31. Antonia J. Jones: 6 November 2005 neurons deal with the external world, and did not ask the questions essential to finding the more abstract emergent computational properties. Finally, perceptron modelling required synchronous neurons like a conventional digital computer. Although synchrony of sorts must exist in biological nervous systems, for example the act of walking involves the precise temporal coordination of both legs, or the Purkinje fibres which help control the heart, there is certainly no global synchrony in the same sense that electronic hardware is clocked. Given the variations of delays of nerve signal propagation, there would probably be no way to use global synchrony effectively. Chiefly computational properties which can exist in spite of asynchrony have interesting implications in biological computing. Hopfield considers the special case wij = wji (all i, j), wii = 0 (all i) and defines a function 1 E w xx j &' ix i % θj 2 i, j ij i j i (2) i j … which is an analog to the physical energy of the system. A low rate of neural firing is approximated by assuming that only one unit changes state at any given moment. Then, since wij = wji, E due to xi is given by ) ) E M ' E ∆ xi ( wij x j j &' ∆ i) xi & θ ∆ x M j (3) i j i… Now consider the effect of the threshold rule (1). If the unit changes state at all then xi = ±1. If xi = 1 this means ) ) the unit changes state from 0 to 1, hence by the threshold rule wijx j > i j θ (4) j i … in which case, by (3), E < 0. Alternatively, xi = -1, the unit therefore changes state from 1 to 0, and ) ) wijx j < i j θ (5) j i … and again E < 0. Thus if a unit changes state at all the energy must decrease. State changes will continue until ) a locally least E is reached.5 The energy is playing the role of a Hamiltonian in the more general dynamic system context. For the Hopfield network individual state changes are deterministic. However, in more general models, such as the Boltzmann machine, we can add a stochastic component to the node update rule which introduces a parameter T called temperature. At T = 0 state changes are decided deterministically by the threshold rule. For T > 0, as T increases the system becomes progressively less deterministic and more stochastic until, at high temperatures, any individual node is in either state with probability ½. Thus we can regard the Hopfield network in operational mode as the zero temperature case of the Boltzmann machine. Hopfield now makes the critical observation that "This case is isomorphic with an Ising model." thereby allowing a deluge of physical theory describing spin-glass models to enter network modelling. This flood of new participants has transformed the field of neural networks. A spin glass is a magnetic alloy formed, for instance, by dilute solutions of manganese in copper or iron in gold. These impurities interact with each other by means of conduction electrons and the couplings are either of the 5 This particular argument is valid only if the neural states change one at a time in some random order; which is approximated by a low neural firing rate. However, Cohen and Grossberg have proved a theorem about a much wider class of networks which guarantees a similar kind of stability for asynchronous operation. 30
  • 32. Antonia J. Jones: 6 November 2005 ferromagnetic (wij > 0) or antiferromagnetic (wij < 0) type. The interest of these alloys comes from the fact that they exhibit a wide variety of stable or meta-stable states. The dipoles interact via the couplings wij. In the simplest case, a spin interacts only with its nearest neighbours, while the equivalent of the neural networks considered here, requires infinite range interactions, where each spin is coupled to all others. In the Ising model, the Hamiltonian of such a spin glass is proportional to j wijx ix j (6) i … j the spins contributing to the total energy by pairwise interactions, and the system stabilizes at an equilibrium point which is a minimum of the free energy, see [Binder 1986]. Procedure Hopfield (Assumes weights are assigned) Repeat until updating every unit produces no change of state. Randomise initial state x {0, 1}n 0 Select unit i (1 i n) with uniform random probability. # # Update unit i according to (1) End Algorithm 3-1 Hopfield network. A number of computer simulations and some analysis led Hopfield to conclude that the number of `memories' (point attractors) that could be stored by a network was about 0.15n, where n is the number of neurons in the network, a figure quite precisely confirmed by the later work. In an analytic tour de force [Amit 1987] it is shown that the Hopfield model can be solved exactly, in the thermodynamic limit as n -> . A phase diagram 4 (Temperature T, storage P/n) is obtained, where T is a measure of the noise level and P/n is the ratio of the number of learnt patterns to the number of neurons [Crisanti 1986]. The main result is the existence of a sharp, discontinuous phase transition at P/n = Ao --> 0.14 (for T --> 0). When P < Aon the retrieval is very good (0.97 correlation) but it drops suddenly for P > Aon. Real neurons need not make synapses both i->j and j->i. We therefore ask if wij = wji is important. Without this condition the probability of making errors is increased but the algorithm continues to generate stable minima. Why should stable limit points or regions persist when wij is not equal to wji? If the algorithm at some time changes xi from 0 to 1, or vice versa, the change of the energy can be split into two terms. The first is the change that would apply to the symmetric model. The second is identically zero if wij is symmetric and is `stochastic' with mean zero if wij and wji are randomly chosen. The algorithm in the non-symmetric case therefore changes E in time in a fashion similar to the symmetric case but corresponding to a finite temperature, i.e. a lower signal-to-noise ratio. In [Hopfield 1984] a more realistic neuron model is used in which an internal continuous variable stores the linear sum of the excitation and inhibition weighted by the appropriate connection strength. The internal variable is converted into an output activity by a sigmoidal non-linearity. As in the earlier paper he sets up an energy function and shows that the evolution of the system in time, given the properties of the neurons, will be to decrease energy. Thus the results found in the previous paper still largely hold. There is a brief, but clear, account of the Hopfield model in [Farhat 1985] in which an optical implementation is described. The outer product rule for assigning weights. An early rule used for memory storage is associative memory models can also be used to store memories in the 31
  • 33. Antonia J. Jones: 6 November 2005 Hopfield model. If the {0, 1} node outputs are mapped to xi {-1, +1} the rule assigned weights as follows. For 0 each pattern vector x, which we require to memorise, we consider the matrix x1 x1 x1 x1 x2 ... x1 x n x2 x2 x1 x2 x2 ... x2 x n x xT ' x1, x2, ,..., x n ' (7) . . . ... . xn x n x1 x n x2 ... x n x n and then average these matrices over all pattern vectors (prototypes). At the time the explanation was that in this way we can capture the average correlations between components of the pattern vectors and then use this information, during the operation of the network, to recapture missing or corrupted components. Assuming that we know the patterns required to be memorised the outer-product rule is a one-shot computation of the weights and so perhaps should not qualify as a `learning rule' in the usual sense of progressive weight modification upon exposure to experience. However, from the perspective of the Hopfield model there is another interpretation that can be given to this rule. For a point x = (x1, ..., xn) to be a stable attractor (i.e. a memory) we require that it be a local energy minimum. Suppose that patterns are presented sequentially and we wish to determine some rule which is intended to make frequently presented patterns likely to be energy minima. If we suppose that x is given at some stage, then it has an associated energy and, if we calculate M E &' x ix j (8) M wij then by taking wij proportional to minus this gradient we obtain ) wij ∆ x ix j ' η (9) where > 0 is some small constant. In other words the averaging process of the outer product rule can now be 0 seen, in the context of progressive learning, as a form of gradient descent. Unfortunately, however we assign the weights, as progressively more prototype memories are added, the system eventually reaches saturation (at around 0.15n). This happens because the number of local minima is not really under our control, as more memories are added we obtain an exponentially increasing number of `spurious memories', ie. local minima that do not correspond to patterns we wish to memorise. For example, we have Theorem [Tanaka 80], [McEliece 87]. If the synaptic matrix is symmetric with wii = 0 and if its elements are independent Gaussian variables with zero mean and unit variance, then an asymptotic estimate for the number of fixed points is given by NF • (1.0505) 20.2874n (10) A proof can be found in [Kamp 1990]. Networks for combinatoric search. Apart from their applications to associative recall or pattern recognition, Hopfield networks can be applied to the very different problem of combinatoric search. It should be made clear at the outset that, with the possible exception of the Boltzmann machine, the application of neural networks to hard combinatoric search has not yet yielded systems or algorithms which compare favourably with state-of-the-art probabilistic algorithms designed for the specific problem, e.g. the TSP. However, this area is of considerable theoretical interest and may eventually prove to be of practical interest. 32
  • 34. Antonia J. Jones: 6 November 2005 We know that the dynamics of a Hopfield network cause it to relax into a local energy minimum. Given a specific combinatoric search problem to be solved we are faced with two issues. First we have to design a representation which relates network states to the objects in the search space. Second we have to arrange that low energy states of the network correspond to good solutions. To make matters specific we can consider the geometric TSP. Here the objects of search are tours and given a list of N cities we can identify a tour as any permutation of this list. There are N! permutations and only ½(N-1)! distinct tours, so we have already introduced some replication by simply identifying tours with permutations, but this causes no serious problems of itself. The next step is to consider how tours might be represented as a state of a network. Here, the generally used method is illustrated below. For a 5-city problem {A,B,C,D,E}, if city A is in position 2 Table 2-2 Example representation. of the tour this is represented by the second neuron from an array of five having an output of 1 and all others in the 1 2 3 4 5 array having an output of 0, i.e. (0,1,0,0,0). The global state of Table 10.1 represents the tour (C,A,E,B,D). Thus for N A 0 1 0 0 0 cities a total of n = N2 neurons are required to specify a complete tour. B 0 0 0 1 0 C 1 0 0 0 0 Clearly there is a 1-1 correspondence between valid tours and the set of all NXN permutation matrices, i.e. matrices D 0 0 0 0 1 which have precisely one `1' in each row and column, all other entries being zero. For an N-city TSP, there are N! E 0 0 1 0 0 states of such matrices of which represent tours, and 2 to the power N2 states in all. Now, this is not a very satisfactory situation. We have replaced the original search space, of size order N!, by a space of much greater size because 1 log(N!) . NlogN & N % log(2 N) π 2 (11) 2 log(2N ) ' N 2 log2 where the approximation for N! is Sterlings theorem. This is contrary to a guiding principle that wherever possible in hard combinatoric search we should simplify the space searched (subject to the condition that good solutions are rich in the smaller space) rather than make it larger. Nevertheless, the above representation has a certain paradigmatic simplicity and will serve to illustrated the ideas. The next problem to be addressed is: how to assign the weights so that states with low energy correspond to short tours. This `assignment problem' is dealt with in the next section. Assignment of weights for the TSP. Let dij denote the distance between city i and city j. Here we shall formulate the TSP as a 0-1 programming problem, by defining it as a quadratic assignment problem [Garfinkel 1985]. The TSP can also be formulated as a linear assignment problem [Aarts 1988] but, of course, then there are far more constraints. Using the n = N2 node state variables defined by xip ' 1, if the tour visits city i at the p th position (12) 0, otherwise we can formulate the TSP as the following quadratic assignment problem: Minimise 33
  • 35. Antonia J. Jones: 6 November 2005 N & 1 F(x) ' j aijpq xipxjq (13) i, j, p, q ' 0 subject to xip, xjq 0 {0, 1} and N & 1 j xip ' 1 (0 # p # N & 1) i ' 0 (14) N & 1 j xip ' 1 (0 # # i N & 1) p ' 0 The first condition of (14) asserts there is just one `1' in every column, and the second condition places a similar constraint on every row. The aijpq are defined by dij, if q / p ± 1 (mod N) aijpq ' (15) 0, otherwise p p p p i i p - 1 (mod N ) p + 1 (mod N) p - 1 (mod N ) p + 1 (mod N) Figure 3-1 Distance Connections. Each node (i, p) has inhibitory connections to the two adjacent Figure 3-2 Exclusion connections. Each node (i, p) columns whose weights reflect the cost of joining the has inhibitory connections to all units in the same row three cities. and column. We wish to choose the weights so that minimising the objective function (13), subject to the constraints of (14), corresponds to minimising the energy as defined in (2), (1). Figure 3-1 and Figure 3-2 illustrate the connections made to each node of the network. These are divided into two types. Distance connections, for which the weights are chosen so that, if the net is in a state which corresponds to a tour, these weights will reflect the energy cost of joining the (p - 1)th city of the tour to the pth city, and the pth city to the (p + 1)th. Note these connections wrap around (mod N). Exclusion connections, which inhibit two units in the same row or column from being on at the same time. Exclusion connections are designed so as to encourage the network to settle in a state which corresponds to a tour state. As all connections so far are inhibitory we need to provide the network with some incentive to turn on any units at all. This can be done by manipulating 34
  • 36. Antonia J. Jones: 6 November 2005 the thresholds. Intuitively we can see that some arrangement such as this might well have the desired effect, but the next theorem shows exactly how to choose these weights so that it does. We consider the following sets of parameters St ip : 0 ' i, p θ # N 1 & # Sd ' wipjq : i … j and q / p ± 1 (mod N) (16) Se ' wipjq : (i ' j and p … q) or (i … j and p ' q) Theorem (Aarts). Let the weights and thresholds be chosen so that œ ip St we have ip < max dik dil : k θ 0 θ & % … l, 0 # k, l N 1 & # œ wipjq 0 Sd we have wipjq &' dij (17) œ wipjq 0 Se we have wipjq < min θ ip, θ jq then (i) Feasibility. Valid tour states of the network exactly correspond to local minima of the energy function. and (ii) Ordering. The energy function is order-preserving with respect to tour length. Proof. (i) Feasibility. Firstly, it is easy to check that a tour state is indeed a local minimum. We simply note the effect of changing the state of some unit. Now suppose the state is not a tour state. We divide this into two possible cases. Case 1. Suppose the network state has more than a single `1' in some row or column. Then at least one exclusion connection is activated. For definiteness suppose units (i, p) and (j, p) are both on and that ip = min { ip, jp} > wipjp. What is the effect of turning unit (i, p) off? Distance connections cause no 2 2 2 problems, for suppose that some unit (k, q), with q p ± 1, in an adjacent column is also on. In this case turning / off (i, p) will remove a distance connection contribution dik to the energy, and so decrease it. For the exclusion connection, if the weights are chosen according to (17) then turning off the connected unit (i, p), which has lowest threshold, causes a change in energy E wipjp ∆ip < 0 ' & θ (18) and so again reduces the energy. Hence a network state with too many `1's in some row or column cannot correspond to a local minima. Case 2. Now suppose that the network state has at most one `1' in every row and column (so that there are at most N units on) and that at least one row or column has no unit on. Suppose, for definiteness, that the pth column has no unit on. This means that in any row, the unit in the pth column cannot be on. If every row contains some unit which is on and only N-1 columns are available, this means that some column would have to contain two units which are on, which is contrary to hypothesis. Hence there must be fewer than N units on, so that some unit in the pth column must be in a row which has no units on. Suppose this is unit (i, p). Turning this unit on does not contribute to the energy via any exclusion connections (because all other units in the same row or column are off). We next consider the contribution due to the distance connections. In each of the adjacent columns (mod N) there is at most one unit on. Call these (k, p-1) and (l, p+1) respectively, where it is understood that the indices p-1 and p+1 are taken (mod N). The choice of weights in (17) ensures that the change of energy produced by turning unit (i, p) on is 35
  • 37. Antonia J. Jones: 6 November 2005 ∆ E &' wkp &1 ipxkp 1xip & & wipl p 1xipxl p % % 1 % θ ipxip (19) ' dik % dil % θ ip < 0 Hence the initial state could not have been an energy minima. The two cases considered cover all possibilities for network states which are not tour states, and so we conclude that the tour states exactly correspond to energy minima. (ii) Ordering. To show that the energy function is order preserving we first observe that ip does not depend on p, 2 i.e. all ip for units in a given row are equal. Consequently, the contribution to the energy from the threshold terms 2 is the same for all tour states. The remaining terms contribute 1 j dij xi pxjq 2 i p, jq (20) (i, p) (j, q) … which evaluates to the tour length. Hence energy preserves the ordering of tour length. % At first sight this theorem may seem paradoxical. In analysing the memory capacity of a Hopfield network, we concluded earlier that only around 0.15n prototype vectors could be memorised before the onset of catastrophic degradation in recall. Thus we were only able to control the network behaviour at a relatively small proportion of an exponentially large (as in (10)) number of local minima. Yet, in the assignment of weights for the TSP network, we have ( n)! local minima exactly where we want them. How can this be? The answer lies in the following % observation. In analysing the effect of the outer-product rule for assigning weights in order to recall stored memories, we assumed that there was no structure to the prototype vectors, i.e. that they were uncorrelated. In the case of the TSP assignment there is a very definite structure in the set of states we are seeking to assign to minima: the structure imposed by the ½n(n-1) distances between the cities of the original problem. The TSP weight assignment respects this structure and that additional degree of control allows us to place N! states in exactly the relationship needed to solve the problem. A Mathematica implementation of the asynchronous Hopfield network is given in the Mathematica directory. The assignment of weights for a specific TSP problem discussed here is also given there. * The Hopfield and Tank application to the TSP. Hopfield and Tank first proposed using the Hopfield model to solve the TSP in [Hopfield 1986]. Their choice of energy function was motivated by similar considerations to those of the previous section, but was slightly different in detail. For a specific TSP problem we assume that the assignment of weights is made as described in the previous section. Suppose we now initialise the corresponding Hopfield network to a random state. Then if the network is run we should expect it to settle into a local energy minima, and we have proved that this will correspond to some tour state. However, it is unlikely to correspond to the optimal tour. Hopfield and Tank suggested an ingenious way of overcoming this problem, which is very similar to the Boltzmann machine approach (not discussed in these notes). Instead of the {0, 1} network Hopfield and Tank used a continuous model, 0 xi 1, in which neurons possess # # a sigmoidal activation function xi = f( neti), where neti is the net input and is a positive constant which 8 8 represents the gain and is equivalent to varying the slope of the sigmoidal. Hence f represents the input-output characteristics of a non-linear amplifier with negligible response time. The discrete model represents the case where -> . In their simulation is taken to be 50 (large but finite). 8 4 8 For 10 cities there are 181,440 possible tours. In the Hopfield and Tank simulations about 50% of the trials produced one of the two shortest paths. They ask why is the computation so effective and provide the following 36
  • 38. Antonia J. Jones: 6 November 2005 answer: "The solution to a TSP is a path and the decoding of the TSP's network final stable state to obtain this discrete decision or solution requires having the final xip values to be near 0 or 1. However, the actual analog computation occurs in the continuous domain 0 # xip 1. The decision-making process or computation consists of the smooth motion from an initial state in the interior of the space # (where the notion of a `tour' is not even defined) to an ultimate stable point near enough to a corner of the continuous domain to be able to identify with that corner. It is as though the logical operations of a calculation could be given continuous values between `true' and `false', and evolve toward certainty only near the end of the calculation." Naturally, with such an interesting approach to an NP hard problem, others tried to repeat these results. It was found by a number of researchers that the Hopfield-Tank algorithm, as originally formulated, was highly unstable "Our simulations indicate that Hopfield and Tank were very fortunate in the limited number of TSP simulations they attempted. Even at the value N = 10 it transpires that their basic method is unreliable..." [Wilson 1988]. However, others later refined the energy function and modified the algorithm to perform more reliably. There are also a number of related papers on `elastic net' methods for the TSP problem [Durbin 1987]. Conclusions. One lesson we have learned from this chapter is that the performance of networks for hard combinatoric search is critically dependent both on the mapping from the problem domain to the network, and on the details of the network architecture and weight assignment which encode the constraints of the original problem. Appropriate encoding of domain knowledge into the network architecture can profoundly enhance the overall performance. One of the important general questions which arises from these observations is how can such encoding of basic constraints themselves be learnt, possibly at the genetic level? In another direction entirely, we first discussed associative memories with an emphasis on direct methods of encoding prototype patterns into the weights of a fully connected network. As our study of these networks progressed the emphasis moved to a perspective more in accord with dynamic systems. This also mirrors the historical development of the subject of artificial neural networks. Following Hopfield's papers prototype vectors, or `memories', were associated with point attractors of the dynamics of the network. However, we can see that this preoccupation is, in itself, just a special case of a much more general vista. Why should we restrict our attention to the point attractors of the system? In modelling biological systems we are studying techniques for embedding learned behaviour into the system. In other words we should really be studying the following question ! How can we sculpt the dynamic evolution of the system through time so as to induce behaviours characteristic of, or responsive to, a given external dynamic system? If we loosely identify a `behaviour' with a trajectory through the state space of the network, then point attractors are just one of the simplest characterisations of dynamic system behaviour. It would be much more interesting to develop learning techniques for whole classes of trajectories, or even chaotic attractors. In other words to address the issue of how to get a neural network to capture a model of a given dynamic system or Markov process for example. This is very much the line of reasoning suggested by Freeman's studies and simulations of biological neural systems [Freeman 1991]. Phase portraits made from EEGs generated by computer models reflect the overall activity of the olfactory system of a rabbit at rest and in response to a familiar scent (e.g. banana). The resemblance of these portraits to irregularly shaped, but still structured, coils of wire reveals that brain activity in both conditions is chaotic but that the response to the known stimulus is more ordered, more nearly periodic during perception, than at rest. Chapter references 37
  • 39. Antonia J. Jones: 6 November 2005 [Aarts 1988] E. Aarts and J. H. M. Korst. Boltzmann machines for travelling salesman problems, European Journal of Operational Research - in press. [Abraham 1985] R. H. Abraham and C. D Shaw, Dynamics - The Geometry of Behaviour, Part 2: Chaotic Behaviour, Arial Press, 1985. [Amit 1987] D. J. Amit, H. Gutfreund, and H. Sompolinsky. Information storage in neural networks with low levels of activity. Physical Review A, 35:2239-2303, 1987. [Binder 1986] K. Binder and A. P. Young. Spin-Glasses - Experimental Facts, Theoretical Concepts and Open Questions. Reviews of Modern Physics, 58(1):801-976, 1986. [Burr 19XX]. D. J. Burr. An Improved Elastic Net Method for the Traveling Salesman Problem, ??? [Cragg 1954] B. G. Cragg and H. N. V. Temperley. Electroencephalog. Clin. Neurophys. 6:85, 1954. [Crisanti 1986] A. Crisanti, D. J. Amit, and H. Gutfreund. Europhys. Letters 2:337, 1986. [Durbin 1987] R. Durbin and D. Willshaw, An analogue approach to the travelling salesman problem using an elastic net method, Nature 326, 16 April 1987. [Farhat 1985] N. H. Farhat, et al. Optical implementation of the Hopfield model. Applied Optics, 24:1469-1475, 1985. [Feller 1966] W. Feller. An introduction to probability theory and its applications, Vol. 1. J. Wiley & Sons, New York, 1966. [Freeman 1991] W. J. Freeman. The physiology of perception. Scientific American, pp 34-41, February 1991. [Garfinkel 1985] R. S. Garfinkel. Motivation and modelling, in The Travelling Salesman Problem: a guided tour of combinatoric optimisation. Eds. E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. Wiley, Chichester, 1985. [Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing. Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk). [Hopfield 1982] J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences 79: 2554-2558. [Hopfield 1984] J. J. Hopfield, Neurons with graded response have collective computational properties like those of two state neurons. Proceedings of the National Academy of Sciences 81: 3088-3092. [Hopfield 1986] J. J. Hopfield and D. W. Tank. `Neural' computation of decisions in optimization problems. Biological Cybernetics, 52:141-152., 1986. [Kamp 1990] Y. Kamp and M. Hasler. Recursive Neural Networks for Associative Memory, John Wiley & Sons, New York, 1990. [Keeler 1988] J. D. Keeler. Cognit. Sci. 12: 299-329, 1988. [Keeler 1989]. J. D. Keeler, E. E. Pichler and J. Ross. Noise in neural networks: Thresholds, hysteresis, and neuromodulation of signal-to-noise. Proc. Nat. Acad. Sci. USA, 86: 1712-1716, March 1989. 38
  • 40. Antonia J. Jones: 6 November 2005 [Little 1974] W. A. Little. Math. Biosci. 19:101, 1974. [McEliece 1987] R. McEliece, E. Posner, E. Rodemich, and S. Venkatesh. The capacity of the Hopfield associative memory. IEEE Transactions on Information Theory, IT-33:461-482, 1987. [Tanaka 1980] F. Tanaka and S. Edwards. Analytic theory of the ground state properties of a spin glass: I. Ising spin glass. J. Phys. F: Metal. Phys, 10:2769-2778, 1980. [Wilson 1988] G. V. Wilson and G. S. Pawley, On the Stability of the Travelling Salesman Problem Algorithm of Hopfield and Tank. Bilogical Cybernetics 58:63-70, 1988. 39
  • 41. Antonia J. Jones: 6 November 2005 IV The WISARD model. Introduction. WISARD (WIlkie, Stonham, Aleksander Recognition Device) is an implementation in hardware of the n-tuple sampling technique first described in [Bledsoe 1959]. The scheme outlined in Figure 4-1, Figure 7-2 was first proposed by Aleksander and Stonham in [Aleksander 1979]. Figure 4-1 Schematic of a 3-tuple recogniser. The sample data to be recognized is stored as a 2-dimensional array (the 'retina') of binary elements. Depending on the nature of the data this can be done in a variety of ways. For visual processing we can simply place a pre- processed version of the image onto the retina. For temporal data in signal processing or speech recognition 40
  • 42. Antonia J. Jones: 6 November 2005 successive samples in time can be stored in successive columns, and the value of the sample represented by a coding of the binary elements in each column. The particular coding used is liable to depend on the application. One of several possible codings is to represent a sample feature value by a 'bar' of binary 1's; the length of the bar being proportional to the value of the sample feature. Wisard model. Random connections are made onto the elements of the array, n such connections being grouped together to form an n-tuple which is used to address one random access memory (RAM) per discriminator. In this way a large number of RAM's are grouped together to form a class discriminator whose output or score is the sum of all its RAM's outputs. This configuration is repeated to give one discriminator for each class of pattern to be recognized. The RAMs implement logic functions which are set up during training; thus the method does not involve any direct storage of pattern data. A random map from array elements to n-tuples is preferable in theory, since a systematic mapping is more likely to render the recogniser blind to distinct patterns having a systematic difference. Hard-wiring a random map in a totally parallel system makes fabrication infeasible at high resolutions. In many applications systematic differences in input patterns of the type liable to pose problems with a non-random mapping are unlikely to occur since real data tends to be 'fuzzy' at the pixel level. However, the issue of randomly hardwiring individual RAMs is somewhat academic since in most contexts a totally parallel system is not needed as its speed (independent of the number of classes and of the order of the access time of a memory element) would far exceed data input rates. At 512X512 resolution a semi-parallel structure is used where the mapping is 'soft' (i.e. achieved by pseudo-random addressing with parallel shift registers) and the processing within discriminators is serial but the discriminators themselves are operating in parallel. Using memory elements with an access time of 10-7 s, this gives a minimum operating time of around 70 mS, which once again is independent of the number of classes. The system is trained using samples of patterns from each class. A pattern is stored into the retina array and a logical 1 is written into the RAM's of the discriminator associated with the class of this training pattern at the locations addressed by the n-tuples. This is repeated many times, typically 25-50 times, for each class. In recognition mode, the unknown pattern is stored in the array and the RAM's of every discriminator put into READ mode. The input pattern then stimulates the logic functions in the discriminator network and an overall response is obtained by summing all the logical outputs. The pattern is then assigned to the class of the discriminator producing the highest score. Where very high resolution image data is presented, as in visual imaging, this design lends itself to easy implementation in massively parallel hardware. However, even with visual images, experience tends to suggest that often a very good recognition performance can be obtained on relatively low resolution data. Hence in many applications massively parallel hardware can be replaced by a fast serial processor and associated RAM, emulating the design in micro-coded software. This was the approach used by Binstead and Stonham in Optical Character recognition, with notable success. Such a system has the advantage of being able to make optimal use of available memory in applications where the n- tuple size, or the number of discriminators, may be required to vary. The advantages of the WISARD model for pattern recognition are: ! Implementation as a parallel, or serial, system in currently available hardware is inexpensive and simple. ! Given labelled samples of each recognition class, training times are extremely short. ! The time required by a trained system to classify an unknown pattern is very small and, in a parallel implementation, is independent of the number of classes. 41
  • 43. Antonia J. Jones: 6 November 2005 The requirement for labelled samples of each class poses particular problems in speech recognition when dealing with smaller units than whole words; the extraction of samples by acoustic and visual inspection is a labour intensive and time consuming activity. It is here that paradigms such as Kohonen's topologising network, as applied to speech by Tattershall, show particular promise. Of course, in such approaches there are other compensating problems; principally, after the network has been trained and produced a dimensionally reduced and feature-clustered map of the pattern space, it is necessary to interpret this map in terms of output symbols useful to higher levels. One approach to this problem is to train an associative Figure 4-2 Continuous response of discriminators to the input word memory on the net output together with 'toothache' [From Neural Computing Architectures, Ed. I Aleksander]. the associated symbol. Applications of n-tuple sampling in hardware have been rather sparse, the commercial version of WISARD as a visual pattern recognition device able to operate at TV frame rates being one of the few to date - another is the Optical Character Recogniser developed by Binstead and Stonham. However, one can envisage a multitude of applications for such pattern recognition systems as their operation and advantages become more widely understood. The total amount of memory needed for the discriminator RAMs is easily calculated. Let n be the number of inputs, C the number of classes, and assume a 1-1 mapping of a retina with p binary pixels. For convenience we assume that n divides p. Then the number of 1 bit RAMs will be p/n and each RAM has 2n bits of storage. So the number of bits per discriminator is (p/n).2n. Hence the total number of bits for C classes is p n C. .2 (21) n Practical n-tuple pattern recognition systems have developed from the original implementation of the hardware WISARD, which used regularly sized blocks of RAM storing only the discriminator states. As memory has become cheaper and processors faster, for many applications such heavily constrained systems are no longer appropriate. Algorithms can be implemented as serial emulations of parallel hardware and RAM can also be used to describe a more flexible structure. Typically the real-time system is preceded by a software simulation in which various parameters of the theoretical model are optimized for the particular application. A design technique which is sufficiently general to cope with a large class of such net-systems whilst at the same time preserving a high degree of computational efficiency is described in [Binstead 1987]. In addition the structure produced has the property that it is easily mapped into hardware to a level determined by the application requirements. The rationale for believing that n-tuple techniques might be successfully applied to speech recognizers is briefly outlined in [Tattershall 1984], where it is demonstrated that n-tuple recognisers can be designed so that in training they derive an implicit map of the class conditional probabilities. Since the n-tuple scheme requires almost no computation it appears to be an attractive way of implementing a Bayesian classifier. In a real time speech 42
  • 44. Antonia J. Jones: 6 November 2005 recognition system the pre-processed input data can be slid across the retina and the system tuned to respond to significant peaking of a class discriminator response, see Figure 4-2 Work on using WISARD nets for speech recognition was started at the University of Brunel Pattern Recognition Laboratory in 1983, some results of this work are reported in [Aleksander 1988a], Chapter 10. One novel feature of this chapter is an account of the work of Jones and Valenzuela using Holland's Genetic Algorithm to breed WISARD nets for the purpose of vowel detection. WISARD - analysis of response. Assume that there are a number of n-tuples whose address lines are randomly and uniformly connected to the retina, and that the total area of the retina is 1. Suppose an n-tuple has been trained on patterns T1, T2,..., Tl. Then we shall evaluate the expected response to an unknown test pattern U. Let I1 ' U _ T1, I2 ' U _ T2, ..., Il ' U _ Tl Iij ' U _ Ti _ Tj (for i … j), (22) Iijk ' U _ Ti _ Tj _ Tk (i, j, k pairwise distinct), I12,...,l ' U _ T1 _ T2 _ ... _ Tl Here Ii = U Ti is the set of all points on the retina for which the pixels in U and Ti take the same value, either 1 both '1' or both '0'. Let U Ti denote the area of this set intersection. Since the area of the retina is supposed to * 1 * be 1, it follows that U Ti is the probability that a single address line will give the same result when sampling * 1 * U as when sampling Ti. Since the n address lines are assumed uniformly distributed across the retina, the probability that all n address lines give the same response when sampling U as when sampling Ti is just U Ti n. * 1 * This is the probability that an n-tuple trained only on the pattern Ti will give the same response (i.e. fire) when presented with the unknown pattern U. For convenience, let pi * *' Ii n, pij * *' Iij n, ..., p12...l *' I12...l * n (23) Suppose the n-tuple has been trained on two patterns Ti and Tj. Then the probability that an n-tuple trained only on the patterns Ti and Tj will give the same response (i.e. fire) when presented with the unknown pattern U is pi pj pij % & (24) Now suppose the n-tuple has been trained on l patterns T1,...,Tl. By a well known combinatoric principle, it follows that the probability that an n-tuple will give the same response (i.e. fire) when presented with the unknown pattern U is p j' pi j& pij j% pijk & ... & % ( 1)l 1 p12...l % i i j … i j … (25) j k … k i … In the vernacular of n-tuple sampling theory this is called the nth power-law. Note that if U = Ti, i.e. is equal to one of the training patterns, then Ii = 1 and the remaining terms in the equation above sum to zero, hence p = 1 as * * we might expect. 43
  • 45. Antonia J. Jones: 6 November 2005 For example, suppose, a discriminator is required to detect the horizontal position of a vertical bar and report on the distance of the bar from the centre. The task is shown in the figure below. T1, the only training pattern, is seen to be a bar of width 1/3 units, where we take the width and height to each be one unit. The test pattern U could be a vertical bar of the same width as T1 but anywhere that is wholly within the window. The distance of the bar from the left hand edge is D, and the maximum value of D is 2/3. As D increases from 0 to 1/3 then the overlap between U and T1 increases linearly but the response R = I1n of the system for different n is governed by the nth power law, see Figure 4-3. Comparison of storage requirements. The functionality of the McCulloch and Pitts neuron model was a matter of interest in the mid 1960's and has been discussed by [Muroga 1965]. With n inputs the 0/1 neuron can only perform linearly separable functions, i.e. those that may be achieved by an n-1 hyperplane of the n-hypercube. We have Figure 4-3 A discriminator for centering on a bar [From Neural seen that there are about 2n of these. Computing, I. Aleksander and H. Morton]. Typical of the functions that such a device cannot perform are parity checking etc. Suppose now we regard the n binary inputs as addressing memory in a RAM, as in the WISARD model discussed earlier. There are 2^(2^n) logic functions f f : {x1, x2, ... ,xn} {0, 1} 6 (26) since there are 2n possible inputs for each function and the function value at each of these inputs can be 0 or 1. The functionality of the two models can be compared as follows. With w-bit weights and zero threshold, an McCullock and Pitts node of type ? requires wn bits of memory and can perform at most 2wn of the possible logic functions: a proportion of the total number possible which tends exponentially to zero as n becomes large. In fact, there is no point in taking w very large for the discrete neuron, since the number of hyperplane dichotomies of the 2n vertices of the n-hypercube is fixed in terms of n (around 22n- 1 ); one needs just enough bits to make sure that all these hyperplane dichotomies are possible, and no more. Although the proportion of possible logic functions which can be implemented by the McCullock and Pitts neuron is asymptotically zero as n tends to infinity, nevertheless it is this restricted functionality which gives the node the capability of generalisation. By comparison, a RAM with n address lines and 2n bits of store can implement any of the possible logic functions on n bits. However, the storage requirements rise exponentially with the number of inputs, and hence, in assemblies of RAMs, with the same level of interconnectivity. This is not the case with the McCullock and Pitts model, where storage increases linearly with n. In any event, detailed discussion of the relative merits of different neural components frequently overlooks the fact that what is perhaps more relevant is the relative functionality 44
  • 46. Antonia J. Jones: 6 November 2005 of large assemblies of such components. In applications where the interconnectivity is low the idea of using a RAM offers obvious attractions. Two problems present themselves. Firstly, what training algorithm should be used for assemblies of RAMs? Secondly, how will such a system be capable of generalisation? Subsequent work by Igor Aleksander's group at Imperial has resulted in a model known as the Probabilistic Logic Node (PLN); such nodes are then cascaded into a pyramidal structure and combined with a simple multilayer learning algorithm. The PLN is a connectionist model introduced in [Aleksander 1988b] which is implementable as a RAM. Binary inputs address a memory location; the node outputs the bit value stored there. The training algorithm sets location contents according to a global error/correct signal; PLNs require no individual error information. A PLN can learn any of the 2^(2^n) Boolean functions of its n inputs. By this definition, it is straightforward to implement the PLN as a RAM with 2n addressable memory locations. In practice the output generally passes through a stochastic device before leaving the PLN. This randomizer allows the PLN to exhibit non-deterministic properties; organic neurons also behave in a stochastic manner [Sejnowski 1981] Chapter references [Aleksander 1979] I. Aleksander and T.J. Stonham. A Guide to Pattern Recognition Using Random-Access Memories. IEE Journal Computers and Digital Techniques, Vol. 2 (1), 29-40, 1979. [Aleksander 1988a] A. Badii, M. J. Binstead, Antonia J. Jones, T. J. Stonham and Christine L. Valenzuela. Applications of N-tuple Sampling and Genetic Algorithms to Speech Recognition. Neural Computing Architectures, Chapter 10. Ed. I. Aleksander, Kogan Page, October 1988. [Aleksander 1988b] I. Aleksander. Logical connectionist systems. In R. Eckmiller and Ch. v. d. Malsburg (Eds.) Neural Computers (pp. 189-197). Springer-Verlag, Berlin, 1988. [Bernstein 1981] J. Bernstein. Profiles: AI, Marvin Minsky. The New Yorker, December 14, 1981, pp 50-126. [Binstead 1987] M. J. Binstead and Antonia J. Jones. A Design Technique for Dynamically Evolving N-tuple Nets. IEE Proceedings, Vol. 134 Part E, No. 6, pp 265-269, November 1987. [Bledsoe 1959] W. W. Bledsoe and I. Browning. Pattern Recognition and Reading by Machine. Proc. Eastern Joint Computer Conf. Boston, Mass., 1959. 45
  • 47. Antonia J. Jones: 6 November 2005 V Feedforward networks and backpropagation. Introduction. In two papers describing the same model [Rumelhart 1986a], [Rumelhart 1986b] Rumelhart, Hinton and Williams introduce a generalization of the Widrow-Hoff error correction rule called back propagation. The algorithm was first described by Paul J. Werbos in his Harvard Ph.D. thesis [Werbos 1974] and also independently rediscovered by [Parker 1985] and [Le Cun 1986]. The model assumes a discrete time system with synchronous update and with each connection involving a unit delay. Simple two-layer associative networks have no hidden units, they involve only input and output units. In these cases there is no internal representation. As Minsky and Papert pointed out we need hidden units to provide the possibility of recoding the input pattern into an internal representation because many problems simply cannot otherwise be solved. An example is the XOR problem mentioned earlier. Here the addition of a unit which detects the logical conjunction of the inputs changes the similarity structure of the patterns sufficiently to allow the solution to be learned, see Figure 5-1 The problem addressed by simulated annealing and backpropagation is the provision of a locally computed learning rule which guarantees that an internal representation adequate to solve the problem will be found. Backpropagation is based on local gradient descent and output functions which step between 0 and 1, as the activation of the neuron increases through a threshold value, will not be differentiable and therefore provide no useful gradient in weight space along which we can descend. We therefore consider semilinear activation functions. A semilinear activation function fi(neti) is one in which the output of a the unit is a non-decreasing and differentiable Figure 5-1 Solving the XOR problem with a hidden unit. function of the net input to the unit. In most cases fi is independent of i and so we write f = fi. Backpropagation - mathematical background. The following functions, and hence all their partial derivatives, are assumed known. Error function. E(z1, z2, ..., zn, t1, t2, ... ,t n) (1) Here z1, z2, ..., zn are the outputs from the output layer units and t1, t2, ... ,tn are the target outputs. 46
  • 48. Antonia J. Jones: 6 November 2005 Figure 5-2 Feedforward network architecture. Activation function. Output layer: net j ' net j (y1, y2, ... ,y m, pj1, ... pjt) (2) Here y1, y2, ... ,ym are the outputs from the previous layer and pj1, ... , pjt are parameters associated with the jth node of the output layer. Frequently t = t(m), i.e. the number of parameters associated with a node is a function of the number of inputs. Previous layer: net i ' net i (x1, x2, ..., x l, pi1, ... ,pis) (3) Here x1, x2, ... ,xl are the outputs from the layer prior to the previous layer. Output function. Output layer: zj ' f(net j) (1 # # j n) (4) Previous layer: yi ' f(net i) (1 # # i m) (5) The output layer calculation. For the output layer we have M η& ' E ∆ pjz (6) M pjz where 1 z # # t, 1 j # # n, and 0 is the learning rate. Hence M η& ' E net jM net j M ∆ pjz ' δη j (7) M net j pjz M M pjz 47
  • 49. Antonia J. Jones: 6 November 2005 where M &' E M M E zj M E δ j &' &' f (net j) ) (8) M net j M M zj net j M zj Equations (7) and (8) express the pjz in terms of known quantities. ) The rule for adjusting weights in hidden layers. In a similar way we can compute a rule which adjusts the weights in the previous layers. Figure 5-3 The previous layer calculation. We shall not go through the details of the derivation (which are quite straightforward) but the rule which emerges is, for node i in this previous layer, net i M piz i ∆ ' δη (9) piz M where n net j ME M δ i &' f (net i) ) ' f (net i) ) δj j (10) Myi j 1 ' M yi In (9) the partial derivative neti/ piz is known from (3). In (10) f´(neti) is known from (5), the M M * j were computed in the previous step, and the last term is known from (2). The conventional model. The usual sum squared error is given by n 1 E(z1, ..., zn, t1, ..., t n) j ' zj & tj 2 (11) 2 j 1 ' Hence, for 1 j # # n E ' M zj & tj (12) zjM 48
  • 50. Antonia J. Jones: 6 November 2005 The linear activation function becomes m Mnet j net j j' wji y i ' yi i 1 ' M wji l (13) M net i net i j' wih x h ' xh h 1 ' M wih where t = m and s = l. Thus (7) becomes ∆ wjz ' δ η j yz (14) and, using (12), (8) becomes E δ j &' ) f (net j) &' M ) f (net j) zj & tj (15) zj M for an output layer unit. Similarly (9) becomes net i M ∆ wiz ' δη i ' δη ix z (16) M wiz where, from (10), m M net j δ i ' f (net i) ) δj j j 1 ' M yi (17) m ' f (net i) ) δj j wji j 1 ' for a hidden layer unit. For this example of a linear activation function any sigmoidal function f is suitable. Frequently the function 1 zj f(netj)' ' (18) net & % θ 1 e j j % is used. Here j is the threshold, or bias, of the unit. Conventionally the threshold is treated as just another weight 2 by creating a dummy unit which is always on (somewhat like ground in an electrical circuit). However, from the present viewpoint the threshold can be considered as just another parameter associated with the unit, in which case there is no need for the dummy unit. Thus the implementation of back propagation involves a forward pass through the layers to estimate the error, and then a backward pass modifying the synapses to decrease the error. Practical implementations are not difficult, but without modification it is still rather slow, especially for systems with many layers. Still, it is at present the most popular learning algorithm for multilayer networks. It is considerably faster than the Boltzmann machine. The whole process is illustrated in the Mathematica file backprop.ma. The Mathematica program runs too slowly to be of much practical use and is intended only to illustrate the process. A C-code implementation will run much faster and is available in various implementations. Problems with backpropagation. 49
  • 51. Antonia J. Jones: 6 November 2005 Backpropagation is a very powerful tool for constructing non-linear models. However, there are a number of problems which arise when one tries to use backpropagation in practice. These are: ! In many cases, given a particular training set, the minimum mean-squared error which one can expect on each output is unknown. Overtraining will result in memorisation, which will reduce the generalisation capability. Dealing with this problem can be very time consuming and involve trial and error. ! The optimal architecture for the network is unknown: too many units and the system will generalise poorly whilst taking ta very long time to train; too few units and the system will not reach the optimal (unknown) mean-squared error. Dealing with this problem can be very time consuming and involve trial and error. ! Optimal values for the learning rate > 0 (and the momentum term > 0 - if used) are unknown. 0 " Dealing with this problem can be very time consuming and involve trial and error. One way for adjusting > 0 dynamically as learning progresses is called the Bold Driver method. This adjusts the 0 learning rate according to the previous value of the Error function. If the value has gone up the learning rate is reduced proportionately, similarly the learning rate is increased if the value has gone down. Until recently a successful application of backpropagation to a large problem was to some extent a matter of patience and luck. The gamma test - a new technique. The Gamma test was developed by Aðalbjörn Stefánsson, N. Kon ar [Kon ar 1997], [Adalbjörn Stefánsson 1997] … … and and myself and has emerged as an extremely useful tool to overcome the kinds of problems mentioned above. It is a very simple technique which in many cases can be used to considerably simplify the design process of constructing a smooth data model such as a neural network. The Gamma test is a data analysis routine, that (in an optimal implementation) runs in time O(MlogM) as M -> , where M is the number of sample data points, and 4 which aims to estimate the best Mean Squared Error (MSError) that can be achieved by any continuous or smooth (bounded first partial derivatives) data model constructed using the data. A proof of the result under fairly general hypotheses was finally given in [Evans 2002a] and [Evans 2002b]. Let a data sample be represented by ((x1, ..., x m), y) ' (x, y) (19) m in which we think of the vector x = (x1, ..., xm) as the input, confined to a closed bounded set C úf , and the scalar y as the output. In the interests of simplicity the following explanation is presented for a single scalar output y. But the same algorithm can be applied to the situation where y is a vector with very little extra complication or time penalty. The Gamma test is designed to give a data-derived estimate for Var(r). We focus on the case where samples are generated by a suitably smooth function (bounded first and second order m partial derivatives) f: C -> and úf ú y f(x1, ..., x m) r ' % (20) where r represents an indeterminable part, which may be due to real noise or might be due to lack of functional determination in the posited input/output relationship i.e. an element of `one -> many-ness' present in the data. We make the following assumption ! Assumption A. We assume that training and testing data are different sample sets in which: (a) the training set inputs are non-sparse in input-space; (b) each output is determined from the 50
  • 52. Antonia J. Jones: 6 November 2005 inputs by a deterministic process which is the same for both training and test sets; (c) each output is subjected to statistical noise with finite variance whose distribution may be different for different outputs but which is the same in both training and test sets for corresponding outputs. Suppose (x, y) is a data sample. Let (x´, y´) be a data sample such that x´ - x > 0 is minimal. Here . denotes * * ** Euclidean distance and the minimum is taken over the set of all sample points different from (x, y). Thus x´ is the nearest neighbour to x (in any ambiguous case we just pick one of the several equidistant points arbitrarily). The Gamma test (or near neighbour technique) is based on the statistic M 1 γ ' j (y (i) ) & y(i))2 (21) 2M i ' 1 where y´(i) is the y value corresponding to the first near neighbour of x(i). It can be shown that -> Var(r) in probability as the nearest neighbour distances approach zero. In a finite data ( set we cannot have nearest neighbour distances arbitrarily small so the Gamma test is designed to estimate this limit by means of a linear correlation. Given data samples (x(i), y(i)), where x(i) = (x1(i), ..., xm(i)), 1 # # i M, let N[i, p] be the list of (equidistant) p th nearest neighbours to x(i). We write M M 1 1 1 δ (p) ' j j x(j) & x(i) 2 ' j x(N[i, p]) & x(i) 2 (22) M i ' 1 L(N[i, p]) j 0 N[i, p] M i ' 1 where L(N[i, p]) is the length of the list N[i, p]. Thus (p) is the mean square distance to the pth nearest neighbour. * Nearest neighbour lists for pth nearest neighbours (1 p pmax), typically we take pmax in the range 20-50, can be # # found in O(MlogM) time using techniques developed by Bentley, see for example [Friedman 1977]. We also write M 1 1 γ (p) ' j j (y(j) & y(i))2 (23) 2M i ' 1 L(N[i,p]) j 0 N[i,p] where the y observations are subject to statistical noise assumed independent of x and having bounded variance.6 Under reasonable conditions one can show that Var(r) γ . % A δ % o( ) δ as M 46 (24) where the convergence is in probability. The Gamma Test computes the mean-squared pth nearest neighbour distances (p) (1 p pmax, typically pmax * # # . 10) and the corresponding (p). Next the ( (p), (p)) regression line is computed and the vertical intercept is ( * ( returned as the gamma value. Effectively this is the limit lim as -> 0 which in theory is Var(r). ( * 6 The original version in [Aðalbjörn Stefánsson 1997] and [Kon ar 1997] used smoothed versions of (p) and … * ((p) which rolled off the significance of more distant near neighbours. Later experience showed that this complication was largely unnecessary and the version of the software used here is implemented as described above. 51
  • 53. Antonia J. Jones: 6 November 2005 Procedure Gamma (or Near Neighbour Test) (data) (* data is an array of points (x(i), y(i)), (1 i M), in # # which x is a real vector of dimension m and y is a real scalar *) For i = 1 to M (* compute x-nearest neighbour list for each data point. This can be done in O(MlogM) time using a kd-tree for example.*) For p = 1 to pmax AppendTo[N(i, p)] all the elements t where x(t) is a p th nearest neighbour to x(i). endfor p endfor i For p = 1 to pmax compute (p) as in (22) * compute (p) as in (23) ( endfor p Perform least squares fit on coordinates ( (p), * ( (p)) (1 # p # pmax) obtaining (say) y = Ax + ¯ ' Return (¯, A) ' Algorithm 5-1 The Gamma test. A technique which allows one to estimate Var(r), on the hypothesis of an underlying continuous or smooth model f, is of considerable practical utility in applications such as control or time series modelling. The implication of being able to estimate Var(r) in neural network modelling is that then one does not need to train the network (or indeed any smooth data model) in order to predict the best possible performance with reasonable accuracy. We have used the Gamma test to: ! To find the minimal number of data samples required to produce a near optimal model. We do this by computing ¯ for increasing M. The graph asymptotes to the true value of Var(r). When M is sufficiently ' large to ensure that ¯ has stabilised close to the asymptote there is little advantage to be gained by ' increasing M. ! To automatically and rapidly (the near neighbour information can be used to speed up training by picking the data points with worst errors and performing backpropagation on a suitable subset of the weights) construct a near minimal neural network architecture and weights which best models the data [Kon ar 1997]. The Gamma test provides the criterion for ceasing training. … For estimating the number of `hills' required in an initial approximation for the network architecture we use the heuristically based formula h = am, where 2 2 a . mA (25) π ! To determine the best embedding dimension and delay time for time series [Masayuki 1997]. ! To determine the best set of inputs from a list of possible inputs for a neuro-controller [Kon ar 1997]. … The last two applications illustrate what is perhaps the main utility of the Gamma-test: on data sets which are not excessively large the test is sufficiently fast to be run on a compete examination of all possible subsets of up to 20 inputs (for a larger number of inputs we use a Genetic Algorithm for which the fitness of a selection of inputs is 52
  • 54. Antonia J. Jones: 6 November 2005 based on how small the ¯ value is for the selection). ' * Metabackpropagation. This is an algorithm developed by N. Kon ar and myself which overcomes many of the disadvantages of simple … backpropagation. It is pivotally dependent on the Gamma test. Procedure Metabackpropagation 1. Use the Gamma test to determine the optimal number of data vectors (input-output pairs) (M), the selection and number of inputs (m), the mean squared error (MSError) to which the network should be trained ( ), and the first approximation to the neural architecture (using A). 2. Create a feedforward network that has the number of hills (h) specified by A. A single hill is made of two fully connected hidden layers with 2m nodes in the first hidden layer and one node in the second hidden layer. 3. Initialise each hill by doing a small number of Backpropagation training cycles on subsets of the data. Subsets are chosen from the near neighbour lists generated during the computation of the Gamma test results. To find these subsets, the neural network weights are first randomised and the points which give the largest errors are identified by feeding every input vector through the network. The near neighbour lists of the points which give the largest errors are the subsets that are chosen. Each hill is trained on its own exclusive subset. This will initialise the weights of each hill so that it is positioned in the right place. 4. Perform Backpropagation training on the entire neural network architecture until either a specified number of cycles is exceeded or the target MSError is reached. Adjust the learning rate by the Bold Driver method. 5. Increase the number of hills if the Gamma test MSError is not reached and then go back to 2. Algorithm 5-2 Metabackpropagation. We have found that the application of the Gamma test and Metabackpropagation have transformed the production of a nonlinear model using feedforward networks from a black-art to an almost fully automated and fast process. * Neural networks for adaptive control. To illustrate our discussion of adaptive control we shall use a simple example. There are two tanks of water at temperatures Tcold < Thot respectively. The tanks are drained at rates c1(t) and c2(t) ltrs/sec into a third tank of volume Vmax (not used). The third tank drains at an (initially) constant rate r. The parameters c1 and c2 are considered as control variables and it is desired to determine a control strategy for c1 and c2 which will maintain a volume Vgoal at temperature Tgoal, where Tcold < Tgoal < Thot, in the third tank. 53
  • 55. Antonia J. Jones: 6 November 2005 Tcold Thot cont[[1]] cont[[2]] Target tank Drain Figure 5-4 The Water Tank Problem Let (V(t), T(t)) denote the volume and temperature of the target tank. The differential equations describing the system are dT dV V T % c1(t)Tcold c2(t)Thot rT ' % & dt dt (26) dV 'c1(t) c2(t) r % & dt where c1 and c2 are the cold and hot valve settings and r is the rate of drain from the target tank. Physical Assumptions. 1. We assume that as water drains into the third tank mixing is instantaneous. 2. We assume that there is no heat loss from the third tank, apart from that lost due to the outflow. 3. We have assumed that the two feeder tanks contain water with a specific heat of unity. It is a simple matter to modify the equations to deal with the case of two inert liquids of specific heats s1 and s2 respectively. If an endo- or exo-thermic reaction results from the mixing then this could also be taken account of in the model. We now proceed to construct an adaptive neurocontroller for the Water Tank Problem. The basic architecture we shall use was developed by N. Kon ar and myself and applied to the Attitude Control Problem [Kon ar 1995]. It … … is described in Figure 5-5. The components of this system are: The Planner: Knowing the long term goal state (which could dynamically change) and the current state, together with maximal variations in one time step, the Planner sets the next desired state. A simple Linear Planner proceeds by taking a small segment of the straight line in state space between the current state and the goal. A wide range of other variations are also possible. The ) Mapping: This transforms the next desired state and the current state into an efficient representation of the difference between the two into inputs for the neural network. The Neural Network: Using the outputs from the mapping the neural network generates a suitable set ) of control signals for application at the current time step. 54
  • 56. Antonia J. Jones: 6 November 2005 Input Long term Desired next map goal state Planner state xG(k+1) Control unit NN controller input u(k) ∆ (running) Dynamic Actual next Current system state x(k+1) state x(k) Training data from observed facts: Add the data ((x(k+1), x(k)), ( u(k))), (inputs, outputs), to the circular buffer of most recent (input,output) vector pairs. i.e. If the desired next state were x(k+1) and the current state is x(k) then the correct control input would be u(k). Input map Next (desired) state x(k+1) NN controller Control ∆ (learning) Current state x(k) u(k) Training consists of performing backpropagation on the entire contents of the circular buffer of most recent (input, output) vector pairs. Figure 5-5 Architecture for direct inverse neurocontrol. The model is adaptive because whenever the network fails to place the next state within a prescribed error tolerance of the next desired state, then the ((state, nextstate), (controls)) are added to the training buffer and the network undergoes a further round of backpropagation training (it is assumed that this is done in hardware - the networks under discussion usually have only a small number of nodes). The main problem in specific applications centres around two issues: ! What input mapping is optimal for the specific application? This turns out to be quite critical. With ) the wrong mapping the controller may not work efficiently (or even at all). Ideally we seek to reduce equivalence classes of control inputs to a single representative, thus reducing the learning demands on the neural network. ! What is the optimal architecture for the neural network? In fact both questions can be addresssed by a combination of Metabackpropagation and the Gamma test. The difference between the desired state (Vdes, Tdes) and the actual state (V, T) can be measured in various ways. For example, with a time step between changes in control signals of delta = 1 in the simulator, if we just take all four of the scalars as inputs then we obtain a ¯ of around 0.08 on each control output. Since (0.08) 0.28, this ' % . implies an absolute error on the control outputs of the order 28%, which is unlikely to be effective. If we return to the original equations for the Water Tank we can determine that if at two states (V1, T1), (V2, T2) we apply the same control inputs (c1, c2) then (dV/dt, dT/dt) will be the same at both states provided c1Tcold % c2Thot V2T1 & V1T2 ' (V2 & V1) (27) c1 % c2 55
  • 57. Antonia J. Jones: 6 November 2005 Thus, under these circumstances, the two states (V1, T1), (V2, T2) are control equivalent, i.e. the application of (c1, c2) in either of these states will (instantaneously) produce the same effect. This suggests the input mapping described in Figure 5-6. Vdes VdesT - VTdes Tdes ∆ V Vdes - V T Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs for the Water Tank Problem. If we perform the Gamma test using these inputs we obtain (for c1 and similarly for c2) the regression line illustrated in Figure 5-7. This may well not be the ideal mapping but we have some grounds for preferring over ) the naive choice. (The Attitude Control problem similarly involves some mathematical ingenuity in the construction of the mapping.) ) If we take the average for the p th (2 p 5) nearest ' # # neighbour then we estimate ¯ = lim ' ' as 0.027. capgamma Leading to an absolute error on the ouputs of the order of 16.4%. This represents the best MSError that a 0.035 feedforward neural network trained on the data can 0.03 achieve. 0.025 0.02 Although the question of building an adaptive ) 0.015 mapping is obviously crucial, if we have some 0.01 mathematical understanding of the system we are seeking to control then the Gamma test is obviously 0.005 very time saving. 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 capdelta Figure 5-8 - Figure 5-10 show the results of a quick trial of the neurocontroller (illustrated in the Figure 5-7 Least squares fit to 200 data points with 20 J Mathematica file nn-tank.ma). These results for the nearest neighbours: ¯ = 0.0332. ' Water Tank Problem are preliminary examples and do not represent a final design. We have not made the training adaptive in the example file. The training was done prior to the simulation on 200 data points uniformly distributed in state-difference space. Performance for small state differences (effectively performance near the goal state) could easily be improved by increasing the density of training data near the origin of state-difference space. It might also be improved by modifications of the Planner module. 56
  • 58. Antonia J. Jones: 6 November 2005 Volume (ltrs) 104 102 100 98 96 Time (secs) 102 104 106 108 110 Volume (ltrs) 140 120 100 80 60 40 20 0 Time (secs) 50 100 150 200 Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear Planner. Temperature 51 50.75 50.5 50.25 Temperature 50 49.75 49.5 70 49.25 49 Time (secs) 96 98 100 102 104 60 50 40 30 20 Time (secs) 50 100 150 200 Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052. Linear Planner. Valve setting (ltrs/sec) 4 3.5 3 2.5 2 Valve setting (ltrs/sec) 1.5 1 0.5 0 Time (secs) 4 96 98 100 102 104 3.5 3 2.5 2 1.5 1 0.5 0 Time (secs) 50 100 150 200 Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training. Linear Planner. 57
  • 59. Antonia J. Jones: 6 November 2005 Chapter references [Evans 2002a] D. Evans and Antonia J. Jones. A proof of the Gamma test. Proc. Roy. Soc. Series A 458(2027), 2759-2799, 2002. [Evans 2002b] D. Evans, Antonia J. Jones, W. M. Schmidt. Asymptotic moments of near neighbour distance distributions. Proc. Roy. Soc. Lond. Series A, 458(2028):2839-2849, 2002. [Friedman 1977] J.H. Friedman, J.L. Bentley and R.A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):200-226, 1977. [Jones 2002] Antonia J. Jones, A.P.M Tsui, and Ana G. Oliveira Neural models of arbitrary chaotic systems: construction and the role of time delayed feedback in control and synchronization.. With html and pdf electronic supplement. Complexity International, Volume 09, 2002. ISSN 1320-0682. Paper ID: tsui01, URL: http://www.csu.edu.au/ci/vol09/tsui01/ [Kon ar 1997] N. Kon ar. Optimisation methodologies for direct inverse neurocontrol. Forthcoming Ph.D. thesis, … … Department of Computing, 180 Queen's Gate, London, SW7 2BZ, U.K. [Kon ar 1995] N. Kon ar and Antonia J. Jones. Adapative real-time neural network attitude control of chaotic … … satellite motion, Presented at Aerospace/Defense Sensing & Control and Dual-Use Photonics, SPIE (The International Society for Optical Engineering and Photonics in Aerospace Engineering) International Synposium. Orlando, Florida April 17-21, 1995. [Rumelhart 1986a] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error propagation. Chapter 8, Parallel Distributed Processing, Vol. 1., M.I.T. Press. [Rumelhart 1986b] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature 323: 533-536. [Adalbjörn Stefánsson 1997] Adalbjörn Stefánsson, N. Koncar and Antonia J. Jones. A note on the Gamma test, Neural Computing & Applications 5(3):131-133, 1997. ISSN 0-941-0643. [Preprint] [Tsui 2002] Alban P. Tsui, Antonia J. Jones and Ana Guedes de Oliveira. The construction of smooth models using irregular embeddings determined by a Gamma test analysis. Neural Computing & Applications 10(4), 318-329, April 2002. ISSN 0-941-0643. [Werbos 1974] P. J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA. 58
  • 60. Antonia J. Jones: 6 November 2005 * VI The chaotic frontier. Introduction. It is interesting to observe from a modern standpoint that Newtonian physics already contained the seeds of its own destruction. Quite apart from the later quantum mechanical caveat of the Heisenberg Uncertainty Principle, the classicists overlooked the 'computational cost' of making deterministic predictions an indeterminate time into the future. The fact is that for many deterministic systems, the computational cost of making an accurate prediction a substantial time into the future becomes prohibitive. Thus the classical view that if a system is deterministic then its future behaviour could be predicted for all time contains a basic flaw. This flaw becomes particularly apparent when we consider chaotic systems. Chaos is the word we use to describe deterministic behaviour for which nevertheless, in view of the computational cost, even if the initial conditions were known to an arbitrary degree of precision, the long term behaviour cannot be accurately predicted. This is certainly the case with many natural systems for which in any case we cannot know the initial conditions to an arbitrary degree of precision. A classic example, first considered by E. N. Lorenz, is the weather [Lorenz 1963]. About a century ago, Henri Poincaré observed that the motion of three bodies under gravity can be extremely complicated. His discovery was the first mathematical evidence of chaos. Since that time there have been many observations of chaos both in mathematical models and natural systems. For many years chaos observed through the study of nonlinear dynamic systems was avoided, due to its complexity. In practice such behaviour was often totally ignored being interpreted as either completely unpredictable or ascribed to statistical noise. The theory of nonlinear dynamics founded by Poincaré describes and classifies the behaviour of complex dynamical systems and the manner in which they evolve through time. Such systems were extraordinarily difficult to study. The situation changed dramatically with the invention of the modern computer. Scientists, especially mathematicians and physicists, who had previously encountered chaos could pursue a more systematic study of the phenomenon using the new tool. The crucial importance of chaos is that it provides an alternative explanation for apparent randomness, one that depends on neither noise nor complexity. Chaotic behaviour appears in systems that are essentially free from noise and are also relatively simple, often with only a few degrees of freedom. Many natural systems exhibit chaos. In the early years of the study, it was common to assume that such behaviour is unpredictable7 and therefore uncontrollable. Since the late 1980's, a number of quite different techniques have been proposed for controlling chaotic systems [Ott 1990], [Dracopoulos 1994], [Ogorzalek 1993]. Mathematical background. A dth order autonomous continuous time system is defined as dx i ' gi(x1, . . . , x d, pi), (1 i # # d) (1) dt where t is the time, x = (x1, . . . , xd) is a vector in n-dimensional state space, i.e. xi are the dynamic variables, g = (g1, . . . , gd) is a vector field in the state space, i.e. gi are functions of xi, and pi are the corresponding vectors of the control parameters. As the vector field does not depend on time, the initial time may be taken as t0 = 0. 7 Unpredictable in the sense that, although completely deterministic the computational `cost' of an accurate prediction rapidly becomes prohibitive as the prediction interval increases. 59
  • 61. Antonia J. Jones: 6 November 2005 The Jacobian matrix J of an autonomous system described by d first-order differential equations, is a d x d matrix with the elements defined as M gl jl, m ' (1 # l, m # d) (2) M xm If the determinant of the Jacobian matrix, det J, is one at all points the system is conservative. If the average of *det J < 1 then the system is dissipative. For the average of det J > 1, the values in state space expand with time. * * * We distinguish between conservative and dissipative dynamic systems. In conservative systems volume elements in the state space are conserved, whereas in a dissipative system the volume elements contract as the system evolves. For dissipative systems, the effects of transients associated with initial conditions disappear in time. The trajectory in state space will head for some final attracting region, or regions, which might be a point, curve, area, and so on. Such an object is called the attractor for the system, since a number of distinct trajectories will be attracted to this set of points in the state space. The properties of the attractor determine the long term dynamical behaviour of the system. The terms are best understood with examples. A well known nonlinear dissipative system called the Lorenz model [Lorenz 1963]. The Lorenz model is defined by a set of differential equations as x (y x) 0 ' σ & y 0 ' x (R & z) & y (3) z 0 ' x y & b z This system has three degrees of freedom as there are three dynamic x 0 variables x, y and z. The control parameters are , R and b. For R less than F y 2 0.5 1.5 1 1, all trajectories, no matter what their initial conditions, eventually end up 1 1.5 approaching the origin of the xyz state space. That is for R < 1, all of the xyz 0.5 space is the basin of attraction for the attractor at the origin. Figure 6-1 0 illustrates a trajectory of the model with = 10.0, R = 0.5, b = 8/3, x = 1.0, F 3 y = 2.0 and z = 3.0. For dissipative systems, the effects of transients associated with initial 2 conditions disappear in time. The trajectory in state space will head for some z final attracting region, or regions, which might be a point, curve, area, and so on. Such an object is called the attractor for the system, since a number 1 of distinct trajectories will be attracted to this set of points in the state space. The properties of the attractor determine the long term dynamical behaviour 0 of the system. Chaos There is currently great excitement and much speculation about chaos theory Figure 6-1 Stable attractor. and its potential role in understanding the world. A brief introduction to the history of the mathematical foundations of the subject can be cited in [Holmes 1990]. A chaotic system will remain apparently noisy regardless of how well experimental conditions are controlled. For the Duffing oscillator model, different values of d, f and exhibit completely different system behaviours. In T Figure 6-2, a time series of a periodic behaviour of the model for d = 0.15, f = 0.3 and = 1.0 is shown. In FigureT 6-2, a typical time series of the chaotic behaviour of the model for d = 0.2, f = 36 and = 0.665 is illustrated. As T can be seen, the chaotic time series is more complicated in appearance, but there is a boundary within which the 60
  • 62. Antonia J. Jones: 6 November 2005 system stays. If a system displays divergence of nearby trajectories or sensitive dependence on initial conditions for some range of its control parameter, then the long term behaviour of that system becomes essentially unpredictable8, i.e. the long term future of a chaotic system is in practice indeterminable even though the system is theoretically deterministic. The effect of divergence of nearby trajectories on the behaviour of nonlinear systems is known as the butterfly effect. The term was introduced by Lorenz based on the picturesque notion that if the atmosphere displays chaotic behaviour with divergence of nearby trajectories, then even the flapping of a butterfly's wings would alter any long term prediction of atmospheric dynamics. This phenomenon is illustrated in Figure 6-3 for the x-coordinate of the Lorenz model with one trajectory starting at x = 1.0, y = 2.0 and z = 3.0 in black, and another at x = 1.01, y = 2.01 and z = 3.01, in x(t) grey. Instead of R = 0.5, we have used R = 28.0 as the model exhibits 4 a chaotic behaviour with this value. 2 For many nonlinear systems, we must integrate the equations step by step to find future behaviour. Any small error in specifying the initial 100 200 300 400 500 600 t conditions will be magnified, thus leading to grossly different long -2 term behaviour of the system, therefore we cannot predict that long term behaviour in practice. Thus, chaotic behaviour is characterised -4 by the divergence of nearby trajectories in state space. As a function of time, the separation between two nearby trajectories increases exponentially, at least for short time. (For short time because the trajectories stay within some bounded region of the state space.) Figure 6-2 A chaotic time series. In three or more dimensions, initially nearby trajectories can continue to diverge by wrapping over and under each other. The crucial feature of state space with three or more dimensions which permits chaotic x(t) behaviour is that trajectories remain within some bounded region by intertwining and wrapping around each other, without intersecting 15 and without repeating themselves exactly. The geometry created by 10 such trajectories is strange. Such attractors are thus called strange 5 attractors [Ruelle 1980], i.e. if nearby trajectories on average diverge 0 2.5 5 7.5 10 12.5 15 17.5 20 t exponentially then we say the attractor is strange or chaotic. -5 -10 Chaos in biology. -15 A relatively new model of brain function was first described by Freeman [Freeman 1991]. The idea is that `thought' (in particular perception, prediction and control) consists of the flow (in the high Figure 6-3 The butterfly effect. dimensional state space of vast assemblies of neurons) from one chaotic orbit to a periodic orbit. Freeman argues that chaos is evident in the tendency of neural assemblies to shift abruptly from one complex activity pattern to a more stable one in response to the smallest of inputs. This is a plausible model and if it stands the test of experiment and scrutiny then chaos is an intrinsic feature of brain function. Phase portraits made from EEGs (electroencephalographs) generated by computer models reflect the overall 8 In the sense previously discussed. 61
  • 63. Antonia J. Jones: 6 November 2005 activity of the olfactory bulb of a rabbit at rest and in response to a familiar scent (e.g. banana). The resemblance of these portraits to irregularly shaped, but still structured, coils of wire reveals that brain activity in both conditions is chaotic but that the response to the known stimulus is more ordered, more nearly periodic during perception, than at rest. The heart also provides interesting examples of chaotic behaviour in biological systems. It is clear that the cardiac waveform is nonlinear. There is also evidence that the cardiac cycle can usefully be described in terms of chaos. A. Babloyantz and A. Destexhe [Babloyantz 1988] examined the ECGs (electrocardiographs) of four normal human hearts, using qualitative and quantitative methods. With a variety of processing algorithms, such as power spectrum, autocorrelation function, phase portrait, Poincaré section, Lyapunov exponent etc, they demonstrated that the heart is not a perfect oscillator, but that cardiac activity stems from deterministic dynamics of a chaotic nature. Numerous in-vivo and in-vitro experiments have investigated cardiac oscillatory activity and found characteristic signatures of chaos [Choi 1983], [Chay 1985], [Geuvara 1981], [Goldberg 1984], [Keener 1981]. Controlling chaos. The extreme sensitivity to initial conditions displayed by chaotic systems makes them unstable and unpredictable. Yet the same sensitivity also makes them highly susceptible to control, provided that the chaotic system can be analyzed and the analysis is then used to make small effective control interventions. By perturbing the system in the right way, it is possible to encourage it to follow one of its many unstable but natural behaviours. In such situations, it may be possible to use chaos to advantage, as chaotic systems, once under control, are very flexible. Such systems can rapidly switch among many different behaviours. Incorporating chaos deliberately into practical systems therefore offers the possibility of achieving greater flexibility in their performance. In the context of chaos, control could mean a number of things. It could mean the elimination of multiple basins of attraction, stabilisation of the fixed points or stabilisation of the unstable periodic orbits. Control of chaos is still in its infancy but the potential it offers is enormous. There are four main categories of chaos control methodologies. They are low energy, high energy, non-feedback and feedback methods. Low energy control methods require very small changes in the control parameter. In contrast, high energy control methods require large changes. It is always desirable to have a control method which is of the low energy type, as in physical systems control parameter may be fixed or can be changed by only a very small amount. When large changes are required, a physical system may need to be redesigned defeating the `control of chaos' concept, as such an approach is closer to avoiding chaos. In feedback methods, a control parameter is changed during the control. In non-feedback methods, a control parameter is changed at the beginning of the control only, and untouched during the control phase. The original OGY control law. Suppose p is some scalar control parameter which is to be ξi pi ξi+1 varied at times ti, say p = pi over the interval (ti, ti + 1), see Figure 6-4. Suppose that the nominal value of p is p0. Our aim pi-1 is to vary p by small amounts about p0 so as to stabilise (t) > pi+1 ξi+2 about a suitable control point F. For all i 1 let > $ Time ti ti+1 ti+2 Figure 6-4 Intervals for which the variables are defined. 62
  • 64. Antonia J. Jones: 6 November 2005 δ p ' p & p0 and % ξδi 1(p i) ' % ξ i 1(p i) & ξ F(p0) (4) Suppose that the iteration is described by the map i + 1 = F( i, p). The locally linear behaviour of F in the vicinity > > of a control point F is described by the dE x dE Jacobian matrix > J ' D F( , p) > > (5) > ' >F, p ' p0 and in what follows we assume det J P 0. This yields the first order approximation % ξδ i 1(p i) . J ξδ i(pi & 1) % u pi δ (6) where u is a vector which reflects the direction of the local gradient with respect to p. Now suppose that an eigenvector of J, say eu, has a real eigenvalue whose absolute value is greater than 1. This means that points i such that i lies in the direction of eu will, if p = p0 in the intervening time period, be such > >* that on the next iteration i + 1 will lie further away from F(p0). We refer to this as an unstable direction. Stable > > directions are characterised by eigenvectors of J which have absolute value less than 1. The basic idea of the OGY method is to choose pi so as to eliminate the component of i + 1 in the unstable * >* direction(s). We are now almost ready to derive the appropriate control strategy. However, we first observe the following lemma. Lemma 6.1. Suppose the d x d matrix J has d linearly independent eigenvectors e1, ..., ed, with real eigenvalues d 81, ..., d. Thus we assume the eigenvectors form a basis in 8 . Construct the dual basis f1, ..., fd defined by ú 1 if i ' j ei . fj ' (7) 0 if i … j d Then for any x ú0 fu . Jx ' λ u fu . x (8) Proof. Express x in terms of the eigenvectors, writing x ' α 1e1 % α 2e2 % ... % α de d (9) where the " i are suitable scalars depending on x. Thus from (7) fu.x ' α u (10) The effect of J on x is (from the definition of eigenvectors and eigenvalues) Jx ' λα 1 1e1 % ... % d d ed λα (11) Taking the inner product with fu yields fu.Jx = 8 " u u and the conclusion now follows from (10). % We can now prove Theorem 6.1 (OGY). The constraint fu. >* i+1 = 0 leads to the first order control law: 63
  • 65. Antonia J. Jones: 6 November 2005 fu. ξδ i(pi & 1) δ pi &. λ u (12) fu.u where for > i near > F , the sensitivity vector u is defined as M ' % ξ i 1(pi) & J i(pi ξ & 1) u % ξδ i 1 (pi) & J ξδ i (pi & 1 ) ' lim (13) M pi pi 6 p0 pi & p0 p0 Proof. Dotting (6) with fu using Lemma 1 we have f u. % ξδ i 1(p i) . λ u f u. ξδ i(pi & 1) % f u.u pi δ (14) Using the constraint fu. >* i+1 = 0 we obtain from (14) λ u fu. ξδ i(pi & 1) % fu.u pi δ . 0 (15) which on solving for pi yields (12). * % The following conditions are required to control a chaotic system with the original OGY method. ! Experimental time series of some scalar-dependent variable xt can be measured and a suitable embedding technique can be applied, or the mathematical model describing the system is available. ! The dynamics of the system can be represented as a low-dimensional surface of section and the system has at least two linearly independent real eigenvectors. ! There is a specific periodic orbit of the map which lies in the attractor and around which one wishes to stabilise and the corresponding unstable periodic point can be located. ! A parameter p is available for external adjustment which can be used to slightly modify the system dynamics. Let the range in which p is allowed to vary pMAX > p > pMIN. There is maximum perturbation p* in the parameter p by which it is acceptable to vary p from the * nominal value p0. ! The position of the periodic orbit is a function of p, but the local dynamics about it do not vary much with small changes in p. Chaotic conventional neural networks. It has been known since at least 1991 that conventional neural network models can exhibit chaotic behaviour. Wang [Wang 1991] constructed a rather stylised simple 2-2 network with weights a ka W ' (16) b kb for a = -5, b = -25, k = -1 whose sigmoidal transfer function is 1 (x) σ ' (17) 1 e x/T % & where T = 1/4, in which the outputs are fed back to the inputs as in Figure 6-5. 64
  • 66. Antonia J. Jones: 6 November 2005 Wang proved that there exists period-doublings to chaos and strange attractors by using a homeomorphism from the network to a known dynamical system having these properties. This formally established that artificial neural networks can exhibit chaotic behaviour. At the same time [Welstead 1991] trained feedforward networks on the Ikeda and Henon maps and then by feeding the outputs back into the inputs empirically produced neural networks with chaotic attractors. Somewhat later further examples were given in [Dracopoulos 1993]9. None of these papers considered the question of controlling the neural network behaviour. y 1 0.9 x (t) x (t+1) Feedforward 0.8 Network 0.7 y (t) y (t+1) 0.6 x 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6-5 Feedforward network as a dynamical system. Figure 6-6 Chaotic attractor of Wang's neural network. Controlling chaotic neural networks The attractor of the Wang network is rather difficult to work with because the attractor is almost a curve, see Figure 6-6. However, we were able to locate an unstable fixed point at (0.896853, 0.999980) which has a Jacobian & 1.96322 2.08867 J ' (18) & 0.00755664 0.00893465 The awkward shape of the attractor is reflected in the fact that this Jacobian has very small determinant. Using T as a control parameter with a nominal value of 1/4 we managed to stabilise this system about the fixed point. In this section we consider neural networks with feedback whose dynamical behaviour is chaotic. This is a subject of increasing interest for two reasons. First, because if Freeman's hypothesis is correct then despite their value in practical applications (in for example pattern recognition) the idea that feedforward networks, regardless of the training algorithm employed, are an accurate analogy to equivalent biological computations is seriously challenged. Second, it is possible that by storing memories in (unstable) periodic behaviours, rather than at point attractors as in the Hopfield model, the memory capacity of simple neural networks may be considerably enhanced. We consider ways in which a small 2-10-10-2 network, trained on the Ikeda attractor, and whose outputs feed back to the inputs can be controlled using small variations of various parameters or system variables. It needs to be said that the control mechanisms employed are external to the neural network. This is in contrast to the hypothetical biological process in which presumably some form of Hebbian learning causes frequently encountered sensory input to be associated with a particular (unstable) periodic dynamical regime. The result being that when the sensory input is re-encountered the neural system relaxes naturally onto a particular (unstable) periodic behaviour which characterises the input, such switching of the dynamical behaviour being implicit to the neural structure rather than being externally imposed. Nevertheless, we believe that such studies of how chaotic neural systems can be encouraged to follow particular unstable periodic orbits to be an interesting and probably necessary first step in developing some understanding of how such behaviour might be made a natural characteristic of the neural 9 At that time the authors were unaware of the results presented in IIJCNN-91. 65
  • 67. Antonia J. Jones: 6 November 2005 system. Other approaches are possible using different neural networks models and different control techniques. For example Babloyantz's group [Sepulchre 1993] [Lourenço 1994] [Babloyantz 1995] have controlled a network of oscillators coupled to their four nearest neighbours using both the OGY method and a delayed feedback control technique first suggested by Pyragas [Pyragas 1992]. Another example [Solé 1995] uses the GM technique ([Güemez 1993], [Matías 1994]) to control small (three or four neurons) fully connected neural networks whose attractors are similar to a chaotic network first discussed by [Wang 1991]. y 0.7 0.6 0.5 0.5 x 0.4 -0.25 0.25 0.5 0.75 1 1.25 0.3 -0.5 0.2 0.1 -1 0 0.5 0.6 0.7 0.8 0.9 Figure 6-7 The Ikeda strange attractor. Figure 6-8 Attractor for the chaotic 2-10-10-2 neural network. However, for the purposes of the present section we work with the Ikeda map [Hammel 1985] defined by α g(z) ' γ % R z exp i κ & (19) 2 1 **% z where z is a complex variable, of the form x + i y, and i2 = -1. We can identify x + i y with the point (x, y) on the complex plane so that g can also thought of as a mapping of 2 -> 2. The dynamical system is then defined by ú ú zn+1 = g(zn). For parameter values = 5.5, = 0.85, = 0.4 and R = 0.9, this mapping has a strange attractor " ( 6 illustrated in Figure 6-7. With only 4000 training pairs (re-scaled into the range [0, 1]) and training MSE error of about 9.9×10-5, the network already produces an attractor in Figure 6-8 with features similar to the Ikeda map strange attractor shown in Figure 6-7. We use this network as the basis for the control experiments, the objective being to determine which parameters or system variables are most effective in stabilising the system onto an unstable periodic attractor. The OGY control method was applied to control the chaotic neural network described above. An unstable fixed point F = (0.626870, 0.553256) was located by examining successive iterations of the system and was used as the > & 1.26617 & 1.03629 J ' (20) & 0.564996 & 1.06779 unstable periodic point to be stabilised. The Jacobian at this point was with eigenvalues s = -0.395399 and u = -1.93857, and stable eigenvector es = (0.7656, -0.643317) and unstable 8 8 66
  • 68. Antonia J. Jones: 6 November 2005 eigenvector eu = (-0.838887, -0.544306). Control varying T in a particular layer. Initial attempts to control using the slope parameter T were not successful. The next attempts were made by varying T only for neurons in a particular layer of the network, and here the OGY control method was more effective. It seems that by varying T in only one particular layer the chaotic regions of the bifurcation diagrams become broader (see Figure 6-9 and Figure 6-10) and so control becomes easier with small variations of T. The variations of T and the controlled result are illustrated in Figure 6-13 - Figure 6-12. Using small variations of the inputs. The results of using an external signal feeding into one of the inputs as a control parameter whose nominal value is set to zero were significantly more interesting. The bifurcation diagrams for x(t) are given in Figure 6-14 and Figure 6-15. We use the same fixed point as before, so the Jacobian and associated eigenvectors and eigenvalues remain unchanged. Using an external signal feeding into input x (cf. Figure 6-5), the sensitivity vector ux = (-1.076260, -0.675875) was approximated. After applying the OGY control for less than 25 time steps (the control variations are shown in Figure 6-18) the system rapidly stabilized onto the unstable fixed point as illustrated in Figure 6-16 - Figure 6-17. In these experiments, an improved technique due to [Otani 1996] was actually used to estimate the sensitivity vectors u. In (13) the Jacobian is used to obtain a prediction of where the system would be at the next iteration if no control were applied. However, in the case of a neural network this is unnecessary since the neural network is its own Jacobian at every point. We can therefore obtain an exact prediction of the next system state by simply iterating the network without control. This resulted in much more accurate estimations of the sensitivity vectors which itself made control of the system using the OGY method much easier. The OGY method can be applied to the control of conventional feedforward networks whose behaviour under iterated feedback has been trained to be chaotic. Whilst the method is computationally expensive and, in its original form subject to a number of limitations (for example inaccuracies in estimating the Jacobian or sensitivity vectors can make control difficult if not impossible) nevertheless see that stabilisation of unstable fixed points is perfectly feasible. However, this relaxation onto a fixed point is achieved by a control external to the network itself rather than as an implicit consequence of network function. It is interesting to observe that control by variation of a global slope parameter is not easy to achieve but becomes easier when the control variations are applied to a single layer rather than to the whole network. It is notable that control becomes very much easier when the controlling parameter is a small signal applied to one of the inputs. This may be closer to being a biological analogy than control of behaviour through global or selective slope control. 67
  • 69. Antonia J. Jones: 6 November 2005 x y T 0.6 0.8 1.2 1.4 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.5 0.2 0.4 T 0.3 0.6 0.8 1.2 1.4 Figure 6-9 Bifurcation diagram x obtained by varying Figure 6-10 Bifurcation diagram y obtained by T in the output layer only. varying T in the output layer only. x y 0.551971 0.8 0.55197 0.7 0.55197 0.6 time step 200 400 600 800 1000 time step 0.551969 200 400 600 800 1000 Figure 6-11 Variations of x from initiation of control. Figure 6-12 Variations of y from initiation of control. parameter T 1.075 1.05 1.025 time step 200 400 600 800 1000 0.975 0.95 0.925 Figure 6-13 Parameter changes during output layer control. 68
  • 70. Antonia J. Jones: 6 November 2005 x y ext to x -0.4 -0.2 0.2 0.4 0.9 0.8 0.8 0.6 0.7 0.6 0.4 0.5 0.2 0.4 0.3 ext to x -0.4 -0.2 0.2 0.4 Figure 6-14 Bifurcation diagram for the output x(t+1) Figure 6-15 Bifurcation diagram for the output y(t+1) using an external variable added to the input x(t). using an external variable added to the input x(t). x y 0.8 0.6 0.5 0.7 0.4 0.3 0.6 0.2 time step time step 50 100 150 200 50 100 150 200 Figure 6-16 Variations of x from initiation of control. Figure 6-17 Variations of y from initiation of control. ext sig x 0.4 0.2 time step 50 100 150 200 -0.2 -0.4 Figure 6-18 Parameter changes during input x control. 69
  • 71. Antonia J. Jones: 6 November 2005 Quite how easy it would be to extend such control to networks with many outputs being fed back to many inputs remains to be determined. It also remains to be determined whether it is practical to control high dimensional networks to follow unstable periodic orbits rather than fixed points. It is likely that more sophisticated variations of the OGY technique or some completely different control method would be required to accomplish this goal. Time delayed feedback and a generic scheme for chaotic neural networks. Recently [Tsui 1999], [Oliveira 1998] [Tsui 2002], Ana Olivera, Alban Tsui and myself have discovered that it is very easy to control and synchronize chaotic neural systems using time delayed feedback. Combined with the Gamma test to select appropriate time-delays we can now easily achieve the following: ! Given an arbitrary smooth multidimensional chaotic system produce an iterated neural network which closely models the system. ! Using a simple technique of time delayed feedback cause the iterated chaotic neural network when presented with a stimulus to stabilize onto an unstable periodic orbit characteristic of the applied stimulus. Moreover, this response is quite stable in the presence of noise. ! Synchronize two identical copies of the network by transmitting the single output of one to the other. This provides an interesting model of other kinds of cortical activity and also has interesting applications in secure communications. Although not a full account of our methods, these papers answer many of the questions mentioned. [Tsui 1999] provides the first actual implementation of an artificial neural network which exhibits all of the properties discussed in [Freeman 1991]. A generic scheme for such a stimulus-response recurrent network is shown in Figure 6-19. The single output of the network feeds back into the inputs using delay buffers according to the embedding previously determined by the Gamma test experiments. This embedding should contain enough information for predicting the next system state. For stabilization control a multiple (gain constant) of the delayed feedback is added to each neural network input specified by the irregular embedding, based on the idea from Pyragas' delayed feedback control. The control module is shown in Figure 6-19 and the control perturbation for the ith input at the nth iteration is k i x i(n i & & τ ) x i(n i) & & (21) where ki is a gain constant and is the delay time. J We imagine that the presence of an external stimulus excites (activates) the control circuitry, which is otherwise inhibited. Thus to achieve a stabilised dynamical regime in response to a stimulus the control is switched on at the same time as the external signal is fed into the input line xn. By varying the external signal in small steps and holding the new setting fixed long enough for the system to stabilise we can observe the response of the network to small changes in stimulus. In the diagram, is the same for each control perturbation but of course, we could set to be different on each J J control line. External stimulus of the network can be applied to the controlled inputs as shown in the diagram. The control module should switch on automatically and simultaneously whenever there is an external stimulation. Variations of stimulation, such as on the control delayed feedback lines may also be used. 70
  • 72. Antonia J. Jones: 6 November 2005 iterative feedback Hidden layer delay d + x(n-d) + kd (x(n-d-t) - x(n-d)) observation Controlled neural inputs points External with no external stimulus stimulus x(n) delay 3 + x(n-3) + k3 (x(n-3-t) - x(n-3)) delay 2 + x(n-2) + k2 (x(n-2-t) - x(n-2)) Feedforward delay 1 + x(n-1) + k1 (x(n-1-t) - x(n-1)) neural network Switch signal Control feedback k1 (x(n-1-t) - x(n-1)) for each k2 (x(n-2-t) - x(n-2)) Time delayed delayed line control feedback Keys k3 (x(n-3-t) - x(n-3)) observation module point kd (x(n-d-t) - x(n-d)) Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network: the chaotic "delayed" network is trained on suitable input-output data constructed from a chaotic time series; a delayed feedback control is applied to each input line; entry points for external stimulus are suggested, with a switch signal to activate the control module during external stimulation; signals on the delay lines or output can be observed at the "observation points". Example: Controlling the Hénon neural network There follows an example of different responses of the for the Hénon neural system using different settings of controls and external stimulation. The response signals of the system can be observed at the output x(n) of the feedforward neural network module or the "observation points" on the delay lines x(n-1) , x(n-d), as indicated Y in Figure 6-19. Due to the complexity of these neural systems, of course, not all possible settings are tried and presented. s x 1.8 0.4 1.6 0.2 1.4 n 1.2 200 400 600 800 1000 1200 1400 n -0.2 200 400 600 800 1000 1200 1400 0.8 -0.4 0.6 -0.6 Figure 6-21 Response signal on x(n-6) with control Figure 6-20 The control signal corresponding to the signal activated on x(n-6) using k = 0.441628, = 2 J delayed feedback control shown in Figure 6-21. Note and without external stimulation after first 10 that the control signal becomes small. transient iterations. After n = 1000 iterations, the control is switched off. 71
  • 73. Antonia J. Jones: 6 November 2005 s x 1 2 0.75 1 0.5 n 0.25 2500 5000 7500 10000 12500 15000 n 2500 5000 7500 10000 12500 15000 -1 -0.25 -2 -0.5 -0.75 Figure 6-22 Response signals on network output x(n), Figure 6-23 The control signal corresponding to the with control signal activated on x(n-6) using k = delayed feedback control shown in Figure 6-22. Note 0.441628, J= 2 and with constant external that the control signal becomes small even when the stimulation sn added to x(n-6), where sn varies from network is under changing external stimulation. -1.5 to 1.5 in steps of 0.1 at each 500 iterative steps (indicated by the change of Hue of the plot points) after 20 initial transient steps. We use = 2 and k = 0.441628 for our control parameters on all the possible feedback control lines. The control J is applied to the delayed feedback line x(n-6). Without any external stimulation and using only a single control delayed feedback, the network quickly produces a stabilised response as shown in Figure 6-21 with the corresponding control signal shown in Figure 6-20. Notice that the control signal is very small during the stabilised behaviour. Under external stimulation with varying strength the network is still stabilised, but with a variety of new periodic behaviours as shown in Figure 6-22. The corresponding control signal is still small (see Figure 6-23). For this system we then investigated the response of the system when the sensory input was perturbed by additive Gaussian noise r with Mean[r] = 0 and standard deviation SD[r] = . Using the experimental setup as in Figure F 6-22, the external stimulus was perturbed at each iteration step by adding a Gaussian noise r with standard deviation , i.e. having an external stimulus sn+r. This experiment was repeated for different , where was F F F varied from = 0.05 to = 0.3, a high noise standard deviation with respect to the external stimulus strength from F F -1.5 t o1.5. The result for = 0.05 is shown in Figure 6-24 and Figure 6-25. Surprisingly, the response signal F almost stays the same but the control signal is not small at all. The results for = 0.15 and = 0.3 are in Figure F F 6-26 and Figure 6-27 respectively. As illustrated in these figures, the system dynamics remain essentially unchanged, although as one might expect the response signal becomes progressively "blurred" as the noise level increases. Similar results can be obtained for the other examples. 72
  • 74. Antonia J. Jones: 6 November 2005 x s 0.04 1 0.02 n 2500 5000 7500 10000 12500 15000 -1 n 2500 5000 7500 10000 12500 15000 -2 -0.02 Figure 6-24 Response signals on network output x(n), -0.04 with control setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. Figure 6-25 The control signal corresponding to the sn+r, with = 0.05, at each iteration step. F delayed feedback control shown in Figure 6-24. x x 2 2 1 1 n n 2500 5000 7500 10000 12500 15000 2500 5000 7500 10000 12500 15000 -1 -1 -2 -2 Figure 6-26 Response signals on network output x(n), Figure 6-27 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 with control experiment setup same as in Figure 6-22 but with Gaussian noise r added to external but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.15, at each iteration F stimulation, i.e. sn+r, with = 0.3, at each iteration F step. step. Chapter references [Abraham 1982] R. H. Abraham and C. D. Shaw, Dynamics - The geometry of behavior Part One : Periodic Behavior, Aerial Press, California, 1982. [Atkinson 1978] K. E. Atkinson, An introduction to numerical analysis, John Wiley & Sons, Canada, 1978. [Auerbach 1987] Ditza Auerbach, Predrag Cvitanovic, Jean-Pierre Eckmann, Gemunu Gunaratne and Itamar Procaccia. Exploring chaotic motion through periodic orbits. Physical Review Letters 58(23), 2387-2389, 1987. [Auerbach 1992] D. Auerbach, C. Grebogi, E. Ott and J. A. Yorke. Controlling Chaos in High Dimensional Systems, Physical Review Letters 69, 24, 3479-3482, 1992. 73
  • 75. Antonia J. Jones: 6 November 2005 [Azevedo 1991] A. Azevedo and S. M. Rezende. Controlling Chaos in Spin-Wave Instabilities, Physical Review Letters 66, 10, 1342-1345, 1991. [Babloyantz 1988] A. Babloyantz and A. Destexhe. Is the normal heart a periodic oscillator? Biol. Cybern. 58, 203-211, 1988. [Babloyantz 1995] A. Babloyantz, C. Lourenço, J.A. Sepulchre, Control of chaos in delay differential equations, in a network of oscillators and in model cortex, Physica D 86, 274-283, 1995. [Belmonte 1988] A. L. Belmonte, M. J. Vision, J. A. Glazier, G. H. Gunaratne and B. G. Kenny. Trajectory Scaling Functions at the Onset of Chaos: Experimental Results, Physical Review Letters 61, 5, 539-542, 1988. [Belmonte 1988] A. L. Belmonte, M. J. Vision, J. A. Glazier, G. H. Gunaratne and B. G. Kenny. Trajectory scaling functions at the onset of chaos: experimental results. Physical Review Letters 61(5), 539-542, 1988. [Carroll 1992] T. L. Carroll, I. Triandaf, I. Schwartz and L. Pecora. Tracking unstable orbits in an experiment, Physical Review A 46, 10, 6189-6192, 1992. [Carroll 1993] Thomas L. Carroll and Louis M. Pecora. Using chaos to keep period-multiplied systems in phase, Physics Review E 48, 4, 2426-2436, 1993. [Choi 1983] M. Y. Choi and B. A. Huberman. Dynamic behaviour of nonlinear networks, The American Physical Society, 28, 1204-1206, 1983. [Chay 1985] T. R. Chay and J. Rinzel. Bursting, beating, and chaos in an excitable membrane model, J. Biophysical Society, 47, 357-366, 1985. [Crutchfield 1980] J. Crutchfield, D. Farmer, N. Packard, R. Shaw, G. Jones and R. J. Donnelly. Power spectral analysis of a dynamical system, Physics Letters A 76, 1-4, 1980. [Crutchfield 1981] J. Crutchfield, M. Nauenberg and J. Rudnick. Scaling for External Noise at the Onset of Chaos, Physical Review Letters 46, 933-935, 1981. [Cvitanovic 1989] P. Cvitanovic. Universality in Chaos : second edition, Adam Hilger, Bristol, 1989. [Derrick 1993] W. R. Derrick in Chaos in chemistry and biochemistry, Ed R. J. Field and L. Györgyi, World Scientific Publishing, 1993. [Ditto 1990] W. L. Ditto, S. N. Rauseo and M. L. Spano. Experimental Control of Chaos, Physical Review Letters 65, 26, 3211-3214, 1990. [Ditto 1993] W. L. Ditto and L. M. Pecora. Mastering Chaos, SCIENTIFIC AMERICAN, 62-68, August 1993. [Dracopoulos 1993] D. C. Dracopoulos and Antonia J. Jones. Neuromodels of analytic Dynamic Systems. Neural Computing & Applications, 1(4):268-279, 1993. [Dracopoulos 1994] D. C. Dracopoulos and Antonia J. J. Jones. Neuro-Genetic Adaptive Attitude Control, Neural Computing & Applications, 2(4):183-204, 1994. [Dressler 1992] U. Dressler and G. Nitsche, Controlling chaos using time delay coordinates, Physics Review Letters 68(1):1-4, 1992. [Eckmann 1985] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors, Rev. Modern 74
  • 76. Antonia J. Jones: 6 November 2005 Physics 57, 3, 617-656, 1985. [Freeman 1991] W.J. Freeman, The Physiology of Perception, Scientific American, 34-41, February 1991. [Garfinkel 1992] A. Garfinkel, M. L. Spano, W. L. Ditto and J. N. Weiss. Controlling Cardiac Chaos, Science 257:1230-1235, 1992. [Geuvara 1981] M. R. Geuvara, L. Glass and A. Shrier. Phase locking, period-doubling bifurcations, and irregular dynamics in periodically stimulated cardiac cells, Science 214:1350-1352, 1981. [Gills 1992] Z. Gillis, C. Iwata, R. Roy, I. B. Schwartz and I. Triandaf. Tracking Unstable Steady States: Extending the Stability Regime of a Multimode Laser System, Physical Review Letters 69(22):3169-3172, 1992. [Gleick 1987] J. Gleick. CHAOS : Making a new science, Abacus book, London, 1987. [Goldberg 1984] A. L. Goldberger, L. J. Findley, M. R. Blackburn and A. J. Mandell. Nonlinear dynamics in heart failure: Implications of long-wavelengths cardiopulmonary oscillations, Amer. Heart J. 107:612-615, 1984. [Grassberger 1983] P. Grassberger and I. Procaccia, Characterization of Strange Attractors, Physical Review Letters 50(5):346-349, 1983. [Grebogi 1988] C. Grebogi, E. Ott, J. A. Yorke. Unstable periodic orbits and the dimensions of multifractal chaotic attractors, Physical Review A 37:1711-1723, 1988. [Grebogi 1982] C.Grebogi, E.Ott, J.A.Yorke. Chaotic Attractors in Crisis, Physical Review Letters 48:1507-1510, 1986. [Grebogi 1986] C. Grebogi, E. Ott, J. A. Yorke. Critical Exponent of Chaotic Transients in Nonlinear Dynamical Systems, Physical Review Letters 57:1284-1287, 1986. [Grebogi 1987] C. Grebogi, E. Ott, J. A. Yorke. Critical exponents for crisis-induced intermittency, Physical Review A 36:5365-5380, 1987. [Greenside 1982] H. S. Greenside, A. Wolf, J. Swift and T. Pignataro. Physical Review A 25:3453, 1982. [Güémez 1993] J. Güémez and M.A. Matías, Control of chaos in unidimensional maps, Physics Letters A 181:29- 32, 1993. [Gunaratne 1989] G. H. Gunaratne, P. S. Linsay and M. J. Vision. Chaos beyond Onset: A Comparison of Theory and Experiment, Physics Review Letters 63(1):1-4, 1989. [Hall 1992] N. Hall. The new scientist guide to chaos, Penguin books, London, 1992. [Hammel 1985] S. Hammel, C. Jones, J. Moloney, Global Dynamical Behaviour of the Optical Field in a Ring Cavity, J. Opt. Soc. Am. B 2(4):552-564, 1985. [Hao 1990] B. Hao, Chaos II, World Scientific, 1990. [Hayes 1993] S. Hayes, C. Grebogi and E. Ott. Communicating with Chaos, Physical Review Letters 70(20):3031- 3034, 1993. [Hénon 1976] M. Hénon. A Two-dimensional Mapping with a Strange Attractor, Communications in Mathematical Physics 50:69-77, 1976. 75
  • 77. Antonia J. Jones: 6 November 2005 [Hilborn 1994] R. C. Hilborn. Chaos and Nonlinear Dynamics : An introduction for scientists and engineers, Oxford University press, New York, 1994. [Holmes 1990] P. Holmes. Poincarè. celestial mechanics, dynamical-systems theory and "chaos", Physics Reports 193(3):137-163, 1990. [Hunt 1991] E. R. Hunt. Stabilizing High-Period Orbits in a Chaotic System: The Diode Resonator, Physical Review Letters 67(15):1953-1955, 1991. [Kaplan 1979] J. Kaplan, J. A. Yorke in Chaotic behavior of multidimensional difference equations, H. O. Peitgen et al., Eds., "Springer Lecture, Notes in Mathematics", Springer-Verlag, Berlin, 730:204-227, 1979. [Keener 1981] J. P. Keener. Chaotic cardiac dynamics, Lectures in Applied Mathematics 19:299-325, 1981. [Kim 1992] J. H. Kim and J. Stringer, Applied Chaos, John Wiley & Sons, Canada, 1992. [Lai 1994] Ying-Cheng Lai and Celso Grebogi. Synchronization of spatiotemporal chaotic systems by feedback control, Physics Review E 50(3):1894-1899, 1994. [Lai 1993] Y-C Lai, M. Ding and C. Grebogi. Controlling Hamiltonian chaos, Physical Review E 47(1):86-92, 1993. [Lathrop 1989] Daniel P. Lathrop and Eric J. Kostelich. Characterization of an experimental strange attractor by periodic orbits, Physics Review A 40(7):4028-4031, 1989. [Lorenz 1963] E. N. Lorenz. Deterministic Nonperiodic Flow, J. Atmospheric Sciences 20:130-141, 1963. [Lorenz 1993] E. N. Lorenz, The essence of chaos, University of Washington press, 1993. [Lourenço 1994] C. Lourenço, A. Babloyantz, Control of Chaos in Networks with Delay: A Model for Synchronization of Cortical Tissue, Neural Computation 6:1141-1154, 1994. [Matías 1994] M.A. Matías and J. Güémez, Stabilization of Chaos by Proportional Pulses in the System Variables, Physical Review Letters 72(10):1455-1458, 1994. [Moss 1994] Frank Moss. Chaos under control, Nature 370:596-597, 1994. [Ogorzalek 1993] M. J. Ogorzalek. Taming Chaos Part II: Control, IEEE Transactions on circuits and systems - 1: Fundamental theory and Applications. Volume 40(10):700-706, October 1993. [Oliveira 1998] Synchronisation of chaotic maps by feedback control and application to secure communications using chaotic neural networks. Ana Guedes de Oliveira and Antonia J. Jones. International Journal of Bifurcation and Chaos, 8(11), November 1998. [Otani 1997] M. Otani and Antonia J. Jones, Guiding chaotic orbits. Research Report, Department of Computer Science, University of Wales, Cardiff, December 1997. [Ott 1990] E. Ott, C. Grebogi and J.A. Yorke, Controlling Chaos, Physical Review Letters 64(11):1196-1199, 1990. [Ott 1994] E. Ott, T Sauer and J. A. Yorke, Coping with Chaos, John Wiley & Sons, Canada, 1994. [Parker 1989] T. S. Parker and L. O. Chua. Practical Numerical Algorithms for Chaotic Systems, Springer-Verlag, 76
  • 78. Antonia J. Jones: 6 November 2005 New York, 1989. [Parlitz 1985] U. Parlitz and W. Lauterborn. Superstructure in the bifurcation set of the Duffing equation, Physics Letters A 107(8):351-355, 1985. [Pearson 1986] C. E. Pearson, Numerical methods in engineering and science, Van Nostrand Reinhold Company Inc, England, 1986. [Peinke 1992] J. Peinke, J. Parisi, O. E. Rössler and R. Stoop, Encounter with Chaos : Self-Organized Hierarchical Complexity in Semiconductor Experiments, Springer-Verlag, 1992. [Petrov 1993] Valery Petrov, Vilmos Gaspar, Jonathan Masere and Kenneth Showalter. Controlling chaos in the Belousov-Zhabotinsky reaction, Nature 361:240-243, 1993. [Pfister 1992] G. Pfister, Th. Buzug and N. Enge. Characterization of experimental time series from Talyor- Couette flow, Physica D 58:441-454, 1992. [Provenzale 1992] A. Provenzale, L. A. Smith, R. Vio and G. Murante. Distinguishing between low-dimensional dynamics and randomness in measured time series, Physica D 58:31-49, 1992. [Pyragas 1992] K. Pyragas, Continuous control of chaos by self-controlling feedback, Physics Letters A 170:421- 428, 1992. [Romeiras 1992] F. J. Romeiras, C. Grebogi, E. Ott and W. P. Dayawansa. Controlling chaotic dynamical systems, Physica D 58:165-192, 1992. [Rössler 1976] O. E. Rössler. An Equation for Continuous Chaos, Physics Letters A 57:397-398, 1976. [Roy 1992] R. R. Roy, T. W. Murphy, T. D. Maier, Z. Gills and E. R. Hunt. Dynamical Control of a Chaotic Lazer: Experimental Stabilization of a Globally Coupled System, Physical Review Letters 68(9):1259-1262, 1992. [Roy 1994] Rajarshi Roy and K. Scott Thornburg, Jr. Experimental Synchronization of Chaotic Lasers, Physical Review Letters 72(13):2009-2012, 1994. [Ruelle 1980] D. Ruelle. Strange Attractors, The Mathematical Intelligence 2:126-137, 1980. [Russell 1980] D. A. Russell, J. D. Hansen, and E. Ott. Dimensions of Strange Attractors, Physical Review Letters 45:1175-1178, 1980. [Sano 1985] M. Sano and Y. Sawada. Measurement of the Lyapunov Spectrum from a Chaotic Time Series, Physical Review Letters 55(10):1082-1085, 1985. [Schiff 1994] S. J. Schiff, K. Jerger, D. H. Duong, T. Chang, M. L. Spano and W. L. Ditto, Controlling chaos in the brain, Nature, 370:615-620, 1994. [Schwartz 1994] I. B. Schwartz and I. Triandaf. Controlling unstable states in reaction-diffusion systems modeled by time series, Physical Review E 50(4):2548-2552, 1994. [Sepulchre 1993] J.A. Sepulchre and A. Babloyantz, Controlling chaos in network of oscillators, Physical Review E 48(2):945-950, 1993. [Shinbrot 1990] T. Shinbrot, C. E. Ott, Grebogi and J. A. Yorke. Using Chaos to Direct Trajectories to Target, Physical Review Letters 65(26):3215-3218, 1990. 77
  • 79. Antonia J. Jones: 6 November 2005 [Shinbrot 1993] Troy Shinbrot, Celso Grebogi, Edward Ott and James A. Yorke. Using small perturbations to control chaos, Nature 363:411-417, 1993. [Shinbrot 1992a] T. Shinbrot, W. Ditto, C. Grebogi, E. Ott, M. Spano, and J. A. Yorke. Using the Sensitive Dependence of Chaos (the "Butterfly Effect") to Direct Trajectories in an Experimental Chaotic System, Physical Review Letters 68(19):2863-2866, 1992. [Shinbrot 1992b] T. Shinbrot, C. Grebogi, E. Ott and J. A. Yorke. Using chaos to target stationary states of flows, Physics Letters A 196:349-354, 1992. [Singer 1991] J. Singer, Y-Z. Wang and H. H. Bau. Controlling a Chaotic System, Physical Review Letters 66(9):1123-1125, 1991. [Smith 1986] W. A. Smith, Elementary Numerical Analysis, Prentice-Hall, England, 1986. [Solé 1995] Ricard V. Solé, Liset Menéndez de la Prida, Controlling chaos in discrete neural networks, Physics Letters A 199:65-69, 1995. [Stewart 1989] I.Stewart, Does God Play Dice : The new mathematics of chaos, Penguin books, England, 1989. [Stewart 1996] I.Stewart, From Here to Infinity : A guide to today's mathematics, Oxford University Press, England, 1996. [Thompson 1994] J. M. T. Thompson and S. R. Bishop, Nonlinearity and Chaos in Engineering Dynamics, John Wiley & Sons, England, 1994. [Tsui 1999] Periodic response to external simulation of a chaotic neural network with delayed feedback. Alban P. Tsui and Antonia J. Jones. International Journal of Bifurcation and Chaos, 9(4), April 1999. [Tsui 2002] Alban P. Tsui, Antonia J. Jones and Ana Guedes de Oliveira. The construction of smooth models using irregular embeddings determined by a Gamma test analysis. Neural Computing & Applications 10(4), 318-329, April 2002. ISSN 0-941-0643. [Wang 1991] Xin Wang, Period-Doublings to Chaos in A Simple Neural Network, 1991 IEES INNS International Joint Conference on Neural Networks - Seattle, Vol II:333-339, 1991. [Welstead 1991] Stephen T. Welstead, Multilayer Feedforward Networks Can Learn Strange Attractors, 1991 IEES INNS International Joint Conference on Neural Networks - Seattle, Vol II: 139-144, 1991. 78
  • 80. Antonia J. Jones: 6 November 2005 COURSEWORK This work should be handed in three weeks before the Easter Break. It is suggested that you work on questions as the subject matter is covered in the course. 1.(a) (i) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed length strings. (ii) Give a detailed description of a typical genetic algorithm. (iii) Explain the different roles played by crossover and mutation in the process of genetic search. (b) What are the main design problems in constructing a GA for a particular problem? What simple checks would you suggest before running a full test of a new genetic algorithm (GA) to verify that it has some chance of performing adequately? [4] (c) (i) "Darwin's theory of evolution is supposed to explain the diversity of species. If by definition two members of different species cannot interbreed then increasing specialization is the only plausible mechanism for species creation. Very specialized species are vulnerable to external changes and so in a dynamic environment might not be expected to survive in the long run. This would tend to decrease diversity rather than increase it." Discuss in two or three sentences. (ii) In the light of the above discussion identify features which might be present in natural evolution which are absent in genetic algorithms. 2(a). Let i11 i12 I ' i21 i22 denote the input to a binary retina. Show that it is impossible to find a set of weights W = (wjk), wjk real numbers, 1 j, k 2, and a threshold so that the single linear function # # 2 1 if j wjkijk > θ j, k P(I) ' 0 if j wjkijk < θ j, k can discriminate between horizontal and vertical lines. (b) An alternative system is proposed in which a 2-tuple WISARD classifier with a 1-1 mapping is used. Investigate the possible mappings and show that two-thirds of these lead to a system which discriminates perfectly between horizontal and vertical lines. (c) Suppose that for some given two-class problem on a 512X512 retina suitable weights can be found for the single linear classifier system. Using 8 bits per weight estimate the storage requirements for such a system. Allowing 10 micro sec for multiplication and 3 micro sec per addition and temporary storage overheads, calculate the 79
  • 81. Antonia J. Jones: 6 November 2005 classification time. Would it be suitable for industrial application. (d) Perform similar calculations, relating to storage and response time, for a WISARD 2-tuple system with a 1-1 mapping. Assume a conventional 16 bit architecture with an access time of 1 micro sec. 3(a). Briefly describe the main features of a Hopfield network: your discussion should include definitions for the update rule and energy function. (b) How are the weights of a Hopfield network usually assigned for a pattern recognition problem and what are two limitations of this approach? (c) It is proposed to apply some style of Hopfield network to the problem of finding a minimal tour for the geometric TSP. Suggest and discuss one method of assigning network states to tours and briefly describe how the weights of the network can be related to the distances between cities. 4(a). (i) Why are hidden units necessary to solve general problems of function modelling using feedforward neural networks? (ii) What is the main difficulty in constructing a learning rule for feedforward networks with hidden units? (b) (i) Describe the backpropagation rule for learning using a feedforward network with one hidden layer. Notation. The following functions, and hence all their partial derivatives, are assumed known. Error function. E(z1, z2, ..., zn, t1, t2, ... ,t n) (24) Here z1, z2, ..., zn are the outputs from the output layer units and t1, t2, ... ,tn are the target outputs. Activation function. Output layer: net j ' net j (y1, y2, ... ,y m, pj1, ... pjt) (25) Here y1, y2, ... ,ym are the outputs from the previous layer and pj1, ... , pjt are parameters associated with the jth node of the output layer. Frequently t = t(m), i.e. the number of parameters associated with a node is a function of the number of inputs. Previous layer: net i ' net i (x1, x2, ..., x l, pi1, ... ,pis) (26) Here x1, x2, ... ,xl are the outputs from the layer prior to the previous layer. Output function. Output layer: zj ' f(net j) (1 j # # n) (27) 80
  • 82. Antonia J. Jones: 6 November 2005 Previous layer: yi ' f(net i) (1 i # # m) (28) (i) What are the strengths and weaknesses of the backpropagation method? (c) The N-bit parity problem is this: given any vector x = (x1, x2, ..., xN), where xi {0, 1}, (i.e. x is a vector 0 of 0's and 1's) the determine the parity of x. The parity of x is defined as 0 if the number of 1's in x is even and 1 if the number of 1's is odd. Construct a feedforward network to solve the N-bit parity problem. 81
  • 83. CMT563 Sample Examination Paper Time Allowed 2 hours Answer THREE Questions 1.(a) Give a high level pseudo-code description of a typical genetic algorithm. [5] (b) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed length strings. [5] (c) Discuss three criteria by which one might evaluate the effectiveness of a heuristic search algorithm, such as a genetic algorithm, when applied to a hard combinatoric search problem. Briefly comment on each of these criteria. [5] (d) What simple checks would you suggest before running a full test of a new genetic algorithm (GA) to verify that the representation and crossover operator which have been selected have a reasonable possibility of ensuring that the GA performs better than random search. [5] 2(a). Describe the Hopfield model for a fully interconnected asynchronous network. Your description should include a definition of the update rule for individual neurons and the method for selecting which neuron to update at the next step. [8] (b) The energy of a Hopfield network with weights wij from the j th node to the i th node, with wii = 0 for all i, wij = wji for all i, j, and thresholds i, when the network is in state x = (x1, ..., xn) 2 0 {0, 1}n, where n is the number of neurons, is defined as 1 E &' j w xx j% x θ 2 i, j ij i j i i i i j … Show that under the rules described in (a) the network will iterate to a state at which energy is a local minimum and stay there. [8] (c) (i) If the Hopfield model is used as an self-associative memory how are the weights determined from the patterns? [2] (ii) What problems are encountered as more memories are added and what is the practical upper limit for memory storage with a given Hopfield network? [2] 82
  • 84. CMT563 3(a). Describe the Wisard model for pattern recognition. State some advantages of this model over other methods of pattern recognition. [10] (b) Explain how the storage requirements and response of a Wisard network alters as the size of the n-tuple increases. [5] (c) Suppose that for some given two-class problem using a 512 × 512 retina a hardware Wisard system is simulated using a conventional 16-bit serial architecture with a single 1 MHz processor. Using 8 bits per weight estimate the storage requirements for such a system. Allowing 1 micro sec for access, 10 micro sec for multiplication and 3 micro sec per addition and temporary storage overheads, estimate the storage requirements and calculate the classification time. Would it be suitable for industrial application? [5] 4(a). Briefly describe the backpropagation algorithm (detailed equations are not required but you should explain what weight adjustments depend on at each step) and discuss its strengths and weaknesses. [8] (b) Suppose that the training data for a feedforward network is derived from a process which can be modelled by a smooth function f from inputs to the single output y, and that in the data y is subjected to measurement error r with mean zero. Identify the principal factors which will determine the best Mean Squared Error that a trained network can achieve when tested on an unseen set of data drawn from the same process. [4] (c) Briefly describe the Gamma test. Give an example of the type of problem when it would be appropriate to use the Gamma test and an example where it would not be appropriate. [8] 83
  • 85. CMT563 Solutions 1. (a) Give a high level pseudo-code description of a typical genetic algorithm [5] 1. Randomly generate a population of M structures S(0) = {s(1,0),...,s(M,0)}. 2. For each new string s(i,t) in S(t), compute and save its measure of utility v(s(i,t)). 3. For each s(i,t) in S(t) compute the selection probability defined by p(i,t) = v(s(i,t))/( E i v(s(i,t))). 4. Generate a new population S(t+1) by selecting structures from S(t) via the selection probability distribution and applying the idealised genetic operators to the structures generated. 5. Goto 2. Algorithm 7-1 Generic GA (b) Describe the genetic operators for crossover and mutation applied to genotypes consisting of fixed length strings. [5] For the moment we represent strings as CROSSOVER a1a2a3...al [ai = 1 or 0]. Using this notation we can describe the operators by cut points which strings are combined to produce new strings. Parent 1 1011 010011 10111 Crossover. In crossover one or more cut points are selected at random and the Parent 2 1100 111000 11010 operation illustrated in the figure (where two cut points are employed) is used to Child 1 1100 010011 11010 create two children. A variety of control 1011 111000 10111 Child 2 regimes are possible, but a simple strategy might be `select one of the children at random to go into the next MUTATION generation'. Crossing over proceeds in three steps: 110011100011010 111011101011010 a) Two structures a1...al and b1...bl are selected at random from the current population. INVERSION b) A crossover point x, in the range 1 to l-1 is selected, again 111111100011010 110011111011010 at random. Standard genetic operators. 84
  • 86. CMT563 c) Two new structures a1a2...axbx+1bx+2...bl b1b2...bxax+1ax+2...al are formed. Children tend to be ‘like’ their parents, so that crossover can be considered as a focussing operator which exploits knowledge already gained, its effects are quite quickly apparent. Mutation. In mutation an allele is altered at each site with some fixed probability. Each structure a1a2...al in the population is operated upon as follows. Position x is modified, with probability p independent of the other positions, so that the string is replaced by a1a2...ax-1 z ax+1...al where z is drawn at random from the possible values. If p is the probability of mutation at a single position then the probability of h mutations in a given string is determined by a Poisson distribution with parameter p. Mutation disperses the population throughout the search space and so might be considered as an information gathering or exploration operator. Search by mutation is a slow process analogous to exhaustive search. Thus mutation is a `background' operator, assuring that the crossover operator has a full range of alleles so that the adaptive plan is not trapped on local optima. (c) Discuss three criteria by which one might evaluate the effectiveness of a heuristic search algorithm, such as a genetic algorithm, when applied to a hard combinatoric search problem. Briefly comment on each of these criteria. [5] The three principal criteria are: solution quality; scaling in run time and memory for a given solution quality; absolute run time should be acceptable. Solution quality is often hard to measure: for the TSP we might use the Held-Karp lower bound. For hard combinatoric search a scaling of O(NlogN), where N is a measure of problem size, is normally the best that can be achieved - anything worse than this results in unacceptable run times for large problems. Acceptable absolute run time is a function of the commercial benefits and time available - for some early VLSI layout designs super-computers were used with run times of several months. More usually a run time of several days is the most that is acceptable. (d) What simple checks would you suggest before running a full test of a new genetic algorithm (GA) to verify that the representation and crossover operator which have been selected have a reasonable possibility of ensuring that the GA performs better than random search. [5] One simple check is to run several thousand crossover events with randomly selected parents and record child_fitness versus mean_parental_fitness: if the result point plot/correlation calculation reveals little or no correlation between the two then the combination of representation and crossover is unlikely to produce a GA that works any better than random search. On the other hand if there is a good correlation between child_fitness and mean_parental_fitness then the mechanisms of evolutionary search have some chance of being effective. 2. (a) Describe the Hopfield model for a fully interconnected asynchronous network. Your description should include a definition of the update rule for individual neurons and the method for selecting which neuron to update at the next step. [8] Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined 85
  • 87. CMT563 as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm. For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean 2 attempt rate µ, setting x i(t) ' 1 > θ i x i(t) ' x i(t 1) & if j wijxj(t 1) & ' θ i (2) x i(t) ' 0 j i … < θ i Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts accordingly. The procedure is described below. Procedure Hopfield (Assumes weights are assigned) Repeat until updating every unit produces no change of state. Randomise initial state x {0, 1}n 0 Select unit i (1 i n) with uniform random probability. # # Update unit i according to (1) End Algorithm 7-2 Generic Hopfield net. (b) The energy of a Hopfield network with weights wij from the j th node to the i th node, with wii = 0 for all i, wij = wji for all i, j, and thresholds i, when the network is in state x = (x1, ..., xn) {0, 1}n, where n is the number of 2 0 neurons, is defined as 1 E j &' w xx ix i % θj 2 i, j ij i j i (3) i j … Show that under the rules described in (a) the network will iterate to a state at which energy is a local minimum and stay there. [8] A low rate of neural firing is approximated by assuming that only one unit changes state at any given moment. Then, since wij = wji, E due to xi is given by ) ) E M ' ∆E xi ( wij x j j &' ∆ i) xi & θ ∆ xiM j (4) j i … Now consider the effect of the threshold rule (2). If the unit changes state at all then xi = ±1. If xi = 1 this means ) ) the unit changes state from 0 to 1, hence by the threshold rule wijx j > i θ j j i … (5) in which case, by (4), E < 0. Alternatively, xi = -1, the unit therefore changes state from 1 to 0, and ) ) wijx j < i θ j j i … (6) and again E < 0. Thus if a unit changes state at all the energy must decrease. State changes will continue until ) a locally least E is reached. The energy is playing the role of a Hamiltonian in the more general dynamic system context. (c) (i) If the Hopfield model is used as an self-associative memory how are the weights determined from the patterns? [2] 86
  • 88. CMT563 An early rule used for memory storage in associative memory models can also be used to store memories in the Hopfield model. If the {0, 1} node outputs are mapped to xi {-1, +1} the rule assigned weights as follows. For 0 each pattern vector x, which we require to memorise, we consider the matrix x1 x1 x1 x1 x2 ... x1 x n x2 x2 x1 x2 x2 ... x2 x n x xT ' x1, x2, ,..., x n ' (7) . . . ... . xn x n x1 x n x2 ... x n x n and then average these matrices over all pattern vectors (prototypes). We then set wii = 0 and note that the resulting matrix is symmetric. In this way we can capture the average correlations between components of the pattern vectors and then use this information, during the operation of the network, to recapture missing or corrupted components. (ii) What problems are encountered as more memories are added and what is the practical upper limit for memory storage with a given Hopfield network? [2] The main problem is that as the number of patterns P increase we find that an exponentially increasing number `spurious' local minima are introduced, i.e. minima not associated with a pattern. When P is approximately 0.15n, where n is the number of nodes in the network, there is a dramatic degradation in the ability of the network to recall noisy patterns. 3. (a) Describe the Wisard model for pattern recognition. Schematic of a 3-tuple recogniser. The scheme outlined above was first proposed by Aleksander and Stonham in [Aleksander 1979]. The sample data 87
  • 89. CMT563 to be recognized is stored as a 2-dimensional array (the 'retina') of binary elements. Random connections are made onto the elements of the array, N such connections being grouped together to form an N-tuple which is used to address one random access memory (RAM) per discriminator. In this way a large number of RAM's are grouped together to form a class discriminator whose output or score is the sum of all its RAM's outputs. This configuration is repeated to give one discriminator for each class of pattern to be recognized. The RAMs implement logic functions which are set up during training; thus the method does not involve any direct storage of pattern data. The system is trained using samples of patterns from each class. A pattern is stored into the retina array and a logical 1 is written into the RAM's of the discriminator associated with the class of this training pattern at the locations addressed by the N-tuples. This is repeated many times, typically 25-50 times, for each class. In recognition mode, the unknown pattern is stored in the array and the RAM's of every discriminator put into READ mode. The input pattern then stimulates the logic functions in the discriminator network and an overall response is obtained by summing all the logical outputs. The pattern is then assigned to the class of the discriminator producing the highest score. State some advantages of this model over other methods of pattern recognition. Some advantages of the WISARD model for pattern recognition are: ! Implementation as a parallel, or serial, system in currently available hardware is inexpensive and simple. ! Given labelled samples of each recognition class, training times are extremely short. ! The time required by a trained system to classify an unknown pattern is very small and, in a parallel implementation, is independent of the number of classes. (b) Explain how the storage requirements and response of a Wisard network alters as the size of the n-tuple increases. The total amount of memory needed for the discriminator RAMs is easily calculated. Let n be the number of inputs, C the number of classes, and assume a 1-1 mapping of a retina with p binary pixels. For convenience we assume that n divides p. Then the number of 1 bit RAMs will be p/n and each RAM has 2n bits of storage. So the number of bits per discriminator is (p/n).2n. Hence the total number of bits for C classes is p n C. .2 (8) n The response of a discriminator becomes more sensitive to precise similarities of a pattern to patterns from the corresponding training class as n increases. (c) Suppose that for some given two-class problem using a 512 × 512 retina a hardware Wisard system is simulated using a conventional 16-bit serial architecture with a single 1 MHz processor. Using 8 bits per weight estimate the storage requirements for such a system. Allowing 1 micro sec for access, 10 micro sec for multiplication and 3 micro sec per addition and temporary storage overheads, estimate the storage requirements and calculate the classification time. Would it be suitable for industrial application? [5] Storage requirements for n-tuple system: (No. of classes) × (Size of image) × 2n /n With two classes and n=2 this gives 88
  • 90. CMT563 2 × 512 × 512 × 4/2 = 106 bits (approx). Response time: With a conventional 16-bit architecture, the computation time is mainly one of storage access (once per n-tuple per pattern class). Taking 1 micro sec per access we have 512 × 512 × 2 × 10-6 /2 = 1/4 sec (approx). This is a bit slow for an industrial application (e.g. an assembly line). 4.(a) Briefly describe the backpropagation algorithm (detailed equations are not required but you should explain what weight adjustments depend on at each step) and discuss its strengths and weaknesses. [6] As is well known, there is no advantage in using several layers if the units have linear activation functions. Since the delta rule is a modification of gradient descent we need to consider derivatives, and the activation functions of linear threshold units are not differentiable (their derivative is infinite at the threshold and zero elsewhere). We therefore consider semilinear activation functions. A semilinear activation function fi(neti) is one in which the output of a the unit is a non-decreasing and differentiable function of the net input to the unit. Suppose the ith unit is an output unit. Let opi be the output produced by the ith unit when pattern p is presented and tpi be the target output. In this case we set the error to be ) δ pi ' (tpi & opi) fi (netpi) (for an output unit) and the weight change for the weight associated with the jth input to the ith unit is then pwij piopj ∆ ' δη where 0 > 0 is the learning rate. The error signal for a hidden unit is determined recursively in terms of the error signals of the units to which it directly connects and the weights of those connections, i.e. ) δ pi ' fi (netpi) δj pkwki (for a hidden unit) k where k varies over those units to which the ith units delivers outputs. The weight change is then computed as before. Thus the implementation of back propagation involves a forward pass through the layers to estimate the error, and then a backward pass modifying the synapses to decrease the error. Practical implementations are not difficult, but without modification it is still rather slow, especially for systems with many layers. Still, it is at present the most popular learning algorithm for multilayer networks. It is considerably faster than the Boltzmann machine. (b) Suppose that the training data for a feedforward network is derived from a process which can be modelled by a smooth function f from inputs to the single output y, and that in the data y is subjected to measurement error r with mean zero. Identify the principal factors which will determine the best Mean Squared Error that a trained network can achieve when tested on an unseen set of data drawn from the same process. [4] The function f can be regarded as a surface in m dimensions which we seek to approximate by a neural network whose input -> output mapping we can call g. The training data gives a noisy model of the surface. It is plain that there is no point in seeking to train the network beyond the stage where the Mean Squared Error (MSE) on the training data is less than Var(r), the variance of r, since this corresponds to overfitting. This then will be the best 89
  • 91. CMT563 MSE possible. The principal factors determining whether or we can train g to have MSE . Var(r) are: ! The `bumpiness' of the surface defined by f. To accurately model a very bumpy surface obviously requires more data points. ! The size of Var(r). The larger Var(r) becomes the less information is contained in any given data point. When Var(r) becomes comparable to the range of y very little information regarding f is retained in the training data. ! The size of the training set. A very bumpy noisy surface will require a very large training set. (c) Briefly describe the Gamma test. Give an example of the type of problem when it would be appropriate to use the Gamma test and an example where it would not be appropriate. [10] [Any reasonable summary of the following] Given data samples (x(i), y(i)), where x(i) = (x1(i), ..., xm(i)), 1 i # # M, let N[i, p] be the list of (equidistant) p th nearest neighbours to x(i). We write M M 1 1 1 δ (p) ' j j x(j) & x(i) 2 ' j x(N[i, p]) & x(i) 2 (12) M L(N[i, p]) M i ' 1 j 0 N[i, p] i ' 1 where L(N[i, p]) is the length of the list N[i, p]. Thus (p) is the mean square distance to the pth nearest neighbour. * Nearest neighbour lists for pth nearest neighbours (1 p pmax), typically we take pmax in the range 20-50, can be # # found in O(MlogM) time using techniques developed by Bentley, see for example [Friedman 1977]. We also write M 1 1 γ (p) ' j j (y(j) & y(i))2 (13) 2M i ' 1 L(N[i,p]) j 0 N[i,p] where the y observations are subject to statistical noise assumed independent of x and having bounded variance.10 Under reasonable conditions one can show that Var(r) γ . % A δ % o( ) δ as M 46 (14) where the convergence is in probability. The Gamma Test computes the mean-squared pth nearest neighbour distances (p) (1 p pmax, typically pmax * # # . 10) and the corresponding (p). Next the ( (p), (p)) regression line is computed and the vertical intercept is ( * ( returned as the gamma value. Effectively this is the limit lim as -> 0 which in theory is Var(r). ( * A technique which allows one to estimate Var(r), on the hypothesis of an underlying continuous or smooth model f, is of considerable practical utility in applications such as control or time series modelling. The implication of 10 The original version in [Aðalbjörn Stefánsson 1997] and [Kon ar 1997] used smoothed versions of (p) and … * ((p) which rolled off the significance of more distant near neighbours. Later experience showed that this complication was largely unnecessary and the version of the software used here is implemented as described above. 90
  • 92. CMT563 being able to estimate Var(r) in neural network modelling is that then one does not need to train the network (or indeed construct any smooth data model at all) in order to predict the best possible performance with reasonable accuracy. An appropriate problem type would be one in which the output is expected to be a smooth function of continuously varying inputs. An inappropriate problem type would be one in which many of the inputs took categorical values (e.g. 0/1). 91