Exploratory Analysis Of Metallurgical Process Data With Neural Networks And Related Methods 1st Edition C Aldrich Eds

Exploratory Analysis Of Metallurgical Process
Data With Neural Networks And Related Methods
1st Edition C Aldrich Eds download
https://ebookbell.com/product/exploratory-analysis-of-
metallurgical-process-data-with-neural-networks-and-related-
methods-1st-edition-c-aldrich-eds-1897984
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Exploratory Analysis Of Spatial And Temporal Data A Systematic
Approach 1st Edition Natalia Andrienko
https://ebookbell.com/product/exploratory-analysis-of-spatial-and-
temporal-data-a-systematic-approach-1st-edition-natalia-
andrienko-4239304
Sociological Perspectives On Clerical Sexual Abuse In The Catholic
Hierarchy An Exploratory Structural Analysis Of Social Disorganisation
1st Ed 2019 Vivencio O Ballano
https://ebookbell.com/product/sociological-perspectives-on-clerical-
sexual-abuse-in-the-catholic-hierarchy-an-exploratory-structural-
analysis-of-social-disorganisation-1st-ed-2019-vivencio-o-
ballano-10805434
Exploratory Data Analysis In Empirical Research Proceedings Of The
25th Annual Conference Of The Gesellschaft Fr Klassifikation Ev
University Of Munich March 1416 2001 1st Edition C Becker
https://ebookbell.com/product/exploratory-data-analysis-in-empirical-
research-proceedings-of-the-25th-annual-conference-of-the-
gesellschaft-fr-klassifikation-ev-university-of-munich-
march-1416-2001-1st-edition-c-becker-4200710
Making Sense Of Data A Practical Guide To Exploratory Data Analysis
And Data Mining Glenn J Myatt
https://ebookbell.com/product/making-sense-of-data-a-practical-guide-
to-exploratory-data-analysis-and-data-mining-glenn-j-myatt-4104176

Making Sense Of Data I A Practical Guide To Exploratory Data Analysis
And Data Mining 2nd Edition Glenn J Myatt
https://ebookbell.com/product/making-sense-of-data-i-a-practical-
guide-to-exploratory-data-analysis-and-data-mining-2nd-edition-glenn-
j-myatt-4722756
Exploratory Analysis And Data Modeling In Functional Neuroimaging
Sommer
https://ebookbell.com/product/exploratory-analysis-and-data-modeling-
in-functional-neuroimaging-sommer-56387702
Mastering Exploratory Analysis With Pandas Harish Garg
https://ebookbell.com/product/mastering-exploratory-analysis-with-
pandas-harish-garg-7351014
Jupyter For Data Science Exploratory Analysis Statistical Modeling
Machine Learning And Data Visualization With Jupyter Dan Toomey
https://ebookbell.com/product/jupyter-for-data-science-exploratory-
analysis-statistical-modeling-machine-learning-and-data-visualization-
with-jupyter-dan-toomey-56236928
Jupyter For Data Science Exploratory Analysis Statistical Modeling
Machine Learning And Data Visualization With Jupyter Dan Toomey
https://ebookbell.com/product/jupyter-for-data-science-exploratory-
analysis-statistical-modeling-machine-learning-and-data-visualization-
with-jupyter-dan-toomey-11566290

Process Metallurgy 12
EXPLORATORY ANALYSIS OF METALLURGICAL
PROCESS DATA WITH NEURAL NENVORKS A N D
RELATED METHODS

ProcessMetallur~
Advisory Editors: A.W. Ashbrook and G.M. Ritcey
1
G.M. RITCEYand A.W. ASH BROOK
Solvent Extraction: Principles and Applications to Process Metallurgy,
Part l and Part II
2
P.A. WRIGHT
Extractive Metallurgy of Tin (Second, completely revised edition)
3
I.H. WARREN (Editor)
Application of Polarization Measurements in the Control of Metal Deposition
4
R.W. LAWRENCE, R.M.R. BRANION and H.G. EBNER (Editors)
Fundamental and Applied Biohydrometallurgy
5
A.E. TORMA and I.H. GUNDILER (Editors)
Precious and Rare Metal Technologies
6
G.M. RITCEY
Tailings Management
7
T. SEKINE
Solvent Extraction ] 990
8
C.K. GUPTA and N. KRISHNAMURTHY
Extractive Metallurgy of Vanadium
9
R. AMILS and A. BALLESTER(Editors)
Biohydrometallurgy and the Environment Toward the Mining of the 21st Century
Part A: Bioleaching, Microbiology
Part B: Molecular Biology, Biosorption, Bioremedation
10
P. BALA2
Extractive Metallurgy of Activated Minerals
11
V.S.T. CIMINELLI and O. GARCIAJr. (Editors)
Biohydometallurgy: Fundamentals, Technology and Sustainable Development
Part A: Bioleaching, Microbiology and Molecular Biology
Part B: Biosorption and Bioremediation

Process Metallurgy 12
EXPLORATORY ANALYSIS OF
M ETALLURGICAL PROCESS DATA WITH
N EU RAL N ETWORKS AN D
RELATED M ETHODS
C. Aldrich
University of Stellenbosch, South Africa
2002
Q
ELSEVIER
Amsterdam 9
Boston 9
London 9
New York 9
Oxford 9
Paris
San Diego 9
San Francisco 9
Singapore 9
Sydney ~ Tokyo

ELSEVIER SCIENCE B.V.
Sara Burgerhartstraat 25
P.O. Box 211, 1000 AE Amsterdam, The Netherlands
9 2002 Elsevier Science B.V. All rights reserved.
This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use:
Photocopying
Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and
payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes,
resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit
educational classroom use.
Permissions may be sought directly from Elsevier Science Global Rights Department, PO Box 800, Oxford OX5 1DX, UK; phone: (+44) 1865
843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.co.uk.You may also contact Global Rights directly through Elsevier's home page
(http://www.elsevier.nl),by selecting 'Obtaining Permissions'.
In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA
01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance
Service (CLARCS), 90 Tottenham Court Road, London W 1P 0LP, UK; phone: (+44) 207 631 5555; fax: (+44) 207 631 5500. Other countries may
have a local reprographlc rights agency for payments.
Derivative Works
Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of
such material.
Permission of the Publisher is required for all other derivative works, including compilations and translations.
Electronic Storage or Usage
Permission of the Publisher is required to store or use electronicallyany material contained in this work, including any chapter or part of a chapter.
Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without prior written permtssion of the Publisher.
Address permissions requests to: Elsevier Science Rights & Permissions Department, at the mail, fax and e-mail addresses noted above.
Notice
No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances
m the medical sciences, in particular, independentverification of diagnoses and drug dosages should be made.
First edition 2002
Library of Congress Cataloging in Publication Data
Exploratory analysis of metallurgical process data with neural networks and related
methods / edited by C. Aldrich.-- I st ed.
p. cm. -- (Process metallurgy ; 12)
ISBN 0-444-50312-9
1. Metallurgy. 2. Metallurgical research. I. Aldrich, C. II. Series.
TN673 .E86 2002
669'.07'2--dc21
British Library Cataloguing in Publication Data
Exploratory analysis of metallurgical process data with
neural networks and related methods. - (Process metallurgy
;12)
1.Metallurgy - Data processing 2.Neural networks (Computer
science)
I .Aldrich, C.
669'. 0285
2002016356
ISBN: 0 444 50312 9
~)The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).
Printed in The Netherlands.

Preface
This book is concerned with the analysis and interpretation of multivariate measurements
commonly found in the mineral and metallurgical industries, with the emphasis on the use of
neural networks. Methods of multivariate analysis deal with reasonably large numbers of
measurements (i.e. variables) made on each entity in one or more samples simultaneously
(Dillon and Goldstein, 1984). In this respect, multivariate techniques differ from univariate or
bivariate techniques, in that they focus on the covariances or correlations of three or more
variables, instead of the means and variances of single variables, or the pairwise relationship
between two variables.
Neural networks can be seen as a legitimate part of statistics that fits snugly in the niche
between parametric and non-parametric methods. They are non-parametric, since they
generally do not require the specification of explicit process models, but are not quite as
unstructured as some statistical methods in that they adhere to a general class of models. In
this context, neural networks have been used to extend, rather than replace, regression models,
principal component analysis (Kramer, 1991, 1992), principal curves (Dong and McAvoy,
1996), partial least squares methods (Qin and McAvoy, 1992), as well as the visualization of
process data in several major ways, to name but a few. In addition, the argument that neural
networks are really highly parallelized neurocomputers or hardware devices and should
therefore be distinguished from statistical or other patter recognition algorithms is not entirely
convincing. In the vast majority of cases neural networks are simulated on single processor
machines. There is no reason why other methods cannot also be simulated or executed in a
similar way (and are indeed).
Since the book is aimed at the practicing metallurgist or a process engineer in the first place, a
considerable part of it is of necessity devoted to the basic theory, which is introduced as
briefly as possible within the large scope of the field. Also, although the book focuses on
neural networks, they cannot be divorced from their statistical framework and the author has
gone to considerable lengths to discuss this. For example, at least a basic understanding of
fundamental statistical modelling (linear models) is necessary to appreciate the issues invol-
ved in the development of nonlinear models. The book is therefore a blend of the basic theory
and some of the most recent advances in the practical application of neural networks.
Naturally, this preface would not be complete without my expressing my gratitude to the
many people who have been involved in the writing of this book in one way or another, my
graduate students who have performed many of the experiments described in this book,
Juliana Steyl, who has helped in the final stages of the preparation of the book and last but not
least, Annemarie and Melissa, who have had to bear with me, despite the wholly
underestimated time and effort required to finish this book.
Chris Aldrich
Stellenbosch
November 2001

This Page Intentionally Left Blank

vii
Table of Contents
CHAPTER 1
INTRODUCTION TO NEURAL NETWORKS ..................................................................................................... 1
1.1. BACKGROUND ....................................................................................................................................... 1
1.2. ARTIFICIAL NEURAL NETWORKS FROM AN ENGINEERING PERSPECTIVE ........................... 2
1.3. BRIEF HISTORY OF NEURAL NETWORKS ....................................................................................... 5
1.4. STRUCTURES OF NEURAL NETWORKS ........................................................................................... 6
1.4.1. Models of single neurons .............................................................................................................7
1.4.2. Models of neural network structures ........................................................................................... 8
1.5. TRAINING RULES ..................................................................................................................................9
1.5.1. Supervised training .................................................................................................................... 11
a) Perceptron learning rule................................................................................................... 11
b) Delta and generalized delta rules ..................................................................................... 12
c) Widrow-Hoff learning rule................................................................................................ 15
d) Correlation learning rule .................................................................................................. 16
1.5.2. Unsupervised training ................................................................................................................ 16
a) Hebbian and anti-Hebbian learning rule.......................................................................... 16
b) Winner-takes-all rule ........................................................................................................ 17
c) Outstar learning rule ........................................................................................................ 18
1.6. NEURAL NETWORK MODELS ........................................................................................................... 19
1.6.1. Multilayer Perceptrons .............................................................................................................. 19
a) Basic structure .................................................................................................................. 19
b) Backpropagation algorithm ............................................................................................. 20
1.6.2. Kohonen self-organizing mapping (SOM) neural networks ...................................................... 21

viii Table of Contents
a) Summary of the SOM algorithm (Kohonen) ...................................................................... 22
b) Properties of the SOM algorithm ...................................................................................... 23
1.6.3. Generative topographic maps ....................................................................................................23
1.6.4. Learning vector quantization neural networks .......................................................................... 25
1.6.5. Probabilistic neural networks ....................................................................................................26
1.6.6. Radial basis function neural networks .......................................................................................29
1.6.7. Adaptive resonance theory neural networks .............................................................................. 34
a) Network architecture ........................................................................................................35
b) Training of the network..................................................................................................... 36
1.6.8. Support vector machines ...........................................................................................................38
a) Linearly separable patterns .............................................................................................. 38
b) Nonlinearly separable patterns ......................................................................................... 39
c) Building support vector machines for pattern recognition (classification) problems ....... 41
d) Location of the optimal hyperplane .................................................................................. 42
e) Inner-product kernels .......................................................................................................43
J) Building support vector machines for nonlinear regression problems ............................. 43
1.7. NEURAL NETWORKS AND STATISTICAL MODELS .....................................................................45
1.8. APPLICATIONS IN THE PROCESS INDUSTRIES ............................................................................48
CHAPTER 2
TRAINING OF NEURAL NETWORKS ............................................................................................................. 50
2.1. GRADIENT DESCENT METHODS .....................................................................................................50
2.2. CONJUGATE GRADIENTS .................................................................................................................. 52
2.3. NEWTON'S METHOD AND QUASI-NEWTON METHOD ............................................................... 54
2.4. LEVENBERG-MARQUARDT ALGORITHM ..................................................................................... 56
2.5. STOCHASTIC METHODS ....................................................................................................................57
2.5.1. Simulated annealing ..................................................................................................................57
2.5.2. Genetic algorithms ....................................................................................................................59
2.6 REGULARIZATION AND PRUNING OF NEURAL NETWORK MODELS .................................... 62
2.6.1 Weight decay .............................................................................................................................62
2.6.2. Removal of weights ...................................................................................................................63
2.6.3 Approximate smoother ..............................................................................................................64
2.7 PRUNING ALGORITHMS FOR NEURAL NETWORKS ................................................................... 64
2.7.1 Hessian-based network pruning ................................................................................................64

Table of Contents ix
2.8.
2.7.2 Optimal brain damage and optimal brain surgeon algorithms .................................................. 64
CONSTRUCTIVE ALGORITHMS FOR NEURAL NETWORKS ...................................................... 65
2.8.1 State space search ...................................................................................................................... 65
a) Initial state and goal state ................................................................................................ 65
b) Search strategy ................................................................................................................. 66
c) Generalized search spaces ............................................................................................... 67
2.8.2. Training algorithms ................................................................................................................... 68
a) Dynamic node creation (DNC) ......................................................................................... 68
b) Projection pursuit regression (PPR) ................................................................................ 69
c) Cascade correlation method ............................................................................................. 70
d) Resource allocating neural networks (RAN) .................................................................... 71
e) Group method of data handling ........................................................................................ 72
LATENT VARIABLE METHODS ............................................................................................................... J....... 74
3.1. BASICS OF LATENT STRUCTURE ANALYSIS ............................................................................... 74
3.2. PRINCIPAL COMPONENT ANALYSIS .............................................................................................. 75
3.2.1. Mathematical perspective ......................................................................................................... 76
3.2.2. Statistics associated with principal component analysis models ............................................... 77
3.2.3. Practical considerations regarding principal component analysis ............................................. 79
3.2.4. Interpretation of principal components ..................................................................................... 81
3.2.5. Simple examples of the application of principal component analysis ....................................... 82
a) Example 1 ......................................................................................................................... 82
b) Example 2 ......................................................................................................................... 86
3.3. NONLINEAR APPROACHES TO LATENT VARIABLE EXTRACTION ........................................ 89
3.4. PRINCIPAL COMPONENT ANALYSIS WITH NEURAL NETWORKS .......................................... 90
3.5. EXAMPLE 2: FEATURE EXTRACTION FROM DIGITISED IMAGES OF INDUSTRIAL
FLOTATION FROTHS WITH AUTOASSOCIATIVE NEURAL NETWORKS ................................ 92
3.6. ALTERNATIVE APPROACHES TO NONLINEAR PRINCIPAL COMPONENT ANALYSIS ........ 95
3.6.1. Principal curves and surfaces .................................................................................................... 95
a) Initialization ..................................................................................................................... 96
b) Projection ......................................................................................................................... 96
c) Expectation ....................................................................................................................... 97
3.6.2. Local principal component analysis .......................................................................................... 97
3.6.3. Kernel principal component analysis ........................................................................................ 98
CHAPTER 3

x Table of Contents
3.7. EXAMPLE 1: LOW-DIMENSIONAL RECONSTRUCTION OF DATA WITH NONLINEAR
PRINCIPAL COMPONENT METHODS .............................................................................................. 99
3.8. PARTIAL LEAST SQUARES (PLS) MODELS .................................................................................. 100
3.9. MULTIVARIATE STATISTICAL PROCESS CONTROL ................................................................. 102
3.9.1. Multivariate Schewhart charts ................................................................................................. 103
3.9.2. Multivariate CUSUM charts ................................................................................................... 104
3.9.3. Multivariate EWMA charts ..................................................................................................... 104
3.9.4. Multivariate statistical process control based on principal components .................................. 105
a) Principal component models........................................................................................... 105
b) Partial least squares models ........................................................................................... 108
c) Multidimensional NOC volume, VNoc............................................................................. 109
3.9.5. Methodology for process monitoring ...................................................................................... 109
CHAPTER 4
REGRESSION MODELS ................................................................................................................................... 112
4.1. THEORETICAL BACKGROUND TO MODEL DEVELOPMENT ................................................... 113
4.1.1. Estimation of model parameters .............................................................................................. 113
4.1.2. Assumptions of regression models with best linear unbiased estimators (BLUE) .................. 114
4.2. REGRESSION AND CORRELATION ................................................................................................ 114
4.2.1 Multiple correlation coefficient, R2 and adjusted R2............................................................... 114
4.2.2 Adjustment of R2..................................................................................................................... 115
4.2.3 Analysis of residuals ............................................................................................................... 116
4.2.4 Confidence intervals of individual model coefficients ............................................................ 116
4.2.5. Joint confidence regions on model coefficients ...................................................................... 117
4.2.6 Confidence interval on the mean response .............................................................................. 117
4.2.7 Confidence interval on individual predicted responses ........................................................... 117
4.2.8. Prediction intervals for neural networks .................................................................................. 117
4.3. MULTICOLLINEARITY ..................................................................................................................... 119
4.3.1. Historic approaches to the detection of multicollinearity ........................................................ 119
a) Methods based on the correlation coefficient matrix ...................................................... 119
b) Characteristics of regression coefficients ....................................................................... 119
c) Eigenstructure of the crossproduct or correlation matrices ........................................... 119
4.3.2. Recent approaches to the detection of multicollinearity .......................................................... 120
a) Condition indices ............................................................................................................ 120
b) Decomposition of regression coefficient variance .......................................................... 120

Table of Contents xi
4.4.
4.5.
4.6.
4.7.
4.8.
4.9.
4.10.
4.11.
4.3.3. Remedies for multicollinearity ................................................................................................ 121
4.3.4. Examples ................................................................................................................................. 122
4.3.5. Multicollinearity and neural networks .................................................................................... 123
OUTLIERS AND INFLUENTIAL OBSERVATIONS ........................................................................ 124
4.4.1. Identification of influential observations ................................................................................ 124
4.4.2. Illustrative case study: Consumption of an additive in a leach plant ...................................... 126
ROBUST REGRESSION MODELS .................................................................................................... 130
DUMMY VARIABLE REGRESSION ................................................................................................ 132
RIDGE REGRESSION ......................................................................................................................... 134
CONTINUUM REGRESSION ............................................................................................................. 137
CASE STUDY: CALIBRATION OF AN ON-LINE DIAGNOSTIC MONITORING SYSTEM
FOR COMMINUTION IN A LABORATORY-SCALE BALL MILL ............................................... 138
4.9.1. Experimental setup .................................................................................................................. 138
4.9.2. Experimental procedure .......................................................................................................... 139
4.9.3. Processing of acoustic signals ................................................................................................. 141
4.9.4. Results and discussion ............................................................................................................ 142
NONLINEAR REGRESSION MODELS ............................................................................................. 146
4.10.1. Regression splines ................................................................................................................... 146
4.10.2. Alternating Conditional Expectation (ACE) ........................................................................... 149
4.10.3. Additive models based on variance stabilizing transformation (AVAS) ................................ 149
4.10.4. Projection pursuit regression (PPR) ........................................................................................ 150
4.10.5. Multivariate Adaptive Regression Splines (MARS) ............................................................... 151
4.10.6. Classification and regression trees .......................................................................................... 153
a) Binary decision trees ...................................................................................................... 153
b) Regression trees.............................................................................................................. 157
4.10.7. Genetic programming models ................................................................................................. 159
CASE STUDY 1: MODELLING OF A SIMPLE BIMODAL FUNCTION ........................................ 160
a) Multiple linear regression (MLR)................................................................................... 162
b) Alternating conditional expectations (ACE) and additive models based on variance
stabilizing transformation (AVAS)................................................................................. 162
c) Multilayerperceptron (MLP) ......................................................................................... 162
d) Multivariate adaptive regression splines (MARS).......................................................... 164
e) Regression tree (CART).................................................................................................. 164
J) Projection pursuit regression (PPR) .............................................................................. 165

xii Table of Contents
g) Genetic programming (GP)............................................................................................. 165
4.12. NONLINEAR MODELLING OF CONSUMPTION OF AN ADDITIVE IN A GOLD LEACH
PLANT .................................................................................................................................................. 167
CHAPTER 5
TOPOGRAPHICAL MAPPINGS WITH NEURAL NETWORKS .................................................................... 172
5.1. BACKGROUND ................................................................................................................................... 172
5.2. OBJECTIVE FUNCTIONS FOR TOPOGRAPHIC MAPS ................................................................. 174
5.3. MULTIDIMENSIONAL SCALING ..................................................................................................... 177
5.3.1. Metric scaling .......................................................................................................................... 177
5.3.2. Nonmetric scaling and ALSCAL ............................................................................................. 177
5.4. SAMMON PROJECTIONS .................................................................................................................. 178
5.5. EXAMPLE 1: ARTIFICIALLY GENERATED AND BENCHMARK DATA SETS ......................... 179
5.5.1. Mapping with neural networks ................................................................................................ 180
5.5.2. Evolutionary programming ............................................................................................... 181
5.6. EXAMPLE 2: VISUALIZATION OF FLOTATION DATA FROM A BASE METAL
FLOTATION PLANT ........................................................................................................................... 183
5.7. EXAMPLE 3: MONITORING OF A FROTH FLOTATION PLANT ................................................. 188
5.8. EXAMPLE 4: ANALYSIS OF THE LIBERATION OF GOLD WITH MULTI-
DIMENSIONALLY SCALED MAPS .................................................................................................. 191
5.8.1. Experimental data .................................................................................................................... 191
a) St Helena and Unisel gold ores.......................................................................................191
b) Beatrix gold ore ............................................................................................................... 192
c) Kinross and Leslie gold ores........................................................................................... 192
d) Barberton gold ore .......................................................................................................... 192
e) Western Deep Level, Free State Geduld and Harmony gold ores................................... 192
5.8.2. Milled and unmilled ores ......................................................................................................... 192
5.9. EXAMPLE 5. MONITORING OF METALLURGICAL FURNACES BY USE OF
TOPOGRAPHIC PROCESS MAPS ..................................................................................................... 195
CHAPTER 6
CLUSTER ANALYSIS ....................................................................................................................................... 199
6.1. SIMILARITY MEASURES .................................................................................................................. 199
6.1.1. Distance-type measures ........................................................................................................... 200
6.1.2. Matching type measures .......................................................................................................... 202
6.1.3. Contextual and conceptual similarity measures ....................................................................... 203

Table of Contents xiii
6.2.
6.3.
6.4.
6.5.
6.6.
6.7.
6.8.
6.8.1.
6.8.2.
6.8.3.
6.8.4.
6.8.5.
6.8.6.
CHAPTER 7
GROUPING OF DATA ........................................................................................................................ 204
HIERARCHICAL CLUSTER ANALYSIS .......................................................................................... 206
6.3.1. Single link or nearest neighbour method ................................................................................. 206
6.3.2. Complete link or furthest neighbour method .......................................................................... 208
OPTIMAL PARTITIONING (K-MEANS CLUSTERING) ................................................................ 209
SIMPLE EXAMPLES OF HIERARCHICAL AND K-MEANS CLUSTER ANALYSIS .................. 209
CLUSTERING OF LARGE DATA SETS ........................................................................................... 213
APPLICATION OF CLUSTER ANALYSIS IN PROCESS ENGINEERING .................................... 214
CLUSTER ANALYSIS WITH NEURAL NETWORKS ..................................................................... 215
Cluster analysis with autoassociative neural networks ........................................................... 216
Example 1: Sn-Ge-Cd-Cu-Fe-bearing samples from Barquilla deposit in Spain ................... 216
Example 2: Chromitite ores from the Bushveld Igneous Complex ......................................... 218
Example 3: Data from an industrial flotation plant ................................................................. 221
Iris Data set........................................................................................................................ 225
Cluster analysis with ART neural networks ............................................................................ 226
EXTRACTION OF RULES FROM DATA WITH NEURAL NETWORKS .................................................... 228
7.1. BACKGROUND .................................................................................................................................. 228
7.1.1. Decompositional methods ....................................................................................................... 228
7.1.2. Pedagogical methods ............................................................................................................... 228
7.1.3. Eclectic methods ..................................................................................................................... 229
7.2. NEUROFUZZY MODELING OF CHEMICAL PROCESS SYSTEMS WITH ELLIPSOIDAL
RADIAL BASIS FUNCTION NEURAL NETWORKS AND GENETIC ALGORITHMS ............... 229
7.2.1. Radial basis function networks and fuzzy systems ................................................................. 229
7.2.2. Development of hidden layers ................................................................................................ 230
7.2.3. Post-processing of membership functions ............................................................................... 231
7.2.4. Case study: Induced aeration in liquids in agitated vessels ..................................................... 231
7.3. EXTRACTION OF RULES WITH THE ARTIFICIAL NEURAL NETWORK DECISION
TREE (ANN-DT) ALGORITHM ......................................................................................................... 235
7.3.1. Induction of rules from sampled points in the feature space ................................................... 236
7.3.2. Interpolation of correlated data ............................................................................................... 237
7.3.3. Selection of attribute and threshold for splitting ..................................................................... 238
a) Gain ratio criterion ........................................................................................................ 238

xiv Table of Contents
b) Analysis of attribute significance .................................................................................... 239
c) Stopping criteria and pruning ......................................................................................... 241
7.3.4. Illustrative examples ............................................................................................................... 242
a) Characterization of gas-liquid flow patterns .................................................................. 242
b) Solidification of ZnCl2..................................................................................................... 243
7.3.5. Performance of the ANN-DT algorithm.................................................................................. 245
7.4. THE COMBINATORIAL RULE ASSEMBLER (CORA) ALGORITHM .......................................... 249
7.4.1. Construction of fuzzy rules with the growing neural gas algorithm........................................ 249
7.4.2. Assembly of rule antecedents with the reactive tabu search algorithm ................................... 250
7.4.3. Membership function merging and rule reduction .................................................................. 251
7.4.4. Calculation of a fuzzy rule consequent and solution fitness .................................................... 251
7.4.5. Optimal-size models ................................................................................................................ 252
7.4.6. Fuzzy rule set reduction .......................................................................................................... 254
7.4.7. Rule model output prediction surface smoothing .................................................................... 254
7.4.8. Overlapping of fuzzy rules in the attribute space .................................................................... 255
7.4.9. Performance of the CORA algorithm ...................................................................................... 256
a) Sin-Cos data.................................................................................................................... 256
b) The Slug Flow Data Set .................................................................................................. 257
c) Radial basis function neural networks (GNG-RBF, KM-RBF) ....................................... 257
d) Rule-induction algorithm (BEXA)................................................................................... 258
7.5. SUMMARY .......................................................................................................................................... 259
CHAPTER 8
INTRODUCTION TO THE MODELLING OF DYNAMIC SYSTEMS .......................................................... 262
8.1. BACKGROUND ................................................................................................................................... 262
8.2. DELAY COORDINATES .................................................................................................................... 264
8.3. LAG OR DELAY TIME ....................................................................................................................... 265
8.3.1. Average Mutual Information (AMI) ....................................................................................... 266
8.3.2. Average Cross Mutual Information (AXMI) ........................................................................... 267
8.4. EMBEDDING DIMENSION ................................................................................................................ 268
8.4.1. False nearest neighbours ......................................................................................................... 269
8.4.2. False nearest strands ................................................................................................................ 269
8.5. CHARACTERIZATION OF ATTRACTORS ...................................................................................... 270
8.5.1. Correlation dimension and correlation entropy ....................................................................... 270

Table of Contents xv
8.6.
8.7.
8.5.2 Other invariants ....................................................................................................................... 272
a) Generalized dimensions and entropies ........................................................................... 272
b) Lyapunov exponents ....................................................................................................... 273
DETECTION OF NONLINEARITIES ................................................................................................ 275
8.6.1. Surrogate data methods ........................................................................................................... 275
a) Pivotal test statistics ....................................................................................................... 276
b) Classes of hypotheses ..................................................................................................... 276
8.6.2. Example: Generation of surrogate data ................................................................................... 277
a) Generating index-shuffled surrogates (Type O).............................................................. 277
b) Generating phase-shuffled surrogates (Type 1) ............................................................. 277
c) Generating amplitude adjusted Fourier transform surrogates (Type 2) ........................ 279
SINGULAR SPECTRUM ANALYSIS ................................................................................................ 280
8.8. RECURSIVE PREDICTION ................................................................................................................ 282
CHAPTER 9
CASE STUDIES: DYNAMIC SYSTEMS ANALYSIS AND MODELLING .................................................. 285
9.1. EFFECT OF NOISE ON PERIODIC TIME SERIES .......................................................................... 285
9.2.
9.3.
AUTOCATALYSIS IN A CONTINUOUS STIRRED TANK REACTOR ......................................... 287
9.2.1. Multi-layer perceptron network model ................................................................................... 288
9.2.2. Pseudo-linear radial basis function model .............................................................................. 290
EFFECT OF MEASUREMENT AND DYNAMIC NOISE ON THE IDENTIFICATION OF
AN AUTOCATALYTIC PROCESS .................................................................................................... 293
9.4. IDENTIFICATION OF AN INDUSTRIAL PLATINUM FLOTATION PLANT BY USE OF
SINGULAR SPECTRUM ANALYSIS AND DELAY COORDINATES ........................................... 295
9.5. IDENTIFICATION OF A HYDROMETALLURGICAL PROCESS CIRCUIT ................................. 296
CHAPTER 10
EMBEDDING OF MULTIVARIATE DYNAMIC PROCESS SYSTEMS ....................................................... 299
10.1. EMBEDDING OF MULTIVARIATE OBSERVATIONS .................................................................. 299
10.2. MULTIDIMENSIONAL EMBEDDING METHODOLOGY .............................................................. 299
10.2.1 Optimal embedding of individual components ....................................................................... 300
10.2.2 Optimal projection of initial embedding ................................................................................. 301
a) Optimal projection by singular spectrum analysis ......................................................... 301
b) Optimal projection by linear independent component analysis ...................................... 301
c) Selection of a suitable model structure ........................................................................... 302

xvi Table of Contents
10.3 APPLICATION OF THE EMBEDDING METHOD ........................................................................... 303
10.4 MODELLING OF NOx -FORMATION ............................................................................................... 305
CHAPTER 11
FROM EXPLORATORY DATA ANALYSIS TO DECISION SUPPORT AND PROCESS CONTROL ....... 313
11.1. BACKGROUND ................................................................................................................................... 313
11.2. ANATOMY OF A KNOWLEDGE-BASED SYSTEM ....................................................................... 313
11.2.1. Knowledge-base ...................................................................................................................... 314
11.2.2. Inference engine and search strategies ................................................................................... 314
11.2.3. Monotonic and non-monotonic reasoning ............................................................................... 316
11.3. DEVELOPMENT OF A DECISION SUPPORT SYSTEM FOR THE DIAGNOSIS OF
CORROSION PROBLEMS .................................................................................................................. 317
11.3.1. Expert System ......................................................................................................................... 317
11.3.2. Examples ................................................................................................................................. 318
a) Example 1: Corrosion of construction materials ............................................................ 318
b) Example 2: Miscellaneous metal corrosion .................................................................... 318
c) Example 3: Seawater corrosion of stainless steels ......................................................... 318
11.3.3. Experiments and results ........................................................................................................... 319
11.4. ADVANCED PROCESS CONTROL WITH NEURAL NETWORKS ............................................... 320
11.4.1 Predictive neurocontrol schemes ............................................................................................. 321
11.4.2. Inverse model-based neurocontrol .......................................................................................... 322
11.4.3. Adaptive neurocontrol systems ............................................................................................... 322
11.5. SYMBIOTIC ADAPTIVE NEURO-EVOLUTION (SANE) ............................................................... 322
11.6. CASE STUDY: NEUROCONTROL OF A BALL MILL GRINDING CIRCUIT ............................... 324
11.7. NEUROCONTROLLER DEVELOPMENT AND PERFORMANCE ................................................. 328
11.7.1. SANE implementation ............................................................................................................ 328
11.7.2. Set point changes ..................................................................................................................... 329
11.7.3. Particle size disturbances in the feed ....................................................................................... 332
11.8. CONCLUSIONS ................................................................................................................................... 332
REFERENCES .................................................................................................................................................... 333
INDEX ................................................................................................................................................................ 366
APPENDIX: DATA FILES ................................................................................................................................ 370

Chapter 1
Introduction to Neural Networks
1.1. BACKGROUND
The technological progress of humanity throughout the ages can be summarized as a perpetual
cycle of observing nature, interpreting these observations until the system or phenomenon
being observed is understood sufficiently well to modify or redesign the system. Clearly man
has made spectacular progress in all four areas. Our understanding of nature is reaching new
depths at an ever-increasing pace, while we only need to look around us to appreciate the role
of engineering and technology in every day life. The growth of each stage in the cycle
depends on the previous stages; for example, heavier-than-air flight only became possible
when the laws of physics governing flight were understood sufficiently well. The same
applies to many of the recent advances in biotechnology, which are contingent upon detailed
knowledge of the human genome, etc.
SCIENCE
Interpretation
INFO DESIGN
Observation Intervention
~ NATURE
Systems
Figure 1.1.The cycle of technological progress.
However, in recent years, the advent of the computer has upset the balance between the
elements of the cycle of technological progress portrayed in Figure 1.1. For example,
although measurements of process variables on metallurgical plants have been logged for de-
cades, it is only relatively recently, with the large-scale availability of inexpensive computing
facilities, that the large historic databases of plant behaviour have become established. These
databases can contain tens of thousands of variables and hundreds of thousands or millions of
observations, constituting a rich repository of data, detailing the historic behaviour of a plant.
For example, in the production of ferrochrome in a submerged arc furnace, the specific energy
consumption, metal production, percentage of valuable metal lost to the slag, etc., may
depend on hundreds of other process variables, such as the composition and particulate state
of the feed, as well as the electrical configuration of the furnace. These variables are likely to
interact in a complex way to influence the quality of the product and the cost of production.
Derivation of an empirical model of such a system is unlikely to be successful if at least a
sizeable subset of the explanatory variables is not considered simultaneously.
Unfortunately, this proliferation of plant data does not always lead to a concomitant increase
in knowledge or insight into process dynamics or plant operation. In fact, on many metallurgi-
cal plants personnel have probably experienced a net loss in understanding of the complexities

2 Introduction to Neural Networks
of the behaviour of the plant, owing to increased turnover, rationalisation, etc. This has
resulted in younger, less experienced plant operators sometimes having to cope with the
unpredictable dynamics of nonlinear process systems. To aggravate the situation, a steady
increase in the demand for high quality products at a lower cost, owing to global competition,
as well as environmental and legislative constraints, require substantial improvements in
process control.
In addition, automated capture of data has not only led to large data sets, but also data sets
that can contain many more variables than observations. One such example pertains to
spectroscopic data, where observations comprise a function, rather than a few discrete values.
The data are obtained by exposing a chemical sample to an energy source, and recording the
resulting absorbance as a continuous trace over a range of wavelengths. Such a trace is
consequently digitised at appropriate intervals (wavelengths) with the digitised values
forming a set of variables. Pyrolysis mass spectroscopy yields, near infrared spectroscopy and
infrared spectroscopy yield approximately 200, 700 and 1700 such variables for each
chemical sample (Krzanowski and Marriot, 1994). In these cases the number of variables
usually exceed the number of samples by far. Similar problems are encountered with the
measurement of acoustic signals, such as may be the case in on-line monitoring of process
equipment (Zeng and Forssberg, 1992), potentiometric measurements to monitor corrosion, or
image analysis, where each pixel in the image would represent a variable. In the latter case,
high-resolution two-dimensional images can yield in excess of a million variables.
It is therefore not surprising that exploratory data analysis, multivariate analysis or data
mining, is seen as a key enabling technology, and the topic of this book. Many of the techni-
ques for the efficient exploration of data have been around for decades. However, it is only
now with the growing availability of processing power that these techniques have become
sophisticated instruments in the hands of metallurgical engineers and analysts.
Artificial neural networks represent a class of tools that can facilitate the exploration of large
systems in ways not previously possible. These methods have seen explosive growth in the
last decade and are still being developed at a breath-taking pace. In many ways neural
networks can be viewed as nonlinear approaches to multivariate statistical methods, not bound
by assumptions of normality or linearity. Although neural networks have originated outside
the field of statistics, and have even been seen as an alternative to statistical methods in some
circles, there are signs that this viewpoint is making way for an appreciation of the way in
which neural networks complement classical statistics.
1.2. ARTIFICIAL NEURAL NETWORKS FROM AN ENGINEERING PERSPEC-
TIVE
By the end of World War II several groups of scientists of the United States and England were
working on what is now known as a computer. Although Alan Turing (1912-1954), the princi-
pal British scientist at the time, suggested the use of logical operators (such as OR, AND,
NOT, etc.) as a basis for fundamental instructions to these machines, the majority of inves-
tigators favoured the use of numeric operators (+, -, <, etc.). It was only with the shifting
emphasis on methods to allow computers to behave more like humans, that the approach
advocated by Turing had begun to attract new interest. This entire research effort and its
commercial repercussions are known as artificial intelligence (AI), and comprise many aspi-

Artificial Neural Networks from an Engineering Perspective
rations, ranging from the design of machines to do various things considered to be intelligent,
to machines which could provide insight into the mental faculties of man.
Although different workers in the field have different goals, all seek to design machines that
can solve problems. In order to achieve this goal, two basic strategies can be pursued. The
first strategy or top-down approach has been developed productively for several decades and
entails the reduction of large complex systems to small manipulable units. These techniques
encompass heuristic programming, goal-based reasoning, parsing and causal analysis and are
efficient systematic search procedures, capable of the manipulation and rearrangement of ele-
ments of complex systems or the supervision or management of the interaction between sub-
systems interacting in intricate ways. The disadvantages of symbolic logic systems such as
these are their inflexibility and restricted operation, which limits them to very narrow domains
of knowledge.
Bottom-up strategies (i.e. connectionist procedures) endeavour to build systems with as little
architecture as possible. These systems start off with simple elements (such as simplified mo-
dels, small computer programs, elementary principles, etc.) and move towards more complex
systems by connecting these units to produce large-scale phenomena. As a consequence, these
systems are very versatile and capable of the representation of uncertain approximate relations
between elements or the solution of problems involving large numbers of weak interactions
(such as found in pattern recognition and knowledge retrieval problems). On the other hand,
connectionist systems cannot reason well and are not capable of symbolic manipulation and
logic analyses.
With the exception of basic arithmetic, it is quite obvious that the human brain is superior to a
digital computer at many tasks. Consider for example the processing of visual information. A
one-year old baby is much better and faster at recognizing objects and faces than the most ad-
vanced supercomputer.
7
Historically the cost of computing has been directly related to the energy consumed in the
process, and not the computation itself. Computational systems are limited by the system
overheads required to supply the energy and to get rid of the heat, i.e. the boxes, the heaters,
the fans, the connectors, the circuit boards, and the other superstructure that is required to
make the system work. Technology development has therefore always been in the direction of
the lowest energy consumed per unit computation. For example, an ordinary wristwatch today
does far more computation than the ENIAC ~did in the 1940s.
From this perspective it is interesting to consider the capability of biological systems in com-
putation. Contrary to the myth that nervous systems are slow an inefficient, we can still not
match the capabilities of the simplest insects, let alone handling tasks routinely performed by
humans, despite a 10 000 000 fold increase in computational capability over the last few
decades. With silicon technology we can envision today will dissipate in the order of 10.9 J of
energy per operation on the chip level, and will consume approximately 100 to 1000 times
more energy on a box level.
l_ElectronicNumericalIntegratorAnd Computer,the firstgeneralpurposeelectroniccomputerbuiltin the Moore
SchoolofElectricalEngineeringofthe UniversityofPennsylvaniaduring 1943-1946.

Compared to the energy requirements of the human brain, with approximately 1015synapses,
each of which receives a nerve pulse roughly 10 times per second, the brain accomplishes
approximately 1016complex operations per second. All this is accomplished while dissipating
only a few watts (i.e. an energy dissipation of about 10 -16 J per operation), making the brain
roughly 10 000 000 times more efficient than the best supercomputer today. From a different
perspective, it is estimated that state-of-the-art digital technology would require about 10 MW
of power to process information at the rate at which the human brain is capable. It is therefore
clear that much can be gained from emulation of biological computational systems in the
quest for more advanced computing systems.
Apart from its computational efficiency, the human brain also excels in many other respects.
It differs significantly from current digital hardware with regard to its bottom-level elemen-
tary functions, representation of information, as well as top-level organizing principles. More
specifically,
9 It does not have to be programmed, but is flexible and can easily adjust to its environment
by learning.
9 It is robust and fault tolerant. An estimated 10 000 nerve cells in the brain die daily with-
out perceptibly affecting its performance.
9 It can deal with information of various kinds, be it fuzzy, probabilistic, noisy or incon-
sistent.
9 It is highly parallel, small and compact.
Neural computation can therefore be seen as an alternative to the usual one based on a pro-
grammed instruction sequence introduced by Von Neuman. Neural networks are useful ma-
thematical techniques inspired by the study of the human brain. Although the brain2 is a very
complex organ that is still largely an enigma, despite considerable advances in the neuro-
sciences, it is clear that it operates in a massively parallel mode. Artificial systems inspired by
the basic architecture of the brain emerged under various names, such as connectionist sys-
tems artificial neural networks, parallel distributed processing and Boltzmann machines.
These systems differ in many subtle ways, but share the same general principles. Unlike tradi-
tional expert systems, where knowledge is stored explicitly in a database or as a set of rules or
heuristics, neural networks generate their own implicit rules by learning from examples. Items
of knowledge are furthermore distributed across the network and reasonable responses are
obtained when the network is presented with incomplete, noisy or previously unseen inputs.
From the perspective of cognitive modelling of process systems know-how, these pattern
recognition and generalization capabilities of neural networks are much more attractive than
the symbol manipulation methodology of expert systems, especially as far as complex, ill-de-
fined systems are concerned.
Many parallels can be drawn between the development of knowledge-based systems and that
of neural networks. Both had suffered from an overzealous approach in the early stages of
their development. In the mid-1980s for example, a common perception had temporarily
made its way into the process engineering community that knowledge-based systems had
2The typical human brain contains between 101~and 10~ neurons each of which can be connected to as many as
10 000 other neurons. The neuron, which is the basic processing unit in the brain, is very slow, with a switch
time in the order of milliseconds and it has to operate in parallel to achieve the performance observed.

Artificial Neural Networksfrom an Engineering Perspective 5
failed to live up to expectations. Like their rule-based counterparts, neural networks are also
sometimes seen as 'solutions looking for problems'. Although the application of neural
networks in the process engineering industry has not matured yet, there is every reason to
believe that like other computational methods it will also find a solid niche in this field. A
closer look at the historic development of neural networks will underpin the analogous paths
of these two branches of artificial intelligence.
1.3. BRIEF HISTORY OF NEURAL NETWORKS
The modern era of neural networks had its inception in the 1940s, when the paper of
McCulloch (a psychiatrist and neuroanatomist) and Pitts (a mathematician) on the modelling
of neurons appeared (McCulloch and Pitts, 1943). The McCulloch-Pitts model contained all
the necessary elements to perform logic operations, but implementation was not feasible with
the bulky vacuum tubes prevalent at the time. Although this model never became technically
significant, it laid the foundation for future developments.
Donald Hebb's book The Organization of Behaviour first appeared in the late 1940s (Hebb,
1949), as well as his proposed learning scheme for updating a neuron's connections, presently
referred to as the Hebbian learning rule. During the 1950s the first neurocomputers that could
adapt their connections automatically were built and tested (Minsky, 1954). Minsky and
Edmond constructed an analog synthetic brain at Harvard in 1951, to test Hebb's learning
theory. Referred to as the Snark, the device consisted of 300 vacuum tubes and 40 variable
resistors, which represented the weights of the network. The Snark could be trained to run a
maze.
The interest sparked by these ideas was further buoyed when Frank Rosenblatt invented his
Mark I Perceptron in 1958. The perceptron was the world's first practical neurocomputer,
used for the recognition of characters mounted on an illuminated board. It was built by
Rosenblatt, Wightman and Martin in 1957 at the Cornell Aeronautics Laboratory and was
sponsored by the US Office of Naval Research. A 20 x 20 array of cadmium sulphide photo-
sensors provided the input to the neural network. An 8 x 8 array of servomotor driven poten-
tiometers constituted the adjustable weights of the neural network.
This was followed by Widrow's ADALINE (ADAptive LINEar combiner) in 1960, as well as
the introduction of a powerful new learning rule called the Widrow-Hoff learning rule
developed by Bernard Widrow and Marcian Hoff (Widrow and Hoff, 1960). The rule
minimized the summed square error during training associated with the classification of
patterns. The ADALINE network and its MADALINE (Multiple ADALINEs) extension were
applied to weather forecasting, adaptive controls and pattern recognition.
Although some success was achieved in this early period, the machine learning theorems were
too limited at the time to support application to more complicated problems. This, as well as
the lack of adequate computational facilities resulted in stagnation of the research in the
neural network field or cybernetics, as it was known at the time. The early development of
neural networks came to a dramatic end when Minsky and Papert (1969) showed that the
capabilities of the linear networks studied at the time were severely limited. These revelations
caused a virtually total cessation in the availability of research funding and many talented
researchers left the field permanently.

This growth has in part been fomented by improvements in very large scale integration
(VLSI) technology, as well as the efforts of a small number of investigators who had conti-
nued to work during the 1970s and early 1980s, despite a lack of funds and public interest.
For example, in Japan, Sun-Ichi Amari (1972, 1977) pursued the investigation of neural
networks with threshold elements and the mathematical theory of neural networks. His com-
patriot, Kunihiko Fukushima developed a class of neural networks known as neocognitrons
(Fukushima, 1980). The neocognitrons were biologically inspired models for visual pattern
recognition, that emulated retinal images, and processed them by use of two-dimensional
layers of neurons. In Finland, Teuvo Kohonen (1977, 1982, 1984, 1988, 1990, 1995)
developed unsupervised neural networks capable of feature mapping into regular arrays of
neurons, while James A. Anderson conducted research into associative memories (Anderson
et al., 1977). Stephen Grossberg and Gail Carpenter introduced a number of neural archi-
tectures and theories, while developing the theory of adaptive resonance theory (ART) neural
networks (Grossberg, 1976, 1982; Carpenter and Grossberg, 1990).
However, the initial interest in neural networks was only revived again in the early 1980s, and
since then the field of neural networks has seen phenomenal growth, passing from a research
curiosity to commercial fruition. Several seminal publications saw the light from 1982 to
1986. John J. Hopfield initiated the new renaissance of neural networks through the intro-
duction of a recurrent neural network for associative memories (Hopfield, 1982, 1984).
Further revitalisation of the field occurred with the publication of James McClelland and
David Rumelhart's (Rumelhart and McClelland, 1986) two volumes on distributed parallel
processing. In this publication, the earlier barriers that had led to the slump in the mainstream
neural network development in the 1960s were circumvented. Much of this work concerning
the training of neural networks with multiple layers had actually been discovered and
rediscovered earlier by Dreyfus (1962), Bryson (Bryson and Ho, 1969), Kelley (1969), as well
as Paul Werbos (Werbos, 1974), but it went largely unnoticed at the time. However, by the
mid-1980s, neural network business has soared from an approximately $7 million industry in
1987, to an estimated $120 million industry in 1990.
1.4. STRUCTURES OF NEURAL NETWORKS
Although much of the development of neural networks has been inspired by biological neural
mechanisms, the link between artificial neural networks and their biological neural systems is
rather tenuous. Biological organisms are simply not understood sufficiently to allow any mea-
ningful emulation. As a result, artificial neural networks are better interpreted as a class of
mathematical algorithms (since a network can essentially be regarded as a graphical notation
for a large class of algorithms), as opposed to synthetic networks capable of competing with
their biological equivalents. In general terms neural networks are therefore simply computers
or computational structures consisting of large numbers of primitive process units connected
on a massively parallel scale. These units, nodes or artificial neurons are relatively simple
devices by themselves, and it is only through the collective behaviour of these nodes that
neural networks can realize their powerful ability to form generalized representations of
complex relationships and data structures. A basic understanding of the structure and
functioning of a typical neural network node is therefore necessary for a better understanding
of the capabilities and limitations of neural networks. The basic model of an artificial neuron
that will be used throughout this text is consequently presented in more detail below.

Structures ofNeural Networks
1.4.1. Models of single neurons
Each node model consists of a processing element with a set of input connections, as well as a
single output connection, as illustrated in Figure 1.1. Each of these connections is
characterized by a numerical value or weight, which is an indication of the strength of the
connection. The flow of information through the node is unidirectional, as indicated by the
arrows in this figure.
X1
Wmf---- 84
Xrn
Figure 1.1. Model of a single neuron.
The output of the neuron can be expressed as follows.
Z-- f(~]i=lmWiXi), or
z = f(wTx) (1.1)
where w is the weight vector of the neural node, defined as
W-" [Wl, W2, W3.... Wm]T
and x is the input vector, defined as
X-- [X1, X2, X3, ... Xm]T
Like all other vectors in the text, these vectors are column vectors, of which the superscript T
denotes the transposition.
The function f(wTx) is referred to as the activation function of the node, defined on the set of
activation values, which are the scalar product of the weight and input vectors of the node.
The argument of the activation function is sometimes referred to as the potential of the node,
in analogy to the membrane potentials of biological neurons.
An additional input can be defined for some neurons, i.e. x0, with associated weight w0. This
input is referred to as a bias and has a fixed value of-1. Like the other weights Wl, w2, w3....
Wm, the bias weight is also adaptable. The use of a bias input value is sometimes necessary to
enable neural networks to form accurate representations of process trends, by offsetting the
output of the neural network. Although the above model is used commonly in the application
of neural networks, some classes of neural networks can have different definitions of potential

(4 WTX).Also, in some neural networks, nodes can perform different functions during diffe-
rent stages of the application of the network.
(a)
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
z = 1/[1+exp(-wTx)]
-o0 ~ r -k-oo
(b)
1
................................ . . . .
I/ E,+
xp
-w,x)l
-oo ~ " ~-- +oo
.............................................. . . . . . . . . . . . . . . . . . . . . .
Figure 1.2. Sigmoidal activation functions, (a) unipolar, z -- 1/[1+exp(-wTx)] and (b) bipolar, that is
z = [1-exp(-wTx)]/[1+exp(-wTx)].
Sigmoidal activation functions are used widely in neural network applications, that is bipolar
sigmoidal activations functions (with ~ = wVx and ~. > 0)
f(~) = 2/(1 + e-Z*)- 1 (1.2)
and their hard-limiting equivalent (bipolar sign functions, or bipolar binary functions)
f(~) : sgn(~) = + 1, if ~ > 0 (1.3)
f(q~)= sgn(~)=-1, if q~< 0 (1.4)
and unipolar sigmoidal activation functions
f(q~)= 1/(1 + e-z*) (1.5)
with their hard-limiting version (unipolar sign functions, or unipolar binary functions) the
same as for the bipolar activations functions (equations 1.3-1.4). The parameter ~ is propor-
tional to the gain of the neuron, and determines the steepness of the continuous activation
function. These functions are depicted graphically in Figure 1.12. For obvious reasons, the
sign function is also called a bipolar binary function.
1.4.2. Models of neural network structures
Neural networks consist of interconnections of nodes, such as the ones described above.
These processing nodes are usually divided into disjoint subsets or layers, in which all the
nodes have similar computational characteristics. A distinction is made between input, hidden
and output layers, depending on their relation to the information environment of the neural
network. The nodes in a particular layer are linked to other nodes in successive layers by
means of the weighted connections discussed above. An elementary feedforward neural
network can therefore be considered as a structure with n neurons or nodes receiving inputs
(x), and m nodes producing outputs (z), that is

Structures ofNeural Networks 9
X-- IX1, X2, X3,...Xn] T
Z = [Z1, Z2, Z3,...Zm] T
The potential for a particular node in this single layer feedforward neural network is similar to
that for the neuron model, except that a double index i,j notation is used to describe the
destination (first subscript) and source (second subscript) nodes of the weights. The activation
value or argument of the i'th neuron in the network is therefore
~)i-- Y"j=lnwijxi, for i = 1, 2, ... m (1.6)
This value is subsequently transformed by the activation function of the i'th output node of
the network
Zi "- f(wiTx), for i = 1, 2, 3.... m (1.7)
Neural networks with multiple layers are simply formed by cascading the single-layer
networks represented by equations 1.6-1.7. Once the structure of the network (number of
layers, number of nodes per layer, types of nodes, etc.) is fixed, the parameters (weights) of
the network have to be determined. This is done by training (optimization) of the weight
matrix of the neural network. Feedforward neural networks, like the one discussed above,
learn by repeatedly attempting to match sets of input data to corresponding sets of output data
or target values (a process called supervised learning). The optimized weights constitute a
distributed internal representation of the relationship(s) between the inputs and the outputs of
the neural network.
Learning typically occurs by means of algorithms designed to minimize the mean square error
between the desired and the actual output of the network through incremental modification of
the weight matrix of the network. In feedforward neural networks, information is propagated
back through the network during the learning process, in order to update the weights. As a
result, these neural networks are also known as back propagation neural networks. Training of
the neural network is terminated when the network has learnt to generalize the underlying
trends or relationships exemplified by the data. Generalization implies that the neural network
can interpolate sensibly at points not contained in its training set, as indicated in Figure 1.3.
The ability of the neural network to do so is typically assessed by means of cross-validation,
where the performance of the network is evaluated against a novel set of test data, not used
during training. Modes of training other than the basic approach mentioned above are possible
as well, depending on the function and structure of the neural network. These algorithms are
considered in more detail below.
1.5. TRAINING RULES
As was mentioned above, neural networks are essentially data driven devices that can form
internal distributed representations of complex relationships. These relationships are typically
specified implicitly by means of examples, i.e. input-output data. Without loss of generality, a
modelling problem (the behaviour of a process, plant unit operation, etc.) can be treated as
follows, if the behaviour of the process is characterized by data of the following form

YI,1 Yl,2 "" Yl,p
Y2,1 Y2,2 '" Y2,p
Y = . . . . . . . . E ~:nxp (1.8)
Yn,1 Yn,2 "" Yn,p
~" XI,1 X1,2 "" Xl,m
X2,1 X2,2 .. X2, m
X = . . . . . . . . E ,~} nxm (1.9)
Xn, 1 Xn,2 "" Xn,m
where Yi,k(i = 1, 2, ... p) represent p variables dependent on m causal or independent variables
Xj,k(j = 1, 2.... m), based on n observations (k = 1, 2.... n). The variables Yi,kare usually para-
meters which provide a measure of the performance of the plant, while the Xj,kvariables are
the plant parameters on which these performance variables are known to depend.
-, overfittingt/
I I I
I I ! I
I
, ~enerali- / Ii ~,.. / ,
! I
! !
Figure 1.3. Overfitting of data (broken line), compared with generalization (solid line) by a neural
network. The solid and empty circles indicate training and test data respectively.
The problem is then to relate the matrix Y to some set of functions Y = f(X) of matrix X, in
order to predict Y from X. The main advantage of modelling techniques based on the use of
neural networks, is that a priori assumptions with regard to the functional relationship
between X and Y are not required. The network learns this relationship instead, on the basis
of examples of related x-y vector pairs or exemplars and forms an internal distributed implicit
model representation of the process.
In supervised training the weights of neural network nodes can be modified, based on the
inputs received by the node, its response, as well as the response of a supervisor, or target
value. In unsupervised learning, the response of a target value to guide learning is not avail-
able. This can be expressed by a generalized learning rule (Amari, 1990) where the weight

Training Rules 11
vector of a node increases proportional to the product of the input vector x, and a learning
signal r.
The leaming signal r is generally a function of the weight vector, Wi E ql~m,the input x e ~m
and a target signal, di E 9], where applicable, that is
r = r(wi,x,di) (1.12)
The weight vector at time t is incremented by
Awi(t) = [3r[wi(t), x(t), di(t)]x(t) (1.13)
The parameter 13determines the leaming rate, so that the weight vector is updated at discrete
time steps as follows
wi(t+l) = wi(t) + 13r[wi(t),x(t), di(t)]x(t) (1.14)
Different leaming rules can be distinguished on the basis of their different learning signals, as
considered in more detail below.
1.5.1. Supervised training
a) Perceptron learning rule
The perceptron learning rule is characterized by a leaming signal that is the difference
between the desired or target response of the neuron, and the neuron's actual response.
r=di-zi (1.15)
The adjustment of the weights in this supervisory procedure takes place as follows
Awi = [3[di- sgn(wiTx)]x = Awi-- [3[di- sgn(wiTx)]x (1.16)
The perceptron rule pertains to binary node outputs only, and weights are only adjusted if
there is a difference between the actual and target response of the neural network. The
weights of the network can assume any initial values. The method is elucidated by the exam-
ple below.
Assume the set of training vectors to be Xl = [1, 2, 3 I 1], x2 = [-1,-2, -1 1-1] and x3 = [-3,-1,
0 I -1]. The learning rate is assumed to be 13= 0.1, and the initial weight vector is arbitrarily
assumed to be w(0) = [1, 0,-1]. As before, the input or training vectors are presented to the
network sequentially.
Step 1: For the first input Xl, with desired output dl, the activation of the node is
W(0)Txl = [1, 0,-1][1, 2, 3]T= -2
and the output of the node is Zl = sgn(-2) =- 1.
Since z~ is not equal to the target dl = 1, the weights of the network have to be adjusted.

W(1) -- W(0) + ~[di- sgn(w(0)Txl]xl
W(1) = [1, 0,-1] T + 0.1[1-(-1)][1, 2, 3]T= [1, 0,-1] T + [0.2, 0.4, 0.6] T
W(1) = [1.2, 0.4, -0.4] T
Step 2: For the second input x2, with desired output d2, the activation of the node is
W(1)Txl = [1.2, 0.4, -0.4][- 1, -2, - 1]v = _1.6
and the output of the node is z2 = sgn(-1.6) - -1.
Since z2 is equal to the target d2 = -1, adjustment of the weights of the network is not required,
so that
W(2) = [1.2, 0.4, -0.4] T
Step 3: For the second input x3, with desired output d3, the activation of the node is
W(2)XX3 = [1.2, 0.4, -0.4][-3, - 1, 0]T = -4
and the output of the node is Zl = sgn(-4) =- 1.
Since z3 is equal to the target d3 = -1, adjustment of the weights of the network is again not
required, so that
w(3) = [1.2, 0.4, -0.4] T
Training of the network can therefore be terminated, since the network (node) has learnt to
reproduce all the targeted outputs correctly.
b) Delta and generalized delta rules
The delta learning rule pertains to nodes with continuous activation functions only, and the
learning signal of the rule is defined as
r = [di- f(wiTx)]f'(wiTx) (1.17)
The term f~(wiTx) is the derivative of the activation function f(wiTx). The rule can be derived
from the least squared error of the difference between the actual output of the network and the
desired output, in the form
E = 1/2(di - zi) 2= 1/2[di - f(wiTx)] 2 (1.18)
Calculation of the gradient vector of the squared error in equation, with regard to Wi, gives
VE -- -[di - f(wiTx)]f'(wiTx)x (1.19)
The adjustment of the weights in this supervisory procedure takes place as follows
Awi = -]3VE (1.20)

Training Rules 13
or
Awi = I~[di - f(wiTx)]f'(wiTx)x (1.21)
for a single weight, this is equivalent to
Awij = 13[di- f(wiTx)]f'(wiTx)xj, for j = 0, 1, 2.... N (1.22)
As an example, assume the set of training vectors to be X 1 = [1, 2, 3 I 1], X2 = [-1, -2, -1 10]
and x3 = [-3, -1, 0 [ 0]. The learning rate is assumed to be 13= 0.1, and the initial weight vector
is arbitrarily assumed to be w(0) = [1, 0, -1]. The conditions are the same as for the perceptron
rule considered above, except that the target vectors are different. As before, the input or
training vectors are presented to the network sequentially. The node is assumed to have a
continuous unipolar activation function of the form f(x) = 1/[1 + eX]. In this case both the
value of the activation of the neuron, as well as the derivative of the activation have to be
computed. One of the reasons for using sigmoidal activation functions is that their derivatives
can be calculated easily. For a continuous unipolar sigmoidal activation function, the
derivative is
As an example, assume the set of training vectors to be xl = [1, 2, 3 I 1], X2 -- [-1, -2, -1 10]
and x3 = [-3, -1, 0[ 0]. The learning rate is assumed to be 13= 0.1, and the initial weight vector
is arbitrarily assumed to be w(0) = [1, 0, -1]. The conditions are the same as for the perceptron
rule considered above, except that the target vectors are different. As before, the input or
training vectors are presented to the network sequentially. The node is assumed to have a con-
tinuous unipolar activation function of the form f(x) = 1/[1 + e-X]. In this case both the value
of the activation of the neuron, as well as the derivative of the activation have to be computed.
One of the reasons for using sigmoidal activation functions is that their derivatives can be
calculated easily. For a continuous unipolar sigmoidal activation function, the derivative is
d/dx[f(x)] = d/dx[1 + exp(-x)] -1
- (-1)[1 + exp(-x)]-Zexp(-x)(-1)
= [1 + exp(-x)]-Zexp(-x)
= [f(x)]2[1 + exp(-x) -1]
= [f(x)]2[ 1/f(x) -1]
= [f(x)] 2{[1-f(x)]/f(x)}
-- f(x)[ 1-fix)]
The initial activation of the node is W(0)Txl =-2. The output of the node is z(0) = 1/[1 +
exp(-w(0)Txl)] = 1/[1 + e-(2~] = 0.119 (which differs from the desirable output dl = 1). The
derivative of the output node is z'(0) --0.119(1-0.119) = 0.1048, and
w(1) = w(0) + 13[di - f(wiTx)]f~(wiTx)x
w(1) - [1, 0, -1] T + 0.1[1 - 0.11930.104811, 2, 3] T

w(1) - [1, 0, -1] T + 0.009211, 2, 3] T= [1.0092, 0.0185,-0.9723] T
This adjustment in the weight vector of the neuron results in a smaller error on the first exem-
plar, reducing it from 0.879 to approximately 0.867.
For step 2:
The activation of the node is w(1)Xx2 = [1.0092, 0.0185,-0.9723][-1,-2,-1] T= -0.074
The output of the node is z(1) = 1/[1 + exp(-w(1)Tx2)] = 1/[1 + e -(-~176 = 0.4815 (which
differs from the desirable output dl = 0).
The derivative of the output node is z'(1) = 0.4815(1-0.4815) = 0.2497, and
w(1) = w(O) + [3[di - f(wiTx)]f ~(wiTx)x
w(1) = [1, 0,- 1]T + 0.1[ 1 -- 0.4815]0.2497[ 1, 2, 3]T
w(1) = [1, 0, -1]T+ 0.012911, 2, 3]T= [1.0129, 0.0258,-0.9613] T
This adjustment in the weight vector of the node results in a smaller error on the second
exemplar, reducing it from 0.482 to approximately 0.474.
The delta rule requires a small learning rate (approximately 0.001 < [3 < 0.1), since the weight
vector is moved in the negative error gradient direction in the weight space. Since the error is
only reduced with small increments at a time, the set of training data has to be presented to the
neural network repeatedly, in order to reduce the output errors of the network satisfactorily.
The best value for the learning rate ~ depends on the error surface, i.e. a plot of E versus wji
(which is rarely known beforehand). If the surface is relatively smooth, a larger learning rate
will speed convergence. On the other hand, if the error surface changes relatively rapidly, a
smaller learning rate would clearly be desirable. As a general rule of thumb, the largest learn-
ing rate not causing oscillation should be used.
Note that with this training scheme, if the desired output of the j'th unit is less than the actual
output of the neural network, the weight wkj connecting input unit k with output unit j is
increased. This does not take into account the response of the network to other training pat-
terns. Moreover, zero inputs do not result in any adjustment, not even for non-zero unit errors.
A simple method of increasing the learning rate without risking instability is to modify the
delta rule through inclusion of a momentum term (c~> 0), that is
Awji(t) = c~Awji(t-1) + ~Sj(t)yi(t) (1.23)
Equation (1.23) is known as the generalized delta rule, since it includes the delta rule (c~= 0).
The inclusion of a momentum term has the following benefits.

Training Rules 15
9 When the partial derivative c3E(k)/~ji(k) has the same algebraic sign in consecutive in
iterations, the weighted sum Awji(t) grows, resulting in large adjustments to the weight
wji(k). This tends to result in accelerated descent in steady downhill directions.
9 When the partial derivative c3E(k)/0wji(k) has alternating algebraic signs in consecutive
iterations, the weighted sum Awji(t) is reduced, resulting in small adjustments to the
weight wji(k). The inclusion of the momentum term therefore has a stabilizing effect in
directions that tend to produce oscillation.
9 In addition, the momentum term can have the advantage of preventing the learning pro-
cess of getting trapped in shallow local minima on the error surface.
When all the network weights are adjusted for the k'th exemplar (i.e. for all i and j) as indi-
cated above, it is referred to as per sample training or pattern training. An alternative is to
train per epoch, by accumulating weight changes prior to adjustment, i.e. Akw'ji= ZnAkWji.The
weights in the network are thus only adjusted after each presentation of all the exemplars in
the training base, or a subset (epoch) of these exemplars.
The supervised training of back propagation neural networks may be viewed as a global
identification problem, which requires the minimization of a cost function. The cost function
(E) can be defined in terms of the discrepancies between the outputs of the neural network
and the desired or target output values. More specifically, the cost function can be expressed
in terms of the weight matrix of the network (which has an otherwise fixed configuration).
The purpose of training is to adjust these free parameters (weights) to enable the outputs of
the neural network to match the target outputs more closely.
The standard back propagation algorithm modifies a particular parameter based on an
instantaneous estimate of the gradient (c3E/0wi) of the cost function with respect to the para-
meter (weight). This is an efficient method of training, although it uses a minimum amount of
information in the process. As a consequence, the use of the algorithm becomes unpractical
with large networks (which require excessively long training times). The problem can be
alleviated by making better use of available information during training, e.g. by incorporating
training heuristics into the algorithm.
A wide variety of approaches to the optimization of the weight matrices of neural networks
have been documented to date. In practice, gradient descent methods, such as the generalized
delta rule have proved to be very popular, but other methods are also being used to
compensate for the disadvantages of these methods (chiefly their susceptibility towards
entrapment in local minima). These methods include 2ndorder gradient descent methods, such
as conjugate gradients, Newton and Levenberg-Marquardt methods (Reklaitis et al., 1983), as
well as genetic algorithms, among other, as discussed in Chapter 2.
c) Widrow-Hofflearning rule
Like the perceptron and delta rules, the Widrow-Hoff rule applies to the supervised training of
neural networks. The Widrow-Hoff rule does not depend on the activation function of the
node, since it minimizes the squared error between the target value and the activation of the
node, that is

r = di - wiTx (1.24)
and
Awi = [~[di- wiTx]x (1.25)
which his equivalent to
Awij = [3[ di- wiTx]xj, for j = 0, 1, 2.... n (1.26)
for the adjustment of a single weight. This rule is clearl~r a special case of the delta rule,
where the activation function is the identity function, f(wi x) = wiVx, and f'(wiTx) = 1. As is
the case with the delta function, the weights of the neural network are also initialized to
arbitrary (small) values.
d) Correlation learning rule
The correlation rule is obtained by substituting r = di in the general leaming rule (equation
1.12), so that
Awi -- [3dix,or (1.27)
Awij = [3dixj, for j = 0, 1, 2, ...n (1.28)
The increment in the weight vector is directly proportional to the product of the target value
of a particular exemplar, and the exemplar (input) itself. The rule is similar to the Hebbian
rule discussed below (with a binary activation function and zi = di), except that it is a super-
vised leaming rule. Similar to Hebbian leaming, the initial weight vector should also be zero.
1.5.2. Unsupervised training
a) Hebbian and anti-Hebbian learning rule
For the Hebbian learning rule, the weight vector is initialized to small random values. Subse-
quent change in the weight vector is related to the product of the input and outputs of the
neural network. The leaming signal is simply the output of the network, and therefore depen-
dent on the activation of the nodes, as well as the activation functions of the nodes, only.
Since the Hebbian leaming rule is not dependent on a target value, it is an unsupervised
leaming rule, i.e.
r = f(wiTx) (1.29)
Awi -- [~f(wiTx)x, or (1.30)
wi(t+l) = wi(t) + [3f~wi(t)Tx(t)]x(t) (1.31)
As can be seen from equation 1.31, the weights of a node are increased when the correlation
between the input and the output of the node is positive, and decreased when this correlation
is negative. Also, the output is progressively strengthened for each presentation of the input.

Training Rules 17
Frequent exemplars will therefore tend to have a larger influence on the node's weight vector
and will eventually produce the largest output.
A simple example demonstrates the dynamics of the Hebbian learning rule for a single node,
with a bipolar binary activation function. Since there is only one weight vector, the subscript
of the weight vector has been dropped.
Let w(0) = [0.1,-0.1,0] T, and the two input vectors available for training be xl = [1,-1,1]T and
x2 = [1,-0.5,-2]T. A unity value for the learning parameter 13 is assumed, while the input
vectors are presented sequentially to the neural network.
Step 1: Calculation of the activation and output of the neural network, based on input of first
exemplar.
W(0)TX: = [0.1, -0.1, 0][ 1, - 1, 1]T = 0.2
z(0) = sgn[w(0)XXl] = +1
Step 2: Modification of weights based on input of first exemplar.
w(1) = w(0) + sgn[w(0)Txl]X: = [0.1,-0.1, 0]T+ (+1)[1,-1, 1]T= [1.1,-1.1, 1]v
Step 3: Calculation of the activation and output of the neural network, based on input of first
exemplar.
W(1)Tx2 = [1.1,-1.1, 1][1,-0.5,-2] T= -0.35
z(1) = sgn[w(1)Vx2] =-1
Step 4: Modification of weights based on input of second exemplar.
w(2) - w(1) + sgn[w(1)Tx2]x2 = [1.1,-1.1, 1]T + (-1)[1,-0.5,-2] T= [0.1,-0.6, 3] T
The procedure can be repeated with additional presentation of the weight vectors, until the
weights of the neural network had stabilized, or until some other termination criterion had
been satisfied. From the above example it can be seen that learning with a discrete activation
function f(wVx) = +1, and learning rate 13= 1, results in addition or subtraction of the entire
input vector to and from the weight vector respectively. Note that with continuous activation
functions, such as unipolar or bipolar sigmoids, the change in the weight vector is some frac-
tion of the input vector instead.
b) Winner-takes-all rule
Like the outstar rule, the winner-takes-all rule can be explained best by considering an
ensemble of neurons, arranged in a layer or some regular geometry. No output data are
required to train the neural network (i.e. unsupervised training). Each node in the ensemble
measures the (Euclidean) distance of its weights to the input values (exemplars) presented to
the nodes.

For example, if the input data consist of m-dimensional vectors of the form x = [x1, x2 ....
Xm]v, then each node will have M weight values, which can be denoted by wi = [Wil, wi2 ...
wire]. The Euclidean distance Di = [Ix - wi[[between the input vectors and the weight vectors of
the neural network is then computed for each of the nodes and the winner is determined by the
minimum Euclidean distance. This is equivalent to
WpTX -- maxi = o, 1, 2.... m(WpTX) (1.32)
The weights of the winning node (assumed to be the p'th node), as well as its neighbouring
nodes (constituting the adaptation zone associated with the winning node) are subsequently
adjusted in order to move the weights closer to the input vector, as follows
Awp = c~(x- Wp),or
AWip -- (X(Xj-Wip,old) ( 1.33)
where ot is an appropriate learning coefficient which decreases with time (typically starting at
0.4 and decreasing to 0.1 or lower).
The adjustment of the weights of the nodes in the immediate vicinity of the winning node is
instrumental in the preservation of the order of the input space and amounts to an order
preserving projection of the input space onto the ensemble of nodes (typically a one- or two-
dimensional layer). As a result similar inputs are mapped to similar regions in the output
space, while dissimilar inputs are mapped to different regions in the output space. The
winner-takes-all rule is especially important in the multivariate analysis of data, and will be
considered in more detail at a later stage.
outputs
Z1 Z2 Zk-2Zk-1 Zk Zk+l Zk+2 Zp_1 Zp
| ,2
i
inputs Xl x2 Xm
-I
eig
connections
Figure 1.4. One-dimensional array of p competitive nodes (zl, Z2.... Zp), eachreceiving m inputs (xl,
x2.... Xm),showingthe winning node (zk),surrounded by neighbouring nodes (zk_2,Zk-l,Zk+land zk+2)
in its neighbourhood (brokenlines).
c) Outstar learning rule
The outstar rule is best explained in the context of an ensemble of nodes, arranged in a layer.
The rule applies to supervised learning, but is also designed to allow the network to extract

Training Rules 19
statistical features from the inputs and outputs of the neural network. Weight adjustments are
computed as follows
Awj = [3(d- wj) (1.34)
or in terms of the adjustments of the individual weights of the nodes, for i = 1, 2, ... m
Awij = [3(di- Wij) (1.34)
In contrast with the previous rules, the weight vector that is updated fans out of the j'th node.
The leaming rate [3 is typically a small positive parameter that decreases as learning pro-
gresses.
Although the above rules are not the only ones used to train neural networks, they are widely
used, although their application does not guarantee convergence of the neural network being
trained. More detailed discussions are available in other sources, such as Zurada (1992),
Bishop (1995) and Haykin (1999).
1.6. NEURAL NETWORK MODELS
1.6.1. Multilayer Perceptrons
Since multilayer perceptron neural networks are by far the most popular (and simple) neural
networks in use in the process engineering industries, they are considered first.
a) Basic structure
As mentioned previously, in multilayer perceptron neural networks the processing nodes are
usually divided into disjoint subsets or layers, in which all the nodes have similar compu-
tational characteristics.
sigmoidal
hidden layer
x1 -~~~linear output
x2 - ~ ~ ~ ~ y
Figure 1.5. Structure of a typical multilayer perceptron neural network.
A distinction is made between input, hidden and output layers depending on their relation to
the information environment of the network. The nodes in a particular layer are linked to
other nodes in successive layers by means of artificial synapses or weighted connections (ad-
justable numeric values), as shown in Figure 1.5. These weights form the crux of the model,

in that they define a distributed internal relationship between the input and output activations
of the neural network.
The development of neural network models thus consists of first determining the overall
structure of the neural network (number of layers, number of nodes per layer, types of nodes,
etc.). Once the structure of the network is fixed, the parameters (weights) of the network have
to be determined. Unlike the case with a single node, a network of nodes requires that the
output error of the network be apportioned to each node in the network.
b) Back propagation algorithm
The back propagation algorithm can be summarized as follows, for a network with a single
hidden layer with q nodes and an output layer with p nodes, without loss in generalization.
Before training, the learning rate coefficient (1"1)and the maximum error (Emax) are specified
by the user.
Initialize the weight matrices of the hidden layer (V) and the output layers (W) to
small random values.
ii) Compute the response of the hidden layer (y) and the output layer (z) of the neural
network, when presented with an exemplar of the form [xlt], where x ~ 9tm, and t
!Rp, i.e.
iii)
y = ~(Vx)
z = tb(Wy)
Compute the error, Enew = Eold + 89 - tll2, for k = 1, 2, ... p.
iv)
v)
vi)
vii)
Calculate the error terms 8z and ~y, with 5z ~ ~P and ~y E ~q, i.e.
8z - 89 for k = 1, 2, ... p
~y -- wjT~zf~y -- ~(1-yj2))-"&=l~zk PWkj)], for j = 1, 2, ... q
Adjust the weights of the output layer
Wnew -- Wold + q SzyT or
Wkj,new= Wkj,old+ q6zkyj, for k = 1, 2, ... p and j = 1, 2, ... q)
Adjust the weights of the hidden layer
Vnew -- Void + ~]~yZT or
Vji,new= Wji,old + ~]~yjZi, forj = 1, 2, ... q and i = 1, 2, ... m)
If there are more exemplars in the training set, go to step ii)
viii) If E < Emax, stop, otherwise go to step ii)

Neural Network Models 21
The back propagation algorithm highly efficient, as the number of operations required to
evaluate the derivatives of the error function scales with O(Nw), for a sufficiently large
number of weights Nw. Since the number of weights are usually much larger than the number
of nodes in the neural network, most of the computational effort in the forward cycle during
training is devoted to the evaluation of the weighted sums (one multiplication and one
addition per connection) in order to determine the activations of the nodes, while the
evaluation of the activation functions can be seen as a small computational overhead (Bishop,
1995). Normally, numerical evaluation of the derivatives would each be O(Nw), so that the
evaluation of the derivatives would scale to O(Nw2). However, with the back propagation
algorithm, evaluation for each exemplar scales to O(Nw) only. This is a crucial gain in
efficiency, since training can involve a considerable computational effort.
1.6.2. Kohonen self-organizing mapping (SOM) neural networks
Self-organizing neural networks or self-organizing (feature) maps (SOM) are systems that
typically create two- or three-dimensional feature maps of input data in such a way that order
is preserved. This characteristic makes them useful for cluster analysis and the visualization
of topologies and hierarchical structures of higher-dimensional input spaces.
output layer _
not shown
~] competitive
.... layer
input
9 ~ layer
xl x2 Xm
Figure 1.6.A self-organizing mapping neural network (Kohonen).
Self-organizing systems are based on competitive learning, that is the outputs of the network
compete among themselves to be activated or fired, so that only one node can win at any
given time. These nodes are known as winner-takes-all nodes.
Self-organizing or Kohonen networks do not require output data to train the network (i.e.
training is unsupervised). Such a network typically consists of an input layer, which is fully
connected to one or more two-dimensional Kohonen layers, as shown in Figure 1.6. Each
node in the Kohonen layer measures the (Euclidean) distance of its weights to the input values
(exemplars) fed to the layer. For example, if the input data consist of m-dimensional vectors
of the form x -- {Xl,x2, ... Xm},then each Kohonen node will have m weight values, which can
be denoted by wi = {Wil, w i2.... w ira}. The Euclidean distances Di = l[ x - wi [[ between the
input vectors and the weights of the network are then computed for each of the Kohonen
nodes and the winner is determined by the minimum Euclidean distance.

The weights of the winning node, as well as its neighbouring nodes, which constitute the
adaptation zone associated with the winning node are subsequently adjusted in order to move
the weights closer to the input vector.
The adjustment of the weights of the nodes in the immediate vicinity of the winning node is
instrumental in the preservation of the order of the input space and amounts to an order
preserving projection of the input space onto the (typically) two-dimensional Kohonen layer.
As a result similar inputs are mapped to similar regions in the output space, while dissimilar
inputs are mapped to different regions in the output space.
One of the problems that have to be dealt with as far as the training of self-organizing neural
networks is concerned, is the non-participation of neurons in the training process. This pro-
blem can be alleviated by modulation of the selection of (a) winning nodes or (b) learning
rates through frequency sensitivity.
Frequency sensitivity entails a history-sensitive threshold in which the level of activation of
the node is proportional to the amount by which the activation exceeds the threshold. This
threshold is constantly adjusted, so that the thresholds of losing neurons are decreased, and
those of winning neurons are increased. In this way output nodes which do not win suffi-
ciently frequently, become increasingly sensitive. Conversely, if nodes win too often, they
become increasingly insensitive. Eventually this enables all neurons to be involved in the
learning process. Training of Kohonen self-organised neural networks can be summarized as
follows.
a) Summary of the SOM algorithm (Kohonen)
Initialization: Select small random values for the initial weight vector wj(0), so that the wj(0)
are different or j -- 1, 2.... p, where p is the number of neurons in the lattice.
i. Sampling: Draw a sample x from the input distribution with a certain probability.
ii. Similarity matching: Find the winning neuron I(x) at time t, using the minimum
distance Euclidean or other criterion:
iii. I(x) = argjmin]]x(t) - will, for j = 1, 2, ... p (1.35)
iv. Updating: Modify the synaptic weights of the neurons in the lattice as follows:
wj(t+ 1) = wj(t) + rl(t)[x(t) - wj(t)], j ~ Al(x)(t) (1.36)
wj(t+ 1) = wj(t), otherwise.
where rl(t) is a time-variant learning rate parameter, Ax(x)(t) is the neighbourhood function
centred around the winning neuron I(x), all of which are varied dynamically. These para-
meters are often allowed to decay exponentially, for example rl(t) = rl0e-~*.
For example, for Gaussian-type neighbourhood functions the modification of the synaptic
weight vector wj of the j'th neuron at a lateral distance dji from the winning neuron I(x) is
wj(t+ 1) = wj(t) + rl(t)nj,i(x)(t)[x(t ) - wj(t)] (1.37)
v. Continuation: Continue with step 2 until the feature map has stabilized.

b) Properties of the SOM algorithm
The feature map displays important statistical properties of the input.
9 Approximation of the input space: The self-organised feature map 9 presented by the set
of synaptic weight vectors {wjlj = 1, 2, ... p} in the output space | provides a good
approximation of the input space X.
Topological ordering: The feature map 9 is topologically ordered in that the spatial loca-
tion of a neuron in the lattice corresponds to particular domain of input patterns.
9 Density matching: Variations in the distribution of the input are reflected in the feature
map, in that regions in the input space X from which samples are drawn with a higher
probability than other regions, are mapped with better resolution and onto larger domains
of the output space | than samples drawn with a lower probability.
If f(x) denotes the multidimensional probability density function of the input vector x, then
the probability density function integrated over the entire input space should equal unity, that
is
[_~f(x)dx = 1
Let m(x) denote the magnification factor of the SOFM, defined as the number of neurons
associated with a small volume dx of the input space X. Then the magnification factor,
integrated over the entire input space should equal the total number of neurons in the network
lattice, that is
~_~o~176 = p (1.38)
In order to match the input density exactly in the feature map
M(x) oc f(x)
In other words, if a particular region in the input space occurs more often, that region is
mapped onto a larger region of neurons in the network lattice. For 2D- and higher-D maps the
magnification factor M(x) is generally not expressible as a simple function of the probability
density function f(x) of the input vector x. In fact, such a relationship can only be derived for
1D-maps, and even then the magnification factor is usually not proportional to the probability
density function.
Generally speaking, the map tends to underrepresent regions of high input density and
conversely, to overrepresent regions of low input density. It is for this reason that heuristics
(such as the conscience mechanism) are sometimes included in the SOM algorithm, i.e. to
force a more exact density matching between the magnification factor of the map and the
probability density function of the input. Alternatively, an information-theoretic approach can
be used to construct the feature map.
1.6.3. Generative topographic maps
Generative topographic maps (Bishop et al., 1997) are density models of data based on the use
of a constrained mixture of Gaussians in the data space in which the model parameters (W
and 13)are determined by maximum likelihood using the expectation-maximization algorithm.

Generative topographic maps are defined by specifying a set of points {xi} in a latent space,
together with a set of basis functions {qbj(x)}.A constrained mixture of Gaussians is defined
by adaptive parameters W and 13,with centres Wqb(xi) and common covariance 13-11.
As a latent variable model, a generative topographic map represents a distribution p(t) of data
in an m-dimensional space, t = (h, t2.... tm) in terms of a set of p latent variables x = (Xl, x2,
... Xp). The mapping between points in the p-dimensional latent space and the m-dimensional
data space is represented by a function y(x,W), as indicated in Figure 1.7, for p = 2 and m = 3.
The matrix of parameters determines the mapping (which represents the weights and biases in
the case of a neural network model).
X2
P
2D latent v
space embe
p-dimen- 3D data,
sional latent
space
M
Xl t2 latent variable space
Figure 1.7. A manifold M embedded in data space is defined by the function y(x,W) given by the
image of the latent variable space under the mapping x~y.
If a probability distribution p(x) is defined on the latent variable space P, this will induce a
corresponding distribution p(ylW) in the data space M. If p < m, then the distribution in the
data space will be confined to a P-dimensional manifold and would therefore be singular.
However, since this manifold will only approximate the actual distribution of the data in the
data space, it is appropriate to include a noise model with the t vector. The distribution of t is
can be represented by spherical Gaussians3 centred on y(x,W), with a common inverse
variance 13-l, i.e.
p(t Ix, W, 13)= (13/2rt)nVZexp{-13/2 IIy(x,W) - t II2} (1.39)
The distribution in the t space (M), for a given matrix of parameters W, is obtained by inte-
gration over the x space (P), that is
p(t IW, 13)= Ip(t Ix, W, 13)p(x) dx (1.40)
For a given data set of size n, the parameter matrix W and the inverse variance 13
-1 can be
determined by use of maximum likelihood. In practice, the log likelihood is maximized, i.e.
L(W,[3) = In 1-Ik=lNp(tklW,[3) (1.41)
3Of course,othermodelsfor p(tlx)may alsobe appropriate.

Unlike the SOM algorithm, the GTM algorithm defines an explicit probability density given
by the mixture distribution (Bishop et al., 1998). Consequently, there is a well-defined
objective function given by the log likelihood and convergence to a local maximum of the
objective function is guaranteed by use of the expectation maximization (EM) algorithm
(Dempster et al., 1977). In contrast, the SOM algorithm does not have an explicit cost
function. Moreover, conditions under which self-organisation occurs in SOM neural networks
are not quantified and in practice it is necessary to validate the spatial ordering of trained
SOM models.
1.6.4. Learning vector quantization neural networks
Vector quantization for data compression is an important application of competitive learning,
and is used for both the storage and transmission of speech and image data.
self-organizing
input Q layerconsistingofq
Xl ~~odes for eachclass
X2
k,_.] " Cl
9 ~C]~i~ c2
Xm cation
layer
Figure 1.8.A learningvector quantization neural network with two classes.
The essential concept on which learning vector quantization networks (see Figure 1.8) is
based, is that a set of vectors can be distributed across a space in such a way that their spatial
distribution corresponds with the probability distribution of a set of training data. The idea is
to categorise the set of input vectors into c classes C - cx, c2, ... cc, each class of which is
characterized by a class or prototype vector. Each of the original input vectors is subsequently
represented by the class of which it is a member, which allows for high ratios of data
compression. The components of the vectors usually have continuous values, and instead of
storing or transmitting the prototype vectors, only their indices need to be handled, once a set
of prototype vectors or a codebook has been defined.
The class of a particular input can be found by locating its nearest prototype vector, using an
ordinary (Euclidean) metric, i.e. ]wi* - ci ]< ]wi- ci 1.This divides the vector space into a so-
called Voronoi (or Dirichlet) tesselation, as indicated in Figure 1.9. Learning vector quantiza-
tion networks differ from supervised neural networks, in that they construct their own repre-
sentations of categories among input data.
A learning vector quantization network contains an input layer, a Kohonen layer which per-
forms the classification based on the previously learned features of the various classes, and an
output layer, as shown in Figure 1.8. The input layer is comprised of one node for each fea-
ture or input parameter of the various classes, while the output layer contains one node for
each class. Although the number of classes is predefined, the categories (q) assigned to these

classes are not. During training the Euclidean distance (di) between a training exemplar (x)
and the weight vector (wi) of each node is computed. That is
di = norm(wi-x) = [~j,n(wij-xj) 2}]1/2 (1.55)
~I
-0.
.
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 1.9. An example of Voronoi tessellation of data in two dimensions.
If this winning node and the training vector share the same class, the winning node is moved
towards the training vector, otherwise it is moved away, or repulsed, i.e.
(~ Wp= Wp -k- [~(Xp-Wp), if winning node is in the correct class (1.56)
0 Wp-- Wp - ]](Xp-Wp), if winning node is not in the correct class (1.57)
As a consequence, the nodes assigned to a class migrate to a region associated with their
class. In the classification mode, the distance of the input vector to each node is determined
and the vector is assigned to the class of the winning node.
1.6.5. Probabilistic neural networks
Many methods of pattern classification and feature evaluation presuppose complete
knowledge of class conditional probability density functions p(x[cj), with j = 1, 2, ... m. In
practice the actual probability structure of the classes are usually unknown and the only
information available to the analyst is a set of exemplars with known class memberships.
It is therefore necessary to infer the unknown probability density functions from the data. One
way to accomplish this is to make use of kernels. For example, given a particular exemplar xi
for a class cj, we can assert that p(x]cj) assumes a non-zero value at the point xi. Moreover, as-
suming p(xlcj) to be continuous, it can be inferred that p(x[cj) will assume non-zero values in
the immediate vicinity of observation xi. The information about p(x[cj) gained by observing xi
cj can be repesented by a function K(x,xi) centred at xi. This function (known as a kernel
function) attains a maximum value at xi and decreases monotonically as the distance from xi
increases, as indicated in Figure 1.10(a).

X Xl X2 X3 X4X5 X6
Figure 1.10(a) A one-dimensional Gaussian kemel function, and (b) estimation of the probability den-
sity function Pest(X[Cj)of class cj by summing the contributions of the kernels centred on the exemplars
for the class.
With a set of n exemplars for a given class cj, p(x]cj) can be estimated by calculating the ave-
rage of all the contributions of all the exemplars, i.e.
Pest(xlcj)- Xi=lnK(x,xi)/n (1.58)
As indicated in Figure 1.10(b), exemplars close to each other give rise to larger values of
Pest(X[Cj)than exemplars situated further apart.
Clearly, the contributions of the kernels to Pest(X[Cj) also depend on their range of influence. If
this is very small, the estimate of the probability density function will be spiky, while too
large a range of influence will miss local variations in p(x[cj). Intuition dictates that as the
number of exemplars increase, the influence of K(x,xi) should decrease progressively, and
conversely, if few exemplars are available, the influence of K(x,xi) should be large to smooth
out sampling effects. The kernel function reflecting this arrangement should have the form
K(x,xi) = O-~'h[d(x,xi)/O] (1.59)
Where 9 is a parameter of the estimator that depends on the sample size and satisfies
limn_~| = 0 (1.60)
d(x,xi) is a suitable metric and h(.) is a function attaining a maximum at d(x,xi) = 0 and
decreasing monotonically as d(x,xi) increases. Provided that h(.) is a non-negative function,
the only constraint which h(.) has to satisfy is
.[K(x,xi)dx = 1 ( 1.61))
Conditions (1.60) and (1.61) guarantee that Pest(XlCj) is a density functionproviding an un-
biased and consistent estimate of p(x[cj). The most important kernels are hyperspheric kemels
(Figure 1.1 l(a) and equation 1.62), hypercubic kemels (Figure 1.1 l(b) and equation 1.63) and
Gaussian kernels (Figure 1.11(c) and equation 1.64).
K(x,xi) = l/v, if {xldE(X,Xi)< p}
= 0, if {xldE(x,xi) > 19} (1.62)

where dE(X,Xi) = [(X - xi)T(x - Xi)]* is the Euclidean distance metric and v is the volume of
hypersphere with radius p.
K(x,xi) = (29) -~',if {XldT(X,Xi) < p}
= 0, if {XldT(X,Xi)> p} (1.63)
where dT(X,Xi) = maxjl(x - xi)[ is the Chebyshev distance metric. Unlike the Gaussian esti-
mator, hypercubic kernels are easy to calculate.
K(x,xi) = [p22n)XlQl]~exp[- 89 ] (1.64)
where dQ(x,xi) = (x - xi)XQ(x - xi) is a quadratic distance and Q is a positive definite scaling
matrix, typically the sampling covariance matrix Sj of class cj.
Probabilistic neural networks (see Figure 1.11) are based on the use of Bayesian classification
methods and as such provide a powerful general framework for pattern classification pro-
blems. These networks use exemplars to develop distribution functions as outlined above,
which in turn are used to estimate the likelihood of a feature vector belonging to a particular
category.
I
(a) (b)
K(x,xi)
. . . . .
I
(c)
Figure 1.11(a). A hyperspheric, (b) cubic and c) Gaussian kernel.
These estimates can be modified by the incorporation of a priori information. Suppose a
classification problem consists of p different classes Cl, c2, ... Cp, and that the data on which
the classification process are based can be represented by a feature vector with m dimensions

X -'- [X1, X2.... Xm] T
F T . . . . .
IfF(x) = [Fl(x), F2(x), ... ~(x)] is the set of probablhty density functions of the class popula-
tions and A = [a~, a2.... ap] is the set of a priori probabilities that a feature vector belongs to a
particular class, then the Bayes classifier compares the p values al.Fl(x), a2.F2(x).... ap.Fp(x)
and determines the class with the highest value.
7q
Bias
Z 1 Z2 Zp
X1 X2 Xm
utput Layer
ummation Layer
Pattern Layer
Formalization Layer
input Layer
Figure 1.12. Structure of a probabilistic neural network.
Before this decision rule (in which the multivariate class of probability density functions are
evaluated, weighted and compared) can be implemented, the probability density functions
have to be constructed. Parzen estimation is a non-parametric method of doing so, in which no
assumption is made with regard to the nature of the distributions of these functions, that is
Fk(X) = [B/mk]Ejexp[-(X-Xkj)V(X-Xkj)/402]
where B = 1/(2rrp/2crP).
(1.65)
The Parzen estimator is constructed from n training data points available. As explained above,
the exponential terms or Parzen kernels are small multivariate Gaussian curves that are added
together and smoothed (B-term). As shown in Figure 1.12, the neural network version of this
Bayesian classifier consists of an input layer, a normalizing layer (which normalizes the
feature vector x, so that xVx = 1), a pattern or exemplar layer, which represents the Parzen
kernels, a summation layer in which the kernels are summed, and a competitive output layer.
The weights associated with the nodes in the output (class) layer constitute the a priori proba-
bilities ak (k = 1, 2.... p) of the occurrence of the classes, and usually assume equal values
unless specified otherwise.
Probabilistic neural networks are also useful for pattern recognition and classification pro-
blems, especially where the probabilities of some events or classes are known in advance,
since these can be incorporated directly into the network.
1.6.6. Radial basis function neural networks
In can be shown that in solving problems concerning nonlinearly separable pattems, there is
practical benefit to be gained in mapping the input space into a new space of sufficiently high

dimension. This nonlinear mapping in effect turns a nonlinearly separable problem into a
linearly separable one. The idea is illustrated in Figure 1.13, where two interlocked two-
dimensional patterns are easily separated by mapping them to three dimensions where they
can be separated by a flat plane. In the same way it is possible to turn a difficult nonlinear
approximation problem into an easier linear approximation problem.
X3
X2 ....
(a) Class 1 Xl Clas2
(b) Xl
Figure 1.13. Linear separation of two nonlinearly separable classes, after mapping to a higher dimen-
sion.
Consider therefore without loss of generality a feedforward neural network with an input layer
with p input nodes, a single hidden layer and an output layer with one node. This network is
designed to perform a nonlinear mapping from the input space to the hidden space, and a
linear mapping from the hidden space to the output space.
radial basis function
hidden layer
X1
x2 ~ Y
** linear output
node
Xrn
Figure 1.14. Structure of a radial basis function neural network.
Overall the network represents a mapping from the m-dimensional input space to the one-
dimensional output space, as follows, s:9tm~ 9~1and the map s can be thought of as a hyper-
surface F : 9~m+: in the same way as we think of the elementary map s: 9~: --> ~R1, where s(x)
= x2, as a parabola drawn in ~RZ-space. The curve F is a multidimensional plot of the output as
a function of the input. In practice the surface F is unknown, but exemplified by a set of
training data (input-output pairs).

As a consequence, training constitutes a fitting procedure for the hypersurface F, based on the
input-output examples presented to the neural network. This is followed by a generalization
phase, which is equivalent to multivariable interpolation between the data points, with inter-
polation performed along the estimated constrained hypersurface (Powell, 1985).
In a strict sense the interpolation problem can be formulated as follows. Given a set of n
different observations on m variables {xi ~ 91m] i = 1, 2, ... n} and a corresponding set of n
real numbers {zi ~ 911 [ i = 1, 2, ... n}, find a function F: 91m 9tl that complies with the
interpolation condition F(xi) = zi, for i = 1, 2, ... n. Note that in the strict sense specified, the
interpolation surface is forced to pass through all the training data points.
Techniques based on radial basis functions are based on the selection of a function F of the
following form
F(x) = ~lnwi q) (]]x-xilD (1.66)
where {q)([[x-xi[I)[ i = 1, 2, ... n}is a set of n arbitrary functions, known as radial basis func-
tions. ]].][ denotes a norm that is usually Euclidean. The known data points xi typically form
the centres of the radial basis functions. Examples of such functions are multiquadrics, q~(r)=
(r 2 + C2)1/2, inverse multiquadrics, q)(r) = (r2 + C2)"1/2,Gaussian functions q~(r) = exp {-r2/(202) }
and thin-plate splines q~(r) = (r/o)log(r/o), where c and cyare postive constants and r e 9t.
By use of the interpolation condition F(xi) = zi and equation (1.66), a set of simultaneous
linear equations for the unknown coefficients or weights (wi) of the expansion can be ob-
tained.
I
(Pll (D12 ... (Pin Wl z1
(j921 q922 ... q92n W2 = Z2
............ :;i zl
where q~ij= q)([[xi- xjJ[), for i, j = 1, 2, ... n. Moreover, the n x 1 vectors w = [Wl, W2, ... Wn] T
and z = [zl, z2.... Zn]~ represent the linear weight vector and target or desired response vector
respectively. With ~ = {q~ij[ i, j = 1,2, ... n} the n x n interpolation matrix, ~w = z represents a
more compact form of the set of simultaneous linear equations.
For a certain class of radial basis functions, such as inverse multiquadrics (equation 1.67) and
Gaussian functions (equation 1.68) the n x n matrix ~ is positive definite.
q~(r) = (r 2 + C2)"1/2 (1.67)
for c > 0 and r > 0, and
q~(r) = exp {- rZ/(2cy2)} (1.68)
for o > 0 and r >_0

If all the data points are distinct, and the matrix ~ is positive definite, then the weight vector
can be obtained from w = ~-~z. If the matrix is arbitrarily close to singular, perturbation of the
matrix can help to solve for w. These radial basis functions are used for interpolation, where
the number of basis functions is equal to the number of data points.
Although the theory of radial basis function neural networks is intimately linked with that of
radial basis functions themselves (a main field of study in numerical analysis), there are some
differences. For example, with radial basis function neural networks, the number of basis
functions need not be equal to the number of data points and is typically much less. Moreover,
the centres of the radial basis functions need not coincide with the data themselves and the
widths of the basis functions also do not need to be the same. The determination of suitable
centres and widths for the basis functions is usually part of the training process of the net-
work. Finally, bias values are typically included in the linear sum associated with the output
layer to compensate for the difference between the average value of the targets and the
average value of the basis functions over the data set (Bishop, 1995).
In its most basic form, the construction of a radial basis function neural network involves
three different types of layers. These networks typically consist of input layers, hidden
(pattern) layers, as well as output layers, as shown in Figure 1.14. The input nodes (one for
each input variable) merely distribute the input values to the hidden nodes (one for each
exemplar in the training set) and are not weighted. In the case of multivariate Gaussian
functions4, the hidden node activation functions can be described by
Zi,j(Xj,13,i,J]i) -- exp(-Ilc~i- xjll2/13i2) (1.69)
where xj = {x~, x2, ... Xm}j is the j'th input vector of dimension m presented to the network,
zij(xj,oq,Bi) is the activation of the i'th node in the hidden layer in response to the j'th input
vector xj. m+l parameters are associated with each node, viz. (Xi -- {U,1, (~2 .... (Xm}i,as well as
13i, a distance scaling parameter which determines the distance in the input space over which
the node will have a significant influence.
The parameters (Xi and 8i function in much the same way as the mean and standard deviation
in a normal distribution. The closer the input vector to the pattern of a hidden unit (i.e. the
smaller the distance between these vectors, the stronger the activity of the unit. The hidden
layer can thus be considered to be a density function for the input space and can be used to
derive a measure of the probability that a new input vector is part of the same distribution as
the training vectors. Note that the training of the hidden units is unsupervised, i.e. the pattern
layer representation is constructed solely by self-organisation.
Whereas the (~i vectors are typically found by vector quantization, the Bi parameters are
usually determined in an ad hoc manner, such as the mean distance to the first k nearest (xi
centres. Once the self-organizing phase of training is complete, the output layer can be trained
using standard least mean square error techniques.
Each hidden unit of a radial basis function network can be seen as having its own receptive
field, which is used to cover the input space. The output weights leading from the hidden units
4Theuse of Gaussianradial basis functionsis particularlyattractivein neural networks, sincethese are the only
functionsthatare factorizable,and can thus be constructedfrom 1- and 2-dimensional radialbasis functions.

to the output nodes subsequently allow a smooth fit to the desired function. Radial basis
function neural networks can be used for classification, pattern recognition and process
modelling and can model local data more accurately than multilayer perceptron neural
networks. They perform less well as far as representation of the global properties of the data
are concerned.
The classical approach to training of radial basis function neural networks consists of un-
supervised training of the hidden layer, followed by supervised training of the output layer,
which can be summarized as follows.
i) Estimation of clusters centres in hidden layer
9 Start with a random set of cluster centres c = {Cl, C2, ... Ck}.
9 Read r'th input vector Xr.
9 Modify closest cluster centre (the learning coefficient rlis usually reduced with time:
Cknew -- CkTM nt- rl(X r- CkTM) (1.70)
9 Terminate after a fixed number of iterations, or when 11= 0.
ii) Estimation of width of activation functions
The width of the transfer functions of each of the Gaussian kernels or receptive fields is
based on a P nearest neighbour heuristic.
(Yk -- { 1/PZpllCk - Ckpll2} 1/2 (1.71)
where ekp represent the p'th nearest neighbour of the k'th cluster Ck.
iii) Training of the output layer
The output layer is trained by minimization of a least squares criterion and is equivalent of
parameter estimation in linear regression, i.e. it does not involve a lengthy process, since there
is only one linear (output) layer.
In summary, when compared with multilayer perceptrons
9 Radial basis function neural networks have single hidden layers, whereas multilayer
perceptrons can have more than one hidden layer. It can be shown that radial basis func-
tion neural networks require only one hidden layer to fit an arbitrary function (as opposed
to the maximum of two required by multilayer perceptrons). This means that training is
considerably faster in radial basis function networks.
9 In contrast with radial basis function neural networks, a common neuron model can be
used for all the nodes in an multilayer perceptron. In radial basis function networks the
hidden layer neurons differ markedly from those in multilayer perceptrons.

Random documents with unrelated
content Scribd suggests to you:

do careva tachta i devleta. zum Thron des Kaisers und zum
Reichpalast.
Tude bješe jedan mjesec
dana,
Daselbst verweilten sie
wohl einen Monat,
tursku svakut raspustiše vojsku, entliessen allwärts hin das Heer
der Türken,
mrtvu caru hator napraviše, bezeigten ihre Lieb dem toten
Kaiser,
795 mrtvu caru sultan Sulejmanu.
* * *
dem toten Kaiser Sultan
Suleimân.
* * *
Ondar veli sultan Ibrahime: Dann spricht das Wort der
Sultan Ibrahîm:
— »Ljuboviću ot
Hercegovine!
— »O Ljubović aus
meinem Herzoglande!
eto tebi sva Hercegovina, da nimm das ganze Herzogland
entgegen;
da ti ništa uzimati ne ću, ich werde nichts von dir an
Steuer nehmen,
800 malo vakta, dvanajs godin dana, nicht einen weissen Heller noch
Denar,
ni bijele pare ni dinara!« nur kurze Zeit hindurch, zwölf
volle Jahr!«
Onda veli gazi
Rustambegu:
Darauf zu Rustanbeg, dem
Glaubenstreiter:
— »A gazijo ot Sarajva
ravnog!
— »O Glaubenhort vom
ebnen Sarajevo!
Hajde jadan šeher Sarajevu, zieh’ heimwärts, Ärmster, in die
Stadt Sarajvo;
805 džamiju si novu načinio du hast ein neues Gotteshaus
erbaut,

a ja ću platit a ti si poharčio. doch ich bezahle, was du
ausgegeben.
A što džamija ima u
Sarajvu
Soviel als in Sarajevo
Gotteshäuser,
a svakoj ćeš biti nadzordžija dir sei die Oberaufsicht über
jedes,
od vakufa u šeher Sarajvu, vom Kirchengut der Stadt von
Sarajevo,
810 da ti ništa uzimati nejma dass keine Steuern du entrichten
magst
Stambul gradu bijelome wohl nach Istambol in die
weisse Stadt,
doklegod je turska Bosna
ravna!«
so lang in Türkenhand das
Bosnaland!«
A beśjedi Ćuprilić veziru: Und spricht zu Vezier
Köprülü gewendet:
— »A veziru moja lalo
prava,
— »Ja, Vezier, o du mein
getreuer Lala,
815 što ću tebi pekšeš učiniti?« mit was für Gabe soll ich dich
bedenken?«
Ondar veli Ćuprilić vezire: Darauf erwiedert Vezier
Köprülü:
— »Što ti hoćeš dragi
sultan Ibrahime?«
— »Was willst du, liebster
Sultan Ibrahîm?«
Ondar veli sultan Ibrahime: Darauf entgegnet Sultan
Ibrahîm:
— »Hajde pravo Bosni
kalovitoj
— »Zieh graden Wegs ins
lehmige Bosnaland
820 i Travniku gradu bijelome, und in die weissgetünchte Stadt
von Travnik,
ondi budi na Bosni valija!« dort sei im Bosnaland mein
Landesvogt!«
Ode vezir bijelu Travniku Zum weissen Travnik
wandert hin der Vezier,

a gazija šeher Sarajevu der Glaubenstreiter in die Stadt
Sarajevo
a Ljubović na Hercegovinu, und heim ins Herzogland Beg
Ljubović;
825 osta care na devletu svome. in seinem Reichpalast der Kaiser
blieb.
Erläuterungen.
Dieses Lied sang mir am 26. April 1885 der serbisierte tatarische Zigeuner
A l i j a C i g o in Pazarići in Bosnien vor.
Zu V. 1. Die Volküberlieferung knüpft auch hier, wie sonst öfter, an den
ruhmreichen Namen Suleimân II. an (1520–1566), unter dessen Regierung
die türkische Machtentwicklung ihren Höhepunkt erreicht hatte. Sultan
Ibrahîms Vorgänger auf dem Throne war Murad IV. und Nachfolger
Mohammed IV.
V. 23. Lalen und Ridžalen. — Lala türk. Diener. Als Lehnwort auch bei den
Bulgaren, Polen und Russen. Im serbischen nur für »Kaiserlicher Diener,«
so z. B. (der Sultan spricht):
lalo moja, muhur sahibija, O du mein Diener, du mein
Siegelwahrer,
što mi zemlje i gradove čuvaš! der du mir Städte und die Länder
hütest!
oder:
Divan čini care u Stambolu Divân beruft der Kaiser ein in
Stambol

za tri petka i tri ponediljka; dreimal je Freitags und dreimal je
Montags;
svu gospodu sebi pokupio, berief zu sich die Herren allzumal,
okupio paše i vezire: berief die Paschen und Vezieren ein:
— Lale moje, paše i veziri! — O meine Lalen, Paschen
und Veziere!
Ridžal arab. türk. Reisiger, übertragen: hoher Würdenträger zum redžal;
albanesisch: ridžal, Advokat, griech. rhitzali. Vergl. S. 249 zu V. 211.
V. 25 u. 33. Neun erwählte Frauen. — »Von den Frauen des Sultans
Ibrahîm führten sieben den Titel Chasseki, d. i. der innigsten
Günstlinginnen, bis zuletzt die achte, die berühmte Telli, d. i. die Drahtige,
ihm gar als Gemahlin vor allen angetraut ward. Eine andere hiess
Ssadschbaghli, d. i. die mit den aufgebundenen Haaren. Jede dieser sieben
innigsten Günstlinginnen hatte ihren Hofstaat, ihre Kiaja, die Einkünfte
eines Sandschaks als Pantoffelgeld, jede hatte einen vergoldeten mit
Edelsteinen besetzten Wagen, Nachen und Reitzeug. Ausser den
Sultaninnen Günstlinginnen hatte er Sklavinnen Günstlinginnen, derer zwei
berühmteste die Schekerpara, d. i. Zuckerstück und Schekerbuli, d. i.
Zuckerbulle hiess; jene ward verheiratet, diese aber stand zu hoch in der
Gunst, um je verheiratet zu werden. Die Sultaninnen Günstlinginnen
erhielten Statthalterschaften zu ihrem Pantoffelgeld, die Schützlinginnen
Sklavinnen hatten sich die höchsten Staatämter vorbehalten.« J. von
Hammer-Purgstall, Geschichte des Osmanischen Reiches, Pressburg, 1835,
V, S. 255 f.
Zu V. 37. Kafirhänden. Im Text: u kaurina, türk. gjaur, gjavir, aus dem
arab. ci Kafir, pers. gebr, der Ungläubige.
V. 38–39. Die Nennung dieser drei Städtenamen, sowie späterhin
Temešvars, hier eine dichterische Freiheit. Die Eroberung von Erlau (Egra)
und Kanísza bilden Glanzpunkte der Regierung Mohammed III. (verstorben
22. Dezember 1603). Über Erlau vergl. Franz Salamon, Ungarn im Zeitalter
der Türkenherrschaft (deutsch v. Gustav Turány), Leipzig, 1887, S. 125 f.
u. besonders S. 138 ff. — Die Einnahme Ofens erfolgte im J. 1541.

Soliman, der 1526 Ofen nicht besetzen wollte, nimmt es 1541 endgültig in
seine Hand. Im J. 1543 setzte er seine Eroberungen fort. Der Sultan nahm
zuerst die Burgen Valpó, Siklós und Fünfkirchen, darauf Stuhlweissenburg
und Gran. Bis 1547 gehörte den Türken Peterwardein, Požega, Valpó,
Essegg, Fünfkirchen, Siklós, Szegszárd, Ofen, Pest, Stuhlweissenburg,
Simontornya, Višegrad, Gran, Waitzen, Neograd und Hatvan; jenseits der
Teiss nur das einzige Szegedin, das sich im Winter 1542 freiwillig ergeben
und als türkischer Besitz isoliert dastand. — Semendria (Szendrö)
versuchten die Türken im J. 1437 einzunehmen, um sich den wichtigen an
der Donau gelegenen Schlüssel des Morava-Tales zu sichern, aber das
ungarische Heer unter Pongraz Szentmiklósi errang einen glänzenden Sieg
über sie. Als sich 1459 die Festung Semendria an Mohammed II. ergab,
gelangten zugleich zahlreichere kleinere Festungen in seine Gewalt. Serbien
wurde zum Sandžak, und der Türke siedelte an Stelle der massenhaft in die
Sklaverei geschleppten Einwohner, Osmanen in die Städte und führte
daselbst seine Verwaltung ein. 1466 als König Mathias beschäftigt war in
Oberungarn einige Aufrührer zur Ruhe zu bringen, lässt ein türkischer
Pascha seine Truppen in Serbien einrücken und nimmt durch
Überrumpelung die Festung Semendria. Vergl. Dr. Wilhelm Fraknói,
Mathias Corvinus, König von Ungarn, Freib. i. Br., 1891, S. 70 ff.
Zu V. 59. »Verschliess dich in den festen Käfig.« — »Als nach Murads
Verscheiden der Hofbedienten Schar mit Freudengeschrei an die Türe des
Käfigs, d. i. des Prinzengemaches drang, um den neuen Herrn
glückwünschend auf den Thron zu ziehen, verrammelte Ibrahîm die Tür,
aus Furcht, dass dies nur List des noch atmenden Tyrannen Murad sei, um
ihn, den einzigen überlebenden Bruder so sicher ins Grab voraus zu
schicken. Mit ehrfurchtvoller Gewalt wurde die Tür erbrochen, und noch
immer weigerte sich Ibrahîm der Freudenkunde Glauben beizumessen, bis
die Sultanin-Mutter Kösem (eine Griechin) selber ihn von des Sultans Tod
versicherte und ihre Versicherung durch den vor die Tür des Käfigs
gebrachten Leichnam bestätigte. Da begab sich erst Ibrahîm aus dem Käfig
in den Thronsaal, empfing die Huldigung der Veziere, Reichsäulen, Ulema
und Aga, trug dann mit den Vezieren des Bruders Leiche selbst bis ans Tor
des Serai und ward hierauf nach altem Herkommen osmanischer
Thronbesitznahme zu Ejub feierlich umgürtet.« Bei J. v. Hammer, a. a. O.,

V, S. 215 f., unter Berufung auf Rycauts Continuation of Knolles II, p. 50.
Die neu eröffnete otomanische Pforte t. 458.
Zu V. 75 ff. Am neunten Tage nach der Thronbesteigung fand die
Umgürtung des Säbels in der Moschee Ejub in den durch das Gesetzbuch
des Zeremoniels vorgeschriebenen Formen des Aufzuges und der
Feierlichkeiten statt. Mit Sonnenaufgang versammelten sich alle Klassen
der Staatbeamten im ersten Hofe des Serai. Die ausführliche Schilderung
siehe bei Hammer, a. a. O., IV2
, S. 499–550.
Zu V. 76. »Goldne Mütze.« — Im Texte tadža. Sultan Bajezid I. (gestorb.
1403) trug als Turban weder die Goldhaube (uskuf) der ersten sechs
Sultane, noch den vom siebenten angenommenen runden Kopfbund der
Ulema (urf) sondern nahm den hohen, zylinderförmigen, mit Musselin
umwundenen an, der sofort unter dem Namen Mudževese (tadža) der Hof-
und Staatturban geblieben.
Zu V. 80 f. Die ersten Säulen des Reiches und Stützen des Divans sind die
Veziere, d. h. die Lastträger. Es gab ihrer unter Ibrahîm schon vier. Die
Vierzahl gibt als eine dem Morgenländer beliebte und heilige Grundzahl
den Teilunggrund der ersten Staatämter ab. Vier Säulen stützen das Zelt,
vier Engel sind nach dem Koran die Träger des Thrones, vier Winde
regieren die Regionen der Luft nach den vier Kardinalpunkten des Himmels
usw. Aus diesem Grunde setzte Sultan Mohammed der Eroberer, vier
Säulen oder Stützen des Reiches (erkiani devlet) fest in den Vezieren, in den
Kadiaskeren, in den Defterdaren und in den Nišandži, die zugleich die vier
Säulen des Divans, d. h. des Staatrates sind. Anfangs war nur ein Vezier,
dann zwei, dann drei unter den ersten Sultanen; der Eroberer erhob ihre
Zahl auf vier, deren erster und allen übrigen an Macht und Rang bei weitem
vorhergehende, der Grossvezier wurde, der unumschränkte
Bevollmächtigte, das sichtbare Ebenbild des Sultans, sein vollgewaltiger
Stellvertreter, der oberste Vorsteher aller Zweige der Staatverwaltung, der
Mittelpunkt und der Hebel der ganzen Regierung.
Zu V. 81. »Siegelhüter« (muhur sahibija). — Der Kanun des Siegels (nach
Sultan Mohammed II.) überträgt dem Grossvezier darüber die Obhut, als

das Symbol der höchsten Vollmacht; in der Überreichung des Siegels liegt
auch die Verleihung der höchsten Würde des Reiches. Der Grossvezir darf
sich (abgesehen von der Versiegelung der Schatzkammer, die, beiläufig
bemerkt, nur in Gegenwart der Defterdare geöffnet werden kann) dieses
Siegels nur zur Besieglung der Vorträge bedienen, und da alle Vorträge
durch die Hand des Grossveziers gehen müssen, und niemand als er das
Recht hat, an den Sultan schriftlich zu berichten, so sieht der letztere kein
anderes Siegel als sein eigenes oder etwas das der fremden Monarchen,
wenn deren Gesandte ihre Beglaubigungschreiben in feierlicher Audienz
überreichen.
Zu V. 82. »Pascha Seidi.« — Über Achmed Sidi, Köprülüs Schwager, die
Geissel Siebenbürgens, Pascha von Neuhäusel, vergl. Hammer, a. a. O., VI,
S. 272. Ein Seid wird in den Epen moslimischer Guslaren häufig auch als
Heiliger genannt und gerühmt. In einem Guslarenliede heisst es:
efendija muhur sahibija [Erschienen war] Efendi Siegelhüter
sa svojijem pašom Seidijom, zugleich mit ihm sein Pascha Seïdi,
što je paša na četeres paša. der Obrist Pascha über vierzig
Paschen.
Zu V. 93 ff. Im J. 1656 war Mohammed mit dem wunden Halse
Grossvezier.
»Am 10. September 1656 fand ein Divan statt. Der Sultan sagte zum
Grossvezier: ‘Ich will selbst in den Krieg ziehen, du musst durchaus für die
nötige Rüstung sorgen!’ Der hilflose Greis faltete die Hände, als ob er die
ganze Versammlung um Hilfe anflehte und sagte: ‘Glorreichster, gnädigster
Padischah, Gott gebe euch langes Leben und lange Regierung! bei der
herrschenden Verwirrung und dem Mangel an Kriegzucht ist es schwer,
Krieg zu führen; zur Möglichkeit der nötigen Rüstungen ist von Seite des
Reichschatzes eine Hilfe von zwanzigtausend Beuteln notwendig!’ Der
Sultan schwieg zornig und hob die Versammlung auf.« Hammer, V, 461.
»Schon bei der ersten Unzufriedenheit nach der Einnahme von Tenedos und
Lemnos hatten sich der Chasnedar der Valide, Solak Mohammed, der

Lehrer des Serai, Mohammed Efendi, der vorige Reis Efendi Schamisade
und der Baumeister Kasim, welcher schon ein paarmal den alten Köprülü
zum Grossvezier in Vorschlag gebracht, insgeheim verbündet, diesem das
Reichsiegel zu verschaffen. Der Grossvezier hatte ihn auf seiner Reise von
Syrien nach Konstantinopel zu Eskischehr wohl empfangen und nach
Konstantinopel mitgenommen, wo er sich dermalen ruhig verhielt; sobald
er aber durch den Silihdar des Sultans Wind von dem Vorschlage erhalten,
ernannte er Köprülü zum Pascha von Tripolis, und befahl ihm sogleich
aufzubrechen. Der Kiaja, ins Vertrauen der Freunde Köprülüs gezogen,
suchte vergebens den Reisebefehl zu verzögern. Da die Sache noch nicht
reif zum Schlag war, brachten die Freunde Köprülüs durch die Valide sehr
geschickt die Ernennung des Silihdars zum Statthalter von Damaskus und
die Einberufung des dortigen Veziers Chasseki Mohammed zuwegen,
wodurch das allgemeine Gerede entstand, dass dieser zum Grossvezier
bestimmt sei und die Aufmerksamkeit des Grossveziers von Köprülü
abgelenkt ward. Der Silihdar, der Patron des Grossveziers beim Sultan war
entfernt, aber noch stand den Freunden Köprülüs ein anderer mächtiger
Feind, der Janičarenaga im Wege. Sobald dieser abgesetzt und an seine
Stelle der Stallmeister Sohrab, ein Freund der Freunde Köprülüs ernannt
war, erklärte sich dieser gegen ihn, dass er einige Punkte der Valide
vorzutragen, nach deren Zusage er die Last der Regierung auf seine
Schultern zu nehmen bereit sei. Noch am selben Nachmittage wurde
Köprülü heimlich vom Kislaraga zur Valide eingeführt, und antwortete auf
ihre Frage, ob er sich den ihm bestimmten Dienst als Grossvezier zu
versehen nicht fürchte, mit dem Begehren folgender vier Punkte: erstens,
dass jeder seiner Vorschläge genehmigt werde; zweitens, dass er in der
Verleihung der Ämter freie Hand und auf die Fürbitte von niemand zu
achten habe: die Schwächen entständen aus Fürsprachen; drittens, dass kein
Vezier und kein Grosser, kein Vertrauter, sei es durch Einfluss von
Geldmacht oder geschenktem Vertrauen, seinem Ansehen eingreife;
viertens, dass keine Verschwärzung seiner Person angehört werde; würden
diese vier Punkte zugesagt, werde er mit Gottes Hilfe und dem Segen der
Valide die Vezirschaft übernehmen. Die Valide war zufrieden und beschwur
ihre Zusage dreimal mit: ‘Bei Gott dem Allerhöchsten!’Am folgenden Tage
(15. September 1656), zwei Stunden vor dem Freitaggebete, wurden der
Grossvezier und Köprülü ins Serai geladen. Dem Grossvezier wurde nach

einigen Vorwürfen über den Mangel seiner Verwaltung das Siegel
abgenommen und er dem Bostandžibaschi zur Haft überlassen, dann
Köprülü in den Thronsaal berufen. Der Sultan wiederholte die vier
versprochenen Punkte, einen nach dem andern und sagte: ‘Unter diesen
Bedingnissen mache ich dich zu meinem unumschränkten Vezier; ich werde
sehen, wie du dienst; meine besten Wünsche sind mit dir!’ Köprülü küsste
die Erde und dankte; grosse Tränen rollten den Silberbart herunter; der
Hofastronom hatte als den glücklichsten Zeitpunkt der Verleihung das
Mittaggebet vom Freitage bestimmt, eben ertönte von den Minareten der
Ausruf: ‘Gott ist gross!’ Hammer, a. a. O., V, S. 462, 2. Aufl.
Zu V. 170. Dem abgesetzten Grossvezier Mohammed mit dem wunden
Halse, dem neunzigjährigen Greise, wurde nach Einziehung seiner Güter,
das nach dem Ausspruche des Sultans verwirkte Leben auf Köprülüs
Fürbitte geschenkt und ihm zur Fristung des schwachen Restes seines
Lebens die Statthalterschaft von Kanisza verliehen. Hammer, V, S. 467.
Zu V. 360 ff. Ganz erfunden ist diese Episode nicht. Hammer berichtet
Bd. V, S. 467 ff.:
»Acht Tage, nachdem Köprülü das Reichsiegel erhalten, Freitag den
22. September 1656, versammelten sich in der Moschee S. Mohammeds die
fanatischen Anhänger Kasisades, die strengen Orthodoxen, welche unter
dem alten Köprülü, den sie für einen ohnmächtigen Greis hielten, ihrer
Verfolgungwut wider die Soffi und Derwische, Walzer- und Flötenspieler,
um so freieren Lauf zu geben hofften. Sie beratschlagten in der Moschee
und fassten den Entschluss, alle Klöster der Derwische mit fliegenden
Haaren und kronenförmigen Kopfbinden von Grund aus zu zerstören, sie
zur Erneuerung des Glaubenbekenntnisses zu zwingen, die sich dess
weigerten zu töten usw. In der Nacht war die ganze Stadt in Bewegung; die
Studenten der verschiedenen Kollegien, welchen orthodoxe Rektoren und
Professoren vorstanden, bewaffneten sich mit Prügeln und Messern und
fingen schon an die Gegner zu bedrohen. Sobald der Grossvezier hiervon
Kunde erhalten, sandte er an die Prediger Scheiche, welche die Anstifter der
Unruhen zur Ruhe bewegen sollten; da aber dies nicht fruchtete, erstattete
er Vortrag an den Sultan über die Notwendigkeit ihrer Vernichtung. Die

sogleich dem Vortrag gemässe allerhöchste Entschliessung des Todurteils
wurde von Köprülü in Verbannung gemildert.«
Zu V. 371. »Der alte Achmedaga.« — In der türkischen Geschichte heisst er
Achmedpascha Heberpascha, d. h. der in tausend Stücke Zerrissene
(Hammer, III, S. 930). Nach Hammer, Bd. III, S. 930, fiel tatsächlich ein
Grossvezier des Namens Achmedpascha durch Henkerhand am Vorabende
der Thronstürzung Sultan Ibrahîms. Es war am Abend des 7. August 1648.
Kaum hatte der abgesetzte Grossvezier Achmedpascha einzuschlafen
versucht, als er mit der Botschaft geweckt ward, er möge sich aufmachen,
die aufrührerischen Truppen verlangten ihn und er, der Grossvezier möge
als Mittler versöhnend dazwischen treten. Als er die Stiege
hinuntergekommen, griff ihm jemand unter die Arme. Er sah sich um, wer
es sei und sah vor sich Kara Ali, den Henker, den er so oft gebraucht. »Ei,
ungläubiger Hurensohn!« redete er ihn an. »Ei, gnädiger Herr!« erwiderte
der Henker, ihm lächelnd die Brust küssend; unter die Linke
Achmedpaschas griff Hamal Ali, des Henkers Gehilfe. Sie führten ihn zum
Stadttor, dort zog der Henker seine rote Haube vom Kopfe und steckte sie
in seinen Gürtel, nahm Achmedpascha seinen Kopfbund ab, warf ihm den
Strick um den Hals und zog ihn mit seinem Gehilfen zusammen, ohne dass
der Unglückliche etwas anderes als: »Ei, du Hurensohn!« vorbringen
konnte. Der ausgezogene Leichnam wurde auf ein Pferd geladen und auf
des neuen Grossveziers Sofi Mohammed Befehl hin auf den Hippodrom
geworfen.
Zu V. 392. »K r e u z e u n d M a r i e n . « — Kreuzchen und
Marienmedaillen, wie Christen solche zu jener Zeit im Haare trugen. Eine
anschauliche Beschreibung gibt uns eine Stelle in einem noch ungedruckten
Guslarenliede meiner Sammlung. Halil, der Falke, ist entschlossen, an
einem Wettrennen im christlichen Gebiete teilzunehmen, um den
ausgesetzten Preis, ein Mädchen von gefeierter Schönheit, davonzutragen.
Seine Schwägerin, Mustapha Hasenschartes Gemahlin, hilft ihm bei der
Verkleidung zu einem christlichen Ritter, wie folgt:
ondar mu je sa glave fesić oborila Vom Haupte sie warf ihm das
Fezlein herab

i rasturi mu turu ot perčina und löste den Bund des Zopfes ihm
auf
i prepati češalj od fildiša und griff nach dem Kamm aus
Elfenbein
te mu raščešlja turu ot perčina und kämmte den Bund des Zopfes
ihm auf
a oplete sedam pletenica und flocht ihm in sieben Flechten
das Haar
a uplete mu sedam medunjica und flocht ihm sieben Medaillen
hinein
a uplete mu križe i maiže Und flocht ihm Kreuze hinein und
Marien
a uplete mu krste četvrtake und flocht ihm hinein quadratige
Kreuze
a šavku mu podiže na glavu und stülpte den Helm ihm auf das
Haupt,
a pokovata grošom i tal’jerom der beschlagen mit Groschen und
Talerstücken,
a potkićena zolotom bijelom. der geschmückt mit weissen
Münzen war.
So wie hier Halil als Christ auftritt, so ist der als Moslim verkappte Christ
eine stehende Figur des Guslarenliedes. Christ und Moslim sind in der
angenommenen Rolle einander wert und würdig. — In der von der
chrowotischen Akademie in Agram herausgegebenen ‘Religion der
Chrowoten und Serben’ figurieren die edlen Raubmörder Gebrüder
Mustapha und Alil als Minos und Rhadamanthys der Urchrowoten. Wie
glücklich sind doch diese unsterblichen Akademiker des
Chrowotenvölkleins zu preisen, die in Zeiten naturwissenschaftlicher
Forschungen und gewaltigster technischer Fortschritte keine anderen
Sorgen haben als unerhörte Götter zu erfinden und eine neue Religion zu
stiften!
Zu V. 477. Ljubović, der berühmteste moslimische Held des Herzogtums,
eine stehende Figur der Guslarenlieder beider Konfessionen. Mustapha

Hasenscharte schreibt einmal ein Aufgebot aus. Der Brief zu Händen des
Freundes Šarić:
O turčine Šarić Mahmudaga! O [Bruder] Türke Šarić
Mahmudaga!
Eto tebi knjige našarane! Da kommt zu dir ein Schreiben
zierlich fein!
Pokupi mi od Mostara turke, Von Mostar biet mir auf die
Türkenmannen,
ne ostavi bega Ljubovića lass nicht zurück den Beg, den
Ljubović,
sa široka polja Nevesinja, vom weitgestreckten
Nevesinjgefilde;
jer brež njega vojevanja nejma. denn ohne seiner gibt es keinen
Feldzug.
Als Jüngling meldete sich Beg Ljubović einmal bei Sil Osmanbeg, dem
Pascha von Essegg, als freiwilliger Kundschafter, um durchs feindliche
Belagerungheer durchzudringen und dem Pascha von Ofen Nachricht von
der Bedrängnis der Stadt Essegg zu überbringen. Sil Osmanbeg umarmt und
küsst ihn und schlägt ihm mit der flachen Hand auf die Schulter:
Haj aferim beže Ljuboviću! Hei traun, fürwahr, mein Beg, du
Ljubović!
vuk od vuka, hajduk od hajduka Vom Wolf ein Wolf, vom Räuber
stammt ein Räuber,
a vazda je soko ot sokola; doch stets entspross ein Falke nur
dem Falken;
vazda su se sokolovi legli und immer fand sich vor die
Falkenbrut
u odžaku bega Ljubovića! am heimischen Herd der Begen
Ljubović!
Zu V. 678. Die Schilderung naturgetreu. Auf meinen Reisen zog ich es
mitunter vor, in eine Rossdecke eingehüllt unterm freien Himmel selbst zu

Winterzeit zu übernachten, als im Schmutz und Ungeziefer und Gestank
einer bosnischen Bauernhütte. Auch meine Aufzeichnungen machte ich
meist im Freien im Hofraume oder an der Strasse sitzend. Ich fragte den
Bauer Mujo Šeferović aus Šepak, einen recht tüchtigen Guslaren, ob er
wohl ein eigenes Heim besitze. Darauf er: imam nešto malo kuće, krovnjak
(ich besitze ein klein Stückchen Haus, eine Bedachung). Neugierig, wie ich
schon bin, ging ich zu ihm ins Gebirge hinauf, um mir seine Behausung
anzuschauen, eigentlich in der Hoffnung, bei ihm meinen Hunger zu stillen.
Ein hohes, mit verfaultem Stroh bedecktes Dach, und statt der Wände aus
Stein oder Ziegeln ein mit Lehm beschmiertes Reisergeflechte! Brot und
Fleisch fehlte im lieblichen Heime. Durch meinen Besuch fühlte er sich und
seine Familie aufs äusserste geehrt und geschmeichelt. Die Hausfrau, die
nicht zum Vorschein kam, sandte mir mit ihrem Söhnchen einen
Bohnenkäse und eingesäuerte Paprika heraus. Als Getränk Kaffeeabsud und
Honigwasser.
Zu V. 707 ff. Das seltsame Mädchen ist als die Sreća, d. h. fortuna Köprülüs
aufzufassen. Vergl. meine Studie, Sreća. Glück und Schicksal im
Volkglauben der Südslaven, Wien, 1887.
Zu V. 821. Zwei Köprülü waren Veziere (Vali) zu Bosnien: Köprülüzade
Numan, der Sohn des Grossveziers 1126 (1714) und Köprülüzade Hadži
Mehmed 1161 (1748); zum zweitenmal derselbe 1179 (1765). Der erste
Köprülü war natürlich nie bosnischer Gouverneur, nur der Guslar erhebt ihn
zu dieser nach seinen bäuerlichen Begriffen ausserordentlichen Ehren- und
Würdenstellung.

Die Russen vor Wien.
Die Belagerung von Wien durch die Türken im Jahre 1683 hat auch die
serbischen und bulgarischen Dichter im Volke, die ihre Lieder mit Gefiedel
auf Guslen begleiten, zur dichterischen Schilderung des weltgeschichtlichen
Ereignisses begeistert. Die älteren gedruckten Sammlungen Guslarenlieder
bieten so manches Stück dar, das jene Niederlage der Türken vor Wien bald
kürzer bald ausführlicher, mehr oder minder in treuer Anlehnung an den
tatsächlichen Verlauf des grossen Geschehnisses darstellt. Eine eigentlich
dichterische Auffassung der Tragweite des für das gesamte Abendland
unendlich bedeutsamen und folgenreichen Sieges des Christentums über
den Halbmond fehlt den Guslaren und den Liedern. Selbst der nach
volktümlicher Weise dichtende dalmatische Franziskaner A n d r i j a
K a č i ć M i o š i ć ist in seiner Besingung (im Jahre 1756) des grossen
Völkerkampfes im Grunde genommen aus seiner Schablone des
versifizierten prosaischen Berichtes nicht herausgetreten. Der Sieg der
vereinigten christlichen Mächte hat eben beim Südslaven mehr den
Verstand als das Herz und das Gemüt, diese wahren Quellen der
Begeisterung, ergriffen. Der Südslave, namentlich der christliche Serbe in
Bosnien, im Herzogland und weiter südlich, war in diesem entscheidenden
Kampfe zwischen Orient und Okzident mehr ein müssiger Zuschauer
gewesen, dem der Sieg nicht unmittelbar zu Trost und Schutz verholfen.
Der moslimische Guslar aber schweigt über diesen Kriegzug der Türken.
Sollte er etwa die Erinnerung an die furchtbare Niederschmetterung seiner
Glaubengenossen frisch im Gedächtnis der Nachwelt erhalten wollen? Sein
Mund verstummte angesichts des über den Sultan »die Sonne des Ostens«
hereingebrochenen unheilschwangeren Unsals. Der christliche Guslar in
Bosnien und dem Herzögischen musste wieder dagegen bedachtsam seine
Schadenfreude vor den Herren des Landes, den Moslimen, verbergen. Das
Ereignis wurde immer seltener und seltener besungen, bis die Nachrichten
darüber schon nach hundertundfünfzig Jahren in eine märchenhafte Sage
ausklangen, die nur noch die Hauptsache, den Entsatz von Wien und die
gänzliche Niederwerfung der Türkenherrschaft im Ungarlande festhält, fast

alles Beiwerk aber der Dichtung entnimmt. Die geschichtliche Wahrheit tritt
zurück, überwuchert vom üppig aufgeschossenen Lianengeranke
ungebundener Phantasie.
Von dieser Art ist unser Guslarenlied.
Man erfährt daraus an geschichtlichen Tatsachen bloss, dass einmal die
Stadt Wien an der Donau von einer gewaltigen Türkenmacht belagert und
fast eingenommen worden sei und dass sich der ‘Kaiser von Wien’, sein
Name wird nicht genannt, durch auswärtige Hilfe, einem aus Norden
kommenden Heere, aus der Not befreit hat und dass die Türken eine
gründliche Niederlage erfahren. Vom Grafen R ü d i g e r v o n
S t a r h e m b e r g , vom Polenkönig J o h a n n S o b i e s k i , vom Herzog
K a r l v o n L o t h r i n g e n , vom Fürsten von Wa l d e c k und den
Kurfürsten von B a y e r n und S a c h s e n , die alle am Befreiungkampfe
rühmlichst Anteil genommen, von allen diesen weiss der Guslar nichts.
Dafür erzählt er uns ein Märchen, das in einzelnen Zügen eine auffällige
Verwandtschaft mit der Fabel des Liedes vom E n d e K ö n i g
B o n a p a r t e s 1 aufweist.
Gleich dem ‘König Alexius-Nikolaus’ von Russland in jenem Liede, verlegt
sich in diesem der ‘Wiener Kaiser’, voll Ergebung in die Schicksalfügung,
aufs Weinen. Dem einen wie dem anderen muss der Eidam Hilfe bringen.
Des Russenkönigs Eidam ist der Tataren Chan, der seine 100 000 Mann
gegen Bonaparte stellt, des Wiener Kaisers Schwiegersohn ist der greise
Vater des Russenkaisers Michael, J o h a n n e s M o s k a u e r (Mojsković
Jovan), der mit 948 000 Mann zum Entsatz Wiens heranrückt. Davon sind,
genau betrachtet, nur 900 000 Mann Russen, der Rest Hilftruppen
sagenhafter Lehenfürsten oder Bundgenossen, der ‘s c h w a r z e n
K ö n i g i n ’ (einer der serbischen Sagenwelt auch sonst vertrauten Gestalt),
des Königs L e n d e r (vielleicht steckt dahinter ursprünglich der Name
L o r r a i n e ?) und des Königs Š p a n j u r , des Spaniers. Trotz dieser
ungeheueren Heermacht vermag ‘Kaiser Michael’ ebensowenig als ‘König
Nikolaus’ gegen den Feind etwas auszurichten. Beidemal hilft zum Siege
das gleiche himmlische Wunder, ein strömender Regen. König Bonapartes
Heer vor Petersburg gerät bis zum Hals in Wasser und erfriert stehenden

Fusses, dem türkischen Heere vor Wien verdirbt im Regen alle Munition.
Es geschieht aber noch ein grösseres Wunder, dasselbe, das auch die
Erscheinung in Macbeth Akt IV, Sz. I, anzeigt:
»Macbeth geht nicht unter, bis der Wald
von Birnam zu Dunsinans Höhen wallt
und dich bekämpft.«
In unserem Liede bedient sich Kaiser Michael auf den Rat seines Vaters hin
einer noch durchdachteren Krieglist, indem er sein Heer auch mit
Leinwandwänden umgibt; darauf rufen die Türken beim Anblick des
wandelnden Kahlenberges und der weissen Wände verzweifelt aus:
’da dringen vor aus dem verfluchten Russland,
da dringen gen uns Berge vor und Burgen!’
Wie in die hundert anderer Märchenmotive ist auch dieses vom wandelnden
Wald ein Gemeingut a l l e r Völker und auch den Arabern geläufig2. Es ist
leicht möglich, dass gerade dieser Zug durch Shakespeares Werke
allgemeinere Verbreitung gewonnen hat. Shakespeare ist bekannter als man
glauben mag. Speziell sein Kaufmann von Venedig ist zum internationalen
geistigen Eigentum selbst der untersten Volkschichten geworden. Aus
Bosnien und Slavonien haben wir von der Geschichte schon mehrere
gedruckte Varianten. Der Vermittler für die Bosnier war in erster Reihe, wie
sich dies bei einem nahezu literaturlosen Volke von selbst versteht, die
mündliche Überlieferung. Auf diesem Wege sind die Bosnier auch mit
M a i s t r e P i e r r e P a t h e l i n bekannt geworden.3
Das Guslarenlied, das wir hier mitteilen, singt der betagte Guslar M a r k o
R a j i l i ć , ein Orthodoxer, im Dorfe P o d v i d a č a in Bosnien. Er sagt, er
habe es in dieser Fassung vor beiläufig vierzig Jahren (1845) von dem nun
längst verstorbenen Bauer Ta n a s i j a (Athanasius) T r k u l j a aus
O v a n j s k a gelernt oder übernommen. Einen Namen oder Titel gab der
Guslar selber dem Liede nicht.

Vojsku kupi care Tatarine Der Zar Tatar der sammelt
eine Heermacht
tri godine, da ćesar ne znade drei Jahre lang, nichts weiss
davon der Kaiser,
a četiri, da i ćesar znade. vier Jahre lang, es weiss davon
der Kaiser;
Vojsku kupi sedam godin dana. wohl sammelt er ein Heer durch
sieben Jahre.
5 Kad je care vojsku sakupio Nachdem der Zar das
grosse Heer gesammelt,
okreće je Beču bijelome da lässt er’s gegen’s weisse
Wien marschieren
u proljeće kat se zopca sije. im Lenze, wann der Landmann
Hafer aussät.
Kada jesu stražnji prolazili, Der letzte Zug vom langen
Zug des Heeres
tu su zopcu konjma naticali. der konnt’ mit reifer Frucht die
Pferde füttern.
10 Kolko j, braćo, polje ispod
Beća,
O Brüder, das Gefild vor
Wien ist mächtig,
ne more ga gavran pregrktiti kein Rabe kann das Marchfeld
überkrächzen,
ja kamo li prekasati vuci. geschweige Wölf’ ohn Rasten
übertraben.
Sve to polje pritisnuli
Turci;
Dies ganze Feld bedeckten
Türkenhorden;
konj do konja, Turčin do
Turčina,
hier Ross an Ross, hier Türk
gedrängt an Türken;
15 sve barjaci kao i oblaci, wie Wolkenflocken flattern
zahllos Fahnen.
bojna koplja ka i gora crna. Von Kriegerspeeren starrt
es wie ein Urwald.
Pa su Turci na Beč udarili, Da griffen Wien die
Türkenscharen an.

polu Beča jesu osvojili Halb Wien ist schon vom
Türkentum erobert
do jabuke i do zlatne ruke bis zu dem Apfel und dem
goldnen Arme
20 i do svetog groba Stefanova, und bis zur heiligen
Stefangrabesstelle
do lijepe Despotove crkve, und bis zur hehren kaiserlichen
Kirche.
u Ružicu crkvu ulazili. Sie brachen ein auch in die
Rosenkirche.
Gje su bile crkve i oltari Wo ehdem Kirchen stunden und
Altäre,
ongje jesu džamije munare; dort stehn Moscheen jetzt mit
Minareten.
25 sa munare turski odža viče Es schreit vom Minaret der
türkische Hodža;
pa se š njime turadija diče. sein rühmen sich die rohen
Türkenhorden.
To dotuži u Beču ćesaru Letzt ward des Leids zuviel
in Wien dem Kaiser,
pa on cvili a suze proljeva. er brach in Tränen aus und
jammerklagte.
Njemu veli sluga Petrenija: Da sprach zu ihm der Diener
Petrenija:
30 — Svjetla diko u Beču
ćesare,
— O heller Stolz und
Glanz, du Wiener Kaiser!
što ti cviliš a suze proljevaš? was soll das Flennen, was das
Zährenfliessen?
Već ti uzmi divit i kalema, ergreif vielmehr die Tinte und
das Schreibrohr,
list artije knjige bez jazije; ein Blatt Papier noch rein und
unbeschrieben,
knjigu piši na svome koljenu und schreib ein Schreiben wohl
auf deinem Knie

35 pa je šalji u kletu Rusiju und schick es ab in das
verfluchte Russland
a na ruke Mojsković Jovanu. zu Handen jenes Mojsković
Johannes.
Otlen će te mio Bog
pomoći,
Von dorten wird der liebe
Gott dir helfen,
žarko će te ogrijati sunce, wird dich die heisse Sonne mild
erwärmen,
otalen će tebi indat doći! von dorten wird zu Teil dir Hilfe
werden.
40 Kat to čuo u Beču ćesare, Als dies der Kaiser wohl zu
Wien vernommen,
on uzima divit i kalema, so griff er nach der Tinte und
dem Schreibrohr
list artije knjige bez jazije und einem Blatt Papier noch
unbeschrieben;
knjigu piše na svome koljenu er schrieb den Schreibebrief auf
seinem Knie
pa je daje slugi Petreniji: und gab ihn hin dem Diener
Petrenija:
45 — Na ti slugo list knjige
bijele
— Da nimm o Diener hin
das weisse Schreiben
pa je nosi u kletu Rusiju und trag es fort in das verfluchte
Russland
a na ruke Mojsković Jovanu zu Handen jenes Mojsković
Johannes,
a Jovanu mome prijatelju. ja, Herrn Johannes, meines
nächsten Freundes.
Njesam njemo svoje ćeri dao Drum gab ich ihm die
Tochter mein nicht hin,
50 što je Jovan meni mio bio, weil mir das Herrchen zu
Gesicht gestanden,
već sam njemu svoju ćerku dao, nur darum gab ich ihm mein
Töchterlein,

da mi Jovan u muci pomaže. damit er Hilf’ in schwerer Not
mir biete.
Nosi knjigu štogogj brže možeš, So trag den Brief so rasch
als Füsse tragen,
ti ne žali áta ni dukata. schon’ deinen Zelter nicht noch
Goldzechine!
55 Uze djete list knjige bijele Es nahm das Kind an sich
das weisse Schreiben
pa je metnu u džepove svoje und tauchte’s tief hinab in seine
Taschen
pa posjede dobra konja svoga und setzte sich auf seinen guten
Zelter
pa otisnu u kletu Rusiju. und schob dann ab in das
verfluchte Russland.
Dok je djete u Rusiju došlo Es wechselte an vierzigmal
die Zelter
60 četr’est je promjenilo áta. das Kind, bevor’s in Russland
angekommen.
Kad je djete u Rusiju došlo Sobald das Kind in Russland
angekommen,
knjigu daje Mojsković Jovanu. so gab’s den Brief an Mojsković
Johannes.
Knjigu štije Mojsković
Jovane,
Es liest den Brief Herr
Mojsković Johannes,
knjigu štije, grozne suze lije er liest den Brief, zerfliesst in
grause Tränen,
65 niza svoje prebijelo lice die Tränen fliessen übers weisse
Antlitz
i nis svoju prebijelu bradu. und übern schneeig weissen Bart
hernieder.
Viš njeg stoji care Mijailo Zu Häupten steht ihm
Kaiser Mihajilo;
pa govori care Mijailo: da nimmt das Wort der Kaiser
Mihajilo:

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Exploratory Analysis Of Metallurgical Process Data With Neural Networks And Related Methods 1st Edition C Aldrich Eds

Weitere ähnliche Inhalte

Ähnlich wie Exploratory Analysis Of Metallurgical Process Data With Neural Networks And Related Methods 1st Edition C Aldrich Eds (20)

Exploratory Analysis Of Metallurgical Process Data With Neural Networks And Related Methods 1st Edition C Aldrich Eds