SlideShare a Scribd company logo
Optimising the Widths of Radial Basis Functions
                                              Mark Orr
                                         mark@cns.ed.ac.uk
                         Centre for Cognitive Science, Edinburgh University
                      2, Buccleuch Street, Edinburgh EH8 9LW, Scotland, UK

                     Abstract                             too complex for the data. The size of the penalty is
                                                          controlled by , the regularisation parameter, which,
   In the context of regression analysis with penalised   like w and r, is free to adapt to the training set. Given
linear models (such as RBF networks) certain model        a value for the weight vector which minimises the
selection criteria can be di erentiated to yield a re-    cost is
estimation formula for the regularisation parameter                            w = A;1H>y
                                                                               ^
such that an initial guess can be iteratively improved    where Hij = hj (xi ) is the design matrix and contains
until a local minimum of the criterion is reached. In     the responses of the m centres to the p inputs of the
this paper we discuss some enhancements of this gen-      training set, A = H>H + Im and Im is the m-
eral approach including improved computational e -        dimensional identity matrix.
ciency, detection of the global minimum and simulta-         The adjustable parameters in the model are the
neous optimisation of the basis function widths. The      m weights wj , the basis function width r and the
bene ts of these improvements are demonstrated on a       regularisation parameter . The xed parameters are
practical problem.                                        the centre positions cj and their number m. Below
                                                          we will assume that the inputs of the training set are
                                                          used as the xed centres, in which case m = p and
1 Introduction                                            cj = xj , but our results apply equally to other choices
                                                          of xed centres.
   Consider a radial basis function (RBF) network            Various model selection criteria, such as gener-
with centres at fcj g, weights fwj g and radial func-     alised cross-validation (GCV) 2] or the marginal like-
tions                                                     lihood of the data (the evidence") 3] can be di er-
                                                          entiated and set equal to zero to yield a re-estimation
       h (x) = exp ; (x ; cj ) (x ; cj )
                                 >
         j                                                formula for the regularisation parameter. For exam-
                                     r2                   ple in a previous paper 5] we derived the following
                                                          formula from GCV
(j = 1 : : : m) all having the same width r. The cen-
                                                                              = p ; w>^ ;1 w
                                                                                          e^ e
tres are xed but the weights and width are adapt-                                          >
able. The response of the network to an input x is                                      ^ A ^                (1)
                          m
                          X                               where ^ = y ; H w, = m ; tr A;1 (the e ective
                                                                 e            ^                 ;
                f (x) =          wj hj (x)                number of parameters 4]) and = tr A;1 ; A;2 .
                          j =1                               However there are problems with simply trying to
Suppose that this network is trained on a regression      iterate equation (1) to convergence. Firstly, depend-
data set fxi yi g (i = 1 : : : p) by minimising the a     ing on the initial guess, a non-optimal local minimum
penalised sum-squared-error cost function                 may be found and secondly, inversion of A is liable
                                                          to become numerically unstable if gravitates to-
               C (w) = e> e + w> w                        wards very small values. Furthermore, the value of
                                                          the width parameter r remains xed. In the next
where w is the m-dimensional weight vector and e is       section we describe a computationally e cient and
the p-dimensional error vector, ei = yi ; f (xi ). The    numerically stable method of iterating (1) which is
second term penalises large weights and is designed       fast enough that the global minimum and the opti-
to avoid over t should the unregularised model be         mal value of r can be found by explicit search.
2 E cient Computation                                    RBF network with m = 60 centres coincident with
                                                         the input points and basis functions of xed width
   Continually recomputing the inverse of A each         r = 0:2. Figure 1 shows the variation of GCV with
time the value of changes requires of order m3           for one particular realisation of this problem.
  oating point operations per iteration and is vulner-
able to numerical instability if becomes very small.
A more e cient and stable method involves initially
computing the eigenvalues f i g and eigenvectors fui g
of HH> and fzi g, the projections of the data onto
the eigenvectors (zi = y> ui ). Thereafter, the four
terms appearing in (1) can be computed e ciently




                                                            log(GCV)
(with cost only linear in p) by
                            p
                            X
                 e>e    =              zi
                                      2 2
                                                  (2)
                            i=1   ( i + )2
                            Xp
           w>A;1 w      =             i zi2       (3)
                          i=1 ( i + )
                                     3


                          Xp                                           −16          −10              −4     2
                        =          i              (4)                                       log(λ)
                          i=1 ( i + )2
                          Xp                               Figure 1: Local (diamonds) and global (star)
                p;      =                         (5)      minima of GCV.
                            i=1 i +
   Note that if p > m then the last p ; m eigenvalues       An array of 50 trial values of , evenly spaced be-
(assuming they are ordered from largest to smallest)     tween log = ;16 and log = 2, were used to nd
are zero. However, as remarked earlier, if we have one   rough positions for the local minima and equation (1)
centre for each training set input then p = m and the    was then iterated for each one found. Convergence
cost of calculating the eigenvalues and eigenvectors     was assumed once changes in GCV from one itera-
of the p p matrix HH> is of the same order as            tion to the next had dipped below a threshold of 1
inverting the m m matrix A. Therefore, unless            part in a million. In the example problem three local
(1) converges almost immediately, it is much more        minima were detected (see gure 1) and the one with
e cient to calculate the eigensystem once and then       the lowest GCV corresponded to          2. Searching
use (2-5) than to invert A on each iteration.            for the minima and re ning the candidate solutions
   Once the eigensystem has been established, GCV        took up only 0.6% of the total computation time, the
                                                         rest was accounted for by the calculation of eigenval-
                 GCV = (p ; ^ 2
                        p^ e
                          e       >                      ues and eigenvectors. Notice that if we had simply
                            )                            started with a single guess for and iterated equa-
                                                         tion (1) to nd the solution, any initial guess below
can also be cheaply calculated for any given us-         about 10;4 would have led to a sub-optimal solution.
ing (2,5). Thus it is feasible to evaluate GCV for a        Occasionally the value of re-estimated from equa-
number of trial values of searching for local min-       tion (1) bounces back and forward between two val-
ima, re ne those that are found by iterating (1) to      ues on each side of a local minimum and then ei-
convergence { using (2-5) of course { and nally se-      ther takes a long time to pass through this bistable
lect from the local minima the one with the smallest     state before nally converging or does not converge
GCV. Assuming a wide and dense enough range of           at all. To solve this problem we devised the follow-
trial values is employed, this procedure will nd the     ing heuristic. Suppose the sequence of re-estimated
global minimum.                                          values is 1 : : : k;2 k;1 k , with k being the
   We now demonstrate this method on a toy prob-         current value. Then if
lem consisting of p = 60 samples taken from the func-                        j k;   k;1 j   > j k ; k;2 j
tion y = 0:8 sin(6 x) at random points in the range
0 < x < 1 and corrupted by Gaussian noise of stan-       replace k by the geometric mean of k;1 and k;2
dard deviation 0.1. The data was modelled by an          before proceeding to the next iteration.
3 Optimising the Width                                      Usually this means the location of the global mini-
                                                            mum also changes smoothly with r but there are par-
   When GCV is di erentiated with respect to the            ticular values of the width where the identity of the
regularisation parameter and set equal to zero the          local minima with the smallest GCV switches, caus-
resulting equation can be manipulated so that alone         ing an abrupt change in location (but not height) of
appears on the left hand side, enabling the equation to     the global minimum. This explains the discontinuous
be used as a re-estimation formula. Unfortunately the       changes of slope in the curve of gure 2. Local min-
same trick does not work with r because, after setting      ima can also be created or destroyed as r changes, so
the derivative of GCV with respect to r to zero, the        discontinuous changes in value are also possible.
terms explicitly involving r cancel so r cannot be iso-        Of course, the ultimate arbiter of generalisation
lated and a re-estimation formula is impossible. The        performance is not the value of a model selection cri-
same applies to other model selection criteria such as      terion (such as GCV) on a particular realisation of
maximum likelihood of the data 6].                          the problem but the error of an independent test set
   When there is only one width parameter, as we as-        averaged over multiple realisations. We perform such
sume here, it is feasible to tackle the problem of choos-   a test in the next section.
ing an optimal value by experimenting with a number
of trial values and selecting the one most favoured by
the model selection criterion. The range of trial values
                                                            4 Results
used will be problem speci c and could be determined            For a thorough test of the method we turn to a
by the likely maximum and minimum scales involved           more realistic problem stemming from Friedmann's
in the particular problem. The number of trial values       MARS paper 1] and later used to compare RBFs
between the these limits will depend on the size of the     and MARS 5]. The problem involves the prediction
problem (p) and the available computing resources           of impedance Z and phase from the four parameters
since for each trial value an eigensystem computation       (resistance, frequency, inductance and capacitance) of
(with cost proportional to p3 ) will be necessary.          an electrical circuit. Training sets of three di erent
                                                            sizes (100, 200, 400) and with a signal-to-noise ra-
                                                            tio of about 3:1 were replicated 100 times each. The
                                                            input components were normalised to have unit vari-
                                                            ance and zero mean for each replication. The learning
                                                            method, as described above, was applied using a set
                                                            of 10 trial values of r between 1 and 10. Generali-
                                                            sation performance was estimated by scaled sum of
                                                            squared errors over two independent test sets (one
    GCV




                                                            for Z and one for ) of size 5000 and uncorrupted by
                                                            noise. This is the same experimental set up as in the
                                                            previous papers 1, 5] from which further details can
                                                            be obtained.
                                                                                 Z
                                                                     p NEW OLD NEW OLD
          0     0.2      0.4       0.6   0.8      1                 100 0.34 0.45 0.27 0.26
                               r                                    200 0.19 0.26 0.18 0.20
                                                                    400 0.14 0.14 0.13 0.16
  Figure 2: Tracking the global minimum with
  respect to as r changes.                                    Table 1: Average generalisation errors for the
                                                              new method, which optimises the width r, and
                                                              an older method which does not.
   Figure 2 illustrates using the toy problem described
earlier. It shows the value of GCV at the global min-
imum over for 50 trial values of r between 0.1 and             Table 1 summarises the results. The left hand col-
1.0. The value of r = 0:2, which we used earlier, ap-       umn gives training set size. Two sets of results, one
pears to have been a little on the small side. The          for Z and one for , are given. The gures quoted are
optimal value is close to 0.45.                             the average (over 100 replications) of the scaled sum
   As r changes the location ( ) and height (GCV)           of squared prediction errors. Apart from the method
of the local minima (see gure 1) change smoothly.           described above, which involves optimisation of r, the
average errors of an older RBF algorithm, regularised     5 Conclusions
forward selection (RFS), are also quoted (taken from
 5]). The main di erences to the method described            We have described a new computational method
here are that RFS uses a xed value of r and creates       for re-estimating the regularisation parameter of an
a parsimonious network. The latter has a relatively       RBF network based on generalised cross-validation
small a ect on generalisation performance.                (GCV). It utilises an eigensystem related to the de-
   RFS is clearly inferior to the new method for the      sign matrix of the regression problem and is more
Z problem and marginally worse for . We think             e cient and more stable than methods which involve
the optimisation of r for each training set explains      a direct matrix inverse at each iteration. We have ex-
the superior performance of the new method and the        tended the algorithm to optimise the basis function
lack of such optimisation is a partial explanation for    width simply by testing a number of trial values and
the poor performance of RFS compared to MARS 5].          selecting the one associated with the smallest value
The xed value of r used for RFS was 3.5 but the av-       of GCV.
erage optimal values determined by the new method            We tested the method on a practical problem in-
were 8.7 for Z and 2.8 for . Thus it looks as if the      volving 4 input dimensions and a few hundred train-
  xed value used for RFS was an underestimate in the      ing examples. Our method, which can adapt the
case of Z (where the new algorithm considerably im-       width of the basis functions, but not their number,
proved the results) but about right for (where the        was found to have better prediction performance than
new method made less of an impact).                       a similar RBF network which can adapt the number
                                                          of functions but is stuck with the same xed width.
                                                             The new method, with its head-on approaches to
                                                            nding the global minimum with respect to the regu-
                                                          larisation parameter and to optimising the basis func-
                                                          tion width, does not scale-up well for multiple regular-
                                                          isation parameters or multiple widths. Additionally,
                                                          there is a limit on how many training examples and
                                                          basis functions can be handled due to the computa-
                                                          tional cost of calculating the eigensystem. It is best
   Z                                                      suited to problems involving a single regularisation
                                                          parameter, a single basis function width and about
                                                          1000 (or less) training set examples.

                                                          References
    2
                                                2         1] J. Friedman. Multivariate adaptive regression splines
             0                                               (with discussion). Annals of Statistics, 19:1{141, 1991.
                                     0
                                                          2] G. Golub, M. Heath, and G. Wahba. Generalised
              L     −2 −2          C                         cross-validation as a method for choosing a good ridge
                                                             parameter. Technometrics, 21(2):215{223, 1979.
        Figure 3: Z as a function of L and C.             3] D. MacKay. Bayesian interpolation. Neural Compu-
                                                             tation, 4(3):415{447, 1992.
                                                          4] J. Moody. The e ective number of parameters: An
                                                             analysis of generalisation and regularisation in non-
   Note that while r = 8:7 may sound rather large, es-       linear learning systems. In J. Moody, S. Hanson, and
pecially in view of the normalised input components,         R. Lippmann, editors, Neural Information Processing
such large basis function widths do not necessarily im-      Systems 4, pages 847{854. Morgan Kaufmann, San
ply a lack of structure in the tted function, as might       Mateo CA, 1992.
                                                          5] M. Orr. Regularisation in the selection of radial basis
be assumed. Figure 3 plots Z (impedance) against C           function centres. Neural Computation, 7(3):606{623,
(capacitance) and L (inductance) for xed values of           1995.
the other two components (resistance and frequency).      6] M. Orr. An EM algorithm for regularised radial ba-
This function was tted to one of the p = 200 train-          sis function networks. In International Conference on
ing sets for which the algorithm had found an optimal        Neural Networks and Brain, Beijing, China, October
basis function width of r = 10. The function still ex-       1998.
hibits considerable structure over the ranges of L and
C even though they are less than half the size of r.

More Related Content

PDF
Logistic Regression(SGD)
PDF
Semi-Supervised Regression using Cluster Ensemble
PDF
Kolev skalna2018 article-exact_solutiontoa_parametricline
PDF
A study of the worst case ratio of a simple algorithm for simple assembly lin...
PDF
Hierarchical matrices for approximating large covariance matries and computin...
PDF
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
PDF
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
PDF
Patch Matching with Polynomial Exponential Families and Projective Divergences
Logistic Regression(SGD)
Semi-Supervised Regression using Cluster Ensemble
Kolev skalna2018 article-exact_solutiontoa_parametricline
A study of the worst case ratio of a simple algorithm for simple assembly lin...
Hierarchical matrices for approximating large covariance matries and computin...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
Patch Matching with Polynomial Exponential Families and Projective Divergences

What's hot (20)

PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
(DL hacks輪読) Variational Inference with Rényi Divergence
PDF
Divergence center-based clustering and their applications
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
PDF
Classification with mixtures of curved Mahalanobis metrics
PDF
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
PDF
Tailored Bregman Ball Trees for Effective Nearest Neighbors
PPT
Convex Optimization Modelling with CVXOPT
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
PDF
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
PDF
Jere Koskela slides
PPTX
Aaex4 group2(中英夾雜)
PPT
Q-Metrics in Theory And Practice
PDF
Cb25464467
PDF
Ef24836841
PDF
A nonlinear approximation of the Bayesian Update formula
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
(DL hacks輪読) Variational Inference with Rényi Divergence
Divergence center-based clustering and their applications
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
Classification with mixtures of curved Mahalanobis metrics
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Convex Optimization Modelling with CVXOPT
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
Jere Koskela slides
Aaex4 group2(中英夾雜)
Q-Metrics in Theory And Practice
Cb25464467
Ef24836841
A nonlinear approximation of the Bayesian Update formula
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
Ad

Similar to Multilayer Neural Networks (20)

PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
PDF
Talk iccf 19_ben_hammouda
PPT
Q-Metrics in Theory and Practice
PDF
An Introduction to Elleptic Curve Cryptography
PDF
Approximate Thin Plate Spline Mappings
PDF
10.1.1.630.8055
PDF
Parallel Evaluation of Multi-Semi-Joins
PPTX
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
PDF
Minimum mean square error estimation and approximation of the Bayesian update
PDF
Linear regression [Theory and Application (In physics point of view) using py...
PDF
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
PDF
Litvinenko nlbu2016
PDF
Project Paper
PDF
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
PDF
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
PDF
Parametric time domain system identification of a mass spring-damper
PDF
Bistablecamnets
PPTX
Design & Analysis of Algorithms Assignment Help
PDF
MVPA with SpaceNet: sparse structured priors
PDF
The Gaussian Hardy-Littlewood Maximal Function
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
Talk iccf 19_ben_hammouda
Q-Metrics in Theory and Practice
An Introduction to Elleptic Curve Cryptography
Approximate Thin Plate Spline Mappings
10.1.1.630.8055
Parallel Evaluation of Multi-Semi-Joins
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
Minimum mean square error estimation and approximation of the Bayesian update
Linear regression [Theory and Application (In physics point of view) using py...
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
Litvinenko nlbu2016
Project Paper
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Parametric time domain system identification of a mass spring-damper
Bistablecamnets
Design & Analysis of Algorithms Assignment Help
MVPA with SpaceNet: sparse structured priors
The Gaussian Hardy-Littlewood Maximal Function
Ad

More from ESCOM (20)

PDF
redes neuronales tipo Som
DOC
redes neuronales Som
PDF
redes neuronales Som Slides
PDF
red neuronal Som Net
PDF
Self Organinising neural networks
DOC
redes neuronales Kohonen
DOC
Teoria Resonancia Adaptativa
DOC
ejemplo red neuronal Art1
DOC
redes neuronales tipo Art3
DOC
Art2
DOC
Redes neuronales tipo Art
DOC
Neocognitron
PPT
Neocognitron
PPT
Neocognitron
PPT
Fukushima Cognitron
PPT
Counterpropagation NETWORK
PPT
Counterpropagation NETWORK
PPT
Counterpropagation
PPT
Teoría de Resonancia Adaptativa Art2 ARTMAP
PPT
Teoría de Resonancia Adaptativa ART1
redes neuronales tipo Som
redes neuronales Som
redes neuronales Som Slides
red neuronal Som Net
Self Organinising neural networks
redes neuronales Kohonen
Teoria Resonancia Adaptativa
ejemplo red neuronal Art1
redes neuronales tipo Art3
Art2
Redes neuronales tipo Art
Neocognitron
Neocognitron
Neocognitron
Fukushima Cognitron
Counterpropagation NETWORK
Counterpropagation NETWORK
Counterpropagation
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa ART1

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PDF
Complications of Minimal Access Surgery at WLH
PDF
RMMM.pdf make it easy to upload and study
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Insiders guide to clinical Medicine.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Lesson notes of climatology university.
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Cell Structure & Organelles in detailed.
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Cell Types and Its function , kingdom of life
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Computing-Curriculum for Schools in Ghana
Complications of Minimal Access Surgery at WLH
RMMM.pdf make it easy to upload and study
Microbial disease of the cardiovascular and lymphatic systems
Sports Quiz easy sports quiz sports quiz
Insiders guide to clinical Medicine.pdf
TR - Agricultural Crops Production NC III.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Lesson notes of climatology university.
Anesthesia in Laparoscopic Surgery in India
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Cell Structure & Organelles in detailed.
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
O7-L3 Supply Chain Operations - ICLT Program
Cell Types and Its function , kingdom of life
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharma ospi slides which help in ospi learning
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Multilayer Neural Networks

  • 1. Optimising the Widths of Radial Basis Functions Mark Orr mark@cns.ed.ac.uk Centre for Cognitive Science, Edinburgh University 2, Buccleuch Street, Edinburgh EH8 9LW, Scotland, UK Abstract too complex for the data. The size of the penalty is controlled by , the regularisation parameter, which, In the context of regression analysis with penalised like w and r, is free to adapt to the training set. Given linear models (such as RBF networks) certain model a value for the weight vector which minimises the selection criteria can be di erentiated to yield a re- cost is estimation formula for the regularisation parameter w = A;1H>y ^ such that an initial guess can be iteratively improved where Hij = hj (xi ) is the design matrix and contains until a local minimum of the criterion is reached. In the responses of the m centres to the p inputs of the this paper we discuss some enhancements of this gen- training set, A = H>H + Im and Im is the m- eral approach including improved computational e - dimensional identity matrix. ciency, detection of the global minimum and simulta- The adjustable parameters in the model are the neous optimisation of the basis function widths. The m weights wj , the basis function width r and the bene ts of these improvements are demonstrated on a regularisation parameter . The xed parameters are practical problem. the centre positions cj and their number m. Below we will assume that the inputs of the training set are used as the xed centres, in which case m = p and 1 Introduction cj = xj , but our results apply equally to other choices of xed centres. Consider a radial basis function (RBF) network Various model selection criteria, such as gener- with centres at fcj g, weights fwj g and radial func- alised cross-validation (GCV) 2] or the marginal like- tions lihood of the data (the evidence") 3] can be di er- entiated and set equal to zero to yield a re-estimation h (x) = exp ; (x ; cj ) (x ; cj ) > j formula for the regularisation parameter. For exam- r2 ple in a previous paper 5] we derived the following formula from GCV (j = 1 : : : m) all having the same width r. The cen- = p ; w>^ ;1 w e^ e tres are xed but the weights and width are adapt- > able. The response of the network to an input x is ^ A ^ (1) m X where ^ = y ; H w, = m ; tr A;1 (the e ective e ^ ; f (x) = wj hj (x) number of parameters 4]) and = tr A;1 ; A;2 . j =1 However there are problems with simply trying to Suppose that this network is trained on a regression iterate equation (1) to convergence. Firstly, depend- data set fxi yi g (i = 1 : : : p) by minimising the a ing on the initial guess, a non-optimal local minimum penalised sum-squared-error cost function may be found and secondly, inversion of A is liable to become numerically unstable if gravitates to- C (w) = e> e + w> w wards very small values. Furthermore, the value of the width parameter r remains xed. In the next where w is the m-dimensional weight vector and e is section we describe a computationally e cient and the p-dimensional error vector, ei = yi ; f (xi ). The numerically stable method of iterating (1) which is second term penalises large weights and is designed fast enough that the global minimum and the opti- to avoid over t should the unregularised model be mal value of r can be found by explicit search.
  • 2. 2 E cient Computation RBF network with m = 60 centres coincident with the input points and basis functions of xed width Continually recomputing the inverse of A each r = 0:2. Figure 1 shows the variation of GCV with time the value of changes requires of order m3 for one particular realisation of this problem. oating point operations per iteration and is vulner- able to numerical instability if becomes very small. A more e cient and stable method involves initially computing the eigenvalues f i g and eigenvectors fui g of HH> and fzi g, the projections of the data onto the eigenvectors (zi = y> ui ). Thereafter, the four terms appearing in (1) can be computed e ciently log(GCV) (with cost only linear in p) by p X e>e = zi 2 2 (2) i=1 ( i + )2 Xp w>A;1 w = i zi2 (3) i=1 ( i + ) 3 Xp −16 −10 −4 2 = i (4) log(λ) i=1 ( i + )2 Xp Figure 1: Local (diamonds) and global (star) p; = (5) minima of GCV. i=1 i + Note that if p > m then the last p ; m eigenvalues An array of 50 trial values of , evenly spaced be- (assuming they are ordered from largest to smallest) tween log = ;16 and log = 2, were used to nd are zero. However, as remarked earlier, if we have one rough positions for the local minima and equation (1) centre for each training set input then p = m and the was then iterated for each one found. Convergence cost of calculating the eigenvalues and eigenvectors was assumed once changes in GCV from one itera- of the p p matrix HH> is of the same order as tion to the next had dipped below a threshold of 1 inverting the m m matrix A. Therefore, unless part in a million. In the example problem three local (1) converges almost immediately, it is much more minima were detected (see gure 1) and the one with e cient to calculate the eigensystem once and then the lowest GCV corresponded to 2. Searching use (2-5) than to invert A on each iteration. for the minima and re ning the candidate solutions Once the eigensystem has been established, GCV took up only 0.6% of the total computation time, the rest was accounted for by the calculation of eigenval- GCV = (p ; ^ 2 p^ e e > ues and eigenvectors. Notice that if we had simply ) started with a single guess for and iterated equa- tion (1) to nd the solution, any initial guess below can also be cheaply calculated for any given us- about 10;4 would have led to a sub-optimal solution. ing (2,5). Thus it is feasible to evaluate GCV for a Occasionally the value of re-estimated from equa- number of trial values of searching for local min- tion (1) bounces back and forward between two val- ima, re ne those that are found by iterating (1) to ues on each side of a local minimum and then ei- convergence { using (2-5) of course { and nally se- ther takes a long time to pass through this bistable lect from the local minima the one with the smallest state before nally converging or does not converge GCV. Assuming a wide and dense enough range of at all. To solve this problem we devised the follow- trial values is employed, this procedure will nd the ing heuristic. Suppose the sequence of re-estimated global minimum. values is 1 : : : k;2 k;1 k , with k being the We now demonstrate this method on a toy prob- current value. Then if lem consisting of p = 60 samples taken from the func- j k; k;1 j > j k ; k;2 j tion y = 0:8 sin(6 x) at random points in the range 0 < x < 1 and corrupted by Gaussian noise of stan- replace k by the geometric mean of k;1 and k;2 dard deviation 0.1. The data was modelled by an before proceeding to the next iteration.
  • 3. 3 Optimising the Width Usually this means the location of the global mini- mum also changes smoothly with r but there are par- When GCV is di erentiated with respect to the ticular values of the width where the identity of the regularisation parameter and set equal to zero the local minima with the smallest GCV switches, caus- resulting equation can be manipulated so that alone ing an abrupt change in location (but not height) of appears on the left hand side, enabling the equation to the global minimum. This explains the discontinuous be used as a re-estimation formula. Unfortunately the changes of slope in the curve of gure 2. Local min- same trick does not work with r because, after setting ima can also be created or destroyed as r changes, so the derivative of GCV with respect to r to zero, the discontinuous changes in value are also possible. terms explicitly involving r cancel so r cannot be iso- Of course, the ultimate arbiter of generalisation lated and a re-estimation formula is impossible. The performance is not the value of a model selection cri- same applies to other model selection criteria such as terion (such as GCV) on a particular realisation of maximum likelihood of the data 6]. the problem but the error of an independent test set When there is only one width parameter, as we as- averaged over multiple realisations. We perform such sume here, it is feasible to tackle the problem of choos- a test in the next section. ing an optimal value by experimenting with a number of trial values and selecting the one most favoured by the model selection criterion. The range of trial values 4 Results used will be problem speci c and could be determined For a thorough test of the method we turn to a by the likely maximum and minimum scales involved more realistic problem stemming from Friedmann's in the particular problem. The number of trial values MARS paper 1] and later used to compare RBFs between the these limits will depend on the size of the and MARS 5]. The problem involves the prediction problem (p) and the available computing resources of impedance Z and phase from the four parameters since for each trial value an eigensystem computation (resistance, frequency, inductance and capacitance) of (with cost proportional to p3 ) will be necessary. an electrical circuit. Training sets of three di erent sizes (100, 200, 400) and with a signal-to-noise ra- tio of about 3:1 were replicated 100 times each. The input components were normalised to have unit vari- ance and zero mean for each replication. The learning method, as described above, was applied using a set of 10 trial values of r between 1 and 10. Generali- sation performance was estimated by scaled sum of squared errors over two independent test sets (one GCV for Z and one for ) of size 5000 and uncorrupted by noise. This is the same experimental set up as in the previous papers 1, 5] from which further details can be obtained. Z p NEW OLD NEW OLD 0 0.2 0.4 0.6 0.8 1 100 0.34 0.45 0.27 0.26 r 200 0.19 0.26 0.18 0.20 400 0.14 0.14 0.13 0.16 Figure 2: Tracking the global minimum with respect to as r changes. Table 1: Average generalisation errors for the new method, which optimises the width r, and an older method which does not. Figure 2 illustrates using the toy problem described earlier. It shows the value of GCV at the global min- imum over for 50 trial values of r between 0.1 and Table 1 summarises the results. The left hand col- 1.0. The value of r = 0:2, which we used earlier, ap- umn gives training set size. Two sets of results, one pears to have been a little on the small side. The for Z and one for , are given. The gures quoted are optimal value is close to 0.45. the average (over 100 replications) of the scaled sum As r changes the location ( ) and height (GCV) of squared prediction errors. Apart from the method of the local minima (see gure 1) change smoothly. described above, which involves optimisation of r, the
  • 4. average errors of an older RBF algorithm, regularised 5 Conclusions forward selection (RFS), are also quoted (taken from 5]). The main di erences to the method described We have described a new computational method here are that RFS uses a xed value of r and creates for re-estimating the regularisation parameter of an a parsimonious network. The latter has a relatively RBF network based on generalised cross-validation small a ect on generalisation performance. (GCV). It utilises an eigensystem related to the de- RFS is clearly inferior to the new method for the sign matrix of the regression problem and is more Z problem and marginally worse for . We think e cient and more stable than methods which involve the optimisation of r for each training set explains a direct matrix inverse at each iteration. We have ex- the superior performance of the new method and the tended the algorithm to optimise the basis function lack of such optimisation is a partial explanation for width simply by testing a number of trial values and the poor performance of RFS compared to MARS 5]. selecting the one associated with the smallest value The xed value of r used for RFS was 3.5 but the av- of GCV. erage optimal values determined by the new method We tested the method on a practical problem in- were 8.7 for Z and 2.8 for . Thus it looks as if the volving 4 input dimensions and a few hundred train- xed value used for RFS was an underestimate in the ing examples. Our method, which can adapt the case of Z (where the new algorithm considerably im- width of the basis functions, but not their number, proved the results) but about right for (where the was found to have better prediction performance than new method made less of an impact). a similar RBF network which can adapt the number of functions but is stuck with the same xed width. The new method, with its head-on approaches to nding the global minimum with respect to the regu- larisation parameter and to optimising the basis func- tion width, does not scale-up well for multiple regular- isation parameters or multiple widths. Additionally, there is a limit on how many training examples and basis functions can be handled due to the computa- tional cost of calculating the eigensystem. It is best Z suited to problems involving a single regularisation parameter, a single basis function width and about 1000 (or less) training set examples. References 2 2 1] J. Friedman. Multivariate adaptive regression splines 0 (with discussion). Annals of Statistics, 19:1{141, 1991. 0 2] G. Golub, M. Heath, and G. Wahba. Generalised L −2 −2 C cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215{223, 1979. Figure 3: Z as a function of L and C. 3] D. MacKay. Bayesian interpolation. Neural Compu- tation, 4(3):415{447, 1992. 4] J. Moody. The e ective number of parameters: An analysis of generalisation and regularisation in non- Note that while r = 8:7 may sound rather large, es- linear learning systems. In J. Moody, S. Hanson, and pecially in view of the normalised input components, R. Lippmann, editors, Neural Information Processing such large basis function widths do not necessarily im- Systems 4, pages 847{854. Morgan Kaufmann, San ply a lack of structure in the tted function, as might Mateo CA, 1992. 5] M. Orr. Regularisation in the selection of radial basis be assumed. Figure 3 plots Z (impedance) against C function centres. Neural Computation, 7(3):606{623, (capacitance) and L (inductance) for xed values of 1995. the other two components (resistance and frequency). 6] M. Orr. An EM algorithm for regularised radial ba- This function was tted to one of the p = 200 train- sis function networks. In International Conference on ing sets for which the algorithm had found an optimal Neural Networks and Brain, Beijing, China, October basis function width of r = 10. The function still ex- 1998. hibits considerable structure over the ranges of L and C even though they are less than half the size of r.