Multilayer Neural Networks

Optimising the Widths of Radial Basis Functions
Mark Orr
mark@cns.ed.ac.uk
Centre for Cognitive Science, Edinburgh University
2, Buccleuch Street, Edinburgh EH8 9LW, Scotland, UK

Abstract too complex for the data. The size of the penalty is
controlled by , the regularisation parameter, which,
In the context of regression analysis with penalised like w and r, is free to adapt to the training set. Given
linear models (such as RBF networks) certain model a value for the weight vector which minimises the
selection criteria can be di erentiated to yield a re- cost is
estimation formula for the regularisation parameter w = A;1H>y
^
such that an initial guess can be iteratively improved where Hij = hj (xi ) is the design matrix and contains
until a local minimum of the criterion is reached. In the responses of the m centres to the p inputs of the
this paper we discuss some enhancements of this gen- training set, A = H>H + Im and Im is the m-
eral approach including improved computational e - dimensional identity matrix.
ciency, detection of the global minimum and simulta- The adjustable parameters in the model are the
neous optimisation of the basis function widths. The m weights wj , the basis function width r and the
bene ts of these improvements are demonstrated on a regularisation parameter . The xed parameters are
practical problem. the centre positions cj and their number m. Below
we will assume that the inputs of the training set are
used as the xed centres, in which case m = p and
1 Introduction cj = xj , but our results apply equally to other choices
of xed centres.
Consider a radial basis function (RBF) network Various model selection criteria, such as gener-
with centres at fcj g, weights fwj g and radial func- alised cross-validation (GCV) 2] or the marginal like-
tions lihood of the data (the evidence") 3] can be di er-
entiated and set equal to zero to yield a re-estimation
h (x) = exp ; (x ; cj ) (x ; cj )
>
j formula for the regularisation parameter. For exam-
r2 ple in a previous paper 5] we derived the following
formula from GCV
(j = 1 : : : m) all having the same width r. The cen-
= p ; w>^ ;1 w
e^ e
tres are xed but the weights and width are adapt- >
able. The response of the network to an input x is ^ A ^ (1)
m
X where ^ = y ; H w, = m ; tr A;1 (the e ective
e ^ ;
f (x) = wj hj (x) number of parameters 4]) and = tr A;1 ; A;2 .
j =1 However there are problems with simply trying to
Suppose that this network is trained on a regression iterate equation (1) to convergence. Firstly, depend-
data set fxi yi g (i = 1 : : : p) by minimising the a ing on the initial guess, a non-optimal local minimum
penalised sum-squared-error cost function may be found and secondly, inversion of A is liable
to become numerically unstable if gravitates to-
C (w) = e> e + w> w wards very small values. Furthermore, the value of
the width parameter r remains xed. In the next
where w is the m-dimensional weight vector and e is section we describe a computationally e cient and
the p-dimensional error vector, ei = yi ; f (xi ). The numerically stable method of iterating (1) which is
second term penalises large weights and is designed fast enough that the global minimum and the opti-
to avoid over t should the unregularised model be mal value of r can be found by explicit search.

2 E cient Computation RBF network with m = 60 centres coincident with
the input points and basis functions of xed width
Continually recomputing the inverse of A each r = 0:2. Figure 1 shows the variation of GCV with
time the value of changes requires of order m3 for one particular realisation of this problem.
oating point operations per iteration and is vulner-
able to numerical instability if becomes very small.
A more e cient and stable method involves initially
computing the eigenvalues f i g and eigenvectors fui g
of HH> and fzi g, the projections of the data onto
the eigenvectors (zi = y> ui ). Thereafter, the four
terms appearing in (1) can be computed e ciently

log(GCV)
(with cost only linear in p) by
p
X
e>e = zi
2 2
(2)
i=1 ( i + )2
Xp
w>A;1 w = i zi2 (3)
i=1 ( i + )
3

Xp −16 −10 −4 2
= i (4) log(λ)
i=1 ( i + )2
Xp Figure 1: Local (diamonds) and global (star)
p; = (5) minima of GCV.
i=1 i +
Note that if p > m then the last p ; m eigenvalues An array of 50 trial values of , evenly spaced be-
(assuming they are ordered from largest to smallest) tween log = ;16 and log = 2, were used to nd
are zero. However, as remarked earlier, if we have one rough positions for the local minima and equation (1)
centre for each training set input then p = m and the was then iterated for each one found. Convergence
cost of calculating the eigenvalues and eigenvectors was assumed once changes in GCV from one itera-
of the p p matrix HH> is of the same order as tion to the next had dipped below a threshold of 1
inverting the m m matrix A. Therefore, unless part in a million. In the example problem three local
(1) converges almost immediately, it is much more minima were detected (see gure 1) and the one with
e cient to calculate the eigensystem once and then the lowest GCV corresponded to 2. Searching
use (2-5) than to invert A on each iteration. for the minima and re ning the candidate solutions
Once the eigensystem has been established, GCV took up only 0.6% of the total computation time, the
rest was accounted for by the calculation of eigenval-
GCV = (p ; ^ 2
p^ e
e > ues and eigenvectors. Notice that if we had simply
) started with a single guess for and iterated equa-
tion (1) to nd the solution, any initial guess below
can also be cheaply calculated for any given us- about 10;4 would have led to a sub-optimal solution.
ing (2,5). Thus it is feasible to evaluate GCV for a Occasionally the value of re-estimated from equa-
number of trial values of searching for local min- tion (1) bounces back and forward between two val-
ima, re ne those that are found by iterating (1) to ues on each side of a local minimum and then ei-
convergence { using (2-5) of course { and nally se- ther takes a long time to pass through this bistable
lect from the local minima the one with the smallest state before nally converging or does not converge
GCV. Assuming a wide and dense enough range of at all. To solve this problem we devised the follow-
trial values is employed, this procedure will nd the ing heuristic. Suppose the sequence of re-estimated
global minimum. values is 1 : : : k;2 k;1 k , with k being the
We now demonstrate this method on a toy prob- current value. Then if
lem consisting of p = 60 samples taken from the func- j k; k;1 j > j k ; k;2 j
tion y = 0:8 sin(6 x) at random points in the range
0 < x < 1 and corrupted by Gaussian noise of stan- replace k by the geometric mean of k;1 and k;2
dard deviation 0.1. The data was modelled by an before proceeding to the next iteration.

3 Optimising the Width Usually this means the location of the global mini-
mum also changes smoothly with r but there are par-
When GCV is di erentiated with respect to the ticular values of the width where the identity of the
regularisation parameter and set equal to zero the local minima with the smallest GCV switches, caus-
resulting equation can be manipulated so that alone ing an abrupt change in location (but not height) of
appears on the left hand side, enabling the equation to the global minimum. This explains the discontinuous
be used as a re-estimation formula. Unfortunately the changes of slope in the curve of gure 2. Local min-
same trick does not work with r because, after setting ima can also be created or destroyed as r changes, so
the derivative of GCV with respect to r to zero, the discontinuous changes in value are also possible.
terms explicitly involving r cancel so r cannot be iso- Of course, the ultimate arbiter of generalisation
lated and a re-estimation formula is impossible. The performance is not the value of a model selection cri-
same applies to other model selection criteria such as terion (such as GCV) on a particular realisation of
maximum likelihood of the data 6]. the problem but the error of an independent test set
When there is only one width parameter, as we as- averaged over multiple realisations. We perform such
sume here, it is feasible to tackle the problem of choos- a test in the next section.
ing an optimal value by experimenting with a number
of trial values and selecting the one most favoured by
the model selection criterion. The range of trial values
4 Results
used will be problem speci c and could be determined For a thorough test of the method we turn to a
by the likely maximum and minimum scales involved more realistic problem stemming from Friedmann's
in the particular problem. The number of trial values MARS paper 1] and later used to compare RBFs
between the these limits will depend on the size of the and MARS 5]. The problem involves the prediction
problem (p) and the available computing resources of impedance Z and phase from the four parameters
since for each trial value an eigensystem computation (resistance, frequency, inductance and capacitance) of
(with cost proportional to p3 ) will be necessary. an electrical circuit. Training sets of three di erent
sizes (100, 200, 400) and with a signal-to-noise ra-
tio of about 3:1 were replicated 100 times each. The
input components were normalised to have unit vari-
ance and zero mean for each replication. The learning
method, as described above, was applied using a set
of 10 trial values of r between 1 and 10. Generali-
sation performance was estimated by scaled sum of
squared errors over two independent test sets (one
GCV

for Z and one for ) of size 5000 and uncorrupted by
noise. This is the same experimental set up as in the
previous papers 1, 5] from which further details can
be obtained.
Z
p NEW OLD NEW OLD
0 0.2 0.4 0.6 0.8 1 100 0.34 0.45 0.27 0.26
r 200 0.19 0.26 0.18 0.20
400 0.14 0.14 0.13 0.16
Figure 2: Tracking the global minimum with
respect to as r changes. Table 1: Average generalisation errors for the
new method, which optimises the width r, and
an older method which does not.
Figure 2 illustrates using the toy problem described
earlier. It shows the value of GCV at the global min-
imum over for 50 trial values of r between 0.1 and Table 1 summarises the results. The left hand col-
1.0. The value of r = 0:2, which we used earlier, ap- umn gives training set size. Two sets of results, one
pears to have been a little on the small side. The for Z and one for , are given. The gures quoted are
optimal value is close to 0.45. the average (over 100 replications) of the scaled sum
As r changes the location ( ) and height (GCV) of squared prediction errors. Apart from the method
of the local minima (see gure 1) change smoothly. described above, which involves optimisation of r, the

average errors of an older RBF algorithm, regularised 5 Conclusions
forward selection (RFS), are also quoted (taken from
5]). The main di erences to the method described We have described a new computational method
here are that RFS uses a xed value of r and creates for re-estimating the regularisation parameter of an
a parsimonious network. The latter has a relatively RBF network based on generalised cross-validation
small a ect on generalisation performance. (GCV). It utilises an eigensystem related to the de-
RFS is clearly inferior to the new method for the sign matrix of the regression problem and is more
Z problem and marginally worse for . We think e cient and more stable than methods which involve
the optimisation of r for each training set explains a direct matrix inverse at each iteration. We have ex-
the superior performance of the new method and the tended the algorithm to optimise the basis function
lack of such optimisation is a partial explanation for width simply by testing a number of trial values and
the poor performance of RFS compared to MARS 5]. selecting the one associated with the smallest value
The xed value of r used for RFS was 3.5 but the av- of GCV.
erage optimal values determined by the new method We tested the method on a practical problem in-
were 8.7 for Z and 2.8 for . Thus it looks as if the volving 4 input dimensions and a few hundred train-
xed value used for RFS was an underestimate in the ing examples. Our method, which can adapt the
case of Z (where the new algorithm considerably im- width of the basis functions, but not their number,
proved the results) but about right for (where the was found to have better prediction performance than
new method made less of an impact). a similar RBF network which can adapt the number
of functions but is stuck with the same xed width.
The new method, with its head-on approaches to
nding the global minimum with respect to the regu-
larisation parameter and to optimising the basis func-
tion width, does not scale-up well for multiple regular-
isation parameters or multiple widths. Additionally,
there is a limit on how many training examples and
basis functions can be handled due to the computa-
tional cost of calculating the eigensystem. It is best
Z suited to problems involving a single regularisation
parameter, a single basis function width and about
1000 (or less) training set examples.

References
2
2 1] J. Friedman. Multivariate adaptive regression splines
0 (with discussion). Annals of Statistics, 19:1{141, 1991.
0
2] G. Golub, M. Heath, and G. Wahba. Generalised
L −2 −2 C cross-validation as a method for choosing a good ridge
parameter. Technometrics, 21(2):215{223, 1979.
Figure 3: Z as a function of L and C. 3] D. MacKay. Bayesian interpolation. Neural Compu-
tation, 4(3):415{447, 1992.
4] J. Moody. The e ective number of parameters: An
analysis of generalisation and regularisation in non-
Note that while r = 8:7 may sound rather large, es- linear learning systems. In J. Moody, S. Hanson, and
pecially in view of the normalised input components, R. Lippmann, editors, Neural Information Processing
such large basis function widths do not necessarily im- Systems 4, pages 847{854. Morgan Kaufmann, San
ply a lack of structure in the tted function, as might Mateo CA, 1992.
5] M. Orr. Regularisation in the selection of radial basis
be assumed. Figure 3 plots Z (impedance) against C function centres. Neural Computation, 7(3):606{623,
(capacitance) and L (inductance) for xed values of 1995.
the other two components (resistance and frequency). 6] M. Orr. An EM algorithm for regularised radial ba-
This function was tted to one of the p = 200 train- sis function networks. In International Conference on
ing sets for which the algorithm had found an optimal Neural Networks and Brain, Beijing, China, October
basis function width of r = 10. The function still ex- 1998.
hibits considerable structure over the ranges of L and
C even though they are less than half the size of r.

Multilayer Neural Networks

More Related Content

What's hot (20)

Similar to Multilayer Neural Networks (20)

More from ESCOM (20)

Recently uploaded (20)

Multilayer Neural Networks