Fpga Implementations Of Neural Networks Amos R Omondi Jagath C Rajapakse
Fpga Implementations Of Neural Networks Amos R Omondi Jagath C Rajapakse
Fpga Implementations Of Neural Networks Amos R Omondi Jagath C Rajapakse
Fpga Implementations Of Neural Networks Amos R Omondi Jagath C Rajapakse
1. Fpga Implementations Of Neural Networks Amos R
Omondi Jagath C Rajapakse download
https://ebookbell.com/product/fpga-implementations-of-neural-
networks-amos-r-omondi-jagath-c-rajapakse-55597074
Explore and download more ebooks at ebookbell.com
2. Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Fpga Implementations Of Neural Networks Amos R Omondi Jagath C
Rajapakse
https://ebookbell.com/product/fpga-implementations-of-neural-networks-
amos-r-omondi-jagath-c-rajapakse-6715812
State Machines Using Vhdl Fpga Implementation Of Serial Communication
And Display Protocols Orhan Gazi
https://ebookbell.com/product/state-machines-using-vhdl-fpga-
implementation-of-serial-communication-and-display-protocols-orhan-
gazi-22576844
Guide To Fpga Implementation Of Arithmetic Functions 1st Edition
Jeanpierre Deschamps
https://ebookbell.com/product/guide-to-fpga-implementation-of-
arithmetic-functions-1st-edition-jeanpierre-deschamps-4195898
A Practical Guide For Simulation And Fpga Implementation Of Digital
Design 3rd Edition Hajji
https://ebookbell.com/product/a-practical-guide-for-simulation-and-
fpga-implementation-of-digital-design-3rd-edition-hajji-42278912
3. Fpgabased Implementation Of Signal Processing Systems 2nd Edition
Roger Woods
https://ebookbell.com/product/fpgabased-implementation-of-signal-
processing-systems-2nd-edition-roger-woods-5760370
Fpgabased Implementation Of Signal Processing Systems 2nd Edition
Roger Woods
https://ebookbell.com/product/fpgabased-implementation-of-signal-
processing-systems-2nd-edition-roger-woods-5760478
Fpgabased Implementation Of Signal Processing Systems Roger Woods
https://ebookbell.com/product/fpgabased-implementation-of-signal-
processing-systems-roger-woods-5760974
Fpgabased Implementation Of Signal Processing Systems Roger Woods
https://ebookbell.com/product/fpgabased-implementation-of-signal-
processing-systems-roger-woods-1275016
Can And Fpga Communication Engineering Implementation Of A Can Bus
Based Measurement System On An Fpga Development Kit Implementation Of
A Can Bus Based Measurement System On An Fpga Development Kit 1st
Edition Yu Zhu
https://ebookbell.com/product/can-and-fpga-communication-engineering-
implementation-of-a-can-bus-based-measurement-system-on-an-fpga-
development-kit-implementation-of-a-can-bus-based-measurement-system-
on-an-fpga-development-kit-1st-edition-yu-zhu-51697420
5. FPGA Implementations
of Neural Networks
Edited by
AMOS R. OMONDI
Flinders University, Adelaide,
SA, Australia
and
JAGATH C. RAJAPAKSE
Nanyang Tecnological University,
Singapore
7. Contents
Preface ix
1
FPGA Neurocomputers 1
Amos R. Omondi, Jagath C. Rajapakse and Mariusz Bajger
1.1. Introduction 1
1.2. Review of neural-network basics 3
1.3. ASIC vs. FPGA neurocomputers 9
1.4. Parallelism in neural networks 12
1.5. Xilinx Virtex-4 FPGA 13
1.6. Arithmetic 15
1.7. Activation-function implementation: unipolar sigmoid 21
1.8. Performance evaluation 32
1.9. Conclusions 34
References 34
2
37
Medhat Moussa and Shawki Areibi and Kristian Nichols
2.1. Introduction 37
2.2. Background 39
2.3. Architecture design and implementation 43
2.4. Experiments using logical-XOR problem 48
2.5. Results and discussion 50
2.6. Conclusions 55
References 56
3
FPNA: Concepts and properties 63
Bernard Girau
3.1. Introduction 63
3.2. Choosing FPGAs 65
3.3. FPNAs, FPNNs 71
3.4. Correctness 86
3.5. Underparameterized convolutions by FPNNs 88
3.6. Conclusions 96
References 97
Arithmetic precision for implementing BP networks on FPGA: A case study
v
8. vi
4
FPNA: Applications and implementations 103
Bernard Girau
4.1. Summary of Chapter 3 104
4.2. Towards simplified architectures: symmetric boolean functions by
FPNAs 105
4.3. Benchmark applications 109
4.4. Other applications 113
4.5. General FPGA implementation 116
4.6. Synchronous FPNNs 120
4.7. Implementations of synchronous FPNNs 124
4.8. Implementation performances 130
4.9. Conclusions 133
References 134
5
Back-Propagation Algorithm Achieving 5 GOPS on the Virtex-E 137
Kolin Paul and Sanjay Rajopadhye
5.1. Introduction 138
5.2. Problem specification 139
5.3. Systolic implementation of matrix-vector multiply 141
5.4. Pipelined back-propagation architecture 142
5.5. Implementation 144
5.6. MMAlpha design environment 147
5.7. Architecture derivation 149
5.8. Hardware generation 155
5.9. Performance evaluation 157
5.10. Related work 159
5.11. Conclusion 160
Appendix 161
References 163
6
FPGA Implementation of Very Large Associative Memories 167
Dan Hammerstrom, Changjian Gao, Shaojuan Zhu, Mike Butts
6.1. Introduction 167
6.2. Associative memory 168
6.3. PC Performance Evaluation 179
6.4. FPGA Implementation 184
6.5. Performance comparisons 190
6.6. Summary and conclusions 192
References 193
7
FPGA Implementations of Neocognitrons 197
Alessandro Noriaki Ide and José Hiroki Saito
7.1. Introduction 197
7.2. Neocognitron 198
7.3. Alternative neocognitron 201
7.4. Reconfigurable computer 205
7.5. Reconfigurable orthogonal memory multiprocessor 206
FPGA Implementations of neural networks
9. Contents vii
7.6. Alternative neocognitron hardware implementation 209
7.7. Performance analysis 215
7.8. Applications 218
7.9. Conclusions 221
References 222
8
Self Organizing Feature Map for Color Quantization on FPGA 225
Chip-Hong Chang, Menon Shibu and Rui Xiao
8.1. Introduction 225
8.2. Algorithmic adjustment 228
8.3. Architecture 231
8.4. Implementation 235
8.5. Experimental results 239
8.6. Conclusions 242
References 242
9
Implementation of Self-Organizing Feature Maps in Reconfigurable
Hardware
247
Mario Porrmann, Ulf Witkowski, and Ulrich Rückert
9.1. Introduction 247
9.2. Using reconfigurable hardware for neural networks 248
9.3. The dynamically reconfigurable rapid prototyping system
RAPTOR2000 250
9.4. Implementing self-organizing feature maps on RAPTOR2000 252
9.5. Conclusions 267
References 267
10
FPGA Implementation of a Fully and Partially Connected MLP 271
Antonio Canas, Eva M. Ortigosa, Eduardo Ros and Pilar M. Ortigosa
10.1. Introduction 271
10.2. MLP/XMLP and speech recognition 273
10.3. Activation functions and discretization problem 276
10.4. Hardware implementations of MLP 284
10.5. Hardware implementations of XMLP 291
10.6. Conclusions 293
Acknowledgments 294
References 295
11
FPGA Implementation of Non-Linear Predictors 297
Rafael Gadea-Girones and Agustn Ramrez-Agundis
11.1. Introduction 298
11.2. Pipeline and back-propagation algorithm 299
11.3. Synthesis and FPGAs 304
11.4. Implementation on FPGA 313
11.5. Conclusions 319
References 321
10. viii
12
The REMAP reconfigurable architecture: a retrospective 325
Lars Bengtsson, Arne Linde, Tomas Nordstr-
om, Bertil Svensson,
and Mikael Taveniku
12.1. Introduction 326
12.2. Target Application Area 327
12.3. REMAP-β – design and implementation 335
12.4. Neural networks mapped on REMAP-β 346
12.5. REMAP- γ architecture 353
12.6. Discussion 354
12.7. Conclusions 357
Acknowledgments 357
References 357
FPGA Implementations of neural networks
11. Preface
During the 1980s and early 1990s there was significant work in the design
and implementation of hardware neurocomputers. Nevertheless, most of these
efforts may be judged to have been unsuccessful: at no time have have hard-
ware neurocomputers been in wide use. This lack of success may be largely
attributed to the fact that earlier work was almost entirely aimed at developing
custom neurocomputers, based on ASIC technology, but for such niche ar-
eas this technology was never sufficiently developed or competitive enough to
justify large-scale adoption. On the other hand, gate-arrays of the period men-
tioned were never large enough nor fast enough for serious artificial-neural-
network (ANN) applications. But technology has now improved: the capacity
and performance of current FPGAs are such that they present a much more
realistic alternative. Consequently neurocomputers based on FPGAs are now
a much more practical proposition than they have been in the past. This book
summarizes some work towards this goal and consists of 12 papers that were
selected, after review, from a number of submissions. The book is nominally
divided into three parts: Chapters 1 through 4 deal with foundational issues;
Chapters 5 through 11 deal with a variety of implementations; and Chapter
12 looks at the lessons learned from a large-scale project and also reconsiders
design issues in light of current and future technology.
Chapter 1 reviews the basics of artificial-neural-network theory, discusses
various aspects of the hardware implementation of neural networks (in both
ASIC and FPGA technologies, with a focus on special features of artificial
neural networks), and concludes with a brief note on performance-evaluation.
Special points are the exploitation of the parallelism inherent in neural net-
works and the appropriate implementation of arithmetic functions, especially
the sigmoid function. With respect to the sigmoid function, the chapter in-
cludes a significant contribution.
Certain sequences of arithmetic operations form the core of neural-network
computations, and the second chapter deals with a foundational issue: how
to determine the numerical precision format that allows an optimum tradeoff
between precision and implementation (cost and performance). Standard sin-
gle or double precision floating-point representations minimize quantization
ix
12. x
errors while requiring significant hardware resources. Less precise fixed-point
representation may require less hardware resources but add quantization errors
that may prevent learning from taking place, especially in regression problems.
Chapter 2 examines this issue and reports on a recent experiment where we im-
plemented a multi-layer perceptron on an FPGA using both fixed and floating
point precision.
A basic problem in all forms of parallel computing is how best to map ap-
plications onto hardware. In the case of FPGAs the difficulty is aggravated
by the relatively rigid interconnection structures of the basic computing cells.
Chapters 3 and 4 consider this problem: an appropriate theoretical and prac-
tical framework to reconcile simple hardware topologies with complex neural
architectures is discussed. The basic concept is that of Field Programmable
Neural Arrays (FPNA) that lead to powerful neural architectures that are easy
to map onto FPGAs, by means of a simplified topology and an original data
exchange scheme. Chapter 3 gives the basic definition and results of the theo-
retical framework. And Chapter 4 shows how FPNAs lead to powerful neural
architectures that are easy to map onto digital hardware. applications and im-
plementations are described, focusing on a class
Chapter 5 presents a systolic architecture for the complete back propagation
algorithm. This is the first such implementation of the back propagation algo-
rithm which completely parallelizes the entire computation of learning phase.
The array has been implemented on an Annapolis FPGA based coprocessor
and it achieves very favorable performance with range of 5 GOPS. The pro-
posed new design targets Virtex boards. A description is given of the process of
automatically deriving these high performance architectures using the systolic
array design tool MMAlpha, facilitates system-specification This makes it
easy to specify the system in a very high level language (Alpha) and also
allows perform design exploration to obtain architectures whose performance
is comparable to that obtained using hand optimized VHDL code.
Associative networks have a number of properties, including a rapid, com-
pute efficient best-match and intrinsic fault tolerance, that make them ideal for
many applications. However, large networks can be slow to emulate because
of their storage and bandwidth requirements. Chapter 6 presents a simple but
effective model of association and then discusses a performance analysis of the
implementation this model on a single high-end PC workstation, a PC cluster,
and FPGA hardware.
Chapter 7 describes the implementation of an artificial neural network in a
reconfigurable parallel computer architecture using FPGA’s, named Reconfig-
urable Orthogonal Memory Multiprocessor (REOMP), which uses p2 memory
modules connected to p reconfigurable processors, in row access mode, and
column access mode. REOMP is considered as an alternative model of the
neural network neocognitron. The chapter consists of a description of the RE-
FPGA Implementations of neural networks
13. xi
OMP architecture, a the case study of alternative neocognitron mapping, and a
performance performance analysis with systems systems consisting of 1 to 64
processors.
Chapter 8 presents an efficient architecture of Kohonen Self-Organizing
Feature Map (SOFM) based on a new Frequency Adaptive Learning (FAL)
algorithm which efficiently replaces the neighborhood adaptation function of
the conventional SOFM. The proposed SOFM architecture is prototyped on
Xilinx Virtex FPGA using the prototyping environment provided by XESS.
A robust functional verification environment is developed for rapid prototype
development. Various experimental results are given for the quantization of a
512 X 512 pixel color image.
Chapter 9 consists of another discussion of an implementation of SOFMs
in reconfigurable hardware. Based on the universal rapid prototyping system,
RAPTOR2000, a hardware accelerator for self-organizing feature maps has
been developed. Using Xilinx Virtex-E FPGAs, RAPTOR2000 is capable of
emulating hardware implementations with a complexity of more than 15 mil-
lion system gates. RAPTOR2000 is linked to its host – a standard personal
computer or workstation – via the PCI bus. A speed-up of up to 190 is achieved
with five FPGA modules on the RAPTOR2000 system compared to a software
implementation on a state of the art personal computer for typical applications
of SOFMs.
Chapter 10 presents several hardware implementations of a standard Multi-
Layer Perceptron (MLP) and a modified version called eXtended Multi-Layer
Perceptron (XMLP). This extended version is an MLP-like feed-forward net-
work with two-dimensional layers and configurable connection pathways. The
discussion includes a description of hardware implementations have been de-
veloped and tested on an FPGA prototyping board and includes systems spec-
ifications using two different abstraction levels: register transfer level (VHDL)
and a higher algorithmic-like level (Handel-C) as well as the exploitation of
varying degrees of parallelism. The main test bed application is speech recog-
nition.
Chapter 11 describes the implementation of a systolic array for a non-linear
predictor for image and video compression. The implementation is based on a
multilayer perceptron with a hardware-friendly learning algorithm. It is shown
that even with relatively modest FPGA devices, the architecture attains the
speeds necessary for real-time training in video applications and enabling more
typical applications to be added to the image compression processing
The final chapter consists of a retrospective look at the REMAP project,
which was the construction of design, implementation, and use of large-scale
parallel architectures for neural-network applications. The chapter gives an
overview of the computational requirements found in algorithms in general and
motivates the use of regular processor arrays for the efficient execution of such
Preface
14. xii
algorithms. The architecture, following the SIMD principle (Single Instruc-
tion stream, Multiple Data streams), is described, as well as the mapping of
some important and representative ANN algorithms. Implemented in FPGA,
the system served as an architecture laboratory. Variations of the architecture
are discussed, as well as scalability of fully synchronous SIMD architectures.
The design principles of a VLSI-implemented successor of REMAP-β are de-
scribed, and the paper concludes with a discussion of how the more powerful
FPGA circuits of today could be used in a similar architecture.
AMOS R. OMONDI AND JAGATH C. RAJAPAKSE
FPGA Implementations of neural networks
16. 2 FPGA Neurocomputers
to have been unsuccessful: at no time have have hardware neurocomputers
been in wide use; indeed, the entire field was largely moribund by the end the
1990s. This lack of success may be largely attributed to the fact that earlier
work was almost entirely based on ASIC technology but was never sufficiently
developed or competetive enough to justify large-scale adoption; gate-arrays
of the period mentioned were never large enough nor fast enough for serious
neural-network applications.1 Nevertheless, the current literature shows that
ASIC neurocomputers appear to be making some sort of a comeback [1, 2, 3];
we shall argue below that these efforts are destined to fail for exactly the same
reasons that earlier ones did. On the other hand, the capacity and performance
of current FPGAs are such that they present a much more realistic alternative.
We shall in what follows give more detailed arguments to support these claims.
The chapter is organized as follows. Section 2 is a review of the fundamen-
tals of neural networks; still, it is expected that most readers of the book will al-
ready be familiar with these. Section 3 briefly contrasts ASIC-neurocomputers
with FPGA-neurocomputers, with the aim of presenting a clear case for the
former; a more significant aspects of this argument will be found in [18]. One
of the most repeated arguments for implementing neural networks in hardware
is the parallelism that the underlying models possess. Section 4 is a short sec-
tion that reviews this. In Section 5 we briefly describe the realization of a
state-of-the art FPGA device. The objective there is to be able to put into a
concrete context certain following discussions and to be able to give grounded
discussions of what can or cannot be achieved with current FPGAs. Section
6 deals with certain aspects of computer arithmetic that are relevant to neural-
network implementations. Much of this is straightforward, and our main aim
is to highlight certain subtle aspects. Section 7 nominally deals with activa-
tion functions, but is actually mostly devoted to the sigmoid function. There
are two main reasons for this choice: first, the chapter contains a significant
contribution to the implementation of elementary or near-elementary activa-
tion functions, the nature of which contribution is not limited to the sigmoid
function; second, the sigmoid function is the most important activation func-
tion for neural networks. In Section 8, we very briefly address an important
issue — performance evaluation. Our goal here is simple and can be stated
quite succintly: as far as performance-evaluation goes, neurocomputer archi-
tecture continues to languish in the “Dark Ages", and this needs to change. A
final section summarises the main points made in chapter and also serves as a
brief introduction to subsequent chapters in the book.
1Unless otherwise indicated, we shall use neural network to mean artificial neural network.
17. Review of neural-network basics 3
1.2 Review of neural-network basics
The human brain, which consists of approximately 100 billion neurons that
are connected by about 100 trillion connections, forms the most complex object
known in the universe. Brain functions such as sensory information process-
ing and cognition are the results of emergent computations carried out by this
massive neural network. Artificial neural networks are computational models
that are inspired by the principles of computations performed by the biolog-
ical neural networks of the brain. Neural networks possess many attractive
characteristics that may ultimately surpass some of the limitations in classical
computational systems. The processing in the brain is mainly parallel and dis-
tributed: the information are stored in connections, mostly in myeline layers
of axons of neurons, and, hence, distributed over the network and processed in
a large number of neurons in parallel. The brain is adaptive from its birth to its
complete death and learns from exemplars as they arise in the external world.
Neural networks have the ability to learn the rules describing training data and,
from previously learnt information, respond to novel patterns. Neural networks
are fault-tolerant, in the sense that the loss of a few neurons or connections does
not significantly affect their behavior, as the information processing involves
a large number of neurons and connections. Artificial neural networks have
found applications in many domains — for example, signal processing, image
analysis, medical diagnosis systems, and financial forecasting.
The roles of neural networks in the afore-mentioned applications fall
broadly into two classes: pattern recognition and functional approximation.
The fundamental objective of pattern recognition is to provide a meaningful
categorization of input patterns. In functional approximation, given a set of
patterns, the network finds a smooth function that approximates the actual
mapping between the input and output.
A vast majority of neural networks are still implemented on software on
sequential machines. Although this is not necessarily always a severe limita-
tion, there is much to be gained from directly implementing neual networks
in hardware, especially if such implementation exploits the parellelism inher-
ent in the neural networks but without undue costs. In what follows, we shall
describe a few neural network models — multi-layer perceptrons, Kohonen’s
self-organizing feature map, and associative memory networks — whose im-
plementations on FPGA are discussed in the other chapters of the book.
1.2.1 Artificial neuron
An artificial neuron forms the basic unit of artficial neural networks. The
basic elements of an artificial neurons are (1) a set of input nodes, indexed by,
say, 1, 2, ... I, that receives the corresponding input signal or pattern vector,
say x = (x1, x2, . . . xI)T; (2) a set of synaptic connections whose strengths are
18. 4 FPGA Neurocomputers
represented by a set of weights, here denoted by w = (w1, w2, . . . wI)T; and
(3) an activation function Φ that relates the total synaptic input to the output
(activation) of the neuron. The main components of an artificial neuron is
illustrated in Figure 1.
Figure 1: The basic components of an artificial neuron
The total synaptic input, u, to the neuron is given by the inner product of the
input and weight vectors:
u =
I
i=1
wixi (1.1)
where we assume that the threshold of the activation is incorporated in the
weight vector. The output activation, y, is given by
y = Φ(u) (1.2)
where Φ denotes the activation function of the neuron. Consequently, the com-
putation of the inner-products is one of the most important arithmetic opera-
tions to be carried out for a hardware implementation of a neural network. This
means not just the individual multiplications and additions, but also the alterna-
tion of successive multiplications and additions — in other words, a sequence
of multiply-add (also commonly known as multiply-accumulate or MAC) op-
erations. We shall see that current FPGA devices are particularly well-suited
to such computations.
The total synaptic input is transformed to the output via the non-linear acti-
vation function. Commonly employed activation functions for neurons are
19. Review of neural-network basics 5
the threshold activation function (unit step function or hard limiter):
Φ(u) =
1.0, when u 0,
0.0, otherwise.
the ramp activation function:2
Φ(u) = max {0.0, min{1.0, u + 0.5}}
the sigmodal activation function, where the unipolar sigmoid function is
Φ(u) =
a
1 + exp(−bu)
and the bipolar sigmoid is
Φ(u) = a
1 − exp(−bu)
1 + exp(−bu)
where a and b represent, repectively, real constants the gain or amplitude
and the slope of the transfer function.
The second most important arithmetic operation required for neural networks
is the computation of such activation functions. We shall see below that the
structure of FPGAs limits the ways in which these operations can be carried
out at reasonable cost, but current FPGAs are also equipped to enable high-
speed implementations of these functions if the right choices are made.
A neuron with a threshold activation function is usually referred to as the
discrete perceptron, and with a continuous activation function, usually a sig-
moidal function, such a neuron is referred to as continuous perceptron. The
sigmoidal is the most pervasive and biologically plausible activation function.
Neural networks attain their operating characteristics through learning or
training. During training, the weights (or strengths) of connections are gradu-
ally adjusted in either supervised or unsupervised manner. In supervised learn-
ing, for each training input pattern, the network is presented with the desired
output (or a teacher), whereas in unsupervised learning, for each training input
pattern, the network adjusts the weights without knowing the correct target.
The network self-organizes to classify similar input patterns into clusters in
unsupervised learning. The learning of a continuous perceptron is by adjust-
ment (using a gradient-descent procedure) of the weight vector, through the
minimization of some error function, usually the square-error between the de-
sired output and the output of the neuron. The resultant learning is known as
2In general, the slope of the ramp may be other than unity.
20. 6 FPGA Neurocomputers
as delta learning: the new weight-vector, wnew, after presentation of an input
x and a desired output d is given by
wnew
= wold
+ αδx
where wold refers to the weight vector before the presentation of the input and
the error term, δ, is (d − y)Φ(u), where y is as defined in Equation 1.2 and
Φ
is the first derivative of Φ. The constant α, where 0 α ≤ 1, denotes the
learning factor. Given a set of training data, Γ = {(xi, di); i = 1, . . . n}, the
complete procedure of training a continuous perceptron is as follows:
begin: /* training a continuous perceptron */
Initialize weights wnew
Repeat
For each pattern (xi, di) do
wold = wnew
wnew = wold + αδxi
until convergence
end
The weights of the perceptron are initialized to random values, and the conver-
gence of the above algorithm is assumed to have been achieved when no more
significant changes occur in the weight vector.
1.2.2 Multi-layer perceptron
The multi-layer perceptron (MLP) is a feedforward neural network consist-
ing of an input layer of nodes, followed by two or more layers of perceptrons,
the last of which is the output layer. The layers between the input layer and
output layer are referred to as hidden layers. MLPs have been applied success-
fully to many complex real-world problems consisting of non-linear decision
boundaries. Three-layer MLPs have been sufficient for most of these applica-
tions. In what follows, we will briefly describe the architecture and learning of
an L-layer MLP.
Let 0-layer and L-layer represent the input and output layers, respectively;
and let wl+1
kj denote the synaptic weight connected to the k-th neuron of the
l + 1 layer from the j-th neuron of the l-th layer. If the number of perceptrons
in the l-th layer is Nl, then we shall let Wl
= {wl
kj}NlxNl−1
denote the matrix
of weights connecting to l-th layer. The vector of synaptic inputs to the l-th
layer, ul = (ul
1, ul
2, . . . ul
Nl
)T is given by
ul = Wlyl−1,
where yl−1 = (yl−1
1 , yl−1
2 , . . . yl−1
Nl−1
)T denotes the vector of outputs at the l−1
layer. The generalized delta learning-rule for the layer l is, for perceptrons,
21. Review of neural-network basics 7
given by
Wnew
l = Wold
l + αδ lyT
l−1,
where the vector of error terms, δ T
l = (δl
1, δl
2, . . . , δl
Nl
) at the l th layer is
given by
δl
j =
2Φl
j
(ul
j)(dj − oj), when l = L,
Φl
j
(ul
j)
Nl+1
k=1 δl+1
k wl+1
kj , otherwise,
where oj and dj denote the network and desired outputs of the j-th output
neuron, respectively; and Φl
j and ul
j denote the activation function and total
synaptic input to the j-th neuron at the l-th layer, respectively. During train-
ing, the activities propagate forward for an input pattern; the error terms of a
particular layer are computed by using the error terms in the next layer and,
hence, move in the backward direction. So, the training of MLP is referred as
error back-propagation algorithm. For the rest of this chapter, we shall gen-
eraly focus on MLP networks with backpropagation, this being, arguably, the
most-implemented type of artificial neural networks.
Figure 2: Architecture of a 3-layer MLP network
22. 8 FPGA Neurocomputers
1.2.3 Self-organizing feature maps
Neurons in the cortex of the human brain are organized into layers of neu-
rons. These neurons not only have bottom-up and top-down connections, but
also have lateral connections. A neuron in a layer excites its closest neigh-
bors via lateral connections but inhibits the distant neighbors. Lateral inter-
actions allow neighbors to partially learn the information learned by a winner
(formally defined below), which gives neighbors responding to similar pat-
terns after learning with the winner. This results in topological ordering of
formed clusters. The self-organizing feature map (SOFM) is a two-layer self-
organizing network which is capable of learning input patterns in a topolog-
ically ordered manner at the output layer. The most significant concept in a
learning SOFM is that of learning within a neighbourhood around a winning
neuron. Therefore not only the weights of the winner but also those of the
neighbors of the winner change.
The winning neuron, m, for an input pattern x is chosen according to the
total synaptic input:
m = arg max
j
wT
j x,
where wj denotes the weight-vector corresponding to the j-th output neuron.
wT
mx determines the neuron with the shortest Euclidean distance between its
weight vector and the input vector when the input patterns are normalized to
unity before training.
Let Nm(t) denote a set of indices corresponding to the neighbourhood size
of the current winner m at the training time or iteration t. The radius of Nm is
decreased as the training progresses; that is, Nm(t1) Nm(t2) Nm(t3) . . . ,
where t1 t2 t3 . . .. The radius Nm(t = 0) can be very large at the
beginning of learning because it is needed for initial global ordering of weights,
but near the end of training, the neighbourhood may involve no neighbouring
neurons other than the winning one. The weights associated with the winner
and its neighbouring neurons are updated by
∆wj = α(j, t) (x − wj) for all j ∈ Nm(t),
where the positive learning factor depends on both the training time and the
size of the neighbourhood. For example, a commonly used neighbourhood
function is the Gaussian function
α(Nm(t), t) = α(t) exp
−
rj − rm2
2σ2(t)
,
where rm and rj denote the positions of the winning neuron m and of the
winning neighbourhood neurons j, respectively. α(t) is usually reduced at a
23. ASIC vs. FPGA neurocomputers 9
rate that is inversely proportional to t. The type of training described above
is known as Kohonen’s algorithm (for SOFMs). The weights generated by the
above algorithms are arranged spatially in an ordering that is related to the
features of the trained patterns. Therefore, the algorithm produces topology-
preserving maps. After learning, each input causes a localized response with
positions on the output layer that reflects dominant features of the input.
1.2.4 Associative-memory networks
Associative memory networks are two-layer networks in which weights
are determined in order to store a set of pattern associations, say,
{(s1, t1), (s2, t2), . . . (sk, tk), . . . (sn, tn)}, where input pattern sk is associ-
ated with output pattern tk. These networks not only learn to produce asso-
ciative patterns, but also are able to recall the desired response patterns when
a given pattern is similar to the stored pattern. Therefore they are referred
to as content-addressible memory. For each association vector (sk, tk), if
sk = tk, the network is referred to as auto-associative; otherwise it is hetero-
associative. The networks often provide input-output descriptions of the asso-
ciative memory through a linear transformation (then known as linear associa-
tive memory). The neurons in these networks have linear activation functions.
If the linearity constant is unity, then the output layer activation is given by
y = Wx,
where W denotes the weight matrix connecting the input and output layers.
These networks learn using the Hebb rule; the weight matrix to learn all the
associations is given by the batch learning rule:
W =
n
k=1
tksT
k .
If the stored patterns are orthogonal to each other, then it can be shown that
the network recalls the stored associations perfectly. Otherwise, the recalled
patterns are distorted by cross-talk between patterns.
1.3 ASIC vs. FPGA neurocomputers
By far, the most often-stated reason for the development of custom (i.e.
ASIC) neurocomputers is that conventional (i.e. sequential) general-purpose
processors do not fully exploit the parallelism inherent in neural-network mod-
els and that highly parallel architectures are required for that. That is true as
far as it goes, which is not very far, since it is mistaken on two counts [18]:
The first is that it confuses the final goal, which is high performance — not
merely parallelism — with artifacts of the basic model. The strong focus on
24. 10 FPGA Neurocomputers
parallelism can be justified only when high performance is attained at a rea-
sonable cost. The second is that such claims ignore the fact that conventional
microprocessors, as well as other types of processors with a substantial user-
base, improve at a much faster rate than (low-use) special-purpose ones, which
implies that the performance (relative to cost or otherwise) of ASIC neurocom-
puters will always lag behind that of mass-produced devices – even on special
applications. As an example of this misdirection of effort, consider the latest
in ASIC neurocomputers, as exemplified by, say, [3]. It is claimed that “with
relatively few neurons, this ANN-dedicated hardware chip [Neuricam Totem]
outperformed the other two implementations [a Pentium-based PC and a Texas
Instruments DSP]”. The actual results as presented and analysed are typical
of the poor benchmarking that afflicts the neural-network area. We shall have
more to say below on that point, but even if one accepts the claims as given,
some remarks can be made immediately. The strongest performance-claim
made in [3], for example, is that the Totem neurochip outperformed, by a fac-
tor of about 3, a PC (with a 400-MHz Pentium II processor, 128 Mbytes of
main memory, and the neural netwoks implemented in Matlab). Two points
are pertinent here:
In late-2001/early 2002, the latest Pentiums had clock rates that were
more than 3 times that of Pentium II above and with much more memory
(cache, main, etc.) as well.
The PC implementation was done on top of a substantial software (base),
instead of a direct low-level implementation, thus raising issues of “best-
effort” with respect to the competitor machines.
A comparison of the NeuriCam Totems and Intel Pentiums, in the years 2002
and 2004 will show the large basic differences have only got larger, primarily
because, with the much large user-base, the Intel (x86) processors continue to
improve rapidly, whereas little is ever heard of about the neurocomputers as
PCs go from one generation to another.
So, where then do FGPAs fit in? It is evident that in general FPGAs can-
not match ASIC processors in performance, and in this regard FPGAs have
always lagged behind conventional microprocessors. Nevertheless, if one
considers FPGA structures as an alternative to software on, say, a general-
purpose processor, then it is possible that FPGAs may be able to deliver better
cost:performance ratios on given applications.3 Moreover, the capacity for
reconfiguration means that may be extended to a range of applications, e.g.
several different types of neural networks. Thus the main advantage of the
FPGA is that it may offer a better cost:performance ratio than either custom
3Note that the issue is cost:performance and not just performance
25. ASIC vs. FPGA neurocomputers 11
ASIC neurocomputers or state-of-the art general-purpose processors and with
more flexibility than the former. A comparison of the NeuriCam Totem, In-
tel Pentiums, and M FPGAs will also show that improvements that show the
advantages of of the FPGAs, as a consequence of relatively rapid changes in
density and speed.
It is important to note here two critical points in relation to custom (ASIC)
neurocomputers versus the FPGA structures that may be used to implement a
variety of artificial neural networks. The first is that if one aims to realize a cus-
tom neurocomputer that has a signficiant amount of flexibility, then one ends
up with a structure that resembles an FPGA — that is, a small number of differ-
ent types functional units that can be configured in different ways, according to
the neural network to be implemented — but which nonetheless does not have
the same flexibility. (A particular aspect to note here is that the large variety of
neural networks — usually geared towards different applications — gives rise
a requirement for flexibility, in the form of either programmability or recon-
figurability.) The second point is that raw hardware-performance alone does
not constitute the entirety of a typical computing structure: software is also
required; but the development of software for custom neurocomputers will,
because of the limited user-base, always lag behind that of the more widely
used FPGAs. A final drawback of the custom-neurocomputer approach is that
most designs and implementations tend to concentrate on just the high paral-
lelism of the neural networks and generally ignore the implications of Am-
dahl’s Law, which states that ultimately the speed-up will be limited by any
serial or lowly-parallel processing involved. (One rare exception is [8].)4 Thus
non-neural and other serial parts of processing tend to be given short shrift.
Further, even where parallelism can be exploited, most neurocomputer-design
seem to to take little account of the fact that the degrees of useful parallelism
will vary according to particular applications. (If parallelism is the main is-
sue, then all this would suggest that the ideal building block for an appropri-
ate parallel-processor machine is one that is less susceptible to these factors,
and this argues for a relatively large-grain high-performance processor, used in
smaller numbers, that can nevertheless exploit some of the parallelism inherent
in neural networks [18].)
All of the above can be summed up quite succintly: despite all the claims
that have been made and are still being made, to date there has not been a
custom neurocomputer that, on artificial neural-network problems (or, for that
matter, on any other type of problem), has outperformed the best conventional
computer of its time. Moreover, there is little chance of that happening. The
4Although not quite successful as a neurocomputer, this machine managed to survive longer than most
neurocomputers — because the flexibility inherent in its design meant that it could also be useful for non-
neural applications.
26. 12 FPGA Neurocomputers
promise of FPGAs is that they offer, in essence, the ability to realize “semi-
custom” machines for neural networks; and, with continuing developments in
technology, they thus offer the best hope for changing the situation, as far as
possibly outperforming (relative to cost) conventional processors.
1.4 Parallelism in neural networks
Neural networks exhibit several types of parallelism, and a careful examina-
tion of these is required in order to both determine the most suitable hardware
structures as well as the best mappings from the neural-network structures onto
given hardware structures. For example, parallelism can be of the SIMD type
or of the MIMD type, bit-parallel or word-parallel, and so forth [5]. In general,
the only categorical statement that can be made is that, except for networks of a
trivial size, fully parallel implementation in hardware is not feasible — virtual
parallelism is necessary, and this, in turn, implies some sequential processing.
In the context of FPGa, it might appear that reconfiguration is a silver bullet,
but this is not so: the benefits of dynamic reconfigurability must be evaluated
relative to the costs (especially in time) of reconfiguration. Nevertheless, there
is litle doubt that FPGAs are more promising that ASIC neurocomputers. The
specific types of parallelism are as follows.
Training parallelism: Different training sessions can be run in parallel,
e.g. on SIMD or MIMD processors. The level of parallelism at this level
is usually medium (i.e. in the hundreds), and hence can be nearly fully
mapped onto current large FPGAs.
Layer parallelism: In a multilayer network, different layers can be
processed in parallel. Parallelism at this level is typically low (in the
tens), and therefore of limited value, but it can still be exploited through
pipelining.
Node parallelism: This level, which coresponds to individual neurons, is
perhaps the most important level of parallelism, in that if fully exploited,
then parallelism at all of the above higher levels is also fully exploited.
But that may not be possible, since the number of neurons can be as
high as in the millions. Nevertheless, node parallelism matches FPGAs
very well, since a typical FPGA basically consists of a large number of
“cells” that can operate in parallel and, as we shall see below, onto which
neurons can readily be mapped.
Weight parallelism: In the computation of an output
y = Φ
n
i=1
wixi
,
27. Xilinx Virtex-4 FPGA 13
where xi is an input and wi is a weight, the products xiwi can all be
computed in parallel, and the sum of these products can also be com-
puted with high parallelism (e.g. by using an adder-tree of logarithmic
depth).
Bit-level parallelism: At the implementation level, a wide variety of par-
allelism is available, depending on the design of individual functional
units. For example, bit-serial, serial-parallel, word-parallel, etc.
From the above, three things are evident in the context of an implementation.
First, the parallelism available at the different levels varies enormously. Sec-
ond, different types of parallelism may be traded off against others, depending
on the desired cost:performance ratio (where for an FPGA cost may be mea-
sured in, say, the number of CLBs etc.); for example, the slow speed of a
single functional unit may be balanced by having many such units operating
concurrently. And third, not all types of parallelism are suitable for FPGA
implementation: for example, the required routing-interconnections may be
problematic, or the exploitation of bit-level parallelism may be constrained by
the design of the device, or bit-level parallelism may simply not be appropriate,
and so forth. In the Xilinx Virtex-4, for example, we shall see that it is possible
to carry out many neural-network computations without using much of what is
usually taken as FPGA fabric.5
1.5 Xilinx Virtex-4 FPGA
In this section, we shall briefly give the details an current FPGA device,
the Xilinx Virtex-4, that is typical of state-of-the-art FPGA devices. We shall
below use this device in several running examples, as these are easiest under-
stood in the context of a concrete device. The Virtex-4 is actually a family of
devices with many common features but varying in speed, logic-capacity, etc..
The Virtex-E consists of an array of up to 192-by-116 tiles (in generic FPGA
terms, configurable logic blocks or CLBs), up to 1392 Kb of Distributed-RAM,
upto 9936 Kb of Block-RAM (arranged in 18-Kb blocks), up to 2 PowerPC 405
processors, up to 512 Xtreme DSP slices for arithmetic, input/ouput blocks,
and so forth.6
A tile is made of two DSP48 slices that together consist of eight function-
generators (configured as 4-bit lookup tables capable of realizing any four-
input boolean function), eight flip-flops, two fast carry-chains, 64 bits of
Distributed-RAM, and 64-bits of shift register. There are two types of slices:
5The definition here of FPGA fabric is, of course, subjective, and this reflects a need to deal with changes in
FPGA realization. But the fundamental point remains valid: bit-level parallelism is not ideal for the given
computations and the device in question.
6Not all the stated maxima occur in any one device of the family.
28. 14 FPGA Neurocomputers
SLICEM, which consists of logic, distributed RAM, and shift registers, and
SLICEL, which consists of logic only. Figure 3 shows the basic elements of a
tile.
Figure 3: DSP48 tile of Xilinx Virtex-4
Blocks of the Block-RAM are true dual-ported and recofigurable to various
widths and depths (from 16K× 1 to 512×36); this memory lies outside the
slices. Distributed RAM are located inside the slices and are nominally single-
port but can be configured for dual-port operation. The PowerPC processor
core is of 32-bit Harvard architecture, implemented as a 5-stage pipeline. The
29. Arithmetic 15
significance of this last unit is in relation to the comment above on the serial
parts of even highly parallel applications — one cannot live by parallelism
alone. The maximum clock rate for all of the units above is 500 MHz.
Arithmetic functions in the Virtex-4 fall into one of two main categories:
arithmetic within a tile and arithmetic within a collection of slices. All the
slices together make up what is called the XtremeDSP [22]. DSP48 slices
are optimized for multipliy, add, and mutiply-add operations. There are 512
DSP48 slices in the largest Virtex-4 device. Each slice has the organization
shown in Figure 3 and consists primarily of an 18-bit×18-bit multiplier, a 48-
bit adder/subtractor, multiplexers, registers, and so forth. Given the importance
of inner-product computations, it is the XtremeDSP that is here most crucial for
neural-network applications. With 512 DSP48 slices operating at a peak rate of
500 MHz, a maximum performance of 256 Giga-MACs (multiply-accumlate
operations) per second is possible. Observe that this is well beyond anything
that has so far been offered by way of a custom neurocomputer.
1.6 Arithmetic
There are several aspects of computer arithmetic that need to be consid-
ered in the design of neurocomputers; these include data representation, inner-
product computation, implementation of activation functions, storage and up-
date of weights, and the nature of learning algorithms. Input/output, although
not an arithmetic problem, is also important to ensure that arithmetic units can
be supplied with inputs (and results sent out) at appropriate rates. Of these,
the most important are the inner-product and the activation functions. Indeed,
the latter is sufficiently significant and of such complexity that we shall devote
to it an entirely separate section. In what follows, we shall discuss the others,
with a special emphasis on inner-products. Activation functions, which here
is restricted to the sigmoid (although the relevant techniques are not) are suf-
ficiently complex that we have relegated them to seperate section: given the
ease with which multiplication and addition can be implemented, unless suffi-
cient care is taken, it is the activation function that will be the limiting factor
in performance.
Data representation: There is not much to be said here, especially since exist-
ing devices restrict the choice; nevertheless, such restrictions are not absolute,
and there is, in any case, room to reflect on alternatives to what may be on
offer. The standard representations are generally based on two’s complement.
We do, however, wish to highlight the role that residue number systems (RNS)
can play.
It is well-known that RNS, because of its carry-free properties, is particu-
larly good for multiplication and addition [23]; and we have noted that inner-
product is particularly important here. So there is a natural fit, it seems. Now,
30. 16 FPGA Neurocomputers
to date RNS have not been particularly successful, primarily because of the
difficulties in converting between RNS representations and conventional ones.
What must be borne in mind, however, is the old adage that computing is about
insight, not numbers; what that means in this context is that the issue of con-
version need come up only if it is absolutely necessary. Consider, for example,
a neural network that is used for classification. The final result for each input is
binary: either a classification is correct or it is not. So, the representation used
in the computations is a side-issue: conversion need not be carried out as long
as an appropriate output can be obtained. (The same remark, of course, applies
to many other problems and not just neural networks.) As for the constraints
of off-the-shelf FPGA devices, two things may be observed: first, FPGA cells
typically perform operations on small slices (say, 4-bit or 8-bit) that are per-
fectly adequate for RNS digit-slice arithmetic; and, second, below the level
of digit-slices, RNS arithmetic will in any case be realized in a conventional
notation.
Figure 4: XtremeDSP chain-configuration for an inner-product
The other issue that is significant for representation is the precision used.
There have now been sufficient studies (e.g. [17]) that have established 16
bits for weights and 8 bits for activation-function inputs as good enough. With
31. Arithmetic 17
this knowledge, the critical aspect then is when, due to considerations of per-
formance or cost, lower precision must be used. Then a careful process of
numerical analysis is needed.
Figure 5: XtremeDSP tree-configuration for an inner-product
Sum-of-products computations: There are several ways to implement this, de-
pending on the number of datasets. If there is just one dataset, then the opera-
tion is
N
i=1 wiXi, where wi is a weight and Xi is an input. (In general, this
is the matrix-vector computation expressed by Equation 1.1.) In such a case,
with a device such as the Xilinx Virtex-4, there are several possible implemen-
tations, of which we now give a few sketches. If N is small enough, then two
direct implementations consist of either a chain (Figure 4) or a tree (Figure 5)
of DSP48 slices. Evidently, the trade-off is one of latency versus effecient use
of device logic: with a tree the use of tile logic is quite uneven and less efficient
than with a chain. If N is large, then an obvious way to proceed is to use a
combination of these two approaches. That is, partition the computation into
several pieces, use a chain for each such piece, and then combine in a tree the
results of these chains, or the other way around. But there are other possible
approaches: for example, instead of using chains, one DSP48 slice could be
used (with a feedback loop) to comute the result of each nominal chain, with
all such results then combined in a chain or a tree. Of course, the latency will
now be much higher.
32. 18 FPGA Neurocomputers
With multiple datasets, any of the above approaches can be used, although
some are better than others — for example, tree structures are more amenable
to pipelining. But there is now an additional issue: how to get data in and
out at the appropriate rates. If the network is sufficiently large, then most of
the inputs to the arithmetic units will be stored outside the device, and the
number of device pins available for input/output becomes a minor issue. In
this case, the organization of input/output is critical. So, in general, one needs
to consider both large datasets as well as multiple data sets. The following
discussions cover both aspects.
Storage and update of weights, input/output: For our purposes, Distributed-
RAM is too small to hold most of the data that is to be processed, and therefore,
in general Block-RAM will be used. Both weights and input values are stored
in a single block and simualtaneously read out (as the RAM is dual-ported).
Of course, for very small networks, it may be practical to use the Distributed-
RAM, especially to store the weights; but we will in general assume networks
of arbitrary size. (A more practical use for Distributed-RAM is the storage of
constants used to implement activation functions.) Note that the disparity (dis-
cussed below) between the rate of inner-product computations and activation-
function computations means that there is more Distributed-RAM available for
this purpose than appears at first glance. For large networks, even the Block-
RAM may not be sufficient, and data has to be periodically loaded into and
retrieved from the FPGA device. Given pin-limitations, careful consideration
must be given to how this is done.
Let us suppose that we have multiple datasets and that each of these is very
large. Then, the matrix-vector product of Equation 1.1, that is,
u = Wy
becomes a matrix-matrix product,
U = WY,
where each column of Y is associated with one input dataset. The most
common method used for matrix-matrix multiplication is the inner-product
method; that is, each element of the output matrix is directly generated as an
inner-product of two vectors of the input matrices. Once the basic method has
been selected, the data must be processed — in particular, for large datasets,
this includes bringing data into, and retrieving data from, the FPGA — exactly
as indicated above. This is, of course, true for other methods as well.
Whether or not the inner-product method, which is a highly sequential
method, is satisfactory depends a great deal on the basic processor microar-
chitecture, and there are at least two alternatives that should always be consid-
33. Arithmetic 19
ered: the outer-product and the middle-product methods.7 Consider a typical
“naive” sequential implementation of matrix multiplication. The inner-product
method would be encoded as three nested loops, the innermost of which com-
putes the inner-product of a vector of one of the input matrices and a vector of
the other input matrix:
for i:=1 to n do
for j:=1 to n do
for k:=1 to n do
U[i,j]:=U[i,j]+W[i,k]*Y[k,j];
(where we assume that the elements of U[i, j] have all been initialized to zero.)
Let us call this the ijk-method8, based on the ordering of the index-changes.
The only parallelism here is in the multiplication of individual matrix element
and, to a much lesser extent (assuming the tree-method is used instead of the
chain method) in the tree-summation of these products). That is, for n×n ma-
trices, the required n2 inner-products are computed one at a time. The middle-
product method is obtained by interchanging two of the loops so as to yield
the jki-method. Now more parallelism is exposed, since n inner-products can
be computed concurrently; this is the middle-product method. And the outer-
product method is the kij-method. Here all parallelism is now exposed: all n2
inner products can be computed concurrently. Nevertheless, it should be noted
that no one method may be categorically said to be better than another — it all
depends on the architecture, etc.
To put some meat to the bones above, let us consider a concrete example —
the case of 2 × 2 matrices. Further, let us assume that the multiply-accumulate
(MAC) operations are carried out within the device but that all data has to be
brought into the device. Then the process with each of the three methods is
shown in Table 1. (The horizontal lines delineate groups of actions that may
take place concurrently; that is within a column, actions separated by a line
must be performed sequentially.)
A somewhat rough way to compare the three methods is measure the ratio,
M : I, of the number of MACs carried out per data value brought into the
array. This measure clearly ranks the three methods in the order one would
expect; also note that by this measure the kij-method is completely efficient
(M : I = 1): every data value brought in is involved in a MAC. Nevertheless,
it is not entirely satisfactory: for example, it shows that the kij-method to be
better than the jki-method by factor, which is smaller that what our intuition
7The reader who is familiar with compiler technology will readily recognise these as vecorization (paral-
lelization) by loop-interchange.
8We have chosen this terminology to make it convenient to also include methods that have not yet been
“named”.
34. 20 FPGA Neurocomputers
would lead us to expect. But if we now take another measure, the ratio of
M : I to the number, S, of MAC-steps (that must be carried out sequentially),
then the diference is apparent.
Lastly, we come to the main reason for our classifcation (by index-ordering)
of the various methods. First, it is evident that any ordering will work just
as well, as far as the production of correct results goes. Second, if the data
values are all of the same precision, then it is sufficient to consider just the
three methods above. Nevertheless, in this case dataflow is also important, and
it easy to establish, for example, that where the jki-method requires (at each
input step) one weight and two inout values, there is an ordering of indices that
requires two weights and one input value. Thus if weights are higher precision,
the latter method may be better.
Inner-Product Middle-Product Outer-Product
Input: W1,1, W1,2, Y1,1, Y2,1 Input: W1,1, Y2,1, Y1,1 Input: W1,1, W2,1, Y1,1, Y1,2
MAC: t1 = t1 + W1,1 ∗ Y1,1 MAC: t1 = t1 + W1,1 ∗ Y1,1 MAC: t1 = t1 + W1,1 ∗ Y1,1
MAC: t1 = t1 + W1,2 ∗ Y2,1 MAC: t2 = t2 + W1,1 ∗ Y1,2 MAC: t2 = t2 + W1,1 ∗ Y1,2
Input: W1,1, W1,2, Y1,2, Y2,2 Input: W1,2, Y2,1, Y2,2 MAC: t3 = t3 + W2,1 ∗ Y1,1
MAC: t4 = t4 + W2,1 ∗ Y1,2
MAC: t2 = t2 + W1,1 ∗ Y1,2 MAC: t1 = t1 + W1,2 ∗ Y2,1 Input: W1,2, W2,2, Y2,1, Y2,2
MAC: t2 = t2 + W1,2 ∗ Y2,2 MAC: t2 = t2 + W1,2 ∗ Y2,2
Input: W2,1, W2,2, Y1,1, Y2,1 Input: W2,1, Y1,1, Y1,2 MAC: t1 = t1 + W1,2 ∗ Y2,1
MAC: t2 = t2 + W1,2 ∗ Y2,2
MAC: t3 = t3 + W2,1 ∗ Y1,1 MAC: t3 = t3 + W2,1 ∗ Y1,1 MAC: t3 = t3 + W1,1 ∗ Y1,1
MAC: t3 = t3 + W2,2 ∗ Y2,1 MAC: t4 = t4 + W2,1 ∗ Y1,2 MAC: t4 = t4 + W2,2 ∗ Y2,2
Input: W2,1, W2,2, Y1,2, Y2,2 Input: W2,2, Y2,1, Y2,2
MAC: t4 = t4 + W2,1 ∗ Y1,2 MAC: t3 = t3 + W2,2 ∗ Y2,1
MAC: t4 = t4 + W2,2 ∗ Y2,2 MAC: t4 = t4 + W2,2 ∗ Y2,2
M : I = 0.5 M : I = 0.667 M : I = 1.0
(M : I)/S = 0.125 (S=8) (M : I)/S = 0.167 (S=4) (M : I)/S = 0.5 (S=2)
Table 1: Matrix multiplication by three standard methods
Learning and other algorithms: The typical learning algorithm is usually
chosen on how quickly it leads to convergence (on, in most cases, a software
platform). For hardware, this is not necessarily the best criteria: algorithms
need to be selected on the basis on how easily they can be implemented in
hardware and what the costs and performance of such implementations are.
Similar considerations should apply to other algorithms as well.
35. Activation-function implementation: unipolar sigmoid 21
1.7 Activation-function implementation: unipolar
sigmoid
For neural networks, the implementation of these functions is one of the two
most important arithmetic design-issues. Many techniques exist for evaluating
such elementary or nearly-elementary functions: polynomial approximations,
CORDIC algorithms, rational approximations, table-driven methods, and so
forth [4, 11]. For hardware implementation, accuracy, performance and cost
are all important. The latter two mean that many of the better techniques that
have been developed in numerical analysis (and which are easily implemented
in software) are not suitable for hardware implementation. CORDIC is perhaps
the most studied technique for hardware implementation, but it is (relatively)
rarely implemented: its advantage is that the same hardware can be used for
several functions, but the resulting performance is usually rather poor. High-
order polynomial approximations can give low-error implementations, but are
generally not suitable for hardware implementation, because of the number of
arithmetic operations (multiplications and additions) that must be performed
for each value; either much hardware must be used, or performance be com-
promised. And a similar remark applies to pure table-driven methods, unless
the tables are quite small: large tables will be both slow and costly. The prac-
tical implication of these constraints is as indicated above: the best techniques
from standard numerical analysis are of dubious worth.
Given trends in technology, it is apparent that at present the best technique
for hardware function-evaluation is a combination of low-order polynomials
and small look-up tables. This is the case for both ASIC and FPGA technolo-
gies, and especially for the latter, in which current devices are equipped with
substantial amounts of memory, spread through the device, as well as many
arithmetic units (notably mulipliers and adders).9 The combination of low-
order polynomials (primarily linear ones) is not new — the main challenges
has always been one of how to choose the best interpolation points and how to
ensure that look-up tables remain small. Low-order interpolation therefore has
three main advantages. The first is that exactly the same hardware structures
can be used to realize different functions, since only polynomial coefficients
(i.e. the contents of look-up tables) need be changed; such efficient reuse is not
possible with the other techniques mentioned above. The second is that it is
well-matched to current FPGA devices, which come with built-in multipliers,
adders, and memory.
The next subsection outlines the basic of our approach to linear interpola-
tion; the one after that discusses implementation issues; and the final subsec-
tion goes into the details of the underlying theory.
9This is validated by a recent study of FPGA implementations of various techniques [16].
36. 22 FPGA Neurocomputers
1.7.1 Basic concepts
On the basis of the considerations above, we have chosen piecewise linear
interpolation for the approximation of the sigmoid function.
For most functions, interpolation with uniformly-sized intervals (i.e.
uniformly-spaced abscissae) is not ideal; in the case of the sigmoid, it is evident
that, idealy, more intervals should be used as the magnitude of the argument
increases. Nevertheless, for hardware implementation, the need to quickly
map arguments onto the appropriate intervals dictates the use of such inter-
vals. With this choice and linear interpolation, the critical issue then becomes
that of what function-value to associate with each interval. The most common
choice is to arbitrarily select the value at the midpoint of the interval — that
is, if x ∈ [L, U], then f(x) = f(L/2 + U/2) — or to choose a value that
minimizes absolute errors. 10 Neither is particularly good. As we shall show,
even with a fixed number of intervals, the best function-value for an interval is
generally not the midpoint. And, depending on the “curvature” of the function
at hand, relative error may be more critical than absolute error. For example,
for the sigmoid function, f(x) = 1/(1 + e−x), we have a function that is sym-
metric (about the y-axis), but the relative error grows more rapidly on one side
of the axis than the other, and on both sides the growth depends on the interval.
Thus, the effect of a given value of absolute error is not constant or even linear.
The general approach we take is as follows. Let I = [L, U] be a real interval
with L U, and let f : I → R be a function to be approximated (where R
denotes the set of real numbers). Suppose that f : I → R is a linear function
— that is, f(x) = c1 + c2x, for some constants c1 and c2 — that approximates
f. Our objective is to investigate the relative-error function
ε(x) = f(x)−f(x)
f(x) , x ∈ I, (Err)
and to find c1 and c2 such that ε(x) is small. One way to obtain reasonably
good values for c1, c2 is to impose the
f(L) = f(L), f(U) = f(U) (C)
and then compute values for c1 and c2. But a much better result can be obtained
using the “improved” condition
|ε(L)| = |ε(U)| = |ε(xstat)|, (IC)
where xstat (stationary point) is the value of x for which ε(x) has a local
extremum. An example of the use of this technique to approximate reciprocals
10Following [12], we use absolute error to refer to the difference between the exact value and its approxi-
mation; that is, it is not the absolute value of that difference.
37. Activation-function implementation: unipolar sigmoid 23
can be found in [4, 10] for the approximation of divisor reciprocals and square-
root reciprocals. It is worth noting, however, that in [10], ε(x) is taken to be
the absolute-error function. This choice simplifies the application of (IC), but,
given the curvature of these functions, it is not as good as the relative-error
function above. We will show, in Section 7.3, that (IC) can be used successfully
for sigmoid function, despite the fact that finding the exact value for xstat
may not be possible. We show that, compared with the results from using the
condition (C), the improved condition (IC) yields a massive 50% reduction in
the magnitude of the relative error. We shall also give the analytical formulae
for the constants c1 and c2. The general technique is easily extended to other
functions and with equal or less ease [13], but we shall here consider only
the sigmoid function, which is probably the most important one for neural
networks.
Figure 6: Hardware organization for piecewise linear interpolation
38. 24 FPGA Neurocomputers
1.7.2 Implementation
It is well-known that use of many interpolation points generally results in
better approximations. That is, subdividing a given interval into several subin-
tervals and keeping to a minimum the error on each of the subintervals im-
proves accuracy of approximation for the given function as a whole. Since for
computer-hardware implementations it is convenient that the number of data
points be a power of two, we will assume that the interval I is divided into 2k
intervals: L, L + ∆
2k , L + ∆
2k , L + 2∆
2k , . . . , L + 2k−1∆
2k , U , where ∆ =
U − L. Then, given an argument, x, the interval into which it falls can read-
ily be located by using, as an address, the k most significant bits of the binary
representation of x. The basic hardware implementation therefore has the high-
level organization shown in Figure 6. The two memories hold the constants c1
and c2 for each interval.
Figure 7: High-performance hardware organization for function evaluation
Figure 6 is only here to be indicative of a “naive” implementation, although
it is quite realistic for some current FPGAs. For a high-speed implementa-
39. Activation-function implementation: unipolar sigmoid 25
tion, the actual structure may differ in several ways. Consider for example the
multiplier-adder pair. Taken individually, the adder must be a carry-propagate
adder (CPA); and the multiplier, if it is of high performance will consist of an
array of carry-save adders (CSAs) with a final CPA to assimilate the partial-
sum/partial-carry (PC/PS) output of the CSAs. But the multiplier-CPA may be
replaced with two CSAs, to yield much higher performance. Therefore, in a
high speed implementation the actual structure would have the form shown in
Figure 7.
Nevertheless, for FPGAs, the built-in structure will impose some con-
straints, and the actual implementation will generally be device-dependent. For
example, for a device such as the Xilinx Virtex-4, the design of Figure 6 may
be implemented more or less exactly as given: the DSP48 slice provides the
multiply-add function, and the constants, c1 and c2, are stored in Block-RAM.
They could also be stored in Distributed-RAM, as it is unlikely that there will
be many of them. Several slices would be required to store the constants at the
required precision, but this is not necessarily problematic: observe that each
instance of activation-function computation corresponds to several MACs (bit
slices).
All of the above is fairly straightforward, but there is one point that needs a
particular mention: Equations 1.1 and 1.2 taken together imply that there is an
inevitable disparity between the rate of inner-product (MACs) computations
and activation-function computations. In custom design, this would not cause
particular concern: both the design and the placement of the relevant hard-
ware units can be chosen so as to optimize cost, performance, etc. But with
FPGAs, this luxury does not exist: the mapping of a network to a device, the
routing requirements to get an inner-product value to the correct place for the
activation-function computation, the need to balance the disparate rates ... all
these mean that the best implementation will be anything but straightforward.
1.7.3 Theory
We shall illustrate our results with detailed numerical data obtained for a
fixed number of intervals. All numerical computations, were carried out in
the computer algebra system MAPLE [24] for the interval11 I = [0.5, 1] and
k = 4; that is, I was divided into the 16 intervals:
1
2
,
17
32
,
9
16
,
19
32
,
5
8
, . . . , 1
.
We have used MAPLE to perform many of complex symbolic computations.
Floating-point calculations in MAPLE are carried out in finite precision, with
11Note that evaluation on any other interval can be transformed into evaluation on the interval [0.5, 1].
40. 26 FPGA Neurocomputers
intermediate results rounded to a precision that is specified by MAPLE con-
stant Digits. This constant controls the number of digits that MAPLE uses for
calculations. Thus, generally, the higher the Digits value is, the higher accu-
racy of the obtainable results, with roundoff errors as small as possible. (This
however cannot be fully controlled in case of complex algebraic expressions).
We set Digits value to 20 for numerical computations. Numerical results will
be presented using standard (decimal) scientific notation.
Applying condition (C), in Section 7.1, to the sigmoid function, we get
⎧
⎨
⎩
c1 + c2L = 1
1+e−U
c1 + c2U = 1
1+e−L .
For simplicity, we will use θ to denote the expression
UeL
− LeU
− Le(L+U)
+ Ue(L+U)
.
Then the solution of the above system may be expressed as
c1 = θ
θ+U−L+UeU −LeL
c2 = eU −eL
θ+U−L+UeU −LeL .
and the approximation function f(x) = c1 + c2x takes the form
f(x) =
θ + x
eU − eL
θ + U − L + UeU + LeL
, x ∈ I.
The relative error is now
ε(x) =
−LeL − L + UeU + U − e−xθ − xeU
θ + U − L + UeU + LeL
+
−xeU−x + xeL + xe(L−x)
θ + U − L + UeU + LeL
, x ∈ I.
41. Activation-function implementation: unipolar sigmoid 27
Figure 8: Error in piecewise-linear approximation of the sigmoid
Figure 8 shows the results for the 16-interval case. As the graphs show, the
amplitude of the error attains a maximum in each of the sixteen intervals. To
ensure that it is in fact so on any interval we investigate the derivatives of the
error function.
The first derivative of the error function is
ε
(x) =
θ + eL − eU + x
e(U−x) − e(L−x)
θ + (U − L) + UeU − LeL
+
e(L−x) − e(U−x) + θe−x
θ + (U − L) + UeU − LeL
, x ∈ I.
A closer look at the formula for the derivative, followed by simple algebraic
computations, reveals that the equation ε(x) = 0 is reducible to the equation
Aex
= B + Cx, for some constants A, B, C.
The solution of this equation is the famous Lambert W function, which has
been extensively studied in the literature; and many algorithms are known for
the computation of its values.12 Since the Lambert W function cannot be ana-
lytically expressed in terms of elementary functions, we leave the solution of
12The reader interested in a recent study of the Lambert W function is refereed to [9].
42. 28 FPGA Neurocomputers
our equation in the form
xstat = eL
LambertW
⎛
⎜
⎝−e
U−
θ+(eU −eL
)(U−1)
eU −eL
⎞
⎟
⎠
+eU
LambertW
⎛
⎜
⎝−e
U−
θ+(eU −eL
)(U−1)
eU −eL
⎞
⎟
⎠
where LambertW is the MAPLE notation for the Lambert W function. There is
no straightforward way to extend our results to an arbitrary interval I. So, for
the rest of this section we will focus on the 16-interval case, where, with the
help of MAPLE, we may accurately ensure validity of our findings. It should
nevertheless be noted that since this choice of intervals was quite arbitrary
(within the domain of the investigated function), the generality of our results
are in no way invalidated. Figure 9 shows plots of the first derivative of the
relative-error function on sixteen intervals, confirming that there exists a local
maximum on each interval for this function.
From Figure 9, one can infer that on each interval the stationary point occurs
somewhere near the mid-point of the interval. This is indeed the case, and the
standard Newton-Raphson method requires only a few iterations to yield a rea-
sonably accurate approximation to this stationary value. (To have a full control
over the procedure we decided not to use the MAPLE’s built-in approximation
method for Lambert W function values.) For the 16-interval case, setting the
tolerance to 10−17 and starting at the mid-point of each interval, the required
level of accuracy is attained after only three iterations. For the stationary points
thus found, the magnitude of the maximum error is
εmax = 1.5139953883 × 10−5
(1.3)
which corresponds to 0.3 on the “magnified” graph of Figure 9.
We next apply the improved condition (IC) to this approximation. By (Err),
we have
ε(x) = 1 − c1 − c1e−x
− c2x − c2xe−x
(1.4)
hence
ε(L) = 1 − c1 − c1e−L
− c2L − c2Le−L
(1.5)
ε(U) = 1 − c1 − c1e−U
− c2U − c2Ue−U
. (1.6)
From Equations (1.5) and (1.6) we get an equation that we can solve for c2:
43. Activation-function implementation: unipolar sigmoid 29
c2 =
c1
e−L − e−U
U + Ue−U − L − Le−L
(1.7)
Substituting for c2 in Equation (1.4) yields the final formula for the relative
error
ε(x) = 1 − c1 − c1e−x
+
c1
e−U − e−L x
U + Ue−U − L − Le−L
+
c1
e−U − e−L xe−x
U + Ue−U − L − Le−L
, x ∈ I.
To study the error magnitude we investigate its first derivative for x ∈ I:
ε
(x) = c1
e−xU + e(−x−U)U − e−xL − e(−x−L)L + e−U
U + Ue−U − L − Le−L
+c1
−e−L + e(−x−U) + e(−x−L)
U + Ue−U − L − Le−L
+c1
−xe(−x−U) + xe(−x−L)
U + Ue−U − L − Le−L
.
We may assume without loss of generality that c1 is positive: the graph of
the sigmoid function indicates that this is a valid assumption. For simplicity,
let us assume that c1 = 1 and see how the first derivative behaves on the
sixteen intervals. Figure 10 shows the graphs of the first derivative of the new
error function. From these plots we see that within each interval the derivative
changes sign at a unique stationary point. Finding an exact analytical formula
for the values of these points is not possible, because, as above, the equation
ε(x) = 0, reduces to a Lambert-type of equation. So, once again, we apply
the Newton-Raphson method to get some reasonably accurate estimate values.
Starting the iteration from the mid-point we obtain good approximation after
just a few iterations.
44. 30 FPGA Neurocomputers
Figure 9: Plots of first derivative of sigmoid error-function
Figure 10: Plots of first derivative of improved sigmoid error-function
It is critical to note that although the Newton-Raphson procedure is easily
(and frequently) implemented in hardware [4], in this case a software imple-
45. Activation-function implementation: unipolar sigmoid 31
mentation is sufficient. The procedure is required only to obtain c1 and c2, and
once these have been obtained off-line and stored in memory, the procedure is
not relevant.
Let by xa denote the approximate value at which the error has a local ex-
tremum. Then, by the the final formula for the relative error, we have
ε(xa) = 1 − c1 − c1e−xa
+
c1
e−U − e−L xa
U + Ue−U − L − Le−L
+
c1
e−U − e−L xe−xa
U + Ue−U − L − Le−L
Since, by condition (IC), we must have ε(xa) = −ε(L), we end up with one
equation with one variable c1. Solving this equation gives us the required sec-
ond parameter for our approximation function f(x) = c1 + c2x. We omit
tedious (but elementary) algebraic calculations, presenting only the final for-
mula
c1 = −2
U + Ue−U
− L − Le−L
/
−2U − 2Ue−U
+ 2L + 2Le−L
− e−xa
U(1 + U)
+e−xa
L(1 + L) + xae−U
− xae−L
+ xae−xa−U
+xae−xa−L
+ xae−xa−L
− e−L
U
−e−L−U
U + Le−U
+ Le−L−U
which, by Equation (1.7) yields the final formula for c2. Finally, we substitute
c1, c2 values into the relative-error formula, expressed by Equation (1.4). (Note
that, xa must be replaced by the corresponding approximate value from the
Newton-Raphson procedure.) We do not present the final analytical formula
for the error as it is quite complex and of little interest by itself. Figure 11
shows the results finally obtained. The magnitude of the maximum relative
error is now
ε(max) = 7.5700342463 × 10−6
which, compared with (1.3) is a reduction of 50.00038%. This concludes the
exposition.
46. 32 FPGA Neurocomputers
Figure 11: Plots of improved sigmoid error-function
1.8 Performance evaluation
Having outlined above the promise of realizing neural networks in hard-
ware, we now come to an especially delicate point — that of actually showing
that “semi-custom” (i.e. FPGA) or custom (i.e. ASIC) neurocomputers can
actually deliver what is promised. In this respect, the neural-network commu-
nity has not especially distinguished itself, which, in turn explains the dearth
of many practical neurocomputers, despite the many years of research, the de-
velopment of countless prototypes, and so forth. (This point can be readily
appreciated by comparing the status of performance-evaluation for nuerocom-
puters with that for conventional computers.) We will not here aim to solve
the underlying problems or even suggest specific concrete solutions — either
being an enormous task that is outside the scope of this work — but it is our
objective to sharply highlight them and to indicate general directions for their
solution.
At the very least there are two issues that must be considered for a proper
evaluation of performance: the metrics used and what benchmarks are used to
obtain the measurements. (Both need to be such that they can, in the main,
be easily agreed on and understood by most users.) The neural-network area
is sorely lacking in both. The most commonly used metrics are connections-
per-second (CPS), which is essentially the rate at which neuron multiply-add
operations are carried out, and connection-updates-per-second (CUPS), which
is essentialy the rate at which updates of weights are carried out. Speedup is
47. Performance evaluation 33
also sometimes used but is of even less value than CPS and CUPS; and even
worse is to rely on just time-complexity, e.g. [1]. The are three main problems
with these metrics:
MCPS and MCUPS are similar to, and suffer from the same drawbacks
as, MIPS (Millions of Instructions Per Second) and MFLOPS (Mil-
lions of Floating-point Operations Per Second) that were long used for
general-purpose computers and now largely discredited as useful mea-
sures.
MCPS and MCUPS cannot always be meaningfully used with all types
of networks; for example, they are of little worth with radial-basis-
function networks.
A large value of MCPS or MCUPS does not necessarily mean better
performance (i.e. in the sense of a network being faster) when applied
to different algorithms. In other words, interpreting the numbers is not
straightforward.
As is the case for general-purpose computers, there is little doubt that ulti-
mately the best measure of performance is actual execution-time [19].
As an example of the untidy state of affairs that curently exists, consider,
for example [3], which has already been partially discussed above. In that
work, execution-time is one of the main metrics used, which is acceptable as
far as it goes. But things quickly become problematic. First, although some
neural-network applications have been chosen for use as benchmarks, no ba-
sis is given for why and how they been chosen. Second, it is not quite the
case that exactly the same programs are being used for benchmarks: the au-
thors compare a Reactive Tabu Search learning algorithm (on their machine)
against Back-Propagation learning (on two other machines). Third, given the
variety and non-uniform use of various metrics, what the target is is far from
clear: Is it, say, the the proportion of patterns that are correcly classified? If
so, then that should be fixed and measurements than made of the other metrics
(error, execution-time, etc.) and the results then used to compare the different
machines. The same remark applies to, say, fixing error limits and then mea-
suring execution time, or number of iterations/epcohs, or patterns-classified,
and so forth. As it is, the authors allow all parameters to simultaneously vary
in a manner that practically renders meaningless the use of execution-time as
a reasonable metric. Fourth, details of algorithm-implementations on the two
other machines are not given, which begs the question of whether they were the
best possible. Contrast this with the standards of SPEC [20], in which “own-
ers” (usually the manufacturers) of given machines run (under highly regulated
conditions) specific programs, under tight constraints, and then publicly report
the results, thus ensuring best-effort implementations.
48. 34 FPGA Neurocomputers
In summary, the area of performance-evaluation — in particular the choice
performance metrics and selection of benchmarks — is one that needs to be
addressed urgently for neurocomputers. Neurocomputers (whether in ASIC of
FPGA) will not achieve widespread use unless potential users can be convinced
of their worth; given their history so far, it should now be clear that merely
extolling their virtues is insufficient.
1.9 Conclusions
This chapter covers a broad range of topics, many of which are discussed
in more detail in following chapters. In the case of arithmetic, we have high-
lighted inner-product and activation-function computations. We have advanced
the use of piecewise linear interpolation, but the story need not end there: al-
though interpolations of a degree higher than three do do not appear to be ap-
propriate for high-speed hardware implementations, there may be some profit
in the search for second-order ones that are well-suited to FPGAs. Chapter 2
discussed further aspects of arithmetic.
We have discussed the main types of parallelism that are to be found in
neural networks, but little of that discussion has addressed the matter of map-
pings. With very large networks, the mapping from network to FPGA is a
major factor in performance. Chapters 3 and 4 are devoted to a suitable the-
oretical framework for the derivation of such mappings. Chapter 5 also deals
with mappings but is limited to back-propagation in the context of an actual
device; Chapter 6 is similar in that it is limited to associative memories.
Chapters 7 through 11 cover the FPGA-implementation of neural networks
for several specific applications. The last chapter is a retrospective: it discusses
various lessons learned from the realization of a custom neurocomputer and
projects these to current FPGAs.
References
[1] U. Ruckert. 2002. ULSI architectures for artificial neural networks. IEEE
Micro (May–June): 10–19.
[2] J. Buddefeld and K. E. Grosspietsch. 2002. Intelligent-memory architec-
tures for artificial neural networks. IEEE Micro (May–June): 32–40.
[3] G. Danese, F. Leoporati, and S. Ramat. 2002. A parallel neural processor
for real-time applications. IEEE Micro (May–June): 20–31.
[4] A. R. Omondi. Computer Arithmetic Systems: Algorithms, Architecture,
and Implementations. Prentice-Hall, UK, 1994.
[5] T. Nordstrom and B. Svensson. 1991. Using and designing massively par-
allel computers for artificial neural networks. Journal of Parallel and Dis-
tributed Computing, 14:260–285.
49. 35
[6] Y. Hirai. 1993. Hardware implementations of neural networks in Japan.
Neurocomputing, 5:3–16.
[7] N. Sundarajan and P. Satchandran. 1998. Parallel Architectures for Arti-
ficial Neural Networks. IEE Press, California.
[8] D. Hammerstom. 1991. A highly parallel digital architecture for neural
network simulation. In: J.D. Delgado-Frias and W.R. Moore, Eds., VLSI
for Artificial Intelligence and Neural Networks, Plenum Press.
[9] R. M. Corless, G. H. Gonnet, D. E. G. Hare, and D. J. Jeffrey, and D. E.
Knuth. 1996. On the Lambert W Function. Advances in Computational
Mathematics, 12:329–359.
[10] M. Ito, N. Takagi, and S. Yajima. 1997. Efficient initial approximation for
multiplicative division and square-root by a multiplication with operand
modification. IEEE Transactions on Computers, 46(4):95–498.
[11] J. M. Muller. 1997. Elementary Functions: Algorithms and Implementa-
tion. Birkhauser, Boston, USA.
[12] S. M. Pizer and V. L. Wallace. 1983. To Compute Numerically. Little,
Brown, and Co., Boston, USA.
[13] M. Bajger and A. R. Omondi. 2005. Low-cost, high-speed implemen-
tations of square-root, exponential and sigmoidal function-evaluations.
Submitted for publication.
[14] S. Vassiliadis, M. Zhang, and J. G. Delgado-Frias. 2000. Elementary
function generators for neural network emulators. IEEE Transactions on
Neural Networks, 11(6):1438–1449.
[15] K. Basterretxea, J. M. Tarela, and I. del Campo. 2004. Approximation
of sigmoid function and the derivative for hardware implementation of
artificial neurons. IEEE Proceedings — Circuits, Devices, and Systems,
151(1):18–24.
[16] O. Mencer and W. Luk. 2004. Parameterized high throughput function
evaluation for FPGAs. Journal of VLSI Signal Processing, 36:17–25.
[17] J. L. Holt and J. N. Hwang. 1993. Finite-precision error analysis of neural
network hardware implementations. IEEE IEEE Transactions on Com-
puters, 42(3):280–290.
[18] A. R. Omondi. 2000. Neurocomputers: a dead end? International Journal
of Neural Systems, 10(6):475–481.
[19] J. L. Hennessy and D. A. Patterson. 2002. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann.
[20] SPEC. Standard Performance Evaluation Corporation. (www.spec.org )
[21] Xilinx. 2004. Virtex-4 User Guide.
[22] Xilinx. 2004. XtremeDSP Design Considerations: User Guide.
References
50. 36 FPGA Neurocomputers
[23] A. P. Preethy, D. Radhakrishnan, A. R. Omondi. Mar 2001. A high-
performance residue-number-system multiply-accumulate unit. In: 11th
ACM Great Lakes Symposium on VLSI (Purdue, Indiana, USA), pp 145–
149.
[24] Waterloo Maple Inc. Maple 8 Programming Guide, 2002.
52. 38 Arithmetic precision for BP networks
for each training session especially with large networks. Network topology is
an important factor in the network’s ability to generalize after training is com-
pleted. A larger than needed network may over-fit the training data and result
in poor generalization on testing data, while a smaller than needed network
may not have the computational capacity to approximate the target function.
Furthermore, in applications where online training is required, training time
is often a critical parameter. Thus it is quite desirable to speed up training.
This allows for reasonable experimentation with various network topologies
and ability to use BP networks in online applications.
Since Neural Networks in general are inherently parallel architectures [55],
there have been several earlier attempts to build custom ASIC based boards
that include multiple parallel processing units such as the NI1000. However,
these boards suffered from several limitations such as the ability to run only
specific algorithms and limitations on the size of a network. Recently, much
work has focused on implementing artificial neural networks on reconfigurable
computing platforms. Reconfigurable computing is a means of increasing the
processing density (i.e greater performance per unit of silicon area) above and
beyond that provided by general-purpose computing platforms. Field Program-
mable Gate Arrays (FPGAs) are a medium that can be used for reconfigurable
computing and offer flexibility in design like software but with performance
speeds closer to Application Specific Integrated Circuits (ASICs).
However, there are certain design tradeoffs which must be dealt with in or-
der to implement Neural Networks on FPGAs. One major tradeoff is area vs.
precision. The problem is how to balance between the need for numeric preci-
sion, which is important for network accuracy and speed of convergence, and
the cost of more logic areas (i.e. FPGA resources) associated with increased
precision. Standard precisions floating-point would be the ideal numeric repre-
sentation to use because it offers the greatest amount of precision (i.e. minimal
quantization error) and matches the representation used in simulating Neural
Networks on general purpose microprocessors. However, due to the limited
resources available on an FPGA, standard floating-point may not be as feasible
compared to more area-efficient numeric representations, such as 16 or 32 bit
fixed-point.
This chapter explores this design trade-off by testing an implementation of
an MLP-BP network on an FPGA using both floating-point and fixed-point
representations. The network is trained to learn the XOR problem. The study’s
goal is to provide experimental data regarding what resources are required for
both formats using current FPGA design tools and technologies. This chap-
ter is organized as follows: In Section 2.2, background material on the area
vs precision range trade-off is presented as well as an overview of the back-
propagation algorithm and FPGA architectures. Section 2.3 provides details
about the architecture design used to implement a BP network on FPGA. In
53. Background 39
section 2.4 the XOR problem is presented. Finally validation of the proposed
implementations, and benchmarked results of floating-point and fixed-point
arithmetic functions implemented on a FPGA are discussed in Section 2.5.
2.2 Background
One way to help achieve the density advantage of reconfigurable comput-
ing over general-purpose computing is to make the most efficient use of the
hardware area available. In terms of an optimal range-precision vs area trade-
off, this can be achieved by determining the minimum allowable precision and
minimum allowable range, where their criterion is to minimize hardware area
usage without sacrificing quality of performance. These two concepts com-
bined can also be referred to as the minimum allowable range-precision.
2.2.1 Range-Precision vs. Area Trade-off
A reduction in precision usually introduces many errors into the system.
Determining the minimum allowable precision is actually a question of de-
termining the maximum amount of uncertainty (i.e. quantization error due to
limited precision) an application can withstand before performance begins to
degrade. It is often dependent upon the algorithm used and the application at
hand.
For MLP using the BP algorithm, Holt and Baker [41] showed using simu-
lations and theoretical analysis that 16-bit fixed-point (1 bit sign, 3 bit left and
12 bit right of the radix point) was the minimum allowable range-precision
for the back-propagation algorithm assuming that both input and output were
normalized between [0,1] and a sigmoid transfer function was used.
Ligon III et al. [45] have also shown the density advantage of fixed-point
over floating-point for older generation Xilinx 4020E FPGAs, by showing that
the space/time requirements for 32-bit fixed-point adders and multipliers were
less than that of their 32-bit floating-point equivalents.
Other efforts focused on developing a complete reconfigurable archi-
tecture for implementing MLP. Eldredge [3] successfully implemented the
back-propagation algorithm using a custom platform he built out of Xilinx
XC3090 FPGAs, called the Run-Time Reconfiguration Artificial Neural Net-
work (RRANN). He showed that the RRANN architecture could learn how
to approximate centroids of fuzzy sets. Heavily influenced by the Eldredge’s
RRANN architecture, Beuchat et al. [13] developed a FPGA platform, called
RENCO–a REconfigurable Network COmputer. As it’s name implies, RENCO
contains four Altera FLEX 10K130 FPGAs that can be reconfigured and mon-
itored over any LAN (i.e. Internet or other) via an on-board 10Base-T inter-
face. RENCO’s intended application was hand-written character recognition.
Ferrucci and Martin [14, 15] built a custom platform, called Adaptive Connec-
54. 40 Arithmetic precision for BP networks
tionist Model Emulator (ACME) which consists of multiple Xilinx XC4010
FPGAs. They validated ACME by successfully carrying out a 3-input, 3-
hidden unit, 1-output network used to learn the 2-input XOR problem. Skr-
bek’s FPGA platform [26], called the ECX card, could also implement Radial
Basis Function (RBF) neural networks, and was validated using pattern recog-
nition applications such as parity problem, digit recognition, inside-outside
test, and sonar signal recognition.
Since the size of an FPGA-based MLP-BP is proportional to the multi-
plier used, it is clear that given an FPGA’s finite resources, a 32-bit signed
(2’s complement) fixed-point representation will allow larger [54] ANNs to
be implemented than could be accommodated when using a 32-bit IEEE (a
32-bit floating point multiplier can be implemented on a Xilinx Virtex-II or
Spartan-3 FPGA using four of the dedicated multiplier blocks and CLB re-
sources) floating-point. However, while 32 fixed-point representation allows
high processor density implementation, the quantization error of 32 floating-
point representation is negligible. Validating an architecure on an FPGA using
32-bit floating point arithmetic might be easier than fixed point arithmetic since
a software version of the architecture can be run on a Personal Computer with
32-bit floating point arithmetic. As such its use is justifiable if the relative loss
in processing density is negligible in comparison.
FPGA architectures and related development tools have become increas-
ingly sophisticated in more recent years, including improvements in the
space/time optimization of arithmetic circuit designs. As such, the objective
of this study is to determine the feasibility of floating-point arithmetic in im-
plementing MLP-BP using today’s FPGA design tools and technologies. Both
floating-point and fixed-point precision are considered for implementation and
are classified as amplitude-based digital numeric representations. Other nu-
meric representations, such as digital frequency-based [42] and analog were
not considered because they promote the use of low precision, which is often
found to be inadequate for minimum allowable range-precision.
2.2.2 Overview of Back-propagation Algorithm
It is helpful before proceeding to discuss architecture design to give a brief
review of MLP and the error Back-propagation algorithm. The general struc-
ture of a Multi-layer perceptron (MLP) neural network is shown in Figure 2.1,
where layers are numbered 0 to M, and neurons are numbered 1 to N.
A MLP using the back-propagation algorithm has five steps of execution:
(1) Initialization
The following parameters must be initialized before training starts: (i)
w
(s)
kj (n) is defined as the synaptic weight that corresponds to the connection
from neuron unit j in the (s − 1)th layer, to k in the sth layer. This weight
56. hand a gold-headed cane, those present looked towards him with
considerable deference.
“Well, squire,” said Sam Tilden, “you met with a misfortun’ last
night.”
“Yes,” said the squire, deliberately; “there was a clean sweep of
the old house. There isn’t much left of it.”
“Have you any idea who sot it on fire?” queried the old man.
“No,” said the squire. “I came in to see if any one here could
throw any light upon it.”
There was one present who could have thrown some light upon it,
and if Squire Turner had chanced to look behind the counter he
might have noticed a peculiar expression in the eyes of Harry
Raymond, who was watching him fixedly. The fact is, Harry was very
much perplexed in his mind in regard to the occurrence. Why a
gentleman should steal out of his house in disguise at the dead of
night to set fire to his own property was a question which was
invested with not a little mystery. But before the conversation was
finished he began to understand it better.
“It must have been sot afire,” continued Sam Tilden, positively.
“There wasn’t nobody livin’ in it.”
“No; it had been empty for several months.”
“You haint got no suspicions, I s’pose?”
“Why, no,” said the squire, slowly. “I suppose it must have been
somebody that had a grudge against me, and took this way to
gratify it. But who it may be I haven’t an idea.”
“I reckon it was insured?” said Sam, interrogatively.
“Yes,” said the squire, cautiously; “it was insured.”
“I said it must be,” said one, who had spoken at an earlier stage in
the conversation. “I knew squire, you was too keerful a man to
neglect it.”
57. “It was insured when it came into my hands,” said Squire Turner;
“and I have merely kept up the payments.”
“What was the figure?”
“I really can’t be quite certain till I have looked at the policy,” said
the squire. “I’ve got all my houses insured, and I can’t, without
looking, tell exactly how much there is on each.”
“That’s the advantage of owning only one house,” said Doctor
Lamson, as he stepped in for a moment. “I’m not liable to make a
mistake about my insurance. In what company was your house
insured, Squire Turner?”
“In the Phœnix Mutual, I believe. By the way, Mr. Porter, you may
send up a barrel of flour to my house. I believe we are nearly out.”
“All right, squire. It shall go up in the course of the day.”
“Good-morning, gentlemen,” said the squire, walking out of the
store.
“I guess the squire won’t lose a cent,” said Sam Tilden, after he
went out. “It’s likely the insurance money will pay him handsome if
the policy was took out years ago. I shouldn’t wonder if he’s glad the
old house is gone. It was awfully out of repair.”
“Very likely you’re right,” said John Gaylord. “I’d rather have the
money than the house, for my part.”
For the first time a light came to Harry’s mind. He felt that he
understood the whole matter now. Squire Turner didn’t want the
house, which would require considerable outlay to make it habitable,
and he did want the money for which it was insured. As the shortest
way to secure this, he had himself set the house on fire. Now, no
doubt, he meant to come upon the company for the amount of
insurance money. To Harry’s mind this looked like a swindle, like
obtaining money by false pretences. Yet here was Squire Turner, the
richest man in the village, occupying a very prominent—indeed the
most prominent—position in town, who was actually going to carry
out this fraud. Nobody except he knew that the squire was himself
58. the incendiary. What ought he to do about it? Should he allow the
insurance company to be swindled?
“Do you think Squire Turner will collect his insurance money, Mr.
Gaylord?” he asked, of the chief clerk.
“Do I think so? Of course he will. He’d be a fool if he didn’t.”
“But people seem to think that the house wasn’t worth as much as
the sum it was insured for.”
“Very likely not; but it was when it was insured, and as the
payments have been kept up regular, the insurance company can’t
complain as I see.”
“Suppose the man that set the house on fire should be caught?”
“He’d be tried, and put in prison.”
This gave Harry something new to think of. The idea of Squire
Turner’s being put in prison was certainly a strange and startling
one. Probably it made a difference as long as he owned the house
himself. Still, if he claimed the insurance money, that again made a
difference. Harry felt puzzled again, and in thinking over the matter
he made several ludicrous mistakes, among others asking a boy who
came in for some molasses how many yards he would have, which
led to a mirthful explosion from the young customer, who looked
upon it as a brilliant joke.
Not knowing what to do, Harry did nothing. Two days afterwards
our hero saw the following placard posted up on the outside of the
store, on the left-hand side of the door;—
“One Hundred Dollars Reward!—For information that
will lead to the discovery of the incendiary or
incendiaries who set fire to the old Jackson farm-
house, belonging to the subscriber, which was
consumed on the evening of the 11th inst.
“Elihu Turner.”
59. Harry read this placard with interest.
“I could claim that reward,” he said to himself; “but would Squire
Turner think my information worth paying for?”
60. CHAPTER XI.
HARRY MAKES A CALL ON BUSINESS.
A few days later Harry heard that Squire Turner had made a
formal claim upon the Phœnix Mutual Insurance Company for two
thousand dollars, the amount of his policy. On hearing this, he no
longer hesitated as to his duty. He resolved to call upon the squire,
and acquaint him with his information upon the subject. Accordingly,
one afternoon, he went up to Mr. Porter, and asked for two hours’
time.
“What for?” queried the store-keeper.
“I want to call on Squire Turner. I have a little business with him.”
The store-keeper naturally supposed that the business related to
the affairs of Harry’s mother, and gave permission, as business was
generally slack about that time in the afternoon, but requested Harry
to be back by half-past three.
When Harry got started on his way to the residence of the squire,
he began to feel that his errand was rather a delicate one. He, a
mere boy, was about to intimate to a gentleman of high social
position that he was a rascal,—that was the plain English of it,—and
was conspiring to defraud an insurance company out of a
considerable sum of money. It was rather a bold undertaking for a
boy of fifteen. Perhaps Squire Turner might be so incensed as to kick
him out of the house. Harry was a stout boy, but still of course he
had not the strength to cope with a tall man like the squire. Had he
been a timid boy, he would have shrunk from the encounter. But
Harry was not timid. On the contrary, he was physically and morally
brave, as anybody who knew him would readily testify.
61. “I’ll take the risk,” he said to himself, firmly. “I don’t think Squire
Turner will think it best to attack me.”
He marched manfully up the front steps, and rang the bell. His
summons was answered by a servant.
“Is the squire in?” he asked.
“Yes,” was the reply; and the girl indicated the door of the “office.”
Harry knocked.
“Come in,” said the squire, in his usual grating voice.
Harry did go in.
Squire Turner was seated at his desk. He had a paper before him,
which Harry rightly guessed was the fire insurance policy. The squire
had been examining it with considerable complacency. Two thousand
dollars was a large sum even to him, and certainly a very handsome
consideration for the old Jackson farm-house, which with the land
around it he had got, by the foreclosure of a mortgage, at a decided
bargain. How the company had ever been induced to grant so large
a sum on such a house, even in its better days, was a wonder; but
insurance companies sometimes make mistakes as well as private
individuals, and this appeared to be one of them.
62. “Very well, you can state your business.”
For two thousand dollars, or a little more, the squire had been
thinking he could build a nice modern house, which would make the
farm salable at a considerably higher figure than before. This was a
very pleasant prospect, of course, and the harsh lines in the squire’s
face were smoothed out to a certain extent as he thought of it.
When he turned, at the opening of the door, and saw who his
visitor was, he naturally concluded that Harry had come about the
land warrant.
“I haven’t heard anything more about your mother’s Western
land,” he said. “When I do I will let you know.”
“Thank you,” said Harry; “but that is not what I have come about.”
“Very well,” said the squire, a little surprised; “you can state your
business.”
At this moment James Turner came in hastily.
“Father, I want a dollar,” he said.
63. “What for?”
“To buy a bat and ball.”
“Wait a minute or two. I am busy.”
James looked at Harry superciliously, as if to imply that his
business could not be of any particular importance, and took a seat.
“You may state your business,” said the squire.
“I beg your pardon,” said Harry, looking towards James, “but my
business is private.”
“Perhaps he wants to complain of me,” thought James, “about the
eggs. If he does he won’t make much.”
“I am not aware of any business between us,” said the squire,
with dignity, “which is of too private a nature to discuss before my
son. I will, however, stretch a point, to oblige you, and request him
to leave the room.”
“It isn’t on my account, but on yours,” said our hero, bluntly, “that
I wish to speak privately.”
Squire Turner looked at Harry in cold displeasure not unmingled
with surprise, at what he felt to be a liberty.
“That’s a strange remark,” he said. “However, James, you may
leave the room. Here is the money.”
“You have offered a reward, Squire Turner, for information about
the fire the other evening,” said Harry, when they were alone,
thinking it best to plunge into the subject at once.
“Yes, a hundred dollars’ reward,” said the squire. “Do you know
anything about it?”
“I do,” said Harry, promptly.
Squire Turner was taken by surprise. What could Harry know
about the fire and its origin? He himself knew all about it; but of
course that knowledge was locked up in his own breast. In offering
the reward he felt sure that it would not be claimed, and, under the
64. circumstances, he felt that it was well to offer it. It would impress
the fire company favorably, as showing his determination to ferret
out the secret incendiary, and therefore he had forwarded a handbill
containing a copy of his offer to the office of the Phœnix Mutual,
together with his claim for the amount of insurance money.
Harry’s prompt answer led to a suspicion in the squire’s mind that
our hero was trying to get the reward on false pretences.
“The money will only be given for positive information leading to
the discovery of the incendiary,” he said, coldly.
“I can give you such information,” said Harry, with the same
promptness as before.
“Perhaps,” said the squire, with a sneer, “you can tell who set the
house on fire.”
“I can,” said Harry, distinctly.
“Who did it?” asked the squire, beginning to feel nervous.
“Squire Turner,” said our hero, feeling that the crisis had come,
“you have asked me the question, and of course you wish me to
answer it truly.”
“Of course,” muttered the squire, whose nervousness increased.
“Then,” said Harry, firmly, “you set the house on fire yourself!”
The words were like a thunderbolt. The squire started to his feet,
his face livid with fear, and then purple with excitement.
“How dare you say such a scandalous thing?” he exclaimed.
“Because you expect me to tell the truth,” said Harry. “If you will
listen, I will tell you how I came to know.”
Hereupon he gave an account, in as few words as possible, of his
midnight visit to the house of Doctor Lamson, of his passing near
the house, and identifying the squire in the act of setting fire to
some shavings. Squire Turner listened, evidently in a state of
65. nervous excitement, fidgeting about in a manner which indicated his
mental disturbance. When Harry had finished, he spoke.
“This is the most impudent fabrication I ever heard. You mean to
charge that I—a rich man, and, if I say it myself, universally
respected—actually set fire to my own house at the dead of night!”
“I do,” said Harry, firmly.
“I have a great mind to kick you out of my house,” said the squire,
violently.
“I don’t think you will do it, Squire Turner,” said Harry, who did not
show a trace of alarm.
“Why not?”
“Because I have told the truth, and you know it,” said our hero,
“and if I told it outside, people might believe it.”
“What would your word weigh against mine?” said the squire, but
his tone was more confident than his feeling.
“I never told a lie, as everybody in the village will testify,” said
Harry, proudly. “Of course it is an object for you to deny it.”
The squire began to see that the overbearing policy was not
exactly the one to pursue in this case. Harry was not to be
frightened easily, and this he realized. Besides, there were other
reasons why he did not wish to fall out with our hero. Accordingly he
thought proper to change his tone.
“My young friend,” he said, with a very significant change of tone
and manner, “you are certainly under a very strange delusion. I
should be angry, but I am rather disposed to be amused. You would
only be laughed at if you should spread abroad such a ridiculous
tale.”
“It’s true,” persisted Harry.
“Consider a moment,” said Squire Turner, with commendable
patience, “the nature of your charge. It is rather absurd that I
66. should set fire to my own building,—isn’t it, now? What possible
object could I have in so doing?”
“The insurance,” briefly answered Harry.
“Yes,” said Squire Turner, slowly; “the house was insured, to be
sure, but they don’t insure to the full value.”
“Everybody says that the house was insured for more than its full
value.”
“Quite a mistake. I would rather have the house than the money.
In fact, it was quite a disappointment having the house burnt down.”
“I don’t know about that,” said Harry, sturdily. “All I know is, that I
saw you setting the house on fire with my own eyes.”
Perspiration began to come out on the squire’s brow. He had
never anticipated such an obstacle to the carrying out his plans, and
it did seem a little provoking when everything had seemed so
favorable hitherto. He would like to have pitched our hero out of the
window, or kicked him out of the house; but neither course seemed
quite expedient. So, though boiling over with inward wrath and
vexation, he forced himself to be conciliatory.
“I have no doubt you think you are right,” he said; “but in the
evening one is easily deceived about faces. I was fast asleep at the
time, and, indeed, I knew nothing of the fire till my house-keeper
came and knocked at my door when it was nearly over.”
This was partly true; but the squire didn’t say that it was just after
he had crept stealthily into the house.
“Still, as I am a friend of your family, and interested in your
welfare,” he continued, “I don’t mind giving you the hundred dollars,
not, of course, as a reward, but to help you along. Of course it is on
condition that you say nothing of this ridiculous story. It would only
involve you in trouble. Come up to-morrow and I’ll give you the
money.”
67. “Squire Turner,” said Harry, promptly, “I cannot accept your
proposition, or money.”
“Why not?”
“Because my story, whether ridiculous or not, is true. I don’t care
for the reward; I didn’t come up here to get it.”
“What did you come for?”
“I came to prevent your coming upon the insurance company for
that money. If you will promise not to ask for the money, I will never
say a word about how the fire came about.”
“I can’t promise that,” said the squire; “but before claiming the
insurance I will let you know. In the mean time you had better keep
the story to yourself.”
“I will,” said Harry, and, rising, he left the room, leaving the squire
in a very uncomfortable and unsatisfactory state of mind.
68. CHAPTER XII.
HARTLEY BRANDON.
When the squire was left alone, he began rather ruefully to think
over the unexpected turn which affairs had taken. If he had disliked
Harry before, he hated him now. He felt that the sturdy
determination of our young hero was likely to place him in a very
unpleasant dilemma. If he should not collect the insurance money,
the house would be a total loss, and this would be very provoking. If
he should collect it, he had every reason to believe that Harry would
keep his word; and, as he was a boy of truth, many would no doubt
believe him, and the insurance company would be sure to stir in the
matter. There was another consideration. If he guiltily let the matter
pass, and failed to make his claim, or recalled it,—for it was already
made,—it would excite a great deal of surprise, and perhaps
suspicion, and thus again he would be disagreeably situated. There
seemed to be only a choice of difficulties, as the squire realized. He
fervently wished now that he had never burnt the house down. But
it was done and could not be undone.
“I wish the young rascal was out of the way,” he muttered to
himself.
He wished it the more because Harry stood in the way of another
plan which he had in view, namely, marrying Mrs. Raymond, in case
the Western property proved as valuable as he anticipated. He had
an instinctive feeling that our hero would not fancy him for a step-
father, and would exert all his influence over his mother to prevent
her accepting him, even if she might otherwise be willing.
“Plague take the young whelp!” muttered the squire. “I wish he
was in Nova Zembla, or somewhere else, where he would never
come back.”
69. His uncomfortable reflections were here broken in upon by the
entrance of the servant.
“There’s a man at the door wants to see you, Squire Turner.”
“Who is it?”
“It’s a stranger.”
“Well, tell him to come in.”
The invitation was duly given, and directly there entered a tall
man, very seedy in his appearance, with a repulsive aspect, who
looked as if the world and he had not been on good terms for some
time. He was probably about the same age as Squire Turner,—that
is, fifty,—but looked still older, probably in consequence of the life he
had led.
Squire Turner looked at the intruder in surprise.
“How do you do, Squire Turner?” said the stranger, familiarly.
“You have the advantage of me,” said the squire, coldly.
“Yet you used to know me well,” was the reply, as the visitor sat
down uninvited.
“I don’t know you now. Who are you?” demanded Squire Turner,
who didn’t feel it necessary to use much ceremony with a man so
evidently under the frowns of fortune.
“I am your cousin, Hartley Brandon.”
Squire Turner started.
“Hartley Brandon!” he repeated, in amazement. “I thought you
were dead years ago.”
“And wished it, no doubt,” said the other, with a laugh. “Confess
now you are not very glad to see me.”
“I am not very glad to see you, as you are sharp enough to
guess,” said the squire, with a sneer. “You are not a relative to be
proud of.”
70. “True enough,” said the other. “I see you are not afraid of hurting
my feelings. However, I’ve had so many hard rubs that my feelings
have got worn off, if I ever had any.”
“What is your object in coming down here, for I suppose you have
an object?”
“Suppose I say that it is for the sake of seeing about the only
relative I have in the world. There’s something in that, you know.”
“Not in this case. We may be cousins, but we are not friends, and
never will be.”
“Come, that’s frank,—true, too, I dare say,” said Hartley Brandon,
who didn’t appear by any means disturbed at the coldness of the
squire. “Well, as you say, it wasn’t that. Blood’s thicker than water,
they say, but there are plenty of people I like better than you, who
are my cousin.”
“That is a matter of perfect indifference to me,” said the squire,
coldly. “I don’t want to know what your object is not, but what it is.”
“I am rather seedy, as you see.”
“So it appears.”
“This shabby suit, with half a dollar, constitutes all my worldly
possessions.”
“Supposing it to be so, what is that to me?”
“Can’t you help me a little?”
The squire’s mouth tightened, as it always did when there was an
attack on his purse-strings. He seldom gave away money, unless he
thought it would help him in some way, and he felt even more than
usually unwilling to do so at a time when, owing to Harry’s obduracy,
he was threatened with a serious loss. No poorer time could have
been selected by his cousin for his application than this.
“I can do nothing for you,” he said, coldly.
71. “I don’t mean you to give me money,” said Brandon. “I only want
an advance of thirty or forty dollars, which I will faithfully repay you
with interest.”
Squire Turner laughed scornfully.
“What security can you offer?” he asked.
“None at all, except my word.”
“That isn’t satisfactory.”
“I thought you’d say so; but listen, and I will tell you how the
matter stands. First, I suppose you would like to know how I have
been employed for the last twenty odd years.”
“You may tell or not, just as you like. I feel no particular interest in
the matter.”
“I have followed the sea,—I see you are surprised; but this is the
way it happened. Twenty-five years since, I found myself high and
dry in New York, with no resources, and nobody to look to for help.
In my distress I fell in with a sailor, who treated me kindly, and
proposed to me to adopt his profession. It was not particularly to my
taste, and I knew it was rather late in life to begin; but I had no
other resource, and I allowed myself to be persuaded. I had a hard
time of it at first, as you may suppose, but after a while I became
acquainted with my duties, and turned out a very fair sailor. Being
possessed of a better education than belongs to the generality of
seamen, I found myself able to rise. On the second voyage, I
shipped third mate. Then I rose to second mate; finally to first mate.
I might have become captain if I had been a little more steady, but a
fondness for drink stood in the way of my advancement.”
“So you have been a sailor for twenty-five years.”
“Yes.”
“It was no doubt the best thing you could do. You don’t think of
giving it up?”
“No.”
72. “Then I don’t see what I can do for you.”
“I’ve a chance to sail as mate next week in the ship Sea Eagle
bound for China.”
“Why don’t you go, then?”
“Because there’s a trifle in the way. I owe twenty-five dollars in
New York, and if I don’t pay it up square the party’ll put a spoke in
my wheel, and prevent my getting the situation.”
“So you want me to advance you the necessary money?”
“Yes, I’ll pay you back at the end of the voyage.”
“Do you know the captain under whom you are to sail?” asked the
squire, thoughtfully.
“Yes, a little.”
“What sort of a man is he?”
“Oh, an average sort of a man,—rather a Tartar, so I hear from
some who have sailed under him. He likes his ease, and leaves the
vessel pretty much in the hands of his first officer.”
A train of reflection had been started in the squire’s mind by the
communication of his kinsman. He wanted to be rid of Harry
Raymond. Why could he not arrange with Hartley Brandon to
smuggle him off to sea, where he would be out of the way of
interfering with his plans? It might be difficult to manage, but no
doubt some way would suggest itself. As for Brandon, there was no
fear of his refusing. He was not troubled with scruples, and a small
sum of money would buy his co-operation.
Then, again, the sea was a treacherous element. Accidents were
frequent. Should Harry once embark on its smooth but fickle
expanse, he might never come back again, or, if he did, it might be
to find him, the squire, his mother’s second husband, and the
relationship would seal his lips from disclosing the secret of which he
had become possessed.
73. All these thoughts passed through the squire’s mind much more
quickly than I have been able to state them. The plan which has
been briefly sketched seemed the only way out of the labyrinth in
which he had become involved, and he resolved to make a trial of it.
“Well, will you help me?” asked Brandon, growing impatient of his
kinsman’s silence.
“I will,” answered the squire, “upon conditions.”
“Name them,” said Brandon, brightening up.
74. CHAPTER XIII.
A LETTER FROM NEW YORK.
It is unnecessary to detail the conversation which took place
between Squire Turner and Hartley Brandon, since the nature of it
may be guessed from the events which followed. As might be
expected, Brandon was by no means squeamish, and made no
objection to what was proposed. Indeed, he made an occasional
suggestion which was adopted by his kinsman. The squire did not, of
course, think it politic to reveal the real causes of his hostility to
Harry, nor of the reasons which he had for desiring that the boy
should be out of the way.
He was too cautious a man for this, and moreover had too little
confidence in Brandon, whom he regarded as an unprincipled fellow,
being in this opinion not far from right. He merely said that he had
reasons for wishing Harry out of the way, and expressed his
willingness, should matters turn out satisfactorily, not only to make
Hartley a present advance of fifty dollars, but to pay him over a
further sum of five hundred when the affair was over, besides what
might be needed for preliminary expenses.
To the shiftless vagabond, who had been tossing about the ocean
for a quarter of a century, five hundred dollars was a large sum,
though we may consider it a trifling compensation for an act of
villany. So he readily promised the squire his co-operation.
“It is best that you should leave Vernon at once,” said the squire,
when the arrangements between them were concluded.
“Why?” asked Brandon, rather disappointed, for he fully expected
to be the squire’s guest till the next day.
75. “Because it won’t do for you to be seen by the boy. He would
recognize you when you meet in the city, and this might lead him to
suspect something wrong.”
“What do you want me to do?”
“I will have my horse harnessed to the carryall, and will take you
over to the Wrexham station, where you can take the cars for the
city.”
“What time do the cars start?”
“In a couple of hours. We have no time to lose.”
“Have you got anything eatable in the house? I’m almost
famished. Haven’t eaten anything since early this morning.”
“I will look to that. Stay here, or rather I will lead the way
upstairs. Some one might be in. How will some beefsteak suit you?”
“Just the thing. Only let there be plenty of it. I’ve got a famous
appetite.”
Brandon was conducted upstairs to a back room on the second
floor, where the squire suggested that he might as well fill up a
portion of the time till lunch by brushing his clothes, and performing
ablutions which appeared to be needful. He then went downstairs to
give the necessary directions to Mrs. Murray.
“Broil some beefsteak and plenty of it,” said the squire. “You may
boil two or three eggs also, and send up a loaf of bread and some
butter.”
“Where shall I set the table?” asked Mrs. Murray.
“Never mind about a table. You can carry all up on a waiter to the
back chamber when ready.”
Seeing that the house-keeper looked surprised, he added, in
rather an embarrassed way:—
“The fact is, the man was a school-mate of mine, who hasn’t
turned out very well. Out of pity, I am going to help him a little, but
76. don’t care about his being seen in my house.”
This seemed plausible enough, particularly when Mrs. Murray saw
Brandon, who certainly looked very much like one who had not
turned out very well. The rapid manner in which the abundant meal
melted away under his vigorous attacks was certainly a tribute to the
culinary skill of the house-keeper, who was led to form a more
favorable estimate of the shabby stranger in consequence.
In a little more than half an hour Squire Turner was on his way to
Wrexham, Brandon occupying a back seat. They reached the depot
ten minutes before the train arrived, so that there was ample time to
buy a ticket.
So the train was set in motion that was to lead to important
changes in the life of our young hero. These it shall be our task
gradually to unfold, and set on record.
Four days passed quietly. The villagers had ceased to talk of the
fire, as another exciting occurrence had succeeded. Deacon Watson
had been thrown out of his carriage and broken his leg, and the
details of this accident were still fresh in the mouths of all.
Harry pursued the even tenor of his way in his new position, trying
to make himself as useful as possible, and succeeding to the
satisfaction of his employer. Always prompt, always reliable, Mr.
Porter felt that in spite of his youth he fully filled the place of Alfred
Harper, whose temporary loss he now regarded with equanimity.
Harry was weighing some sugar for a customer one afternoon
when John Gaylord, who had just got through sorting the mail, said
to him, “Here’s a letter for your mother, mailed at New York.”
“Let me see it,” said Harry, who felt some curiosity as to who
might have written to his mother, for her correspondence was very
limited.
He took the letter in his hand, and looked at the direction. It was
in a dashing business-hand, quite unknown to him, and revealed
nothing.
77. “I will take it home when I go to supper,” he said.
“Has your mother got friends in New York?” asked Gaylord.
“Not that I know of. I don’t recognize the handwriting.”
“Maybe it’s a lawyer’s letter, informing her of a legacy,” said the
senior clerk, jocosely.
“Very probable,” said Harry, smiling.
It was already the hour when he usually returned for supper.
Accordingly he put on his cap and went out of the store. Being a
little curious as to the contents of the letter, he hastened his steps,
and entered the house out of breath.
“You’re a little early,” said his mother. “Supper isn’t quite ready.”
“I hurried, because a letter came by this afternoon’s mail. It’s
mailed at New York.”
“New York!” repeated Mrs. Raymond, in surprise. “Who can it be
from?”
“I don’t know. Haven’t you any friends there?”
“Not that I know of. Harry, you may take up the tea and toast,
while I am reading the letter.”
She tore open the envelope, and first, as was natural, turned to
the bottom of the second page, and read the name appended to the
letter.
“Lemuel Fairchild!” she repeated, thoughtfully. “I don’t recall the
name.”
“Read it aloud, mother,” said Harry.
She complied with his request.
This is the way the letter read:—
78. “No. — Nassau Street, Room 7.
New York, Nov. 7, 18—.
“Dear Madam:—Though personally a stranger to you, I
knew your husband well, and have heard with the
deepest regret of his sad fate. We had not met for
years, but I have always cherished a warm regard for
him, though on account of the absorption of my time
by important business I have not been able to keep up
a correspondence with him. But, without further
preface, I will come to my object in writing.
“If I remember rightly, you have a son who must
now be a boy of sixteen or thereabouts. No doubt you
are anxious to get him into some kind of employment.
In the country I am aware desirable opportunities are
rare, and I presume you are at a loss how to secure
him one. Now, I am desirous of taking a boy, and
training him in my own business. Having no one in
view, it has occurred to me that it might be a pleasant
arrangement for you as well as for me, if I should take
your son. I may add that I am a commission merchant,
doing a large business. Can you send him up at once?
As to wages, I will give him twelve dollars a week at
first. He will not earn half that, but I shall feel that, in
overpaying him, I shall be assisting the widow and son
of my old friend.
“Yours very truly,
“Lemuel Fairchild.
“If you accept my proposal, I should like to see your
son at my office some time Monday.”
Mrs. Raymond looked at Harry in perplexity, after finishing the
letter.
79. “Lemuel Fairchild!” she repeated. “It is strange I never heard your
father speak of him.”
“Perhaps he may have done so, and you do not recall the name.”
“It may be so,” said Mrs. Raymond, slowly, “but I do not think so.”
“At any rate,” said Harry, “it’s a splendid offer. Think of earning
twelve dollars a week, to begin with, in New York!”
“Yes, it’s a good offer, but how can I spare you?” said his mother,
sorrowfully. “It will be very lonely without you. Don’t you think you
had better remain in Mr. Porter’s store?”
“That will only be for a few weeks, you know, mother. Alfred
Harper will be getting well before long, and then I shall be out of a
situation. I think we had better say yes.”
Harry’s ambition was fired by the prospect of a place in the city.
Like many another country boy he had the most splendid visions of
what city life was. By the side of a position in a city office his present
situation looked mean and contemptible. Even had the pay been the
same, he would have preferred New York to Vernon; but the fact
that the salary offered in the city was just double was an additional
inducement. Why, John Gaylord, Mr. Porter’s chief salesman, though
already twenty-five years of age, and with several years’ experience
as clerk, received just that, and no more. That Harry should be
offered the same salary at fifteen was indeed a compliment.
“I expect board is higher in the city,” said Mrs. Raymond.
“Yes, I suppose it is; but next year I shall probably have my pay
raised. Who knows but I may get into the firm some day,” said Harry,
glowing with enthusiasm, “and make money hand over hand? Then I
can take a nice house in the city, and you and Katy can come up and
live with me. Won’t that be nice?”
Mrs. Raymond confessed that it would be nice. Still she did not
like to let Harry go. But he gradually won her to his side, and she
admitted that there was something in his arguments. So, before he
80. Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com