UNIT 3 - Neural networks feed forward n/w

1
Unit III NEURAL NETWORKS
Adaptive Networks – Feed Forward Networks

2
Topics
• Machine Learning using Neural Network
• Adaptive Networks – Feed Forward Networks
• Supervised Learning Neural Networks, Radial
Basis Function Networks
• Reinforcement Learning
• Unsupervised Learning Neural Networks
• Adaptive Resonance Architectures
• Advances in Neural Networks

3
ARTIFICIAL NEURAL NET
X2
X1
W2
W1
Y
The figure shows a simple artificial neural net with two input
neurons (X1, X2) and one output neuron (Y). The inter connected
weights are given by W1 and W2.
“Principles of Soft Computing, 2nd
Edition”
by S.N. Sivanandam & SN Deepa
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.

4
ASSOCIATION OF BIOLOGICAL NET WITH
ARTIFICIAL NET
Edition”

5
The neuron is the basic information processing unit
of a NN. It consists of:
1. A set of links, describing the neuron inputs, with
weights W1, W2, …, Wm.
2. An adder function (linear combiner) for
computing the weighted sum of the inputs (real
numbers):
3. Activation function for limiting the amplitude of
the neuron output.
j
m
j
j X
W
u 


1
)
b
(u
y 

PROCESSING OF AN ARTIFICIAL NET
Edition”

6
BIAS OF AN ARTIFICIAL NEURON
The bias value is added to the weighted sum
∑wixi so that we can transform it from the origin.
Yin = w
∑ ixi + b, where b is the bias
x1-x2=0
x1-x2= 1
x1
x2
x1-x2= -1
Edition”

7
MULTI LAYER ARTIFICIAL NEURAL NET
INPUT: records without class attribute with
normalized attributes values.
INPUT VECTOR: X = { x1, x2, …, xn} where n is the
number of (non-class) attributes.
INPUT LAYER: there are as many nodes as non-
class attributes, i.e. as the length of the input
vector.
HIDDEN LAYER: the number of nodes in the
hidden layer and the number of hidden layers
depends on implementation.
Edition”

8
OPERATION OF A NEURAL NET
-
f
Weighted
sum
Input
vector x
Output y
Activation
function
Weight
vector w
å
w0j
w1j
wnj
x0
x1
xn
Bias
Edition”

9
WEIGHT AND BIAS UPDATION
Per Sample Updating
• Updating weights and biases after the presentation
of each sample.
Per Training Set Updating (Epoch or Iteration)
• Weight and bias increments could be accumulated
in variables and the weights and biases updated
after all the samples of the training set have been
presented.
Edition”

10
STOPPING CONDITION
 All change in weights (wij) in the previous epoch
are below some threshold, or
 The percentage of samples misclassified in the
previous epoch is below some threshold, or
 A pre-specified number of epochs has expired.
 In practice, several hundreds of thousands of epochs
may be required before the weights will converge.
Edition”

11
BUILDING BLOCKS OF ARTIFICIAL NEURAL NET
 Network Architecture (Connection between
Neurons)
 Setting the Weights (Training)
 Activation Function
Edition”

12
Edition”

13
LAYER PROPERTIES
 Input Layer: Each input unit may be designated by
an attribute value possessed by the instance.
 Hidden Layer: Not directly observable, provides
nonlinearities for the network.
 Output Layer: Encodes possible values.
Edition”

14
TRAINING PROCESS
 Supervised Training - Providing the network with a
series of sample inputs and comparing the output
with the expected responses.
 Unsupervised Training - Most similar input vector is
assigned to the same output unit.
 Reinforcement Training - Right answer is not
provided but indication of whether ‘right’ or ‘wrong’
is provided.
Edition”

15
ACTIVATION FUNCTION
 ACTIVATION LEVEL – DISCRETE OR CONTINUOUS
 HARD LIMIT FUCNTION (DISCRETE)
• Binary Activation function
• Bipolar activation function
• Identity function
 SIGMOIDAL ACTIVATION FUNCTION
(CONTINUOUS)
• Binary Sigmoidal activation function
• Bipolar Sigmoidal activation function
Edition”

16
ACTIVATION FUNCTION
Activation functions:
(A) Identity
(B) Binary step
(C) Bipolar step
(D) Binary sigmoidal
(E) Bipolar sigmoidal
(F) Ramp
Edition”

17
CONSTRUCTING ANN
 Determine the network properties:
• Network topology
• Types of connectivity
• Order of connections
• Weight range
 Determine the node properties:
• Activation range
 Determine the system dynamics
• Weight initialization scheme
• Activation – calculating formula
• Learning rule
Edition”

18
PROBLEM SOLVING
 Select a suitable NN model based on the nature of
the problem.
 Construct a NN according to the characteristics of
the application domain.
 Train the neural network with the learning
procedure of the selected model.
 Use the trained network for making inference or
solving problems.
Edition”

19
NEURAL NETWORKS
 Neural Network learns by adjusting the weights so
as to be able to correctly classify the training data
and hence, after testing phase, to classify unknown
data.
 Neural Network needs long time for training.
 Neural Network has a high tolerance to noisy and
incomplete data.
Edition”

20
SALIENT FEATURES OF ANN
 Adaptive learning
 Self-organization
 Real-time operation
 Fault tolerance via redundant information coding
 Massive parallelism
 Learning and generalizing ability
 Distributed representation
Edition”

Adaptive Networks
• An adaptive network is a network structure
consisting of a number of nodes connected through
directional links.
• Each node represents a process unit, and the links
between nodes specify the causal relationship
between the connected nodes.
• All or part of the nodes are adaptive
• The learning rule specifies how these parameters
should be updated to minimize a prescribed error
measure
▫ It is a mathematical expression that measures the
discrepancy between the network's actual output and a
desired output.
21

Adaptive Networks
• Basic learning rule of the adaptive network is the
well-known steepest descent method, in which
the gradient vector is derived by successive
invocations of the chain rule.
• The gradient in a multilayer neural network -
was called the back propagation learning rule
22

Feed forward adaptive network
23

Adaptive Networks
• An adaptive network is a network structure
whose overall input-output behavior is
determined by a collection of modifiable
parameters.
• The configuration of an adaptive network
▫ set of nodes connected by directed links where
each node performs a static node function on its
incoming signals to generate a single node output
and each link specifies the direction of signal flow
from one node to another.
24

Adaptive Networks
• An adaptive network is heterogeneous and each
node may have a specific node function different
from the others.
• Links in an adaptive network are merely used to
specify the propagation direction of node outputs;
generally there are no weights or parameters
associated with links.
25

Adaptive Networks
• The parameters of an adaptive network are distributed
into its nodes, so each node has a local parameter set.
• The union of these local parameter sets is the
network's overall parameter set.
• If a node's parameter set is not empty, then its node
function depends on the parameter values; we use a
square to represent this kind of adaptive node.
• If a node has an empty parameter set, then its
function is fixed; we use a circle to denote this type of
fixed node.
• Each adaptive node can be decomposed into a fixed
node plus one or several parameter nodes.
26

Adaptive networks - Classification
• Feedforward
• Recurrent
28

Layered representation of the feed-forward
adaptive network
• No links between nodes in the same layer, and
outputs of nodes in a specific layer are always
connected to nodes in succeeding layers.
• This representation is usually preferred because
of its modularity, in that nodes in the same layer
have the same functionality or generate the same
level of abstraction about input vectors.
29

Topological ordering representation
• Labels the nodes in an ordered sequence 1,2,3,..., such
that there are no links from node i to node j whenever i
>= j.
• This representation is less modular than the layer
representation, but it facilitates the formulation of
learning rules, as will be detailed in the next section.
• Special case of layered representation (one node per
layer)
30

Feed-forward adaptive network
• Conceptually, a feedforward adaptive network is actually a
static mapping between its input and output spaces; this
mapping may be either a simple linear relationship or a
highly nonlinear one, depending on the network structure
(node arrangement and connections, and so on) and the
functionality for each node.
• Here our aim is to construct a network for achieving a
desired nonlinear mapping that is regulated by a data set
consisting of desired input-output pairs of a target system
to be modeled.
• This data set is usually called the training data set, and the
procedures we follow in adjusting the parameters to
improve the network's performance are often referred to
as the learning rules or adaptation algorithms.
31

Feed-forward adaptive network
• Usually a network's performance is measured as
the discrepancy between the desired output and
the network's output under the same input
conditions.
• This discrepancy is called the error measure and it
can assume different forms for different
applications.
• Generally speaking, a learning rule is derived by
applying a specific optimization technique to a
given error measure.
32

Examples of adaptive networks
• An adaptive network with a single linear node
33

• Perceptron network
• We can form an equivalent network with a single
node whose function is the composition of f3 and
f4 the resulting node is the building block of the
classical perceptron
34

• Since the step function is discontinuous at one point
and flat at all the other points, it is not suitable for
derivative-based learning procedures.
• One way to get around this difficulty is to use the
sigmoidal function as a squashing function that has
values between 0 and 1:
• This is a continuous and differentiate approximation
to the step function. The composition of f3 and this
differentiable f4 is the building block for the
multilayer perceptron in the following example.
35

BACKPROPAGATION FOR FEEDFORWARD
NETWORKS
• The central part of this learning rule concerns how
to recursively obtain a gradient vector in which
each element is defined as the derivative of an
error measure with respect to a parameter.
• This is done by means of the chain rule, a basic
formula for differentiating composite functions.
• The procedure of finding a gradient vector in a
network structure is generally referred to as
backpropagation because the gradient vector is
calculated in the direction opposite to the flow of
the output of each node.
37

NETWORKS
• Once the gradient is obtained, a number of
derivative-based optimization and regression
techniques are available for updating the
parameters.
• In particular, if we use the gradient vector in a
simple steepest descent method, the resulting
learning paradigm is often referred to as the
backpropagation learning rule.
38

NETWORKS
• Suppose that a given feedforward adaptive network in the
layered representation has L layers and layer l (L = 0,1,
..., L; l = 0 represents the input layer) has N(l) nodes.
• Then the output and function of node i [i = 1, ..., N(l)] in
layer l can be represented as xl,i and fl,i respectively.
• We assume that there are no jumping links (that is, links
connecting nonconsecutive layers).
• Since the output of a node depends on the incoming
signals and the parameter set of the node, we have the
following general expression for the node function fl,i
40
α, β, γ etc. are the parameters of this node.

NETWORKS
• An error measure for the pth (1 ≤ p < P) entry of the
training data
where dk – kth component of desired output
xL,k – kthe component of actual output
• Minimize the overall error measure, which is defined
as
• assume that Ep depends on the output nodes only,
and it is not universal definition
41

NETWORKS
• To use steepest descent to minimize the error
measure, first we have to obtain the gradient
vector.
• Before calculating the gradient vector, we should
observe the following causal relationships:
42

NETWORKS
• A small change in a parameter α will affect the output
of the node containing α; this in turn will affect the
output of the final layer and thus the error measure.
• Therefore, the basic concept in calculating the gradient
vector is to pass a form of derivative information
starting from the output layer and going backward
layer by layer until the input layer is reached.
• We define the error signal ϵl,i as the derivative of the
error measure Ep with respect to the output of node i in
layer l, taking both direct and indirect paths into
consideration.
43

NETWORKS
• The ordered derivative takes into consideration
both the direct and indirect paths that lead to the
causal relationship.
• The error signal for the ith
output node (at layer L)
can be calculated directly:
44

NETWORKS
• For the internal node at the ith position of layer l,
the error signal can be derived by the chain rule:
• The error signal of an internal node at layer l can
be expressed as a linear combination of the error
signal of the nodes at layer l + 1.
45

NETWORKS
• Therefore, for any l and i , we can find ϵl,i
▫ Find error signals at the output layer,
▫ Repeat iteratively to reach the desired layer.
• This procedure is called backpropagation since
the error signals are obtained sequentially from
the output layer back to the input layer.
• The gradient vector is defined as the derivative of
the error measure with respect to each parameter,
so we have to apply the chain rule again to find
the gradient vector.
46

NETWORKS
• If α is a parameter of the ith node at layer l, we
have
• More general form:
where S is the set of nodes containing a as a
parameter; and x* and f* are the output and
function, respectively, of a generic node in s.
47

NETWORKS
• The derivative of the overall error measure E with
respect to α is
• Accordingly, for simple steepest descent without
line minimization, the update formula for the
generic parameter α is
• η is the learning rate
48
k is the step size, the length of each transition along the gradient direction in the
parameter space.

NETWORKS
• When an n-node feedforward network is
represented in its topological order, we can
envision the error measure Ep as the output of an
additional node with index n + 1, whose node
function fn+1 can be defined on the outputs of any
nodes with smaller index;
• Therefore, Ep may depend directly on any nodes.
• Applying the chain rule again, we have the
following concise formula for calculating the
error signal
49

NETWORKS
• where the first term shows the direct effect of xi on
Ep via the direct path from node i to node n + 1 and
each product term in the summation indicates the
indirect effect of Xi on Ep.
• Once we find the error signal for each node, then
the gradient vector for the parameters is derived
as before.
50

NETWORKS
• Another systematic way to calculate the error
signals is through the representation of the error-
propagation network (or sensitivity model),
which is obtained from the original adaptive
network by reversing the links and supplying the
error signals at the output layer as inputs to the
new network.
51

NETWORKS
52

NETWORKS
• There are two types of learning paradigms that are
available to suit the needs for various applications.
• Off-line learning (or batch learning)
▫ The update formula for parameter a is based on Equation
(8.8) and the update action takes place only after the whole
training data set has been presented—that is, only after each
epoch or sweep.
• On-line learning (or pattern-by-pattern learning)
▫ The parameters are updated immediately after each input-
output pair has been presented, and the update formula is
based on Equation (8.6).
• In practice, it is possible to combine these two learning
modes and update the parameter after k training data
entries have been presented, where k is between 1 and P
and it is sometimes referred to as the epoch size.
53

EXTENDED BACKPROPAGATION FOR RECURRENT
NETWORKS
54

NETWORKS
55
Because it has directional
loops 3-4-5,
3-4-6-5, and 6 (a self-loop),
this is a typical recurrent
network with node functions
denoted as follows:

NETWORKS
• Two operating modes through which the
network may satisfy the node functions
▫ Synchronous operation
 Back propagation Through Time (BPTT)
 Real-Time Recurrent Learning (RTRL)
▫ Continuous operation
56

SYNCHRONOUS OPERATION
• If a network is operated synchronously, all nodes
change their outputs simultaneously according
to a global clock signal and there is a time delay
associated with each link.
• This synchronization is reflected by adding the
time t as an argument to the output of each node
(assuming there is a unit time delay associated
with each link):
57

CONTINUOUS OPERATION
• In a network that is operated continuously, all
nodes continuously change their outputs until
Equation (8.13) is satisfied.
• This operating mode is of particular interest for
analog circuit implementations, where a certain
kind of dynamical evolution rule is imposed on
the network.
58
(8.13)

Backpropagation Through Time (BPTT)
• Identifying a set of parameters that will make
the output of a node (or several nodes) follow a
given trajectory (or trajectories) in the discrete
time domain.
• This problem of tracking or trajectory following
is usually solved by using a method called
unfolding of time to transform a recurrent
network into a feedforward one, as long as the
time t does not exceed a reasonable maximum T.
59

•
60

•
61

•
62

• Note that the error signals of a parameter node come
from nodes located at layers across different time
instants;
• Thus, the backpropagation procedure (and the
corresponding steepest descent) for this kind of
unfolded network is often called backpropagation
through time (BPTT).
• Disadvantages of BPTT
▫ Requires extensive computing resources when the
sequence length T is large
▫ The duplication of nodes makes both memory
requirements and simulation time proportional to T.
63

Real-Time Recurrent Learning (RTRL)
• For long sequences or sequences of unknown
length, real-time recurrent learning (RTRL) is
employed to perform on-line learning — that is,
to update parameters while the network is
running rather than at the end of the presented
sequences.
64

65

• To save computation and memory requirements,
a sensible choice is to minimize Ei at each time
step instead of trying to minimize E at the end of
a sequences.
• To achieve this, we need to calculate ∂+
E/∂α
recursively at each time step i.
66

67

UNIT 3 - Neural networks feed forward n/w

More Related Content

Similar to UNIT 3 - Neural networks feed forward n/w (20)

More from Subha421414 (12)

Recently uploaded (20)

UNIT 3 - Neural networks feed forward n/w