Ann chapter-3-single layerperceptron20021031

Single-Layer Perceptron
Classifiers

Berlin Chen, 2002

Outline
• Foundations of trainable decision-making
networks to be formulated
– Input space to output space (classification space)
• Focus on the classification of linearly separable
classes of patterns
– Linear discriminating functions and simple correction
function
– Continuous error function minimization
• Explanation and justification of perceptron and
delta training rules

2

Classification Model, Features,
and Decision Regions
• A pattern is the quantitative description of an
object, event, or phenomenon
– Spatial patterns: weather maps, fingerprints …
– Temporal patterns: speech signals …

• Pattern classification/recognition
– Assign the input data (a physical object, event, or
phenomenon) to one of the pre-specified classes
(categories)
– Discriminate the input data within object population
via the search for invariant attributes among
members of the population 3

and Decision Regions (cont.)
• The block diagram of the recognition and
classification system

Dimension
reduction
A neural network
for classification
and for feature
extraction

4

• More about Feature Extraction
– The compressed data from the input patterns while
poses salient information
– E.g.
• Speech vowel sounds analyzed in 16-channel filterbanks can
provide 16 spectral vectors, which can be further transformed
into two dimensions
– Tone height (high-low) and retraction (front-back)

• Input patterns to be projected and reduced to lower
dimensions

5

• More about Feature Extraction
y x’

y’

x

6

• Two simple ways to generate the pattern vectors for
cases of spatial and temporal objects to be classified

• A pattern classifier maps input patterns (vectors) in En
space into numbers (E1) which specify the membership
j = i0 ( x ), j = 1, 2,..., R
7

• Classification described in geometric terms

The decision surfaces here
are curved lines

i o ( x ) = j , for all x ∈ Χ j , j = 1, 2 ,..., R

– Decision regions
– Decision surfaces: generally, the decision surfaces for n-
dimensional patterns may be (n-1)-dimensional hyper-surfaces 8

Discriminant Functions
• Determine the membership in a category by the
classifier based on the comparison of R
discriminant functions g1(x), g2(x),…, gR(x)
– When x is within the region Xk if gk(x) has the largest
value i0 ( x ) = k if g k ( x ) > g j ( x ) for k, j = 1, 2 ,..., R, k ≠ j
g1
x1 g1(x)
x1, x2,…., xp, ….,xP x2 g2
g2(x)

P>>n gR(x)
xn
Assume the classifier
gR
Has been designed

9

Discriminant Functions (cont.)

• Example 3.1 Decision surface Equation: g ( x ) = g1 ( x ) − g 2 ( x )
= -2 x1 + x2 + 2
g ( x ) > 0 : class1
g ( x ) < 0 : class 2

The decision surface does
not uniquely specify the
discriminant functions

The classifier that classifies patterns
into two classes or categories is called
“dichotomizer”

“two” “cut” 10


11

(x-0,y+2, g1 -1)(2,-1,1)=0 Solution 1
2x-y-2+ g1 -1=0 g
 x1 
g1 ( x ) = [− 2 1]  + 3
g1=-2x+y+3
(x-0,y+2, g2 -1)(-2,1,1)=0  x2 
-2x+y+2+ g2 -1=0
 x1 
g 2 ( x ) = [2 -1] 
g2=2x-y-1
g=g1 -g2=0  x2 
-4x+2y+4=0
-2x+y+2=0 y [2,-1,1]
[-2,1,0]
x
(x-0,y+2, g1 -1)(2,-1,2)=0 Solution 2 [0,0,1]
2x-y-2+2g1 -2=0 (1,0,0)
g1=-x+1/2y+2 (0,-2,1)
(x-0,y+2, g2 -1)(-2,1,2)=0 [2,-1,0]
-2x+y+2+ 2g2 -2=0 (0,-2,0)
g2=x-1/2y
An infinite number of
g=g1 -g2=0
-2x+y+2=0 discriminant functions will yield
correct classification 12

Multi-class

Two-class

g( x) = g1 ( x) − g2 ( x) g ( x ) > 0 : class 1
g ( x ) < 0 : class 2
subtraction Sign examination 13


The design of discriminator
for this case is not
straightforward.
The discriminant functions
may result as nonlinear
functions of x1 and x2

14

Bayes’ Decision Theory
• A decision-making based on both the posterior
knowledge obtained from specific observation
data and prior knowledge of the categories
– Prior class probabilities P(ωi ), ∀ class i
– Class-conditioned probabilities P(x ωi ), ∀ class i

P (x ω i )P (ω i ) P (x ω i )P (ω i )
k = arg max P (ω i x ) = arg max = arg max
i i P (x ) i
j =1
( )
∑ P x ω j P (ω j )

k = arg max P (ω i x ) = arg max P (x ω i )P (ω i )
i i

15

Bayes’ Decision Theory (cont.)
• Bayes’ decision rule designed to minimize the
overall risk involved in making decision
– The expected loss (conditional risk) when making
decision δ i
R (δ x ) = ∑ l (δ ω , x )P (ω x ), where l (δ ω , x ) = 
0 , i = j
i i j j i j
j 1, i ≠ j
= ∑ P (ω j x)
j≠i

= 1 - P (ω i x )

• The overall risk (Bayes’ risk)
∞
R = ∫ R (δ ( x ) x )p ( x )dx , δ ( x ) : the selected decision for a sample x
−∞
– Minimize the overall risk (classification error) by
computing the conditional risks and select the decision
δ i for which the conditional risk R (δ i x ) is minimum, i.e.,
P (ω i x ) is maximum (minimum-error-rate decision rule) 16

• Two-class pattern classification
g 1 ( x ) = P (ω 1 x ) ≅ P (x ω 1 )P (ω 1 ), g 2 (x ) = P (ω 2 x ) ≅ P (x ω 2 )P (ω 2 )

Bayes’ Classifier Likelihood ratio or log-likelihood ratio:
ω1
ω1 P(x ω1 ) > P(ω2 )
> l (x ) =
P (x ω 1 )P (ω 1 ) P (x ω 2 )P (ω 2 ) P(x ω2 ) < P(ω1 ) ω1
< ω2
>
ω2
log l ( x ) = log P(x ω1 ) − log P(x ω2 ) log P(ω2 ) − log P(ω1 )
<
ω2

Classification error:
p (error ) = P ( x ∈ R1 , ω 2 ) + P ( x ∈ R 2 , ω 1 )
= P (x ∈ R1 ω 2 )P (ω 2 ) + P (x ∈ R 2 ω 1 )P (ω 1 )
= ∫R P (x ω 2 )P (ω 2 )dx + ∫R P (x ω 1 )P (ω 1 )dx
1 2

17

• When the environment is multivariate Gaussian,
the Bayes’ classifier reduces to a linear classifier
– The same form taken by the perceptron
– But the linear nature of the perceptron is not
contingent on the assumption of Gaussianity

P (x ω ) =
 1
1
exp  − ( x − µ ) Σ
t −1
( x − µ )

(2 π )
1
n
2 Σ 2  2 

Class ω 1 : E [ X ] = µ1
[
E ( X − µ1 )( X − µ1 ) = Σ
t
] P (ω 1 ) = P (ω 2 ) =
1
2
Class ω 2 : E [ X ] = µ 2
[
E ( X − µ 2 )( X − µ 2 ) = Σ
t
]
Assumptions 18


• When the environment is Gaussian, the Bayes’
classifier reduces to a linear classifier (cont.)
log l ( x ) = log P (x ω1 ) − log P (x ω 2 )
1
=− ( x − µ1 )t Σ −1 ( x − µ1 ) + 1 ( x − µ2 )t Σ −1 ( x − µ2 )
2 2
(
1 t
= ( µ1 − µ 2 ) Σ −1 x + µ 2 Σ −1 µ 2 − µ1 Σ −1 µ1
t

2
t
)
= wx + b
ω1
>
∴ log l ( x ) = wx + b 0
<
ω2

19


• Multi-class pattern classification

20

Linear Machine and Minimum Distance
Classification
• Find the linear-form discriminant function for two-
class classification when the class prototypes are
known

• Example 3.1: Select the decision hyperplane that
contains the midpoint of the line segment
connecting center point of two classes

21

Classification (cont.)
The dichotomizer’s discriminant function g(x):
x1 + x 2
( x1 − x 2 ) t ( x − )=0
2
1 2 2
( x1 − x 2 ) t x + ( x 2 − x1 ) = 0
2
x1 + x 2
 w x
Taken as  = 0 , where
w n +1   1 
2
  
w = x1 − x 2

w n +1 =
1
2
(x2
2
− x1
2
) Augmented
input pattern

It is a simple minimum-distance classifier.
22

• The linear-form discriminant functions for multi-
class classification
– There are up to R(R-1)/2 decision hyperplanes for R
pairwise separable classes
Some classes may not be contiguous

o o
o o o
o Δ
x o o x
o o ΔΔ
x x
x x o oo Δ Δ
Δ Δ
x x x x o o
x Δ Δ x Δ
x x
Δ Δ Δ
Δ
23

• Linear machine or minimum-distance classifier
– Assume the class prototypes are known for all classes
• Euclidean distance between input pattern x and the center of
class i, xi :
x − xi = ( x − xi ) ( x − xi )
t

2
• Minimizing x − xi = x t x − 2 xit x + xit xi is equal to
1 t
maximizing xit x − xi xi The same for all classes
2

– Set the discriminant function for each class i to be:
1 t
g i ( x ) = xit x − xi xi g i ( x ) = w it y
2
w = xi
 wi   x  i

gi (x ) =  , where ( )
wi , n +1   1 
1
w i,n +1 = − x it x i
   2 24


This approach is also called
correlation classification

An 1 as the n+1’th component
of the input pattern
1
gi ( x) = xit x − xit xi g i ( x ) = w it y
2

25

• Example 3.2
 10   2   -5 
w1 =  2 , w =  − 5 , w =  5 
2 3
 − 52   − 14 . 5   − 25 
     

g 1 ( x ) = 10 x 1 + 2 x 2 − 52
g (x ) =
2 2 x 1 − 5 x 2 − 14 . 5
g 3 (x ) = − 5 x 1 + 5 x 2 − 25
S 12
S 13
S 12 : 8 x1 + 7 x 2 − 37 . 5 = 0
S 13 : − 15 x1 + 3 x 2 + 27 = 0
S 23 : − 7 x1 + 10 x 2 − 10 . 5 = 0

1
gi ( x) = xit x − xit xi
2
S 23 26

• If R linear discriminant functions exist for a set of
patterns such that

g i (x ) > g j (x ) for x ∈ Class i,
i = 1 , 2 ,..., R,j = 1 , 2 ,..., R , i ≠ j

– The classes are linearly separable

27


28

(a) 2x1-x2+2=0, decision surface is a line
(b) 2x1-x2+2=0, decision surface is a plane
(c) x1=[2,5], x2=[-1,-3]
=>The decision surface for minimum distance classifier
(x1-x2)t x+1/2 (||x2||2-||x1||2)t=0
3x1+ 8x2-19/2=0 x3
(d)
(19/16,0)

(-1,0) (0,2) (-1,0) (0,2) (-1,0) (0,2)
x1 x1 x1
(0,0) (0,0) (0,0)
(19/6,0)
x2 x2 x2
29

• Examples 3.1 and 3.2 have shown that the
coefficients (weights) of the linear
discriminant functions can be determined if
the a priori information about the sets of
patterns and their class membership is
known

30

• The example of linearly non-separable patterns

31


o1 (-1, 1) (1,1)
1
x1 TLU#1
-1
1 TLU#2
1 o
2 1
x2 -1 TLU#2
-1 -1 -x1-x2+1=0
1
-1
o2
(1,1)
(-1, 1) x1+x2+1=0

o1
(-1,-1)
(1, -1)

(1, -1) o1+o2-1=0 32

Discrete Perceptron Training Algorithm
- Geometrical Representations
• Examine the neural network classifiers that
derive/training their weights based on the error-
correction scheme

Class 1: wt y > 0
g(y) = wt y
Class 2: wt y < 0
Augmented
input pattern
Vector Representations
in the Weight Space 33

- Geometrical Representations (count.)
• Devise an analytic approach based on the
geometrical representations
– E.g. the decision surface for the training pattern y1

y1 in Class 1 ( )
∇ w w t y1 = y1 Gradient
(the direction of
If y1 in Class 1: steep increase)

Weight Space
w ′ = w 1 + cy1
c controls the
If y1 in Class 2: size of adjustment
y1 in Class 2
w ′ = w 1 − cy1

c (>0) is the correction increment (is
two times of the learning constant
Weight Space introduced before)
34

Weight adjustments of three
augmented training pattern y1,
y2, y3 , shown in the weight
space
y1 ∈ C 1
y2 ∈ C1
y3 ∈ C 2

- Weights in the shaded region
are the solutions
- The three lines labeled are
fixed during training
Weight Space 35

• More about the correction increment c
– If it is not merely a constant, but related to the current
training pattern
How to select the correction increment
based on the dislocates of w1 and the
corrected weight vector w
w 1t y
p=
y (w 1
± cy ) t
y =0
w 1t y w 1t y
c = m t = 2
, because c > 0
y y y

w 1t y
⇒ cy = 2
y
y
36

• For fixed correction rule with c=constant, the
correction of weights is always the same fixed
portion of the current training vector
– The weight can be initialized at any value
w ′ = w + ∆w
w ′ = w ± cy or
[ ( )]
∆ w = c d − sgn w t y y

• For dynamic correction rule with c dependent
on the distance from the weight (i.e. the weight
vector) to the decision surface in the weight
w 1t y
space ⇒ cy = 2
y
y
– The initial weight should be different from 0
37

• Dynamic correction rule with c dependent
on the distance from the weight

w 1t y
c = λ 2
y
w 1t y y
cy = λ
y y

38

• Example 3.3

1 − 0.5 y 1 ∈ C 1
y1 =   y2 =  
1  1 y2 ∈ C 2

3  2 y 3 ∈ C 1
y3 =   y4 =  
1  − 1  y 4 ∈ C 2

∆w k =
c
2
[ ( )]
d k − sgn w kt y j y j

What if w kt y j = 0 ?
-> interpreted as a mistake
and followed by a correlation
39

Continuous Perceptron
Training Algorithm
• Replace the TLU (Threshold Logic Unit) with the
sigmoid activation function for two reasons:
– Gain finer control over the training procedure
– Facilitate the differential characteristics to enable
computation of the error gradient

w = w − η ∇ E (w
ˆ )
learning constant error gradient

40

Training Algorithm (cont.)
• The new weights is obtained by moving in the
direction of the negative gradient along the
multidimensional error surface

41

• Define the error as the squared difference
between the desired output and the actual
output 1
E = (d − o)
2

2
1
[
or E = d − f w t y
2
( )]2
=
1
2
[d − f (net )]2
∇ E (w ) =
1
2
(
∇ [d − f (net )]
2
)
 ∂E   ∂ (net ) 
 ∂w   ∂w 
 1   1 
 ∂E   ∂ (net ) 
∆ 
∂w 2   
∇ E (w ) = 
. 
= − (d − o ) f ′ (net ) ∂ w 2  = − (d − o ) f ′ (net ) y
.
   
 .   . 
 ∂E   ∂ (net ) 
   
 ∂ w n +1 
   ∂ w n +1 
  42

• Bipolar Continuous Activation Function
2 exp(− λ ⋅ net )
f (net ) =
2
1 + exp(− λ ⋅ net )
−1 f ′(net ) = λ ⋅
[1 + exp(− λ ⋅ net )]2
{ } (
= λ ⋅ 1 − [ f (net )] = λ 1 − o 2
2
)

w = w +
ˆ
1
2
η ⋅ λ (d − o ) 1 − o 2 y( )
• Unipolar Continuous Activation Function
f (net ) =
1 λ ⋅ exp(− λ ⋅ net)
1 + exp(− λ ⋅ net ) f ′(net) = = λ ⋅ f (net)[1 − f (net)] = λ ⋅ o(1 − o)
[1+ exp(− λ ⋅ net)] 2

w = w + η ⋅ λ ⋅ (d − o )o (1 − o ) y
ˆ
43

2
• Example 3.3 f (net ) = −1
1 + exp(− net ) − 0.5
y2 =  
 1
1
y1 =  
1

3  2
y3 =   y4 =  
1  − 1 

44

• Example 3.3 Total error surface Trajectories started from four
arbitrary initial weights

45

• Treat the last fixed component of input pattern
vector as the neuron activation threshold

46

• R-category linear classifier using R discrete
bipolar perceptrons
– Goal: The i-th TLU response of +1 is indicative of
class i and all other TLU respond with -1

1
wi = wi +
ˆ c ⋅ (d i − o i ) y
2
d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i

For “local representation”

47

• Example 3.5

48

• R-category linear classifier using R continuous
bipolar perceptrons

wi = wi +
ˆ
1
2
( )
η ⋅ λ (d i − o i ) 1 − o i2 y
for i = 1,2 ,...,R

d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i

49

• Error function dependent on the difference
vector d-o

50

Bayes’ Classifier vs. Percepron

• Perceptron operates on the promise that the patterns to
be classified are linear separable (otherwise the training
algorithm will oscillate), while Bayes’ classifier assumes
the (Gaussian) distribution of two classes certainly do
overlap each other
• The perceptron is nonparametric while the Bayes’
classifier is parametric (its derivation is contingent on the
assumption of the underlying distributions)
• The perceptron is simple and adaptive, and needs small
storage, while the Bayes’ classifier could be made
adaptive but at the expanse of increased storage and
more complex computations

51

Homework

• P3.5, P3.7, P3.9, P3.22

52

Ann chapter-3-single layerperceptron20021031

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Ann chapter-3-single layerperceptron20021031 (20)

Ann chapter-3-single layerperceptron20021031