SlideShare a Scribd company logo
Single-Layer Perceptron
       Classifiers


     Berlin Chen, 2002
Outline
• Foundations of trainable decision-making
  networks to be formulated
  – Input space to output space (classification space)
• Focus on the classification of linearly separable
  classes of patterns
  – Linear discriminating functions and simple correction
    function
  – Continuous error function minimization
• Explanation and justification of perceptron and
  delta training rules

                                                            2
Classification Model, Features,
          and Decision Regions
• A pattern is the quantitative description of an
  object, event, or phenomenon
   – Spatial patterns: weather maps, fingerprints …
   – Temporal patterns: speech signals …


• Pattern classification/recognition
   – Assign the input data (a physical object, event, or
     phenomenon) to one of the pre-specified classes
     (categories)
   – Discriminate the input data within object population
     via the search for invariant attributes among
     members of the population                              3
Classification Model, Features,
             and Decision Regions (cont.)
   • The block diagram of the recognition and
     classification system



                                          Dimension
                                          reduction
A neural network
for classification
 and for feature
    extraction



                                                      4
Classification Model, Features,
      and Decision Regions (cont.)
• More about Feature Extraction
  – The compressed data from the input patterns while
    poses salient information
  – E.g.
     • Speech vowel sounds analyzed in 16-channel filterbanks can
       provide 16 spectral vectors, which can be further transformed
       into two dimensions
         – Tone height (high-low) and retraction (front-back)

     • Input patterns to be projected and reduced to lower
       dimensions



                                                                       5
Classification Model, Features,
      and Decision Regions (cont.)
• More about Feature Extraction
                              y        x’


                         y’




                                            x




                                                6
Classification Model, Features,
       and Decision Regions (cont.)
• Two simple ways to generate the pattern vectors for
  cases of spatial and temporal objects to be classified




• A pattern classifier maps input patterns (vectors) in En
  space into numbers (E1) which specify the membership
                 j = i0 ( x ), j = 1, 2,..., R
                                                             7
Classification Model, Features,
       and Decision Regions (cont.)
• Classification described in geometric terms



                                                The decision surfaces here
                                                are curved lines


                                     i o ( x ) = j , for all x ∈ Χ j ,   j = 1, 2 ,..., R




   – Decision regions
   – Decision surfaces: generally, the decision surfaces for n-
     dimensional patterns may be (n-1)-dimensional hyper-surfaces                           8
Discriminant Functions
  • Determine the membership in a category by the
    classifier based on the comparison of R
    discriminant functions g1(x), g2(x),…, gR(x)
      – When x is within the region Xk if gk(x) has the largest
        value i0 ( x ) = k if g k ( x ) > g j ( x ) for k, j = 1, 2 ,..., R, k ≠ j
                                             g1
                           x1                              g1(x)
x1, x2,…., xp, ….,xP       x2                g2
                                                   g2(x)

P>>n                                                       gR(x)
                          xn
Assume the classifier
                                             gR
Has been designed


                                                                                     9
Discriminant Functions (cont.)

• Example 3.1   Decision surface Equation:   g ( x ) = g1 ( x ) − g 2 ( x )
                                                    = -2 x1 + x2 + 2
                                             g ( x ) > 0 : class1
                                             g ( x ) < 0 : class 2


                                      The decision surface does
                                        not uniquely specify the
                                         discriminant functions


                                The classifier that classifies patterns
                                 into two classes or categories is called
                                “dichotomizer”

                                 “two” “cut”                                  10
Discriminant Functions (cont.)




                                 11
Discriminant Functions (cont.)
(x-0,y+2, g1 -1)(2,-1,1)=0   Solution 1
2x-y-2+ g1 -1=0                                           g
                                                x1 
                             g1 ( x ) = [− 2 1]  + 3
g1=-2x+y+3
(x-0,y+2, g2 -1)(-2,1,1)=0                      x2 
-2x+y+2+ g2 -1=0
                                                x1 
                             g 2 ( x ) = [2 -1] 
g2=2x-y-1
g=g1 -g2=0                                      x2 
-4x+2y+4=0
-2x+y+2=0                                             y                        [2,-1,1]
                                           [-2,1,0]
                                                                                          x
(x-0,y+2, g1 -1)(2,-1,2)=0   Solution 2                                                    [0,0,1]
2x-y-2+2g1 -2=0                                                            (1,0,0)
g1=-x+1/2y+2                                                  (0,-2,1)
(x-0,y+2, g2 -1)(-2,1,2)=0                                                                [2,-1,0]
-2x+y+2+ 2g2 -2=0                                                        (0,-2,0)
g2=x-1/2y
                                     An infinite number of
 g=g1 -g2=0
 -2x+y+2=0                      discriminant functions will yield
                                      correct classification                                     12
Discriminant Functions (cont.)
Multi-class




Two-class




              g( x) = g1 ( x) − g2 ( x)      g ( x ) > 0 : class 1
                                             g ( x ) < 0 : class 2
              subtraction                 Sign examination           13
Discriminant Functions (cont.)




                     The design of discriminator
                     for this case is not
                     straightforward.
                     The discriminant functions
                     may result as nonlinear
                      functions of x1 and x2




                                                   14
Bayes’ Decision Theory
• A decision-making based on both the posterior
  knowledge obtained from specific observation
  data and prior knowledge of the categories
   – Prior class probabilities P(ωi ), ∀ class i
  – Class-conditioned probabilities P(x ωi ), ∀ class i

                                        P (x ω i )P (ω i )                 P (x ω i )P (ω i )
    k = arg max P (ω i x ) = arg max                         = arg max
             i                    i          P (x )               i
                                                                         j =1
                                                                                (    )
                                                                         ∑ P x ω j P (ω j )

    k = arg max P (ω i x ) = arg max P (x ω i )P (ω i )
             i                    i




                                                                                                15
Bayes’ Decision Theory (cont.)
• Bayes’ decision rule designed to minimize the
  overall risk involved in making decision
  – The expected loss (conditional risk) when making
    decision δ i
     R (δ x ) = ∑ l (δ ω , x )P (ω x ), where l (δ ω , x ) = 
                                                             0 , i = j
          i                 i       j    j               i   j
                      j                                             1, i ≠ j
                   = ∑ P (ω     j   x)
                      j≠i

                   = 1 - P (ω i x )

      • The overall risk (Bayes’ risk)
              ∞
        R = ∫ R (δ ( x ) x )p ( x )dx , δ ( x ) : the selected decision for a sample x
              −∞
  – Minimize the overall risk (classification error) by
    computing the conditional risks and select the decision
     δ i for which the conditional risk R (δ i x ) is minimum, i.e.,
    P (ω i x ) is maximum (minimum-error-rate decision rule) 16
Bayes’ Decision Theory (cont.)
   • Two-class pattern classification
      g 1 ( x ) = P (ω 1 x ) ≅ P (x ω 1 )P (ω 1 ), g 2 (x ) = P (ω 2 x ) ≅ P (x ω 2 )P (ω 2 )

 Bayes’ Classifier                                Likelihood ratio or log-likelihood ratio:
                                                                    ω1
                 ω1                                         P(x ω1 ) > P(ω2 )
                  >                              l (x ) =
P (x ω 1 )P (ω 1 ) P (x ω 2 )P (ω 2 )                       P(x ω2 ) < P(ω1 )            ω1
                  <                                                 ω2
                                                                                         >
                 ω2
                                                log l ( x ) = log P(x ω1 ) − log P(x ω2 ) log P(ω2 ) − log P(ω1 )
                                                                                         <
                                                                                         ω2


                                                       Classification error:
                                                     p (error    ) = P ( x ∈ R1 , ω 2 ) + P ( x ∈ R 2 , ω 1 )
                                                                  = P (x ∈ R1 ω 2 )P (ω 2 ) + P (x ∈ R 2 ω 1 )P (ω 1 )
                                                                  = ∫R P (x ω 2 )P (ω 2 )dx + ∫R P (x ω 1 )P (ω 1 )dx
                                                                         1                      2




                                                                                                                    17
Bayes’ Decision Theory (cont.)
• When the environment is multivariate Gaussian,
  the Bayes’ classifier reduces to a linear classifier
    – The same form taken by the perceptron
    – But the linear nature of the perceptron is not
      contingent on the assumption of Gaussianity

         P (x ω ) =
                                                        1
                                   1
                                                   exp  − ( x − µ ) Σ
                                                                    t    −1
                                                                              ( x − µ )
                                                                                       
                          (2 π )
                                           1
                               n
                                   2   Σ       2        2                            

        Class ω 1 : E [ X ] = µ1
                      [
                   E ( X − µ1 )( X − µ1 ) = Σ
                                                     t
                                                         ]         P (ω 1 ) = P (ω 2 ) =
                                                                                           1
                                                                                           2
        Class ω 2 : E [ X ] = µ 2
                      [
                   E ( X − µ 2 )( X − µ 2 ) = Σ
                                                         t
                                                             ]
Assumptions                                                                                    18
Bayes’ Decision Theory (cont.)

• When the environment is Gaussian, the Bayes’
  classifier reduces to a linear classifier (cont.)
     log l ( x ) = log P (x ω1 ) − log P (x ω 2 )
                1
           =−     ( x − µ1 )t Σ −1 ( x − µ1 ) + 1 ( x − µ2 )t Σ −1 ( x − µ2 )
                2                               2
                                      (
                                     1 t
           = ( µ1 − µ 2 ) Σ −1 x + µ 2 Σ −1 µ 2 − µ1 Σ −1 µ1
                         t

                                     2
                                                        t
                                                                )
           = wx + b
                              ω1
                          >
    ∴ log l ( x ) = wx + b 0
                          <
                              ω2

                                                                                19
Bayes’ Decision Theory (cont.)

• Multi-class pattern classification




                                       20
Linear Machine and Minimum Distance
             Classification
• Find the linear-form discriminant function for two-
  class classification when the class prototypes are
  known

• Example 3.1: Select the decision hyperplane that
  contains the midpoint of the line segment
  connecting center point of two classes




                                                        21
Linear Machine and Minimum Distance
        Classification (cont.)
The dichotomizer’s discriminant function g(x):
                                                         x1 + x 2
                                      ( x1 − x 2 ) t ( x −        )=0
                                                            2
                                                        1      2      2
                                      ( x1 − x 2 ) t x + ( x 2 − x1 ) = 0
                                                        2
           x1 + x 2
                                               w x
                                     Taken as                 = 0 , where
                                                w n +1   1 
              2
                                                       
                                     w = x1 − x 2

                                     w n +1 =
                                                1
                                                2
                                                 (x2
                                                        2
                                                             − x1
                                                                    2
                                                                        )   Augmented
                                                                            input pattern




       It is a simple minimum-distance classifier.
                                                                                            22
Linear Machine and Minimum Distance
        Classification (cont.)
• The linear-form discriminant functions for multi-
  class classification
   – There are up to R(R-1)/2 decision hyperplanes for R
     pairwise separable classes
                                      Some classes may not be contiguous

                      o                           o
                    o o o
                          o                                        Δ
            x          o o                x
                                                  o      o    ΔΔ
                x                             x
        x                             x           o oo       Δ Δ
                           Δ                                       Δ
            x x                           x x      o o
    x                     Δ Δ     x                          Δ
                x                             x
                       Δ Δ    Δ
                      Δ
                                                                           23
Linear Machine and Minimum Distance
        Classification (cont.)
• Linear machine or minimum-distance classifier
  – Assume the class prototypes are known for all classes
         • Euclidean distance between input pattern x and the center of
           class i, xi :
                         x − xi = ( x − xi ) ( x − xi )
                                            t


                                                     2
         • Minimizing                x − xi              = x t x − 2 xit x + xit xi is equal to
                                                 1 t
             maximizing xit x −                    xi xi                       The same for all classes
                                                 2

  – Set the discriminant function for each class i to be:
                                           1 t
             g i ( x ) = xit x −             xi xi                                   g i ( x ) = w it y
                                           2
                                             w       = xi
              wi   x                         i

   gi (x ) =                    , where                        (          )
               wi , n +1   1 
                                                              1
                                             w i,n +1 = −       x it x i
                                                          2                                           24
Linear Machine and Minimum Distance
        Classification (cont.)



                          This approach is also called
                           correlation classification




                            An 1 as the n+1’th component
                            of the input pattern
                                    1
                   gi ( x) = xit x − xit xi   g i ( x ) = w it y
                                    2


                                                                   25
Linear Machine and Minimum Distance
        Classification (cont.)
• Example 3.2
          10                     2              -5     
  w1   =     2 , w       =     − 5 , w       =     5   
                       2                     3
          − 52              − 14 . 5            − 25   
                                                       


  g 1 ( x ) = 10 x 1 + 2 x 2 − 52
  g   (x ) =
       2          2 x 1 − 5 x 2 − 14 . 5
  g 3 (x ) =      − 5 x 1 + 5 x 2 − 25
                                                                              S 12
                                                                S 13
   S 12 : 8 x1 + 7 x 2 − 37 . 5 = 0
   S 13 : − 15 x1 + 3 x 2 + 27 = 0
   S 23 : − 7 x1 + 10 x 2 − 10 . 5 = 0


                            1
           gi ( x) = xit x − xit xi
                            2
                                                                       S 23          26
Linear Machine and Minimum Distance
        Classification (cont.)
• If R linear discriminant functions exist for a set of
  patterns such that

     g i (x ) > g   j   (x )   for x ∈ Class i,
     i = 1 , 2 ,..., R,j = 1 , 2 ,..., R , i ≠ j

   – The classes are linearly separable




                                                          27
Linear Machine and Minimum Distance
        Classification (cont.)




                                      28
Linear Machine and Minimum Distance
            Classification (cont.)
(a) 2x1-x2+2=0, decision surface is a line
(b) 2x1-x2+2=0, decision surface is a plane
(c) x1=[2,5], x2=[-1,-3]
=>The decision surface for minimum distance classifier
    (x1-x2)t x+1/2 (||x2||2-||x1||2)t=0
    3x1+ 8x2-19/2=0                     x3
(d)
                                                                   (19/16,0)

 (-1,0)           (0,2)        (-1,0)           (0,2)        (-1,0)           (0,2)
                          x1                            x1                               x1
          (0,0)                         (0,0)                         (0,0)
                                                                              (19/6,0)
   x2                           x2                            x2
                                                                                         29
Linear Machine and Minimum Distance
        Classification (cont.)
•   Examples 3.1 and 3.2 have shown that the
    coefficients (weights) of the linear
    discriminant functions can be determined if
    the a priori information about the sets of
    patterns and their class membership is
    known




                                                  30
Linear Machine and Minimum Distance
        Classification (cont.)
• The example of linearly non-separable patterns




                                                   31
Linear Machine and Minimum Distance
             Classification (cont.)

                         o1                         (-1, 1)     (1,1)
          1
x1                  TLU#1
          -1
                           1    TLU#2
     1                   o
                         2 1
x2       -1         TLU#2
     -1        -1                                             -x1-x2+1=0
                     1
-1
                         o2
                                 (1,1)
(-1, 1)                                       x1+x2+1=0

                                         o1
                                               (-1,-1)
                                                                  (1, -1)


                              (1, -1)    o1+o2-1=0                          32
Discrete Perceptron Training Algorithm
         - Geometrical Representations
   • Examine the neural network classifiers that
     derive/training their weights based on the error-
     correction scheme




                   Class 1:   wt y > 0
 g(y) = wt y
                   Class 2:   wt y < 0
Augmented
input pattern
                Vector Representations
                in the Weight Space                      33
Discrete Perceptron Training Algorithm
    - Geometrical Representations (count.)
   • Devise an analytic approach based on the
     geometrical representations
       – E.g. the decision surface for the training pattern y1

y1 in Class 1                             (      )
                                      ∇ w w t y1 = y1       Gradient
                                                            (the direction of
                                       If y1 in Class 1:     steep increase)


Weight Space
                                               w ′ = w 1 + cy1
                                                                 c controls the
                                       If y1 in Class 2:         size of adjustment
y1 in Class 2
                                              w ′ = w 1 − cy1

                                       c (>0) is the correction increment (is
                                      two times of the learning constant
Weight Space                          introduced before)
                                                                                34
Discrete Perceptron Training Algorithm
    - Geometrical Representations (count.)
                      Weight adjustments of three
                      augmented training pattern y1,
                      y2, y3 , shown in the weight
                      space
                              y1 ∈ C 1
                              y2 ∈ C1
                              y3 ∈ C    2


                      - Weights in the shaded region
                        are the solutions
                      - The three lines labeled are
                        fixed during training
Weight Space                                           35
Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• More about the correction increment c
  – If it is not merely a constant, but related to the current
    training pattern
                                   How to select the correction increment
                                   based on the dislocates of w1 and the
                                   corrected weight vector w
                          w 1t y
                     p=
                            y        (w   1
                                              ± cy   )   t
                                                             y =0
                                          w 1t y w 1t y
                                     c = m t =        2
                                                        , because c > 0
                                           y y    y

                                                         w 1t y
                                     ⇒ cy =                      2
                                                                     y
                                                             y
                                                                            36
Discrete Perceptron Training Algorithm
    - Geometrical Representations (count.)
• For fixed correction rule with c=constant, the
  correction of weights is always the same fixed
  portion of the current training vector
   – The weight can be initialized at any value
                                  w ′ = w + ∆w
           w ′ = w ± cy or
                                         [       (    )]
                                  ∆ w = c d − sgn w t y y

• For dynamic correction rule with c dependent
  on the distance from the weight (i.e. the weight
  vector) to the decision surface in the weight
                                                    w 1t y
  space                                      ⇒ cy =      2
                                                                y
                                                            y
   – The initial weight should be different from 0
                                                                    37
Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Dynamic correction rule with c dependent
  on the distance from the weight

                                            w 1t y
                                   c = λ         2
                                             y
                                             w 1t y   y
                                   cy = λ
                                                 y    y




                                                          38
Discrete Perceptron Training Algorithm
- Geometrical Representations (count.)
• Example 3.3


                            1          − 0.5 y 1 ∈ C 1
                       y1 =       y2 =      
                            1           1 y2 ∈ C 2

                            3            2 y     3   ∈ C   1
                       y3 =        y4 =  
                            1           − 1  y   4   ∈ C   2



                       ∆w k =
                                   c
                                   2
                                     [         (          )]
                                     d k − sgn w kt y j y j

                         What if w kt y j = 0 ?
                         -> interpreted as a mistake
                         and followed by a correlation
                                                                   39
Continuous Perceptron
             Training Algorithm
• Replace the TLU (Threshold Logic Unit) with the
  sigmoid activation function for two reasons:
  – Gain finer control over the training procedure
  – Facilitate the differential characteristics to enable
    computation of the error gradient



                                        w = w − η ∇ E (w
                                        ˆ                               )
                                   learning constant   error gradient




                                                                            40
Continuous Perceptron
         Training Algorithm (cont.)
• The new weights is obtained by moving in the
  direction of the negative gradient along the
  multidimensional error surface




                                                 41
Continuous Perceptron
            Training Algorithm (cont.)
• Define the error as the squared difference
  between the desired output and the actual
  output        1
            E = (d − o)
                       2

               2
                1
                      [
         or E = d − f w t y
                2
                                 (     )]2
                                             =
                                                 1
                                                 2
                                                   [d − f (net )]2
           ∇ E (w ) =
                        1
                        2
                             (
                           ∇ [d − f (net )]
                                            2
                                                 )
                       ∂E                                  ∂ (net ) 
                       ∂w                                  ∂w 
                             1                                   1  
                         ∂E                                ∂ (net ) 
                    ∆ 
                          ∂w 2                                       
           ∇ E (w ) = 
                            . 
                                   = − (d − o ) f ′ (net   ) ∂ w 2  = − (d − o ) f ′ (net ) y
                                                                  .
                                                                    
                       .                                       .    
                       ∂E                                  ∂ (net ) 
                                                                    
                       ∂ w n +1 
                                                           ∂ w n +1 
                                                                                                42
Continuous Perceptron
                   Training Algorithm (cont.)
• Bipolar Continuous Activation Function
                                                                2 exp(− λ ⋅ net )
  f (net ) =
                       2
               1 + exp(− λ ⋅ net )
                                   −1      f ′(net ) = λ ⋅
                                                             [1 + exp(− λ ⋅ net )]2
                                                                                         {              } (
                                                                                    = λ ⋅ 1 − [ f (net )] = λ 1 − o 2
                                                                                                         2
                                                                                                                        )

  w = w +
  ˆ
                      1
                      2
                        η ⋅ λ (d − o ) 1 − o 2 y(               )
• Unipolar Continuous Activation Function
   f (net ) =
                        1                              λ ⋅ exp(− λ ⋅ net)
                1 + exp(− λ ⋅ net )     f ′(net) =                          = λ ⋅ f (net)[1 − f (net)] = λ ⋅ o(1 − o)
                                                     [1+ exp(− λ ⋅ net)]  2




   w = w + η ⋅ λ ⋅ (d − o )o (1 − o ) y
   ˆ
                                                                                                                  43
Continuous Perceptron
        Training Algorithm (cont.)
                                    2
• Example 3.3    f (net ) =                   −1
                              1 + exp(− net )           − 0.5
                                                   y2 =      
                                                         1
                       1
                  y1 =  
                       1




                     3                                  2
                y3 =                              y4 =  
                     1                                 − 1 




                                                              44
Continuous Perceptron
        Training Algorithm (cont.)
• Example 3.3   Total error surface   Trajectories started from four
                                         arbitrary initial weights




                                                                       45
Continuous Perceptron
         Training Algorithm (cont.)
• Treat the last fixed component of input pattern
  vector as the neuron activation threshold




                                                    46
Continuous Perceptron
         Training Algorithm (cont.)
• R-category linear classifier using R discrete
  bipolar perceptrons
  – Goal: The i-th TLU response of +1 is indicative of
    class i and all other TLU respond with -1

                                                1
                                 wi = wi +
                                 ˆ                c ⋅ (d i − o i ) y
                                                2
                                d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i

                                         For “local representation”




                                                                           47
Continuous Perceptron
        Training Algorithm (cont.)
• Example 3.5




                                     48
Continuous Perceptron
         Training Algorithm (cont.)
• R-category linear classifier using R continuous
  bipolar perceptrons


                            wi = wi +
                            ˆ
                                         1
                                         2
                                                             (        )
                                            η ⋅ λ (d i − o i ) 1 − o i2 y
                            for i = 1,2 ,...,R

                            d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i




                                                                            49
Continuous Perceptron
         Training Algorithm (cont.)
• Error function dependent on the difference
  vector d-o




                                               50
Bayes’ Classifier vs. Percepron

• Perceptron operates on the promise that the patterns to
  be classified are linear separable (otherwise the training
  algorithm will oscillate), while Bayes’ classifier assumes
  the (Gaussian) distribution of two classes certainly do
  overlap each other
• The perceptron is nonparametric while the Bayes’
  classifier is parametric (its derivation is contingent on the
  assumption of the underlying distributions)
• The perceptron is simple and adaptive, and needs small
  storage, while the Bayes’ classifier could be made
  adaptive but at the expanse of increased storage and
  more complex computations

                                                                  51
Homework

• P3.5, P3.7, P3.9, P3.22




                             52

More Related Content

PDF
Gz3113501354
PDF
Numerical Linear Algebra for Data and Link Analysis.
PDF
Math refresher
PDF
Lesson 19: Double Integrals over General Regions
PDF
Estimation and Prediction of Complex Systems: Progress in Weather and Climate
PDF
Chapter 04
ZIP
AA Section 6-4
PDF
Metric Embeddings and Expanders
Gz3113501354
Numerical Linear Algebra for Data and Link Analysis.
Math refresher
Lesson 19: Double Integrals over General Regions
Estimation and Prediction of Complex Systems: Progress in Weather and Climate
Chapter 04
AA Section 6-4
Metric Embeddings and Expanders

What's hot (18)

PDF
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
PDF
Structured regression for efficient object detection
PPTX
Computer graphics
PPT
Symmetrical2
PPTX
Wordproblem
PDF
Machine learning of structured outputs
PPT
Sisteme de ecuatii
PDF
Cross product
PDF
Introduction to Numerical Methods for Differential Equations
PPT
PPT
PDF
03 finding roots
PPT
Chapter 04
PDF
Integrated exercise a_(book_2_B)_Ans
PPTX
Introduction to Neural Netwoks
PPT
Bai giang ham so kha vi va vi phan cua ham nhieu bien
PDF
Signal Processing Course : Orthogonal Bases
PDF
06 Arithmetic 1
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
Structured regression for efficient object detection
Computer graphics
Symmetrical2
Wordproblem
Machine learning of structured outputs
Sisteme de ecuatii
Cross product
Introduction to Numerical Methods for Differential Equations
03 finding roots
Chapter 04
Integrated exercise a_(book_2_B)_Ans
Introduction to Neural Netwoks
Bai giang ham so kha vi va vi phan cua ham nhieu bien
Signal Processing Course : Orthogonal Bases
06 Arithmetic 1
Ad

Viewers also liked (20)

PPT
Perceptron
PDF
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
PDF
Pattern Recognition - Designing a minimum distance class mean classifier
PPSX
Perceptron (neural network)
PPTX
mohsin dalvi artificial neural networks presentation
PPT
MPerceptron
PDF
Aprendizaje Redes Neuronales
PPTX
30 分鐘學會實作 Python Feature Selection
PDF
Space-time data workshop at IfGI
PPT
Pengenalan pola sederhana dg perceptron
PDF
Latex crash course
PDF
IIT Certificate
PDF
Perceptron Slides
PDF
R in latex
PDF
Support Vector Machine
PDF
Multi Layer Perceptron & Back Propagation
PDF
Integrating R, knitr, and LaTeX via RStudio
PPTX
PDF
Latex Certificate
PDF
Pattern recognition for UX - 13 April 2013
Perceptron
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Pattern Recognition - Designing a minimum distance class mean classifier
Perceptron (neural network)
mohsin dalvi artificial neural networks presentation
MPerceptron
Aprendizaje Redes Neuronales
30 分鐘學會實作 Python Feature Selection
Space-time data workshop at IfGI
Pengenalan pola sederhana dg perceptron
Latex crash course
IIT Certificate
Perceptron Slides
R in latex
Support Vector Machine
Multi Layer Perceptron & Back Propagation
Integrating R, knitr, and LaTeX via RStudio
Latex Certificate
Pattern recognition for UX - 13 April 2013
Ad

Similar to Ann chapter-3-single layerperceptron20021031 (20)

PPT
8-5 Adding and Subtracting Rational Expressions
PDF
6.3_DiscriminantFunctions for machine learning supervised learning
PDF
Cluster Analysis
PDF
Lesson 1: Functions and their representations (slides)
PDF
Lesson 1: Functions and their representations (slides)
PPT
Unit 4 Review
PDF
Cluster analysis
PPT
PPT
Piecewise functions updated_2016
PDF
Generalized Reinforcement Learning
PDF
1010n3a
PDF
Gz3113501354
PDF
Gz3113501354
PPTX
Graph of functions
PPTX
2 1 relationsfunctions
PPTX
Graphing linear relations and functions
PPTX
Jacob's and Vlad's D.E.V. Project - 2012
PPT
Exponential functions
PPT
Calculus - 1 Functions, domain and range
PPTX
Functions
8-5 Adding and Subtracting Rational Expressions
6.3_DiscriminantFunctions for machine learning supervised learning
Cluster Analysis
Lesson 1: Functions and their representations (slides)
Lesson 1: Functions and their representations (slides)
Unit 4 Review
Cluster analysis
Piecewise functions updated_2016
Generalized Reinforcement Learning
1010n3a
Gz3113501354
Gz3113501354
Graph of functions
2 1 relationsfunctions
Graphing linear relations and functions
Jacob's and Vlad's D.E.V. Project - 2012
Exponential functions
Calculus - 1 Functions, domain and range
Functions

Ann chapter-3-single layerperceptron20021031

  • 1. Single-Layer Perceptron Classifiers Berlin Chen, 2002
  • 2. Outline • Foundations of trainable decision-making networks to be formulated – Input space to output space (classification space) • Focus on the classification of linearly separable classes of patterns – Linear discriminating functions and simple correction function – Continuous error function minimization • Explanation and justification of perceptron and delta training rules 2
  • 3. Classification Model, Features, and Decision Regions • A pattern is the quantitative description of an object, event, or phenomenon – Spatial patterns: weather maps, fingerprints … – Temporal patterns: speech signals … • Pattern classification/recognition – Assign the input data (a physical object, event, or phenomenon) to one of the pre-specified classes (categories) – Discriminate the input data within object population via the search for invariant attributes among members of the population 3
  • 4. Classification Model, Features, and Decision Regions (cont.) • The block diagram of the recognition and classification system Dimension reduction A neural network for classification and for feature extraction 4
  • 5. Classification Model, Features, and Decision Regions (cont.) • More about Feature Extraction – The compressed data from the input patterns while poses salient information – E.g. • Speech vowel sounds analyzed in 16-channel filterbanks can provide 16 spectral vectors, which can be further transformed into two dimensions – Tone height (high-low) and retraction (front-back) • Input patterns to be projected and reduced to lower dimensions 5
  • 6. Classification Model, Features, and Decision Regions (cont.) • More about Feature Extraction y x’ y’ x 6
  • 7. Classification Model, Features, and Decision Regions (cont.) • Two simple ways to generate the pattern vectors for cases of spatial and temporal objects to be classified • A pattern classifier maps input patterns (vectors) in En space into numbers (E1) which specify the membership j = i0 ( x ), j = 1, 2,..., R 7
  • 8. Classification Model, Features, and Decision Regions (cont.) • Classification described in geometric terms The decision surfaces here are curved lines i o ( x ) = j , for all x ∈ Χ j , j = 1, 2 ,..., R – Decision regions – Decision surfaces: generally, the decision surfaces for n- dimensional patterns may be (n-1)-dimensional hyper-surfaces 8
  • 9. Discriminant Functions • Determine the membership in a category by the classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x) – When x is within the region Xk if gk(x) has the largest value i0 ( x ) = k if g k ( x ) > g j ( x ) for k, j = 1, 2 ,..., R, k ≠ j g1 x1 g1(x) x1, x2,…., xp, ….,xP x2 g2 g2(x) P>>n gR(x) xn Assume the classifier gR Has been designed 9
  • 10. Discriminant Functions (cont.) • Example 3.1 Decision surface Equation: g ( x ) = g1 ( x ) − g 2 ( x ) = -2 x1 + x2 + 2 g ( x ) > 0 : class1 g ( x ) < 0 : class 2 The decision surface does not uniquely specify the discriminant functions The classifier that classifies patterns into two classes or categories is called “dichotomizer” “two” “cut” 10
  • 12. Discriminant Functions (cont.) (x-0,y+2, g1 -1)(2,-1,1)=0 Solution 1 2x-y-2+ g1 -1=0 g  x1  g1 ( x ) = [− 2 1]  + 3 g1=-2x+y+3 (x-0,y+2, g2 -1)(-2,1,1)=0  x2  -2x+y+2+ g2 -1=0  x1  g 2 ( x ) = [2 -1]  g2=2x-y-1 g=g1 -g2=0  x2  -4x+2y+4=0 -2x+y+2=0 y [2,-1,1] [-2,1,0] x (x-0,y+2, g1 -1)(2,-1,2)=0 Solution 2 [0,0,1] 2x-y-2+2g1 -2=0 (1,0,0) g1=-x+1/2y+2 (0,-2,1) (x-0,y+2, g2 -1)(-2,1,2)=0 [2,-1,0] -2x+y+2+ 2g2 -2=0 (0,-2,0) g2=x-1/2y An infinite number of g=g1 -g2=0 -2x+y+2=0 discriminant functions will yield correct classification 12
  • 13. Discriminant Functions (cont.) Multi-class Two-class g( x) = g1 ( x) − g2 ( x) g ( x ) > 0 : class 1 g ( x ) < 0 : class 2 subtraction Sign examination 13
  • 14. Discriminant Functions (cont.) The design of discriminator for this case is not straightforward. The discriminant functions may result as nonlinear functions of x1 and x2 14
  • 15. Bayes’ Decision Theory • A decision-making based on both the posterior knowledge obtained from specific observation data and prior knowledge of the categories – Prior class probabilities P(ωi ), ∀ class i – Class-conditioned probabilities P(x ωi ), ∀ class i P (x ω i )P (ω i ) P (x ω i )P (ω i ) k = arg max P (ω i x ) = arg max = arg max i i P (x ) i j =1 ( ) ∑ P x ω j P (ω j ) k = arg max P (ω i x ) = arg max P (x ω i )P (ω i ) i i 15
  • 16. Bayes’ Decision Theory (cont.) • Bayes’ decision rule designed to minimize the overall risk involved in making decision – The expected loss (conditional risk) when making decision δ i R (δ x ) = ∑ l (δ ω , x )P (ω x ), where l (δ ω , x ) =  0 , i = j i i j j i j j 1, i ≠ j = ∑ P (ω j x) j≠i = 1 - P (ω i x ) • The overall risk (Bayes’ risk) ∞ R = ∫ R (δ ( x ) x )p ( x )dx , δ ( x ) : the selected decision for a sample x −∞ – Minimize the overall risk (classification error) by computing the conditional risks and select the decision δ i for which the conditional risk R (δ i x ) is minimum, i.e., P (ω i x ) is maximum (minimum-error-rate decision rule) 16
  • 17. Bayes’ Decision Theory (cont.) • Two-class pattern classification g 1 ( x ) = P (ω 1 x ) ≅ P (x ω 1 )P (ω 1 ), g 2 (x ) = P (ω 2 x ) ≅ P (x ω 2 )P (ω 2 ) Bayes’ Classifier Likelihood ratio or log-likelihood ratio: ω1 ω1 P(x ω1 ) > P(ω2 ) > l (x ) = P (x ω 1 )P (ω 1 ) P (x ω 2 )P (ω 2 ) P(x ω2 ) < P(ω1 ) ω1 < ω2 > ω2 log l ( x ) = log P(x ω1 ) − log P(x ω2 ) log P(ω2 ) − log P(ω1 ) < ω2 Classification error: p (error ) = P ( x ∈ R1 , ω 2 ) + P ( x ∈ R 2 , ω 1 ) = P (x ∈ R1 ω 2 )P (ω 2 ) + P (x ∈ R 2 ω 1 )P (ω 1 ) = ∫R P (x ω 2 )P (ω 2 )dx + ∫R P (x ω 1 )P (ω 1 )dx 1 2 17
  • 18. Bayes’ Decision Theory (cont.) • When the environment is multivariate Gaussian, the Bayes’ classifier reduces to a linear classifier – The same form taken by the perceptron – But the linear nature of the perceptron is not contingent on the assumption of Gaussianity P (x ω ) =  1 1 exp  − ( x − µ ) Σ t −1 ( x − µ )  (2 π ) 1 n 2 Σ 2  2  Class ω 1 : E [ X ] = µ1 [ E ( X − µ1 )( X − µ1 ) = Σ t ] P (ω 1 ) = P (ω 2 ) = 1 2 Class ω 2 : E [ X ] = µ 2 [ E ( X − µ 2 )( X − µ 2 ) = Σ t ] Assumptions 18
  • 19. Bayes’ Decision Theory (cont.) • When the environment is Gaussian, the Bayes’ classifier reduces to a linear classifier (cont.) log l ( x ) = log P (x ω1 ) − log P (x ω 2 ) 1 =− ( x − µ1 )t Σ −1 ( x − µ1 ) + 1 ( x − µ2 )t Σ −1 ( x − µ2 ) 2 2 ( 1 t = ( µ1 − µ 2 ) Σ −1 x + µ 2 Σ −1 µ 2 − µ1 Σ −1 µ1 t 2 t ) = wx + b ω1 > ∴ log l ( x ) = wx + b 0 < ω2 19
  • 20. Bayes’ Decision Theory (cont.) • Multi-class pattern classification 20
  • 21. Linear Machine and Minimum Distance Classification • Find the linear-form discriminant function for two- class classification when the class prototypes are known • Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes 21
  • 22. Linear Machine and Minimum Distance Classification (cont.) The dichotomizer’s discriminant function g(x): x1 + x 2 ( x1 − x 2 ) t ( x − )=0 2 1 2 2 ( x1 − x 2 ) t x + ( x 2 − x1 ) = 0 2 x1 + x 2  w x Taken as  = 0 , where w n +1   1  2    w = x1 − x 2 w n +1 = 1 2 (x2 2 − x1 2 ) Augmented input pattern It is a simple minimum-distance classifier. 22
  • 23. Linear Machine and Minimum Distance Classification (cont.) • The linear-form discriminant functions for multi- class classification – There are up to R(R-1)/2 decision hyperplanes for R pairwise separable classes Some classes may not be contiguous o o o o o o Δ x o o x o o ΔΔ x x x x o oo Δ Δ Δ Δ x x x x o o x Δ Δ x Δ x x Δ Δ Δ Δ 23
  • 24. Linear Machine and Minimum Distance Classification (cont.) • Linear machine or minimum-distance classifier – Assume the class prototypes are known for all classes • Euclidean distance between input pattern x and the center of class i, xi : x − xi = ( x − xi ) ( x − xi ) t 2 • Minimizing x − xi = x t x − 2 xit x + xit xi is equal to 1 t maximizing xit x − xi xi The same for all classes 2 – Set the discriminant function for each class i to be: 1 t g i ( x ) = xit x − xi xi g i ( x ) = w it y 2 w = xi  wi   x  i gi (x ) =  , where ( ) wi , n +1   1  1 w i,n +1 = − x it x i    2 24
  • 25. Linear Machine and Minimum Distance Classification (cont.) This approach is also called correlation classification An 1 as the n+1’th component of the input pattern 1 gi ( x) = xit x − xit xi g i ( x ) = w it y 2 25
  • 26. Linear Machine and Minimum Distance Classification (cont.) • Example 3.2  10   2   -5  w1 =  2 , w =  − 5 , w =  5  2 3  − 52   − 14 . 5   − 25        g 1 ( x ) = 10 x 1 + 2 x 2 − 52 g (x ) = 2 2 x 1 − 5 x 2 − 14 . 5 g 3 (x ) = − 5 x 1 + 5 x 2 − 25 S 12 S 13 S 12 : 8 x1 + 7 x 2 − 37 . 5 = 0 S 13 : − 15 x1 + 3 x 2 + 27 = 0 S 23 : − 7 x1 + 10 x 2 − 10 . 5 = 0 1 gi ( x) = xit x − xit xi 2 S 23 26
  • 27. Linear Machine and Minimum Distance Classification (cont.) • If R linear discriminant functions exist for a set of patterns such that g i (x ) > g j (x ) for x ∈ Class i, i = 1 , 2 ,..., R,j = 1 , 2 ,..., R , i ≠ j – The classes are linearly separable 27
  • 28. Linear Machine and Minimum Distance Classification (cont.) 28
  • 29. Linear Machine and Minimum Distance Classification (cont.) (a) 2x1-x2+2=0, decision surface is a line (b) 2x1-x2+2=0, decision surface is a plane (c) x1=[2,5], x2=[-1,-3] =>The decision surface for minimum distance classifier (x1-x2)t x+1/2 (||x2||2-||x1||2)t=0 3x1+ 8x2-19/2=0 x3 (d) (19/16,0) (-1,0) (0,2) (-1,0) (0,2) (-1,0) (0,2) x1 x1 x1 (0,0) (0,0) (0,0) (19/6,0) x2 x2 x2 29
  • 30. Linear Machine and Minimum Distance Classification (cont.) • Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known 30
  • 31. Linear Machine and Minimum Distance Classification (cont.) • The example of linearly non-separable patterns 31
  • 32. Linear Machine and Minimum Distance Classification (cont.) o1 (-1, 1) (1,1) 1 x1 TLU#1 -1 1 TLU#2 1 o 2 1 x2 -1 TLU#2 -1 -1 -x1-x2+1=0 1 -1 o2 (1,1) (-1, 1) x1+x2+1=0 o1 (-1,-1) (1, -1) (1, -1) o1+o2-1=0 32
  • 33. Discrete Perceptron Training Algorithm - Geometrical Representations • Examine the neural network classifiers that derive/training their weights based on the error- correction scheme Class 1: wt y > 0 g(y) = wt y Class 2: wt y < 0 Augmented input pattern Vector Representations in the Weight Space 33
  • 34. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • Devise an analytic approach based on the geometrical representations – E.g. the decision surface for the training pattern y1 y1 in Class 1 ( ) ∇ w w t y1 = y1 Gradient (the direction of If y1 in Class 1: steep increase) Weight Space w ′ = w 1 + cy1 c controls the If y1 in Class 2: size of adjustment y1 in Class 2 w ′ = w 1 − cy1 c (>0) is the correction increment (is two times of the learning constant Weight Space introduced before) 34
  • 35. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) Weight adjustments of three augmented training pattern y1, y2, y3 , shown in the weight space y1 ∈ C 1 y2 ∈ C1 y3 ∈ C 2 - Weights in the shaded region are the solutions - The three lines labeled are fixed during training Weight Space 35
  • 36. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • More about the correction increment c – If it is not merely a constant, but related to the current training pattern How to select the correction increment based on the dislocates of w1 and the corrected weight vector w w 1t y p= y (w 1 ± cy ) t y =0 w 1t y w 1t y c = m t = 2 , because c > 0 y y y w 1t y ⇒ cy = 2 y y 36
  • 37. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • For fixed correction rule with c=constant, the correction of weights is always the same fixed portion of the current training vector – The weight can be initialized at any value w ′ = w + ∆w w ′ = w ± cy or [ ( )] ∆ w = c d − sgn w t y y • For dynamic correction rule with c dependent on the distance from the weight (i.e. the weight vector) to the decision surface in the weight w 1t y space ⇒ cy = 2 y y – The initial weight should be different from 0 37
  • 38. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • Dynamic correction rule with c dependent on the distance from the weight w 1t y c = λ 2 y w 1t y y cy = λ y y 38
  • 39. Discrete Perceptron Training Algorithm - Geometrical Representations (count.) • Example 3.3 1 − 0.5 y 1 ∈ C 1 y1 =   y2 =   1  1 y2 ∈ C 2 3  2 y 3 ∈ C 1 y3 =   y4 =   1  − 1  y 4 ∈ C 2 ∆w k = c 2 [ ( )] d k − sgn w kt y j y j What if w kt y j = 0 ? -> interpreted as a mistake and followed by a correlation 39
  • 40. Continuous Perceptron Training Algorithm • Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons: – Gain finer control over the training procedure – Facilitate the differential characteristics to enable computation of the error gradient w = w − η ∇ E (w ˆ ) learning constant error gradient 40
  • 41. Continuous Perceptron Training Algorithm (cont.) • The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface 41
  • 42. Continuous Perceptron Training Algorithm (cont.) • Define the error as the squared difference between the desired output and the actual output 1 E = (d − o) 2 2 1 [ or E = d − f w t y 2 ( )]2 = 1 2 [d − f (net )]2 ∇ E (w ) = 1 2 ( ∇ [d − f (net )] 2 )  ∂E   ∂ (net )   ∂w   ∂w   1   1   ∂E   ∂ (net )  ∆  ∂w 2    ∇ E (w ) =  .  = − (d − o ) f ′ (net ) ∂ w 2  = − (d − o ) f ′ (net ) y .      .   .   ∂E   ∂ (net )       ∂ w n +1     ∂ w n +1    42
  • 43. Continuous Perceptron Training Algorithm (cont.) • Bipolar Continuous Activation Function 2 exp(− λ ⋅ net ) f (net ) = 2 1 + exp(− λ ⋅ net ) −1 f ′(net ) = λ ⋅ [1 + exp(− λ ⋅ net )]2 { } ( = λ ⋅ 1 − [ f (net )] = λ 1 − o 2 2 ) w = w + ˆ 1 2 η ⋅ λ (d − o ) 1 − o 2 y( ) • Unipolar Continuous Activation Function f (net ) = 1 λ ⋅ exp(− λ ⋅ net) 1 + exp(− λ ⋅ net ) f ′(net) = = λ ⋅ f (net)[1 − f (net)] = λ ⋅ o(1 − o) [1+ exp(− λ ⋅ net)] 2 w = w + η ⋅ λ ⋅ (d − o )o (1 − o ) y ˆ 43
  • 44. Continuous Perceptron Training Algorithm (cont.) 2 • Example 3.3 f (net ) = −1 1 + exp(− net ) − 0.5 y2 =    1 1 y1 =   1 3  2 y3 =   y4 =   1  − 1  44
  • 45. Continuous Perceptron Training Algorithm (cont.) • Example 3.3 Total error surface Trajectories started from four arbitrary initial weights 45
  • 46. Continuous Perceptron Training Algorithm (cont.) • Treat the last fixed component of input pattern vector as the neuron activation threshold 46
  • 47. Continuous Perceptron Training Algorithm (cont.) • R-category linear classifier using R discrete bipolar perceptrons – Goal: The i-th TLU response of +1 is indicative of class i and all other TLU respond with -1 1 wi = wi + ˆ c ⋅ (d i − o i ) y 2 d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i For “local representation” 47
  • 48. Continuous Perceptron Training Algorithm (cont.) • Example 3.5 48
  • 49. Continuous Perceptron Training Algorithm (cont.) • R-category linear classifier using R continuous bipolar perceptrons wi = wi + ˆ 1 2 ( ) η ⋅ λ (d i − o i ) 1 − o i2 y for i = 1,2 ,...,R d i = 1, d j = − 1, for j = 1,2 ,..,R , j ≠ i 49
  • 50. Continuous Perceptron Training Algorithm (cont.) • Error function dependent on the difference vector d-o 50
  • 51. Bayes’ Classifier vs. Percepron • Perceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes’ classifier assumes the (Gaussian) distribution of two classes certainly do overlap each other • The perceptron is nonparametric while the Bayes’ classifier is parametric (its derivation is contingent on the assumption of the underlying distributions) • The perceptron is simple and adaptive, and needs small storage, while the Bayes’ classifier could be made adaptive but at the expanse of increased storage and more complex computations 51
  • 52. Homework • P3.5, P3.7, P3.9, P3.22 52