SlideShare a Scribd company logo
Machine Learning: Generative
 and Discriminative Models
               Sargur N. Srihari
          srihari@cedar.buffalo.edu
              Machine Learning Course:
http://www.cedar.buffalo.edu/~srihari/CSE574/index.html
Machine Learning                                            Srihari


                   Outline of Presentation
    1.      What is Machine Learning?
              ML applications, ML as Search
    2.      Generative and Discriminative Taxonomy
    3.      Generative-Discriminative Pairs
              Classifiers: Naïve Bayes and Logistic Regression
              Sequential Data: HMMs and CRFs
    4.      Performance Comparison in Sequential Applications
              NLP: Table extraction, POS tagging, Shallow parsing,
              Handwritten word recognition, Document analysis
    5.      Advantages, disadvantages
    6.      Summary
    7.      References
                                                                      2
Machine Learning                                       Srihari



                   1. Machine Learning
     • Programming computers to use
       example data or past experience
     • Well-Posed Learning Problems
           – A computer program is said to learn from
             experience E
           – with respect to class of tasks T and performance
             measure P,
           – if its performance at tasks T, as measured by P,
             improves with experience E.
                                                                 3
Machine Learning                                        Srihari

  Problems Too Difficult To Program by Hand
  • Learning to drive an
    autonomous vehicle
        – Train computer-controlled
          vehicles to steer correctly
        – Drive at 70 mph for 90
          miles on public highways
        – Associate steering
          commands with image
          sequences


Task T: driving on public, 4-lane highway using vision sensors
Perform measure P: average distance traveled before error
                     (as judged by human overseer)
Training E: sequence of images and steering commands recorded  4
              while observing a human driver
Machine Learning                                     Srihari


                  Example Problem:
              Handwritten Digit Recognition
                                   • Handcrafted rules will
                                     result in large no of
                                     rules and exceptions
                                   • Better to have a
                                     machine that learns
                                     from a large training
Wide variability of same numeral     set



                                                                5
Machine Learning                                                            Srihari


Other Applications of Machine Learning

• Recognizing spoken words
    – Speaker-specific strategies for recognizing phonemes and words from speech
    – Neural networks and methods for learning HMMs for customizing to individual
      speakers, vocabularies and microphone characteristics
• Search engines
    – Information extraction from text
• Data mining
    – Very large databases to learn general regularities implicit in data
    – Classify celestial objects from image data
       – Decision tree for objects in sky survey: 3 terabytes


                                                                                      6
Machine Learning                                         Srihari


      ML as Searching Hypotheses Space
      • Very large space of possible hypotheses to fit:
            – observed data and
            – any prior knowledge held by the observer

                   Method            Hypothesis Space
                   Concept Learning Boolean
                                    Expressions
                   Decision Trees   All Possible Trees

                   Neural Networks   Weight Space
                                                                   7
Machine Learning                                             Srihari


       ML Methodologies are increasingly
                 statistical
     • Rule-based expert systems being replaced by
       probabilistic generative models
     • Example: Autonomous agents in AI
           – ELIZA : natural language rules to emulate therapy session
           – Manual specification of models, theories are increasingly
             difficult
     • Greater availability of data and computational power
       to migrate away from rule-based and manually
       specified models to probabilistic data-driven modes
                                                                       8
Machine Learning                                                    Srihari


            The Statistical ML Approach
     1. Data Collection
                   Large sample of data of how humans perform the task
     2. Model Selection
                   Settle on a parametric statistical model of the process
     3. Parameter Estimation
                   Calculate parameter values by inspecting the data


     Using learned model perform:
     4. Search
                   Find optimal solution to given problem
                                                                              9
Machine Learning                                        Srihari


          2. Generative and Discriminative
                Models: An analogy
     • The task is to determine the language that
       someone is speaking
     • Generative approach:
           – is to learn each language and determine as to
             which language the speech belongs to
     • Discriminative approach:
           – is determine the linguistic differences without
             learning any language– a much easier task!
                                                                  10
Machine Learning                                                         Srihari


                   Taxonomy of ML Models
• Generative Methods
     – Model class-conditional pdfs and prior probabilities
     – “Generative” since sampling can generate synthetic data points
     – Popular models
           • Gaussians, Naïve Bayes, Mixtures of multinomials
           • Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models (HMM)
           • Sigmoidal belief networks, Bayesian networks, Markov random fields
• Discriminative Methods
     –   Directly estimate posterior probabilities
     –   No attempt to model underlying probability distributions
     –   Focus computational resources on given task– better performance
     –   Popular models
           • Logistic regression, SVMs
           • Traditional neural networks, Nearest neighbor
           • Conditional Random Fields (CRF)
                                                                                   11
Generative Models (graphical)




Parent node         Quick Medical   Markov Random
selects between     Reference -DT   Field
components
                    Diagnosing
                    Diseases from
                    Symptoms
Machine Learning                                            Srihari




     Successes of Generative Methods
     • NLP
           – Traditional rule-based or Boolean logic systems (eg
             Dialog and Lexis-Nexis) are giving way to statistical
             approaches (Markov models and stochastic context free
             grammars)
     • Medical Diagnosis
           – QMR knowledge base, initially a heuristic expert
             systems for reasoning about diseases and symptoms has
             been augmented with decision theoretic formulation
     • Genomics and Bioinformatics
           – Sequences represented as generative HMMs

                                                                      13
Machine Learning                                        Srihari

          Discriminative Classifier: SVM




                               (x1, x2)   (x1, x2, x1x2)
 Nonlinear decision boundary
                                              Linear boundary
                                              in high-dimensional
                                              space

                                                                  14
Machine Learning                              Srihari


             Support Vector Machines
                                Three support vectors are
•   Support vectors are those   shown as solid dots
    nearest patterns at
    distance b from
    hyperplane
•   SVM finds hyperplane with
    maximum distance from
    nearest training patterns
•   For full description of
    SVMs see
http://www.cedar.buffalo.edu/
   ~srihari/CSE555/SVMs.pdf


                                                        15
Machine Learning                           Srihari




        3. Generative-Discriminative Pairs

     • Naïve Bayes and Logistic Regression form
       a generative-discriminative pair for
       classification
     • Their relationship mirrors that between
       HMMs and linear-chain CRFs for
       sequential data


                                                     16
Machine Learning                                                                    Srihari



                           Graphical Model Relationship
                                                                 Hidden Markov Model
GENERATIVE




                        Naïve Bayes Classifier
                              y                              Y
                                                                      y1                       yN
                                                  SEQUENCE

                    x                                        X                                      p(Y,X)
                                                 p(y,x)
                          x1               xM                    x1                       xN

                               CONDITION                                   CONDITION
                                                 p(y/x)                                             p(Y/X)
   DISCRIMINATIVE




                                                  SEQUENCE



                          Logistic Regression                    Conditional Random Field
                                                                                                 17
Machine Learning                                            Srihari



            Generative Classifier: Bayes
• Given variables x =(x1,..,xM) and class variable y
• Joint pdf is p(x,y)
     – Called generative model since we can generate more samples
       artificially
• Given a full joint pdf we can
     – Marginalize p( y ) = ∑ p(x, y )
                                     x
                                    p (x, y )
                     p ( y | x) =
     – Condition                     p (x)
     – By conditioning the joint pdf we form a classifier
• Computational problem:
     – If x is binary then we need 2M values
     – If 100 samples are needed to estimate a given probability,
       M=10, and there are two classes then we need 2048 samples
                                                                      18
Machine Learning                                  Srihari



                   Naïve Bayes Classifier
• Goal is to predict single class variable y
  given a vector of features x=(x1,..,xM)
• Assume that once class labels are known
  the features are independent
• Joint probability model has the form
                                 M
               p ( y, x) = p ( y )∏ p( xm | y )
                                m =1

     – Need to estimate only M probabilities
• Factor graph obtained by defining factors
  ψ(y)=p(y), ψm(y,xm)=p(xm,y)



                                                            19
Machine Learning                                             Srihari

     Discriminative Classifier: Logistic Regression
                                                    Logistic Sigmoid
• Feature vector x
• Two-class classification: class variable σ(a)
  y has values C1 and C2
                                                       a
• A posteriori probability p(C1|x) written Properties:
  as                                        A. Symmetry
      p(C1|x) =f(x) = σ (wTx) where                   σ(-a)=1-σ(a)
                                 1                  B. Inverse
                   σ (a) =                            a=ln(σ /1-σ)
                           1 + exp(−a )
                                                     known as logit.
• It is known as logistic regression in              Also known as
  statistics                                         log odds since
                                                     it is the ratio
      – Although it is a model for classification
                                                      ln[p(C1|x)/p(C2|x)]
        rather than for regression
                                                    C. Derivative 20
                                                     dσ/da=σ(1-σ)
Machine Learning                                                                    Srihari


                   Logistic Regression versus
                   Generative Bayes Classifier
     • Posterior probability of class variable y is
                                        p (x | C1 ) p (C1 )
            p (C1 | x) =
                            p (x | C1 ) p (C1 ) + p (x | C2 ) p (C2 )
                                 1                            p (x | C1 ) p (C1 )
                      =                 = σ (a ) where a = ln
                           1 + exp(−a )                       p (x | C2 ) p(C2 )

     • In a generative model we estimate the class-
       conditionals (which are used to determine a)
     • In the discriminative approach we directly estimate a
       as a linear function of x i.e., a = wTx
                                                                                              21
Machine Learning                                           Srihari




            Logistic Regression Parameters
     • For M-dimensional feature space logistic
       regression has M parameters w=(w1,..,wM)
     • By contrast, generative approach
           – by fitting Gaussian class-conditional densities will
             result in 2M parameters for means, M(M+1)/2
             parameters for shared covariance matrix, and one for
             class prior p(C1)
           – Which can be reduced to O(M) parameters by assuming
             independence via Naïve Bayes

                                                                     22
Machine Learning                                              Srihari




            Multi-class Logistic Regression
                                            p ( x | Ck ) p(Ck )
                             p(Ck | x) =
• Case of K>2 classes                      ∑ p( x | C j ) p(C j )
                                               j

                                        exp(ak )
                                     =
                                       ∑ exp(a j )
                                           j

• Known as normalized exponential
  where ak=ln p(x|Ck)p(Ck)
• Normalized exponential also known as softmax since if
  ak>>aj then p(Ck|x)=1 and p(Cj|x)=0
• In logistic regression we assume activations given by
  ak=wkTx
                                                                        23
Machine Learning                                                        Srihari


      Graphical Model for Logistic Regression
• Multiclass logistic regression can be
  written as p( y | x) = 1 exp ⎧λ + ∑ λ x ⎫ where
                               ⎨          ⎬     y
                                                    K

                                                           yj   j
                                 Z (x)      ⎩       j =1            ⎭
                                      ⎧       K         ⎫
                      Z (x) = ∑ y exp ⎨λ y + ∑ λ yj x j ⎬
                                      ⎩      j =1       ⎭

• Rather than using one weight per class we
  can define feature functions that are
  nonzero only for a single class
                             1       ⎧K                ⎫
              p ( y | x) =       exp ⎨∑ λk f k ( y, x) ⎬
                           Z (x)     ⎩ k =1            ⎭
• This notation mirrors the usual notation
  for CRFs
                                                                                  24
Machine Learning                              Srihari




                   4. Sequence Models
• Classifiers predict only a single class variable
• Graphical Models are best to model many
  variables that are interdependent
• Given sequence of observations X={xn}n=1N
• Underlying sequence of states Y={yn}n=1N



                                                        25
Machine Learning                                                       Srihari


                   Generative Model: HMM
• X is observed data sequence to be
  labeled,                                     y1            y2           yn          yN
  Y is the random variable over the
  label sequences
• HMM is a distribution that models            x1            x2           xn          xN
  p(Y, X)
                                     N
• Joint distribution is   p( Y,X ) = ∏ p ( yn | yn −1 ) p(x n | yn )
                                             n =1



• Highly structured network indicates
  conditional independences,
     – past states independent of future states
     – Conditional independence of observed
       given its state.
                                                                                 26
Machine Learning                               Srihari




    Discriminative Model for Sequential Data

• CRF models the conditional
  distribution p(Y/X)
• CRF is a random field globally
                                     y1   y2      yn          yN
  conditioned on the observation
  X
• The conditional distribution            X

  p(Y|X) that follows from the
  joint distribution p(Y,X) can be
  rewritten as a Markov Random
  Field                                                  27
Machine Learning                                                             Srihari



            Markov Random Field (MRF)
• Also called undirected graphical model
• Joint distribution of set of variables x is defined by an
  undirected graph as                    1
                                                p (x) =
                                                          Z
                                                            ∏ψ
                                                            C
                                                                 C   (x C )
   where C is a maximal clique
                    (each node connected to every other node),
     xC is the set of variables in that clique,
     ψC is a potential function (or local or compatibility function)
      such that ψC(xC) > 0, typically ψC(xC) = exp{-E(xC)}, and
 Z = ∑ ∏ψ C (x C ) is the partition function for normalization
        x    C


• Model refers to a family of distributions and Field refers to
  a specific one
                                                                                        28
Machine Learning                                           Srihari



          MRF with Input-Output Variables
• X is a set of input variables that are observed
   – Element of X is denoted x
• Y is a set of output variables that we predict
   – Element of Y is denoted y
• A are subsets of X U Y
   – Elements of A that are in A ^ X are denoted xA
   – Element of A that are in A ^ Y are denoted yA
• Then undirected graphical model has the form
               1
      p (x,y) = ∏ Ψ A (x A , y A ) where Z= ∑ ∏ Ψ A (x A , y A )
               Z A                          x,y A
                                                                      29
Machine Learning                                                  Srihari




                    MRF Local Function
     • Assume each local function has the form
                                    ⎧                         ⎫
             Ψ A (x A , y A ) = exp ⎨∑ θ Am f Am (x A , y A ) ⎬
                                    ⎩m                        ⎭
         where θA is a parameter vector, fA are feature
         functions and m=1,..M are feature subscripts



                                                                            30
Machine Learning                                                                               Srihari


                          From HMM to CRF
• In an HMM                                                                   Indicator function:
                   N
                                                                              1{x = x’} takes value 1when
     p( Y,X ) = ∏ p ( yn | yn −1 ) p(x n | yn )
                   n =1
                                                                              x=x’ and 0 otherwise

• Can be rewritten as                                                                    Parameters of
                   1     ⎧                                                           ⎫   the distribution:
     p ( Y, X) =     exp ⎨∑ ∑ λij 1{ yn =i}1{ yn−1 = j} + ∑∑∑ μoi 1{ yn =i}1{ xn =o} ⎬   θ ={λij,μoi}
                   Z     ⎩ n i , j∈S                      n i∈S o∈O                  ⎭

• Further rewritten as                                                       Feature Functions have
                 1     ⎧   M
                                                      ⎫
    p (Y, X) =     exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬                      the form fm(yn,yn-1,xn):
                 Z     ⎩ m =1                         ⎭                      Need one feature for each
                                                                             state transition (i,j)
• Which gives us                                                             fij(y,y’,x)=1{y=i}1{y’=j} and
                                        ⎧M                             ⎫
                                    exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬
                                                                             one for each state-
     p ( Y | X) =
                    p( y, x)
                                  =     ⎩ m =1                         ⎭     observation pair
                  ∑ y ' p( y ', x) ∑ exp ⎧∑ λm f m ( yn , yn−1 , xn ) ⎫
                                           ⎨
                                               M

                                                                         ⎬
                                                                             fio(y,y’,x)=1{y=i}1{x=o}
                                           ⎩ m =1                        ⎭
                                     y'

                                                                                                         31
           • Note that Z cancels out
Machine Learning                                                             Srihari




                              CRF definition
     • A linear chain CRF is a distribution p(Y|X)
       that takes the form
                                  1       ⎧M                             ⎫
                   p ( Y | X) =       exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬
                                Z (X)     ⎩ m =1                         ⎭

     • Where Z(X) is an instance specific
       normalization function
                                 ⎧M                             ⎫
                   Z (X) = ∑ exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬
                           y     ⎩ m =1                         ⎭
                                                                                       32
Machine Learning                                                                                                                      Srihari

                                                Functional Models
                    Naïve Bayes Classifier                                                        Hidden Markov Model
                          y                                                                Y
                                                                                                     y1                             yn           yN
GENERATIVE




                    x                                                                      X
                         x1                         xM                                          x1                             xn           xN
                                       M
                        p(y, x ) = p(y)∏ p ( xm | y )
                                                                                                                N
                                                                                                p( Y,X ) = ∏ p( yn | yn −1 ) p(x n | yn )
                                       m =1                                                                    n =1


                                                                                                                    ⎧M                           ⎫
                                                                ⎧ M
                                                                                   ⎫                          exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬
                                                         exp ⎨∑ λm f m ( y, x) ⎬               p ( Y | X) =         ⎩ m =1                       ⎭
                                           p ( y | x) =         ⎩ m =1             ⎭                                  ⎧  M
                                                                                                                                                      ⎫
                                                                  ⎧   M
                                                                                       ⎫                    ∑ y ' exp ⎨∑1 λm f m ( yn ', yn−1 ', xn ) ⎬
                                                        ∑ y ' exp ⎨∑1 λm f m ( y ', x) ⎬                              ⎩ m=                            ⎭
   DISCRIMINATIVE




                                                                  ⎩ m=                 ⎭




                        Logistic Regression                                                     Conditional Random Field
                                                                                                                                                    33
Machine Learning                                               Srihari



           NLP: Part Of Speech Tagging
    For a sequence of words w = {w1,w2,..wn} find syntactic
    labels s for each word:
w = The quick brown fox jumped over the               lazy   dog
s = DET VERB ADJ NOUN-S VERB-P PREP DET              ADJ NOUN-S


                           Baseline is already 90%
   Model           Error
                               • Tag every word with its most frequent tag
   HMM             5.69%       • Tag unknown words as nouns


   CRF             5.55%   Per-word error rates for POS tagging on
                           the Penn treebank

                                                                         34
Machine Learning                                   Srihari



                      Table Extraction
    To label lines of text document:
    Whether part of table and its role in table.

   Finding tables and extracting information is necessary
   component of data mining, question-answering and IR
   tasks.


                   HMM        CRF
                   89.7%      99.9%

                                                             35
Machine Learning                                                        Srihari



                        Shallow Parsing
• Precursor to full parsing or information extraction
     – Identifies non-recursive cores of various phrase types in text
• Input: words in a sentence annotated automatically with POS tags
• Task: label each word with a label indicating
     – word is outside a chunk (O), starts a chunk (B), continues a chunk (I)

     NP chunks




CRFs beat all reported single-model NP chunking results on standard
evaluation dataset



                                                                                  36
Machine Learning                                                                                                                                             Srihari

                Handwritten Word Recognition
  Given word image and lexicon, find most probable
  lexical entry
  Algorithm Outline
           • Oversegment image
                     segment combinations are potential characters
           • Given y = a word in lexicon, s = grouping of segments,
                     x = input word image features
           • Find word in lexicon and segment grouping that maximizes
             P(y,s | x),
  CRF Model
                eψ ( y , x ;θ )                     m ⎛                                                     ⎞
P( y | x,θ ) =                      ψ ( y, x;θ ) = ∑ ⎜ A( j, y j , x;θ s + ∑ I ( j , k , y j , yk , x,θ t ) ⎟
               ∑y' eψ ( y ', x ;θ )                     ⎜
                                                   j =1 ⎝                 ( j , k )∈E
                                                                                                            ⎟
                                                                                                            ⎠
 where yi ε (a-z,A-Z,0-9}, θ : model parameters
  Association Potential (state term)                                                                                                  1



                                                                                                                                    0.98
                                                                                                                                                                                                    SDP
                                                                                                                                                                                                    CRF




    A( j , y j , x;θ s ) = ∑ ( f i s ( j , y j , x ) ⋅ θ ijs )                                                                                             CRF
                                                                                                                                    0.96



                                                                                                                                    0.94




                                                                                                            Precision
                                  i                                                                                                 0.92

                                                                                                                                                    Segment-DP



                                                                                                                        Precision
                                                                                                                                     0.9




  Interaction Potential                                                                                                             0.88



                                                                                                                                    0.86




  I ( j , k , y j , y k , x; θ ) = ∑ ( f i ( j , k , y j , y k , x ) ⋅ θ )
                                                                                                                                    0.84


                              t               t                       t
                                                                     ijk
                                                                                                                                    0.82



                                                                                                                                     0.8
                                                                                                                                                                                         37
                                                                                                                                           0   20     40            60              80        100         120
                                                                                                                                                            Word Recognition Rank

                                       i
                                                                                                                                                           WR Rank
Machine Learning                             Srihari


               Document Analysis (labeling
                   regions) error rates
                    CRF      Neural    Naive
                             Network   Bayes
      Machine      1.64%     2.35%     11.54%
      Printed Text
      Handwritten 5.19%      20.90%    25.04%
      Text
      Noise        10.20%    15.00%    12.23%

      Total         4.25%    7.04%     12.58%
                                                       38
Machine Learning                                                          Srihari


5. Advantage of CRF over Other Models
• Other Generative Models
      – Relax assuming conditional independence of observed data
        given the labels
      – Can contain arbitrary feature functions
            • Each feature function can use entire input data sequence. Probability of
              label at observed data segment may depend on any past or future data
              segments.
• Other Discriminative Models
      – Avoid limitation of other discriminative Markov models
        biased towards states with few successor states.
      – Single exponential model for joint probability of entire
        sequence of labels given observed sequence.
      – Each factor depends only on previous label, and not future
        labels. P(y | x) = product of factors, one for each label.
                                                                                    39
Machine Learning                                       Srihari


      Disadvantages of Discriminative
                Classifiers
     • Lack elegance of generative
           – Priors, structure, uncertainty
     • Alternative notions of penalty functions,
       regularization, kernel functions
     • Feel like black-boxes
           – Relationships between variables are not explicit
             and visualizable

                                                                 40
Machine Learning                                          Srihari



  Bridging Generative and Discriminative

     • Can performance of SVMs be combined
       elegantly with flexible Bayesian statistics?

     • Maximum Entropy Discrimination marries
       both methods
           – Solve over a distribution of parameters (a
             distribution over solutions)

                                                                    41
Machine Learning                                                          Srihari




                             6. Summary
• Machine learning algorithms have great practical value in a
  variety of application domains
      – A well-defined learning problem requires a well-specified task,
        performance metric, and source of experience
• Generative and Discriminative methods are two-broad
  approaches:
      – former involve modeling, latter directly solve classification
• Generative and Discriminative Method Pairs
      – Naïve Bayes and Logistic Regression are a corresponding pair for
        classification
      – HMM and CRF are a corresponding pair for sequential data
• CRF performs better in language related tasks
• Generative models are more elegant, have explanatory power
                                                                                    42
Machine Learning                                                         Srihari




                             7. References
     1.      T. Mitchell, Machine Learning, McGraw-Hill, 1997
     2.      C. Bishop, Pattern Recognition and Machine Learning, Springer,
             2006
     3.      T. Jebarra, Machine Learning: Discriminative and Generative,
             Kluwer, 2004
     4.      R.O. Duda, P.E. Hart and D. Stork, Pattern Classification, 2nd Ed,
             Wiley 2002
     5.      C. Sutton and A. McCallum, An Introduction to Conditional
             Random Fields for Relational Learning
     6.      S. Shetty, H. Srinivasan and S. N. Srihari, Handwritten Word
             Recognition using CRFs, ICDAR 2007
     7.      S. Shetty, H.Srinivasan and S. N. Srihari, Segmentation and
             Labeling of Documents using CRFs, SPIE-DRR 2007


                                                                                   43

More Related Content

PDF
Introduction to Deep Learning, Keras, and TensorFlow
PDF
Seq2Seq (encoder decoder) model
PDF
bag-of-words models
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
PDF
Introduction To TensorFlow
PDF
Latent Dirichlet Allocation
PDF
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPTX
Introduction to natural language processing (NLP)
Introduction to Deep Learning, Keras, and TensorFlow
Seq2Seq (encoder decoder) model
bag-of-words models
Introduction For seq2seq(sequence to sequence) and RNN
Introduction To TensorFlow
Latent Dirichlet Allocation
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
Introduction to natural language processing (NLP)

What's hot (20)

PPTX
Support vector machines (svm)
PDF
Bayesian Networks - A Brief Introduction
PDF
Understanding Bagging and Boosting
PPTX
Bagging.pptx
PDF
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
PDF
Logistic regression in Machine Learning
PDF
Gradient descent method
PDF
Support Vector Machines ( SVM )
PDF
Distributed machine learning
PPT
5.5 graph mining
PPTX
Introduction to CNN
PPTX
U-Net (1).pptx
PPT
Clustering
PPTX
Deep learning.pptx
PPTX
Support vector machine
PDF
Autoencoders
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PPTX
Fuzzy Clustering(C-means, K-means)
PPTX
Ensemble learning
PDF
Linear models for classification
Support vector machines (svm)
Bayesian Networks - A Brief Introduction
Understanding Bagging and Boosting
Bagging.pptx
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Logistic regression in Machine Learning
Gradient descent method
Support Vector Machines ( SVM )
Distributed machine learning
5.5 graph mining
Introduction to CNN
U-Net (1).pptx
Clustering
Deep learning.pptx
Support vector machine
Autoencoders
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Fuzzy Clustering(C-means, K-means)
Ensemble learning
Linear models for classification
Ad

Viewers also liked (20)

PPTX
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
PDF
Lecture2 xing
PPTX
Deep Advances in Generative Modeling
PPTX
Conditional Random Fields - Vidya Venkiteswaran
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
PDF
Deep Learning for Computer Vision: Generative models and adversarial training...
PPTX
Conditional Random Fields
PDF
Bayesian Network backbone of clippy
PDF
Fast Global Stereo Matching Via Energy Pyramid Minimization
PDF
Flickr Distance
PPT
Progress Report 0729
PDF
ICCV 2011 Presentation
PDF
Software Abstractions for Parallel Hardware
PDF
Automatic Task-based Code Generation for High Performance DSEL
PDF
Tools for Meta-Programming
PDF
Creative coding in art education -Fads presentation
PDF
Probabilistic generative models for machine vision
PDF
Paper crf design_tools
PDF
Geometria Projetiva
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Spatially Coherent Latent Topic Model For Concurrent Object Segmentation and ...
Lecture2 xing
Deep Advances in Generative Modeling
Conditional Random Fields - Vidya Venkiteswaran
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Generative models and adversarial training...
Conditional Random Fields
Bayesian Network backbone of clippy
Fast Global Stereo Matching Via Energy Pyramid Minimization
Flickr Distance
Progress Report 0729
ICCV 2011 Presentation
Software Abstractions for Parallel Hardware
Automatic Task-based Code Generation for High Performance DSEL
Tools for Meta-Programming
Creative coding in art education -Fads presentation
Probabilistic generative models for machine vision
Paper crf design_tools
Geometria Projetiva
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Ad

Similar to Machine Learning: Generative and Discriminative Models (20)

PDF
A Few Useful Things to Know about Machine Learning
PPTX
Artificial intelligence: Simulation of Intelligence
PDF
Machine learning SVM
PDF
Technical Area: Machine Learning and Pattern Recognition
PDF
Lecture3 - Machine Learning
PDF
Lecture4 - Machine Learning
PPTX
Machine learning
PPT
Chapter II.6 (Book Part VI) Learning
PPTX
Machine learning ppt.
PDF
Efficient Learning Machines Theories Concepts And Applications For Engineers ...
PPT
Machine learning and deep learning algorithms
PPTX
INTRODUCTION TO ML basics of ml that one should know
PPT
chapter1-introduction1.ppt
DOC
Course Syllabus
PPT
AML_030607.ppt
PDF
Introduction to Machine Learning
PDF
AI Presentation 1
PPT
Machine Learning Ch 1.ppt
PPT
ML_Overview.ppt
PPTX
ML_Overview.pptx
A Few Useful Things to Know about Machine Learning
Artificial intelligence: Simulation of Intelligence
Machine learning SVM
Technical Area: Machine Learning and Pattern Recognition
Lecture3 - Machine Learning
Lecture4 - Machine Learning
Machine learning
Chapter II.6 (Book Part VI) Learning
Machine learning ppt.
Efficient Learning Machines Theories Concepts And Applications For Engineers ...
Machine learning and deep learning algorithms
INTRODUCTION TO ML basics of ml that one should know
chapter1-introduction1.ppt
Course Syllabus
AML_030607.ppt
Introduction to Machine Learning
AI Presentation 1
Machine Learning Ch 1.ppt
ML_Overview.ppt
ML_Overview.pptx

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Machine Learning: Generative and Discriminative Models

  • 1. Machine Learning: Generative and Discriminative Models Sargur N. Srihari srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html
  • 2. Machine Learning Srihari Outline of Presentation 1. What is Machine Learning? ML applications, ML as Search 2. Generative and Discriminative Taxonomy 3. Generative-Discriminative Pairs Classifiers: Naïve Bayes and Logistic Regression Sequential Data: HMMs and CRFs 4. Performance Comparison in Sequential Applications NLP: Table extraction, POS tagging, Shallow parsing, Handwritten word recognition, Document analysis 5. Advantages, disadvantages 6. Summary 7. References 2
  • 3. Machine Learning Srihari 1. Machine Learning • Programming computers to use example data or past experience • Well-Posed Learning Problems – A computer program is said to learn from experience E – with respect to class of tasks T and performance measure P, – if its performance at tasks T, as measured by P, improves with experience E. 3
  • 4. Machine Learning Srihari Problems Too Difficult To Program by Hand • Learning to drive an autonomous vehicle – Train computer-controlled vehicles to steer correctly – Drive at 70 mph for 90 miles on public highways – Associate steering commands with image sequences Task T: driving on public, 4-lane highway using vision sensors Perform measure P: average distance traveled before error (as judged by human overseer) Training E: sequence of images and steering commands recorded 4 while observing a human driver
  • 5. Machine Learning Srihari Example Problem: Handwritten Digit Recognition • Handcrafted rules will result in large no of rules and exceptions • Better to have a machine that learns from a large training Wide variability of same numeral set 5
  • 6. Machine Learning Srihari Other Applications of Machine Learning • Recognizing spoken words – Speaker-specific strategies for recognizing phonemes and words from speech – Neural networks and methods for learning HMMs for customizing to individual speakers, vocabularies and microphone characteristics • Search engines – Information extraction from text • Data mining – Very large databases to learn general regularities implicit in data – Classify celestial objects from image data – Decision tree for objects in sky survey: 3 terabytes 6
  • 7. Machine Learning Srihari ML as Searching Hypotheses Space • Very large space of possible hypotheses to fit: – observed data and – any prior knowledge held by the observer Method Hypothesis Space Concept Learning Boolean Expressions Decision Trees All Possible Trees Neural Networks Weight Space 7
  • 8. Machine Learning Srihari ML Methodologies are increasingly statistical • Rule-based expert systems being replaced by probabilistic generative models • Example: Autonomous agents in AI – ELIZA : natural language rules to emulate therapy session – Manual specification of models, theories are increasingly difficult • Greater availability of data and computational power to migrate away from rule-based and manually specified models to probabilistic data-driven modes 8
  • 9. Machine Learning Srihari The Statistical ML Approach 1. Data Collection Large sample of data of how humans perform the task 2. Model Selection Settle on a parametric statistical model of the process 3. Parameter Estimation Calculate parameter values by inspecting the data Using learned model perform: 4. Search Find optimal solution to given problem 9
  • 10. Machine Learning Srihari 2. Generative and Discriminative Models: An analogy • The task is to determine the language that someone is speaking • Generative approach: – is to learn each language and determine as to which language the speech belongs to • Discriminative approach: – is determine the linguistic differences without learning any language– a much easier task! 10
  • 11. Machine Learning Srihari Taxonomy of ML Models • Generative Methods – Model class-conditional pdfs and prior probabilities – “Generative” since sampling can generate synthetic data points – Popular models • Gaussians, Naïve Bayes, Mixtures of multinomials • Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models (HMM) • Sigmoidal belief networks, Bayesian networks, Markov random fields • Discriminative Methods – Directly estimate posterior probabilities – No attempt to model underlying probability distributions – Focus computational resources on given task– better performance – Popular models • Logistic regression, SVMs • Traditional neural networks, Nearest neighbor • Conditional Random Fields (CRF) 11
  • 12. Generative Models (graphical) Parent node Quick Medical Markov Random selects between Reference -DT Field components Diagnosing Diseases from Symptoms
  • 13. Machine Learning Srihari Successes of Generative Methods • NLP – Traditional rule-based or Boolean logic systems (eg Dialog and Lexis-Nexis) are giving way to statistical approaches (Markov models and stochastic context free grammars) • Medical Diagnosis – QMR knowledge base, initially a heuristic expert systems for reasoning about diseases and symptoms has been augmented with decision theoretic formulation • Genomics and Bioinformatics – Sequences represented as generative HMMs 13
  • 14. Machine Learning Srihari Discriminative Classifier: SVM (x1, x2) (x1, x2, x1x2) Nonlinear decision boundary Linear boundary in high-dimensional space 14
  • 15. Machine Learning Srihari Support Vector Machines Three support vectors are • Support vectors are those shown as solid dots nearest patterns at distance b from hyperplane • SVM finds hyperplane with maximum distance from nearest training patterns • For full description of SVMs see http://www.cedar.buffalo.edu/ ~srihari/CSE555/SVMs.pdf 15
  • 16. Machine Learning Srihari 3. Generative-Discriminative Pairs • Naïve Bayes and Logistic Regression form a generative-discriminative pair for classification • Their relationship mirrors that between HMMs and linear-chain CRFs for sequential data 16
  • 17. Machine Learning Srihari Graphical Model Relationship Hidden Markov Model GENERATIVE Naïve Bayes Classifier y Y y1 yN SEQUENCE x X p(Y,X) p(y,x) x1 xM x1 xN CONDITION CONDITION p(y/x) p(Y/X) DISCRIMINATIVE SEQUENCE Logistic Regression Conditional Random Field 17
  • 18. Machine Learning Srihari Generative Classifier: Bayes • Given variables x =(x1,..,xM) and class variable y • Joint pdf is p(x,y) – Called generative model since we can generate more samples artificially • Given a full joint pdf we can – Marginalize p( y ) = ∑ p(x, y ) x p (x, y ) p ( y | x) = – Condition p (x) – By conditioning the joint pdf we form a classifier • Computational problem: – If x is binary then we need 2M values – If 100 samples are needed to estimate a given probability, M=10, and there are two classes then we need 2048 samples 18
  • 19. Machine Learning Srihari Naïve Bayes Classifier • Goal is to predict single class variable y given a vector of features x=(x1,..,xM) • Assume that once class labels are known the features are independent • Joint probability model has the form M p ( y, x) = p ( y )∏ p( xm | y ) m =1 – Need to estimate only M probabilities • Factor graph obtained by defining factors ψ(y)=p(y), ψm(y,xm)=p(xm,y) 19
  • 20. Machine Learning Srihari Discriminative Classifier: Logistic Regression Logistic Sigmoid • Feature vector x • Two-class classification: class variable σ(a) y has values C1 and C2 a • A posteriori probability p(C1|x) written Properties: as A. Symmetry p(C1|x) =f(x) = σ (wTx) where σ(-a)=1-σ(a) 1 B. Inverse σ (a) = a=ln(σ /1-σ) 1 + exp(−a ) known as logit. • It is known as logistic regression in Also known as statistics log odds since it is the ratio – Although it is a model for classification ln[p(C1|x)/p(C2|x)] rather than for regression C. Derivative 20 dσ/da=σ(1-σ)
  • 21. Machine Learning Srihari Logistic Regression versus Generative Bayes Classifier • Posterior probability of class variable y is p (x | C1 ) p (C1 ) p (C1 | x) = p (x | C1 ) p (C1 ) + p (x | C2 ) p (C2 ) 1 p (x | C1 ) p (C1 ) = = σ (a ) where a = ln 1 + exp(−a ) p (x | C2 ) p(C2 ) • In a generative model we estimate the class- conditionals (which are used to determine a) • In the discriminative approach we directly estimate a as a linear function of x i.e., a = wTx 21
  • 22. Machine Learning Srihari Logistic Regression Parameters • For M-dimensional feature space logistic regression has M parameters w=(w1,..,wM) • By contrast, generative approach – by fitting Gaussian class-conditional densities will result in 2M parameters for means, M(M+1)/2 parameters for shared covariance matrix, and one for class prior p(C1) – Which can be reduced to O(M) parameters by assuming independence via Naïve Bayes 22
  • 23. Machine Learning Srihari Multi-class Logistic Regression p ( x | Ck ) p(Ck ) p(Ck | x) = • Case of K>2 classes ∑ p( x | C j ) p(C j ) j exp(ak ) = ∑ exp(a j ) j • Known as normalized exponential where ak=ln p(x|Ck)p(Ck) • Normalized exponential also known as softmax since if ak>>aj then p(Ck|x)=1 and p(Cj|x)=0 • In logistic regression we assume activations given by ak=wkTx 23
  • 24. Machine Learning Srihari Graphical Model for Logistic Regression • Multiclass logistic regression can be written as p( y | x) = 1 exp ⎧λ + ∑ λ x ⎫ where ⎨ ⎬ y K yj j Z (x) ⎩ j =1 ⎭ ⎧ K ⎫ Z (x) = ∑ y exp ⎨λ y + ∑ λ yj x j ⎬ ⎩ j =1 ⎭ • Rather than using one weight per class we can define feature functions that are nonzero only for a single class 1 ⎧K ⎫ p ( y | x) = exp ⎨∑ λk f k ( y, x) ⎬ Z (x) ⎩ k =1 ⎭ • This notation mirrors the usual notation for CRFs 24
  • 25. Machine Learning Srihari 4. Sequence Models • Classifiers predict only a single class variable • Graphical Models are best to model many variables that are interdependent • Given sequence of observations X={xn}n=1N • Underlying sequence of states Y={yn}n=1N 25
  • 26. Machine Learning Srihari Generative Model: HMM • X is observed data sequence to be labeled, y1 y2 yn yN Y is the random variable over the label sequences • HMM is a distribution that models x1 x2 xn xN p(Y, X) N • Joint distribution is p( Y,X ) = ∏ p ( yn | yn −1 ) p(x n | yn ) n =1 • Highly structured network indicates conditional independences, – past states independent of future states – Conditional independence of observed given its state. 26
  • 27. Machine Learning Srihari Discriminative Model for Sequential Data • CRF models the conditional distribution p(Y/X) • CRF is a random field globally y1 y2 yn yN conditioned on the observation X • The conditional distribution X p(Y|X) that follows from the joint distribution p(Y,X) can be rewritten as a Markov Random Field 27
  • 28. Machine Learning Srihari Markov Random Field (MRF) • Also called undirected graphical model • Joint distribution of set of variables x is defined by an undirected graph as 1 p (x) = Z ∏ψ C C (x C ) where C is a maximal clique (each node connected to every other node), xC is the set of variables in that clique, ψC is a potential function (or local or compatibility function) such that ψC(xC) > 0, typically ψC(xC) = exp{-E(xC)}, and Z = ∑ ∏ψ C (x C ) is the partition function for normalization x C • Model refers to a family of distributions and Field refers to a specific one 28
  • 29. Machine Learning Srihari MRF with Input-Output Variables • X is a set of input variables that are observed – Element of X is denoted x • Y is a set of output variables that we predict – Element of Y is denoted y • A are subsets of X U Y – Elements of A that are in A ^ X are denoted xA – Element of A that are in A ^ Y are denoted yA • Then undirected graphical model has the form 1 p (x,y) = ∏ Ψ A (x A , y A ) where Z= ∑ ∏ Ψ A (x A , y A ) Z A x,y A 29
  • 30. Machine Learning Srihari MRF Local Function • Assume each local function has the form ⎧ ⎫ Ψ A (x A , y A ) = exp ⎨∑ θ Am f Am (x A , y A ) ⎬ ⎩m ⎭ where θA is a parameter vector, fA are feature functions and m=1,..M are feature subscripts 30
  • 31. Machine Learning Srihari From HMM to CRF • In an HMM Indicator function: N 1{x = x’} takes value 1when p( Y,X ) = ∏ p ( yn | yn −1 ) p(x n | yn ) n =1 x=x’ and 0 otherwise • Can be rewritten as Parameters of 1 ⎧ ⎫ the distribution: p ( Y, X) = exp ⎨∑ ∑ λij 1{ yn =i}1{ yn−1 = j} + ∑∑∑ μoi 1{ yn =i}1{ xn =o} ⎬ θ ={λij,μoi} Z ⎩ n i , j∈S n i∈S o∈O ⎭ • Further rewritten as Feature Functions have 1 ⎧ M ⎫ p (Y, X) = exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬ the form fm(yn,yn-1,xn): Z ⎩ m =1 ⎭ Need one feature for each state transition (i,j) • Which gives us fij(y,y’,x)=1{y=i}1{y’=j} and ⎧M ⎫ exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬ one for each state- p ( Y | X) = p( y, x) = ⎩ m =1 ⎭ observation pair ∑ y ' p( y ', x) ∑ exp ⎧∑ λm f m ( yn , yn−1 , xn ) ⎫ ⎨ M ⎬ fio(y,y’,x)=1{y=i}1{x=o} ⎩ m =1 ⎭ y' 31 • Note that Z cancels out
  • 32. Machine Learning Srihari CRF definition • A linear chain CRF is a distribution p(Y|X) that takes the form 1 ⎧M ⎫ p ( Y | X) = exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬ Z (X) ⎩ m =1 ⎭ • Where Z(X) is an instance specific normalization function ⎧M ⎫ Z (X) = ∑ exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬ y ⎩ m =1 ⎭ 32
  • 33. Machine Learning Srihari Functional Models Naïve Bayes Classifier Hidden Markov Model y Y y1 yn yN GENERATIVE x X x1 xM x1 xn xN M p(y, x ) = p(y)∏ p ( xm | y ) N p( Y,X ) = ∏ p( yn | yn −1 ) p(x n | yn ) m =1 n =1 ⎧M ⎫ ⎧ M ⎫ exp ⎨∑ λm f m ( yn , yn −1 , x n ) ⎬ exp ⎨∑ λm f m ( y, x) ⎬ p ( Y | X) = ⎩ m =1 ⎭ p ( y | x) = ⎩ m =1 ⎭ ⎧ M ⎫ ⎧ M ⎫ ∑ y ' exp ⎨∑1 λm f m ( yn ', yn−1 ', xn ) ⎬ ∑ y ' exp ⎨∑1 λm f m ( y ', x) ⎬ ⎩ m= ⎭ DISCRIMINATIVE ⎩ m= ⎭ Logistic Regression Conditional Random Field 33
  • 34. Machine Learning Srihari NLP: Part Of Speech Tagging For a sequence of words w = {w1,w2,..wn} find syntactic labels s for each word: w = The quick brown fox jumped over the lazy dog s = DET VERB ADJ NOUN-S VERB-P PREP DET ADJ NOUN-S Baseline is already 90% Model Error • Tag every word with its most frequent tag HMM 5.69% • Tag unknown words as nouns CRF 5.55% Per-word error rates for POS tagging on the Penn treebank 34
  • 35. Machine Learning Srihari Table Extraction To label lines of text document: Whether part of table and its role in table. Finding tables and extracting information is necessary component of data mining, question-answering and IR tasks. HMM CRF 89.7% 99.9% 35
  • 36. Machine Learning Srihari Shallow Parsing • Precursor to full parsing or information extraction – Identifies non-recursive cores of various phrase types in text • Input: words in a sentence annotated automatically with POS tags • Task: label each word with a label indicating – word is outside a chunk (O), starts a chunk (B), continues a chunk (I) NP chunks CRFs beat all reported single-model NP chunking results on standard evaluation dataset 36
  • 37. Machine Learning Srihari Handwritten Word Recognition Given word image and lexicon, find most probable lexical entry Algorithm Outline • Oversegment image segment combinations are potential characters • Given y = a word in lexicon, s = grouping of segments, x = input word image features • Find word in lexicon and segment grouping that maximizes P(y,s | x), CRF Model eψ ( y , x ;θ ) m ⎛ ⎞ P( y | x,θ ) = ψ ( y, x;θ ) = ∑ ⎜ A( j, y j , x;θ s + ∑ I ( j , k , y j , yk , x,θ t ) ⎟ ∑y' eψ ( y ', x ;θ ) ⎜ j =1 ⎝ ( j , k )∈E ⎟ ⎠ where yi ε (a-z,A-Z,0-9}, θ : model parameters Association Potential (state term) 1 0.98 SDP CRF A( j , y j , x;θ s ) = ∑ ( f i s ( j , y j , x ) ⋅ θ ijs ) CRF 0.96 0.94 Precision i 0.92 Segment-DP Precision 0.9 Interaction Potential 0.88 0.86 I ( j , k , y j , y k , x; θ ) = ∑ ( f i ( j , k , y j , y k , x ) ⋅ θ ) 0.84 t t t ijk 0.82 0.8 37 0 20 40 60 80 100 120 Word Recognition Rank i WR Rank
  • 38. Machine Learning Srihari Document Analysis (labeling regions) error rates CRF Neural Naive Network Bayes Machine 1.64% 2.35% 11.54% Printed Text Handwritten 5.19% 20.90% 25.04% Text Noise 10.20% 15.00% 12.23% Total 4.25% 7.04% 12.58% 38
  • 39. Machine Learning Srihari 5. Advantage of CRF over Other Models • Other Generative Models – Relax assuming conditional independence of observed data given the labels – Can contain arbitrary feature functions • Each feature function can use entire input data sequence. Probability of label at observed data segment may depend on any past or future data segments. • Other Discriminative Models – Avoid limitation of other discriminative Markov models biased towards states with few successor states. – Single exponential model for joint probability of entire sequence of labels given observed sequence. – Each factor depends only on previous label, and not future labels. P(y | x) = product of factors, one for each label. 39
  • 40. Machine Learning Srihari Disadvantages of Discriminative Classifiers • Lack elegance of generative – Priors, structure, uncertainty • Alternative notions of penalty functions, regularization, kernel functions • Feel like black-boxes – Relationships between variables are not explicit and visualizable 40
  • 41. Machine Learning Srihari Bridging Generative and Discriminative • Can performance of SVMs be combined elegantly with flexible Bayesian statistics? • Maximum Entropy Discrimination marries both methods – Solve over a distribution of parameters (a distribution over solutions) 41
  • 42. Machine Learning Srihari 6. Summary • Machine learning algorithms have great practical value in a variety of application domains – A well-defined learning problem requires a well-specified task, performance metric, and source of experience • Generative and Discriminative methods are two-broad approaches: – former involve modeling, latter directly solve classification • Generative and Discriminative Method Pairs – Naïve Bayes and Logistic Regression are a corresponding pair for classification – HMM and CRF are a corresponding pair for sequential data • CRF performs better in language related tasks • Generative models are more elegant, have explanatory power 42
  • 43. Machine Learning Srihari 7. References 1. T. Mitchell, Machine Learning, McGraw-Hill, 1997 2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 3. T. Jebarra, Machine Learning: Discriminative and Generative, Kluwer, 2004 4. R.O. Duda, P.E. Hart and D. Stork, Pattern Classification, 2nd Ed, Wiley 2002 5. C. Sutton and A. McCallum, An Introduction to Conditional Random Fields for Relational Learning 6. S. Shetty, H. Srinivasan and S. N. Srihari, Handwritten Word Recognition using CRFs, ICDAR 2007 7. S. Shetty, H.Srinivasan and S. N. Srihari, Segmentation and Labeling of Documents using CRFs, SPIE-DRR 2007 43