SlideShare a Scribd company logo
Introduction to Machine Learning Aristotelis Tsirigos email: tsirigos@cs.nyu.edu  Dennis Shasha - Advanced Database Systems NYU Computer Science
What is Machine Learning? Principles, methods and algorithms for predicting using past experience Not mere memorization, but capability of generalizing on novel situations based on past experience Where no theoretical model exists to explain the data, machine learning can be employed to offer such a model
Learning models According to how active or passive the learner is: Statistical learning model No control over observations, they are presented at random in an independent identically distributed fashion Online model An external source presents the observations to the learner in a query form Query model The learner is querying an external “expert” source
Types of learning problems A very rough categorization of learning problems: Unsupervised learning Clustering, density estimation, feature selection Supervised learning Classification, regression Reinforcement learning Feedback, games
Outline Learning methods Bayesian learning Nearest neighbor Decision trees Linear classifiers Ensemble methods (bagging & boosting) Testing the learner Learner evaluation Practical issues Resources
Bayesian learning - Introduction Given are: Observed data  D = {  d 1 ,  d 2 , …,  d n  } Hypothesis space  H In the Bayesian setup we want to find the hypothesis that best fits the data in a probabilistic manner: In general, it is computationally intractable without any assumptions about the data
First transformation using Bayes rule: Now we get: Notice that the optimal choice depends also on the a priori probability  P(h)  of hypothesis  h Bayesian learning - Elaboration
Bayesian learning - Independence We can further simplify by assuming that the data in  D  are independently drawn from the underlying distribution: Under this assumption: How do we estimate  P( d i |h)  in practice?
Bayesian learning - Analysis For any hypothesis  h  in  H  we assume a distribution: P( d |h)  for any point  d  in the input space If  d  is a point in an m-dimensional space, then the distribution is in fact: P(d (1) ,d (2) ,…,d (m) |h) Problems: Complex optimization problem, suboptimal techniques can be used if the distributions are differentiable In general, there are too many parameters to estimate, therefore a lot of data is needed for reliable estimation
Bayesian learning - Analysis Need to further analyze distribution  P( d |h) : We can assume features are independent (Naïve Bayes): Or, build Bayesian Networks where dependencies of the features are explicitly modeled Still we have to somehow learn the distributions Model as parametrized distributions (e.g. Gaussians) Estimate the parameters using standard greedy techniques (e.g. Expectation Maximization)
Bayesian learning - Summary Makes use of prior knowledge of: The likelihood of alternative hypotheses and The probability of observing data given a specific hypothesis The goal is to determine the most probable hypothesis given a series of observations The Naïve Bayes method has been found useful in practical applications (e.g. text classification) If the naïve assumption is not appropriate, there is a generic algorithm (EM) that can be used to find a hypothesis that is locally optimal
Nearest Neighbor - Introduction Belongs to class of instance-based learners: The learner does not make any global prediction of the target function, only predicts locally for a given point (lazy learner) Idea:  Given a query instance  x , look at past observations  D  that are “close” to  x  in order to determine  x ’s class  y .  Issues: How do we define distance? How do we define the notion of “neighborhood”?
Nearest Neighbor - Details Classify new instance  x  according to its neighborhood  N( x ) Neighborhood can be defined in different ways: Constant radius k-Neighbors Weights are a function of distance:  w i  = w(d( x , x i )) Classification rule: y i  : label for point  x i w i  : weight for  x i x N( x )
Nearest Neighbor - Summary Classify new instances according to their closest points Control accuracy in three ways: Distance metric Definition of neighborhood Weight assignment These parameters must be tuned depending on the problem: Is there noise in the data? Outliers? What is a “natural” distance for the data?
Decision Trees - Introduction Suppose data is categorical Observation: Distance cannot be defined in a natural way Need a learner that operates directly on the attribute values YES NO LOW HIGH BAD NO YES LOW HIGH BAD NO NO HIGH LOW BAD NO NO LOW LOW BAD YES YES LOW LOW GOOD NO NO HIGH LOW GOOD YES NO LOW HIGH GOOD YES YES LOW HIGH GOOD YES NO HIGH HIGH GOOD YES YES HIGH HIGH GOOD Elected War casualties Gas prices Popularity Economy
Decision Trees - The model  Idea: a decision tree that “explains” the data Observation: In general, there is no unique tree to represent the data In some nodes the decision is not strongly supported by the data Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2
Decision Trees - Training Build the tree from top to bottom choosing one attribute at a time How do we make the choice? Idea: Choose the most “informative” attribute first Having no other information, which attribute allows us to classify correctly most of the time? This can be quantified using the Information Gain metric: Based on Entropy = Randomness Measures the reduction in uncertainty about the target value given the value of one of the attributes, therefore it tells us how informative that attribute is
Decision Trees - Overfitting Problems: Solution is greedy, therefore suboptimal Optimal solution infeasible due to time constraints Is optimal really optimal? What if observations are corrupted by noise? We are really interested in the true, not the training error Overfitting in the presence of noise Occam’s razor: prefer simpler solutions Apply pruning to eliminate nodes with low statistical support
Decision Trees - Pruning Get rid of nodes with low support Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2 The pruned tree does not fully explain the data, but we hope that it will generalize better on unseen instances…
Decision Trees - Summary Advantages: Categorical data Easy to interpret in a simple rule format Disadvantages Hard to accommodate numerical data Suboptimal solution Bias towards simple trees
Linear Classifiers - Introduction There is an infinite number of hyperplanes  f( x )=0  that can separate positive from negative examples!  Is there an optimal one to choose? Decision function: Predicted label:
Linear Classifiers - Margins Make sure it leaves enough …room for future points! Margins: For a training point  x i   define its margin:  For classifier  f  it is the worst of all margins: f( x )=0
Linear Classifiers - Optimization Now, the only thing we have to do is find the  f  with the maximum possible margin! Quadratic optimization problem: This classifier is known as Support Vector Machines The optimal  w *  will yield the maximum margin  γ * : Finally, the optimal hyperplane can be written as:
Linear Classifiers - Problems What if data is noisy or, worse, linearly inseparable? Solution 1: Allow for outliers in the data when data is noisy Solution 2: Increase dimensionality by creating composite features if the target function is nonlinear Solution 3: Do both 1 and 2
Linear Classifiers - Outliers Impose softer restrictions on the margin distribution to accept outliers f( x )=0 Now our classifier is more flexible and more powerful, but there are more parameters to estimate outlier
Linear Classifiers - Nonlinearity (!) Combine input features to form more complex ones: Initial space  x  = (x 1 ,x 2 ,…,x m ) Induced space  Φ ( x ) = (x 1 ,x 2 ,…,x m ,2x 1 x 2 ,2x 1 x 3 ,…,2x m-1 x m ) Inner product can now be written as  < Φ (x)· Φ (y)> = <x·y> 2 Kernels: The above product is denoted  K(x,y)= < Φ (x)· Φ (y)>   and it is called a kernel Kernels induce nonlinear feature spaces based on the initial feature space  There is a huge collection of kernels, for vectors, trees, strings, graphs, time series, … Linear separation in the composite feature space implies a nonlinear separation in the initial space!
Linear Classifiers - Summary Provides a generic solution to the learning problem We just have to solve an easy optimization problem Parametrized by the induced feature space and the noise parameters There exist theoretical bounds on their performance
Ensembles - Introduction Motivation:  finding just one classifier is “too risky” Idea: combine a group of classifiers into the final learner  Intuition: each classifier is associated with some risk of wrong predictions in future data instead of investing in just one risky classifier, we can distribute the decision in many classifiers thus effectively reducing the overall risk
Ensembles - Bagging Main idea: From observations  D ,  T  subsets  D 1 ,…,D T  are drawn at random For each  D i  train a “base” classifier  f i  (e.g. decision tree) Finally, combine the  T  classifiers into one classifier  f  by taking a majority vote: Observations: Need enough observations to get partitions that approximately respect the iid condition ( |D| >> T ) How do we decide on the base classifier?
Ensembles - Boosting Main idea: Run through a number of iterations At each iteration  t , a “weak” classifier  f t  is trained on a weighted version of the training data (initially weights are equal) Each point’s weight is updated so that examples with poor margin with respect to  f t  are assigned a  higher  weight in an attempt to “boost” them in the next iterations The classifier itself is assigned a weight  α t   according to its training error Combine all classifiers into a weighted majority vote:
Bagging vs. Boosting Two distinct ways to apply the diversification idea  Margin maximization Risk minimization Effect Simple Complex  Base learner Adaptive data weighting Partition before training Training data Boosting Bagging
Testing the learner How do we estimate the learner’s performance? Create test sets from the original observations: Test set: Partition into training and test sets and use error on test set as an estimate of the true error Leave-one-out:  Remove one point and train the rest, then report error on this point Do it for all points and report mean error k-fold Cross Validation: Randomly partition the data set in k non-overlapping sets Choose one set at a time for testing and train on the rest Report mean error
Learner evaluation - PAC learning Probably Approximate Correct (PAC) learning of a target class  C  using hypothesis space  H : If for all target functions  f  in  C  and for all  0<ε,δ<1 /2 ,   with probability at least  (1- δ )  we can learn a hypothesis  h  in  H  that approximates  f  with an error at most  ε .  Generalization (or true) error: We really care about the error in  unseen  data Statistical learning theory gives as the tools to express the true error (and its confidence) in terms of: The empirical (or training) error The confidence  1- δ  of the true error The number of training examples The complexity of the classes  C  and/or  H
Learner evaluation - VC dimension The VC dimension of a hypothesis space H measures its power to interpret the observations Infinite hypothesis space size does not necessarily imply infinite VC dimension! Bad news:  If we allow a hypothesis space with infinite VC dimension, learning is  impossible  (requires infinite number of observations) Good news: For the class of large margin linear classifiers the following error bound can be proven:
Practical issues Machine learning is driven by data, so a good learner must be data-dependent in all aspects: Hypothesis space Prior knowledge Feature selection and composition Distance/similarity measures Outliers/Noise  Never forget that learners  and  training algorithms must be efficient in time and space with respect to: The feature space dimensionality The training set size The hypothesis space size
Conclusions Machine learning is mostly art and a little bit of science! For each problem at hand a different classifier will be the optimal one This simply means that the solution  must be data-dependent : Select an “appropriate” family of classifiers (e.g. Decision Trees) Choose the right representation for the data in the feature space Tune available parameters of your favorite classifier to reflect the “nature” of the data Many practical applications, especially when there is no good theory available to model the data
Resources Books T. Mitchell,  Machine Learning N. Cristianini & Shawe-Taylor,  An introduction to Support Vector Machines V. Kecman,  Learning and Soft Computing R. Duda, P. Hart & D. Stork,  Pattern Classification Online tutorials A. Moore,  http://www-2.cs.cmu.edu/~awm/tutorials/ Software WEKA:  http://www.cs.waikato.ac.nz/~ml/weka/ SVMlight:  http://svmlight.joachims.org/

More Related Content

PPT
2.8 accuracy and ensemble methods
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
2.2 decision tree
PPTX
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
PPT
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPTX
Decision Tree Learning
PPTX
Decision trees
2.8 accuracy and ensemble methods
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
2.2 decision tree
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Decision Tree Learning
Decision trees

What's hot (20)

PPT
Download presentation source
PPT
Unit 3classification
PDF
Machine learning Lecture 1
PPT
Decision tree
PDF
Machine Learning Lecture 3 Decision Trees
PPT
Machine Learning 3 - Decision Tree Learning
PDF
Classification Techniques
ODP
Machine Learning with Decision trees
PPTX
Decision Tree - C4.5&CART
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
PDF
Decision trees
PPT
Machine Learning 1 - Introduction
PPT
08 classbasic
PPT
08 classbasic
PPT
Machine Learning: Decision Trees Chapter 18.1-18.3
PPTX
BAS 250 Lecture 8
PPTX
Lect9 Decision tree
PDF
Decision tree learning
PPT
MachineLearning.ppt
PPTX
Decision tree, softmax regression and ensemble methods in machine learning
Download presentation source
Unit 3classification
Machine learning Lecture 1
Decision tree
Machine Learning Lecture 3 Decision Trees
Machine Learning 3 - Decision Tree Learning
Classification Techniques
Machine Learning with Decision trees
Decision Tree - C4.5&CART
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Decision trees
Machine Learning 1 - Introduction
08 classbasic
08 classbasic
Machine Learning: Decision Trees Chapter 18.1-18.3
BAS 250 Lecture 8
Lect9 Decision tree
Decision tree learning
MachineLearning.ppt
Decision tree, softmax regression and ensemble methods in machine learning
Ad

Viewers also liked (7)

PPTX
Mncs 16-10-1주-변승규-introduction to the machine learning #2
PDF
04 Machine Learning - Supervised Linear Classifier
PDF
Introduction to Machine Learning Classifiers
PDF
CSC446: Pattern Recognition (LN8)
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
PPTX
Machine Learning
PPTX
Introduction to Machine Learning
Mncs 16-10-1주-변승규-introduction to the machine learning #2
04 Machine Learning - Supervised Linear Classifier
Introduction to Machine Learning Classifiers
CSC446: Pattern Recognition (LN8)
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Machine Learning
Introduction to Machine Learning
Ad

Similar to Introduction to Machine Learning Aristotelis Tsirigos (20)

PPT
[ppt]
PPT
[ppt]
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PDF
machine_learning.pptx
PPT
coppin chapter 10e.ppt
PPT
Computational Biology, Part 4 Protein Coding Regions
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
Classifiers
PPT
Machine learning and Neural Networks
PPT
PDF
Machine Learning Algorithms Introduction.pdf
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
PPT
ai4.ppt
PDF
A Few Useful Things to Know about Machine Learning
PPTX
Deep learning from mashine learning AI..
DOC
Lecture #1: Introduction to machine learning (ML)
PPTX
AI -learning and machine learning.pptx
PPTX
Machine Learning
[ppt]
[ppt]
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
machine_learning.pptx
coppin chapter 10e.ppt
Computational Biology, Part 4 Protein Coding Regions
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
Classifiers
Machine learning and Neural Networks
Machine Learning Algorithms Introduction.pdf
How Machine Learning Helps Organizations to Work More Efficiently?
ai4.ppt
A Few Useful Things to Know about Machine Learning
Deep learning from mashine learning AI..
Lecture #1: Introduction to machine learning (ML)
AI -learning and machine learning.pptx
Machine Learning

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Introduction to Machine Learning Aristotelis Tsirigos

  • 1. Introduction to Machine Learning Aristotelis Tsirigos email: tsirigos@cs.nyu.edu Dennis Shasha - Advanced Database Systems NYU Computer Science
  • 2. What is Machine Learning? Principles, methods and algorithms for predicting using past experience Not mere memorization, but capability of generalizing on novel situations based on past experience Where no theoretical model exists to explain the data, machine learning can be employed to offer such a model
  • 3. Learning models According to how active or passive the learner is: Statistical learning model No control over observations, they are presented at random in an independent identically distributed fashion Online model An external source presents the observations to the learner in a query form Query model The learner is querying an external “expert” source
  • 4. Types of learning problems A very rough categorization of learning problems: Unsupervised learning Clustering, density estimation, feature selection Supervised learning Classification, regression Reinforcement learning Feedback, games
  • 5. Outline Learning methods Bayesian learning Nearest neighbor Decision trees Linear classifiers Ensemble methods (bagging & boosting) Testing the learner Learner evaluation Practical issues Resources
  • 6. Bayesian learning - Introduction Given are: Observed data D = { d 1 , d 2 , …, d n } Hypothesis space H In the Bayesian setup we want to find the hypothesis that best fits the data in a probabilistic manner: In general, it is computationally intractable without any assumptions about the data
  • 7. First transformation using Bayes rule: Now we get: Notice that the optimal choice depends also on the a priori probability P(h) of hypothesis h Bayesian learning - Elaboration
  • 8. Bayesian learning - Independence We can further simplify by assuming that the data in D are independently drawn from the underlying distribution: Under this assumption: How do we estimate P( d i |h) in practice?
  • 9. Bayesian learning - Analysis For any hypothesis h in H we assume a distribution: P( d |h) for any point d in the input space If d is a point in an m-dimensional space, then the distribution is in fact: P(d (1) ,d (2) ,…,d (m) |h) Problems: Complex optimization problem, suboptimal techniques can be used if the distributions are differentiable In general, there are too many parameters to estimate, therefore a lot of data is needed for reliable estimation
  • 10. Bayesian learning - Analysis Need to further analyze distribution P( d |h) : We can assume features are independent (Naïve Bayes): Or, build Bayesian Networks where dependencies of the features are explicitly modeled Still we have to somehow learn the distributions Model as parametrized distributions (e.g. Gaussians) Estimate the parameters using standard greedy techniques (e.g. Expectation Maximization)
  • 11. Bayesian learning - Summary Makes use of prior knowledge of: The likelihood of alternative hypotheses and The probability of observing data given a specific hypothesis The goal is to determine the most probable hypothesis given a series of observations The Naïve Bayes method has been found useful in practical applications (e.g. text classification) If the naïve assumption is not appropriate, there is a generic algorithm (EM) that can be used to find a hypothesis that is locally optimal
  • 12. Nearest Neighbor - Introduction Belongs to class of instance-based learners: The learner does not make any global prediction of the target function, only predicts locally for a given point (lazy learner) Idea: Given a query instance x , look at past observations D that are “close” to x in order to determine x ’s class y . Issues: How do we define distance? How do we define the notion of “neighborhood”?
  • 13. Nearest Neighbor - Details Classify new instance x according to its neighborhood N( x ) Neighborhood can be defined in different ways: Constant radius k-Neighbors Weights are a function of distance: w i = w(d( x , x i )) Classification rule: y i : label for point x i w i : weight for x i x N( x )
  • 14. Nearest Neighbor - Summary Classify new instances according to their closest points Control accuracy in three ways: Distance metric Definition of neighborhood Weight assignment These parameters must be tuned depending on the problem: Is there noise in the data? Outliers? What is a “natural” distance for the data?
  • 15. Decision Trees - Introduction Suppose data is categorical Observation: Distance cannot be defined in a natural way Need a learner that operates directly on the attribute values YES NO LOW HIGH BAD NO YES LOW HIGH BAD NO NO HIGH LOW BAD NO NO LOW LOW BAD YES YES LOW LOW GOOD NO NO HIGH LOW GOOD YES NO LOW HIGH GOOD YES YES LOW HIGH GOOD YES NO HIGH HIGH GOOD YES YES HIGH HIGH GOOD Elected War casualties Gas prices Popularity Economy
  • 16. Decision Trees - The model Idea: a decision tree that “explains” the data Observation: In general, there is no unique tree to represent the data In some nodes the decision is not strongly supported by the data Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2
  • 17. Decision Trees - Training Build the tree from top to bottom choosing one attribute at a time How do we make the choice? Idea: Choose the most “informative” attribute first Having no other information, which attribute allows us to classify correctly most of the time? This can be quantified using the Information Gain metric: Based on Entropy = Randomness Measures the reduction in uncertainty about the target value given the value of one of the attributes, therefore it tells us how informative that attribute is
  • 18. Decision Trees - Overfitting Problems: Solution is greedy, therefore suboptimal Optimal solution infeasible due to time constraints Is optimal really optimal? What if observations are corrupted by noise? We are really interested in the true, not the training error Overfitting in the presence of noise Occam’s razor: prefer simpler solutions Apply pruning to eliminate nodes with low statistical support
  • 19. Decision Trees - Pruning Get rid of nodes with low support Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2 The pruned tree does not fully explain the data, but we hope that it will generalize better on unseen instances…
  • 20. Decision Trees - Summary Advantages: Categorical data Easy to interpret in a simple rule format Disadvantages Hard to accommodate numerical data Suboptimal solution Bias towards simple trees
  • 21. Linear Classifiers - Introduction There is an infinite number of hyperplanes f( x )=0 that can separate positive from negative examples! Is there an optimal one to choose? Decision function: Predicted label:
  • 22. Linear Classifiers - Margins Make sure it leaves enough …room for future points! Margins: For a training point x i define its margin: For classifier f it is the worst of all margins: f( x )=0
  • 23. Linear Classifiers - Optimization Now, the only thing we have to do is find the f with the maximum possible margin! Quadratic optimization problem: This classifier is known as Support Vector Machines The optimal w * will yield the maximum margin γ * : Finally, the optimal hyperplane can be written as:
  • 24. Linear Classifiers - Problems What if data is noisy or, worse, linearly inseparable? Solution 1: Allow for outliers in the data when data is noisy Solution 2: Increase dimensionality by creating composite features if the target function is nonlinear Solution 3: Do both 1 and 2
  • 25. Linear Classifiers - Outliers Impose softer restrictions on the margin distribution to accept outliers f( x )=0 Now our classifier is more flexible and more powerful, but there are more parameters to estimate outlier
  • 26. Linear Classifiers - Nonlinearity (!) Combine input features to form more complex ones: Initial space x = (x 1 ,x 2 ,…,x m ) Induced space Φ ( x ) = (x 1 ,x 2 ,…,x m ,2x 1 x 2 ,2x 1 x 3 ,…,2x m-1 x m ) Inner product can now be written as < Φ (x)· Φ (y)> = <x·y> 2 Kernels: The above product is denoted K(x,y)= < Φ (x)· Φ (y)> and it is called a kernel Kernels induce nonlinear feature spaces based on the initial feature space There is a huge collection of kernels, for vectors, trees, strings, graphs, time series, … Linear separation in the composite feature space implies a nonlinear separation in the initial space!
  • 27. Linear Classifiers - Summary Provides a generic solution to the learning problem We just have to solve an easy optimization problem Parametrized by the induced feature space and the noise parameters There exist theoretical bounds on their performance
  • 28. Ensembles - Introduction Motivation: finding just one classifier is “too risky” Idea: combine a group of classifiers into the final learner Intuition: each classifier is associated with some risk of wrong predictions in future data instead of investing in just one risky classifier, we can distribute the decision in many classifiers thus effectively reducing the overall risk
  • 29. Ensembles - Bagging Main idea: From observations D , T subsets D 1 ,…,D T are drawn at random For each D i train a “base” classifier f i (e.g. decision tree) Finally, combine the T classifiers into one classifier f by taking a majority vote: Observations: Need enough observations to get partitions that approximately respect the iid condition ( |D| >> T ) How do we decide on the base classifier?
  • 30. Ensembles - Boosting Main idea: Run through a number of iterations At each iteration t , a “weak” classifier f t is trained on a weighted version of the training data (initially weights are equal) Each point’s weight is updated so that examples with poor margin with respect to f t are assigned a higher weight in an attempt to “boost” them in the next iterations The classifier itself is assigned a weight α t according to its training error Combine all classifiers into a weighted majority vote:
  • 31. Bagging vs. Boosting Two distinct ways to apply the diversification idea Margin maximization Risk minimization Effect Simple Complex Base learner Adaptive data weighting Partition before training Training data Boosting Bagging
  • 32. Testing the learner How do we estimate the learner’s performance? Create test sets from the original observations: Test set: Partition into training and test sets and use error on test set as an estimate of the true error Leave-one-out: Remove one point and train the rest, then report error on this point Do it for all points and report mean error k-fold Cross Validation: Randomly partition the data set in k non-overlapping sets Choose one set at a time for testing and train on the rest Report mean error
  • 33. Learner evaluation - PAC learning Probably Approximate Correct (PAC) learning of a target class C using hypothesis space H : If for all target functions f in C and for all 0<ε,δ<1 /2 , with probability at least (1- δ ) we can learn a hypothesis h in H that approximates f with an error at most ε . Generalization (or true) error: We really care about the error in unseen data Statistical learning theory gives as the tools to express the true error (and its confidence) in terms of: The empirical (or training) error The confidence 1- δ of the true error The number of training examples The complexity of the classes C and/or H
  • 34. Learner evaluation - VC dimension The VC dimension of a hypothesis space H measures its power to interpret the observations Infinite hypothesis space size does not necessarily imply infinite VC dimension! Bad news: If we allow a hypothesis space with infinite VC dimension, learning is impossible (requires infinite number of observations) Good news: For the class of large margin linear classifiers the following error bound can be proven:
  • 35. Practical issues Machine learning is driven by data, so a good learner must be data-dependent in all aspects: Hypothesis space Prior knowledge Feature selection and composition Distance/similarity measures Outliers/Noise Never forget that learners and training algorithms must be efficient in time and space with respect to: The feature space dimensionality The training set size The hypothesis space size
  • 36. Conclusions Machine learning is mostly art and a little bit of science! For each problem at hand a different classifier will be the optimal one This simply means that the solution must be data-dependent : Select an “appropriate” family of classifiers (e.g. Decision Trees) Choose the right representation for the data in the feature space Tune available parameters of your favorite classifier to reflect the “nature” of the data Many practical applications, especially when there is no good theory available to model the data
  • 37. Resources Books T. Mitchell, Machine Learning N. Cristianini & Shawe-Taylor, An introduction to Support Vector Machines V. Kecman, Learning and Soft Computing R. Duda, P. Hart & D. Stork, Pattern Classification Online tutorials A. Moore, http://www-2.cs.cmu.edu/~awm/tutorials/ Software WEKA: http://www.cs.waikato.ac.nz/~ml/weka/ SVMlight: http://svmlight.joachims.org/