Introduction to Machine Learning Aristotelis Tsirigos

Introduction to Machine Learning Aristotelis Tsirigos email: tsirigos@cs.nyu.edu Dennis Shasha - Advanced Database Systems NYU Computer Science

What is Machine Learning? Principles, methods and algorithms for predicting using past experience Not mere memorization, but capability of generalizing on novel situations based on past experience Where no theoretical model exists to explain the data, machine learning can be employed to offer such a model

Learning models According to how active or passive the learner is: Statistical learning model No control over observations, they are presented at random in an independent identically distributed fashion Online model An external source presents the observations to the learner in a query form Query model The learner is querying an external “expert” source

Types of learning problems A very rough categorization of learning problems: Unsupervised learning Clustering, density estimation, feature selection Supervised learning Classification, regression Reinforcement learning Feedback, games

Outline Learning methods Bayesian learning Nearest neighbor Decision trees Linear classifiers Ensemble methods (bagging & boosting) Testing the learner Learner evaluation Practical issues Resources

Bayesian learning - Introduction Given are: Observed data D = { d 1 , d 2 , …, d n } Hypothesis space H In the Bayesian setup we want to find the hypothesis that best fits the data in a probabilistic manner: In general, it is computationally intractable without any assumptions about the data

First transformation using Bayes rule: Now we get: Notice that the optimal choice depends also on the a priori probability P(h) of hypothesis h Bayesian learning - Elaboration

Bayesian learning - Independence We can further simplify by assuming that the data in D are independently drawn from the underlying distribution: Under this assumption: How do we estimate P( d i |h) in practice?

Bayesian learning - Analysis For any hypothesis h in H we assume a distribution: P( d |h) for any point d in the input space If d is a point in an m-dimensional space, then the distribution is in fact: P(d (1) ,d (2) ,…,d (m) |h) Problems: Complex optimization problem, suboptimal techniques can be used if the distributions are differentiable In general, there are too many parameters to estimate, therefore a lot of data is needed for reliable estimation

Bayesian learning - Analysis Need to further analyze distribution P( d |h) : We can assume features are independent (Naïve Bayes): Or, build Bayesian Networks where dependencies of the features are explicitly modeled Still we have to somehow learn the distributions Model as parametrized distributions (e.g. Gaussians) Estimate the parameters using standard greedy techniques (e.g. Expectation Maximization)

Bayesian learning - Summary Makes use of prior knowledge of: The likelihood of alternative hypotheses and The probability of observing data given a specific hypothesis The goal is to determine the most probable hypothesis given a series of observations The Naïve Bayes method has been found useful in practical applications (e.g. text classification) If the naïve assumption is not appropriate, there is a generic algorithm (EM) that can be used to find a hypothesis that is locally optimal

Nearest Neighbor - Introduction Belongs to class of instance-based learners: The learner does not make any global prediction of the target function, only predicts locally for a given point (lazy learner) Idea: Given a query instance x , look at past observations D that are “close” to x in order to determine x ’s class y . Issues: How do we define distance? How do we define the notion of “neighborhood”?

Nearest Neighbor - Details Classify new instance x according to its neighborhood N( x ) Neighborhood can be defined in different ways: Constant radius k-Neighbors Weights are a function of distance: w i = w(d( x , x i )) Classification rule: y i : label for point x i w i : weight for x i x N( x )

Nearest Neighbor - Summary Classify new instances according to their closest points Control accuracy in three ways: Distance metric Definition of neighborhood Weight assignment These parameters must be tuned depending on the problem: Is there noise in the data? Outliers? What is a “natural” distance for the data?

Decision Trees - Introduction Suppose data is categorical Observation: Distance cannot be defined in a natural way Need a learner that operates directly on the attribute values YES NO LOW HIGH BAD NO YES LOW HIGH BAD NO NO HIGH LOW BAD NO NO LOW LOW BAD YES YES LOW LOW GOOD NO NO HIGH LOW GOOD YES NO LOW HIGH GOOD YES YES LOW HIGH GOOD YES NO HIGH HIGH GOOD YES YES HIGH HIGH GOOD Elected War casualties Gas prices Popularity Economy

Decision Trees - The model Idea: a decision tree that “explains” the data Observation: In general, there is no unique tree to represent the data In some nodes the decision is not strongly supported by the data Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2

Decision Trees - Training Build the tree from top to bottom choosing one attribute at a time How do we make the choice? Idea: Choose the most “informative” attribute first Having no other information, which attribute allows us to classify correctly most of the time? This can be quantified using the Information Gain metric: Based on Entropy = Randomness Measures the reduction in uncertainty about the target value given the value of one of the attributes, therefore it tells us how informative that attribute is

Decision Trees - Overfitting Problems: Solution is greedy, therefore suboptimal Optimal solution infeasible due to time constraints Is optimal really optimal? What if observations are corrupted by noise? We are really interested in the true, not the training error Overfitting in the presence of noise Occam’s razor: prefer simpler solutions Apply pruning to eliminate nodes with low statistical support

Decision Trees - Pruning Get rid of nodes with low support Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2 The pruned tree does not fully explain the data, but we hope that it will generalize better on unseen instances…

Decision Trees - Summary Advantages: Categorical data Easy to interpret in a simple rule format Disadvantages Hard to accommodate numerical data Suboptimal solution Bias towards simple trees

Linear Classifiers - Introduction There is an infinite number of hyperplanes f( x )=0 that can separate positive from negative examples! Is there an optimal one to choose? Decision function: Predicted label:

Linear Classifiers - Margins Make sure it leaves enough …room for future points! Margins: For a training point x i define its margin: For classifier f it is the worst of all margins: f( x )=0

Linear Classifiers - Optimization Now, the only thing we have to do is find the f with the maximum possible margin! Quadratic optimization problem: This classifier is known as Support Vector Machines The optimal w * will yield the maximum margin γ * : Finally, the optimal hyperplane can be written as:

Linear Classifiers - Problems What if data is noisy or, worse, linearly inseparable? Solution 1: Allow for outliers in the data when data is noisy Solution 2: Increase dimensionality by creating composite features if the target function is nonlinear Solution 3: Do both 1 and 2

Linear Classifiers - Outliers Impose softer restrictions on the margin distribution to accept outliers f( x )=0 Now our classifier is more flexible and more powerful, but there are more parameters to estimate outlier

Linear Classifiers - Nonlinearity (!) Combine input features to form more complex ones: Initial space x = (x 1 ,x 2 ,…,x m ) Induced space Φ ( x ) = (x 1 ,x 2 ,…,x m ,2x 1 x 2 ,2x 1 x 3 ,…,2x m-1 x m ) Inner product can now be written as < Φ (x)· Φ (y)> = <x·y> 2 Kernels: The above product is denoted K(x,y)= < Φ (x)· Φ (y)> and it is called a kernel Kernels induce nonlinear feature spaces based on the initial feature space There is a huge collection of kernels, for vectors, trees, strings, graphs, time series, … Linear separation in the composite feature space implies a nonlinear separation in the initial space!

Linear Classifiers - Summary Provides a generic solution to the learning problem We just have to solve an easy optimization problem Parametrized by the induced feature space and the noise parameters There exist theoretical bounds on their performance

Ensembles - Introduction Motivation: finding just one classifier is “too risky” Idea: combine a group of classifiers into the final learner Intuition: each classifier is associated with some risk of wrong predictions in future data instead of investing in just one risky classifier, we can distribute the decision in many classifiers thus effectively reducing the overall risk

Ensembles - Bagging Main idea: From observations D , T subsets D 1 ,…,D T are drawn at random For each D i train a “base” classifier f i (e.g. decision tree) Finally, combine the T classifiers into one classifier f by taking a majority vote: Observations: Need enough observations to get partitions that approximately respect the iid condition ( |D| >> T ) How do we decide on the base classifier?

Ensembles - Boosting Main idea: Run through a number of iterations At each iteration t , a “weak” classifier f t is trained on a weighted version of the training data (initially weights are equal) Each point’s weight is updated so that examples with poor margin with respect to f t are assigned a higher weight in an attempt to “boost” them in the next iterations The classifier itself is assigned a weight α t according to its training error Combine all classifiers into a weighted majority vote:

Bagging vs. Boosting Two distinct ways to apply the diversification idea Margin maximization Risk minimization Effect Simple Complex Base learner Adaptive data weighting Partition before training Training data Boosting Bagging

Testing the learner How do we estimate the learner’s performance? Create test sets from the original observations: Test set: Partition into training and test sets and use error on test set as an estimate of the true error Leave-one-out: Remove one point and train the rest, then report error on this point Do it for all points and report mean error k-fold Cross Validation: Randomly partition the data set in k non-overlapping sets Choose one set at a time for testing and train on the rest Report mean error

Learner evaluation - PAC learning Probably Approximate Correct (PAC) learning of a target class C using hypothesis space H : If for all target functions f in C and for all 0<ε,δ<1 /2 , with probability at least (1- δ ) we can learn a hypothesis h in H that approximates f with an error at most ε . Generalization (or true) error: We really care about the error in unseen data Statistical learning theory gives as the tools to express the true error (and its confidence) in terms of: The empirical (or training) error The confidence 1- δ of the true error The number of training examples The complexity of the classes C and/or H

Learner evaluation - VC dimension The VC dimension of a hypothesis space H measures its power to interpret the observations Infinite hypothesis space size does not necessarily imply infinite VC dimension! Bad news: If we allow a hypothesis space with infinite VC dimension, learning is impossible (requires infinite number of observations) Good news: For the class of large margin linear classifiers the following error bound can be proven:

Practical issues Machine learning is driven by data, so a good learner must be data-dependent in all aspects: Hypothesis space Prior knowledge Feature selection and composition Distance/similarity measures Outliers/Noise Never forget that learners and training algorithms must be efficient in time and space with respect to: The feature space dimensionality The training set size The hypothesis space size

Conclusions Machine learning is mostly art and a little bit of science! For each problem at hand a different classifier will be the optimal one This simply means that the solution must be data-dependent : Select an “appropriate” family of classifiers (e.g. Decision Trees) Choose the right representation for the data in the feature space Tune available parameters of your favorite classifier to reflect the “nature” of the data Many practical applications, especially when there is no good theory available to model the data

Resources Books T. Mitchell, Machine Learning N. Cristianini & Shawe-Taylor, An introduction to Support Vector Machines V. Kecman, Learning and Soft Computing R. Duda, P. Hart & D. Stork, Pattern Classification Online tutorials A. Moore, http://www-2.cs.cmu.edu/~awm/tutorials/ Software WEKA: http://www.cs.waikato.ac.nz/~ml/weka/ SVMlight: http://svmlight.joachims.org/

Introduction to Machine Learning Aristotelis Tsirigos

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Introduction to Machine Learning Aristotelis Tsirigos (20)

More from butest (20)

Introduction to Machine Learning Aristotelis Tsirigos