Chapter 4 Classification

Khalid Elshafie.abolkog@dblab.cbnu.ac.krDatabase / Bioinformatics LabChungbuk National University, KoreaClassification : Basic ConceptsDecember, 12 2009

OutlineChapter 4: Classification11 December 20092/46

Introduction (1/4)Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.A test set is used to determine the accuracy of the model. Classification ModelOutputClass LabelInputAttribute setChapter 4: Classification11 December 20094/46

Introduction(2/4)Classification:Two step process:1-learning step:Training data are analyzed by classification algorithm and a model (classifier) is learned.2- Classification:Test data are used to estimate the accuracy of the classification rules.Usually the given data set is divided into training and test sets.Chapter 4: Classification11 December 20095/46

Introduction (3/4)Examples of Classification:Predicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etcChapter 4: Classification11 December 20096/46

Introduction (4/4)Classification Techniques: Decision Trees Based Methods.Rule Based Methods.Neural Networks.Naïve Bayes and Bayesian Belief Networks.Support Vector Machines.Chapter 4: Classification11 December 20097/46

General Approach to Solving a Classification Problem

General Approach To Solving a Classification Problem (1/2)General Approach for building a classification model.Chapter 4: Classification11 December 20099/46

General Approach To Solving a Classification Problem (2/2)Performance evaluation.Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model.Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models.Confusion matrix for a 2-class problemChapter 4: Classification11 December 200910/46

Decision Tree Induction (1/15)What is a decision tree?A decision tree is a flowchart-like tree structure.Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label.Single, DivorcedInternal nodeMarStMarriedRoot nodeRefundNONoYesTaxIncNO> 80K< 80KYESNOLeaf nodesChapter 4: Classification11 December 200912/46

Decision Tree Induction (2/15)How to build a decision tree?Let Dt be the set of training records that reach a node tGeneral Procedure:If Dt contains records that belong the same class yt, then t is a leaf node labeled as ytIf Dt is an empty set, then t is a leaf node labeled by the default class, ydIf Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.Dt?Chapter 4: Classification11 December 200913/46

Decision Tree Induction (3/15)How to build a decision tree?Tree induction:Greedy strategy.Split the record based on an attribute test that optimizes certain condition.Tree induction issues:Determine how to split the record?How to specify the attribute test condition?How to determine the best split?Determine when to stop splitting.Chapter 4: Classification11 December 200914/46

Decision Tree Induction (4/15)How to specify test condition?Depends on attribute typesNominal.Ordinal.Continuous. Depends on number of ways to split.2-way split.Multi-way split.Chapter 4: Classification11 December 200915/46

Decision Tree Induction (5/15)Splitting based on nominal attributes.Multi-way splitUse as many partition as distinct values.Binary split.Divides the values into two subsets.CarTypeFamilyLuxurySportsCarTypeCarType{Family, Luxury}{Sports, Luxury}{Sports}{Family}ORChapter 4: Classification11 December 200916/46

Decision Tree Induction (6/15)Splitting based on ordinal attributes.Multi-way splitUse as many partition as distinct values.Binary split.Divides the values into two subsets. as long as it doesn’t violate the order property of the attributeSizeSmallLargeMediumSizeSize{Small, Medium}{Medium, Large}{Large}{Small}ORChapter 4: Classification11 December 200917/46

Decision Tree Induction (7/15)Splitting based on continuous attributes.Multi-way splitMust consider all possible test for continuous values.One approach, Discretization.Binary split.The test condition can be expressed as a comparison test.(A < v) or (A  v)Chapter 4: Classification11 December 200918/46

Decision Tree Induction (8/15)How to determine the best split?Attribute Selection Measure.A heuristic for selecting the splitting criterion that best separate a given data set.Information gain.Gain Ratio.Chapter 4: Classification11 December 200919/46

Decision Tree Induction (9/15)Information Gain.Used by ID3 algorithm as its attribute selection measure.Select the attribute with the heights information gain.Expected information (entropy) needed to classify a tuple in D:Information needed (after using A to split D into v partitions) to classify D:Information gained by branching on attribute AChapter 4: Classification11 December 200920/46

Decision Tree Induction (10/15)Information Gain.14 recordClass “Yes”=9 records.Class “No”= 5 records.Similarly, Chapter 4: Classification11 December 200921/46

Decision Tree Induction (11/15)Information Gain.age?senioryouthMiddle ageYesChapter 4: Classification11 December 200922/46

Decision Tree Induction (12/15)Gain ratio.Information gain measure is biased towards attributes with a large number of valuesC4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)Example:

Therefore, GainRatio(Income)=0.029/0.926=0.031Chapter 4: Classification11 December 200923/46

Decision Tree Induction (14/15)Comparing attribute selection measuresInformation gain: biased towards multi-valued attributes.Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others.Chapter 4: Classification11 December 200924/46

Decision Tree Induction (15/15)Decision Tree InductionAdvantages: Inexpensive to construct.Easy to interpret for small-sized trees.Extremely fast at classifying unknown recordsDisadvantages:decision tree could be suboptimal (i.e., over fitting) Chapter 4: Classification11 December 200925/46

Model Overfitting (1/5)Model Overfitting:Type of errors committed by a classification model:Training errors.Number of misclassification errors committed on training record.Generalization error.The expected error of the model on previously unseen records.Good model must have low training error as well as low generalization error.The model that fit the training data too well can have a poorer generalization error than a model with a high training error.Chapter 4: Classificationoverfitting11 December 200927/46

Model Overfitting (2/5)Reasons of overfittingThe presence of Noisein the dataset.Chapter 4: Classification11 December 200928/46

Model Overfitting (2/5)Reasons of overfittingThe presence of Noisein the dataset.Chapter 4: ClassificationMisclassified11 December 200929/46

Model Overfitting(3/5)Reasons of overfittingLack of Representative Samples.Chapter 4: ClassificationMisclassified11 December 200930/46

Model Overfitting(4/5)Handling overfittingPre-Pruning (Early Stopping Rule)Stop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).Chapter 4: Classification11 December 200931/46

Model Overfitting(5/5)Handling overfittingPost-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeChapter 4: ClassificationIn practice , Post-Pruning is preferable since early pruning can “stop too early”11 December 200932/46

Performance Evaluation(1/3)Holdout Method Partition: Training-and-testinguse two independent data sets, e.g., training set (2/3), test set (1/3)used for data set with large number of samplesChapter 4: Classification30%Divide randomlyAvailable examplesTraining Setused to develop one treecheck accuracy11 December 200934/46

Performance Evaluation(2/3)Cross-Validationdivide the data set into k subsamplesuse k-1 subsamples as training data and one sub-sample as test data k-fold cross-validationused for data set with moderate size10-fold cross-validationthe standard and most popular technique of estimating a classifier accuracyChapter 4: ClassificationAvailable examples10%90%Test SetTraining Setused to develop 10 different treescheck accuracy11 December 200935/46

Performance Evaluation(3/3)BootstrappingBased on the sampling with replacementThe initial dataset is sampled N timesN : the total number of samples in the dataset, with replacement, to form another set of N samples for training.Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set. Used for small size dataset.Chapter 4: Classification11 December 200936/46

Summary Chapter 4: Classification11 December 200937/46

Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataStart from the root of tree.Chapter 4: Classification11 December 200938/46

Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200939/46

Summary RefundYesNoMarStNOMarried Single, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200942/46

Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataAssign Cheat to “No”Chapter 4: Classification11 December 200943/46

Chapter 4 Classification

More Related Content

What's hot (20)

Similar to Chapter 4 Classification (20)

Chapter 4 Classification