SlideShare a Scribd company logo
Khalid Elshafie.abolkog@dblab.cbnu.ac.krDatabase / Bioinformatics LabChungbuk National University, KoreaClassification : Basic ConceptsDecember, 12 2009
OutlineChapter 4: Classification11 December 20092/46
Introduction
Introduction (1/4)Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model  for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.A test set is used to determine the accuracy of the model. Classification ModelOutputClass LabelInputAttribute setChapter 4: Classification11 December 20094/46
Introduction(2/4)Classification:Two step process:1-learning step:Training data are analyzed by classification algorithm and a model (classifier) is learned.2- Classification:Test data are used to estimate the accuracy of the classification rules.Usually the given data set is divided into training and test sets.Chapter 4: Classification11 December 20095/46
Introduction (3/4)Examples of Classification:Predicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etcChapter 4: Classification11 December 20096/46
Introduction (4/4)Classification Techniques: Decision Trees Based Methods.Rule Based Methods.Neural Networks.Naïve Bayes and Bayesian Belief Networks.Support Vector Machines.Chapter 4: Classification11 December 20097/46
General Approach to Solving a Classification Problem
General Approach To Solving a Classification Problem (1/2)General Approach for building a classification model.Chapter 4: Classification11 December 20099/46
General Approach To Solving a Classification Problem (2/2)Performance evaluation.Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model.Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models.Confusion matrix for a 2-class problemChapter 4: Classification11 December 200910/46
Decision Tree Induction
Decision Tree Induction (1/15)What is a decision tree?A decision tree is a flowchart-like tree structure.Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label.Single, DivorcedInternal nodeMarStMarriedRoot nodeRefundNONoYesTaxIncNO> 80K< 80KYESNOLeaf nodesChapter 4: Classification11 December 200912/46
Decision Tree Induction (2/15)How to build a decision tree?Let Dt be the set of training records that reach a node tGeneral Procedure:If Dt contains records that belong the same class yt, then t is a leaf node labeled as ytIf Dt is an empty set, then t is a leaf node labeled by the default class, ydIf Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.Dt?Chapter 4: Classification11 December 200913/46
Decision Tree Induction (3/15)How to build a decision tree?Tree induction:Greedy strategy.Split the record based on an attribute test that optimizes certain condition.Tree induction issues:Determine how to split the record?How to specify the attribute test condition?How to determine the best split?Determine when to stop splitting.Chapter 4: Classification11 December 200914/46
Decision Tree Induction (4/15)How to specify test condition?Depends on attribute typesNominal.Ordinal.Continuous. Depends on number of ways to split.2-way split.Multi-way split.Chapter 4: Classification11 December 200915/46
Decision Tree Induction (5/15)Splitting based on nominal attributes.Multi-way splitUse as many partition as distinct values.Binary split.Divides the values into two subsets.CarTypeFamilyLuxurySportsCarTypeCarType{Family, Luxury}{Sports, Luxury}{Sports}{Family}ORChapter 4: Classification11 December 200916/46
Decision Tree Induction (6/15)Splitting based on ordinal attributes.Multi-way splitUse as many partition as distinct values.Binary split.Divides the values into two subsets. as long as it doesn’t violate the order property of the attributeSizeSmallLargeMediumSizeSize{Small, Medium}{Medium, Large}{Large}{Small}ORChapter 4: Classification11 December 200917/46
Decision Tree Induction (7/15)Splitting based on continuous attributes.Multi-way splitMust consider all possible test for continuous values.One approach, Discretization.Binary split.The test condition can be expressed as a comparison test.(A < v) or (A  v)Chapter 4: Classification11 December 200918/46
Decision Tree Induction (8/15)How to determine the best split?Attribute Selection Measure.A heuristic for selecting the splitting criterion that best separate a given data set.Information gain.Gain Ratio.Chapter 4: Classification11 December 200919/46
Decision Tree Induction (9/15)Information Gain.Used by ID3 algorithm as its attribute selection measure.Select the attribute with the heights information gain.Expected information (entropy) needed to classify a tuple in D:Information needed (after using A to split D into v partitions) to classify D:Information gained by branching on attribute AChapter 4: Classification11 December 200920/46
Decision Tree Induction (10/15)Information Gain.14 recordClass “Yes”=9 records.Class “No”= 5 records.Similarly, Chapter 4: Classification11 December 200921/46
Decision Tree Induction (11/15)Information Gain.age?senioryouthMiddle ageYesChapter 4: Classification11 December 200922/46
Decision Tree Induction (12/15)Gain ratio.Information gain measure is biased towards attributes with a large number of valuesC4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)Example:
For attribute income:
Gain(Income)=0.029
Therefore, GainRatio(Income)=0.029/0.926=0.031Chapter 4: Classification11 December 200923/46
Decision Tree Induction (14/15)Comparing attribute selection measuresInformation gain: biased towards multi-valued attributes.Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others.Chapter 4: Classification11 December 200924/46
Decision Tree Induction (15/15)Decision Tree InductionAdvantages: Inexpensive to construct.Easy to interpret for small-sized trees.Extremely fast at classifying unknown recordsDisadvantages:decision tree could be suboptimal (i.e., over fitting)  Chapter 4: Classification11 December 200925/46
Model Overfitting
Model Overfitting (1/5)Model Overfitting:Type of errors committed by a classification model:Training errors.Number of misclassification errors committed on training record.Generalization error.The expected error of the model on previously unseen records.Good model must have low training error as well as low generalization error.The model that fit the training data too well can have a poorer generalization error than a model with a high training error.Chapter 4: Classificationoverfitting11 December 200927/46
Model Overfitting (2/5)Reasons of overfittingThe presence of Noisein the dataset.Chapter 4: Classification11 December 200928/46
Model Overfitting (2/5)Reasons of overfittingThe presence of Noisein the dataset.Chapter 4: ClassificationMisclassified11 December 200929/46
Model Overfitting(3/5)Reasons of overfittingLack of Representative Samples.Chapter 4: ClassificationMisclassified11 December 200930/46
Model Overfitting(4/5)Handling overfittingPre-Pruning (Early Stopping Rule)Stop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity    measures (e.g., Gini or information gain).Chapter 4: Classification11 December 200931/46
Model Overfitting(5/5)Handling overfittingPost-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeChapter 4: ClassificationIn practice , Post-Pruning is preferable since early pruning can “stop  too early”11 December 200932/46
Performance Evaluation
Performance Evaluation(1/3)Holdout Method Partition: Training-and-testinguse two independent data sets, e.g., training set (2/3), test set (1/3)used for data set with large number of samplesChapter 4: Classification30%Divide randomlyAvailable examplesTraining Setused to develop one treecheck accuracy11 December 200934/46
Performance Evaluation(2/3)Cross-Validationdivide the data set into k subsamplesuse k-1 subsamples as training data and one sub-sample as test data k-fold cross-validationused for data set with moderate size10-fold cross-validationthe standard and most popular technique of estimating a classifier accuracyChapter 4: ClassificationAvailable examples10%90%Test SetTraining Setused to develop 10 different treescheck accuracy11 December 200935/46
Performance Evaluation(3/3)BootstrappingBased on the sampling with replacementThe initial dataset is sampled N timesN : the total number of samples in the dataset, with replacement, to form another set of N samples for training.Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set. Used for small size dataset.Chapter 4: Classification11 December 200936/46
Summary Chapter 4: Classification11 December 200937/46
Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataStart from the root of tree.Chapter 4: Classification11 December 200938/46
Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200939/46
Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200940/46
Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200941/46
Summary RefundYesNoMarStNOMarried Single, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200942/46
Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataAssign Cheat to “No”Chapter 4: Classification11 December 200943/46

More Related Content

PPTX
Geo thermal and ocean energy
PPTX
Introduction to Machine Learning
PPTX
ORO551 RES - Unit 1 - Role and potential of new and renewable source
PPTX
Sentiment analysis
PPTX
Lesson 4 ar-ma
ODP
Machine Learning with Decision trees
PPTX
Data preprocessing in Machine learning
Geo thermal and ocean energy
Introduction to Machine Learning
ORO551 RES - Unit 1 - Role and potential of new and renewable source
Sentiment analysis
Lesson 4 ar-ma
Machine Learning with Decision trees
Data preprocessing in Machine learning

What's hot (20)

PPT
2.2 decision tree
PPT
Clustering: Large Databases in data mining
PPTX
PPTX
Decision tree
PPT
1.8 discretization
PPTX
multi dimensional data model
PPTX
Data cubes
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PPTX
Presentation on unsupervised learning
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PPTX
Clusters techniques
PPT
Cluster analysis
PPTX
K means clustering
PPTX
Decision Tree - C4.5&CART
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
Decision tree
PPT
2.4 rule based classification
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
PPTX
05 Clustering in Data Mining
2.2 decision tree
Clustering: Large Databases in data mining
Decision tree
1.8 discretization
multi dimensional data model
Data cubes
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Presentation on unsupervised learning
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Clusters techniques
Cluster analysis
K means clustering
Decision Tree - C4.5&CART
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Decision tree
2.4 rule based classification
Decision tree induction \ Decision Tree Algorithm with Example| Data science
05 Clustering in Data Mining
Ad

Similar to Chapter 4 Classification (20)

PPTX
5.Module_AIML Random Forest.pptx
PDF
22PCOAM16 _ML_Unit 3 Notes & Question bank
PDF
22PCOAM16 ML Unit 3 Full notes PDF & QB.pdf
PPTX
Machine learning Chapter three (16).pptx
PPTX
data mining.pptx
PPT
DM Unit-III ppt.ppt
PPTX
XL Miner: Classification
PPTX
XL-Miner: Classification
PPTX
Decision tree induction
PDF
Efficient classification of big data using vfdt (very fast decision tree)
PPTX
83 learningdecisiontree
PPTX
Classification
PPTX
Classification
PDF
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
PDF
classification in data mining and data warehousing.pdf
PDF
Decision tree
PPTX
13 random forest
PPTX
Research trends in data warehousing and data mining
PDF
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
PPTX
Algoritma Random Forest beserta aplikasi nya
5.Module_AIML Random Forest.pptx
22PCOAM16 _ML_Unit 3 Notes & Question bank
22PCOAM16 ML Unit 3 Full notes PDF & QB.pdf
Machine learning Chapter three (16).pptx
data mining.pptx
DM Unit-III ppt.ppt
XL Miner: Classification
XL-Miner: Classification
Decision tree induction
Efficient classification of big data using vfdt (very fast decision tree)
83 learningdecisiontree
Classification
Classification
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
classification in data mining and data warehousing.pdf
Decision tree
13 random forest
Research trends in data warehousing and data mining
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
Algoritma Random Forest beserta aplikasi nya
Ad

Chapter 4 Classification

  • 1. Khalid Elshafie.abolkog@dblab.cbnu.ac.krDatabase / Bioinformatics LabChungbuk National University, KoreaClassification : Basic ConceptsDecember, 12 2009
  • 4. Introduction (1/4)Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.A test set is used to determine the accuracy of the model. Classification ModelOutputClass LabelInputAttribute setChapter 4: Classification11 December 20094/46
  • 5. Introduction(2/4)Classification:Two step process:1-learning step:Training data are analyzed by classification algorithm and a model (classifier) is learned.2- Classification:Test data are used to estimate the accuracy of the classification rules.Usually the given data set is divided into training and test sets.Chapter 4: Classification11 December 20095/46
  • 6. Introduction (3/4)Examples of Classification:Predicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etcChapter 4: Classification11 December 20096/46
  • 7. Introduction (4/4)Classification Techniques: Decision Trees Based Methods.Rule Based Methods.Neural Networks.Naïve Bayes and Bayesian Belief Networks.Support Vector Machines.Chapter 4: Classification11 December 20097/46
  • 8. General Approach to Solving a Classification Problem
  • 9. General Approach To Solving a Classification Problem (1/2)General Approach for building a classification model.Chapter 4: Classification11 December 20099/46
  • 10. General Approach To Solving a Classification Problem (2/2)Performance evaluation.Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model.Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models.Confusion matrix for a 2-class problemChapter 4: Classification11 December 200910/46
  • 12. Decision Tree Induction (1/15)What is a decision tree?A decision tree is a flowchart-like tree structure.Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label.Single, DivorcedInternal nodeMarStMarriedRoot nodeRefundNONoYesTaxIncNO> 80K< 80KYESNOLeaf nodesChapter 4: Classification11 December 200912/46
  • 13. Decision Tree Induction (2/15)How to build a decision tree?Let Dt be the set of training records that reach a node tGeneral Procedure:If Dt contains records that belong the same class yt, then t is a leaf node labeled as ytIf Dt is an empty set, then t is a leaf node labeled by the default class, ydIf Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.Dt?Chapter 4: Classification11 December 200913/46
  • 14. Decision Tree Induction (3/15)How to build a decision tree?Tree induction:Greedy strategy.Split the record based on an attribute test that optimizes certain condition.Tree induction issues:Determine how to split the record?How to specify the attribute test condition?How to determine the best split?Determine when to stop splitting.Chapter 4: Classification11 December 200914/46
  • 15. Decision Tree Induction (4/15)How to specify test condition?Depends on attribute typesNominal.Ordinal.Continuous. Depends on number of ways to split.2-way split.Multi-way split.Chapter 4: Classification11 December 200915/46
  • 16. Decision Tree Induction (5/15)Splitting based on nominal attributes.Multi-way splitUse as many partition as distinct values.Binary split.Divides the values into two subsets.CarTypeFamilyLuxurySportsCarTypeCarType{Family, Luxury}{Sports, Luxury}{Sports}{Family}ORChapter 4: Classification11 December 200916/46
  • 17. Decision Tree Induction (6/15)Splitting based on ordinal attributes.Multi-way splitUse as many partition as distinct values.Binary split.Divides the values into two subsets. as long as it doesn’t violate the order property of the attributeSizeSmallLargeMediumSizeSize{Small, Medium}{Medium, Large}{Large}{Small}ORChapter 4: Classification11 December 200917/46
  • 18. Decision Tree Induction (7/15)Splitting based on continuous attributes.Multi-way splitMust consider all possible test for continuous values.One approach, Discretization.Binary split.The test condition can be expressed as a comparison test.(A < v) or (A  v)Chapter 4: Classification11 December 200918/46
  • 19. Decision Tree Induction (8/15)How to determine the best split?Attribute Selection Measure.A heuristic for selecting the splitting criterion that best separate a given data set.Information gain.Gain Ratio.Chapter 4: Classification11 December 200919/46
  • 20. Decision Tree Induction (9/15)Information Gain.Used by ID3 algorithm as its attribute selection measure.Select the attribute with the heights information gain.Expected information (entropy) needed to classify a tuple in D:Information needed (after using A to split D into v partitions) to classify D:Information gained by branching on attribute AChapter 4: Classification11 December 200920/46
  • 21. Decision Tree Induction (10/15)Information Gain.14 recordClass “Yes”=9 records.Class “No”= 5 records.Similarly, Chapter 4: Classification11 December 200921/46
  • 22. Decision Tree Induction (11/15)Information Gain.age?senioryouthMiddle ageYesChapter 4: Classification11 December 200922/46
  • 23. Decision Tree Induction (12/15)Gain ratio.Information gain measure is biased towards attributes with a large number of valuesC4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)Example:
  • 26. Therefore, GainRatio(Income)=0.029/0.926=0.031Chapter 4: Classification11 December 200923/46
  • 27. Decision Tree Induction (14/15)Comparing attribute selection measuresInformation gain: biased towards multi-valued attributes.Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others.Chapter 4: Classification11 December 200924/46
  • 28. Decision Tree Induction (15/15)Decision Tree InductionAdvantages: Inexpensive to construct.Easy to interpret for small-sized trees.Extremely fast at classifying unknown recordsDisadvantages:decision tree could be suboptimal (i.e., over fitting) Chapter 4: Classification11 December 200925/46
  • 30. Model Overfitting (1/5)Model Overfitting:Type of errors committed by a classification model:Training errors.Number of misclassification errors committed on training record.Generalization error.The expected error of the model on previously unseen records.Good model must have low training error as well as low generalization error.The model that fit the training data too well can have a poorer generalization error than a model with a high training error.Chapter 4: Classificationoverfitting11 December 200927/46
  • 31. Model Overfitting (2/5)Reasons of overfittingThe presence of Noisein the dataset.Chapter 4: Classification11 December 200928/46
  • 32. Model Overfitting (2/5)Reasons of overfittingThe presence of Noisein the dataset.Chapter 4: ClassificationMisclassified11 December 200929/46
  • 33. Model Overfitting(3/5)Reasons of overfittingLack of Representative Samples.Chapter 4: ClassificationMisclassified11 December 200930/46
  • 34. Model Overfitting(4/5)Handling overfittingPre-Pruning (Early Stopping Rule)Stop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).Chapter 4: Classification11 December 200931/46
  • 35. Model Overfitting(5/5)Handling overfittingPost-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeChapter 4: ClassificationIn practice , Post-Pruning is preferable since early pruning can “stop too early”11 December 200932/46
  • 37. Performance Evaluation(1/3)Holdout Method Partition: Training-and-testinguse two independent data sets, e.g., training set (2/3), test set (1/3)used for data set with large number of samplesChapter 4: Classification30%Divide randomlyAvailable examplesTraining Setused to develop one treecheck accuracy11 December 200934/46
  • 38. Performance Evaluation(2/3)Cross-Validationdivide the data set into k subsamplesuse k-1 subsamples as training data and one sub-sample as test data k-fold cross-validationused for data set with moderate size10-fold cross-validationthe standard and most popular technique of estimating a classifier accuracyChapter 4: ClassificationAvailable examples10%90%Test SetTraining Setused to develop 10 different treescheck accuracy11 December 200935/46
  • 39. Performance Evaluation(3/3)BootstrappingBased on the sampling with replacementThe initial dataset is sampled N timesN : the total number of samples in the dataset, with replacement, to form another set of N samples for training.Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set. Used for small size dataset.Chapter 4: Classification11 December 200936/46
  • 40. Summary Chapter 4: Classification11 December 200937/46
  • 41. Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataStart from the root of tree.Chapter 4: Classification11 December 200938/46
  • 42. Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200939/46
  • 43. Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200940/46
  • 44. Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200941/46
  • 45. Summary RefundYesNoMarStNOMarried Single, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataChapter 4: Classification11 December 200942/46
  • 46. Summary RefundYesNoMarStNOMarriedSingle, DivorcedTaxIncNO< 80K> 80KYESNOApply model to test dataTest DataAssign Cheat to “No”Chapter 4: Classification11 December 200943/46
  • 47. Chapter 4: Classification11 December 200944/46
  • 48. SummaryClassification is one of the most important technique in detaining.Have so much application in real world.Decision tree Powerful classification technique.Decision trees are easy to understand.Strength:Easy to understand, fast in classifying records.Weakness:Suffer from oversetting.Large tree size cause some memory handling issueHandling overfitting:Pruning.Evaluation methodsChapter 4: Classification11 December 200945/46
  • 49. Thank you !Any Comments & Questions ?