SlideShare a Scribd company logo
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model  for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.
Examples of Classification TaskPredicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etc
Classification TechniquesDecision Tree based MethodsRule-based MethodsMemory based reasoningNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines
Decision Tree InductionMany Algorithms:Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ,SPRINT
Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion.IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
How to Specify Test Condition?Depends on attribute typesNominalOrdinalContinuousDepends on number of ways to split2-way splitMulti-way split
Splitting Based on Nominal AttributesCarTypeFamilyLuxurySportsMulti-way split: Use as many partitions as distinct values.
Contd…..Binary split:  Divides values into two subsets. Need to find optimal partitioningCarType{Sports, Luxury}{Family}
Splitting Based on Continuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval 		bucketing, equal frequency bucketing		(percentiles), or clustering.
Contd….Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive
How to determine the Best SplitGreedy approach: Nodes with homogeneous class distribution are preferredNeed a measure of node impurity:
Measures of Node ImpurityGini IndexEntropyMisclassification error
Measure of Impurity: GINIMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting information
Splitting Based on GINIUsed in CART, SLIQ, SPRINT.When a node p is split into k partitions (children), the quality of split is computed as,where,	ni = number of records at child i,    	          n = number of records at node p.
Binary Attributes: Computing GINI IndexSplits into two partitions
Effect of Weighing partitions:
Larger and Purer Partitions are sought for.Categorical Attributes: Computing Gini IndexFor each distinct value, gather counts for each class in the datasetUse the count matrix to make decisionsTwo-way split (find best partition of values)Multi-way split
Two-way split (find best partition of values)Use Binary Decisions based on one valueSeveral Choices for the splitting valueNumber of possible splitting values = Number of distinct valuesEach splitting value has a count matrix associated with itClass counts in each of the partitions, A < v and A  vSimple method to choose best vFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient! Repetition of work.
Measure of Impurity: EntropyEntropy at a given node t:Measures homogeneity of a node. Maximum (log nc) when records are equallydistributed among all classes implying leastInformationMinimum (0.0) when all records belong to oneclass, implying most information
Splitting based on EntropyParent Node, p is split into k partitionsni is the number of records in partition iClassification error at a node t :
Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)
Decision Tree Based ClassificationAdvantages:Inexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data sets
Practical Issues of ClassificationUnderfitting and OverfittingMissing ValuesCosts of Classification
Notes on OverfittingOverfitting results in decision trees that are more complex than necessaryTraining error no longer provides a good estimate of how well the tree will perform on previously unseen recordsNeed new ways for estimating errors
How to Address OverfittingStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
How to Address Overfitting…Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeCan use MDL for post-pruning
Other IssuesData FragmentationSearch StrategyExpressivenessTree Replication
Data FragmentationNumber of instances gets smaller as you traverse down the treeNumber of instances at the leaf nodes could be too small to make any statistically significant decision
Search StrategyFinding an optimal decision tree is NP-hardThe algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solutionOther strategies?Bottom-upBi-directional
ExpressivenessDecision tree provides expressive representation for learning discrete-valued functionBut they do not generalize well to certain types of Boolean functionsNot expressive enough for modeling continuous variablesParticularly when test condition involves only a single attribute at-a-time
Tree ReplicationSame subtree appears in multiple branches
Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model?Methods for Performance EvaluationHow to obtain reliable estimates?Methods for Model ComparisonHow to compare the relative performance among competing models?
Metrics for Performance EvaluationFocus on the predictive capability of a modelRather than how fast it takes to classify or build models, scalability, etc.It is determined using:Confusion matrixCost matrix
Methods for Performance EvaluationHow to obtain a reliable estimate of performance?Performance of a model may depend on other factors besides the learning algorithm:Class distributionCost of misclassificationSize of training and test sets
Methods of EstimationHoldoutReserve 2/3 for training and 1/3 for testing Random subsamplingRepeated holdoutCross validationPartition data into k disjoint subsetsk-fold: train on k-1 partitions, test on the remaining oneLeave-one-out:   k=nStratified sampling oversampling vsundersamplingBootstrapSampling with replacement
Methods for Model Comparison -ROCDeveloped in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarmsROC curve plots TP (on the y-axis) against FP (on the x-axis)Performance of each classifier represented as a point on the ROC curvechanging the threshold of algorithm, sample distribution or cost matrix changes the location of the point
Test of SignificanceGiven two models:Model M1: accuracy = 85%, tested on 30 instancesModel M2: accuracy = 75%, tested on 5000 instancesCan we say M1 is better than M2?How much confidence can we place on accuracy of M1 and M2?Can the difference in performance measure be explained as a result of random fluctuations in the test set?
ConclusionDecision tree inductionAlgorithm for decision tee inductionModel OverfittingEvaluating the performance of a classifier are studied  in detail

More Related Content

PPT
3.5 model based clustering
PDF
Understanding Bagging and Boosting
PDF
Classification
PPTX
Mining Data Streams
ODP
Machine Learning with Decision trees
PDF
Dimensionality Reduction
PPTX
Binary Class and Multi Class Strategies for Machine Learning
PPTX
Machine learning session4(linear regression)
3.5 model based clustering
Understanding Bagging and Boosting
Classification
Mining Data Streams
Machine Learning with Decision trees
Dimensionality Reduction
Binary Class and Multi Class Strategies for Machine Learning
Machine learning session4(linear regression)

What's hot (20)

PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPTX
Supervised Machine Learning
PPTX
Unsupervised learning (clustering)
PPTX
Unsupervised learning clustering
PPTX
Perceptron & Neural Networks
PPTX
Hierarchical clustering.pptx
PPT
Data mining :Concepts and Techniques Chapter 2, data
PPTX
Machine Learning - Dataset Preparation
PPTX
PPTX
Classification
PPT
Multi-Layer Perceptrons
PPTX
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PPTX
DBSCAN : A Clustering Algorithm
PPTX
Linear Discriminant Analysis (LDA)
PDF
Dimensionality Reduction
PPT
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
PPT
Classification and prediction
PDF
Feature Engineering in Machine Learning
PPTX
Presentation on unsupervised learning
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Supervised Machine Learning
Unsupervised learning (clustering)
Unsupervised learning clustering
Perceptron & Neural Networks
Hierarchical clustering.pptx
Data mining :Concepts and Techniques Chapter 2, data
Machine Learning - Dataset Preparation
Classification
Multi-Layer Perceptrons
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
DBSCAN : A Clustering Algorithm
Linear Discriminant Analysis (LDA)
Dimensionality Reduction
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Classification and prediction
Feature Engineering in Machine Learning
Presentation on unsupervised learning
Ad

Viewers also liked (20)

PDF
Huidige status van de testtaal TTCN-3
PPTX
Data Applied: Developer Quicklook
PPTX
Bernoullis Random Variables And Binomial Distribution
PPTX
Info Chimps: What Makes Infochimps.org Unique
PPTX
Data Applied: Similarity
PPS
Quantica Construction Search
PPTX
MS Sql Server: Doing Calculations With Functions
PPTX
Txomin Hartz Txikia
PPTX
Quick Look At Classification
PPTX
SPSS: Data Editor
PPTX
MS Sql Server: Deleting A Database
PPTX
Mysql:Operators
PPTX
Oracle: Joins
ODP
Presentazione oroblu
PDF
Direct-services portfolio
PPT
PresentacióN De Quimica
XLSX
PPTX
RapidMiner: Nested Subprocesses
PPT
Facebook: An Innovative Influenza Pandemic Early Warning System
Huidige status van de testtaal TTCN-3
Data Applied: Developer Quicklook
Bernoullis Random Variables And Binomial Distribution
Info Chimps: What Makes Infochimps.org Unique
Data Applied: Similarity
Quantica Construction Search
MS Sql Server: Doing Calculations With Functions
Txomin Hartz Txikia
Quick Look At Classification
SPSS: Data Editor
MS Sql Server: Deleting A Database
Mysql:Operators
Oracle: Joins
Presentazione oroblu
Direct-services portfolio
PresentacióN De Quimica
RapidMiner: Nested Subprocesses
Facebook: An Innovative Influenza Pandemic Early Warning System
Ad

Similar to Classification (20)

PDF
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
PPTX
Decision Tree - C4.5&CART
PPT
decisiontrees.ppt
PPT
decisiontrees (3).ppt
PPT
decisiontrees.ppt
PPTX
data mining.pptx
PPTX
Machine learning session6(decision trees random forrest)
PDF
Machine Learning Algorithm - Decision Trees
PPT
My8clst
PDF
Decision trees
ODP
On cascading small decision trees
DOCX
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
PPTX
Chapter4-ML.pptx slide for concept of mechanic learning
PDF
Decision tree
PDF
Data Science - Part V - Decision Trees & Random Forests
PPT
Capter10 cluster basic
PPT
Capter10 cluster basic : Han & Kamber
PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Decision Tree - C4.5&CART
decisiontrees.ppt
decisiontrees (3).ppt
decisiontrees.ppt
data mining.pptx
Machine learning session6(decision trees random forrest)
Machine Learning Algorithm - Decision Trees
My8clst
Decision trees
On cascading small decision trees
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Chapter4-ML.pptx slide for concept of mechanic learning
Decision tree
Data Science - Part V - Decision Trees & Random Forests
Capter10 cluster basic
Capter10 cluster basic : Han & Kamber
Textmining Predictive Models
Textmining Predictive Models
Textmining Predictive Models

More from DataminingTools Inc (20)

PPTX
Terminology Machine Learning
PPTX
Techniques Machine Learning
PPTX
Machine learning Introduction
PPTX
Areas of machine leanring
PPTX
AI: Planning and AI
PPTX
AI: Logic in AI 2
PPTX
AI: Logic in AI
PPTX
AI: Learning in AI 2
PPTX
AI: Learning in AI
PPTX
AI: Introduction to artificial intelligence
PPTX
AI: Belief Networks
PPTX
AI: AI & Searching
PPTX
AI: AI & Problem Solving
PPTX
Data Mining: Text and web mining
PPTX
Data Mining: Outlier analysis
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Data warehouse and olap technology
PPTX
Data Mining: Data processing
Terminology Machine Learning
Techniques Machine Learning
Machine learning Introduction
Areas of machine leanring
AI: Planning and AI
AI: Logic in AI 2
AI: Logic in AI
AI: Learning in AI 2
AI: Learning in AI
AI: Introduction to artificial intelligence
AI: Belief Networks
AI: AI & Searching
AI: AI & Problem Solving
Data Mining: Text and web mining
Data Mining: Outlier analysis
Data Mining: Mining stream time series and sequence data
Data Mining: Mining ,associations, and correlations
Data Mining: Graph mining and social network analysis
Data warehouse and olap technology
Data Mining: Data processing

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
Modernizing your data center with Dell and AMD
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Modernizing your data center with Dell and AMD
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
NewMind AI Weekly Chronicles - August'25 Week I
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Classification

  • 1. Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation
  • 2. Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.
  • 3. Examples of Classification TaskPredicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etc
  • 4. Classification TechniquesDecision Tree based MethodsRule-based MethodsMemory based reasoningNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines
  • 5. Decision Tree InductionMany Algorithms:Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ,SPRINT
  • 6. Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion.IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
  • 7. How to Specify Test Condition?Depends on attribute typesNominalOrdinalContinuousDepends on number of ways to split2-way splitMulti-way split
  • 8. Splitting Based on Nominal AttributesCarTypeFamilyLuxurySportsMulti-way split: Use as many partitions as distinct values.
  • 9. Contd…..Binary split: Divides values into two subsets. Need to find optimal partitioningCarType{Sports, Luxury}{Family}
  • 10. Splitting Based on Continuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
  • 11. Contd….Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive
  • 12. How to determine the Best SplitGreedy approach: Nodes with homogeneous class distribution are preferredNeed a measure of node impurity:
  • 13. Measures of Node ImpurityGini IndexEntropyMisclassification error
  • 14. Measure of Impurity: GINIMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting information
  • 15. Splitting Based on GINIUsed in CART, SLIQ, SPRINT.When a node p is split into k partitions (children), the quality of split is computed as,where, ni = number of records at child i, n = number of records at node p.
  • 16. Binary Attributes: Computing GINI IndexSplits into two partitions
  • 17. Effect of Weighing partitions:
  • 18. Larger and Purer Partitions are sought for.Categorical Attributes: Computing Gini IndexFor each distinct value, gather counts for each class in the datasetUse the count matrix to make decisionsTwo-way split (find best partition of values)Multi-way split
  • 19. Two-way split (find best partition of values)Use Binary Decisions based on one valueSeveral Choices for the splitting valueNumber of possible splitting values = Number of distinct valuesEach splitting value has a count matrix associated with itClass counts in each of the partitions, A < v and A  vSimple method to choose best vFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient! Repetition of work.
  • 20. Measure of Impurity: EntropyEntropy at a given node t:Measures homogeneity of a node. Maximum (log nc) when records are equallydistributed among all classes implying leastInformationMinimum (0.0) when all records belong to oneclass, implying most information
  • 21. Splitting based on EntropyParent Node, p is split into k partitionsni is the number of records in partition iClassification error at a node t :
  • 22. Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)
  • 23. Decision Tree Based ClassificationAdvantages:Inexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data sets
  • 24. Practical Issues of ClassificationUnderfitting and OverfittingMissing ValuesCosts of Classification
  • 25. Notes on OverfittingOverfitting results in decision trees that are more complex than necessaryTraining error no longer provides a good estimate of how well the tree will perform on previously unseen recordsNeed new ways for estimating errors
  • 26. How to Address OverfittingStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
  • 27. How to Address Overfitting…Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeCan use MDL for post-pruning
  • 28. Other IssuesData FragmentationSearch StrategyExpressivenessTree Replication
  • 29. Data FragmentationNumber of instances gets smaller as you traverse down the treeNumber of instances at the leaf nodes could be too small to make any statistically significant decision
  • 30. Search StrategyFinding an optimal decision tree is NP-hardThe algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solutionOther strategies?Bottom-upBi-directional
  • 31. ExpressivenessDecision tree provides expressive representation for learning discrete-valued functionBut they do not generalize well to certain types of Boolean functionsNot expressive enough for modeling continuous variablesParticularly when test condition involves only a single attribute at-a-time
  • 32. Tree ReplicationSame subtree appears in multiple branches
  • 33. Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model?Methods for Performance EvaluationHow to obtain reliable estimates?Methods for Model ComparisonHow to compare the relative performance among competing models?
  • 34. Metrics for Performance EvaluationFocus on the predictive capability of a modelRather than how fast it takes to classify or build models, scalability, etc.It is determined using:Confusion matrixCost matrix
  • 35. Methods for Performance EvaluationHow to obtain a reliable estimate of performance?Performance of a model may depend on other factors besides the learning algorithm:Class distributionCost of misclassificationSize of training and test sets
  • 36. Methods of EstimationHoldoutReserve 2/3 for training and 1/3 for testing Random subsamplingRepeated holdoutCross validationPartition data into k disjoint subsetsk-fold: train on k-1 partitions, test on the remaining oneLeave-one-out: k=nStratified sampling oversampling vsundersamplingBootstrapSampling with replacement
  • 37. Methods for Model Comparison -ROCDeveloped in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarmsROC curve plots TP (on the y-axis) against FP (on the x-axis)Performance of each classifier represented as a point on the ROC curvechanging the threshold of algorithm, sample distribution or cost matrix changes the location of the point
  • 38. Test of SignificanceGiven two models:Model M1: accuracy = 85%, tested on 30 instancesModel M2: accuracy = 75%, tested on 5000 instancesCan we say M1 is better than M2?How much confidence can we place on accuracy of M1 and M2?Can the difference in performance measure be explained as a result of random fluctuations in the test set?
  • 39. ConclusionDecision tree inductionAlgorithm for decision tee inductionModel OverfittingEvaluating the performance of a classifier are studied in detail
  • 40. Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net