Classification

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.

Examples of Classification TaskPredicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etc

Classification TechniquesDecision Tree based MethodsRule-based MethodsMemory based reasoningNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines

Decision Tree InductionMany Algorithms:Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ,SPRINT

Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion.IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting

How to Specify Test Condition?Depends on attribute typesNominalOrdinalContinuousDepends on number of ways to split2-way splitMulti-way split

Splitting Based on Nominal AttributesCarTypeFamilyLuxurySportsMulti-way split: Use as many partitions as distinct values.

Contd…..Binary split: Divides values into two subsets. Need to find optimal partitioningCarType{Sports, Luxury}{Family}

Splitting Based on Continuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

Contd….Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive

How to determine the Best SplitGreedy approach: Nodes with homogeneous class distribution are preferredNeed a measure of node impurity:

Measures of Node ImpurityGini IndexEntropyMisclassification error

Measure of Impurity: GINIMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting information

Splitting Based on GINIUsed in CART, SLIQ, SPRINT.When a node p is split into k partitions (children), the quality of split is computed as,where, ni = number of records at child i, n = number of records at node p.

Binary Attributes: Computing GINI IndexSplits into two partitions

Effect of Weighing partitions:

Larger and Purer Partitions are sought for.Categorical Attributes: Computing Gini IndexFor each distinct value, gather counts for each class in the datasetUse the count matrix to make decisionsTwo-way split (find best partition of values)Multi-way split

Two-way split (find best partition of values)Use Binary Decisions based on one valueSeveral Choices for the splitting valueNumber of possible splitting values = Number of distinct valuesEach splitting value has a count matrix associated with itClass counts in each of the partitions, A < v and A  vSimple method to choose best vFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient! Repetition of work.

Measure of Impurity: EntropyEntropy at a given node t:Measures homogeneity of a node. Maximum (log nc) when records are equallydistributed among all classes implying leastInformationMinimum (0.0) when all records belong to oneclass, implying most information

Splitting based on EntropyParent Node, p is split into k partitionsni is the number of records in partition iClassification error at a node t :

Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)

Decision Tree Based ClassificationAdvantages:Inexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data sets

Practical Issues of ClassificationUnderfitting and OverfittingMissing ValuesCosts of Classification

Notes on OverfittingOverfitting results in decision trees that are more complex than necessaryTraining error no longer provides a good estimate of how well the tree will perform on previously unseen recordsNeed new ways for estimating errors

How to Address OverfittingStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test)

How to Address Overfitting…Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeCan use MDL for post-pruning

Other IssuesData FragmentationSearch StrategyExpressivenessTree Replication

Data FragmentationNumber of instances gets smaller as you traverse down the treeNumber of instances at the leaf nodes could be too small to make any statistically significant decision

Search StrategyFinding an optimal decision tree is NP-hardThe algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solutionOther strategies?Bottom-upBi-directional

ExpressivenessDecision tree provides expressive representation for learning discrete-valued functionBut they do not generalize well to certain types of Boolean functionsNot expressive enough for modeling continuous variablesParticularly when test condition involves only a single attribute at-a-time

Tree ReplicationSame subtree appears in multiple branches

Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model?Methods for Performance EvaluationHow to obtain reliable estimates?Methods for Model ComparisonHow to compare the relative performance among competing models?

Metrics for Performance EvaluationFocus on the predictive capability of a modelRather than how fast it takes to classify or build models, scalability, etc.It is determined using:Confusion matrixCost matrix

Methods for Performance EvaluationHow to obtain a reliable estimate of performance?Performance of a model may depend on other factors besides the learning algorithm:Class distributionCost of misclassificationSize of training and test sets

Methods of EstimationHoldoutReserve 2/3 for training and 1/3 for testing Random subsamplingRepeated holdoutCross validationPartition data into k disjoint subsetsk-fold: train on k-1 partitions, test on the remaining oneLeave-one-out: k=nStratified sampling oversampling vsundersamplingBootstrapSampling with replacement

Methods for Model Comparison -ROCDeveloped in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarmsROC curve plots TP (on the y-axis) against FP (on the x-axis)Performance of each classifier represented as a point on the ROC curvechanging the threshold of algorithm, sample distribution or cost matrix changes the location of the point

Test of SignificanceGiven two models:Model M1: accuracy = 85%, tested on 30 instancesModel M2: accuracy = 75%, tested on 5000 instancesCan we say M1 is better than M2?How much confidence can we place on accuracy of M1 and M2?Can the difference in performance measure be explained as a result of random fluctuations in the test set?

ConclusionDecision tree inductionAlgorithm for decision tee inductionModel OverfittingEvaluating the performance of a classifier are studied in detail

Classification

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Classification (20)

More from DataminingTools Inc (20)

Recently uploaded (20)

Classification