On cascading small decision trees

On Cascading Small Decision Trees Julià Minguillón Combinatorics and Digital Communications Group (CCD) Autonomous University of Barcelona (UAB) Barcelona, Spain http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf

Table of contents Introduction Decision trees Combining classifiers Experimental results Theoretical issues Conclusions Further research References

Introduction Main goal: to build simple and fast classifiers for data mining Partial goals: To reduce both training and exploitation costs To increase classification accuracy To permit partial classification Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.

Decision trees Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984 Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes Supervised training  labeled data is used for training

Why decision trees? Natural handling of data of mixed types Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations Computational scalability Ability to deal with irrelevant inputs Interpretability

Growing decision trees (binary) T=(data set) /* initially the tree is a single leaf */ while stoppingCriterion(T) is false select t from T maximising selectionCriterion(t) split t=(t L ,t R ) maximising splittingCriterion(t,t L ,t R ) replace t in T with (t L ,t R ) end prune back T using the BFOS algorithm choose T’ minimising classification error on (data set’)

Growing algorithm parameters The computed decision tree is determined by: Stopping criterion Node selection criterion Splitting criterion Labelling rule If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant

Splitting criterion Measures the gain of a split for a given criterion Usually related to the concept of impurity Classification performance may be very sensitive to such criterion Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst Different kinds of splits: Orthogonal hyperplanes: fast, interpretable, poor performance General hyperplanes: expensive, partially interpretable Distance based (spherical trees): expensive, allow clustering

Labelling rule Each leaf t is labelled in order to minimise misclassification error: l(t) = arg j min { r(t) =  {k=0..K-1} C(j,k) p(k|t) } Different classification costs C(j,k) are allowed A priori class probabilities may be included Margin is defined as 1-2 r(t) , or also as max { p(k|t) } – 2 nd max { p(k|t) }

Problems Repetition, replication and fragmentation Poor performance for large data dimensionality or large number of classes Orthogonal splits may lead to p oor classification performance due to poor internal decision functions Overfitting may occur for large decision trees Training is very expensive for large data sets Decision trees are unstable classifiers

Progressive decision trees Goal: to overcome some problems related to the use of classical decision trees Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult Only small decision trees are used: Avoid overfitting Reduce both training and exploitation costs Permit partial classification Detect possible outliers Decision trees become decision graphs

Growing progressive decision trees Build a complete decision tree of depth d Prune it using the BFOS algorithm Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least  ) Join all regions labelled as mixed Start again using only the mixed regions

Example (II) M 0 1 M 1 0 0 1 M M M

Combining classifiers Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one Several paradigms: Voting: bagging, boosting, randomising Stacking Cascading Why do they work? Because of the fact that different classifiers make different kinds of mistakes Different classifiers are built by using different training sets

Cascading generalization Developed by Gama et al. in 2000 Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence Three types of cascading ensembles: Type A: no additional info, mixed class Type B: additional info, no mixed class Type C: additional info, mixed class

Type A progressive decision trees No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down: T D Y D’

Type B progressive decision trees Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down: T D Y D’

Type C progressive decision trees Additional info is computed for each sample, and only samples labelled as mixed are passed down: T D Y D’

Experimental results Four different projects: Document layout recognition Hyperspectral imaging Brain tumour classification UCI collection  evaluation Basic tools for evaluation: N-fold cross-validation bootstrapping bias-variance decomposition } real projects

Document layout recognition (I) Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy) Documents are 8.5” x 11.7” at 150 dpi Target block size: 8 x 8 pixels (JPEG standard) Minguillón, J. et al., Progressive classification scheme for document layout recognition , Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999

Document layout recognition (II) Classical approach: a single decision tree with a block size of 8 x 8 pixels 0.078 38 8.56 721 211200 / 211200 8 x 8 Error d max R |T| Num. Blocks Size

Document layout recognition (III) Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8) 0.042 6 3.72 11 21052 / 53760 16 x 16 0.047 6 4.17 14 7856 / 13440 32 x 32 0.089 4 2.77 6 3360 / 3360 64 x 64 0.065 8 4.73 18 27892 / 215040 8 x 8 Error d max R |T| Num. Blocks Size

Hyperspectral imaging (I) Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400) Ground truth data presents some artifacts due to low resolution: around 10% mislabelled 19 classes including asphalt, water, rocks, soil and several vegetation types Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set Minguillón, J. et al., Adaptive lossy compression and classification of hyperspectral images , Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000

Hyperspectral imaging (II) Classical approach: Using the new labeling rule: 0.163 1.0 9.83 836 T 1 Error P T R |T| Tree 0.092 0.722 9.60 650 T 2 Error P T R |T| Tree

Hyperspectral imaging (III) Progressive approach: 0.199 0.383 2.14 8 T 3B 0.056 0.523 3.02 9 T 3A 0.094 0.706 4.84 44 T 3 Error P T R |T| Tree

Brain tumour classification (I) Goal: to build a classification system for helping clinicians to identify brain tumour types Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created Different classifiers (LDA, k -NN, decision trees) are combined using a mixture of cascading and voting schemes Minguillón, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours , Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364

Brain tumour classification (II) Each classification stage is: k -NN LDA DT X V Y Decision trees use LDA class distances as additional information “ Unknown” means classifiers disagree

Brain tumour classification (III) Normal 100% Tumour 99.5% Benign 92.1% Malignant 94.9% Grade II 82.6% Grade IV 94.7% 98.9% Grade III 0% Astro 94.1% Oligo 100% 84.0% 89.9% 83.8% Secondary 91.4% Primary 81.8% 75.0% MN+SCH+HB ASTII+OD GLB+LYM+PNET+MET

UCI collection Goal: exhaustive testing of progressive decision trees 20 data sets were chosen: No categorical variables No missing values Large range of number of samples, data dimension and number of classes Available at http:// kdd.ics.uci.edu

Experiments setup N-fold cross-validation with N=3 For each training set, 25 bootstrap replicates are generated (subsampling with replacement) Each experiment is repeated 5 times and performance results are averaged Bias-variance decomposition is computed for each repetition and then averaged

Bias-variance decomposition Several approaches, Domingos 2000 First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees Last classifiers should have small bias and moderate variance: large (but not too much) decision trees Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small

Empirical evaluation summary (I) Bias usually predominates over variance on most data sets  decision trees outperform the k -NN classifier Bias decreases fast when the decision tree has enough leaves Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics

Empirical evaluation summary (II) Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler) Type C experiments are still running...

Theoretical issues Decision trees are convex combinations of internal node decision functions: T j (x)=  {i=1..|T j |} p ij  ij h ij (x) Cascading is a convex combination of t decision trees: T(x)=  {j=1..t} q j T j (x) Type A: the first decision tree is the most important Type B: the last decision tree is the most important Type C: not aplicable

Error generalization bounds Convex combinations may be studied under the margin paradigm defined by Schapire et al. Generalization error depends on tree structure and internal node functions VC dimension Unbalanced trees are preferable Unbalanced classifiers are preferable Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees

Conclusions Progressive decision trees generalise classical decision trees and the cascading paradigm Cascading is very useful for large data sets with a large number of classes  hierarchical structure Preliminary experiments with type C progressive decision trees look promising… Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and explo i tation costs at the same time Fine tuning is absolutely necessary!...

Further research The R-norm splitting criterion may be used to build adaptive decision trees Better error generalisation bounds are needed A complete and specific theoretical framework for the cascading paradigm must be developed Parameters (  , d and t ) are currently empirical, more explanations are needed New applications (huge data sets): Web mining DNA interpretation

Selected references Breiman, L. et al., Classification and Regression Trees , Wadsworth International Group, 1984 Gama, J. et al., Cascade Generalization , Machine Learning 41(3):315-343, 2000 Domingos, P., A unified bias-variance decomposition and its applications , Proc. of the 17 th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000 Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods , Annals of Statistics 26(5):1651-1686, 1998

On cascading small decision trees

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to On cascading small decision trees (20)

More from Julià Minguillón (20)

Recently uploaded (20)

On cascading small decision trees