SlideShare a Scribd company logo
On Cascading Small Decision Trees Julià Minguillón Combinatorics and Digital Communications Group (CCD) Autonomous University of Barcelona (UAB) Barcelona, Spain http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf
Table of contents Introduction Decision trees Combining classifiers Experimental results Theoretical issues Conclusions Further research References
Introduction Main goal:  to build simple and fast classifiers for data mining Partial goals: To reduce both training and exploitation costs To increase classification accuracy To permit partial classification Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.
Decision trees Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984 Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes Supervised training    labeled data is used for training
Why decision trees? Natural handling of data of mixed types Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations Computational scalability Ability to deal with irrelevant inputs Interpretability
Growing decision trees (binary) T=(data set) /* initially the tree is a single leaf */ while stoppingCriterion(T) is false select t from T maximising selectionCriterion(t) split t=(t L ,t R ) maximising splittingCriterion(t,t L ,t R ) replace t in T with (t L ,t R ) end prune back T using the BFOS algorithm  choose T’ minimising classification error on (data set’)
Growing algorithm parameters The computed decision tree is determined by: Stopping criterion Node selection criterion Splitting criterion Labelling rule If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant
Splitting criterion Measures the gain of a split for a given criterion Usually related to the concept of impurity Classification performance may be very sensitive to such criterion Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst Different kinds of splits: Orthogonal hyperplanes: fast, interpretable, poor performance General hyperplanes: expensive, partially interpretable Distance based (spherical trees): expensive, allow clustering
Labelling rule Each leaf  t  is labelled in order to minimise misclassification error: l(t) = arg j min { r(t) =     {k=0..K-1}  C(j,k) p(k|t) } Different classification costs  C(j,k)  are allowed A priori  class probabilities may be included Margin is defined as  1-2 r(t) , or also as max { p(k|t) } – 2 nd max { p(k|t) }
Problems Repetition, replication and fragmentation Poor performance for large data dimensionality or large number of classes Orthogonal splits may lead to p oor classification performance due to poor internal decision functions Overfitting may occur for large decision trees Training is very expensive for large data sets Decision trees are unstable classifiers
Progressive decision trees Goal:  to overcome some problems related to the use of classical decision trees Basic idea:  to break the classification problem in a sequence of partial classification problems, from easier to more difficult Only small decision trees are used: Avoid overfitting Reduce both training and exploitation costs Permit partial classification Detect possible outliers Decision trees become decision graphs
Growing progressive decision trees Build a complete decision tree of depth  d Prune it using the BFOS algorithm Relabel it using the new labelling rule: a leaf is labelled as  mixed  if its margin is not large enough (at least   ) Join all regions labelled as mixed Start again using only the mixed regions
Example (I) M 1 M 0 M 0 1 M
Example (II) M 0 1 M 1 0 0 1 M M M
Example (III) 1 0 M 0 1
Combining classifiers Basic idea:  instead of building a complex classifier, build several simple classifiers and combine them into a more complex one Several paradigms: Voting: bagging, boosting, randomising Stacking Cascading Why do they work?  Because of the fact that different classifiers make different kinds of mistakes Different classifiers are built by using different training sets
Cascading generalization Developed by Gama et al. in 2000 Basic idea:  simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence Three types of cascading ensembles: Type A: no additional info, mixed class Type B: additional info, no mixed class Type C: additional info, mixed class
Type A progressive decision trees No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down: T D Y D’
Type B progressive decision trees Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down: T D Y D’
Type C progressive decision trees Additional info is computed for each sample, and only samples labelled as mixed are passed down: T D Y D’
Experimental results Four different projects: Document layout recognition  Hyperspectral imaging Brain tumour classification UCI collection     evaluation Basic tools for evaluation: N-fold cross-validation bootstrapping bias-variance decomposition } real projects
Document layout recognition (I) Goal:  adaptive compression for an automated document storage system using lossy/lossless JPEG standard Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy) Documents are 8.5” x 11.7” at 150 dpi Target block size: 8 x 8 pixels (JPEG standard) Minguillón, J. et al.,  Progressive classification scheme for document layout recognition , Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999
Document layout recognition (II) Classical approach: a single decision tree with a block size of 8 x 8 pixels 0.078 38 8.56 721 211200 / 211200 8 x 8 Error d max R |T| Num. Blocks Size
Document layout recognition (III) Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8) 0.042 6 3.72 11 21052 / 53760 16 x 16 0.047 6 4.17 14 7856 / 13440 32 x 32 0.089 4 2.77 6 3360 / 3360 64 x 64 0.065 8 4.73 18 27892 / 215040 8 x 8 Error d max R |T| Num. Blocks Size
Hyperspectral imaging (I) Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400) Ground truth data presents some artifacts due to low resolution: around 10% mislabelled 19 classes including asphalt, water, rocks, soil and several vegetation types Goal:  to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set Minguillón, J. et al.,  Adaptive lossy compression and classification of hyperspectral images , Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000
Hyperspectral imaging (II) Classical approach: Using the new labeling rule: 0.163 1.0 9.83 836 T 1 Error P T R |T| Tree 0.092 0.722 9.60 650 T 2 Error P T R |T| Tree
Hyperspectral imaging (III) Progressive approach: 0.199 0.383 2.14 8 T 3B 0.056 0.523 3.02 9 T 3A 0.094 0.706 4.84 44 T 3 Error P T R |T| Tree
Brain tumour classification (I) Goal:  to build a classification system for helping clinicians to identify brain tumour types Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created Different classifiers (LDA,  k -NN, decision trees) are combined using a mixture of cascading and voting schemes Minguillón, J. et al.,  Classifier combination for in vivo magnetic resonance spectra of brain tumours , Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364
Brain tumour classification (II) Each classification stage is: k -NN LDA DT X V Y Decision trees use LDA class distances as additional information “ Unknown” means classifiers disagree
Brain tumour classification (III) Normal 100% Tumour 99.5% Benign 92.1% Malignant 94.9% Grade II 82.6% Grade IV 94.7% 98.9% Grade III 0% Astro 94.1% Oligo 100% 84.0% 89.9% 83.8% Secondary 91.4% Primary 81.8% 75.0% MN+SCH+HB ASTII+OD GLB+LYM+PNET+MET
UCI collection Goal:  exhaustive testing of progressive decision trees 20 data sets were chosen: No categorical variables No missing values Large range of number of samples, data dimension and number of classes Available at  http:// kdd.ics.uci.edu
Experiments setup N-fold cross-validation with N=3 For each training set, 25 bootstrap replicates are generated (subsampling with replacement) Each experiment is repeated 5 times and performance results are averaged Bias-variance decomposition is computed for each repetition and then averaged
Bias-variance decomposition Several approaches, Domingos 2000 First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees Last classifiers should have small bias and moderate variance: large (but not too much) decision trees Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small
Empirical evaluation summary (I) Bias usually predominates over variance on most data sets    decision trees outperform the  k -NN classifier Bias decreases fast when the decision tree has enough leaves Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics
Empirical evaluation summary (II) Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler) Type C experiments are still running...
Theoretical issues Decision trees are convex combinations of internal node decision functions: T j (x)=  {i=1..|T j |}  p ij    ij  h ij (x) Cascading is a convex combination of  t  decision trees:  T(x)=  {j=1..t}  q j  T j (x) Type A: the first decision tree is the most important Type B: the last decision tree is the most important Type C: not aplicable
Error generalization bounds Convex combinations may be studied under the margin paradigm defined by Schapire et al. Generalization error depends on tree structure and internal node functions VC dimension Unbalanced trees are preferable Unbalanced classifiers are preferable Modest goal:  to see that the current theory related to classifier combination does not deny progressive decision trees
Conclusions Progressive decision trees generalise classical decision trees and the cascading paradigm Cascading is very useful for large data sets with a large number of classes    hierarchical structure Preliminary experiments with type C progressive decision trees look promising… Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and explo i tation costs at the same time Fine tuning is absolutely necessary!...
Further research The R-norm splitting criterion may be used to build adaptive decision trees Better error generalisation bounds are needed A complete and specific theoretical framework for the cascading paradigm must be developed Parameters (  ,  d  and  t ) are currently empirical, more explanations are needed New applications (huge data sets): Web mining DNA interpretation
Selected references Breiman, L. et al.,  Classification and Regression Trees , Wadsworth International Group, 1984 Gama, J. et al.,  Cascade Generalization , Machine Learning 41(3):315-343, 2000 Domingos, P.,  A unified bias-variance decomposition and its applications , Proc. of the 17 th  Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000 Schapire, R.E. et al.,  Boosting the margin: a new explanation for the effectiveness of voting methods , Annals of Statistics 26(5):1651-1686, 1998

More Related Content

PDF
Hybrid Approach for Brain Tumour Detection in Image Segmentation
PDF
SVM Classifiers at it Bests in Brain Tumor Detection using MR Images
PDF
SEGMENTATION AND CLASSIFICATION OF BRAIN TUMOR CT IMAGES USING SVM WITH WEIGH...
PDF
MALIGNANT AND BENIGN BRAIN TUMOR SEGMENTATION AND CLASSIFICATION USING SVM WI...
PDF
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
PDF
Ijetcas14 327
PPT
Download It
PDF
Efficient classification of big data using vfdt (very fast decision tree)
Hybrid Approach for Brain Tumour Detection in Image Segmentation
SVM Classifiers at it Bests in Brain Tumor Detection using MR Images
SEGMENTATION AND CLASSIFICATION OF BRAIN TUMOR CT IMAGES USING SVM WITH WEIGH...
MALIGNANT AND BENIGN BRAIN TUMOR SEGMENTATION AND CLASSIFICATION USING SVM WI...
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Ijetcas14 327
Download It
Efficient classification of big data using vfdt (very fast decision tree)

What's hot (19)

PPTX
Hanaa phd presentation 14-4-2017
PDF
Possibilistic Fuzzy C Means Algorithm For Mass classificaion In Digital Mammo...
PDF
Regularized Weighted Ensemble of Deep Classifiers
PDF
Brain Tumor Classification using Support Vector Machine
PDF
Comprehensive Survey of Data Classification & Prediction Techniques
PDF
Twin support vector machine using kernel function for colorectal cancer detec...
PDF
Survey on lung nodule classifications
PDF
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
PPT
5 5 10
PPT
Avanced Image Classification
PPT
2.8 accuracy and ensemble methods
PDF
An efficient feature selection in
PDF
SpectralClassificationOfStars
PDF
Color Image Segmentation Technique Using “Natural Grouping” of Pixels
DOCX
On distributed fuzzy decision trees for big data
PDF
Distributed Digital Artifacts on the Semantic Web
PDF
Predicting Lung Cancer Kaggle Data challenge's 2nd Place review
PDF
B0343011014
PDF
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
Hanaa phd presentation 14-4-2017
Possibilistic Fuzzy C Means Algorithm For Mass classificaion In Digital Mammo...
Regularized Weighted Ensemble of Deep Classifiers
Brain Tumor Classification using Support Vector Machine
Comprehensive Survey of Data Classification & Prediction Techniques
Twin support vector machine using kernel function for colorectal cancer detec...
Survey on lung nodule classifications
Skin lesion detection from dermoscopic images using Convolutional Neural Netw...
5 5 10
Avanced Image Classification
2.8 accuracy and ensemble methods
An efficient feature selection in
SpectralClassificationOfStars
Color Image Segmentation Technique Using “Natural Grouping” of Pixels
On distributed fuzzy decision trees for big data
Distributed Digital Artifacts on the Semantic Web
Predicting Lung Cancer Kaggle Data challenge's 2nd Place review
B0343011014
A Wavelet Based Automatic Segmentation of Brain Tumor in CT Images Using Opti...
Ad

Viewers also liked (20)

PPTX
An introduction to decision trees
PPTX
Cima edition-17-decision-trees (2)
PPTX
[Women in Data Science Meetup ATX] Decision Trees
PPTX
Decision Tree- M.B.A -DecSci
PDF
07 history of cv vision paradigms - system - algorithms - applications - eva...
PDF
Supervised Approach to Extract Sentiments from Unstructured Text
PDF
One Size Doesn't Fit All: The New Database Revolution
PDF
Power of Code: What you don’t know about what you know
PPTX
Streamlining Technology to Reduce Complexity and Improve Productivity
PPTX
Machine Learning techniques
PDF
Some Take-Home Message about Machine Learning
PPT
Applying Reinforcement Learning for Network Routing
PDF
Graphical Models for chains, trees and grids
PPTX
Pattern Recognition and Machine Learning : Graphical Models
PDF
Les outils de modélisation des Big Data
PDF
graphical models for the Internet
PPTX
Nearest Neighbor Customer Insight
PDF
Web Crawling and Reinforcement Learning
PDF
A real-time big data architecture for glasses detection using computer vision...
An introduction to decision trees
Cima edition-17-decision-trees (2)
[Women in Data Science Meetup ATX] Decision Trees
Decision Tree- M.B.A -DecSci
07 history of cv vision paradigms - system - algorithms - applications - eva...
Supervised Approach to Extract Sentiments from Unstructured Text
One Size Doesn't Fit All: The New Database Revolution
Power of Code: What you don’t know about what you know
Streamlining Technology to Reduce Complexity and Improve Productivity
Machine Learning techniques
Some Take-Home Message about Machine Learning
Applying Reinforcement Learning for Network Routing
Graphical Models for chains, trees and grids
Pattern Recognition and Machine Learning : Graphical Models
Les outils de modélisation des Big Data
graphical models for the Internet
Nearest Neighbor Customer Insight
Web Crawling and Reinforcement Learning
A real-time big data architecture for glasses detection using computer vision...
Ad

Similar to On cascading small decision trees (20)

PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
PDF
Using Decision Trees to Analyze Online Learning Data
PDF
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
PPTX
BAS 250 Lecture 8
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PDF
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
PDF
forest-cover-type
PPT
decisiontrees.ppt
PPT
decisiontrees.ppt
PPT
decisiontrees (3).ppt
PDF
M3R.FINAL
PPTX
Machine Learning
PDF
Machine learning in science and industry — day 2
PDF
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
PDF
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
PPTX
Chapter4-ML.pptx slide for concept of mechanic learning
PDF
A Modern Introduction to Decision Tree Ensembles
PDF
CSA 3702 machine learning module 2
Textmining Predictive Models
Textmining Predictive Models
Textmining Predictive Models
Using Decision Trees to Analyze Online Learning Data
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
BAS 250 Lecture 8
Introduction to Machine Learning Aristotelis Tsirigos
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
forest-cover-type
decisiontrees.ppt
decisiontrees.ppt
decisiontrees (3).ppt
M3R.FINAL
Machine Learning
Machine learning in science and industry — day 2
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Chapter4-ML.pptx slide for concept of mechanic learning
A Modern Introduction to Decision Tree Ensembles
CSA 3702 machine learning module 2

More from Julià Minguillón (20)

PDF
introducció a les dades obertes i altres hypes
PDF
Introduction to OPEN DATA and other hypes (2017/18)
PDF
Using learning analytics to support applied research and innovation in higher...
PDF
Visualización del diseño competencial de un plan de estudios
PDF
Visualization of Enrollment data using Chord Diagrams - GRAPP 2015
PDF
Why do teachers (not) use the institutional repository?
PDF
JPEG 1991 GK Wallace paper on JPEG standard
ODP
Introducció a Open Data / Big Data
PPTX
Relationships between users, resources and services in learning object reposi...
PPTX
From institutional repositories to personal collections of learning resources
PPTX
Educational Data Mining: cerrando el círculo del proceso de aprendizaje en en...
PPT
Analyzing OpenCourseWare usage by means of social tagging
PPT
Conocimiento abierto, objetos de aprendizaje y repositorios
PPT
Promoting OERs through Open Repositories and Social Networks
PPT
Cerca d’informació a Internet: implicacions jurídiques
PPT
Learning Object Repositories: a learner centered perspective
PDF
ECEL 2009 Keynote J. Minguillón
ODP
Sustainable Information Management for Personal Learning Environments
ODP
Closing remarks of the IV UOC UNESCO Chair International Seminar
ODP
Closing remarks of the III UOC UNESCO Chair International Seminar
introducció a les dades obertes i altres hypes
Introduction to OPEN DATA and other hypes (2017/18)
Using learning analytics to support applied research and innovation in higher...
Visualización del diseño competencial de un plan de estudios
Visualization of Enrollment data using Chord Diagrams - GRAPP 2015
Why do teachers (not) use the institutional repository?
JPEG 1991 GK Wallace paper on JPEG standard
Introducció a Open Data / Big Data
Relationships between users, resources and services in learning object reposi...
From institutional repositories to personal collections of learning resources
Educational Data Mining: cerrando el círculo del proceso de aprendizaje en en...
Analyzing OpenCourseWare usage by means of social tagging
Conocimiento abierto, objetos de aprendizaje y repositorios
Promoting OERs through Open Repositories and Social Networks
Cerca d’informació a Internet: implicacions jurídiques
Learning Object Repositories: a learner centered perspective
ECEL 2009 Keynote J. Minguillón
Sustainable Information Management for Personal Learning Environments
Closing remarks of the IV UOC UNESCO Chair International Seminar
Closing remarks of the III UOC UNESCO Chair International Seminar

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Classroom Observation Tools for Teachers
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Cell Types and Its function , kingdom of life
Institutional Correction lecture only . . .
Renaissance Architecture: A Journey from Faith to Humanism
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
01-Introduction-to-Information-Management.pdf
Insiders guide to clinical Medicine.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Final Presentation General Medicine 03-08-2024.pptx
human mycosis Human fungal infections are called human mycosis..pptx
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Microbial disease of the cardiovascular and lymphatic systems
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Classroom Observation Tools for Teachers
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
VCE English Exam - Section C Student Revision Booklet
Cell Types and Its function , kingdom of life

On cascading small decision trees

  • 1. On Cascading Small Decision Trees Julià Minguillón Combinatorics and Digital Communications Group (CCD) Autonomous University of Barcelona (UAB) Barcelona, Spain http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf
  • 2. Table of contents Introduction Decision trees Combining classifiers Experimental results Theoretical issues Conclusions Further research References
  • 3. Introduction Main goal: to build simple and fast classifiers for data mining Partial goals: To reduce both training and exploitation costs To increase classification accuracy To permit partial classification Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc.
  • 4. Decision trees Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984 Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes Supervised training  labeled data is used for training
  • 5. Why decision trees? Natural handling of data of mixed types Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations Computational scalability Ability to deal with irrelevant inputs Interpretability
  • 6. Growing decision trees (binary) T=(data set) /* initially the tree is a single leaf */ while stoppingCriterion(T) is false select t from T maximising selectionCriterion(t) split t=(t L ,t R ) maximising splittingCriterion(t,t L ,t R ) replace t in T with (t L ,t R ) end prune back T using the BFOS algorithm choose T’ minimising classification error on (data set’)
  • 7. Growing algorithm parameters The computed decision tree is determined by: Stopping criterion Node selection criterion Splitting criterion Labelling rule If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant
  • 8. Splitting criterion Measures the gain of a split for a given criterion Usually related to the concept of impurity Classification performance may be very sensitive to such criterion Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst Different kinds of splits: Orthogonal hyperplanes: fast, interpretable, poor performance General hyperplanes: expensive, partially interpretable Distance based (spherical trees): expensive, allow clustering
  • 9. Labelling rule Each leaf t is labelled in order to minimise misclassification error: l(t) = arg j min { r(t) =  {k=0..K-1} C(j,k) p(k|t) } Different classification costs C(j,k) are allowed A priori class probabilities may be included Margin is defined as 1-2 r(t) , or also as max { p(k|t) } – 2 nd max { p(k|t) }
  • 10. Problems Repetition, replication and fragmentation Poor performance for large data dimensionality or large number of classes Orthogonal splits may lead to p oor classification performance due to poor internal decision functions Overfitting may occur for large decision trees Training is very expensive for large data sets Decision trees are unstable classifiers
  • 11. Progressive decision trees Goal: to overcome some problems related to the use of classical decision trees Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult Only small decision trees are used: Avoid overfitting Reduce both training and exploitation costs Permit partial classification Detect possible outliers Decision trees become decision graphs
  • 12. Growing progressive decision trees Build a complete decision tree of depth d Prune it using the BFOS algorithm Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least  ) Join all regions labelled as mixed Start again using only the mixed regions
  • 13. Example (I) M 1 M 0 M 0 1 M
  • 14. Example (II) M 0 1 M 1 0 0 1 M M M
  • 15. Example (III) 1 0 M 0 1
  • 16. Combining classifiers Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one Several paradigms: Voting: bagging, boosting, randomising Stacking Cascading Why do they work? Because of the fact that different classifiers make different kinds of mistakes Different classifiers are built by using different training sets
  • 17. Cascading generalization Developed by Gama et al. in 2000 Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence Three types of cascading ensembles: Type A: no additional info, mixed class Type B: additional info, no mixed class Type C: additional info, mixed class
  • 18. Type A progressive decision trees No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down: T D Y D’
  • 19. Type B progressive decision trees Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down: T D Y D’
  • 20. Type C progressive decision trees Additional info is computed for each sample, and only samples labelled as mixed are passed down: T D Y D’
  • 21. Experimental results Four different projects: Document layout recognition Hyperspectral imaging Brain tumour classification UCI collection  evaluation Basic tools for evaluation: N-fold cross-validation bootstrapping bias-variance decomposition } real projects
  • 22. Document layout recognition (I) Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy) Documents are 8.5” x 11.7” at 150 dpi Target block size: 8 x 8 pixels (JPEG standard) Minguillón, J. et al., Progressive classification scheme for document layout recognition , Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999
  • 23. Document layout recognition (II) Classical approach: a single decision tree with a block size of 8 x 8 pixels 0.078 38 8.56 721 211200 / 211200 8 x 8 Error d max R |T| Num. Blocks Size
  • 24. Document layout recognition (III) Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8) 0.042 6 3.72 11 21052 / 53760 16 x 16 0.047 6 4.17 14 7856 / 13440 32 x 32 0.089 4 2.77 6 3360 / 3360 64 x 64 0.065 8 4.73 18 27892 / 215040 8 x 8 Error d max R |T| Num. Blocks Size
  • 25. Hyperspectral imaging (I) Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400) Ground truth data presents some artifacts due to low resolution: around 10% mislabelled 19 classes including asphalt, water, rocks, soil and several vegetation types Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set Minguillón, J. et al., Adaptive lossy compression and classification of hyperspectral images , Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000
  • 26. Hyperspectral imaging (II) Classical approach: Using the new labeling rule: 0.163 1.0 9.83 836 T 1 Error P T R |T| Tree 0.092 0.722 9.60 650 T 2 Error P T R |T| Tree
  • 27. Hyperspectral imaging (III) Progressive approach: 0.199 0.383 2.14 8 T 3B 0.056 0.523 3.02 9 T 3A 0.094 0.706 4.84 44 T 3 Error P T R |T| Tree
  • 28. Brain tumour classification (I) Goal: to build a classification system for helping clinicians to identify brain tumour types Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created Different classifiers (LDA, k -NN, decision trees) are combined using a mixture of cascading and voting schemes Minguillón, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours , Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364
  • 29. Brain tumour classification (II) Each classification stage is: k -NN LDA DT X V Y Decision trees use LDA class distances as additional information “ Unknown” means classifiers disagree
  • 30. Brain tumour classification (III) Normal 100% Tumour 99.5% Benign 92.1% Malignant 94.9% Grade II 82.6% Grade IV 94.7% 98.9% Grade III 0% Astro 94.1% Oligo 100% 84.0% 89.9% 83.8% Secondary 91.4% Primary 81.8% 75.0% MN+SCH+HB ASTII+OD GLB+LYM+PNET+MET
  • 31. UCI collection Goal: exhaustive testing of progressive decision trees 20 data sets were chosen: No categorical variables No missing values Large range of number of samples, data dimension and number of classes Available at http:// kdd.ics.uci.edu
  • 32. Experiments setup N-fold cross-validation with N=3 For each training set, 25 bootstrap replicates are generated (subsampling with replacement) Each experiment is repeated 5 times and performance results are averaged Bias-variance decomposition is computed for each repetition and then averaged
  • 33. Bias-variance decomposition Several approaches, Domingos 2000 First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees Last classifiers should have small bias and moderate variance: large (but not too much) decision trees Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small
  • 34. Empirical evaluation summary (I) Bias usually predominates over variance on most data sets  decision trees outperform the k -NN classifier Bias decreases fast when the decision tree has enough leaves Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics
  • 35. Empirical evaluation summary (II) Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler) Type C experiments are still running...
  • 36. Theoretical issues Decision trees are convex combinations of internal node decision functions: T j (x)=  {i=1..|T j |} p ij  ij h ij (x) Cascading is a convex combination of t decision trees: T(x)=  {j=1..t} q j T j (x) Type A: the first decision tree is the most important Type B: the last decision tree is the most important Type C: not aplicable
  • 37. Error generalization bounds Convex combinations may be studied under the margin paradigm defined by Schapire et al. Generalization error depends on tree structure and internal node functions VC dimension Unbalanced trees are preferable Unbalanced classifiers are preferable Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees
  • 38. Conclusions Progressive decision trees generalise classical decision trees and the cascading paradigm Cascading is very useful for large data sets with a large number of classes  hierarchical structure Preliminary experiments with type C progressive decision trees look promising… Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and explo i tation costs at the same time Fine tuning is absolutely necessary!...
  • 39. Further research The R-norm splitting criterion may be used to build adaptive decision trees Better error generalisation bounds are needed A complete and specific theoretical framework for the cascading paradigm must be developed Parameters (  , d and t ) are currently empirical, more explanations are needed New applications (huge data sets): Web mining DNA interpretation
  • 40. Selected references Breiman, L. et al., Classification and Regression Trees , Wadsworth International Group, 1984 Gama, J. et al., Cascade Generalization , Machine Learning 41(3):315-343, 2000 Domingos, P., A unified bias-variance decomposition and its applications , Proc. of the 17 th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000 Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods , Annals of Statistics 26(5):1651-1686, 1998