SlideShare a Scribd company logo
Decision Trees
The Machine Learning “Magic” Unveiled
Luca Zavarella
Sponsors
Who am I
Data Science Microsoft Professional Program
Microsoft SQL Server BI MCTS & MCITP
Working with SQL Server since 2007
Mentor & Technical Director @
Email: lzavarella@solidq.com
Twitter: @lucazav
LinkedIn: http://it.linkedin.com/in/lucazavarella
Agenda
• The Classification Problem
• What’s a Decision Tree
• Entropy
• Information Gain
• Gini Gain
• Tree Induction
• Overfitting and Pruning
• Model Evaluation
• Conclusions
Let’s start
The Classification Problem
Classifying something according to shared qualities or characteristics
• Detecting spam email messages based upon the message header and content
• Categorizing cells as malignant or benign based upon the results of Magnetic
Resonance Imaging
• Determine if a banking transaction is fraudulent or not
In a formal way
• Classification is the task of inferring a target function f (classification model)
that maps each attribute set x to one to the predefined class labels y
Classification
Model
Attribute set
(X)
Class label
(y)
Input Output
Predictors (x) Target (y)
Classification To Predict A Class Label
• Once the classification model has been defined, the
function f is available to map a new item set of
predictors to a label class
• You can predict the target variable for an unknown item set
Classification
Model
Defined
Attribute set
(X never seen before)
Class label
(???)
Input Output
Entire data set
How to Build a Classification Model
Induction
Training set
Id x1 x2 x3 y
1 Blue No 125K Yes
2 Red No 30K No
3 Green Yes 25K No
4 Blue Yes 110K Yes
5 Yellow No 78K Yes
6 Brown No 50K No
… … … … …
Learn
Model
Classification
Model
Learning
Algorithm
Apply
Model
Deduction
Test set
Id x1 x2 x3 y y scored
106 Green Yes 43K No ??
107 Blue Yes 35K No ??
108 Red Yes 80K Yes ??
109 Red No 70K No ??
Evaluation
Neural Networks
Support Vector
Machines
Decision Trees
What is a Decision (Classification) Tree
• It’s a tree-shaped diagram used to break down complex problems
• Each leaf node is assigned a class label
• Non-terminal nodes (root + internal ones) contain attribute test conditions
to separate records having different characteristics
• Each branch is the result of a split and a possible outcome of a test
1 Root Node
Internal Nodes /
Decision Nodes
Leaf Nodes
volatile acidity free sulfur dioxide alcohol quality2
0.27 45 8.8 High
0.3 14 9.5 High
0.22 28 11 High
0.27 11 12 High
0.23 17 9.7 High
0.18 16 10.8 High
0.16 48 12.4 High
0.42 41 9.7 High
0.17 28 11.4 High
0.48 30 9.6 Low
0.66 29 12.8 Low
0.34 17 11.3 Low
0.31 34 9.5 High
0.66 29 12.8 Low
0.31 19 11 High
Sub-Tree
Splitting
Branch
Binary vs Multi-branches Tree
• Multi-branches
• Two or more branches leave each non-
terminal node
• These branches cover all the outcomes of
the test
• Exactly one branch enters in one non-root
node
• Binary
• Two branches leave each non-terminal node
• These two branches cover all the outcomes
of the test
• Exactly one branch enters in one non-root
node
How to Make “Strategic” Splits
• Which attribute is the best one to segment the
instances?
• The one that generates groups that are as homogeneous as
possible with respect to the target variable
Target
variable
Pure Impure Impure
• We need a purity measure
Purity Measure: Entropy
• Supposing to draw a ball from the three boxes below, what’s the knowledge about?
1. We’ll know for sure the ball coming out is red
▪ High knowledge (for sure the ball will be red)
2. We know with 83% certainty that the ball is red; and 17% certainty it is blue
▪ Give us some knowledge
3. We know with 50% certainty that the ball is red; and 50% certainty it is blue
▪ Give us the least amount of knowledge
• Entropy is in some way the opposite of knowledge
• It’s a measure of disorder; it’s the degree of disorganization in a system
• How mixed (impure) the group is with respect to the target variable
High knowledge Medium knowledge Low knowledge
Low entropy Medium entropy High entropy
How to Calculate the Entropy
• We saw the knowledge is related in some way to the probability
to have a specific balls configuration
• How can we get the “opposite” of a probability (defined in the
range [0, 1])?
• The log function does the job! ☺
•  x(0, 1]  log(x)  0
• The multi-class entropy formula is
Entropy(S) = − ෍
𝑖=1
𝑛
𝑝𝑖 log2 𝑝𝑖 (Shannon)
• pi is the probability an object from the i-th class appearing in the
training set S
• The max value for Entropy is log2(num of classes)
• 2 classes  max = 1; 4 classes  max = 2; 8 classes  max = 3
• Using this formula for our boxes we have
• Entropy first box = - 1 log2(1) = 0
• Entropy second box = - 5/6 log2(5/6) - 1/6 log2(1/6) = 0.65
• Entropy third box = - 3/6 log2(3/6) - 3/6 log2(3/6) = - log2(1/2) = 1
p
log2(p)
x
y = log2(x)
From Entropy to Information Gain
• How informative is an attribute with respect to our target?
• An attribute segments a set of instances into several subsets
• Entropy tell us how impure is one individual subset
• Information Gain (IG) measures how much an attribute improves/decreases
entropy over the whole segmentation it creates
• The greater the IG, the more relevant an attribute is
IG A, S = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍
𝑗=1
𝑘
𝑝𝑗 ∙ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑗)
• IG is a function of the parent (S) and
all the children sets (Sj):
• k is the number of the values of the attribute x
• pj is the probability an instance from training set S has the attribute x with
value cj
Parent set
Entropy based on
target attribute
Child set 1
Entropy based on
target attribute
Attribute x = c1 Attribute x = c2 Attribute x = c3
Child set 2
Entropy based on
target attribute
Child set 3
Entropy based on
target attribute
Let’s Calculate the Information Gain
Entire population
(18 instances= 14 red + 4 blue)
Attribute = {c1,c2,c3}
Parent entropy =
-14/18  log(14/18) – 4/18  log(4/18) = 0.764
Entropy1 = 0
Child 1
(6 instances = 6 red)
Child 2
(6 instances = 3 red + 3 blue)
Entropy2 = 1
Child 3
(6 instances = 5 red + 1 blue)
Entropy3 = -5/6  log(5/6) – 1/6  log(1/6) = 0.65
Attribute = c1
P(c1) = 6/18 = 0.33
Attribute = c2
P(c2) = 6/18 = 0.33
Attribute = c3
P(c3) = 6/18 = 0.33
IG = Parent entropy - P(c1)  Entropy1 - P(c2)  Entropy2 - P(c3)  Entropy3
= 0.764 – 0.33  0 – 0.33  1 – 0.33  0.65 = 0.219
c1 c1
c1 c1
c1 c1
c3
c3
c3
c3
c3
c3
c2
c2
c2
c2
c2
c2
c1
c1
c1 c1
c1
c1
c2
c2
c2
c3
c3
c3
c3
c3
c2
c2 c2
c3
Information Gain for Numeric Attributes
• Not all the attributes are categorical. They may be numeric
• Each numeric value is a possible split point
• E.g. temperature  18; temperature > 18
• Split points can be placed between values or directly at values
• The solution is
• Evaluate Information Gain for every possible split point
• Choose the split point with the greatest IG
• The IG found is the one for the whole attribute
• Computationally more demanding
• Sort all the values
• Linear scanning of all the values, updating the IG for the split point
• Choose the split point with the “best” value
Attribute Selection and Information Gain
• For a dataset described by attributes and a target variable, we can get the
most informative attribute with the respect of the target value thanks to the
Information Gain
• A rank of the most informative attributes can be done
• The size of the data can be reduced keeping only the most informative attributes
rowname attr_importance
alcohol 0.084971929
density 0.054802674
citric.acid 0.033315593
chlorides 0.03302314
volatile.acidity 0.026740526
total.sulfur.dioxide 0.025815848
free.sulfur.dioxide 0.019255954
residual.sugar 0.009900215
sulphates 0.008109667
pH 0.006643071
fixed.acidity 0.003609548
DEMO 1
Attribute Selection with Informtaion Gain in R
Tree Induction with Information Gain
• The divide et impera approach will take place
1. Apply attribute selection to the whole dataset
• The attribute with the highest IG is chosen
2. Subgroups are created
3. For each subgroup apply the attribute selection again
4. …and so on recursively
5. Just stop
1. When the nodes are pure
2. When the variables to split on are over
3. Just before the events 1 or 2 can happen to avoid too much branches (overfitting)
• This is the basis of the ID3 algorithm
Highly-branching Attributes in ID3
• Attributes with a large number of values are a problem
• E.g. an ID or the Day attribute
• Lot of values  Small subsets  Subset are purer
• Information gain is biased toward choosing attributes with a large
number of values
• What are the issues?
• Data is fragmented into too many small set
• Attribute chosen is non-optimal for prediction
• Tree is hard to read
• Numeric attributes have this problem too
• What’s the solution?
• Convert them into discrete intervals (discretization)
• Determine ideal intervals is not easy
Tree Induction with Gini Gain
• Instead of Entropy, the impurity measure can be expressed by the Gini Impurity
• pi is the probability an object from the i-th class appearing in the training set S
• The max value for Gini Impurity is limited to 1
• 2 classes  max = 0.5; 4 classes  max = 0.75; 8 classes  max = 0.875
• Gini Gain is the alternative to Information Gain for selecting an attribute A
• Sj is the partition of S induced by the attribute A
• k is the number of the values of the attribute A
• pj is the probability an instance from training set S has the attribute A with value cj
• The tree induction is similar to the one that use IG, but now it uses GG
• This is the basis of the CART algorithm
𝐺𝑖𝑛𝑖(𝑆) = 1 − ෍
𝑖=1
𝑛
𝑝𝑖
2
𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 𝐴, 𝑆 = 𝐺𝑖𝑛𝑖 𝑆 − 𝐺𝑖𝑛𝑖 𝐴, 𝑆 = 𝐺𝑖𝑛𝑖 𝑆 − ෍
𝑗=1
𝑘
𝑝𝑗 ∙ 𝐺𝑖𝑛𝑖(𝑆𝑗)
Information Gain and Gini Gain Comparison
• Many impurity measures (Entropy,
Gini Impurity) are quite consistent
with each other
• The choice of impurity measure has
little effect on the performance of
decision tree induction algorithms
• Information Gain
• It favors smaller partitions with many
distinct values
• Gini Gain
• It favors larger partitions
• Split attributes may change
Misclassification Error = 1 − max
𝑖
𝑝𝑖
ID3 Algorithm Evolution
• ID3 stands for Iterative Dichotomizer
• Introduced in 1986 by Ross Quinlan
• The algorithm is too much simplistic
• It doesn’t allow numeric attributes
• It doesn’t allow missing values
• It’s successor  C4.5
• It is free
• It solves all the gaps in ID3
• C4.5 successor  C5.0
• Free GPL Edition stuck at release 2.07
• C50 package in R
• Commercial Edition (now at release 2.11)
• Multithreading
• Memory optimized
• Other new features (different costs per error types, …)
CART Algorithm
• CART stands for Classification and Regression Trees
• Introduced in 1984 by Breiman
• Missing and numeric values are handled
• It’s free
• The package in R is Rpart (Recursive PARTitioning)
C5.0 vs CART Implementation Details
• C5.0
• It gives a binary tree or multi-branches tree
• It uses Information Gain to induce the tree
• Missing values assigned according to the distribution
of values in the attribute
• CART
• The resulting models are only binary trees
• It uses Gini Gain to induce the tree
• Handles missing values by surrogating tests to
approximate outcomes
DEMO 2
Tree Induction with C5.0 and CART in R
Overfitting and Underfitting
• The models seen in the demo explain all of the training data
• Trees are really complex
• They reduce the training set error
• Will them generalize well to new data?
• No! They’re too much fitted to training data (overfitting)
• The test set error increases
• How to avoid overfitting in decision trees?
• Pre-pruning
• Stop growing the tree earlier, before it perfectly classifies the training set
• It is not easy to precisely estimate when to stop growing the tree
• It can lead to underfitting, important structural information could not be captured
• Post-pruning
• Allows the tree to perfectly classify the training set, then prune the tree
• More successful than the pre-pruning
Is my Model Good or Bad?
• A way to present the prediction results of a
classifier is the confusion matrix
• It makes explicit how one class is being confused
for another
• We need some metrics to evaluate the goodness of the model
• Accuracy: the number of correct predictions made divided by the total number of
predictions made
• (TP + TN) / (TP + FP + FN + TN)
• Not a good metric for imbalanced class  the Accuracy Paradox
• Precision: the number of true positive predictions divided by the total number of
positive predicted class values
• TP / (TP + FP)
• High precision relates to low False Positive rate
• Recall: the number of true positive predictions divided by the number of actual
positive class values
• TP / (TP + FN)
• Low recall indicates many False Negatives
DEMO 3
Model Evaluation and Pruning in R
Target Variable is Numeric? Regression Tree
• Two approaches
• Discretization of the target variable
• Then use a classification learning algorithm
• Adapt the classification algorithm to regression data
• Split trying to minimize the standard deviation in each subset Sj
• Use the Standard Deviation Reduction to induce the tree
𝑆𝐷𝑅 𝐴, 𝑆 = 𝑆𝐷 𝑆 − 𝑆𝐷 𝐴, 𝑆 = 𝑆𝐷 𝑆 − ෍
𝑗=1
𝑘
𝑝𝑗 ∙ 𝑆𝐷(𝑆𝑗)
• Evaluation metrics will change for Regressions
• RMSE, MAE, etc.
Conclusions
Technical Advantages of Decision Trees
• Few data cleaning
• Robust to outliers
• Missing values handling
• They implicitly perform feature selection
• Nonlinear models are handled
• Classification and regression models are handled
Advantages of Decision Trees for Business
• Easy to understand
• Non-statistical background! ☺
• Important insights can be generated
• Important variables for a decision are automatically
emphasized through the process of developing the
tree
• They show the order decisions must be made
References
• Tree Based Modeling from Scratch (https://goo.gl/INo6bC)
• Shannon Entropy, Information Gain, and Picking Balls from
Buckets (https://goo.gl/8JeJJm)
• Data Science for Business (https://goo.gl/8bSm26)
• How to Better Evaluate the Goodness-of-Fit of Regressions
(https://goo.gl/K2dGnD)
• Wine Quality Data Set (https://goo.gl/akGXCm)
Decision Trees - The Machine Learning Magic Unveiled

More Related Content

PDF
Market Basket Analysis in SQL Server Machine Learning Services
PPTX
Ml6 decision trees
PDF
forecasting model
 
PPTX
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
PPTX
Machine learning and_nlp
PPTX
Decision tree
PPTX
Talk@rmit 09112017
PPTX
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Market Basket Analysis in SQL Server Machine Learning Services
Ml6 decision trees
forecasting model
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Machine learning and_nlp
Decision tree
Talk@rmit 09112017
Ml9 introduction to-unsupervised_learning_and_clustering_methods

What's hot (20)

PPTX
Deep learning to the rescue - solving long standing problems of recommender ...
PDF
Feature Engineering for ML - Dmitry Larko, H2O.ai
PPTX
Ml8 boosting and-stacking
PDF
Decision tree
PDF
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
PPTX
Branch And Bound and Beam Search Feature Selection Algorithms
PPTX
Session-aware Linear Item-Item Models for Session-based Recommendation (WWW 2...
PPTX
Ml4 naive bayes
PDF
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
PDF
Context-aware preference modeling with factorization
PPTX
Ifi7184 lesson6
PDF
GoshawkDB: Making Time with Vector Clocks
PPTX
Ifi7184 lesson4
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PPTX
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
PDF
DeepXplore: Automated Whitebox Testing of Deep Learning Systems
PPTX
Ifi7184 lesson3
PPTX
Ifi7184 lesson5
PDF
Simple rules for building robust machine learning models
PDF
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep learning to the rescue - solving long standing problems of recommender ...
Feature Engineering for ML - Dmitry Larko, H2O.ai
Ml8 boosting and-stacking
Decision tree
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Branch And Bound and Beam Search Feature Selection Algorithms
Session-aware Linear Item-Item Models for Session-based Recommendation (WWW 2...
Ml4 naive bayes
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Context-aware preference modeling with factorization
Ifi7184 lesson6
GoshawkDB: Making Time with Vector Clocks
Ifi7184 lesson4
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
DeepXplore: Automated Whitebox Testing of Deep Learning Systems
Ifi7184 lesson3
Ifi7184 lesson5
Simple rules for building robust machine learning models
Deep Learning for Recommender Systems RecSys2017 Tutorial
Ad

Similar to Decision Trees - The Machine Learning Magic Unveiled (20)

PPTX
Lecture4.pptx
PDF
Decision trees
PPTX
03-classificationTrees03-classificationTrees.pptx
PPT
Classfication Basic.ppt
PPT
Chapter 08 Class_Basic.ppt DataMinning
PDF
Machine Learning course Lecture number 5, InfoGain.pdf
PDF
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
PDF
Supervised Learning Decision Trees Review of Entropy
PDF
Supervised Learning Decision Trees Machine Learning
PDF
CSA 3702 machine learning module 2
PPT
Data-Mining
PDF
Decision Tree-ID3,C4.5,CART,Regression Tree
PPT
08 classbasic
PPT
08 classbasic
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
DOCX
Classification Using Decision Trees and RulesChapter 5.docx
PPTX
Lect9 Decision tree
PPTX
Decision tree
PDF
08 classbasic
Lecture4.pptx
Decision trees
03-classificationTrees03-classificationTrees.pptx
Classfication Basic.ppt
Chapter 08 Class_Basic.ppt DataMinning
Machine Learning course Lecture number 5, InfoGain.pdf
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Supervised Learning Decision Trees Review of Entropy
Supervised Learning Decision Trees Machine Learning
CSA 3702 machine learning module 2
Data-Mining
Decision Tree-ID3,C4.5,CART,Regression Tree
08 classbasic
08 classbasic
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Classification Using Decision Trees and RulesChapter 5.docx
Lect9 Decision tree
Decision tree
08 classbasic
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Lecture1 pattern recognition............
PPTX
Global journeys: estimating international migration
PPTX
Computer network topology notes for revision
PPTX
climate analysis of Dhaka ,Banglades.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Launch Your Data Science Career in Kochi – 2025
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
Global journeys: estimating international migration
Computer network topology notes for revision
climate analysis of Dhaka ,Banglades.pptx

Decision Trees - The Machine Learning Magic Unveiled

  • 1. Decision Trees The Machine Learning “Magic” Unveiled Luca Zavarella
  • 3. Who am I Data Science Microsoft Professional Program Microsoft SQL Server BI MCTS & MCITP Working with SQL Server since 2007 Mentor & Technical Director @ Email: lzavarella@solidq.com Twitter: @lucazav LinkedIn: http://it.linkedin.com/in/lucazavarella
  • 4. Agenda • The Classification Problem • What’s a Decision Tree • Entropy • Information Gain • Gini Gain • Tree Induction • Overfitting and Pruning • Model Evaluation • Conclusions
  • 6. The Classification Problem Classifying something according to shared qualities or characteristics • Detecting spam email messages based upon the message header and content • Categorizing cells as malignant or benign based upon the results of Magnetic Resonance Imaging • Determine if a banking transaction is fraudulent or not In a formal way • Classification is the task of inferring a target function f (classification model) that maps each attribute set x to one to the predefined class labels y Classification Model Attribute set (X) Class label (y) Input Output Predictors (x) Target (y)
  • 7. Classification To Predict A Class Label • Once the classification model has been defined, the function f is available to map a new item set of predictors to a label class • You can predict the target variable for an unknown item set Classification Model Defined Attribute set (X never seen before) Class label (???) Input Output
  • 8. Entire data set How to Build a Classification Model Induction Training set Id x1 x2 x3 y 1 Blue No 125K Yes 2 Red No 30K No 3 Green Yes 25K No 4 Blue Yes 110K Yes 5 Yellow No 78K Yes 6 Brown No 50K No … … … … … Learn Model Classification Model Learning Algorithm Apply Model Deduction Test set Id x1 x2 x3 y y scored 106 Green Yes 43K No ?? 107 Blue Yes 35K No ?? 108 Red Yes 80K Yes ?? 109 Red No 70K No ?? Evaluation Neural Networks Support Vector Machines Decision Trees
  • 9. What is a Decision (Classification) Tree • It’s a tree-shaped diagram used to break down complex problems • Each leaf node is assigned a class label • Non-terminal nodes (root + internal ones) contain attribute test conditions to separate records having different characteristics • Each branch is the result of a split and a possible outcome of a test 1 Root Node Internal Nodes / Decision Nodes Leaf Nodes volatile acidity free sulfur dioxide alcohol quality2 0.27 45 8.8 High 0.3 14 9.5 High 0.22 28 11 High 0.27 11 12 High 0.23 17 9.7 High 0.18 16 10.8 High 0.16 48 12.4 High 0.42 41 9.7 High 0.17 28 11.4 High 0.48 30 9.6 Low 0.66 29 12.8 Low 0.34 17 11.3 Low 0.31 34 9.5 High 0.66 29 12.8 Low 0.31 19 11 High Sub-Tree Splitting Branch
  • 10. Binary vs Multi-branches Tree • Multi-branches • Two or more branches leave each non- terminal node • These branches cover all the outcomes of the test • Exactly one branch enters in one non-root node • Binary • Two branches leave each non-terminal node • These two branches cover all the outcomes of the test • Exactly one branch enters in one non-root node
  • 11. How to Make “Strategic” Splits • Which attribute is the best one to segment the instances? • The one that generates groups that are as homogeneous as possible with respect to the target variable Target variable Pure Impure Impure • We need a purity measure
  • 12. Purity Measure: Entropy • Supposing to draw a ball from the three boxes below, what’s the knowledge about? 1. We’ll know for sure the ball coming out is red ▪ High knowledge (for sure the ball will be red) 2. We know with 83% certainty that the ball is red; and 17% certainty it is blue ▪ Give us some knowledge 3. We know with 50% certainty that the ball is red; and 50% certainty it is blue ▪ Give us the least amount of knowledge • Entropy is in some way the opposite of knowledge • It’s a measure of disorder; it’s the degree of disorganization in a system • How mixed (impure) the group is with respect to the target variable High knowledge Medium knowledge Low knowledge Low entropy Medium entropy High entropy
  • 13. How to Calculate the Entropy • We saw the knowledge is related in some way to the probability to have a specific balls configuration • How can we get the “opposite” of a probability (defined in the range [0, 1])? • The log function does the job! ☺ •  x(0, 1]  log(x)  0 • The multi-class entropy formula is Entropy(S) = − ෍ 𝑖=1 𝑛 𝑝𝑖 log2 𝑝𝑖 (Shannon) • pi is the probability an object from the i-th class appearing in the training set S • The max value for Entropy is log2(num of classes) • 2 classes  max = 1; 4 classes  max = 2; 8 classes  max = 3 • Using this formula for our boxes we have • Entropy first box = - 1 log2(1) = 0 • Entropy second box = - 5/6 log2(5/6) - 1/6 log2(1/6) = 0.65 • Entropy third box = - 3/6 log2(3/6) - 3/6 log2(3/6) = - log2(1/2) = 1 p log2(p) x y = log2(x)
  • 14. From Entropy to Information Gain • How informative is an attribute with respect to our target? • An attribute segments a set of instances into several subsets • Entropy tell us how impure is one individual subset • Information Gain (IG) measures how much an attribute improves/decreases entropy over the whole segmentation it creates • The greater the IG, the more relevant an attribute is IG A, S = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝑗=1 𝑘 𝑝𝑗 ∙ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑗) • IG is a function of the parent (S) and all the children sets (Sj): • k is the number of the values of the attribute x • pj is the probability an instance from training set S has the attribute x with value cj Parent set Entropy based on target attribute Child set 1 Entropy based on target attribute Attribute x = c1 Attribute x = c2 Attribute x = c3 Child set 2 Entropy based on target attribute Child set 3 Entropy based on target attribute
  • 15. Let’s Calculate the Information Gain Entire population (18 instances= 14 red + 4 blue) Attribute = {c1,c2,c3} Parent entropy = -14/18  log(14/18) – 4/18  log(4/18) = 0.764 Entropy1 = 0 Child 1 (6 instances = 6 red) Child 2 (6 instances = 3 red + 3 blue) Entropy2 = 1 Child 3 (6 instances = 5 red + 1 blue) Entropy3 = -5/6  log(5/6) – 1/6  log(1/6) = 0.65 Attribute = c1 P(c1) = 6/18 = 0.33 Attribute = c2 P(c2) = 6/18 = 0.33 Attribute = c3 P(c3) = 6/18 = 0.33 IG = Parent entropy - P(c1)  Entropy1 - P(c2)  Entropy2 - P(c3)  Entropy3 = 0.764 – 0.33  0 – 0.33  1 – 0.33  0.65 = 0.219 c1 c1 c1 c1 c1 c1 c3 c3 c3 c3 c3 c3 c2 c2 c2 c2 c2 c2 c1 c1 c1 c1 c1 c1 c2 c2 c2 c3 c3 c3 c3 c3 c2 c2 c2 c3
  • 16. Information Gain for Numeric Attributes • Not all the attributes are categorical. They may be numeric • Each numeric value is a possible split point • E.g. temperature  18; temperature > 18 • Split points can be placed between values or directly at values • The solution is • Evaluate Information Gain for every possible split point • Choose the split point with the greatest IG • The IG found is the one for the whole attribute • Computationally more demanding • Sort all the values • Linear scanning of all the values, updating the IG for the split point • Choose the split point with the “best” value
  • 17. Attribute Selection and Information Gain • For a dataset described by attributes and a target variable, we can get the most informative attribute with the respect of the target value thanks to the Information Gain • A rank of the most informative attributes can be done • The size of the data can be reduced keeping only the most informative attributes rowname attr_importance alcohol 0.084971929 density 0.054802674 citric.acid 0.033315593 chlorides 0.03302314 volatile.acidity 0.026740526 total.sulfur.dioxide 0.025815848 free.sulfur.dioxide 0.019255954 residual.sugar 0.009900215 sulphates 0.008109667 pH 0.006643071 fixed.acidity 0.003609548
  • 18. DEMO 1 Attribute Selection with Informtaion Gain in R
  • 19. Tree Induction with Information Gain • The divide et impera approach will take place 1. Apply attribute selection to the whole dataset • The attribute with the highest IG is chosen 2. Subgroups are created 3. For each subgroup apply the attribute selection again 4. …and so on recursively 5. Just stop 1. When the nodes are pure 2. When the variables to split on are over 3. Just before the events 1 or 2 can happen to avoid too much branches (overfitting) • This is the basis of the ID3 algorithm
  • 20. Highly-branching Attributes in ID3 • Attributes with a large number of values are a problem • E.g. an ID or the Day attribute • Lot of values  Small subsets  Subset are purer • Information gain is biased toward choosing attributes with a large number of values • What are the issues? • Data is fragmented into too many small set • Attribute chosen is non-optimal for prediction • Tree is hard to read • Numeric attributes have this problem too • What’s the solution? • Convert them into discrete intervals (discretization) • Determine ideal intervals is not easy
  • 21. Tree Induction with Gini Gain • Instead of Entropy, the impurity measure can be expressed by the Gini Impurity • pi is the probability an object from the i-th class appearing in the training set S • The max value for Gini Impurity is limited to 1 • 2 classes  max = 0.5; 4 classes  max = 0.75; 8 classes  max = 0.875 • Gini Gain is the alternative to Information Gain for selecting an attribute A • Sj is the partition of S induced by the attribute A • k is the number of the values of the attribute A • pj is the probability an instance from training set S has the attribute A with value cj • The tree induction is similar to the one that use IG, but now it uses GG • This is the basis of the CART algorithm 𝐺𝑖𝑛𝑖(𝑆) = 1 − ෍ 𝑖=1 𝑛 𝑝𝑖 2 𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 𝐴, 𝑆 = 𝐺𝑖𝑛𝑖 𝑆 − 𝐺𝑖𝑛𝑖 𝐴, 𝑆 = 𝐺𝑖𝑛𝑖 𝑆 − ෍ 𝑗=1 𝑘 𝑝𝑗 ∙ 𝐺𝑖𝑛𝑖(𝑆𝑗)
  • 22. Information Gain and Gini Gain Comparison • Many impurity measures (Entropy, Gini Impurity) are quite consistent with each other • The choice of impurity measure has little effect on the performance of decision tree induction algorithms • Information Gain • It favors smaller partitions with many distinct values • Gini Gain • It favors larger partitions • Split attributes may change Misclassification Error = 1 − max 𝑖 𝑝𝑖
  • 23. ID3 Algorithm Evolution • ID3 stands for Iterative Dichotomizer • Introduced in 1986 by Ross Quinlan • The algorithm is too much simplistic • It doesn’t allow numeric attributes • It doesn’t allow missing values • It’s successor  C4.5 • It is free • It solves all the gaps in ID3 • C4.5 successor  C5.0 • Free GPL Edition stuck at release 2.07 • C50 package in R • Commercial Edition (now at release 2.11) • Multithreading • Memory optimized • Other new features (different costs per error types, …)
  • 24. CART Algorithm • CART stands for Classification and Regression Trees • Introduced in 1984 by Breiman • Missing and numeric values are handled • It’s free • The package in R is Rpart (Recursive PARTitioning)
  • 25. C5.0 vs CART Implementation Details • C5.0 • It gives a binary tree or multi-branches tree • It uses Information Gain to induce the tree • Missing values assigned according to the distribution of values in the attribute • CART • The resulting models are only binary trees • It uses Gini Gain to induce the tree • Handles missing values by surrogating tests to approximate outcomes
  • 26. DEMO 2 Tree Induction with C5.0 and CART in R
  • 27. Overfitting and Underfitting • The models seen in the demo explain all of the training data • Trees are really complex • They reduce the training set error • Will them generalize well to new data? • No! They’re too much fitted to training data (overfitting) • The test set error increases • How to avoid overfitting in decision trees? • Pre-pruning • Stop growing the tree earlier, before it perfectly classifies the training set • It is not easy to precisely estimate when to stop growing the tree • It can lead to underfitting, important structural information could not be captured • Post-pruning • Allows the tree to perfectly classify the training set, then prune the tree • More successful than the pre-pruning
  • 28. Is my Model Good or Bad? • A way to present the prediction results of a classifier is the confusion matrix • It makes explicit how one class is being confused for another • We need some metrics to evaluate the goodness of the model • Accuracy: the number of correct predictions made divided by the total number of predictions made • (TP + TN) / (TP + FP + FN + TN) • Not a good metric for imbalanced class  the Accuracy Paradox • Precision: the number of true positive predictions divided by the total number of positive predicted class values • TP / (TP + FP) • High precision relates to low False Positive rate • Recall: the number of true positive predictions divided by the number of actual positive class values • TP / (TP + FN) • Low recall indicates many False Negatives
  • 29. DEMO 3 Model Evaluation and Pruning in R
  • 30. Target Variable is Numeric? Regression Tree • Two approaches • Discretization of the target variable • Then use a classification learning algorithm • Adapt the classification algorithm to regression data • Split trying to minimize the standard deviation in each subset Sj • Use the Standard Deviation Reduction to induce the tree 𝑆𝐷𝑅 𝐴, 𝑆 = 𝑆𝐷 𝑆 − 𝑆𝐷 𝐴, 𝑆 = 𝑆𝐷 𝑆 − ෍ 𝑗=1 𝑘 𝑝𝑗 ∙ 𝑆𝐷(𝑆𝑗) • Evaluation metrics will change for Regressions • RMSE, MAE, etc.
  • 32. Technical Advantages of Decision Trees • Few data cleaning • Robust to outliers • Missing values handling • They implicitly perform feature selection • Nonlinear models are handled • Classification and regression models are handled
  • 33. Advantages of Decision Trees for Business • Easy to understand • Non-statistical background! ☺ • Important insights can be generated • Important variables for a decision are automatically emphasized through the process of developing the tree • They show the order decisions must be made
  • 34. References • Tree Based Modeling from Scratch (https://goo.gl/INo6bC) • Shannon Entropy, Information Gain, and Picking Balls from Buckets (https://goo.gl/8JeJJm) • Data Science for Business (https://goo.gl/8bSm26) • How to Better Evaluate the Goodness-of-Fit of Regressions (https://goo.gl/K2dGnD) • Wine Quality Data Set (https://goo.gl/akGXCm)