Decision Trees - The Machine Learning Magic Unveiled

Decision Trees
The Machine Learning “Magic” Unveiled
Luca Zavarella

Who am I
Data Science Microsoft Professional Program
Microsoft SQL Server BI MCTS & MCITP
Working with SQL Server since 2007
Mentor & Technical Director @
Email: lzavarella@solidq.com
Twitter: @lucazav
LinkedIn: http://it.linkedin.com/in/lucazavarella

Agenda
• The Classification Problem
• What’s a Decision Tree
• Entropy
• Information Gain
• Gini Gain
• Tree Induction
• Overfitting and Pruning
• Model Evaluation
• Conclusions

The Classification Problem
Classifying something according to shared qualities or characteristics
• Detecting spam email messages based upon the message header and content
• Categorizing cells as malignant or benign based upon the results of Magnetic
Resonance Imaging
• Determine if a banking transaction is fraudulent or not
In a formal way
• Classification is the task of inferring a target function f (classification model)
that maps each attribute set x to one to the predefined class labels y
Classification
Model
Attribute set
(X)
Class label
(y)
Input Output
Predictors (x) Target (y)

Classification To Predict A Class Label
• Once the classification model has been defined, the
function f is available to map a new item set of
predictors to a label class
• You can predict the target variable for an unknown item set
Classification
Model
Defined
Attribute set
(X never seen before)
Class label
(???)
Input Output

Entire data set
How to Build a Classification Model
Induction
Training set
Id x1 x2 x3 y
1 Blue No 125K Yes
2 Red No 30K No
3 Green Yes 25K No
4 Blue Yes 110K Yes
5 Yellow No 78K Yes
6 Brown No 50K No
… … … … …
Learn
Model
Classification
Model
Learning
Algorithm
Apply
Model
Deduction
Test set
Id x1 x2 x3 y y scored
106 Green Yes 43K No ??
107 Blue Yes 35K No ??
108 Red Yes 80K Yes ??
109 Red No 70K No ??
Evaluation
Neural Networks
Support Vector
Machines
Decision Trees

What is a Decision (Classification) Tree
• It’s a tree-shaped diagram used to break down complex problems
• Each leaf node is assigned a class label
• Non-terminal nodes (root + internal ones) contain attribute test conditions
to separate records having different characteristics
• Each branch is the result of a split and a possible outcome of a test
1 Root Node
Internal Nodes /
Decision Nodes
Leaf Nodes
volatile acidity free sulfur dioxide alcohol quality2
0.27 45 8.8 High
0.3 14 9.5 High
0.22 28 11 High
0.27 11 12 High
0.23 17 9.7 High
0.18 16 10.8 High
0.16 48 12.4 High
0.42 41 9.7 High
0.17 28 11.4 High
0.48 30 9.6 Low
0.66 29 12.8 Low
0.34 17 11.3 Low
0.31 34 9.5 High
0.66 29 12.8 Low
0.31 19 11 High
Sub-Tree
Splitting
Branch

Binary vs Multi-branches Tree
• Multi-branches
• Two or more branches leave each non-
terminal node
• These branches cover all the outcomes of
the test
• Exactly one branch enters in one non-root
node
• Binary
• Two branches leave each non-terminal node
• These two branches cover all the outcomes
of the test
• Exactly one branch enters in one non-root
node

How to Make “Strategic” Splits
• Which attribute is the best one to segment the
instances?
• The one that generates groups that are as homogeneous as
possible with respect to the target variable
Target
variable
Pure Impure Impure
• We need a purity measure

Purity Measure: Entropy
• Supposing to draw a ball from the three boxes below, what’s the knowledge about?
1. We’ll know for sure the ball coming out is red
▪ High knowledge (for sure the ball will be red)
2. We know with 83% certainty that the ball is red; and 17% certainty it is blue
▪ Give us some knowledge
3. We know with 50% certainty that the ball is red; and 50% certainty it is blue
▪ Give us the least amount of knowledge
• Entropy is in some way the opposite of knowledge
• It’s a measure of disorder; it’s the degree of disorganization in a system
• How mixed (impure) the group is with respect to the target variable
High knowledge Medium knowledge Low knowledge
Low entropy Medium entropy High entropy

How to Calculate the Entropy
• We saw the knowledge is related in some way to the probability
to have a specific balls configuration
• How can we get the “opposite” of a probability (defined in the
range [0, 1])?
• The log function does the job! ☺
•  x(0, 1]  log(x)  0
• The multi-class entropy formula is
Entropy(S) = − ෍
𝑖=1
𝑛
𝑝𝑖 log2 𝑝𝑖 (Shannon)
• pi is the probability an object from the i-th class appearing in the
training set S
• The max value for Entropy is log2(num of classes)
• 2 classes  max = 1; 4 classes  max = 2; 8 classes  max = 3
• Using this formula for our boxes we have
• Entropy first box = - 1 log2(1) = 0
• Entropy second box = - 5/6 log2(5/6) - 1/6 log2(1/6) = 0.65
• Entropy third box = - 3/6 log2(3/6) - 3/6 log2(3/6) = - log2(1/2) = 1
p
log2(p)
x
y = log2(x)

From Entropy to Information Gain
• How informative is an attribute with respect to our target?
• An attribute segments a set of instances into several subsets
• Entropy tell us how impure is one individual subset
• Information Gain (IG) measures how much an attribute improves/decreases
entropy over the whole segmentation it creates
• The greater the IG, the more relevant an attribute is
IG A, S = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍
𝑗=1
𝑘
𝑝𝑗 ∙ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑗)
• IG is a function of the parent (S) and
all the children sets (Sj):
• k is the number of the values of the attribute x
• pj is the probability an instance from training set S has the attribute x with
value cj
Parent set
Entropy based on
target attribute
Child set 1
Entropy based on
target attribute
Attribute x = c1 Attribute x = c2 Attribute x = c3
Child set 2
Entropy based on
target attribute
Child set 3
Entropy based on
target attribute

Let’s Calculate the Information Gain
Entire population
(18 instances= 14 red + 4 blue)
Attribute = {c1,c2,c3}
Parent entropy =
-14/18  log(14/18) – 4/18  log(4/18) = 0.764
Entropy1 = 0
Child 1
(6 instances = 6 red)
Child 2
(6 instances = 3 red + 3 blue)
Entropy2 = 1
Child 3
(6 instances = 5 red + 1 blue)
Entropy3 = -5/6  log(5/6) – 1/6  log(1/6) = 0.65
Attribute = c1
P(c1) = 6/18 = 0.33
Attribute = c2
P(c2) = 6/18 = 0.33
Attribute = c3
P(c3) = 6/18 = 0.33
IG = Parent entropy - P(c1)  Entropy1 - P(c2)  Entropy2 - P(c3)  Entropy3
= 0.764 – 0.33  0 – 0.33  1 – 0.33  0.65 = 0.219
c1 c1
c1 c1
c1 c1
c3
c3
c3
c3
c3
c3
c2
c2
c2
c2
c2
c2
c1
c1
c1 c1
c1
c1
c2
c2
c2
c3
c3
c3
c3
c3
c2
c2 c2
c3

Information Gain for Numeric Attributes
• Not all the attributes are categorical. They may be numeric
• Each numeric value is a possible split point
• E.g. temperature  18; temperature > 18
• Split points can be placed between values or directly at values
• The solution is
• Evaluate Information Gain for every possible split point
• Choose the split point with the greatest IG
• The IG found is the one for the whole attribute
• Computationally more demanding
• Sort all the values
• Linear scanning of all the values, updating the IG for the split point
• Choose the split point with the “best” value

Attribute Selection and Information Gain
• For a dataset described by attributes and a target variable, we can get the
most informative attribute with the respect of the target value thanks to the
Information Gain
• A rank of the most informative attributes can be done
• The size of the data can be reduced keeping only the most informative attributes
rowname attr_importance
alcohol 0.084971929
density 0.054802674
citric.acid 0.033315593
chlorides 0.03302314
volatile.acidity 0.026740526
total.sulfur.dioxide 0.025815848
free.sulfur.dioxide 0.019255954
residual.sugar 0.009900215
sulphates 0.008109667
pH 0.006643071
fixed.acidity 0.003609548

DEMO 1
Attribute Selection with Informtaion Gain in R

Tree Induction with Information Gain
• The divide et impera approach will take place
1. Apply attribute selection to the whole dataset
• The attribute with the highest IG is chosen
2. Subgroups are created
3. For each subgroup apply the attribute selection again
4. …and so on recursively
5. Just stop
1. When the nodes are pure
2. When the variables to split on are over
3. Just before the events 1 or 2 can happen to avoid too much branches (overfitting)
• This is the basis of the ID3 algorithm

Highly-branching Attributes in ID3
• Attributes with a large number of values are a problem
• E.g. an ID or the Day attribute
• Lot of values  Small subsets  Subset are purer
• Information gain is biased toward choosing attributes with a large
number of values
• What are the issues?
• Data is fragmented into too many small set
• Attribute chosen is non-optimal for prediction
• Tree is hard to read
• Numeric attributes have this problem too
• What’s the solution?
• Convert them into discrete intervals (discretization)
• Determine ideal intervals is not easy

Tree Induction with Gini Gain
• Instead of Entropy, the impurity measure can be expressed by the Gini Impurity
• pi is the probability an object from the i-th class appearing in the training set S
• The max value for Gini Impurity is limited to 1
• 2 classes  max = 0.5; 4 classes  max = 0.75; 8 classes  max = 0.875
• Gini Gain is the alternative to Information Gain for selecting an attribute A
• Sj is the partition of S induced by the attribute A
• k is the number of the values of the attribute A
• pj is the probability an instance from training set S has the attribute A with value cj
• The tree induction is similar to the one that use IG, but now it uses GG
• This is the basis of the CART algorithm
𝐺𝑖𝑛𝑖(𝑆) = 1 − ෍
𝑖=1
𝑛
𝑝𝑖
2
𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 𝐴, 𝑆 = 𝐺𝑖𝑛𝑖 𝑆 − 𝐺𝑖𝑛𝑖 𝐴, 𝑆 = 𝐺𝑖𝑛𝑖 𝑆 − ෍
𝑗=1
𝑘
𝑝𝑗 ∙ 𝐺𝑖𝑛𝑖(𝑆𝑗)

Information Gain and Gini Gain Comparison
• Many impurity measures (Entropy,
Gini Impurity) are quite consistent
with each other
• The choice of impurity measure has
little effect on the performance of
decision tree induction algorithms
• Information Gain
• It favors smaller partitions with many
distinct values
• Gini Gain
• It favors larger partitions
• Split attributes may change
Misclassification Error = 1 − max
𝑖
𝑝𝑖

ID3 Algorithm Evolution
• ID3 stands for Iterative Dichotomizer
• Introduced in 1986 by Ross Quinlan
• The algorithm is too much simplistic
• It doesn’t allow numeric attributes
• It doesn’t allow missing values
• It’s successor  C4.5
• It is free
• It solves all the gaps in ID3
• C4.5 successor  C5.0
• Free GPL Edition stuck at release 2.07
• C50 package in R
• Commercial Edition (now at release 2.11)
• Multithreading
• Memory optimized
• Other new features (different costs per error types, …)

CART Algorithm
• CART stands for Classification and Regression Trees
• Introduced in 1984 by Breiman
• Missing and numeric values are handled
• It’s free
• The package in R is Rpart (Recursive PARTitioning)

C5.0 vs CART Implementation Details
• C5.0
• It gives a binary tree or multi-branches tree
• It uses Information Gain to induce the tree
• Missing values assigned according to the distribution
of values in the attribute
• CART
• The resulting models are only binary trees
• It uses Gini Gain to induce the tree
• Handles missing values by surrogating tests to
approximate outcomes

DEMO 2
Tree Induction with C5.0 and CART in R

Overfitting and Underfitting
• The models seen in the demo explain all of the training data
• Trees are really complex
• They reduce the training set error
• Will them generalize well to new data?
• No! They’re too much fitted to training data (overfitting)
• The test set error increases
• How to avoid overfitting in decision trees?
• Pre-pruning
• Stop growing the tree earlier, before it perfectly classifies the training set
• It is not easy to precisely estimate when to stop growing the tree
• It can lead to underfitting, important structural information could not be captured
• Post-pruning
• Allows the tree to perfectly classify the training set, then prune the tree
• More successful than the pre-pruning

Is my Model Good or Bad?
• A way to present the prediction results of a
classifier is the confusion matrix
• It makes explicit how one class is being confused
for another
• We need some metrics to evaluate the goodness of the model
• Accuracy: the number of correct predictions made divided by the total number of
predictions made
• (TP + TN) / (TP + FP + FN + TN)
• Not a good metric for imbalanced class  the Accuracy Paradox
• Precision: the number of true positive predictions divided by the total number of
positive predicted class values
• TP / (TP + FP)
• High precision relates to low False Positive rate
• Recall: the number of true positive predictions divided by the number of actual
positive class values
• TP / (TP + FN)
• Low recall indicates many False Negatives

DEMO 3
Model Evaluation and Pruning in R

Target Variable is Numeric? Regression Tree
• Two approaches
• Discretization of the target variable
• Then use a classification learning algorithm
• Adapt the classification algorithm to regression data
• Split trying to minimize the standard deviation in each subset Sj
• Use the Standard Deviation Reduction to induce the tree
𝑆𝐷𝑅 𝐴, 𝑆 = 𝑆𝐷 𝑆 − 𝑆𝐷 𝐴, 𝑆 = 𝑆𝐷 𝑆 − ෍
𝑗=1
𝑘
𝑝𝑗 ∙ 𝑆𝐷(𝑆𝑗)
• Evaluation metrics will change for Regressions
• RMSE, MAE, etc.

Technical Advantages of Decision Trees
• Few data cleaning
• Robust to outliers
• Missing values handling
• They implicitly perform feature selection
• Nonlinear models are handled
• Classification and regression models are handled

Advantages of Decision Trees for Business
• Easy to understand
• Non-statistical background! ☺
• Important insights can be generated
• Important variables for a decision are automatically
emphasized through the process of developing the
tree
• They show the order decisions must be made

References
• Tree Based Modeling from Scratch (https://goo.gl/INo6bC)
• Shannon Entropy, Information Gain, and Picking Balls from
Buckets (https://goo.gl/8JeJJm)
• Data Science for Business (https://goo.gl/8bSm26)
• How to Better Evaluate the Goodness-of-Fit of Regressions
(https://goo.gl/K2dGnD)
• Wine Quality Data Set (https://goo.gl/akGXCm)

Decision Trees - The Machine Learning Magic Unveiled

Decision Trees - The Machine Learning Magic Unveiled

More Related Content

What's hot (20)

Similar to Decision Trees - The Machine Learning Magic Unveiled (20)

Recently uploaded (20)

Decision Trees - The Machine Learning Magic Unveiled