BAS 250 Lecture 5

BAS 250
Lesson 5: Decision Trees

• Explain what decision trees are, how they are used, and the
benefits of using them
• Describe the best format for data in order to perform predictive
decision tree mining
• Interpret visual tree’s nodes and leaves
• Explain the use of different algorithms in order to increase the
granularity of the tree’s detail
This Week’s Learning Objectives

 What is a Decision Tree
 Sample Decision Trees
 How to Construct a Decision Tree
 Problems with Decision Trees
 Summary
Overview

• Decision trees are excellent predictive models when the target attribute is categorical in
nature and when the data set is of mixed data types
• More numerically-based approaches, decision trees are better at handling attributes that
have missing or inconsistent values that are not handled- decision trees will work around
such data and still generate usable results
• Decision trees are made of nodes and leaves to represent the best predictor attributes in
a data set
• Decision trees tell the user what is predicted, how confident that prediction can be, and
how we arrived at said prediction
Overview

An example of a Decision Tree developed in RapidMiner
Decision Trees

• Nodes are circular or oval shapes that represent
attributes which serve as good predictors for the label
attribute
• Leaves are end points that demonstrate the
distribution of categories from the label attribute that
follow the branch of the tree to the point of that leaf
Decision Trees

An example of meta data for playing golf based on a decision tree
Decision Trees

 An inductive learning task
o Use particular facts to make more generalized conclusions
 A predictive model based on a branching series of
Boolean tests
o These smaller Boolean tests are less complex than a one-
stage classifier
 Let’s look at a sample decision tree…
What is a Decision Tree?

Predicting Commute Time
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Long
Short Medium Long
No Yes No Yes
If we leave at 10 AM and
there are no cars stalled
on the road, what will our
commute time be?

 In this decision tree, we made a series of Boolean
decisions and followed the corresponding branch
o Did we leave at 10 AM?
o Did a car stall on the road?
o Is there an accident on the road?
 By answering each of these yes/no questions, we
then came to a conclusion on how long our commute
might take
Inductive Learning

We did not have represent this tree graphically
We could have represented as a set of rules.
However, this may be much harder to read…
Decision Trees as Rules

if hour == 8am
commute time = long
else if hour == 9am
if accident == yes
commute time = long
else
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = short
Decision Tree as a Rule Set
• Notice that all attributes to
not have to be used in each
path of the decision.
• As we will see, all attributes
may not even appear in the
tree.

1. We first make a list of attributes that we can measure
 These attributes (for now) must be discrete
2. We then choose a target attribute that we want to predict
3. Then create an experience table that lists what we have
seen in the past
How to Create a Decision Tree

Example Attributes Target
Hour Weather Accident Stall Commute
D1 8 AM Sunny No No Long
D2 8 AM Cloudy No Yes Long
D3 10 AM Sunny No No Short
D4 9 AM Rainy Yes No Long
D5 9 AM Sunny Yes Yes Long
D6 10 AM Sunny No No Short
D7 10 AM Cloudy No No Short
D8 9 AM Rainy No No Medium
D9 9 AM Sunny Yes No Long
D10 10 AM Cloudy Yes Yes Long
D11 10 AM Rainy No No Short
D12 8 AM Cloudy Yes No Long
D13 9 AM Sunny No No Medium
Sample Experience Table

The previous experience decision table had 4 attributes:
1. Hour
2. Weather
3. Accident
4. Stall
But the decision tree only showed 3 attributes:
1. Hour
2. Accident
3. Stall
Why?
Choosing Attributes

 Methods for selecting attributes show that weather is
not a discriminating attribute
 We use the principle of Occam’s Razor: Given a
number of competing hypotheses, the simplest one
is preferable
Choosing Attributes

 The basic structure of creating a decision tree is
the same for most decision tree algorithms
 The difference lies in how we select the attributes
for the tree
 We will focus on the ID3 algorithm developed by
Ross Quinlan in 1975
Choosing Attributes

 The basic idea behind any decision tree algorithm is as
follows:
o Choose the best attribute(s) to split the remaining instances and make
that attribute a decision node
o Repeat this process for recursively for each child
o Stop when:
 All the instances have the same target attribute value
 There are no more attributes
 There are no more instances
Decision Tree Algorithms

Original decision tree
Identifying the Best Attributes
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Long
Short Medium
No Yes No Yes
Long
How did we know to split on leave at and then on stall and
accident and not weather?

 To determine the best attribute, we look at the
ID3 heuristic
 ID3 splits attributes based on their entropy.
Entropy is the measure of disinformation…
ID3 Heuristic

 Entropy is minimized when all values of the target
attribute are the same
o If we know that commute time will always be short, then entropy = 0
 Entropy is maximized when there is an equal chance
of all values for the target attribute (i.e. the result is
random)
o If commute time = short in 3 instances, medium in 3 instances and long
in 3 instances, entropy is maximized
Entropy

 Calculation of entropy
o Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)
 S = set of examples
 Si = subset of S with value vi under the target attribute
 l = size of the range of the target attribute
Entropy

 ID3 splits on attributes with the lowest entropy
 We calculate the entropy for all values of an attribute
as the weighted sum of subset entropies as follows:
o ∑(i = 1 to k) |Si|/|S| Entropy(Si), where k is the range
of the attribute we are testing
 We can also measure information gain (which is
inversely proportional to entropy) as follows:
o Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si)
ID3

Attribute Expected Entropy Information Gain
Hour 0.6511 0.768449
Weather 1.28884 0.130719
Accident 0.92307 0.496479
Stall 1.17071 0.248842
ID3
Given our commute time sample set, we can calculate
the entropy of each attribute at the root node

 There is another technique for reducing the
number of attributes used in a tree – pruning
 Two types of pruning:
oPre-pruning (forward pruning)
oPost-pruning (backward pruning)
Pruning Trees

 In prepruning, we decide during the building process
when to stop adding attributes (possibly based on their
information gain)
 However, this may be problematic – Why?
o Sometimes attributes individually do not contribute much to a
decision, but combined, they may have a significant impact
Prepruning

 Postpruning waits until the full decision tree
has built and then prunes the attributes
 Two techniques:
o Subtree Replacement
o Subtree Raising
Postpruning

Entire subtree is replaced by a single leaf node
Subtree Replacement
A
B
C
1 2 3
4 5

• Node 6 replaced
the subtree
• Generalizes tree
a little more, but
may increase
accuracy
Subtree Replacement
A
B
6 4 5

Entire subtree is raised onto another node
Subtree Raising
A
B
C
1 2 3
4 5

Entire subtree is raised onto another node
We will NOT be using Subtree Raising in this course!
Subtree Raising
A
C
1 2 3

 ID3 is not optimal
o Uses expected entropy reduction, not actual reduction
 Must use discrete (or discretized) attributes
o What if we left for work at 9:30 AM?
o We could break down the attributes into smaller
values…
Problems with ID3

If we broke down leave time to the minute, we
might get something like this:
Problems with ID3
8:02 AM 10:02 AM8:03 AM 9:09 AM9:05 AM 9:07 AM
Long Medium Short Long Long Short
Since entropy is very low for each branch, we have n branches
with n leaves. This would not be helpful for predictive modeling.

 We can use a technique known as discretization
 We choose cut points, such as 9AM for splitting
continuous attributes
 These cut points generally lie in a subset of boundary
points, such that a boundary point is where two adjacent
instances in a sorted list have different target value
attributes
Problems with ID3

Consider the attribute commute time
Problems with ID3
8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M)
When we split on these attributes, we increase
the entropy so we don’t have a decision tree
with the same number of cut points as leaves

 While decision trees classify quickly, the time for
building a tree may be higher than another type of
classifier
 Decision trees suffer from a problem of errors
propagating throughout a tree
 A very serious problem as the number of classes
increases
Problems with Decision Trees

 Since decision trees work by a series of local
decisions, what happens when one of these
local decisions is wrong?
o Every decision from that point on may be wrong
o We may never return to the correct path of the
tree
Error Propagation

 Decision trees can be used to help predict the
future
 The trees are easy to understand
 Decision trees work more efficiently with discrete
attributes
 The trees may suffer from error propagation
Summary

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information

BAS 250 Lecture 5

More Related Content

What's hot (14)

Viewers also liked (10)

Similar to BAS 250 Lecture 5 (20)

More from Wake Tech BAS (9)

Recently uploaded (20)

BAS 250 Lecture 5