Decision tree

● DECISION TREES
● BUILDING DECISION TREES
● SPLITTING METRICS
● PREVENTING OVERFITTING
● STRENGTHS AND LIMITATIONS
STEPS

DECISION TREE CLASSIFIERS
Q: What is a decision tree
classifier?
A: A non-parametric hierarchical
classification technique.
Non-parametric: No parameters,
No distribution assumptions
Hierarchical: Consists of a sequence
of questions which yield a class
label when applied to any record

Q: How is a decision tree represented?
A: Using a configuration of nodes and edges.
Nodes represent questions (test conditions)
Edges are the answers to these questions.
DECISION TREE CLASSIFIERS

source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
EXAMPLE DATA

EXAMPLE – DECISION TREE
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
TYPES OF NODES
Top node of the tree: root node.
• 0 incoming edges, 2+ outgoing edges.
An internal node:
• 1 incoming edge, and 2+ outgoing edges.
• Represent test conditions on the features. (if-statements)
A leaf node:
• 1 incoming edge, 0 outgoing edges.
• Correspond to decisions on class labels.
NOTE
Internal nodes
represent test
conditions which
partition the records
at that node.

BUILDING A DECISION TREE
Q: How do we build a decision tree?
A: Evaluate all possible decision trees (eg, all permutations of features) for
a given dataset?
No! Too complex! Impractical!

This is a greedy recursive algorithm
that leads to a local optimum.
greedy – algorithm makes locally
optimal decision at each step .
recursive – splits task into subtasks,
solves each the same way
local optimum – solution for a given
neighborhood of points

Build a decision tree by recursively
splitting records into smaller &
smaller subsets, or splits.
The splitting decision is made at
each node according to some metric
representing purity.
A partition is 100% pure when all
of its records belong to a single
class.

Binary classification problem with classes X, Y. Given set of records Dt
at node t:
● If All records in Dt
are class X/Y: t is a leaf node with class X/Y.
● If Dt
has mixed classes: split into child nodes based on value of some
feature(s).
● t is an internal node whose outgoing edges correspond to the possible
values of the chosen splitting feature(s).
● Outgoing edges terminate in child nodes.
● Record d is assigned to a child node based on the value of the splitting
feature(s) for d.
● Recursively apply 1 & 2 to each child node

Q: How do we know when to stop aka what’s a leaf
node?
A: Naively, when all nodes are 100% pure or have
identical records
•Not practical in reality
•Other Options:
•Specify maximum tree depth
•Specify Node impurity threshold

CREATING SPLITS
Q: How do we split the training records?
Test conditions can create binary splits:

Q: How do we split the training records?
Alternatively, we can create multiway splits:
CREATING SPLITS

CREATING SPLITS
For continuous features, we can use either method:
NOTE
There are optimizations
that can improve the
naïve quadratic
complexity of determining
the optimum split point
for continuous attributes.

Q: How do we determine the best split?
A: Recall: no split necessary if records belong to same class.
Thus we want each step to create the split with the highest possible purity
(the most class-homogeneous splits).
We need a metric for purity to optimize!
CREATING SPLITS

TYPES OF DECISION TREES
•ID3 (precursor to C4.5)
•C4.5
•CART (Classification and Regression Trees)
•Others…
Diﬀer in splitting metric, stopping criterion, pruning strategy, etc

SPLITTING METRICS
ID3 (Iterative Dichotomiser 3)
● ID3 is a straightforward decision tree learning algorithm developed by Ross
Quinlan.
● Applicable only in cases where the attributes (or features) defining data
examples are categorical in nature and the data examples belong to
pre-defined, clearly distinguishable (ie. well defined) classes.
● ID3 is an iterative greedy algorithm which starts with the root node and
eventually builds the entire tree.
● ID3 uses Entropy and Information Gain to construct a decision tree.

SPLITTING METRICS
Classification and Regression Trees (CART)
● The CART algorithm is a popular decision tree learning algorithm
introduced by Breiman et al.
● Unlike ID3 te learnt decision tree in this case can be used for both multiclass
classification and regression depending on the type of dependent variable.
● The tree growing process comprises of recursive binary splitting of nodes.

ENTROPY
a) Entropy using the frequency table of one attribute: b) Entropy using the frequency table of two attributes:
Entropy gives measure of impurity in a node. In a decision tree building process, two important decisions are to be
made — what is the best split(s) and which is the best variable to split a node.
http://www.saedsayad.com/decision_tree.htm

Impurity measures put us on the right track, but on their own they are
not enough to tell us how our split will do.
Q: Why is this true?
A: We still need to look at impurity before & aﬅer the split.
SPLITTING METRICS

ENTROPY
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Information Gain criteria helps in making these decisions. Using a independent variable value(s), the
child nodes are created.
We need to calculate Entropy of Parent and Child Nodes for calculating the information gain due to the
split.
A variable with highest information gain is selected for the split.

CART
Gini Impurity would favor the split in scenario B
(0.1666) over scenario A (0.125), which is indeed
more “pure”
Dp, Dleft, and Dright are the dataset of the
parent, left and right child node.

CONSISTENCY OF SPLITTING METRICS
Note that each measure
achieves its max at 0.5,
min at 0 & 1.
NOTE
Despite consistency,
different measures
may create different
splits.

SPLITTING METRICS
Some measures of impurity at node t over classes i include:

SPLITTING METRICS
Splitting with Information Gain and Entropy
● Favors splits with small counts but many unique values.
● Weights probability of class by log(base=2) of the class probability
● A smaller value of Entropy is better. That makes the difference between the parent node’s
entropy larger.
● Information Gain is the Entropy of the parent node minus the entropy of the child nodes.
● Entropy is calculated [ P(class1)*log(P(class1),2) + P(class2)*log(P(class2),2) + … +
P(classN)*log(P(classN),2)]
Splitting with Information Gain and Gini Index
● Favors larger partitions.
● Uses squared proportion of classes.
● Perfectly classified, Gini Index would be zero.
● Evenly distributed would be 1 – (1/# Classes).
● You want a variable split that has a low Gini Index.
● The algorithm works as 1 – ( P(class1)^2 + P(class2)^2 + … + P(classN)^2)

PREVENTING OVERFITTING
(Where p(vi
) refers to the probability of label i at node v)
We can use a function of the (information) gain called the gain ratio
to explicitly penalize high numbers of outcomes:
NOTE
This is a form of
regularization!

In addition to determining splits, we also need a stopping criterion to tell us
when we’re done.
For example: stop when all records belong to the same class, or when all
records have identical features.
This is correct in principle, but will likely overfit.

PRE-PRUNING
One possibility: pre-pruning
● Set a minimum threshold on the gain.
● Stop when no split breaks this threshold.
● Set a max tree depth
This prevents overfitting, but is diﬃcult to calibrate in practice (may
preserve bias!)

Alternatively: post-pruning
• Build the full tree and perform pruning as a post-processing step.
To prune a tree:
• Examine the nodes from the bottom-up
• Simplify pieces of the tree (according to some criteria).

POST-PRUNING
Complicated subtrees can be replaced either with a single node, or with a
simpler (child) subtree.
The first approach is called subtree replacement, and the second is subtree
raising.

STRENGTHS AND WEAKNESSES OF DECISION TREES
Strengths:
•Simple interpretation
•Little feature preprocessing, scaling, etc
•(Mostly) agnostic to feature data type
•Mostly robust/scalable Drawbacks:
Local optimum solution
Unstable (aka high variance) Overfitting!!!!

Decision tree

More Related Content

What's hot (20)

Similar to Decision tree (20)

Recently uploaded (20)

Decision tree