SlideShare a Scribd company logo
Decision Trees
Varun Jain
● DECISION TREES
● BUILDING DECISION TREES
● SPLITTING METRICS
● PREVENTING OVERFITTING
● STRENGTHS AND LIMITATIONS
STEPS
DECISION TREE CLASSIFIERS
Q: What is a decision tree
classifier?
A: A non-parametric hierarchical
classification technique.
Non-parametric: No parameters,
No distribution assumptions
Hierarchical: Consists of a sequence
of questions which yield a class
label when applied to any record
Q: How is a decision tree represented?
A: Using a configuration of nodes and edges.
Nodes represent questions (test conditions)
Edges are the answers to these questions.
DECISION TREE CLASSIFIERS
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
EXAMPLE DATA
EXAMPLE – DECISION TREE
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
TYPES OF NODES
Top node of the tree: root node.
• 0 incoming edges, 2+ outgoing edges.
An internal node:
• 1 incoming edge, and 2+ outgoing edges.
• Represent test conditions on the features. (if-statements)
A leaf node:
• 1 incoming edge, 0 outgoing edges.
• Correspond to decisions on class labels.
NOTE
Internal nodes
represent test
conditions which
partition the records
at that node.
BUILDING A DECISION TREE
Q: How do we build a decision tree?
A: Evaluate all possible decision trees (eg, all permutations of features) for
a given dataset?
No! Too complex! Impractical!
BUILDING
DECISION TREES
BUILDING A DECISION TREE
This is a greedy recursive algorithm
that leads to a local optimum.
greedy – algorithm makes locally
optimal decision at each step .
recursive – splits task into subtasks,
solves each the same way
local optimum – solution for a given
neighborhood of points
BUILDING A DECISION TREE
Build a decision tree by recursively
splitting records into smaller &
smaller subsets, or splits.
The splitting decision is made at
each node according to some metric
representing purity.
A partition is 100% pure when all
of its records belong to a single
class.
Binary classification problem with classes X, Y. Given set of records Dt
at node t:
● If All records in Dt
are class X/Y: t is a leaf node with class X/Y.
● If Dt
has mixed classes: split into child nodes based on value of some
feature(s).
● t is an internal node whose outgoing edges correspond to the possible
values of the chosen splitting feature(s).
● Outgoing edges terminate in child nodes.
● Record d is assigned to a child node based on the value of the splitting
feature(s) for d.
● Recursively apply 1 & 2 to each child node
BUILDING A DECISION TREE
Q: How do we know when to stop aka what’s a leaf
node?
A: Naively, when all nodes are 100% pure or have
identical records
•Not practical in reality
•Other Options:
•Specify maximum tree depth
•Specify Node impurity threshold
BUILDING A DECISION TREE
CREATING SPLITS
Q: How do we split the training records?
Test conditions can create binary splits:
Q: How do we split the training records?
Alternatively, we can create multiway splits:
CREATING SPLITS
CREATING SPLITS
For continuous features, we can use either method:
NOTE
There are optimizations
that can improve the
naïve quadratic
complexity of determining
the optimum split point
for continuous attributes.
Q: How do we determine the best split?
A: Recall: no split necessary if records belong to same class.
Thus we want each step to create the split with the highest possible purity
(the most class-homogeneous splits).
We need a metric for purity to optimize!
CREATING SPLITS
SPLITTING
METRICS
TYPES OF DECISION TREES
•ID3 (precursor to C4.5)
•C4.5
•CART (Classification and Regression Trees)
•Others…
Differ in splitting metric, stopping criterion, pruning strategy, etc
SPLITTING METRICS
ID3 (Iterative Dichotomiser 3)
● ID3 is a straightforward decision tree learning algorithm developed by Ross
Quinlan.
● Applicable only in cases where the attributes (or features) defining data
examples are categorical in nature and the data examples belong to
pre-defined, clearly distinguishable (ie. well defined) classes.
● ID3 is an iterative greedy algorithm which starts with the root node and
eventually builds the entire tree.
● ID3 uses Entropy and Information Gain to construct a decision tree.
SPLITTING METRICS
Classification and Regression Trees (CART)
● The CART algorithm is a popular decision tree learning algorithm
introduced by Breiman et al.
● Unlike ID3 te learnt decision tree in this case can be used for both multiclass
classification and regression depending on the type of dependent variable.
● The tree growing process comprises of recursive binary splitting of nodes.
ENTROPY
a) Entropy using the frequency table of one attribute: b) Entropy using the frequency table of two attributes:
Entropy gives measure of impurity in a node. In a decision tree building process, two important decisions are to be
made — what is the best split(s) and which is the best variable to split a node.
http://www.saedsayad.com/decision_tree.htm
Impurity measures put us on the right track, but on their own they are
not enough to tell us how our split will do.
Q: Why is this true?
A: We still need to look at impurity before & aſter the split.
SPLITTING METRICS
ENTROPY
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Information Gain criteria helps in making these decisions. Using a independent variable value(s), the
child nodes are created.
We need to calculate Entropy of Parent and Child Nodes for calculating the information gain due to the
split.
A variable with highest information gain is selected for the split.
SPLITTING METRICS
ENTROPY
ENTROPY
CART
Gini Impurity would favor the split in scenario B
(0.1666) over scenario A (0.125), which is indeed
more “pure”
Dp, Dleft, and Dright are the dataset of the
parent, left and right child node.
CONSISTENCY OF SPLITTING METRICS
Note that each measure
achieves its max at 0.5,
min at 0 & 1.
NOTE
Despite consistency,
different measures
may create different
splits.
SPLITTING METRICS
Some measures of impurity at node t over classes i include:
SPLITTING METRICS
Splitting with Information Gain and Entropy
● Favors splits with small counts but many unique values.
● Weights probability of class by log(base=2) of the class probability
● A smaller value of Entropy is better. That makes the difference between the parent node’s
entropy larger.
● Information Gain is the Entropy of the parent node minus the entropy of the child nodes.
● Entropy is calculated [ P(class1)*log(P(class1),2) + P(class2)*log(P(class2),2) + … +
P(classN)*log(P(classN),2)]
Splitting with Information Gain and Gini Index
● Favors larger partitions.
● Uses squared proportion of classes.
● Perfectly classified, Gini Index would be zero.
● Evenly distributed would be 1 – (1/# Classes).
● You want a variable split that has a low Gini Index.
● The algorithm works as 1 – ( P(class1)^2 + P(class2)^2 + … + P(classN)^2)
PREVENTING
OVERFITTING
PREVENTING OVERFITTING
(Where p(vi
) refers to the probability of label i at node v)
We can use a function of the (information) gain called the gain ratio
to explicitly penalize high numbers of outcomes:
NOTE
This is a form of
regularization!
In addition to determining splits, we also need a stopping criterion to tell us
when we’re done.
For example: stop when all records belong to the same class, or when all
records have identical features.
This is correct in principle, but will likely overfit.
PREVENTING OVERFITTING
PRE-PRUNING
One possibility: pre-pruning
● Set a minimum threshold on the gain.
● Stop when no split breaks this threshold.
● Set a max tree depth
This prevents overfitting, but is difficult to calibrate in practice (may
preserve bias!)
Alternatively: post-pruning
• Build the full tree and perform pruning as a post-processing step.
To prune a tree:
• Examine the nodes from the bottom-up
• Simplify pieces of the tree (according to some criteria).
PREVENTING OVERFITTING
POST-PRUNING
Complicated subtrees can be replaced either with a single node, or with a
simpler (child) subtree.
The first approach is called subtree replacement, and the second is subtree
raising.
POST-PRUNING
Complicated subtrees can be replaced either with a single node, or with a
simpler (child) subtree.
The first approach is called subtree replacement, and the second is subtree
raising.
STRENGTHS AND
LIMITATIONS
STRENGTHS AND WEAKNESSES OF DECISION TREES
Strengths:
•Simple interpretation
•Little feature preprocessing, scaling, etc
•(Mostly) agnostic to feature data type
•Mostly robust/scalable Drawbacks:
Local optimum solution
Unstable (aka high variance) Overfitting!!!!

More Related Content

PPTX
Decision tree
PDF
Decision tree lecture 3
PPTX
Decision tree
PPTX
Decision trees
PPT
Decision tree
PPT
DESIGN AND ANALYSIS OF ALGORITHMS
PDF
1. Linear Algebra for Machine Learning: Linear Systems
PPTX
Classification and prediction in data mining
Decision tree
Decision tree lecture 3
Decision tree
Decision trees
Decision tree
DESIGN AND ANALYSIS OF ALGORITHMS
1. Linear Algebra for Machine Learning: Linear Systems
Classification and prediction in data mining

What's hot (20)

PPTX
Decision Tree Learning
PPTX
Reasoning in AI
PPTX
Data preparation and processing chapter 2
PPT
Propositional Logic and Pridicate logic
PPT
5 csp
PDF
Decision tree
PDF
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
PPT
Medians and order statistics
PDF
Introduction To Multilevel Association Rule And Its Methods
PDF
Genetic Algorithms
PPTX
Top down parsing
PPTX
Random forest algorithm
PPT
Parallel algorithms
PDF
R Programming language model test paper
PDF
PDF
Decision trees in Machine Learning
PPTX
Ensemble learning Techniques
PPTX
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
PPT
02 data
PDF
I.BEST FIRST SEARCH IN AI
Decision Tree Learning
Reasoning in AI
Data preparation and processing chapter 2
Propositional Logic and Pridicate logic
5 csp
Decision tree
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
Medians and order statistics
Introduction To Multilevel Association Rule And Its Methods
Genetic Algorithms
Top down parsing
Random forest algorithm
Parallel algorithms
R Programming language model test paper
Decision trees in Machine Learning
Ensemble learning Techniques
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
02 data
I.BEST FIRST SEARCH IN AI
Ad

Similar to Decision tree (20)

PDF
Decision tree for data mining and computer
PPTX
Decision tree induction
PPTX
Basic Process of Classification with Example
PPTX
Classification
PPTX
Classification
PPTX
Machine Learning with Python unit-2.pptx
PPTX
BAS 250 Lecture 8
PDF
Chapter 4.pdf
PPTX
Data mining technique (decision tree)
PDF
Lecture 5 Decision tree.pdf
PPTX
decision tree machine learning model for classification
PDF
Decision tree
PPTX
CART Training 1999
PPTX
data mining.pptx
PDF
Decision trees
PDF
Decision tree and ensemble
PDF
Machine Learning Algorithm - Decision Trees
PDF
CSA 3702 machine learning module 2
PDF
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
PPTX
Decision tree
Decision tree for data mining and computer
Decision tree induction
Basic Process of Classification with Example
Classification
Classification
Machine Learning with Python unit-2.pptx
BAS 250 Lecture 8
Chapter 4.pdf
Data mining technique (decision tree)
Lecture 5 Decision tree.pdf
decision tree machine learning model for classification
Decision tree
CART Training 1999
data mining.pptx
Decision trees
Decision tree and ensemble
Machine Learning Algorithm - Decision Trees
CSA 3702 machine learning module 2
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Decision tree
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Global journeys: estimating international migration
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
Database Infoormation System (DBIS).pptx
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Moving the Public Sector (Government) to a Digital Adoption
Global journeys: estimating international migration
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Business Data Analytics.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
1_Introduction to advance data techniques.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Decision tree

  • 2. ● DECISION TREES ● BUILDING DECISION TREES ● SPLITTING METRICS ● PREVENTING OVERFITTING ● STRENGTHS AND LIMITATIONS STEPS
  • 3. DECISION TREE CLASSIFIERS Q: What is a decision tree classifier? A: A non-parametric hierarchical classification technique. Non-parametric: No parameters, No distribution assumptions Hierarchical: Consists of a sequence of questions which yield a class label when applied to any record
  • 4. Q: How is a decision tree represented? A: Using a configuration of nodes and edges. Nodes represent questions (test conditions) Edges are the answers to these questions. DECISION TREE CLASSIFIERS
  • 6. EXAMPLE – DECISION TREE source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf TYPES OF NODES Top node of the tree: root node. • 0 incoming edges, 2+ outgoing edges. An internal node: • 1 incoming edge, and 2+ outgoing edges. • Represent test conditions on the features. (if-statements) A leaf node: • 1 incoming edge, 0 outgoing edges. • Correspond to decisions on class labels. NOTE Internal nodes represent test conditions which partition the records at that node.
  • 7. BUILDING A DECISION TREE Q: How do we build a decision tree? A: Evaluate all possible decision trees (eg, all permutations of features) for a given dataset? No! Too complex! Impractical!
  • 9. BUILDING A DECISION TREE This is a greedy recursive algorithm that leads to a local optimum. greedy – algorithm makes locally optimal decision at each step . recursive – splits task into subtasks, solves each the same way local optimum – solution for a given neighborhood of points
  • 10. BUILDING A DECISION TREE Build a decision tree by recursively splitting records into smaller & smaller subsets, or splits. The splitting decision is made at each node according to some metric representing purity. A partition is 100% pure when all of its records belong to a single class.
  • 11. Binary classification problem with classes X, Y. Given set of records Dt at node t: ● If All records in Dt are class X/Y: t is a leaf node with class X/Y. ● If Dt has mixed classes: split into child nodes based on value of some feature(s). ● t is an internal node whose outgoing edges correspond to the possible values of the chosen splitting feature(s). ● Outgoing edges terminate in child nodes. ● Record d is assigned to a child node based on the value of the splitting feature(s) for d. ● Recursively apply 1 & 2 to each child node BUILDING A DECISION TREE
  • 12. Q: How do we know when to stop aka what’s a leaf node? A: Naively, when all nodes are 100% pure or have identical records •Not practical in reality •Other Options: •Specify maximum tree depth •Specify Node impurity threshold BUILDING A DECISION TREE
  • 13. CREATING SPLITS Q: How do we split the training records? Test conditions can create binary splits:
  • 14. Q: How do we split the training records? Alternatively, we can create multiway splits: CREATING SPLITS
  • 15. CREATING SPLITS For continuous features, we can use either method: NOTE There are optimizations that can improve the naïve quadratic complexity of determining the optimum split point for continuous attributes.
  • 16. Q: How do we determine the best split? A: Recall: no split necessary if records belong to same class. Thus we want each step to create the split with the highest possible purity (the most class-homogeneous splits). We need a metric for purity to optimize! CREATING SPLITS
  • 18. TYPES OF DECISION TREES •ID3 (precursor to C4.5) •C4.5 •CART (Classification and Regression Trees) •Others… Differ in splitting metric, stopping criterion, pruning strategy, etc
  • 19. SPLITTING METRICS ID3 (Iterative Dichotomiser 3) ● ID3 is a straightforward decision tree learning algorithm developed by Ross Quinlan. ● Applicable only in cases where the attributes (or features) defining data examples are categorical in nature and the data examples belong to pre-defined, clearly distinguishable (ie. well defined) classes. ● ID3 is an iterative greedy algorithm which starts with the root node and eventually builds the entire tree. ● ID3 uses Entropy and Information Gain to construct a decision tree.
  • 20. SPLITTING METRICS Classification and Regression Trees (CART) ● The CART algorithm is a popular decision tree learning algorithm introduced by Breiman et al. ● Unlike ID3 te learnt decision tree in this case can be used for both multiclass classification and regression depending on the type of dependent variable. ● The tree growing process comprises of recursive binary splitting of nodes.
  • 21. ENTROPY a) Entropy using the frequency table of one attribute: b) Entropy using the frequency table of two attributes: Entropy gives measure of impurity in a node. In a decision tree building process, two important decisions are to be made — what is the best split(s) and which is the best variable to split a node. http://www.saedsayad.com/decision_tree.htm
  • 22. Impurity measures put us on the right track, but on their own they are not enough to tell us how our split will do. Q: Why is this true? A: We still need to look at impurity before & aſter the split. SPLITTING METRICS
  • 23. ENTROPY Information Gain The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). Information Gain criteria helps in making these decisions. Using a independent variable value(s), the child nodes are created. We need to calculate Entropy of Parent and Child Nodes for calculating the information gain due to the split. A variable with highest information gain is selected for the split.
  • 27. CART Gini Impurity would favor the split in scenario B (0.1666) over scenario A (0.125), which is indeed more “pure” Dp, Dleft, and Dright are the dataset of the parent, left and right child node.
  • 28. CONSISTENCY OF SPLITTING METRICS Note that each measure achieves its max at 0.5, min at 0 & 1. NOTE Despite consistency, different measures may create different splits.
  • 29. SPLITTING METRICS Some measures of impurity at node t over classes i include:
  • 30. SPLITTING METRICS Splitting with Information Gain and Entropy ● Favors splits with small counts but many unique values. ● Weights probability of class by log(base=2) of the class probability ● A smaller value of Entropy is better. That makes the difference between the parent node’s entropy larger. ● Information Gain is the Entropy of the parent node minus the entropy of the child nodes. ● Entropy is calculated [ P(class1)*log(P(class1),2) + P(class2)*log(P(class2),2) + … + P(classN)*log(P(classN),2)] Splitting with Information Gain and Gini Index ● Favors larger partitions. ● Uses squared proportion of classes. ● Perfectly classified, Gini Index would be zero. ● Evenly distributed would be 1 – (1/# Classes). ● You want a variable split that has a low Gini Index. ● The algorithm works as 1 – ( P(class1)^2 + P(class2)^2 + … + P(classN)^2)
  • 32. PREVENTING OVERFITTING (Where p(vi ) refers to the probability of label i at node v) We can use a function of the (information) gain called the gain ratio to explicitly penalize high numbers of outcomes: NOTE This is a form of regularization!
  • 33. In addition to determining splits, we also need a stopping criterion to tell us when we’re done. For example: stop when all records belong to the same class, or when all records have identical features. This is correct in principle, but will likely overfit. PREVENTING OVERFITTING
  • 34. PRE-PRUNING One possibility: pre-pruning ● Set a minimum threshold on the gain. ● Stop when no split breaks this threshold. ● Set a max tree depth This prevents overfitting, but is difficult to calibrate in practice (may preserve bias!)
  • 35. Alternatively: post-pruning • Build the full tree and perform pruning as a post-processing step. To prune a tree: • Examine the nodes from the bottom-up • Simplify pieces of the tree (according to some criteria). PREVENTING OVERFITTING
  • 36. POST-PRUNING Complicated subtrees can be replaced either with a single node, or with a simpler (child) subtree. The first approach is called subtree replacement, and the second is subtree raising.
  • 37. POST-PRUNING Complicated subtrees can be replaced either with a single node, or with a simpler (child) subtree. The first approach is called subtree replacement, and the second is subtree raising.
  • 39. STRENGTHS AND WEAKNESSES OF DECISION TREES Strengths: •Simple interpretation •Little feature preprocessing, scaling, etc •(Mostly) agnostic to feature data type •Mostly robust/scalable Drawbacks: Local optimum solution Unstable (aka high variance) Overfitting!!!!