SlideShare a Scribd company logo
Unit-IV: Machine Learning
Decision Tree Induction: Non-metric Methods
Indian Institute of Information Technology
Sri City, Chittoor
October 9, 2024 2
Decision Tree Induction
• Decision tree induction is the learning of decision trees from
class-labeled training tuples
• It is a flowchart-like tree structure, where
– each internal node (non-leaf node) denotes a test on an attribute,
– each branch represents an outcome of the test, and
– each leaf node (or terminal node) holds a class label
– The topmost node in a tree is the root node
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown
• The attribute values of the tuple are tested against the decision tree
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple
• Decision trees can easily be converted to classification rules
October 9, 2024 3
Decision Tree Induction: Training Dataset
October 9, 2024 4
Output: A Decision Tree for “buys_computer”
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
October 9, 2024 5
Decision Tree Induction
Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for exploratory
knowledge discovery
• Decision trees can handle multidimensional data
• Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans
• The learning and classification steps of decision tree induction are simple
and fast
• In general, decision tree classifiers have good accuracy
• Decision tree induction application areas such as
– medicine, manufacturing or production, financial analysis, astronomy,
and molecular biology
• Decision trees are the basis of several commercial rule induction systems
• However, successful use may depend on the data at hand
October 9, 2024 6
Decision Tree Induction
• Most algorithms for decision tree induction follow a top-down approach,
which starts with a training set of tuples and their associated class labels
• The training set is recursively partitioned into smaller subsets as the tree is
being built
• Algorithm: Generate a decision tree from the training tuples of data
partition, D
• Input:
– Data partition, D, which is a set of training tuples and their associated
class labels;
– Attribute list, the set of candidate attributes;
– Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes
This criterion consists of a splitting attribute and, possibly, either a
split-point or splitting subset
• Output: A decision tree
October 9, 2024 7
Decision Tree Algorithm: Strategy
• The algorithm is called with three parameters: D, attribute list, and
Attribute selection method
• D is data partition, initially, it is the complete set of training tuples and
their associated class labels
• The parameter attribute list is a list of attributes describing the tuples
• Attribute selection method specifies a heuristic procedure for selecting the
attribute that “best” discriminates the given tuples according to class
• This procedure employs an attribute selection measure such as information
gain or the Gini index
• Whether the tree is strictly binary is generally driven by the attribute
selection measure
• Some attribute selection measures, such as the Gini index, enforce the
resulting tree to be binary
• Others, like information gain, do not, therein allowing multiway splits (i.e.,
two or more branches to be grown from a node)
October 9, 2024 8
Decision Tree Algorithm: Strategy
• The tree starts as a single node, N, representing the training tuples in D
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion
• The splitting criterion tells us which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes
• The splitting criterion also tells us which branches to grow from node N
with respect to the outcomes of the chosen test
• More specifically, the splitting criterion indicates the splitting attribute
and may also indicate either a split-point or a splitting subset
• The splitting criterion is determined so that, ideally, the resulting
partitions at each branch are as “pure” as possible
October 9, 2024 9
Decision Tree Algorithm: Strategy
• A partition is pure if all the tuples in it belong to the same class
• In other words, if we split up the tuples in D according to the mutually
exclusive outcomes of the splitting criterion, we hope for the resulting
partitions to be as pure as possible
• The node N is labeled with the splitting criterion, which serves as a test
at the node
• A branch is grown from node N for each of the outcomes of the splitting
criterion
• The tuples in D are partitioned accordingly based on the splitting
attribute type
• The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, Dj , of D
• The resulting decision tree is returned
October 9, 2024 10
Decision Tree Algorithm: Strategy
• Let A be the splitting attribute. A has v distinct values, {a1, a2, : : : ,
av}, based on the training data
Case 1:
• A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A
• A branch is created for each known value, aj , of A and labeled with that
value
• Partition Dj is the subset of class-labeled tuples in D having value aj of
A
• Because all the tuples in a given partition have the same value for A, A
need not be considered in any future partitioning of the tuples
• Therefore, it is removed from attribute list
October 9, 2024 11
Decision Tree Algorithm: Strategy
Case 2:
• A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A <= split point and A >
split point, respectively,
• where split point is the split-point returned by Attribute selection
method as part of the splitting criterion
• Two branches are grown from N and labeled according to the previous
outcomes
• The tuples are partitioned such that D1 holds the subset of class-labeled
tuples in D for which A<=split point, while D2 holds the rest
October 9, 2024 12
Decision Tree Algorithm: Strategy
Case 3:
• A is discrete-valued and a binary tree must be produced: The test at
node N is of the form“A SA?,” where SA is the splitting subset for A,
returned by Attribute selection method as part of the splitting criterion
• It is a subset of the known values of A
• If a given tuple has value aj of A and if aj SA, then the test at node N
is satisfied Two branches are grown from N
October 9, 2024 13
Decision Tree Algorithm Time Complexity
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
• The computational complexity of the algorithm given training set D is
O(n * |D| * log(|D|)),
• where n is the number of attributes describing the tuples in D and |D| is
the number of training tuples in D
• This means that the computational cost of growing a tree grows at most
n * |D| * log(|D|), with |D| tuples
October 9, 2024 14
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the splitting
criterion that
– “best” separates a given data partition, D, of class-labeled training
tuples into individual classes
• If we were to split D into smaller partitions according to the outcomes
of the splitting criterion, ideally each partition would be pure
• Attribute selection measures are also known as splitting rules because
– they determine how the tuples at a given node are to be split
• The attribute selection measure provides a ranking for each attribute
describing the given training tuples
• The attribute having the best score for the measure is chosen as
– the splitting attribute for the given tuples
– Three popular attribute selection measures
• Information gain, Gain ratio, and Gini index
October 9, 2024 15
Attribute Selection Measures
• The notations used herein is as follows
– Let D, the data partition, be a training set of class-labeled tuples
– Suppose the class label attribute has m distinct values defining m
distinct classes, Ci (for i =1, … , m)
– Let Ci,D be the set of tuples of class Ci in D
– Let |D| and |Ci,D
| denote the number of tuples in D and Ci,D,
respectively
Information Gain:
• Let node N represent or hold the tuples of partition D
• The attribute with the highest information gain is chosen as the splitting
attribute for node N
• This approach minimizes the expected number of tests needed to
– classify a given tuple and guarantees that a simple tree is found
Information Gain:
• The expected information needed to classify a tuple in D is given by
• where pi
is the nonzero probability that an arbitrary tuple in D belongs to
class Ci
and is estimated by |Ci,D
|/|D|
• A log function to the base 2 is used, because the information is encoded in
bits
• Info(D) is just the average amount of information needed to identify the
class label of a tuple in D
• Info(D) is also known as the entropy of D
• Suppose to partition the tuples in D on some attribute A having v distinct
values, {a1
, a2
, : : : , av
}, as observed from the training data
• If A is discrete-valued, these values correspond directly to the v outcomes
of a test on A
October 9, 2024 16
Attribute Selection Measures
Information Gain:
• Attribute A can be used to split D into v partitions or subsets, {D1
, D2
,
….., Dv
},
– where Dj
contains those tuples in D that have outcome aj
of A
• These partitions would correspond to the branches grown from node N
• Ideally, we would like this partitioning to produce an exact classification
of the tuples
• How much more information would we still need (after the partitioning)
to arrive at an exact classification? This amount is measured by
• The term |Dj
|/|D| acts as the weight of the jth
partition
• InfoA
(D) is the expected information required to classify a tuple from D
based on the partitioning by A
October 9, 2024 17
Attribute Selection Measures
October 9, 2024 18
Attribute Selection Measures
Information Gain:
• The smaller the expected information (still) required, the greater the
purity of the partitions
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A)
• That is,
• Gain(A) tells us how much would be gained by branching on A
• The attribute A with the highest information gain, Gain(A), is chosen as
the splitting attribute at node N
• We want to partition on the attribute A that would do the “best
classification,” so that the amount of information still required to finish
classifying the tuples is minimal
October 9, 2024 19
Attribute Selection Measures
Induction of a decision tree
using information gain:
• The class label
attribute, buys
computer, has two
distinct values (namely,
{yes, no});
• therefore, there are two
distinct classes (m = 2)
• Let class C1
correspond
to yes and class C2
correspond to no
■ There are nine tuples of class yes and five tuples of class no
■ A (root) node N is created for the tuples in D
■ To find the splitting criterion for these tuples, we must compute the
information gain of each attribute
October 9, 2024 20
Attribute Selection Measures
Induction of a decision tree using information gain:
• To compute the expected information needed to classify a tuple in D
• Next, we need to compute the expected information requirement for each
attribute
• Let’s start with the attribute age
• We need to look at the distribution of yes and no tuples for each category
of age
• For the age category “youth,” there are two yes tuples and three no
tuples
• For the category “middle aged,” there are four yes tuples and zero no
tuples
• For the category “senior,” there are three yes tuples and two no tuples
October 9, 2024 21
Attribute Selection Measures
Induction of a decision tree using information gain:
• The expected information needed to classify a tuple in D if the tuples are
partitioned according to age is
• Hence, the gain in information from such a partitioning would be
• Similarly, we can compute Gain(income)= 0.029 bits, Gain(student) =0.151
bits, and Gain(credit rating)= 0.048 bits
• Because age has the highest information gain among the attributes, it is
selected as the splitting attribute
• Node N is labeled with age, and branches are grown for each of the
attribute’s values
■ Because they all belong to class “yes,” a leaf should therefore be created at
the end of this branch and labeled “yes.”
October 9, 2024 22
Attribute Selection Measures
Induction of a
decision tree using
information gain:
• The tuples are
then partitioned
accordingly, as
shown in Figure
• Notice that the
tuples falling
into the partition
for age D
middle aged all
belong to the
same class
October 9, 2024 23
Attribute Selection Measures
How can we compute the information gain of an attribute that is
continuous valued:
• Suppose, an attribute A that is continuous-valued, rather than
discrete-valued
• For such a scenario, we must determine the “best” split-point for A,
where the split-point is a threshold on A
• We first sort the values of A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split-point
• Therefore, given v values of A, then v-1 possible splits are evaluated
• The midpoint between the values ai
and ai+1
of A is
October 9, 2024 24
Attribute Selection Measures
How can we compute the information gain of an attribute that is
continuous valued:
• If the values of A are sorted in advance, then determining the best split
for A requires only one pass through the values
• For each possible split-point for A, we evaluate InfoA
(D),
– where the number of partitions is two, that is, v = 2 (or j = 1, 2)
• The point with the minimum expected information requirement for A is
selected as the split point for A
• D1
is the set of tuples in D satisfying A≤ split point, and
• D2
is the set of tuples in D satisfying A > split point
October 9, 2024 25
Attribute Selection Measures: Gain Ratio
Gain Ratio:
• The information gain measure is biased toward tests with many outcomes
• That is, it prefers to select attributes having a large number of values
• For example, consider an attribute that acts as a unique identifier such as
product_ID
• A split on product_ID would result in a large number of partitions (as
many as there are values), each one containing just one tuple
• Because each partition is pure, the information required to classify data
set D based on this partitioning would be Infoproduct_ID
(D)= 0
• Therefore, the information gained by partitioning on this attribute is
maximal
• Clearly, such a partitioning is useless for classification
• C4.5, a successor of ID3, uses an extension to information gain known as
gain ratio, which attempts to overcome this bias
October 9, 2024 26
Attribute Selection Measures: Gain Ratio
Gain Ratio:
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as
• This value represents the potential information generated by splitting the
training data set, D, into v partitions,
– corresponding to the v outcomes of a test on attribute A
• Note that, for each outcome, it considers the number of tuples having that
outcome with respect to the total number of tuples in D
• It differs from information gain, which measures the information with
respect to classification that is acquired based on the same partitioning
• The gain ratio is defined as
Data Mining
October 9, 2024 27
Attribute Selection Measures: Gain Ratio
• The attribute with the maximum gain ratio is selected as the splitting
attribute
• However, that as the split information approaches 0, the ratio becomes
unstable
• A constraint is added to avoid this, whereby the information gain of the
test selected must be large
– at least as great as the average gain over all tests examined
Computation of gain ratio for the attribute income:
• A test on income splits the data into three partitions, namely low,
medium, and high, containing four, six, and four tuples, respectively
• To compute the gain ratio of income
• We have Gain(income) = 0.029
• Therefore, GainRatio(income) = 0.029/1.557 = 0.019
October 9, 2024 28
Attribute Selection Measures: Gini Index
• The Gini index is used in CART
• The Gini index measures the impurity of D, a data partition or set of
training tuples, as
• where pi
is the probability that a tuple in D belongs to class Ci
and is
estimated by |Ci,D
|/|D|
• The sum is computed over m classes
• The Gini index considers a binary split for each attribute
• Let’s first consider the case where A is a discrete-valued attribute having
v distinct values, {a1
, a2
, : : : , av
}, occurring in D
• To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A
• Each subset, SA, can be considered as a binary test for attribute A of the
form “A SA?”
October 9, 2024 29
Attribute Selection Measures: Gini Index
• Given a tuple, this test is satisfied if the value of A for the tuple is among
the values listed in SA
• If A has v possible values, then there are 2^v possible subsets
• For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are
– {low, medium, high},
– {low, Medium}, {low, high}, {medium, high},
– {low}, {medium}, {high}, and {}
• We exclude the power set, {low, medium, high}, and the empty set from
consideration since, conceptually, they do not represent a split
• Therefore, there are 2v
-2 possible ways to form two partitions of the data,
D, based on a binary split on A
• When considering a binary split, we compute a weighted sum of the
impurity of each resulting partition
October 9, 2024 30
Attribute Selection Measures: Gini Index
• For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is
• For each attribute, each of the possible binary splits is considered
• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset
• For continuous-valued attributes, each possible split-point must be
considered
• The strategy is similar to that described earlier for information gain,
where the midpoint between each pair of (sorted) adjacent values is taken
as a possible split-point
• The point giving the minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point of that attribute
October 9, 2024 31
Attribute Selection Measures: Gini Index
• Recall that for a possible split-point of A,
– D1 is the set of tuples in D satisfying A <=split point, and
– D2 is the set of tuples in D satisfying A > split point
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is
• The attribute that maximizes the reduction in impurity (or, equivalently,
has the minimum Gini index) is selected as the splitting attribute
• This attribute and either its splitting subset (for a discrete-valued splitting
attribute) or split-point (for a continuous-valued splitting attribute)
together form the splitting criterion
October 9, 2024 32
Attribute Selection Measures: Gini Index
Induction of a decision tree using the Gini index:
• Let D be the training data, where there are nine tuples belonging to
– the class buys computer_D yes and
– the remaining five tuples belong to the class buys computer D no
• A (root) node N is created for the tuples in D
• The Gini index to compute the impurity of D:
• To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute
• Let’s start with the attribute income and consider each of the possible
splitting subsets
• Consider the subset {low, medium}
Dataset(D)
October 9, 2024 33
October 9, 2024 34
Attribute Selection Measures: Gini Index
Induction of a decision tree using the Gini index:
• This would result in 10 tuples in partition D1 satisfying the condition
“income {low, medium}”
• The remaining four tuples of D would be assigned to partition D2
• The Gini index value computed based on this partitioning is
• Similarly, the Gini index values for splits on the remaining subsets are
0.458 (for the subsets {low, high} and {medium}) and 0.450 (for the
subsets {medium, high} and {low})
Induction of a decision tree using the Gini index:
• Therefore, the best binary split for attribute income is on {low, medium}
(or {high}) because it minimizes the Gini index
• Evaluating age, we obtain {youth, senior} (or {middle_aged}) as the best
split for age with a Gini index of 0.357;
• The attributes student and credit_rating are both binary, with Gini index
values of 0.367 and 0.429, respectively
• The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of 0.459-0.357
=0.102
• The binary split “age ɛ {youth, senior?}” results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
• Node N is labeled with the criterion, two branches are grown from it, and
the tuples are partitioned accordingly
October 9, 2024 35
Attribute Selection Measures: Gini Index
October 27, 2022 36
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
October 27, 2022 37
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
October 27, 2022 38
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in classification?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods
October 27, 2022 39
Scalable Decision Tree Induction
Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that
determine the quality of the tree
– builds an AVC-list (attribute, value, class label)
Drawbacks
• What we discussed are axis parallel
• For continuous valued attributes cut-points
can be found.
– Can be discretized (CART does).
October 27, 2022 40
October 27, 2022 41
October 27, 2022 42
October 27, 2022 43

More Related Content

PDF
Chapter 4.pdf
PPTX
DecisionTree.pptx for btech cse student
PPTX
Unit 4 Classification of data and more info on it
PDF
CSA 3702 machine learning module 2
PDF
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
PDF
Decision tree for data mining and computer
PPTX
UNIT 3: Data Warehousing and Data Mining
PPT
2.1 Data Mining-classification Basic concepts
Chapter 4.pdf
DecisionTree.pptx for btech cse student
Unit 4 Classification of data and more info on it
CSA 3702 machine learning module 2
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Decision tree for data mining and computer
UNIT 3: Data Warehousing and Data Mining
2.1 Data Mining-classification Basic concepts

Similar to Machine learning important pdf for supervised (20)

PPT
Lecture3 (3).ppt
PDF
classification in data mining and data warehousing.pdf
PPTX
Decision tree induction
PPTX
04 Classification in Data Mining
PPTX
Supervised Learning Algorithm Slide.pptx
PPTX
Machine Learning
PDF
Classification in Data Mining
PPTX
Decision Tree Learning: Decision tree representation, Appropriate problems fo...
PPT
Data mining techniques unit iv
PPTX
Fundamentals of Data Science Modeling Lec
PPT
08ClassBasic.ppt
PPT
Lecture4.ppt
PPT
Data Mining Concepts and Techniques.ppt
PPT
Data Mining Concepts and Techniques.ppt
PPT
classification in data warehouse and mining
PDF
machine_learning.pptx
PDF
Decision Tree in Machine Learning
PPTX
Data discretization
PPT
08ClassBasic VT.ppt
PPT
Chapter 08 ClassBasic.ppt file used for help
Lecture3 (3).ppt
classification in data mining and data warehousing.pdf
Decision tree induction
04 Classification in Data Mining
Supervised Learning Algorithm Slide.pptx
Machine Learning
Classification in Data Mining
Decision Tree Learning: Decision tree representation, Appropriate problems fo...
Data mining techniques unit iv
Fundamentals of Data Science Modeling Lec
08ClassBasic.ppt
Lecture4.ppt
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
classification in data warehouse and mining
machine_learning.pptx
Decision Tree in Machine Learning
Data discretization
08ClassBasic VT.ppt
Chapter 08 ClassBasic.ppt file used for help
Ad

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Pre independence Education in Inndia.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Classroom Observation Tools for Teachers
PPTX
Cell Structure & Organelles in detailed.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Basic Mud Logging Guide for educational purpose
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
01-Introduction-to-Information-Management.pdf
Institutional Correction lecture only . . .
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Pre independence Education in Inndia.pdf
Pharma ospi slides which help in ospi learning
Microbial disease of the cardiovascular and lymphatic systems
Insiders guide to clinical Medicine.pdf
Microbial diseases, their pathogenesis and prophylaxis
Classroom Observation Tools for Teachers
Cell Structure & Organelles in detailed.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
GDM (1) (1).pptx small presentation for students
Abdominal Access Techniques with Prof. Dr. R K Mishra
TR - Agricultural Crops Production NC III.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Sports Quiz easy sports quiz sports quiz
O7-L3 Supply Chain Operations - ICLT Program
Basic Mud Logging Guide for educational purpose
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Ad

Machine learning important pdf for supervised

  • 1. Unit-IV: Machine Learning Decision Tree Induction: Non-metric Methods Indian Institute of Information Technology Sri City, Chittoor
  • 2. October 9, 2024 2 Decision Tree Induction • Decision tree induction is the learning of decision trees from class-labeled training tuples • It is a flowchart-like tree structure, where – each internal node (non-leaf node) denotes a test on an attribute, – each branch represents an outcome of the test, and – each leaf node (or terminal node) holds a class label – The topmost node in a tree is the root node How are decision trees used for classification? • Given a tuple, X, for which the associated class label is unknown • The attribute values of the tuple are tested against the decision tree • A path is traced from the root to a leaf node, which holds the class prediction for that tuple • Decision trees can easily be converted to classification rules
  • 3. October 9, 2024 3 Decision Tree Induction: Training Dataset
  • 4. October 9, 2024 4 Output: A Decision Tree for “buys_computer” age? student? credit rating? <=30 >40 no yes yes yes 31..40 no fair excellent yes no
  • 5. October 9, 2024 5 Decision Tree Induction Why are decision tree classifiers so popular? • The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery • Decision trees can handle multidimensional data • Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans • The learning and classification steps of decision tree induction are simple and fast • In general, decision tree classifiers have good accuracy • Decision tree induction application areas such as – medicine, manufacturing or production, financial analysis, astronomy, and molecular biology • Decision trees are the basis of several commercial rule induction systems • However, successful use may depend on the data at hand
  • 6. October 9, 2024 6 Decision Tree Induction • Most algorithms for decision tree induction follow a top-down approach, which starts with a training set of tuples and their associated class labels • The training set is recursively partitioned into smaller subsets as the tree is being built • Algorithm: Generate a decision tree from the training tuples of data partition, D • Input: – Data partition, D, which is a set of training tuples and their associated class labels; – Attribute list, the set of candidate attributes; – Attribute selection method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes This criterion consists of a splitting attribute and, possibly, either a split-point or splitting subset • Output: A decision tree
  • 7. October 9, 2024 7 Decision Tree Algorithm: Strategy • The algorithm is called with three parameters: D, attribute list, and Attribute selection method • D is data partition, initially, it is the complete set of training tuples and their associated class labels • The parameter attribute list is a list of attributes describing the tuples • Attribute selection method specifies a heuristic procedure for selecting the attribute that “best” discriminates the given tuples according to class • This procedure employs an attribute selection measure such as information gain or the Gini index • Whether the tree is strictly binary is generally driven by the attribute selection measure • Some attribute selection measures, such as the Gini index, enforce the resulting tree to be binary • Others, like information gain, do not, therein allowing multiway splits (i.e., two or more branches to be grown from a node)
  • 8. October 9, 2024 8 Decision Tree Algorithm: Strategy • The tree starts as a single node, N, representing the training tuples in D • If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that class • Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion • The splitting criterion tells us which attribute to test at node N by determining the “best” way to separate or partition the tuples in D into individual classes • The splitting criterion also tells us which branches to grow from node N with respect to the outcomes of the chosen test • More specifically, the splitting criterion indicates the splitting attribute and may also indicate either a split-point or a splitting subset • The splitting criterion is determined so that, ideally, the resulting partitions at each branch are as “pure” as possible
  • 9. October 9, 2024 9 Decision Tree Algorithm: Strategy • A partition is pure if all the tuples in it belong to the same class • In other words, if we split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion, we hope for the resulting partitions to be as pure as possible • The node N is labeled with the splitting criterion, which serves as a test at the node • A branch is grown from node N for each of the outcomes of the splitting criterion • The tuples in D are partitioned accordingly based on the splitting attribute type • The algorithm uses the same process recursively to form a decision tree for the tuples at each resulting partition, Dj , of D • The resulting decision tree is returned
  • 10. October 9, 2024 10 Decision Tree Algorithm: Strategy • Let A be the splitting attribute. A has v distinct values, {a1, a2, : : : , av}, based on the training data Case 1: • A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the known values of A • A branch is created for each known value, aj , of A and labeled with that value • Partition Dj is the subset of class-labeled tuples in D having value aj of A • Because all the tuples in a given partition have the same value for A, A need not be considered in any future partitioning of the tuples • Therefore, it is removed from attribute list
  • 11. October 9, 2024 11 Decision Tree Algorithm: Strategy Case 2: • A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to the conditions A <= split point and A > split point, respectively, • where split point is the split-point returned by Attribute selection method as part of the splitting criterion • Two branches are grown from N and labeled according to the previous outcomes • The tuples are partitioned such that D1 holds the subset of class-labeled tuples in D for which A<=split point, while D2 holds the rest
  • 12. October 9, 2024 12 Decision Tree Algorithm: Strategy Case 3: • A is discrete-valued and a binary tree must be produced: The test at node N is of the form“A SA?,” where SA is the splitting subset for A, returned by Attribute selection method as part of the splitting criterion • It is a subset of the known values of A • If a given tuple has value aj of A and if aj SA, then the test at node N is satisfied Two branches are grown from N
  • 13. October 9, 2024 13 Decision Tree Algorithm Time Complexity • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left • The computational complexity of the algorithm given training set D is O(n * |D| * log(|D|)), • where n is the number of attributes describing the tuples in D and |D| is the number of training tuples in D • This means that the computational cost of growing a tree grows at most n * |D| * log(|D|), with |D| tuples
  • 14. October 9, 2024 14 Attribute Selection Measures • An attribute selection measure is a heuristic for selecting the splitting criterion that – “best” separates a given data partition, D, of class-labeled training tuples into individual classes • If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each partition would be pure • Attribute selection measures are also known as splitting rules because – they determine how the tuples at a given node are to be split • The attribute selection measure provides a ranking for each attribute describing the given training tuples • The attribute having the best score for the measure is chosen as – the splitting attribute for the given tuples – Three popular attribute selection measures • Information gain, Gain ratio, and Gini index
  • 15. October 9, 2024 15 Attribute Selection Measures • The notations used herein is as follows – Let D, the data partition, be a training set of class-labeled tuples – Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i =1, … , m) – Let Ci,D be the set of tuples of class Ci in D – Let |D| and |Ci,D | denote the number of tuples in D and Ci,D, respectively Information Gain: • Let node N represent or hold the tuples of partition D • The attribute with the highest information gain is chosen as the splitting attribute for node N • This approach minimizes the expected number of tests needed to – classify a given tuple and guarantees that a simple tree is found
  • 16. Information Gain: • The expected information needed to classify a tuple in D is given by • where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D |/|D| • A log function to the base 2 is used, because the information is encoded in bits • Info(D) is just the average amount of information needed to identify the class label of a tuple in D • Info(D) is also known as the entropy of D • Suppose to partition the tuples in D on some attribute A having v distinct values, {a1 , a2 , : : : , av }, as observed from the training data • If A is discrete-valued, these values correspond directly to the v outcomes of a test on A October 9, 2024 16 Attribute Selection Measures
  • 17. Information Gain: • Attribute A can be used to split D into v partitions or subsets, {D1 , D2 , ….., Dv }, – where Dj contains those tuples in D that have outcome aj of A • These partitions would correspond to the branches grown from node N • Ideally, we would like this partitioning to produce an exact classification of the tuples • How much more information would we still need (after the partitioning) to arrive at an exact classification? This amount is measured by • The term |Dj |/|D| acts as the weight of the jth partition • InfoA (D) is the expected information required to classify a tuple from D based on the partitioning by A October 9, 2024 17 Attribute Selection Measures
  • 18. October 9, 2024 18 Attribute Selection Measures Information Gain: • The smaller the expected information (still) required, the greater the purity of the partitions • Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A) • That is, • Gain(A) tells us how much would be gained by branching on A • The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N • We want to partition on the attribute A that would do the “best classification,” so that the amount of information still required to finish classifying the tuples is minimal
  • 19. October 9, 2024 19 Attribute Selection Measures Induction of a decision tree using information gain: • The class label attribute, buys computer, has two distinct values (namely, {yes, no}); • therefore, there are two distinct classes (m = 2) • Let class C1 correspond to yes and class C2 correspond to no ■ There are nine tuples of class yes and five tuples of class no ■ A (root) node N is created for the tuples in D ■ To find the splitting criterion for these tuples, we must compute the information gain of each attribute
  • 20. October 9, 2024 20 Attribute Selection Measures Induction of a decision tree using information gain: • To compute the expected information needed to classify a tuple in D • Next, we need to compute the expected information requirement for each attribute • Let’s start with the attribute age • We need to look at the distribution of yes and no tuples for each category of age • For the age category “youth,” there are two yes tuples and three no tuples • For the category “middle aged,” there are four yes tuples and zero no tuples • For the category “senior,” there are three yes tuples and two no tuples
  • 21. October 9, 2024 21 Attribute Selection Measures Induction of a decision tree using information gain: • The expected information needed to classify a tuple in D if the tuples are partitioned according to age is • Hence, the gain in information from such a partitioning would be • Similarly, we can compute Gain(income)= 0.029 bits, Gain(student) =0.151 bits, and Gain(credit rating)= 0.048 bits • Because age has the highest information gain among the attributes, it is selected as the splitting attribute • Node N is labeled with age, and branches are grown for each of the attribute’s values
  • 22. ■ Because they all belong to class “yes,” a leaf should therefore be created at the end of this branch and labeled “yes.” October 9, 2024 22 Attribute Selection Measures Induction of a decision tree using information gain: • The tuples are then partitioned accordingly, as shown in Figure • Notice that the tuples falling into the partition for age D middle aged all belong to the same class
  • 23. October 9, 2024 23 Attribute Selection Measures How can we compute the information gain of an attribute that is continuous valued: • Suppose, an attribute A that is continuous-valued, rather than discrete-valued • For such a scenario, we must determine the “best” split-point for A, where the split-point is a threshold on A • We first sort the values of A in increasing order • Typically, the midpoint between each pair of adjacent values is considered as a possible split-point • Therefore, given v values of A, then v-1 possible splits are evaluated • The midpoint between the values ai and ai+1 of A is
  • 24. October 9, 2024 24 Attribute Selection Measures How can we compute the information gain of an attribute that is continuous valued: • If the values of A are sorted in advance, then determining the best split for A requires only one pass through the values • For each possible split-point for A, we evaluate InfoA (D), – where the number of partitions is two, that is, v = 2 (or j = 1, 2) • The point with the minimum expected information requirement for A is selected as the split point for A • D1 is the set of tuples in D satisfying A≤ split point, and • D2 is the set of tuples in D satisfying A > split point
  • 25. October 9, 2024 25 Attribute Selection Measures: Gain Ratio Gain Ratio: • The information gain measure is biased toward tests with many outcomes • That is, it prefers to select attributes having a large number of values • For example, consider an attribute that acts as a unique identifier such as product_ID • A split on product_ID would result in a large number of partitions (as many as there are values), each one containing just one tuple • Because each partition is pure, the information required to classify data set D based on this partitioning would be Infoproduct_ID (D)= 0 • Therefore, the information gained by partitioning on this attribute is maximal • Clearly, such a partitioning is useless for classification • C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome this bias
  • 26. October 9, 2024 26 Attribute Selection Measures: Gain Ratio Gain Ratio: • It applies a kind of normalization to information gain using a “split information” value defined analogously with Info(D) as • This value represents the potential information generated by splitting the training data set, D, into v partitions, – corresponding to the v outcomes of a test on attribute A • Note that, for each outcome, it considers the number of tuples having that outcome with respect to the total number of tuples in D • It differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning • The gain ratio is defined as Data Mining
  • 27. October 9, 2024 27 Attribute Selection Measures: Gain Ratio • The attribute with the maximum gain ratio is selected as the splitting attribute • However, that as the split information approaches 0, the ratio becomes unstable • A constraint is added to avoid this, whereby the information gain of the test selected must be large – at least as great as the average gain over all tests examined Computation of gain ratio for the attribute income: • A test on income splits the data into three partitions, namely low, medium, and high, containing four, six, and four tuples, respectively • To compute the gain ratio of income • We have Gain(income) = 0.029 • Therefore, GainRatio(income) = 0.029/1.557 = 0.019
  • 28. October 9, 2024 28 Attribute Selection Measures: Gini Index • The Gini index is used in CART • The Gini index measures the impurity of D, a data partition or set of training tuples, as • where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D |/|D| • The sum is computed over m classes • The Gini index considers a binary split for each attribute • Let’s first consider the case where A is a discrete-valued attribute having v distinct values, {a1 , a2 , : : : , av }, occurring in D • To determine the best binary split on A, we examine all the possible subsets that can be formed using known values of A • Each subset, SA, can be considered as a binary test for attribute A of the form “A SA?”
  • 29. October 9, 2024 29 Attribute Selection Measures: Gini Index • Given a tuple, this test is satisfied if the value of A for the tuple is among the values listed in SA • If A has v possible values, then there are 2^v possible subsets • For example, if income has three possible values, namely {low, medium, high}, then the possible subsets are – {low, medium, high}, – {low, Medium}, {low, high}, {medium, high}, – {low}, {medium}, {high}, and {} • We exclude the power set, {low, medium, high}, and the empty set from consideration since, conceptually, they do not represent a split • Therefore, there are 2v -2 possible ways to form two partitions of the data, D, based on a binary split on A • When considering a binary split, we compute a weighted sum of the impurity of each resulting partition
  • 30. October 9, 2024 30 Attribute Selection Measures: Gini Index • For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning is • For each attribute, each of the possible binary splits is considered • For a discrete-valued attribute, the subset that gives the minimum Gini index for that attribute is selected as its splitting subset • For continuous-valued attributes, each possible split-point must be considered • The strategy is similar to that described earlier for information gain, where the midpoint between each pair of (sorted) adjacent values is taken as a possible split-point • The point giving the minimum Gini index for a given (continuous-valued) attribute is taken as the split-point of that attribute
  • 31. October 9, 2024 31 Attribute Selection Measures: Gini Index • Recall that for a possible split-point of A, – D1 is the set of tuples in D satisfying A <=split point, and – D2 is the set of tuples in D satisfying A > split point • The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-valued attribute A is • The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum Gini index) is selected as the splitting attribute • This attribute and either its splitting subset (for a discrete-valued splitting attribute) or split-point (for a continuous-valued splitting attribute) together form the splitting criterion
  • 32. October 9, 2024 32 Attribute Selection Measures: Gini Index Induction of a decision tree using the Gini index: • Let D be the training data, where there are nine tuples belonging to – the class buys computer_D yes and – the remaining five tuples belong to the class buys computer D no • A (root) node N is created for the tuples in D • The Gini index to compute the impurity of D: • To find the splitting criterion for the tuples in D, we need to compute the Gini index for each attribute • Let’s start with the attribute income and consider each of the possible splitting subsets • Consider the subset {low, medium}
  • 34. October 9, 2024 34 Attribute Selection Measures: Gini Index Induction of a decision tree using the Gini index: • This would result in 10 tuples in partition D1 satisfying the condition “income {low, medium}” • The remaining four tuples of D would be assigned to partition D2 • The Gini index value computed based on this partitioning is • Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the subsets {low, high} and {medium}) and 0.450 (for the subsets {medium, high} and {low})
  • 35. Induction of a decision tree using the Gini index: • Therefore, the best binary split for attribute income is on {low, medium} (or {high}) because it minimizes the Gini index • Evaluating age, we obtain {youth, senior} (or {middle_aged}) as the best split for age with a Gini index of 0.357; • The attributes student and credit_rating are both binary, with Gini index values of 0.367 and 0.429, respectively • The attribute age and splitting subset {youth, senior} therefore give the minimum Gini index overall, with a reduction in impurity of 0.459-0.357 =0.102 • The binary split “age ɛ {youth, senior?}” results in the maximum reduction in impurity of the tuples in D and is returned as the splitting criterion • Node N is labeled with the criterion, two branches are grown from it, and the tuples are partitioned accordingly October 9, 2024 35 Attribute Selection Measures: Gini Index
  • 36. October 27, 2022 36 Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
  • 37. October 27, 2022 37 Avoid Overfitting in Classification • The generated tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Result is in poor accuracy for unseen samples • Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree”
  • 38. October 27, 2022 38 Classification in Large Databases • Classification—a classical problem extensively studied by statisticians and machine learning researchers • Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed • Why decision tree induction in classification? – relatively faster learning speed (than other classification methods) – convertible to simple and easy to understand classification rules – can use SQL queries for accessing databases – comparable classification accuracy with other methods
  • 39. October 27, 2022 39 Scalable Decision Tree Induction Methods in Data Mining Studies • SLIQ (EDBT’96 — Mehta et al.) – builds an index for each attribute and only class list and the current attribute list reside in memory • SPRINT (VLDB’96 — J. Shafer et al.) – constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim) – integrates tree splitting and tree pruning: stop growing the tree earlier • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) – separates the scalability aspects from the criteria that determine the quality of the tree – builds an AVC-list (attribute, value, class label)
  • 40. Drawbacks • What we discussed are axis parallel • For continuous valued attributes cut-points can be found. – Can be discretized (CART does). October 27, 2022 40