Machine learning important pdf for supervised

Unit-IV: Machine Learning
Decision Tree Induction: Non-metric Methods
Indian Institute of Information Technology
Sri City, Chittoor

October 9, 2024 2
Decision Tree Induction
• Decision tree induction is the learning of decision trees from
class-labeled training tuples
• It is a flowchart-like tree structure, where
– each internal node (non-leaf node) denotes a test on an attribute,
– each branch represents an outcome of the test, and
– each leaf node (or terminal node) holds a class label
– The topmost node in a tree is the root node
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown
• The attribute values of the tuple are tested against the decision tree
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple
• Decision trees can easily be converted to classification rules

October 9, 2024 3
Decision Tree Induction: Training Dataset

October 9, 2024 4
Output: A Decision Tree for “buys_computer”
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no

October 9, 2024 5
Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for exploratory
knowledge discovery
• Decision trees can handle multidimensional data
• Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans
• The learning and classification steps of decision tree induction are simple
and fast
• In general, decision tree classifiers have good accuracy
• Decision tree induction application areas such as
– medicine, manufacturing or production, financial analysis, astronomy,
and molecular biology
• Decision trees are the basis of several commercial rule induction systems
• However, successful use may depend on the data at hand

October 9, 2024 6
• Most algorithms for decision tree induction follow a top-down approach,
which starts with a training set of tuples and their associated class labels
• The training set is recursively partitioned into smaller subsets as the tree is
being built
• Algorithm: Generate a decision tree from the training tuples of data
partition, D
• Input:
– Data partition, D, which is a set of training tuples and their associated
class labels;
– Attribute list, the set of candidate attributes;
– Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes
This criterion consists of a splitting attribute and, possibly, either a
split-point or splitting subset
• Output: A decision tree

October 9, 2024 7
Decision Tree Algorithm: Strategy
• The algorithm is called with three parameters: D, attribute list, and
Attribute selection method
• D is data partition, initially, it is the complete set of training tuples and
their associated class labels
• The parameter attribute list is a list of attributes describing the tuples
• Attribute selection method specifies a heuristic procedure for selecting the
attribute that “best” discriminates the given tuples according to class
• This procedure employs an attribute selection measure such as information
gain or the Gini index
• Whether the tree is strictly binary is generally driven by the attribute
selection measure
• Some attribute selection measures, such as the Gini index, enforce the
resulting tree to be binary
• Others, like information gain, do not, therein allowing multiway splits (i.e.,
two or more branches to be grown from a node)

October 9, 2024 8
• The tree starts as a single node, N, representing the training tuples in D
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion
• The splitting criterion tells us which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes
• The splitting criterion also tells us which branches to grow from node N
with respect to the outcomes of the chosen test
• More specifically, the splitting criterion indicates the splitting attribute
and may also indicate either a split-point or a splitting subset
• The splitting criterion is determined so that, ideally, the resulting
partitions at each branch are as “pure” as possible

October 9, 2024 9
• A partition is pure if all the tuples in it belong to the same class
• In other words, if we split up the tuples in D according to the mutually
exclusive outcomes of the splitting criterion, we hope for the resulting
partitions to be as pure as possible
• The node N is labeled with the splitting criterion, which serves as a test
at the node
• A branch is grown from node N for each of the outcomes of the splitting
criterion
• The tuples in D are partitioned accordingly based on the splitting
attribute type
• The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, Dj , of D
• The resulting decision tree is returned

October 9, 2024 10
• Let A be the splitting attribute. A has v distinct values, {a1, a2, : : : ,
av}, based on the training data
Case 1:
• A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A
• A branch is created for each known value, aj , of A and labeled with that
value
• Partition Dj is the subset of class-labeled tuples in D having value aj of
A
• Because all the tuples in a given partition have the same value for A, A
need not be considered in any future partitioning of the tuples
• Therefore, it is removed from attribute list

October 9, 2024 11
Case 2:
• A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A <= split point and A >
split point, respectively,
• where split point is the split-point returned by Attribute selection
method as part of the splitting criterion
• Two branches are grown from N and labeled according to the previous
outcomes
• The tuples are partitioned such that D1 holds the subset of class-labeled
tuples in D for which A<=split point, while D2 holds the rest

October 9, 2024 12
Case 3:
• A is discrete-valued and a binary tree must be produced: The test at
node N is of the form“A SA?,” where SA is the splitting subset for A,
returned by Attribute selection method as part of the splitting criterion
• It is a subset of the known values of A
• If a given tuple has value aj of A and if aj SA, then the test at node N
is satisfied Two branches are grown from N

October 9, 2024 13
Decision Tree Algorithm Time Complexity
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
• The computational complexity of the algorithm given training set D is
O(n * |D| * log(|D|)),
• where n is the number of attributes describing the tuples in D and |D| is
the number of training tuples in D
• This means that the computational cost of growing a tree grows at most
n * |D| * log(|D|), with |D| tuples

October 9, 2024 14
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the splitting
criterion that
– “best” separates a given data partition, D, of class-labeled training
tuples into individual classes
• If we were to split D into smaller partitions according to the outcomes
of the splitting criterion, ideally each partition would be pure
• Attribute selection measures are also known as splitting rules because
– they determine how the tuples at a given node are to be split
• The attribute selection measure provides a ranking for each attribute
describing the given training tuples
• The attribute having the best score for the measure is chosen as
– the splitting attribute for the given tuples
– Three popular attribute selection measures
• Information gain, Gain ratio, and Gini index

October 9, 2024 15
• The notations used herein is as follows
– Let D, the data partition, be a training set of class-labeled tuples
– Suppose the class label attribute has m distinct values defining m
distinct classes, Ci (for i =1, … , m)
– Let Ci,D be the set of tuples of class Ci in D
– Let |D| and |Ci,D
| denote the number of tuples in D and Ci,D,
respectively
Information Gain:
• Let node N represent or hold the tuples of partition D
• The attribute with the highest information gain is chosen as the splitting
attribute for node N
• This approach minimizes the expected number of tests needed to
– classify a given tuple and guarantees that a simple tree is found

Information Gain:
• The expected information needed to classify a tuple in D is given by
• where pi
is the nonzero probability that an arbitrary tuple in D belongs to
class Ci
and is estimated by |Ci,D
|/|D|
• A log function to the base 2 is used, because the information is encoded in
bits
• Info(D) is just the average amount of information needed to identify the
class label of a tuple in D
• Info(D) is also known as the entropy of D
• Suppose to partition the tuples in D on some attribute A having v distinct
values, {a1
, a2
, : : : , av
}, as observed from the training data
• If A is discrete-valued, these values correspond directly to the v outcomes
of a test on A
October 9, 2024 16

Information Gain:
• Attribute A can be used to split D into v partitions or subsets, {D1
, D2
,
….., Dv
},
– where Dj
contains those tuples in D that have outcome aj
of A
• These partitions would correspond to the branches grown from node N
• Ideally, we would like this partitioning to produce an exact classification
of the tuples
• How much more information would we still need (after the partitioning)
to arrive at an exact classification? This amount is measured by
• The term |Dj
|/|D| acts as the weight of the jth
partition
• InfoA
(D) is the expected information required to classify a tuple from D
based on the partitioning by A
October 9, 2024 17

October 9, 2024 18
Information Gain:
• The smaller the expected information (still) required, the greater the
purity of the partitions
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A)
• That is,
• Gain(A) tells us how much would be gained by branching on A
• The attribute A with the highest information gain, Gain(A), is chosen as
the splitting attribute at node N
• We want to partition on the attribute A that would do the “best
classification,” so that the amount of information still required to finish
classifying the tuples is minimal

October 9, 2024 19
Induction of a decision tree
using information gain:
• The class label
attribute, buys
computer, has two
distinct values (namely,
{yes, no});
• therefore, there are two
distinct classes (m = 2)
• Let class C1
correspond
to yes and class C2
correspond to no
■ There are nine tuples of class yes and five tuples of class no
■ A (root) node N is created for the tuples in D
■ To find the splitting criterion for these tuples, we must compute the
information gain of each attribute

October 9, 2024 20
Induction of a decision tree using information gain:
• To compute the expected information needed to classify a tuple in D
• Next, we need to compute the expected information requirement for each
attribute
• Let’s start with the attribute age
• We need to look at the distribution of yes and no tuples for each category
of age
• For the age category “youth,” there are two yes tuples and three no
tuples
• For the category “middle aged,” there are four yes tuples and zero no
tuples
• For the category “senior,” there are three yes tuples and two no tuples

October 9, 2024 21
Induction of a decision tree using information gain:
• The expected information needed to classify a tuple in D if the tuples are
partitioned according to age is
• Hence, the gain in information from such a partitioning would be
• Similarly, we can compute Gain(income)= 0.029 bits, Gain(student) =0.151
bits, and Gain(credit rating)= 0.048 bits
• Because age has the highest information gain among the attributes, it is
selected as the splitting attribute
• Node N is labeled with age, and branches are grown for each of the
attribute’s values

■ Because they all belong to class “yes,” a leaf should therefore be created at
the end of this branch and labeled “yes.”
October 9, 2024 22
Induction of a
decision tree using
information gain:
• The tuples are
then partitioned
accordingly, as
shown in Figure
• Notice that the
tuples falling
into the partition
for age D
middle aged all
belong to the
same class

October 9, 2024 23
How can we compute the information gain of an attribute that is
continuous valued:
• Suppose, an attribute A that is continuous-valued, rather than
discrete-valued
• For such a scenario, we must determine the “best” split-point for A,
where the split-point is a threshold on A
• We first sort the values of A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split-point
• Therefore, given v values of A, then v-1 possible splits are evaluated
• The midpoint between the values ai
and ai+1
of A is

October 9, 2024 24
How can we compute the information gain of an attribute that is
continuous valued:
• If the values of A are sorted in advance, then determining the best split
for A requires only one pass through the values
• For each possible split-point for A, we evaluate InfoA
(D),
– where the number of partitions is two, that is, v = 2 (or j = 1, 2)
• The point with the minimum expected information requirement for A is
selected as the split point for A
• D1
is the set of tuples in D satisfying A≤ split point, and
• D2
is the set of tuples in D satisfying A > split point

October 9, 2024 25
Attribute Selection Measures: Gain Ratio
Gain Ratio:
• The information gain measure is biased toward tests with many outcomes
• That is, it prefers to select attributes having a large number of values
• For example, consider an attribute that acts as a unique identifier such as
product_ID
• A split on product_ID would result in a large number of partitions (as
many as there are values), each one containing just one tuple
• Because each partition is pure, the information required to classify data
set D based on this partitioning would be Infoproduct_ID
(D)= 0
• Therefore, the information gained by partitioning on this attribute is
maximal
• Clearly, such a partitioning is useless for classification
• C4.5, a successor of ID3, uses an extension to information gain known as
gain ratio, which attempts to overcome this bias

October 9, 2024 26
Gain Ratio:
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as
• This value represents the potential information generated by splitting the
training data set, D, into v partitions,
– corresponding to the v outcomes of a test on attribute A
• Note that, for each outcome, it considers the number of tuples having that
outcome with respect to the total number of tuples in D
• It differs from information gain, which measures the information with
respect to classification that is acquired based on the same partitioning
• The gain ratio is defined as
Data Mining

October 9, 2024 27
• The attribute with the maximum gain ratio is selected as the splitting
attribute
• However, that as the split information approaches 0, the ratio becomes
unstable
• A constraint is added to avoid this, whereby the information gain of the
test selected must be large
– at least as great as the average gain over all tests examined
Computation of gain ratio for the attribute income:
• A test on income splits the data into three partitions, namely low,
medium, and high, containing four, six, and four tuples, respectively
• To compute the gain ratio of income
• We have Gain(income) = 0.029
• Therefore, GainRatio(income) = 0.029/1.557 = 0.019

October 9, 2024 28
Attribute Selection Measures: Gini Index
• The Gini index is used in CART
• The Gini index measures the impurity of D, a data partition or set of
training tuples, as
• where pi
is the probability that a tuple in D belongs to class Ci
and is
estimated by |Ci,D
|/|D|
• The sum is computed over m classes
• The Gini index considers a binary split for each attribute
• Let’s first consider the case where A is a discrete-valued attribute having
v distinct values, {a1
, a2
, : : : , av
}, occurring in D
• To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A
• Each subset, SA, can be considered as a binary test for attribute A of the
form “A SA?”

October 9, 2024 29
• Given a tuple, this test is satisfied if the value of A for the tuple is among
the values listed in SA
• If A has v possible values, then there are 2^v possible subsets
• For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are
– {low, medium, high},
– {low, Medium}, {low, high}, {medium, high},
– {low}, {medium}, {high}, and {}
• We exclude the power set, {low, medium, high}, and the empty set from
consideration since, conceptually, they do not represent a split
• Therefore, there are 2v
-2 possible ways to form two partitions of the data,
D, based on a binary split on A
• When considering a binary split, we compute a weighted sum of the
impurity of each resulting partition

October 9, 2024 30
• For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is
• For each attribute, each of the possible binary splits is considered
• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset
• For continuous-valued attributes, each possible split-point must be
considered
• The strategy is similar to that described earlier for information gain,
where the midpoint between each pair of (sorted) adjacent values is taken
as a possible split-point
• The point giving the minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point of that attribute

October 9, 2024 31
• Recall that for a possible split-point of A,
– D1 is the set of tuples in D satisfying A <=split point, and
– D2 is the set of tuples in D satisfying A > split point
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is
• The attribute that maximizes the reduction in impurity (or, equivalently,
has the minimum Gini index) is selected as the splitting attribute
• This attribute and either its splitting subset (for a discrete-valued splitting
attribute) or split-point (for a continuous-valued splitting attribute)
together form the splitting criterion

October 9, 2024 32
Induction of a decision tree using the Gini index:
• Let D be the training data, where there are nine tuples belonging to
– the class buys computer_D yes and
– the remaining five tuples belong to the class buys computer D no
• A (root) node N is created for the tuples in D
• The Gini index to compute the impurity of D:
• To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute
• Let’s start with the attribute income and consider each of the possible
splitting subsets
• Consider the subset {low, medium}

October 9, 2024 34
• This would result in 10 tuples in partition D1 satisfying the condition
“income {low, medium}”
• The remaining four tuples of D would be assigned to partition D2
• The Gini index value computed based on this partitioning is
• Similarly, the Gini index values for splits on the remaining subsets are
0.458 (for the subsets {low, high} and {medium}) and 0.450 (for the
subsets {medium, high} and {low})

• Therefore, the best binary split for attribute income is on {low, medium}
(or {high}) because it minimizes the Gini index
• Evaluating age, we obtain {youth, senior} (or {middle_aged}) as the best
split for age with a Gini index of 0.357;
• The attributes student and credit_rating are both binary, with Gini index
values of 0.367 and 0.429, respectively
• The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of 0.459-0.357
=0.102
• The binary split “age ɛ {youth, senior?}” results in the maximum
reduction in impurity of the tuples in D and is returned as the splitting
criterion
• Node N is labeled with the criterion, two branches are grown from it, and
the tuples are partitioned accordingly
October 9, 2024 35

October 27, 2022 36
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

October 27, 2022 37
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”

October 27, 2022 38
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in classification?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods

October 27, 2022 39
Scalable Decision Tree Induction
Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that
determine the quality of the tree
– builds an AVC-list (attribute, value, class label)

Drawbacks
• What we discussed are axis parallel
• For continuous valued attributes cut-points
can be found.
– Can be discretized (CART does).
October 27, 2022 40

Machine learning important pdf for supervised

More Related Content

Similar to Machine learning important pdf for supervised (20)

Recently uploaded (20)

Machine learning important pdf for supervised