SlideShare a Scribd company logo
Machine Learning
Lecture 04
Decision Tree Learning
Dr. Rao Muhammad Adeel Nawab
Dr. Rao Muhammad Adeel Nawab 2
How to Work
Power of Dua
Dr. Rao Muhammad Adeel Nawab 3
Dua – Take Help from Allah before starting any task
Dr. Rao Muhammad Adeel Nawab 4
Course Focus
Mainly get EXCELLENCE in two things
1. Become a great human being
2. Become a great Machine Learning Engineer
To become a great human being
Get sincere with yourself
When you get sincere with yourself your ‫ﺧﻠﻭﺕ‬ and ‫ﺟﻠﻭﺕ‬ is the
same
Dr. Rao Muhammad Adeel Nawab 5
Lecture Outline
What are Decision Trees?
What problems are appropriate for Decision Trees?
The Basic Decision Tree Learning Algorithm: ID3
Entropy and Information Gain
Inductive Bias in Decision Tree Learning
Refinements to Basic Decision Tree Learning
Reading:
Chapter 3 of Mitchell
Sections 4.3 and 6.1 of Wittena and Frank
Dr. Rao Muhammad Adeel Nawab 6
What are Decision Trees?
Decision tree learning is a method for approximating
discrete-valued target functions, in which the learned
function is represented by a decision tree.
Learned trees can also be re-represented as sets of if-then
rules to improve human readability.
Most popular of inductive inference algorithms
Successfully applied to a broad range of tasks..
Dr. Rao Muhammad Adeel Nawab 7
What are Decision Trees?
Decision trees are trees which classify instances by testing
at each node some attribute of the instance.
Testing starts at the root node and proceeds downwards
to a leaf node, which indicates the classification of the
instance.
Each branch leading out of a node corresponds to a value
of the attribute being tested at that node.
Dr. Rao Muhammad Adeel Nawab 8
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No
Dr. Rao Muhammad Adeel Nawab 9
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node tests an attribute
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
Dr. Rao Muhammad Adeel Nawab 10
No
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak ?
Dr. Rao Muhammad Adeel Nawab 11
Decision Tree for Conjunction
Outlook
Sunny Overcast Rain
Wind
Strong Weak
No Yes
No
Outlook=Sunny ∧
∧
∧
∧ Wind=Weak
No
Dr. Rao Muhammad Adeel Nawab 12
Decision Tree for Disjunction
Outlook
Sunny Overcast Rain
Yes
Outlook=Sunny ∨
∨
∨
∨ Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
Dr. Rao Muhammad Adeel Nawab 13
Decision Tree for XOR
Outlook
Sunny Overcast Rain
Wind
Strong Weak
Yes No
Outlook=Sunny XOR
Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
Dr. Rao Muhammad Adeel Nawab 14
Decision Tree
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No
decision trees represent disjunctions of conjunctions
(Outlook=Sunny ∧
∧
∧
∧ Humidity=Normal) ∨
∨
∨
∨ (Outlook=Overcast)
∨
∨
∨
∨ (Outlook=Rain ∧
∧
∧
∧ Wind=Weak)
Dr. Rao Muhammad Adeel Nawab 15
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No
Dr. Rao Muhammad Adeel Nawab 16
A decision tree to classify
days as appropriate for
playing tennis might look
like:
〈
〈
〈
〈Outlook = Sunny, Temp = Hot,
Humidity = High, Wind = Strong〉
〉
〉
〉 No
What are Decision Trees?
Note that
each path through a decision tree forms a conjunction of
attribute tests
the tree as a whole forms a disjunction of such paths; i.e. a
disjunction of conjunctions of attribute tests
Preceding example could be re-expressed as:
(Out look = Sunny ∧
∧
∧
∧ Humidity = Normal)
∨
∨
∨
∨ (Out look = Overcast)
∨
∨
∨
∨ (Out look = Rain ∧
∧
∧
∧ Wind =Weak)
Dr. Rao Muhammad Adeel Nawab 17
∧
∧
∧
∧ = AND
V = OR
What are Decision Trees? (cont)
As a complex rule, such a decision tree could be coded by
hand.
However, the challenge for machine learning is to propose
algorithms for learning decision trees from examples.
Dr. Rao Muhammad Adeel Nawab 18
What Problems are Appropriate for Decision Trees?
There are several varieties of decision tree learning, but in
general decision tree learning is best for problems where:
Instances describable by attribute–value pairs
usually nominal (categorical/enumerated/discrete) attributes
with small number of discrete values, but can be numeric
(ordinal/continuous).
Dr. Rao Muhammad Adeel Nawab 19
What Problems are Appropriate for Decision Trees?
Target function is discrete valued
in PlayTennis example target function is Boolean
easy to extend to target functions with > 2 output values
harder, but possible, to extend to numeric target functions
Disjunctive hypothesis may be required
easy for decision trees to learn disjunctive concepts (note such
concepts were outside the hypothesis space of the Candidate-
Elimination algorithm)
Dr. Rao Muhammad Adeel Nawab 20
What Problems are Appropriate for Decision Trees?
Possibly noisy/incomplete training data
robust to errors in classification of training examples and errors
in attribute values describing these examples
Can be trained on examples where for some instances
some attribute values are unknown/missing.
Dr. Rao Muhammad Adeel Nawab 21
Sample Applications of Decision Trees?
Decision trees have been used for:
(see http://www.rulequest.com/see5-examples.html)
Predicting Magnetic Properties of Crystals
Profiling Higher-Priced Houses in Boston
Detecting Advertisements on the Web
Controlling a Production Process
Diagnosing Hypothyroidism
Assessing Credit Risk
Such problems, in which the task is to classify examples into
one of a discrete set of possible categories, are often referred to
as classification problems.
Dr. Rao Muhammad Adeel Nawab 22
Sample Applications of Decision Trees? (cont)
Sample Applications of Decision Trees? (cont)
Sample Applications of Decision Trees? (cont)
Sample Applications of Decision Trees? (cont)
Assessing Credit Risk
Sample Applications of Decision Trees? (cont)
From 490 cases like this, split 44%/56% between
accept/reject, See5 derived twelve rules.
On a further 200 unseen cases, these rules give a
classification accuracy of 83%
Dr. Rao Muhammad Adeel Nawab 24
ID3 Algorithm
Dr. Rao Muhammad Adeel Nawab 25
ID3 Algorithm
ID3, learns decision trees by constructing them top- down,
beginning with the question
which attribute should be tested at the root of the tree?
Each instance attribute is evaluated using a statistical test
to determine how well it alone classifies the training
examples.
Dr. Rao Muhammad Adeel Nawab 26
ID3 Algorithm
The best attribute is selected and used as the test at the
root node of the tree.
A descendant of the root node is then created for each
possible value of this attribute, and the training examples
are sorted to the appropriate descendant node (i.e., down
the branch corresponding to the example's value for this
attribute).
Dr. Rao Muhammad Adeel Nawab 27
ID3 Algorithm
The entire process is then repeated using the training
examples associated with each descendant node to select
the best attribute to test at that point in the tree.
This process continues for each new leaf node until either
of two conditions is met:
every attribute has already been included along this path
through the tree, or
the training examples associated with this leaf node all have
the same target attribute value (i.e., their entropy is zero).
Dr. Rao Muhammad Adeel Nawab 28
ID3 Algorithm
This forms a greedy search for an acceptable decision tree,
in which the algorithm never backtracks to reconsider
earlier choices.
Dr. Rao Muhammad Adeel Nawab 29
The Basic Decision Tree Learning Algorithm:ID3(Cont.)
ID3 algorithm:
ID3(Example, Target_Attribute, Attribute)
Create Root node for the tree
If all examples +ve, return 1-node tree Root with label=+
If all examples -ve, return 1-node tree Root with label=-
If Attributes=[], return 1-node tree Root with label=most
common value of Target_Attribute in Examples
Otherwise
Dr. Rao Muhammad Adeel Nawab 30
The Basic Decision Tree Learning Algorithm:ID3
Begin
A ← attribute in Attributes that best classifies Examples
The decision attribute for Root ← A
For each possible value vi of A
Add a new branch below Root for test A = vi
Let Examplesvi = subset of Examples with value vi for A
If Examplesvi = []
Then below this new branch add leaf node with label=most common value of
Target_Attribute in Examples
Else below this new branch add subtree
ID3(Examplesvi, Target_Attribute, Attributes –{A})
End
Return Root
Dr. Rao Muhammad Adeel Nawab 31
Which Attribute is the Best Classifier?
In the ID3 algorithm, choosing which attribute to test at the
next node is a crucial step.
Would like to choose that attribute which does best at
separating training examples according to their target
classification.
An attribute which separates training examples into two sets
each of which contains positive/negative examples of the target
attribute in the same ratio as the initial set of examples has not
helped us progress towards a classification.
Dr. Rao Muhammad Adeel Nawab 32
Which Attribute is the Best Classifier?
Suppose we have 14 training examples, 9 +ve and 5 -ve, of days on which
tennis is played.
For each day we have information about the attributes humidity and wind,
as below.
Which attribute is the best classifier?
Dr. Rao Muhammad Adeel Nawab 33
Entropy and Information Gain
A useful measure of for picking the best classifier attribute
is information gain.
Information gain measures how well a given attribute
separates training examples with respect to their target
classification.
Information gain is defined in terms of entropy as used in
information theory.
Dr. Rao Muhammad Adeel Nawab 34
Entropy and Information Gain(Cont.)
S is a sample of training examples
p+ is the proportion of positive
examples
p- is the proportion of negative
examples
Entropy measures the impurity of S
Entropy(S) = -p+ log2 p+ - p- log2 p-
Or
Entropy(S) = −p ⊕
⊕
⊕
⊕ log2 p ⊕
⊕
⊕
⊕ − p ⊖
⊖
⊖
⊖
log2 p ⊖
⊖
⊖
⊖
Dr. Rao Muhammad Adeel Nawab 35
p ⊕
⊕
⊕
⊕
Entropy
For our previous example (14 examples, 9 positive, 5
negative):
Entropy([9+,5−]) = −p⊕
⊕
⊕
⊕ log2 p⊕
⊕
⊕
⊕− p⊖
⊖
⊖
⊖ log2
= −(9/14)log2(9/14)−(5/14)log2(5/14)
= .940
Dr. Rao Muhammad Adeel Nawab 36
Entropy Cont…
Think of Entropy(S) as expected number of bits needed to
encode class (⊕
⊕
⊕
⊕ or ⊖
⊖
⊖
⊖) of randomly drawn member of S (under
the optimal, shortest-length code)
For Example
If p⊕
⊕
⊕
⊕ = 1 (all instances are positive) then no message need be sent
(receiver knows example will be positive) and Entropy = 0 (“pure
sample”)
If p⊕
⊕
⊕
⊕ = .5 then 1 bit need be sent to indicate whether instance
negative or positive and Entropy = 1
If p⊕
⊕
⊕
⊕ = .8 then less than 1 bit need be sent on average – assign
shorter codes to collections of positive examples and longer ones
to negative ones
Dr. Rao Muhammad Adeel Nawab 37
Entropy Cont…
Why?
Information theory: optimal length code assigns −log2p bits to.
message having probability p.
So, expected number of bits needed to encode class (⊕
⊕
⊕
⊕ or ⊖
⊖
⊖
⊖)
of random member of S:
p⊕
⊕
⊕
⊕(−log2 p⊕
⊕
⊕
⊕)+ p⊖
⊖
⊖
⊖(−log2 p⊖
⊖
⊖
⊖)
Entropy(S) ≡ −p⊕
⊕
⊕
⊕ log2p⊕
⊕
⊕
⊕− p⊖
⊖
⊖
⊖ log2p⊖
⊖
⊖
⊖
Dr. Rao Muhammad Adeel Nawab 38
Information Gain
Entropy gives a measure of purity/impurity of a set of
examples.
Define information gain as the expected reduction in
entropy resulting from partitioning a set of examples on
the basis of an attribute.
Formally, given a set of examples S and attribute A:
Dr. Rao Muhammad Adeel Nawab 39
Information Gain
where
Values(A) is the set of values attribute A can take on
Sv is the subset of S for which A has value v
First term in Gain(S,A) is entropy of original set; second
term is expected entropy after partitioning on A = sum of
entropies of each subset Sv weighted by ratio of Sv in S.
Dr. Rao Muhammad Adeel Nawab 40
Information Gain Cont….
Dr. Rao Muhammad Adeel Nawab 41
Information Gain Cont….
Dr. Rao Muhammad Adeel Nawab 42
Training Examples
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Dr. Rao Muhammad Adeel Nawab 43
First step: which attribute to test at the root?
Which attribute should be tested at the root?
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.084
Gain(S, Temperature) = 0.029
Outlook provides the best prediction for the target
Lets grow the tree:
add to the tree a successor for each possible value of Outlook
partition the training samples according to the value of Outlook
Dr. Rao Muhammad Adeel Nawab 44
After first step
Outlook
Sunny Overcast Rain
Yes
[D1,D2,…,D14]
[9+,5-]
Ssunny=[D1, D2, D8, D9, D11]
[2+,3-]
? ?
[D3, D7, D12, D13]
[4+,0-]
[D4, D5, D6, D10, D14]
[3+,2-]
Which attribute should be tested here?
Ssunny ={D1,D2,D8,D9,D11}
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
45
Second step
Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 −
−
−
− 3/5 ×
×
×
× 0.0 −
−
−
− 2/5 ×
×
×
× 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 −
−
−
− 2/5 ×
×
×
× 1.0 −
−
−
− 3.5 ×
×
×
× 0.918 = 0 .019
Gain(SSunny, Temp) = 0.970 −
−
−
− 2/5 ×
×
×
× 0.0 −
−
−
− 2/5 ×
×
×
× 1.0 −
−
−
− 1/5 ×
×
×
× 0.0 =0.570
Humidity provides the best prediction for the target
Lets grow the tree:
add to the tree a successor for each possible value of Humidity
partition the training samples according to the value of Humidity
Dr. Rao Muhammad Adeel Nawab 46
Second and Third Steps
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No
[D3,D7,D12,D13]
[D8,D9,D11] [D6,D14]
[D1,D2] [D4,D5,D10]
Dr. Rao Muhammad Adeel Nawab
47
Final tree for S is:
Hypothesis Space Search by ID3
+ - +
+ - +
A1
- - +
+ - +
A2
+ - -
+ - +
A2
-
A4
+ -
A2
-
A3
- +
Dr. Rao Muhammad Adeel Nawab 48
Hypothesis Space Search by ID3
ID3 searches a space of hypotheses (set of possible
decision trees) for one fitting the training data.
Search is simple-to-complex, hill-climbing search guided
by the information gain evaluation function.
Hypothesis space of ID3 is complete space of finite,
discrete-valued functions w.r.t available attributes
contrast with incomplete hypothesis spaces, such as
conjunctive hypothesis space
Dr. Rao Muhammad Adeel Nawab 49
Hypothesis Space Search by ID3
ID3 maintains only one hypothesis at any time, instead of,
e.g., all hypotheses consistent with training examples seen
so far
contrast with CANDIDATE-ELIMINATION
means can’t determine how many alternative decision trees
are consistent with data
means can’t ask questions to resolve competing alternatives
Dr. Rao Muhammad Adeel Nawab 50
Hypothesis Space Search by ID3
ID3 performs no backtracking – once an attribute is
selected for testing at a given node, this choice is never
reconsidered.
so, susceptible to converging to locally optimal rather than
globally optimal solutions
Dr. Rao Muhammad Adeel Nawab 51
Hypothesis Space Search by ID3
Uses all training examples at each step to make
statistically-based decision about how to refine current
hypothesis
contrast with CANDIDATE-ELIMINATION or FIND-S – make
decisions incrementally based on single training examples
using statistically-based properties of all examples
(information gain) means technique is robust in the face of
errors in individual examples.
Dr. Rao Muhammad Adeel Nawab 52
Inductive Bias in Decision Tree
Learning
Dr. Rao Muhammad Adeel Nawab 53
Inductive Bias in Decision Tree Learning
Inductive bias: set of assumptions needed in addition to
training data to justify deductively learner’s classification
Given a set of training examples, there may be many
decision trees consistent with them Inductive bias of ID3 is
shown by which of these trees it chooses
ID3’s search strategy (simple-to-complex, hill climbing)
selects shorter trees over longer ones
selects trees that place attributes with highest Information
Gain closest to root
Dr. Rao Muhammad Adeel Nawab 54
Inductive Bias in Decision Tree Learning
Inductive bias of ID3
Shorter trees are preferred over longer trees.
Trees that place high information gain attributes close to the root
are preferred to those that do not.
Note that one could produce a decision tree learning algorithm
with the simpler bias of always preferring a shorter tree.
How does inductive bias of ID3 compare to that of version
space CANDIDATE-ELIMINATION algorithm?
ID3 incompletely searches a complete hypothesis space
CANDIDATE-ELIMINATION completely searches an incomplete
hypothesis space
Dr. Rao Muhammad Adeel Nawab 55
Inductive Bias in Decision Tree Learning
Can be put differently by saying
inductive bias of ID3 follows from its search strategy (preference
bias or search bias)
inductive bias of CANDIDATE-ELIMINATION follows from its
definition of its search space (restriction bias or language bias).
Note that preference bias only effects order in which
hypotheses are investigated; restriction bias effects which
hypotheses are investigated.
Generally better to choose algorithm with preference bias rather
than restriction bias – with restriction bias target function may not
be contained in hypothesis space.
Dr. Rao Muhammad Adeel Nawab 56
Inductive Bias in Decision Tree Learning
Note that some algorithms may combine preference and
restriction biases – e.g. checker’s learning program
linear weighted function of fixed set of board features
introduces restriction bias (non-linear potential target
functions excluded)
least mean square parameter tuning introduces preference
bias into search through space of parameter values
Dr. Rao Muhammad Adeel Nawab 57
Inductive Bias in Decision Tree Learning
Is ID3’s inductive bias sound? Why prefer shorter
hypotheses/trees?
One response :
“Occam’s Razor” – prefer simplest hypothesis that fits the data.
This is a general assumption that many natural scientists make.
Dr. Rao Muhammad Adeel Nawab 58
Occam’s Razor
Why prefer short hypotheses?
Argument in favor:
Fewer short hypotheses than long hypotheses
A short hypothesis that fits the data is unlikely to be a coincidence
A long hypothesis that fits the data might be a coincidence
Argument opposed:
There are many ways to define small sets of hypotheses
E.g. All trees with a prime number of nodes that use attributes
beginning with ”Z”
What is so special about small sets based on size of hypothesis?
Dr. Rao Muhammad Adeel Nawab 59
Issues in Decision Tree Learning
Practical issues in learning decision trees include
determining how deeply to grow the decision tree
handling continuous attributes
choosing an appropriate attribute selection measure
handling training data with missing attribute values
handling attributes with differing costs and
improving computational efficiency
Dr. Rao Muhammad Adeel Nawab 60
Refinements to Basic Decision Tree
Learning
Dr. Rao Muhammad Adeel Nawab 61
Refinements to Basic Decision Tree Learning:
Overfitting Training Data + Tree Pruning Cont….
Overfitting Training Data + Tree Pruning
In case of
noise in the data or
number of training examples is too small to produce a
representative sample of the true target function
The simple ID3 algorithm can produce trees that overfit
the training examples.
Dr. Rao Muhammad Adeel Nawab 62
Refinements to Basic Decision Tree Learning:
Overfitting Training Data + Tree Pruning Cont….
Suppose in addition to the 14 examples for PlayTennis we
get a 15th example whose target classification is wrong:
‹Sunny, Hot , Normal, Strong, PlayTennis = No›
Dr. Rao Muhammad Adeel Nawab 63
Refinements to Basic Decision Tree Learning:
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No
Dr. Rao Muhammad Adeel Nawab 64
What impact this will have
on our earlier tree ?
Refinements to Basic Decision Tree Learning
Since we previously had the correct example:
‹Sunny, Cool , Normal, Weak, PlayTennis = Yes›
‹Sunny, Mild , Normal, Strong, PlayTennis = Yes›
Tree will be elaborated below right branch of Humidity
Result will be tree that performs well on (errorful) training
examples, but less well on new unseen instances
Dr. Rao Muhammad Adeel Nawab 65
Refinements to Basic Decision Tree Learning
The addition of this incorrect example will now cause ID3 to
construct a more complex tree.
The new example will be sorted into the second leaf node from
the left in the learned tree, along with the previous positive
examples D9 and D11.
Because the new example is labeled as a negative example, ID3 will
search for further refinements to the tree below this node.
Result will be tree which performs well on (error full training
examples) and less well on new unseen instances.
Dr. Rao Muhammad Adeel Nawab 66
Refinements: Overfitting Training Data
Adapting to noisy training data is one type of overfitting.
Overfitting can also occur when the number of training
examples is too small to be representative of the true target
function
coincidental regularities may be picked up during training
More precisely:
Definition: Given a hypothesis space H, a hypothesis h ∈
∈
∈
∈ H
overfits the training data if there is another hypothesis h′ ∈
∈
∈
∈ H
such that h has smaller error than h′ over the training data, but
h′ has a smaller error over the entire distribution of instances.
Dr. Rao Muhammad Adeel Nawab 67
Refinements: Overfitting Training Data
Overfitting is a real problem for decision tree learning –
10% - 25% decrease in accuracy over a range of tasks in
one empirical study
Overfitting a problem for many other machine learning
methods too
Dr. Rao Muhammad Adeel Nawab 68
Refinements: Overfitting Training Data (Example)
Example of ID3 learning which medical patients have a form of diabetes:
Accuracy of tree over training examples increases monotonically as tree grows(to be expected)
Accuracy of tree over independent test examples increases till about 25 nodes, then decreases
Dr. Rao Muhammad Adeel Nawab 69
Refinements: Avoiding Overfitting
How can overfitting be avoided?
Two general approaches:
stop growing tree before perfectly fitting training data
e.g. when data split is not statistically significant
grow full tree, then prune afterwards
In practice, second approach has been more successful
Dr. Rao Muhammad Adeel Nawab 70
Refinements: Avoiding Overfitting
For either approach, how can optimal final tree size be
decided?
use a set of examples distinct from training examples to
evaluate quality of tree; or
use all data for training but apply statistical test to decide
whether expanding/pruning a given node is likely to improve
performance over whole instance distribution; or
measure complexity of encoding training examples + decision
tree and stop growing tree when this size is minimized –
minimum description length principle
Dr. Rao Muhammad Adeel Nawab 71
Refinements: Avoiding Overfitting
First approach most common – called training and
validation set approach.
Divide available instances into
training set – commonly 2/3 of data
validation set – commonly 1/3 of data
Hope is that random errors and coincidental regularities
learned from training set will not be present in validation
set
Dr. Rao Muhammad Adeel Nawab 72
Refinements: Reduced Error Pruning
Assumes data split into training and validation sets.
Proceed as follows:
Train decision tree on training set
Do until further pruning is harmful:
for each decision node evaluate impact on validation set
of removing that node and those below it
remove node that most improves accuracy on validation set
Dr. Rao Muhammad Adeel Nawab 73
Refinements: Reduced Error Pruning
How is impact of removing a node evaluated?
When a decision node is removed the subtree rooted at it
is replaced with a leaf node whose classification is the
most common classification of examples beneath the
decision node
Dr. Rao Muhammad Adeel Nawab 74
Refinements: Reduced Error Pruning (cont…)
To assess value of reduced error pruning, split data into 3
distinct sets:
1. training examples for the original tree
2. validation examples of guiding tree pruning
3. test examples to provide an estimate over future unseen
examples
Dr. Rao Muhammad Adeel Nawab 75
Refinements: Reduced Error Pruning (cont.)
On previous example, reduced error pruning produces this effect:
Drawback: holding data back for a validation set reduces data available for
training
Dr. Rao Muhammad Adeel Nawab 76
Refinements: Rule Post-Pruning
Perhaps most frequently used method (e.g.,C4.5)
Proceed as follows:
1. Convert tree to equivalent set of rules
2. Prune each rule independently of others
3. Sort final rules into desired sequence for use
Convert tree to rules by making the conjunction of
decision nodes along each branch the antecedent of a rule
and each leaf the consequent
Dr. Rao Muhammad Adeel Nawab 77
Refinements: Rule Post-Pruning
Dr. Rao Muhammad Adeel Nawab 78
Refinements: Rule Post-Pruning (cont)
To prune rules, remove any precondition (= conjunct in
antecedent) of a rule whose removal does not worsen rule
accuracy
Can estimate rule accuracy
by using a separate validation set
by using the training data, but assuming a statistically-based
pessimistic estimate of rule accuracy (C4.5)
Dr. Rao Muhammad Adeel Nawab 79
Refinements: Rule Post-Pruning (cont)
Three advantages of converting trees to rules before
pruning:
1. converting to rules allows distinguishing different contexts in
which rules are used – treat each path through tree differently
contrast: removing a decision node removes all paths beneath
it
2. removes distinction between testing nodes near root and
those near leaves – avoids need to rearrange tree should
higher nodes be removed
3. rules often easier for people to understand
Dr. Rao Muhammad Adeel Nawab 80
Refinements: Continuous-valued Attributes
Initial definition of ID3 restricted to discrete-valued
target attributes
decision node attributes
Can overcome second limitation by dynamically defining
new discrete-valued attributes that partition a continuous
attribute value into a set of discrete intervals
Dr. Rao Muhammad Adeel Nawab 81
Refinements: Continuous-valued Attributes
So, for continuous attribute A dynamically create a new
Boolean attribute Ac that is true A > c and false otherwise
How do we pick c ? →Pick c that maximises information
gain
Dr. Rao Muhammad Adeel Nawab 82
Refinements: Continuous-valued Attributes
E.g. suppose for PlayTennis example we want Temperature to be a
continuous attribute
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
Sort by temperature and identify candidate thresholds midway
between points where target attribute changes ((60+48)/2) and
((90+80)/2))
Compute information gain for Temperature>54 and Temperature<85
and select highest (Temperature>54)
Can be extended to split continuous attribute into
> 2 intervals
Dr. Rao Muhammad Adeel Nawab 83
Refinements: Alternative Attribute Selection
Measures
Information gain measure favours attributes with many
values over those with few values.
E.g. if we add a Date attribute to the PlayTennis example it
will have a distinct value for each day and will have the
highest information gain.
this is because date perfectly predicts the target attribute for
all training examples
result is a tree of depth 1 that perfectly classifies training
examples but fails on all other data
Dr. Rao Muhammad Adeel Nawab 84
Refinements: Alternative Attribute Selection
Measures
Can avoid this by using other attribute selection measures.
One alternative is gain ratio
Dr. Rao Muhammad Adeel Nawab 85
Refinements: Alternative Attribute Selection
Measures
where Si is subset of S for which c-valued attribute A has
value vi
(Note: Split Information is entropy of S w.r.t values of A)
Has effect of penalizing attributes with many, uniformly
distributed values
Experiments with variants of this and other attribute
selection measures have been carried out and are
reported in the machine learning literature
Dr. Rao Muhammad Adeel Nawab 86
Refinements: Missing/Unknown Attribute
Values
What if a training example x is missing value for attribute A?
Several alternatives have been explored.
At decision node n where Gain(S,A) is computed
assign most common value of A among other examples sorted to
node n or
assign most common value of A among other examples at n with
same target attribute value as x or
assign probability pi to each possible value vi of A estimated from
observed frequencies of values of A for examples sorted to A
Dr. Rao Muhammad Adeel Nawab 87
Refinements: Missing/Unknown Attribute Values
Assign fraction pi of x
x
x
x distributed down each branch in
tree below n (this technique is used in C4.5)
Last technique can be used to classify new examples with
missing attributes (i.e. after learning) in same fashion
Dr. Rao Muhammad Adeel Nawab 88
Refinements: Attributes with Differing Costs
Different attributes may have different costs associated
with acquiring their values
E.g.
in medical diagnosis, different tests, such as blood tests, brain
scans, have different costs
in robotics positioning a sensing device on a robot so as to take
a differing measurements requires differing amounts of time (=
cost)
Dr. Rao Muhammad Adeel Nawab 89
Refinements: Attributes with Differing Costs
How to learn a consistent tree with low expected cost?
Various approaches have been explored in which the
attribute selection measure is modified to include a cost
term. (E.g.)
Dr. Rao Muhammad Adeel Nawab 90
Summary
Decision trees classify instances. Testing starts at the root and
proceeds downwards:
Non-leaf nodes test one attribute of the instance and the attribute
value determines which branch is followed.
Leaf nodes are instance classifications.
Decision trees are appropriate for problems where:
instances are describable by attribute–value pairs (typically, but
not necessarily, nominal);
target function is discrete valued (typically, but not necessarily);
disjunctive hypotheses may be required;
training data may be noisy/incomplete.
Dr. Rao Muhammad Adeel Nawab 91
Summary (cont….)
Various algorithms have been proposed to learn decision trees
– ID3 is the classic. ID3:
recursively grows tree from the root picking at each point attribute
which maximises information gain with respect to the training
examples sorted to the current node
recursion stops when all examples down a branch fall into a single
class or all attributes have been tested
ID3 carries out incomplete search of complete hypothesis space
– contrast with CANDIDATE-ELIMINATION which carries out a
complete search of an incomplete hypothesis space.
Dr. Rao Muhammad Adeel Nawab 92
Summary (cont…)
Decision trees exhibit an inductive bias which prefers
shorter trees with high information gain attributes closer
to the root (at least where information gain is used as the
attribute selection criterion, as in ID3)
ID3 searches a complete hypothesis space for discrete-
valued functions, but searches the space incompletely,
using the information gain heuristic
Dr. Rao Muhammad Adeel Nawab 93
Summary (cont…)
Overfitting the training data is an important issue in
decision tree learning.
Noise or coincidental regularities due to small samples
may mean that while growing a tree beyond a certain size
improves its performance on the training data, it worsens
its performance on unseen instances
Overfitting can be addressed by post-pruning the decision
in a variety of ways
Dr. Rao Muhammad Adeel Nawab 94
Summary (cont…)
Various other refinements of the basic ID3 algorithm
address issues such as:
handling real-valued attributes
handling training/test instances with missing attribute values
using attribute selection measures other than information gain
allowing costs to be associated with attributes
Dr. Rao Muhammad Adeel Nawab 95
How To Become a Great Human
Being
Dr. Rao Muhammad Adeel Nawab 96
Balanced Life is Ideal Life
Get Excellence in five things
1. Health
2. Spirituality
3. Work
4. Friend
5. Family
A Journey from BIGNNER to EXCELLENCE
You must have a combination of five things with different
variations. However, aggregate will be same.
Dr. Rao Muhammad Adeel Nawab 97
Excellence
1. Health
I can run (or brisk walk) 5 kilometers in one go
I take 7-9 hours sleep per night (TIP: Go to bed at 10pm)
I take 3 meals of balanced diet daily
2. Spirituality
Dr. Rao Muhammad Adeel Nawab 98
Excellence
3. Work
Become an authority in your field
For example - Dr. Abdul Qadeer Khan Sb is an authority in research
4. Friend
Have a DADDU YAR in life to drain out on daily basis
5. Family
1. Take Duas of Parents and elders by doing their ‫ﺧﺩﻣﺕ‬ and ‫ﺍﺩﺏ‬
2. Your wife/husband should be your best friend
3. Be humble and kind to kids, subordinates and poor people
Dr. Rao Muhammad Adeel Nawab 99
Dr. Rao Muhammad Adeel Nawab 100
It is a state of complete
1. physical
2. mental
3. social
wellbeing, and not merely the absence of disease or infirmity.
Definition by World Health Organization (WHO)
Dr. Rao Muhammad Adeel Nawab 101
CHANGE is never a matter of ABILITY it is always a matter of
MOTIVATION
Man + Tan ⟶ Both need good quality food to remain healthy
Focus on OUTCOMES not ACTIVITIES
Dr. Rao Muhammad Adeel Nawab 101
Motivation for Physical Health
Daily running and exercise
Dr. Rao Muhammad Adeel Nawab 102
Motivation for my students and
friends
•
Technology is the biggest addiction after drugs
Trend vs Comfort
Control vs Quit
How to Spare Time for Health and Fitness
1. Get ADEQUATE Sleep
For adults - 7 to 9 hours regular sleep per night.
Research showed that “Amount of Sleep” is an important indicator
of Health and Well Being.
Go to bed for sleep between 9:00 pm to 10:00 pm
Make a Schedule with a particular focus on 3 things
2. Eat a HEALTHY diet
Healthy diet contains mostly fruits and vegetables and includes little to
no processed food and sweetened beverages
The China Study
i. ‫ے‬
ii. ‫ں‬
iii.
Make a Schedule with a particular focus on 3 things
3. Exercise REGULARLY
Exercise is any bodily activity that enhances or maintains physical
fitness and overall health and wellness.
I am 55 years old and I can run (or brisk walk) five kilometers in one go
(Prof. Roger Moore, University of Sheffield, UK)
At least have brisk walk of 30 to 60 minutes daily
Make a Schedule with a particular focus on 3 things
No Pain No Gain
Key to Success

More Related Content

PPTX
module_3_1.pptx
PPTX
module_3_1.pptx
PPTX
Decision Tree Learning: Decision tree representation, Appropriate problems fo...
PPTX
Machine Learning, Decision Tree Learning module_2_ppt.pptx
PDF
Machine Learning using python module_2_ppt.pdf
PPT
Decision Trees.ppt
PDF
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
PPTX
Lect9 Decision tree
module_3_1.pptx
module_3_1.pptx
Decision Tree Learning: Decision tree representation, Appropriate problems fo...
Machine Learning, Decision Tree Learning module_2_ppt.pptx
Machine Learning using python module_2_ppt.pdf
Decision Trees.ppt
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
Lect9 Decision tree

Similar to Microsoft PowerPoint - Lec 04 - Decision Tree Learning.pdf (20)

PPTX
Decision Trees Learning in Machine Learning
PDF
Decision Tree-ID3,C4.5,CART,Regression Tree
PPTX
83 learningdecisiontree
PPTX
ML_Unit_1_Part_C
PPTX
3. Tree Models in machine learning
PDF
Decision treeDecision treeDecision treeDecision tree
PPTX
BAS 250 Lecture 5
PPT
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
PDF
lec02-DecisionTreed. Checking primality of an integer n .pdf
PPT
Decision tree Using Machine Learning.ppt
PPT
Storey_DecisionTrees explain ml algo.ppt
PDF
Decision trees
PPTX
Decision tree algorithm in Machine Learning
PPTX
DecisionTree.pptx for btech cse student
PPTX
Decision tree
PPT
Slide3.ppt
PDF
7 decision tree
PDF
Decision tree lecture 3
PDF
CSA 3702 machine learning module 2
PDF
2MLChapter2DecisionTrees23EN UC Coimbra PT
Decision Trees Learning in Machine Learning
Decision Tree-ID3,C4.5,CART,Regression Tree
83 learningdecisiontree
ML_Unit_1_Part_C
3. Tree Models in machine learning
Decision treeDecision treeDecision treeDecision tree
BAS 250 Lecture 5
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
lec02-DecisionTreed. Checking primality of an integer n .pdf
Decision tree Using Machine Learning.ppt
Storey_DecisionTrees explain ml algo.ppt
Decision trees
Decision tree algorithm in Machine Learning
DecisionTree.pptx for btech cse student
Decision tree
Slide3.ppt
7 decision tree
Decision tree lecture 3
CSA 3702 machine learning module 2
2MLChapter2DecisionTrees23EN UC Coimbra PT
Ad

More from ZainabShahzad9 (18)

PPTX
Data Science-entropy machine learning.pptx
PPT
software quality Assurance-lecture26.ppt
PPT
software quality Assurance-lecture23.ppt
PPTX
Naive bayes algorithm machine learning.pptx
PPT
Compiler Construction - CS606 Power Point Slides Lecture 13.ppt
PDF
lecture8-final.pdf ( analysis and design of algorithm)
PDF
maxflow.4up.pdf for the Maximam flow to solve using flord fulkerson algorithm
PDF
Chache memory ( chapter number 4 ) by William stalling
PDF
Lecture number 5 Theory.pdf(machine learning)
PDF
Lec 3.pdf
PDF
Lec-1.pdf
PPTX
Presentation1.pptx
PPTX
Presentation2-2.pptx
PPT
Lesson 20.ppt
PPTX
OS 7.pptx
PPTX
OS 6.pptx
DOCX
111803154 - Assignment 5 Normalisation.docx
PPTX
Project Presentation.pptx
Data Science-entropy machine learning.pptx
software quality Assurance-lecture26.ppt
software quality Assurance-lecture23.ppt
Naive bayes algorithm machine learning.pptx
Compiler Construction - CS606 Power Point Slides Lecture 13.ppt
lecture8-final.pdf ( analysis and design of algorithm)
maxflow.4up.pdf for the Maximam flow to solve using flord fulkerson algorithm
Chache memory ( chapter number 4 ) by William stalling
Lecture number 5 Theory.pdf(machine learning)
Lec 3.pdf
Lec-1.pdf
Presentation1.pptx
Presentation2-2.pptx
Lesson 20.ppt
OS 7.pptx
OS 6.pptx
111803154 - Assignment 5 Normalisation.docx
Project Presentation.pptx
Ad

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
RMMM.pdf make it easy to upload and study
PPTX
master seminar digital applications in india
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Institutional Correction lecture only . . .
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Business Ethics Teaching Materials for college
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
RMMM.pdf make it easy to upload and study
master seminar digital applications in india
Supply Chain Operations Speaking Notes -ICLT Program
2.FourierTransform-ShortQuestionswithAnswers.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Anesthesia in Laparoscopic Surgery in India
Microbial disease of the cardiovascular and lymphatic systems
O7-L3 Supply Chain Operations - ICLT Program
Institutional Correction lecture only . . .
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
VCE English Exam - Section C Student Revision Booklet
Business Ethics Teaching Materials for college
Microbial diseases, their pathogenesis and prophylaxis
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx

Microsoft PowerPoint - Lec 04 - Decision Tree Learning.pdf

  • 1. Machine Learning Lecture 04 Decision Tree Learning Dr. Rao Muhammad Adeel Nawab
  • 2. Dr. Rao Muhammad Adeel Nawab 2 How to Work
  • 3. Power of Dua Dr. Rao Muhammad Adeel Nawab 3
  • 4. Dua – Take Help from Allah before starting any task Dr. Rao Muhammad Adeel Nawab 4
  • 5. Course Focus Mainly get EXCELLENCE in two things 1. Become a great human being 2. Become a great Machine Learning Engineer To become a great human being Get sincere with yourself When you get sincere with yourself your ‫ﺧﻠﻭﺕ‬ and ‫ﺟﻠﻭﺕ‬ is the same Dr. Rao Muhammad Adeel Nawab 5
  • 6. Lecture Outline What are Decision Trees? What problems are appropriate for Decision Trees? The Basic Decision Tree Learning Algorithm: ID3 Entropy and Information Gain Inductive Bias in Decision Tree Learning Refinements to Basic Decision Tree Learning Reading: Chapter 3 of Mitchell Sections 4.3 and 6.1 of Wittena and Frank Dr. Rao Muhammad Adeel Nawab 6
  • 7. What are Decision Trees? Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability. Most popular of inductive inference algorithms Successfully applied to a broad range of tasks.. Dr. Rao Muhammad Adeel Nawab 7
  • 8. What are Decision Trees? Decision trees are trees which classify instances by testing at each node some attribute of the instance. Testing starts at the root node and proceeds downwards to a leaf node, which indicates the classification of the instance. Each branch leading out of a node corresponds to a value of the attribute being tested at that node. Dr. Rao Muhammad Adeel Nawab 8
  • 9. Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes Yes No Dr. Rao Muhammad Adeel Nawab 9
  • 10. Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High Normal No Yes Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification Dr. Rao Muhammad Adeel Nawab 10
  • 11. No Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes Yes No Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ? Dr. Rao Muhammad Adeel Nawab 11
  • 12. Decision Tree for Conjunction Outlook Sunny Overcast Rain Wind Strong Weak No Yes No Outlook=Sunny ∧ ∧ ∧ ∧ Wind=Weak No Dr. Rao Muhammad Adeel Nawab 12
  • 13. Decision Tree for Disjunction Outlook Sunny Overcast Rain Yes Outlook=Sunny ∨ ∨ ∨ ∨ Wind=Weak Wind Strong Weak No Yes Wind Strong Weak No Yes Dr. Rao Muhammad Adeel Nawab 13
  • 14. Decision Tree for XOR Outlook Sunny Overcast Rain Wind Strong Weak Yes No Outlook=Sunny XOR Wind=Weak Wind Strong Weak No Yes Wind Strong Weak No Yes Dr. Rao Muhammad Adeel Nawab 14
  • 15. Decision Tree Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes Yes No decision trees represent disjunctions of conjunctions (Outlook=Sunny ∧ ∧ ∧ ∧ Humidity=Normal) ∨ ∨ ∨ ∨ (Outlook=Overcast) ∨ ∨ ∨ ∨ (Outlook=Rain ∧ ∧ ∧ ∧ Wind=Weak) Dr. Rao Muhammad Adeel Nawab 15
  • 16. Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes Yes No Dr. Rao Muhammad Adeel Nawab 16 A decision tree to classify days as appropriate for playing tennis might look like: 〈 〈 〈 〈Outlook = Sunny, Temp = Hot, Humidity = High, Wind = Strong〉 〉 〉 〉 No
  • 17. What are Decision Trees? Note that each path through a decision tree forms a conjunction of attribute tests the tree as a whole forms a disjunction of such paths; i.e. a disjunction of conjunctions of attribute tests Preceding example could be re-expressed as: (Out look = Sunny ∧ ∧ ∧ ∧ Humidity = Normal) ∨ ∨ ∨ ∨ (Out look = Overcast) ∨ ∨ ∨ ∨ (Out look = Rain ∧ ∧ ∧ ∧ Wind =Weak) Dr. Rao Muhammad Adeel Nawab 17 ∧ ∧ ∧ ∧ = AND V = OR
  • 18. What are Decision Trees? (cont) As a complex rule, such a decision tree could be coded by hand. However, the challenge for machine learning is to propose algorithms for learning decision trees from examples. Dr. Rao Muhammad Adeel Nawab 18
  • 19. What Problems are Appropriate for Decision Trees? There are several varieties of decision tree learning, but in general decision tree learning is best for problems where: Instances describable by attribute–value pairs usually nominal (categorical/enumerated/discrete) attributes with small number of discrete values, but can be numeric (ordinal/continuous). Dr. Rao Muhammad Adeel Nawab 19
  • 20. What Problems are Appropriate for Decision Trees? Target function is discrete valued in PlayTennis example target function is Boolean easy to extend to target functions with > 2 output values harder, but possible, to extend to numeric target functions Disjunctive hypothesis may be required easy for decision trees to learn disjunctive concepts (note such concepts were outside the hypothesis space of the Candidate- Elimination algorithm) Dr. Rao Muhammad Adeel Nawab 20
  • 21. What Problems are Appropriate for Decision Trees? Possibly noisy/incomplete training data robust to errors in classification of training examples and errors in attribute values describing these examples Can be trained on examples where for some instances some attribute values are unknown/missing. Dr. Rao Muhammad Adeel Nawab 21
  • 22. Sample Applications of Decision Trees? Decision trees have been used for: (see http://www.rulequest.com/see5-examples.html) Predicting Magnetic Properties of Crystals Profiling Higher-Priced Houses in Boston Detecting Advertisements on the Web Controlling a Production Process Diagnosing Hypothyroidism Assessing Credit Risk Such problems, in which the task is to classify examples into one of a discrete set of possible categories, are often referred to as classification problems. Dr. Rao Muhammad Adeel Nawab 22
  • 23. Sample Applications of Decision Trees? (cont) Sample Applications of Decision Trees? (cont) Sample Applications of Decision Trees? (cont) Sample Applications of Decision Trees? (cont) Assessing Credit Risk
  • 24. Sample Applications of Decision Trees? (cont) From 490 cases like this, split 44%/56% between accept/reject, See5 derived twelve rules. On a further 200 unseen cases, these rules give a classification accuracy of 83% Dr. Rao Muhammad Adeel Nawab 24
  • 25. ID3 Algorithm Dr. Rao Muhammad Adeel Nawab 25
  • 26. ID3 Algorithm ID3, learns decision trees by constructing them top- down, beginning with the question which attribute should be tested at the root of the tree? Each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples. Dr. Rao Muhammad Adeel Nawab 26
  • 27. ID3 Algorithm The best attribute is selected and used as the test at the root node of the tree. A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example's value for this attribute). Dr. Rao Muhammad Adeel Nawab 27
  • 28. ID3 Algorithm The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree. This process continues for each new leaf node until either of two conditions is met: every attribute has already been included along this path through the tree, or the training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero). Dr. Rao Muhammad Adeel Nawab 28
  • 29. ID3 Algorithm This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices. Dr. Rao Muhammad Adeel Nawab 29
  • 30. The Basic Decision Tree Learning Algorithm:ID3(Cont.) ID3 algorithm: ID3(Example, Target_Attribute, Attribute) Create Root node for the tree If all examples +ve, return 1-node tree Root with label=+ If all examples -ve, return 1-node tree Root with label=- If Attributes=[], return 1-node tree Root with label=most common value of Target_Attribute in Examples Otherwise Dr. Rao Muhammad Adeel Nawab 30
  • 31. The Basic Decision Tree Learning Algorithm:ID3 Begin A ← attribute in Attributes that best classifies Examples The decision attribute for Root ← A For each possible value vi of A Add a new branch below Root for test A = vi Let Examplesvi = subset of Examples with value vi for A If Examplesvi = [] Then below this new branch add leaf node with label=most common value of Target_Attribute in Examples Else below this new branch add subtree ID3(Examplesvi, Target_Attribute, Attributes –{A}) End Return Root Dr. Rao Muhammad Adeel Nawab 31
  • 32. Which Attribute is the Best Classifier? In the ID3 algorithm, choosing which attribute to test at the next node is a crucial step. Would like to choose that attribute which does best at separating training examples according to their target classification. An attribute which separates training examples into two sets each of which contains positive/negative examples of the target attribute in the same ratio as the initial set of examples has not helped us progress towards a classification. Dr. Rao Muhammad Adeel Nawab 32
  • 33. Which Attribute is the Best Classifier? Suppose we have 14 training examples, 9 +ve and 5 -ve, of days on which tennis is played. For each day we have information about the attributes humidity and wind, as below. Which attribute is the best classifier? Dr. Rao Muhammad Adeel Nawab 33
  • 34. Entropy and Information Gain A useful measure of for picking the best classifier attribute is information gain. Information gain measures how well a given attribute separates training examples with respect to their target classification. Information gain is defined in terms of entropy as used in information theory. Dr. Rao Muhammad Adeel Nawab 34
  • 35. Entropy and Information Gain(Cont.) S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p- Or Entropy(S) = −p ⊕ ⊕ ⊕ ⊕ log2 p ⊕ ⊕ ⊕ ⊕ − p ⊖ ⊖ ⊖ ⊖ log2 p ⊖ ⊖ ⊖ ⊖ Dr. Rao Muhammad Adeel Nawab 35 p ⊕ ⊕ ⊕ ⊕
  • 36. Entropy For our previous example (14 examples, 9 positive, 5 negative): Entropy([9+,5−]) = −p⊕ ⊕ ⊕ ⊕ log2 p⊕ ⊕ ⊕ ⊕− p⊖ ⊖ ⊖ ⊖ log2 = −(9/14)log2(9/14)−(5/14)log2(5/14) = .940 Dr. Rao Muhammad Adeel Nawab 36
  • 37. Entropy Cont… Think of Entropy(S) as expected number of bits needed to encode class (⊕ ⊕ ⊕ ⊕ or ⊖ ⊖ ⊖ ⊖) of randomly drawn member of S (under the optimal, shortest-length code) For Example If p⊕ ⊕ ⊕ ⊕ = 1 (all instances are positive) then no message need be sent (receiver knows example will be positive) and Entropy = 0 (“pure sample”) If p⊕ ⊕ ⊕ ⊕ = .5 then 1 bit need be sent to indicate whether instance negative or positive and Entropy = 1 If p⊕ ⊕ ⊕ ⊕ = .8 then less than 1 bit need be sent on average – assign shorter codes to collections of positive examples and longer ones to negative ones Dr. Rao Muhammad Adeel Nawab 37
  • 38. Entropy Cont… Why? Information theory: optimal length code assigns −log2p bits to. message having probability p. So, expected number of bits needed to encode class (⊕ ⊕ ⊕ ⊕ or ⊖ ⊖ ⊖ ⊖) of random member of S: p⊕ ⊕ ⊕ ⊕(−log2 p⊕ ⊕ ⊕ ⊕)+ p⊖ ⊖ ⊖ ⊖(−log2 p⊖ ⊖ ⊖ ⊖) Entropy(S) ≡ −p⊕ ⊕ ⊕ ⊕ log2p⊕ ⊕ ⊕ ⊕− p⊖ ⊖ ⊖ ⊖ log2p⊖ ⊖ ⊖ ⊖ Dr. Rao Muhammad Adeel Nawab 38
  • 39. Information Gain Entropy gives a measure of purity/impurity of a set of examples. Define information gain as the expected reduction in entropy resulting from partitioning a set of examples on the basis of an attribute. Formally, given a set of examples S and attribute A: Dr. Rao Muhammad Adeel Nawab 39
  • 40. Information Gain where Values(A) is the set of values attribute A can take on Sv is the subset of S for which A has value v First term in Gain(S,A) is entropy of original set; second term is expected entropy after partitioning on A = sum of entropies of each subset Sv weighted by ratio of Sv in S. Dr. Rao Muhammad Adeel Nawab 40
  • 41. Information Gain Cont…. Dr. Rao Muhammad Adeel Nawab 41
  • 42. Information Gain Cont…. Dr. Rao Muhammad Adeel Nawab 42
  • 43. Training Examples Day Outlook Temp Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cold Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Dr. Rao Muhammad Adeel Nawab 43
  • 44. First step: which attribute to test at the root? Which attribute should be tested at the root? Gain(S, Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.084 Gain(S, Temperature) = 0.029 Outlook provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Outlook partition the training samples according to the value of Outlook Dr. Rao Muhammad Adeel Nawab 44
  • 45. After first step Outlook Sunny Overcast Rain Yes [D1,D2,…,D14] [9+,5-] Ssunny=[D1, D2, D8, D9, D11] [2+,3-] ? ? [D3, D7, D12, D13] [4+,0-] [D4, D5, D6, D10, D14] [3+,2-] Which attribute should be tested here? Ssunny ={D1,D2,D8,D9,D11} Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 45
  • 46. Second step Working on Outlook=Sunny node: Gain(SSunny, Humidity) = 0.970 − − − − 3/5 × × × × 0.0 − − − − 2/5 × × × × 0.0 = 0.970 Gain(SSunny, Wind) = 0.970 − − − − 2/5 × × × × 1.0 − − − − 3.5 × × × × 0.918 = 0 .019 Gain(SSunny, Temp) = 0.970 − − − − 2/5 × × × × 0.0 − − − − 2/5 × × × × 1.0 − − − − 1/5 × × × × 0.0 =0.570 Humidity provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Humidity partition the training samples according to the value of Humidity Dr. Rao Muhammad Adeel Nawab 46
  • 47. Second and Third Steps Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes Yes No [D3,D7,D12,D13] [D8,D9,D11] [D6,D14] [D1,D2] [D4,D5,D10] Dr. Rao Muhammad Adeel Nawab 47 Final tree for S is:
  • 48. Hypothesis Space Search by ID3 + - + + - + A1 - - + + - + A2 + - - + - + A2 - A4 + - A2 - A3 - + Dr. Rao Muhammad Adeel Nawab 48
  • 49. Hypothesis Space Search by ID3 ID3 searches a space of hypotheses (set of possible decision trees) for one fitting the training data. Search is simple-to-complex, hill-climbing search guided by the information gain evaluation function. Hypothesis space of ID3 is complete space of finite, discrete-valued functions w.r.t available attributes contrast with incomplete hypothesis spaces, such as conjunctive hypothesis space Dr. Rao Muhammad Adeel Nawab 49
  • 50. Hypothesis Space Search by ID3 ID3 maintains only one hypothesis at any time, instead of, e.g., all hypotheses consistent with training examples seen so far contrast with CANDIDATE-ELIMINATION means can’t determine how many alternative decision trees are consistent with data means can’t ask questions to resolve competing alternatives Dr. Rao Muhammad Adeel Nawab 50
  • 51. Hypothesis Space Search by ID3 ID3 performs no backtracking – once an attribute is selected for testing at a given node, this choice is never reconsidered. so, susceptible to converging to locally optimal rather than globally optimal solutions Dr. Rao Muhammad Adeel Nawab 51
  • 52. Hypothesis Space Search by ID3 Uses all training examples at each step to make statistically-based decision about how to refine current hypothesis contrast with CANDIDATE-ELIMINATION or FIND-S – make decisions incrementally based on single training examples using statistically-based properties of all examples (information gain) means technique is robust in the face of errors in individual examples. Dr. Rao Muhammad Adeel Nawab 52
  • 53. Inductive Bias in Decision Tree Learning Dr. Rao Muhammad Adeel Nawab 53
  • 54. Inductive Bias in Decision Tree Learning Inductive bias: set of assumptions needed in addition to training data to justify deductively learner’s classification Given a set of training examples, there may be many decision trees consistent with them Inductive bias of ID3 is shown by which of these trees it chooses ID3’s search strategy (simple-to-complex, hill climbing) selects shorter trees over longer ones selects trees that place attributes with highest Information Gain closest to root Dr. Rao Muhammad Adeel Nawab 54
  • 55. Inductive Bias in Decision Tree Learning Inductive bias of ID3 Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred to those that do not. Note that one could produce a decision tree learning algorithm with the simpler bias of always preferring a shorter tree. How does inductive bias of ID3 compare to that of version space CANDIDATE-ELIMINATION algorithm? ID3 incompletely searches a complete hypothesis space CANDIDATE-ELIMINATION completely searches an incomplete hypothesis space Dr. Rao Muhammad Adeel Nawab 55
  • 56. Inductive Bias in Decision Tree Learning Can be put differently by saying inductive bias of ID3 follows from its search strategy (preference bias or search bias) inductive bias of CANDIDATE-ELIMINATION follows from its definition of its search space (restriction bias or language bias). Note that preference bias only effects order in which hypotheses are investigated; restriction bias effects which hypotheses are investigated. Generally better to choose algorithm with preference bias rather than restriction bias – with restriction bias target function may not be contained in hypothesis space. Dr. Rao Muhammad Adeel Nawab 56
  • 57. Inductive Bias in Decision Tree Learning Note that some algorithms may combine preference and restriction biases – e.g. checker’s learning program linear weighted function of fixed set of board features introduces restriction bias (non-linear potential target functions excluded) least mean square parameter tuning introduces preference bias into search through space of parameter values Dr. Rao Muhammad Adeel Nawab 57
  • 58. Inductive Bias in Decision Tree Learning Is ID3’s inductive bias sound? Why prefer shorter hypotheses/trees? One response : “Occam’s Razor” – prefer simplest hypothesis that fits the data. This is a general assumption that many natural scientists make. Dr. Rao Muhammad Adeel Nawab 58
  • 59. Occam’s Razor Why prefer short hypotheses? Argument in favor: Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a coincidence A long hypothesis that fits the data might be a coincidence Argument opposed: There are many ways to define small sets of hypotheses E.g. All trees with a prime number of nodes that use attributes beginning with ”Z” What is so special about small sets based on size of hypothesis? Dr. Rao Muhammad Adeel Nawab 59
  • 60. Issues in Decision Tree Learning Practical issues in learning decision trees include determining how deeply to grow the decision tree handling continuous attributes choosing an appropriate attribute selection measure handling training data with missing attribute values handling attributes with differing costs and improving computational efficiency Dr. Rao Muhammad Adeel Nawab 60
  • 61. Refinements to Basic Decision Tree Learning Dr. Rao Muhammad Adeel Nawab 61
  • 62. Refinements to Basic Decision Tree Learning: Overfitting Training Data + Tree Pruning Cont…. Overfitting Training Data + Tree Pruning In case of noise in the data or number of training examples is too small to produce a representative sample of the true target function The simple ID3 algorithm can produce trees that overfit the training examples. Dr. Rao Muhammad Adeel Nawab 62
  • 63. Refinements to Basic Decision Tree Learning: Overfitting Training Data + Tree Pruning Cont…. Suppose in addition to the 14 examples for PlayTennis we get a 15th example whose target classification is wrong: ‹Sunny, Hot , Normal, Strong, PlayTennis = No› Dr. Rao Muhammad Adeel Nawab 63
  • 64. Refinements to Basic Decision Tree Learning: Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes Yes No Dr. Rao Muhammad Adeel Nawab 64 What impact this will have on our earlier tree ?
  • 65. Refinements to Basic Decision Tree Learning Since we previously had the correct example: ‹Sunny, Cool , Normal, Weak, PlayTennis = Yes› ‹Sunny, Mild , Normal, Strong, PlayTennis = Yes› Tree will be elaborated below right branch of Humidity Result will be tree that performs well on (errorful) training examples, but less well on new unseen instances Dr. Rao Muhammad Adeel Nawab 65
  • 66. Refinements to Basic Decision Tree Learning The addition of this incorrect example will now cause ID3 to construct a more complex tree. The new example will be sorted into the second leaf node from the left in the learned tree, along with the previous positive examples D9 and D11. Because the new example is labeled as a negative example, ID3 will search for further refinements to the tree below this node. Result will be tree which performs well on (error full training examples) and less well on new unseen instances. Dr. Rao Muhammad Adeel Nawab 66
  • 67. Refinements: Overfitting Training Data Adapting to noisy training data is one type of overfitting. Overfitting can also occur when the number of training examples is too small to be representative of the true target function coincidental regularities may be picked up during training More precisely: Definition: Given a hypothesis space H, a hypothesis h ∈ ∈ ∈ ∈ H overfits the training data if there is another hypothesis h′ ∈ ∈ ∈ ∈ H such that h has smaller error than h′ over the training data, but h′ has a smaller error over the entire distribution of instances. Dr. Rao Muhammad Adeel Nawab 67
  • 68. Refinements: Overfitting Training Data Overfitting is a real problem for decision tree learning – 10% - 25% decrease in accuracy over a range of tasks in one empirical study Overfitting a problem for many other machine learning methods too Dr. Rao Muhammad Adeel Nawab 68
  • 69. Refinements: Overfitting Training Data (Example) Example of ID3 learning which medical patients have a form of diabetes: Accuracy of tree over training examples increases monotonically as tree grows(to be expected) Accuracy of tree over independent test examples increases till about 25 nodes, then decreases Dr. Rao Muhammad Adeel Nawab 69
  • 70. Refinements: Avoiding Overfitting How can overfitting be avoided? Two general approaches: stop growing tree before perfectly fitting training data e.g. when data split is not statistically significant grow full tree, then prune afterwards In practice, second approach has been more successful Dr. Rao Muhammad Adeel Nawab 70
  • 71. Refinements: Avoiding Overfitting For either approach, how can optimal final tree size be decided? use a set of examples distinct from training examples to evaluate quality of tree; or use all data for training but apply statistical test to decide whether expanding/pruning a given node is likely to improve performance over whole instance distribution; or measure complexity of encoding training examples + decision tree and stop growing tree when this size is minimized – minimum description length principle Dr. Rao Muhammad Adeel Nawab 71
  • 72. Refinements: Avoiding Overfitting First approach most common – called training and validation set approach. Divide available instances into training set – commonly 2/3 of data validation set – commonly 1/3 of data Hope is that random errors and coincidental regularities learned from training set will not be present in validation set Dr. Rao Muhammad Adeel Nawab 72
  • 73. Refinements: Reduced Error Pruning Assumes data split into training and validation sets. Proceed as follows: Train decision tree on training set Do until further pruning is harmful: for each decision node evaluate impact on validation set of removing that node and those below it remove node that most improves accuracy on validation set Dr. Rao Muhammad Adeel Nawab 73
  • 74. Refinements: Reduced Error Pruning How is impact of removing a node evaluated? When a decision node is removed the subtree rooted at it is replaced with a leaf node whose classification is the most common classification of examples beneath the decision node Dr. Rao Muhammad Adeel Nawab 74
  • 75. Refinements: Reduced Error Pruning (cont…) To assess value of reduced error pruning, split data into 3 distinct sets: 1. training examples for the original tree 2. validation examples of guiding tree pruning 3. test examples to provide an estimate over future unseen examples Dr. Rao Muhammad Adeel Nawab 75
  • 76. Refinements: Reduced Error Pruning (cont.) On previous example, reduced error pruning produces this effect: Drawback: holding data back for a validation set reduces data available for training Dr. Rao Muhammad Adeel Nawab 76
  • 77. Refinements: Rule Post-Pruning Perhaps most frequently used method (e.g.,C4.5) Proceed as follows: 1. Convert tree to equivalent set of rules 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use Convert tree to rules by making the conjunction of decision nodes along each branch the antecedent of a rule and each leaf the consequent Dr. Rao Muhammad Adeel Nawab 77
  • 78. Refinements: Rule Post-Pruning Dr. Rao Muhammad Adeel Nawab 78
  • 79. Refinements: Rule Post-Pruning (cont) To prune rules, remove any precondition (= conjunct in antecedent) of a rule whose removal does not worsen rule accuracy Can estimate rule accuracy by using a separate validation set by using the training data, but assuming a statistically-based pessimistic estimate of rule accuracy (C4.5) Dr. Rao Muhammad Adeel Nawab 79
  • 80. Refinements: Rule Post-Pruning (cont) Three advantages of converting trees to rules before pruning: 1. converting to rules allows distinguishing different contexts in which rules are used – treat each path through tree differently contrast: removing a decision node removes all paths beneath it 2. removes distinction between testing nodes near root and those near leaves – avoids need to rearrange tree should higher nodes be removed 3. rules often easier for people to understand Dr. Rao Muhammad Adeel Nawab 80
  • 81. Refinements: Continuous-valued Attributes Initial definition of ID3 restricted to discrete-valued target attributes decision node attributes Can overcome second limitation by dynamically defining new discrete-valued attributes that partition a continuous attribute value into a set of discrete intervals Dr. Rao Muhammad Adeel Nawab 81
  • 82. Refinements: Continuous-valued Attributes So, for continuous attribute A dynamically create a new Boolean attribute Ac that is true A > c and false otherwise How do we pick c ? →Pick c that maximises information gain Dr. Rao Muhammad Adeel Nawab 82
  • 83. Refinements: Continuous-valued Attributes E.g. suppose for PlayTennis example we want Temperature to be a continuous attribute Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No Sort by temperature and identify candidate thresholds midway between points where target attribute changes ((60+48)/2) and ((90+80)/2)) Compute information gain for Temperature>54 and Temperature<85 and select highest (Temperature>54) Can be extended to split continuous attribute into > 2 intervals Dr. Rao Muhammad Adeel Nawab 83
  • 84. Refinements: Alternative Attribute Selection Measures Information gain measure favours attributes with many values over those with few values. E.g. if we add a Date attribute to the PlayTennis example it will have a distinct value for each day and will have the highest information gain. this is because date perfectly predicts the target attribute for all training examples result is a tree of depth 1 that perfectly classifies training examples but fails on all other data Dr. Rao Muhammad Adeel Nawab 84
  • 85. Refinements: Alternative Attribute Selection Measures Can avoid this by using other attribute selection measures. One alternative is gain ratio Dr. Rao Muhammad Adeel Nawab 85
  • 86. Refinements: Alternative Attribute Selection Measures where Si is subset of S for which c-valued attribute A has value vi (Note: Split Information is entropy of S w.r.t values of A) Has effect of penalizing attributes with many, uniformly distributed values Experiments with variants of this and other attribute selection measures have been carried out and are reported in the machine learning literature Dr. Rao Muhammad Adeel Nawab 86
  • 87. Refinements: Missing/Unknown Attribute Values What if a training example x is missing value for attribute A? Several alternatives have been explored. At decision node n where Gain(S,A) is computed assign most common value of A among other examples sorted to node n or assign most common value of A among other examples at n with same target attribute value as x or assign probability pi to each possible value vi of A estimated from observed frequencies of values of A for examples sorted to A Dr. Rao Muhammad Adeel Nawab 87
  • 88. Refinements: Missing/Unknown Attribute Values Assign fraction pi of x x x x distributed down each branch in tree below n (this technique is used in C4.5) Last technique can be used to classify new examples with missing attributes (i.e. after learning) in same fashion Dr. Rao Muhammad Adeel Nawab 88
  • 89. Refinements: Attributes with Differing Costs Different attributes may have different costs associated with acquiring their values E.g. in medical diagnosis, different tests, such as blood tests, brain scans, have different costs in robotics positioning a sensing device on a robot so as to take a differing measurements requires differing amounts of time (= cost) Dr. Rao Muhammad Adeel Nawab 89
  • 90. Refinements: Attributes with Differing Costs How to learn a consistent tree with low expected cost? Various approaches have been explored in which the attribute selection measure is modified to include a cost term. (E.g.) Dr. Rao Muhammad Adeel Nawab 90
  • 91. Summary Decision trees classify instances. Testing starts at the root and proceeds downwards: Non-leaf nodes test one attribute of the instance and the attribute value determines which branch is followed. Leaf nodes are instance classifications. Decision trees are appropriate for problems where: instances are describable by attribute–value pairs (typically, but not necessarily, nominal); target function is discrete valued (typically, but not necessarily); disjunctive hypotheses may be required; training data may be noisy/incomplete. Dr. Rao Muhammad Adeel Nawab 91
  • 92. Summary (cont….) Various algorithms have been proposed to learn decision trees – ID3 is the classic. ID3: recursively grows tree from the root picking at each point attribute which maximises information gain with respect to the training examples sorted to the current node recursion stops when all examples down a branch fall into a single class or all attributes have been tested ID3 carries out incomplete search of complete hypothesis space – contrast with CANDIDATE-ELIMINATION which carries out a complete search of an incomplete hypothesis space. Dr. Rao Muhammad Adeel Nawab 92
  • 93. Summary (cont…) Decision trees exhibit an inductive bias which prefers shorter trees with high information gain attributes closer to the root (at least where information gain is used as the attribute selection criterion, as in ID3) ID3 searches a complete hypothesis space for discrete- valued functions, but searches the space incompletely, using the information gain heuristic Dr. Rao Muhammad Adeel Nawab 93
  • 94. Summary (cont…) Overfitting the training data is an important issue in decision tree learning. Noise or coincidental regularities due to small samples may mean that while growing a tree beyond a certain size improves its performance on the training data, it worsens its performance on unseen instances Overfitting can be addressed by post-pruning the decision in a variety of ways Dr. Rao Muhammad Adeel Nawab 94
  • 95. Summary (cont…) Various other refinements of the basic ID3 algorithm address issues such as: handling real-valued attributes handling training/test instances with missing attribute values using attribute selection measures other than information gain allowing costs to be associated with attributes Dr. Rao Muhammad Adeel Nawab 95
  • 96. How To Become a Great Human Being Dr. Rao Muhammad Adeel Nawab 96
  • 97. Balanced Life is Ideal Life Get Excellence in five things 1. Health 2. Spirituality 3. Work 4. Friend 5. Family A Journey from BIGNNER to EXCELLENCE You must have a combination of five things with different variations. However, aggregate will be same. Dr. Rao Muhammad Adeel Nawab 97
  • 98. Excellence 1. Health I can run (or brisk walk) 5 kilometers in one go I take 7-9 hours sleep per night (TIP: Go to bed at 10pm) I take 3 meals of balanced diet daily 2. Spirituality Dr. Rao Muhammad Adeel Nawab 98
  • 99. Excellence 3. Work Become an authority in your field For example - Dr. Abdul Qadeer Khan Sb is an authority in research 4. Friend Have a DADDU YAR in life to drain out on daily basis 5. Family 1. Take Duas of Parents and elders by doing their ‫ﺧﺩﻣﺕ‬ and ‫ﺍﺩﺏ‬ 2. Your wife/husband should be your best friend 3. Be humble and kind to kids, subordinates and poor people Dr. Rao Muhammad Adeel Nawab 99
  • 100. Dr. Rao Muhammad Adeel Nawab 100 It is a state of complete 1. physical 2. mental 3. social wellbeing, and not merely the absence of disease or infirmity. Definition by World Health Organization (WHO)
  • 101. Dr. Rao Muhammad Adeel Nawab 101 CHANGE is never a matter of ABILITY it is always a matter of MOTIVATION Man + Tan ⟶ Both need good quality food to remain healthy Focus on OUTCOMES not ACTIVITIES Dr. Rao Muhammad Adeel Nawab 101 Motivation for Physical Health
  • 102. Daily running and exercise Dr. Rao Muhammad Adeel Nawab 102 Motivation for my students and friends •
  • 103. Technology is the biggest addiction after drugs Trend vs Comfort Control vs Quit How to Spare Time for Health and Fitness
  • 104. 1. Get ADEQUATE Sleep For adults - 7 to 9 hours regular sleep per night. Research showed that “Amount of Sleep” is an important indicator of Health and Well Being. Go to bed for sleep between 9:00 pm to 10:00 pm Make a Schedule with a particular focus on 3 things
  • 105. 2. Eat a HEALTHY diet Healthy diet contains mostly fruits and vegetables and includes little to no processed food and sweetened beverages The China Study i. ‫ے‬ ii. ‫ں‬ iii. Make a Schedule with a particular focus on 3 things
  • 106. 3. Exercise REGULARLY Exercise is any bodily activity that enhances or maintains physical fitness and overall health and wellness. I am 55 years old and I can run (or brisk walk) five kilometers in one go (Prof. Roger Moore, University of Sheffield, UK) At least have brisk walk of 30 to 60 minutes daily Make a Schedule with a particular focus on 3 things
  • 107. No Pain No Gain Key to Success