Decision Tree - ID3

Decision Tree(ID3)
Xueping Peng
Xueping.peng@uts.edu.au

Outline
 What is decision tree
 How to Use DecisionTree
 How to Generate a DecisionTree
 Sum Up and Some Drawbacks

What is decision tree(1/3)
 Decision tree is a hierarchical tree structure that used to classify
classes based on a series of questions (or rules) about the attributes
of the class.
 The attributes of the classes can be any type of variables from binary,
nominal, ordinal, and quantitative values.
 The classes must be qualitative type (categorical or binary, or
ordinal).
 In short, given a data of attributes together with its classes, a
decision tree produces a sequence of rules (or series of questions)
that can be used to recognize the class.

What is decision tree(2/3)
Attributes Classes
Gender Car Ownership Travel Cost ($)/km Income Level Transportation
Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Cheap Medium Train
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 0 Standard Medium Train
Female 1 Standard Medium Train
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car

How to Use DecisionTree
Person Name Gender Car
Ownership
Travel Cost ($)/km Income Level Transportation
Level
Alex Male 1 Standard High ?
Buddy Male 0 Cheap Medium ?
Cherry Female 1 Cheap High ?
 Test Data
 What transportation mode would Alex, Buddy and Cheery use?
AlexBuddy Cherry

How to Generate a Decision Tree(1/13)
 Description of ID3

 Which is the best choice?
 We have 29 positive examples and 35 negative ones
 Should I use attribute 1 or attribute 2 in this iteration of the node?

 Use Entropy to Measure Degree of Impurity
 Entropy
 All above formulas contain values of probability of Pj a class j.

 What does Entropy mean?
 Entropy is the minimum number of bits needed to encode the
classification of a member of S randomly drawn.
 P+ = 1, the receiver knows the class, no message sent, Entropy=0.
 P+ = 0.5, 1 bit needed.
 Optimal length code assigns –log2p to message having probability p
 The idea behind is to assign shorter codes to the more probable
messages and longer codes to less likely examples
 Thus,the expected number of bits to encode + or – of random
member of S is:
 H(S) = p+ (-log2 p+) + p-(-log2 p-)

 Information Gain
 Measures the expected reduction in entropy caused by partitioning
the examples according to the given attribute
 IG(S|A): the number of bits saved when encoding the target value of
an arbitrary member of S, knowing the value of attribute A.
 Expected reduction in entropy caused by knowing the value ofA
 IG(S|A) = H(S) – Σj Prob(A=vj) H(Y | A = vj)

 Which is the best choice?
 We have 29 positive examples and 35 negative ones
 Should I use attribute 0 or attribute 2 in this iteration of the node?
IG(A1) = 0.993 – 26/64*0.70 – 36/64*0.74 = 0.292
IG(A2) = 0.993 – 51/64*0.93 – 13/64*0.61 = 0.128

 Specific Conditional Entropy H(Y|X=v)
 Y is class, X is attribute and v is value of X
 H(Y |X=v) = The entropy of Y among only those records in which X
has value v
 H(Class|Travel Cost=Cheap) =
-0.8*log20.8 - 0.2*log20.2 = 0.722
 H(Class|Travel Cost=Expensive) =
-1*log21 = 0
 H(Class|Travel Cost=Standard) =
-1*log21 = 0

 Conditional Entropy H(Y|X)
 H(Y |X) = The average specific conditional entropy of Y=
Σj Prob(X=vj) H(Y | X = vj)
 e.g. H(Class|Travel Cost) =
prob(Travel Cost=Cheap) * H(Class|Travel Cost=Cheap) +
prob(Travel Cost=Expensive) * H(Class|Travel Cost=Expensive) +
prob(Travel Cost=Standard) * H(Class|Travel Cost=Standard)
= 0.5 * 0.722 + 0.2 * 0 + 0.3 * 0 = 0.361

 Information Gain IG(Y|X)
 IG(Y|X) = H(Y) - H(Y | X)
 e.g.
 H(Class) = – 0.4 log2 (0.4) – 0.3 log2 (0.3) – 0.3 log2 (0.3) = 1.571
 IG(Class|Travel Cost) = H(Class) – H(Class|Travel Cost)
1.571 – 0.361 = 1.210
 Results of first iteration
Gain Gender Car
Ownership
Travel Cost ($)/km Income Level
IG 0.125 0.534 1.210 0.695

 Root Node
 Split Node

 Second Iteration

 Results of Second Iteration
 Split Node
 Update DecisionTree
Gain Gender Car
Ownership
Income
Level
IG 0.322 0.171 0.171

 Third Iteration
 Update DecisionTree

To Sum Up
 ID3 is a strong system that
 Uses hill-climbing search based on the information gain measure
to search through the space of decision trees
 Outputs a single hypothesis
 Never backtracks.It converges to locally optimal solutions
 Uses all training examples at each step, contrary to methods that
make decisions incrementally
 Uses statistical properties of all examples:the search is less
sensitive to errors in individual training examples

Some Drawbacks
 It can only deal with nominal data
 It may be not robust in presence of noise
 It is not able to deal with noisy data sets

References
 Tutorial on DecisionTree,
http://people.revoledu.com/kardi/tutorial/DecisionTree/index
.html
 Information Gain,
http://www.autonlab.org/tutorials/infogain11.pdf
 http://www.slideshare.net/aorriols/lecture5-c45

Decision Tree - ID3

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Decision Tree - ID3 (20)

Decision Tree - ID3