SlideShare a Scribd company logo
7/4/2008 1
Decision Tree Approach in Data Mining
What is data mining ?
The process of extracting previous unknown
and potentially useful information from large
database
Several data mining approaches nowadays
Association Rules
Decision Tree
Neutral Network Algorithm
7/4/2008 2
Decision Tree Induction
A decision tree is a flow-chart-like tree
structure, where each internal node
denotes a test on an attribute, each
branch represents an outcome of the test,
and leaf nodes represent classes or class
distribution.
7/4/2008 3
Data Mining Approach - Decision Tree
• a model that is both predictive and
descriptive
• can help identify which factors to
consider and how each factor
associated to a business decision
• most commonly used for classification
(predicting what group a case belongs to)
• several decision tree induction
algorithms, e.g. C4.5, CART, CAL5, ID3
etc.
7/4/2008 4
Algorithm for building Decision
Trees
Decision trees are a popular structure for
supervised learning. They are
constructed using attributes best able to
differentiate the concepts to be learned.
A decision tree is built by initially
selecting a subset of instances from a
training set. This subset is then used by
the algorithm to construct a decision
tree. The remaining training set
instances test the accuracy of the
constructed tree.
7/4/2008 5
If the decision tree classified the
instances correctly, the procedure
terminates. If an instance is
incorrectly classified, the instance
is added to the selected subset of
training instances and a new tree is
constructed. This process
continues until a tree that correctly
classify all nonselected instances is
created or the decision tree is built
from the entire training set.
7/4/2008 6
Entropy
(a) shows probability p range from 0 to 1 = log(1/p)
(b) Shows probability of an event occurs = p log(1/p)
(c) Shows probability of an expected value (occurs+not occurs)
= p log(1/p) + (1-p) log (1/(1-p))
7/4/2008 7
Training Process
Sam ple D ata S et
T raining Set
T es ting S et T rained C lass ifier
R es ults
W indow ing
Process
C onstruc t
D ec is ion T ree
& R uleset
Process
Predic tion
Process
B lock D iagram of Training Process
|-------- Data Preparation Stage --------|------- Tree Building Stage -------|--- Prediction Stage ---|
7/4/2008 8
Basic algorithm for inducing a decision
tree
• Algorithm: Generate_decision_tree. Generate a
decision tree from the given training data.
• Input: The training samples, represented by
discrete-valued attributes; the set of candidate
attributes, attribute-list;
• Output: A decision tree
7/4/2008 9
Begin
Partition (S)
If (all records in S are of the same class or only 1 record found
in S)
then return;
For each attribute Ai do
evaluate splits on attribute Ai;
Use best split found to partition S into S1 and S2 to grow a tree
with two Partition (S1) and Partition (S2);
Repeat partitioning for Partition (S1) and (S2) until it meets tree
stop growing criteria;
End;
7/4/2008 10
Information Gain
Difference between information needed for
correct classification before and after the split.
For example, before split, there are 4 possible
outcomes represented in 2 bits in the
information of A, B, …Outcome. After split on
attribute A, the split results in two branches of
the tree, and each tree branch represent two
outcomes represented in 1 bit. Thus, choosing
attribute A results in an information gain of one
bit.
7/4/2008 11
Classification Rule Generation
• Generate Rules
– rewrite the tree to a collection of rules, one for each tree leaf
– e.g. Rule 1: IF ‘outlook = rain’ AND ‘windy = false’ THEN
‘play’
• Simplifying Rules
– delete any irrelevant rule condition without affecting its
accuracy
– e.g. Rule R-: IF r1 AND r2 AND r3 THEN class1
– Condition: Error Rate (R-) without r1 < Error Rate (R) =>
delete this rule condition r1
– Resultant Rule: IF r2 AND r3 THEN class1
• Ranking Rules
– order the rules according to the error rate
7/4/2008 12
Decision Tree Rules
Rules are more appealing than trees,
variations of the basic tree to rule mapping
must be presented. Most variations focus
on simplifying and/or eliminating existing
rules.
7/4/2008 13
Example of simplifying rules of credit card
Income Range Life Insurance Credit Card Sex Age
Promotion Insurance
40-50k no no Male 45
30-40k yes no Female 40
40-50k no no Male 42
30-40k yes yes Male 43
50-60k yes no Female 38
20-30k no no Female 55
30-40k yes yes Male 35
20-30k no no Male 27
30-40k no no Male 43
30-40k yes no Female 41
40-50k yes no Female 43
20-30k yes no Male 29
50-60k yes no Female 39
40-50k no no Male 55
20-30k yes yes Female 19
14/4/2008 14
A rule created by following one path of the tree is:
Case 1:
If Age<=43 & Sex=Male & Credit Card Insurance=No
Then Life Insurance Promotion = No
The conditions for this rule cover 4 of 15 instances with 75%
accuracy in which 3 out of 4 meet the successful rate.
Case 2:
If Sex=Male & Credit Card Insurance=No
Then Life Insurance Promotion = No
The conditions for this rule cover 5 of 6 instances with 83.3%
accuracy
Therefore, the simplified rule is more general and more accurate
than the original rule.
7/4/2008 15
C4.5 Tree Induction Algorithm
• Involves two phases for decision tree
construction
– growing tree phase
– pruning tree phase
• Growing Tree Phase
– a top-down approach which repeatedly
build the tree, it is a specialization process
• Pruning Tree Phase
– a bottom-up approach which removes sub-
trees by replacing them with leaves, it is a
7/4/2008 16
Expected information before splitting
Let S be a set consisting of s data samples. Suppose
the class label attribute has m distinct values
defining m distinct classes, Ci for i=1,..m. Let Si be
the number of samples of S in class Ci. The
expected information needed to classify a given
sample Si is given by:
m
Info(S)= -  Si log2 Si
i=1 S S
Note that a log function to the base 2 is used since the
information is encoded in bit
7/4/2008 17
Expected information after splitting
Let attribute A have v distinct values {a1, a2,…av},
and is used to split S into v subsets {S1,…Sv}
where Sj contains those samples in S that
have value aj of A. After splitting, then these
subsets would correspond to the branches
partitioned in S.
v
InfoA(S) =  S1j+…+Smj Info(Sj)
j=1 S
Gain (A) = Info (S) – InfoA(S)
7/4/2008 18
C4.5 Algorithm - Growing Tree Phase
Let S = any set of training case
Let |S| = number of classes in set S
Let Freq (Ci, S) = number of cases in S that belong to
class Ci
Info(S) = average amount of information needed to
identify the class in S
Infox(S) = expected information to identify the class of a
case in S after partitioning S with the test on attribute
X
Gain (X) = information gained by partitioning S
according to the test on attribute X
7/4/2008 19
C4.5 Algorithm - Growing Tree Phase
Data
Mining Set
Find Splitting Attribute
Find threshold value for
splitting
Terminate Tree
Growing
?
No
Yes
Tree Splitting
Select Decisive Attribute for Tree Splitting
( Informational Gain Ratio )
m
Info(S)= -  Si log2 Si
i=1 S S
v
InfoA(S) =  S1j+…+Smj Info(Sj)
j=1 S
Gain (X) = Info (S) – Infox (S)
7/4/2008 20
C4.5 Algorithm - Growing Tree Phase
Let S be the training set
Info (S) = -9 log2 (9) - 5 log2 (5) = 0.42+0.52=0.94
14 14 14 14
Where log2(9/14)= log 2
log (9/14)
InfoOutlook(S) = 5 (- 2 log2 (2) - 3 log2 (3) )
14 5 5 5 5
+ 4 (- 4 log2 (4) - 0 log2 (0) )
14 4 4 4 4
+ 5 (- 3 log2 (3) - 2 log2 (2) ) = 0.694
14 5 5 5 5
Gain (Outlook) = 0.94 - 0.694 = 0.246
Similarly,computed information Gain(Windy)
=Info(S) - InfoWindy(S) = 0.94 - 0.892 = 0.048
Thus, decision tree splits on attribute Outlook with
higher information gain.
Root
|
Outlook
|
Sunny Overcast Rain
7/4/2008 21
After first splitting
Windy?
TRUE
TRUE
FALSE
FALSE
FALSE
Class
Play
Don’t Play
Don’t Play
Don’t Play
Play
Windy?
TRUE
FALSE
TRUE
FALSE
Class
Play
Play
Play
Play
Windy?
TRUE
TRUE
FALSE
FALSE
FALSE
Class
Don’t Play
Don’t Play
Play
Play
Play
Root
|
Outlook
/ | 
Sunny Overcast Rain
7/4/2008 22
Decision Tree after grow tree
phase
Root
|
Outlook
/ | 
Sunny Overcast Rain
/  | / 
Wendy not Play Windy not
wendy (100%) wendy
/  / 
Play not play Play not play
(40%) (60%)
7/4/2008 23
7/4/2008 24
Continuous-valued data
If input sample data consists of an attribute that
is continuous-valued, rather than discrete-
valued.
For example, people’s Ages is continuous-
valued.
For such a scenario, we must determine the
“best” split-point for the attribute.
An example is to take an average of the
continuous values.
7/4/2008 25
C4.5 Algorithm - Pruning Tree Phase
E2 < E1
Replace the subtree
Finish ?
Yes
No
Yes
No
Compute Original Sub-
Tree Error Rate
( E1 )
Compute Replaced
Sub-Tree Error Rate
(E2 )
Goto Bottom Sub-tree
( Error-Based Pruning Algorithm )
U25%(E,N) = Predicted Error Rate
= the number of misclassified test cases *
100%
the total number of test cases
where E is no. of error cases in the class,
N is no. of cases in the class
7/4/2008 26
Case study of predicting student enrolment by
decision tree
• Enrolment Relational schema
Attribute Data type
ID Number
Class Varchar
Sex Varchar
Fin_Support Varchar
Emp_Code Varchar
Job_Code Varchar
Income Varchar
Qualification Varchar
Marital_Status Varchar
7/4/2008 27
Student Enrolment Analysis
– deduce influencing factors associated to student course
enrolment
– Three selected courses’ enrolment data is sampled:
Computer Science, English Studies and Real Estate
Management
– with 100 training records and 274 testing records
– prediction result
– Generate Classification Rules
– Decision tree - Classification Rule
– Students Enrolment: 41 Computer Science, 46 English
Studies and 13 Real Estate Management
7/4/2008 28
Growing Tree Phase
C4.5 tree induction algorithm gain ratio of all possible data attributes
Note: Emp_code shows highest information gain, and thus is the top
priority in decision tree.
7/4/2008 29
Growing Tree Phase Decision Tree
ROOT - Employment
Manufacturing
Social Work
Tourism, Hotel
Trading
Property
Construction
Education
Engineering
Fin/Accounting
Government
Info. Technology
Others
Job
Sex
Real Estate Mangement = 100%
Job
Job
Job
Fin_Support
Income
Sex
Qualification
Qualification
Computer Science = 100%
Form 4, Form 5 [English Studies = 100%]
Form 6 or equivalent [English Studies = 100%]
Master degree [computer Science = 100%]
Owner/partners of Companies [English Studies = 100%]
Executive [English Studies = 100%]
Female [computer Science = 100%]
Female [English Studies = 100%]
Male [computer Science = 100%]
Executive [Real Estate Mgt = 100%]
Professional, technical [Real Estate = 70%]
Clerical [English studies = 70%]
Professional [Computer science = 100%]
Technical studies [real estate = 100%]
Sales [computer Science = 100%]
Yes [Computer sicence = 100%]
No [computer Science = 50%]
Female [English Studies = 80%]
Male [computer Science = 100%]
Form 4, Form 5 [English Studies = 100%]
First degree equivalent [English Studies = 100%]
Postgraduate[computer Science = 100%]
> $800,000 [real estate = 100%]
$200000-$250,000 [English Studies = 100%]
$250,000-$299,000 [Real Estate = 100%]
Professional, Technical [Real Estate Mgt = 80%]
7/4/2008 30
Growing Tree Phase classification rules
-Root
-Emp_Code = Manufacturing (English Studies = 67%)
-Quali = Form 4 Form 5 (English studies = 100%)
-Quali = Form 6 or equi. (English studies = 100%)
-Quali = First degree (Computer science = 100%)
-Quali = Master degree (computer science = 100%)
-Emp_Code = Social work (computer science = 100%)
-Emp_Code = Tourism, Hotel (English studies = 67%)
-Emp_Code = Trading (English studies = 75%)
-Emp_Code = Property (Real estate = 100%)
-Emp_Code = Construction (Real estate = 56%)
-Emp_Code = Education (computer science = 73%)
-Emp_Code = Engineering (Real estate = 60%)
-Emp_Code = Fin/Accounting (computer science = 54%)
-Emp_Code = Government (computer science = 50%)
-Emp_Code = Info. Tech. (computer science = 50%)
-Emp_code = Others (English studies= 82%)
7/4/2008 31
Pruned Decision Tree
Given: Error rate of Pruned Sub-tree Emp_code = “Manufacturing” =3.34
Non-Pruned Sub-tree
Condition Error Rate
Emp_Code=“Manufacturing” 0.75
-Quali = Form 4 and 5 1.11
-Quali = Form 6 0.75
-Quali = First Degree 0.75
Total 3.36
Note: Prune tree since Pruning Error rate 3.34 < no pruning error rate 3.36
7/4/2008 32
Prune Tree Phase Decision Tree
ROOT - Employment
Manufacturing
Social Work
Tourism, Hotel
Trading
Property
Construction
Education
Engineering
Fin/Accounting
Government
Info. Technology
Others
[English Studies = 70%]
Sex
[Real Estate Mangement = 100%]
Job
[Computer Science = 70%]
Job
[Computer science = 50%]
Income
Sex
[English Studies = 80%]
[English Studies = 70%]
[Computer Science = 100%]
Female [English Studies = 100%]
Male [computer Science = 100%]
Executive [Real Estate Mgt = 100%]
Professional, technical [Real Estate = 70%]
Clerical [English studies = 70%]
Sales [computer Science = 100%]
Female [English Studies = 80%]
Male [computer Science = 100%]
> $800,000 [real estate = 100%]
$200000-$250,000 [English Studies = 100%]
$250,000-$299,000 [Real Estate = 100%]
Professional, Technical [Real Estate Mgt = 80%]
7/4/2008 33
Prune Tree Phase classification Rules
No. Rule Class
1 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt
2 IF Emp_Code = “Tourism, Hotel” English Studies
3 IF Emp_Code = “Education” Computer Science
4 IF Emp_Code = “Others” English Studies
5 IF Emp_Code = “Government” AND Income = “$150,000 - $199,999” English Studies
6 IF Emp_Code = “Construction” AND Job_Code = “Professional, Technical” Real Estate Mgt
7 IF Emp_Code = “Manufacturing” English Studies
8 IF Emp_Code = “Trading” AND Sex = “Female” English Studies
9 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt
10 IF Emp_Code = “Engineering” AND Job_Code = “Sales” Computer Science
11 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt
12 IF Emp_Code = “Government” AND Income = “$800,000 - $999,999” Real Estate Mgt
13 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies
14 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science
15 IF Emp_Code = “Social Work” Computer Science
16 IF Emp_Code = “Fin/Accounting” Computer Science
17 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science
18 IF Emp_Code = “Construction” AND Job_Code = “Clerical” English Studies
7/4/2008 34
Simplify classification rules by deleting
unnecessary conditions
Pessimistic error rate is due to its disappearance is minimal
If the condition disappears, then the error rate is 0.338.
7/4/2008 35
Simplified Classification Rules
No. Rule Class
1 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt
2 IF Emp_Code = “Tourism, Hotel” English Studies
3 IF Emp_Code = “Education” Computer Science
4 IF Emp_Code = “Others” English Studies
5 IF Emp_Code = “Manufacturing” English Studies
6 IF Emp_Code = “Trading” AND Sex = “Female” English Studies
7 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt
8 IF Job_Code = “Sales” Computer Science
9 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt
10 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies
11 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science
12 IF Emp_Code = “Social Work” Computer Science
13 IF Emp_Code = “Fin/Accounting” Computer Science
14 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science
15 IF Job_Code = “Clerical” English Studies
16 IF Emp_Code = “Property” Real Estate
17 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999” English Studies
c
7/4/2008 36
Ranking Rules
After simplifying the classification rule set, the
remaining step is to rank the rules according to
their prediction reliability percentage defined as
(1 – misclassify cases / total cases of the rule) *
100%
For the rule
If Employment = “Trading” and “Sex=‘female’”
then class = “English Studies”
Gives out 6 cases with 0 misclassify cases.
Therefore, give out 100% reliability percentage
and thus is ranked first rule in the rule set.
7/4/2008 37
Success rate ranked classification rules
No. Rule Class
1 IF Emp_Code = “Trading” AND Sex = “Female” English Studies
2 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt
3 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science
4 IF Emp_Code = “Social Work” Computer Science
5 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt
6 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999” English Studies
7 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science
8 IF Emp_Code = “Property” Real Estate
9 IF Job_Code = “Sales” Computer Science
10 IF Emp_Code = “Others” English Studies
11 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies
12 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt
13 IF Emp_Code = “Education” Computer Science
14 IF Emp_Code = “Manufacturing” English Studies
15 IF Emp_Code = “Tourism, Hotel” English Studies
16 IF Job_Code = “Clerical” English Studies
17 IF Emp_Code = “Fin/Accounting” Computer Science
7/4/2008 38
Data Prediction Stage
Classifier No. of misclassify cases Error rate(%)
Pruned Decision Tree 81 30.7%
Classification Rule set 90 32.8%
Both prediction results are reasonable good. The prediction
error rate obtained is 30%, which means nearly 70% of
unseen test cases can have accurate prediction result.
7/4/2008 39
Summary
• “Employment Industry” is the most
significant factor affecting an student
enrolment
• Decision Tree Classifier gives the best
better prediction result
• Windowing mechanism improves
prediction accuracy
7/4/2008 40
Reading Assignment
“Data Mining: Concepts and Techniques”
2nd edition, by Han and Kamber, Morgan
Kaufmann publishers, 2007, Chapter 6, pp.
291-309.
7/4/2008 41
Lecture Review Question 11
(i) Explain the term “Information Gain” in
Decision Tree.
(ii) What is the termination condition of Growing
tree phase?
(iii) Given a decision tree, which option do you
prefer to prune the resulting rule and why?
(a) Converting the decision tree to rules and then
prune the resulting rules.
(b) Pruning the decision tree and then converting
the pruned tree to rules.
7/4/2008 42
CS5483 tutorial question 11
Apply C4.5 algorithm to construct a decision tree after first splitting for purchasing records
from the following data after dividing the tuples into two groups according to “age”: one is less
than 25, and another is greater than or equal to 25. Show all the steps and calculation for the
construction.
Location Customer Sex Age Purchase records
Asia Male 15 Yes
Asia Female 23 No
America Female 20 No
Europe Male 18 No
Europe Female 10 No
Asia Female 40 Yes
Europe Male 33 Yes
Asia Male 24 Yes
America Male 25 Yes
Asia Female 27 Yes
America Female 15 Yes
Europe Male 19 No
Europe Female 33 No
Asia Female 35 No
Europe Male 14 Yes
Asia Male 29 Yes
America Male 30 No

More Related Content

PPTX
Lecture optimal binary search tree
PPT
Divide and conquer
PPTX
daa-unit-3-greedy method
PPTX
Bfs and Dfs
PPTX
Strassen's matrix multiplication
PPTX
8 queens problem using back tracking
PPT
Lecture optimal binary search tree
Divide and conquer
daa-unit-3-greedy method
Bfs and Dfs
Strassen's matrix multiplication
8 queens problem using back tracking

What's hot (20)

PPTX
Operator precedance parsing
PPT
Iterative deepening search
PPTX
Naïve Bayes Classifier Algorithm.pptx
PPTX
Relational Algebra,Types of join
PPTX
Data Structures : hashing (1)
PPTX
joins in database
PPTX
Classification in data mining
PPTX
Relational algebra ppt
PPTX
Query processing
PPTX
Knapsack Problem
PDF
Mining Frequent Patterns And Association Rules
PPT
Dinive conquer algorithm
PPT
Problems, Problem spaces and Search
PPTX
heap Sort Algorithm
PPTX
Graph coloring using backtracking
PPT
Recursion tree method
PDF
PPTX
Nested queries in database
PPT
2.4 rule based classification
PPTX
Breadth First Search & Depth First Search
Operator precedance parsing
Iterative deepening search
Naïve Bayes Classifier Algorithm.pptx
Relational Algebra,Types of join
Data Structures : hashing (1)
joins in database
Classification in data mining
Relational algebra ppt
Query processing
Knapsack Problem
Mining Frequent Patterns And Association Rules
Dinive conquer algorithm
Problems, Problem spaces and Search
heap Sort Algorithm
Graph coloring using backtracking
Recursion tree method
Nested queries in database
2.4 rule based classification
Breadth First Search & Depth First Search
Ad

Similar to decison tree and rules in data mining techniques (20)

PPTX
DecisionTree.pptx for btech cse student
PDF
Lecture 5 Decision tree.pdf
PPTX
Lect9 Decision tree
PPTX
Chapter 4 Classification
PPT
decisiontrees.ppt
PPT
decisiontrees.ppt
PPT
decisiontrees (3).ppt
PPTX
Basic Process of Classification with Example
DOCX
Classification Using Decision Trees and RulesChapter 5.docx
PDF
Data Science - Part V - Decision Trees & Random Forests
PDF
L3. Decision Trees
PPTX
Decision Trees Learning in Machine Learning
PDF
7 decision tree
DOCX
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
PPTX
Data mining technique (decision tree)
PPTX
BAS 250 Lecture 5
PDF
Efficient classification of big data using vfdt (very fast decision tree)
PDF
decision tree.pdf
PPTX
data mining.pptx
PDF
Machine learning important pdf for supervised
DecisionTree.pptx for btech cse student
Lecture 5 Decision tree.pdf
Lect9 Decision tree
Chapter 4 Classification
decisiontrees.ppt
decisiontrees.ppt
decisiontrees (3).ppt
Basic Process of Classification with Example
Classification Using Decision Trees and RulesChapter 5.docx
Data Science - Part V - Decision Trees & Random Forests
L3. Decision Trees
Decision Trees Learning in Machine Learning
7 decision tree
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Data mining technique (decision tree)
BAS 250 Lecture 5
Efficient classification of big data using vfdt (very fast decision tree)
decision tree.pdf
data mining.pptx
Machine learning important pdf for supervised
Ad

More from ALIZAIB KHAN (16)

PPT
Lecture 44 Database management system for databases
PPT
Lecture 43 Database management system Usage
PPTX
4+Mandarin and Dialects +Overview of China.pptx
PPTX
Over View of China Final Presentation.pptx
PPTX
Foundation of the information securiety
PDF
IS Merg file is technique of Information Securiety
PDF
Chinese Phonetics for the purpose of learning chinese
PPT
Ants coony optimiztion problem in Advance analysis of algorithms
PPT
Ant Colony Optimization algorithms in ADSA
PDF
Artificial Neural Networks for data mining
PPT
Chapter01.ppt
PPTX
CS911-Lecture-19_40235.pptx
PPTX
CS911-Lecture-13_34826.pptx
PPTX
Lecture 1.pptx
PPT
Types of Algorithms.ppt
PPT
10994479.ppt
Lecture 44 Database management system for databases
Lecture 43 Database management system Usage
4+Mandarin and Dialects +Overview of China.pptx
Over View of China Final Presentation.pptx
Foundation of the information securiety
IS Merg file is technique of Information Securiety
Chinese Phonetics for the purpose of learning chinese
Ants coony optimiztion problem in Advance analysis of algorithms
Ant Colony Optimization algorithms in ADSA
Artificial Neural Networks for data mining
Chapter01.ppt
CS911-Lecture-19_40235.pptx
CS911-Lecture-13_34826.pptx
Lecture 1.pptx
Types of Algorithms.ppt
10994479.ppt

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Database Infoormation System (DBIS).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Mega Projects Data Mega Projects Data
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Business Analytics and business intelligence.pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
annual-report-2024-2025 original latest.
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Fluorescence-microscope_Botany_detailed content
Database Infoormation System (DBIS).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
Quality review (1)_presentation of this 21
Reliability_Chapter_ presentation 1221.5784
Mega Projects Data Mega Projects Data
.pdf is not working space design for the following data for the following dat...
Business Analytics and business intelligence.pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

decison tree and rules in data mining techniques

  • 1. 7/4/2008 1 Decision Tree Approach in Data Mining What is data mining ? The process of extracting previous unknown and potentially useful information from large database Several data mining approaches nowadays Association Rules Decision Tree Neutral Network Algorithm
  • 2. 7/4/2008 2 Decision Tree Induction A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution.
  • 3. 7/4/2008 3 Data Mining Approach - Decision Tree • a model that is both predictive and descriptive • can help identify which factors to consider and how each factor associated to a business decision • most commonly used for classification (predicting what group a case belongs to) • several decision tree induction algorithms, e.g. C4.5, CART, CAL5, ID3 etc.
  • 4. 7/4/2008 4 Algorithm for building Decision Trees Decision trees are a popular structure for supervised learning. They are constructed using attributes best able to differentiate the concepts to be learned. A decision tree is built by initially selecting a subset of instances from a training set. This subset is then used by the algorithm to construct a decision tree. The remaining training set instances test the accuracy of the constructed tree.
  • 5. 7/4/2008 5 If the decision tree classified the instances correctly, the procedure terminates. If an instance is incorrectly classified, the instance is added to the selected subset of training instances and a new tree is constructed. This process continues until a tree that correctly classify all nonselected instances is created or the decision tree is built from the entire training set.
  • 6. 7/4/2008 6 Entropy (a) shows probability p range from 0 to 1 = log(1/p) (b) Shows probability of an event occurs = p log(1/p) (c) Shows probability of an expected value (occurs+not occurs) = p log(1/p) + (1-p) log (1/(1-p))
  • 7. 7/4/2008 7 Training Process Sam ple D ata S et T raining Set T es ting S et T rained C lass ifier R es ults W indow ing Process C onstruc t D ec is ion T ree & R uleset Process Predic tion Process B lock D iagram of Training Process |-------- Data Preparation Stage --------|------- Tree Building Stage -------|--- Prediction Stage ---|
  • 8. 7/4/2008 8 Basic algorithm for inducing a decision tree • Algorithm: Generate_decision_tree. Generate a decision tree from the given training data. • Input: The training samples, represented by discrete-valued attributes; the set of candidate attributes, attribute-list; • Output: A decision tree
  • 9. 7/4/2008 9 Begin Partition (S) If (all records in S are of the same class or only 1 record found in S) then return; For each attribute Ai do evaluate splits on attribute Ai; Use best split found to partition S into S1 and S2 to grow a tree with two Partition (S1) and Partition (S2); Repeat partitioning for Partition (S1) and (S2) until it meets tree stop growing criteria; End;
  • 10. 7/4/2008 10 Information Gain Difference between information needed for correct classification before and after the split. For example, before split, there are 4 possible outcomes represented in 2 bits in the information of A, B, …Outcome. After split on attribute A, the split results in two branches of the tree, and each tree branch represent two outcomes represented in 1 bit. Thus, choosing attribute A results in an information gain of one bit.
  • 11. 7/4/2008 11 Classification Rule Generation • Generate Rules – rewrite the tree to a collection of rules, one for each tree leaf – e.g. Rule 1: IF ‘outlook = rain’ AND ‘windy = false’ THEN ‘play’ • Simplifying Rules – delete any irrelevant rule condition without affecting its accuracy – e.g. Rule R-: IF r1 AND r2 AND r3 THEN class1 – Condition: Error Rate (R-) without r1 < Error Rate (R) => delete this rule condition r1 – Resultant Rule: IF r2 AND r3 THEN class1 • Ranking Rules – order the rules according to the error rate
  • 12. 7/4/2008 12 Decision Tree Rules Rules are more appealing than trees, variations of the basic tree to rule mapping must be presented. Most variations focus on simplifying and/or eliminating existing rules.
  • 13. 7/4/2008 13 Example of simplifying rules of credit card Income Range Life Insurance Credit Card Sex Age Promotion Insurance 40-50k no no Male 45 30-40k yes no Female 40 40-50k no no Male 42 30-40k yes yes Male 43 50-60k yes no Female 38 20-30k no no Female 55 30-40k yes yes Male 35 20-30k no no Male 27 30-40k no no Male 43 30-40k yes no Female 41 40-50k yes no Female 43 20-30k yes no Male 29 50-60k yes no Female 39 40-50k no no Male 55 20-30k yes yes Female 19
  • 14. 14/4/2008 14 A rule created by following one path of the tree is: Case 1: If Age<=43 & Sex=Male & Credit Card Insurance=No Then Life Insurance Promotion = No The conditions for this rule cover 4 of 15 instances with 75% accuracy in which 3 out of 4 meet the successful rate. Case 2: If Sex=Male & Credit Card Insurance=No Then Life Insurance Promotion = No The conditions for this rule cover 5 of 6 instances with 83.3% accuracy Therefore, the simplified rule is more general and more accurate than the original rule.
  • 15. 7/4/2008 15 C4.5 Tree Induction Algorithm • Involves two phases for decision tree construction – growing tree phase – pruning tree phase • Growing Tree Phase – a top-down approach which repeatedly build the tree, it is a specialization process • Pruning Tree Phase – a bottom-up approach which removes sub- trees by replacing them with leaves, it is a
  • 16. 7/4/2008 16 Expected information before splitting Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci for i=1,..m. Let Si be the number of samples of S in class Ci. The expected information needed to classify a given sample Si is given by: m Info(S)= -  Si log2 Si i=1 S S Note that a log function to the base 2 is used since the information is encoded in bit
  • 17. 7/4/2008 17 Expected information after splitting Let attribute A have v distinct values {a1, a2,…av}, and is used to split S into v subsets {S1,…Sv} where Sj contains those samples in S that have value aj of A. After splitting, then these subsets would correspond to the branches partitioned in S. v InfoA(S) =  S1j+…+Smj Info(Sj) j=1 S Gain (A) = Info (S) – InfoA(S)
  • 18. 7/4/2008 18 C4.5 Algorithm - Growing Tree Phase Let S = any set of training case Let |S| = number of classes in set S Let Freq (Ci, S) = number of cases in S that belong to class Ci Info(S) = average amount of information needed to identify the class in S Infox(S) = expected information to identify the class of a case in S after partitioning S with the test on attribute X Gain (X) = information gained by partitioning S according to the test on attribute X
  • 19. 7/4/2008 19 C4.5 Algorithm - Growing Tree Phase Data Mining Set Find Splitting Attribute Find threshold value for splitting Terminate Tree Growing ? No Yes Tree Splitting Select Decisive Attribute for Tree Splitting ( Informational Gain Ratio ) m Info(S)= -  Si log2 Si i=1 S S v InfoA(S) =  S1j+…+Smj Info(Sj) j=1 S Gain (X) = Info (S) – Infox (S)
  • 20. 7/4/2008 20 C4.5 Algorithm - Growing Tree Phase Let S be the training set Info (S) = -9 log2 (9) - 5 log2 (5) = 0.42+0.52=0.94 14 14 14 14 Where log2(9/14)= log 2 log (9/14) InfoOutlook(S) = 5 (- 2 log2 (2) - 3 log2 (3) ) 14 5 5 5 5 + 4 (- 4 log2 (4) - 0 log2 (0) ) 14 4 4 4 4 + 5 (- 3 log2 (3) - 2 log2 (2) ) = 0.694 14 5 5 5 5 Gain (Outlook) = 0.94 - 0.694 = 0.246 Similarly,computed information Gain(Windy) =Info(S) - InfoWindy(S) = 0.94 - 0.892 = 0.048 Thus, decision tree splits on attribute Outlook with higher information gain. Root | Outlook | Sunny Overcast Rain
  • 21. 7/4/2008 21 After first splitting Windy? TRUE TRUE FALSE FALSE FALSE Class Play Don’t Play Don’t Play Don’t Play Play Windy? TRUE FALSE TRUE FALSE Class Play Play Play Play Windy? TRUE TRUE FALSE FALSE FALSE Class Don’t Play Don’t Play Play Play Play Root | Outlook / | Sunny Overcast Rain
  • 22. 7/4/2008 22 Decision Tree after grow tree phase Root | Outlook / | Sunny Overcast Rain / | / Wendy not Play Windy not wendy (100%) wendy / / Play not play Play not play (40%) (60%)
  • 24. 7/4/2008 24 Continuous-valued data If input sample data consists of an attribute that is continuous-valued, rather than discrete- valued. For example, people’s Ages is continuous- valued. For such a scenario, we must determine the “best” split-point for the attribute. An example is to take an average of the continuous values.
  • 25. 7/4/2008 25 C4.5 Algorithm - Pruning Tree Phase E2 < E1 Replace the subtree Finish ? Yes No Yes No Compute Original Sub- Tree Error Rate ( E1 ) Compute Replaced Sub-Tree Error Rate (E2 ) Goto Bottom Sub-tree ( Error-Based Pruning Algorithm ) U25%(E,N) = Predicted Error Rate = the number of misclassified test cases * 100% the total number of test cases where E is no. of error cases in the class, N is no. of cases in the class
  • 26. 7/4/2008 26 Case study of predicting student enrolment by decision tree • Enrolment Relational schema Attribute Data type ID Number Class Varchar Sex Varchar Fin_Support Varchar Emp_Code Varchar Job_Code Varchar Income Varchar Qualification Varchar Marital_Status Varchar
  • 27. 7/4/2008 27 Student Enrolment Analysis – deduce influencing factors associated to student course enrolment – Three selected courses’ enrolment data is sampled: Computer Science, English Studies and Real Estate Management – with 100 training records and 274 testing records – prediction result – Generate Classification Rules – Decision tree - Classification Rule – Students Enrolment: 41 Computer Science, 46 English Studies and 13 Real Estate Management
  • 28. 7/4/2008 28 Growing Tree Phase C4.5 tree induction algorithm gain ratio of all possible data attributes Note: Emp_code shows highest information gain, and thus is the top priority in decision tree.
  • 29. 7/4/2008 29 Growing Tree Phase Decision Tree ROOT - Employment Manufacturing Social Work Tourism, Hotel Trading Property Construction Education Engineering Fin/Accounting Government Info. Technology Others Job Sex Real Estate Mangement = 100% Job Job Job Fin_Support Income Sex Qualification Qualification Computer Science = 100% Form 4, Form 5 [English Studies = 100%] Form 6 or equivalent [English Studies = 100%] Master degree [computer Science = 100%] Owner/partners of Companies [English Studies = 100%] Executive [English Studies = 100%] Female [computer Science = 100%] Female [English Studies = 100%] Male [computer Science = 100%] Executive [Real Estate Mgt = 100%] Professional, technical [Real Estate = 70%] Clerical [English studies = 70%] Professional [Computer science = 100%] Technical studies [real estate = 100%] Sales [computer Science = 100%] Yes [Computer sicence = 100%] No [computer Science = 50%] Female [English Studies = 80%] Male [computer Science = 100%] Form 4, Form 5 [English Studies = 100%] First degree equivalent [English Studies = 100%] Postgraduate[computer Science = 100%] > $800,000 [real estate = 100%] $200000-$250,000 [English Studies = 100%] $250,000-$299,000 [Real Estate = 100%] Professional, Technical [Real Estate Mgt = 80%]
  • 30. 7/4/2008 30 Growing Tree Phase classification rules -Root -Emp_Code = Manufacturing (English Studies = 67%) -Quali = Form 4 Form 5 (English studies = 100%) -Quali = Form 6 or equi. (English studies = 100%) -Quali = First degree (Computer science = 100%) -Quali = Master degree (computer science = 100%) -Emp_Code = Social work (computer science = 100%) -Emp_Code = Tourism, Hotel (English studies = 67%) -Emp_Code = Trading (English studies = 75%) -Emp_Code = Property (Real estate = 100%) -Emp_Code = Construction (Real estate = 56%) -Emp_Code = Education (computer science = 73%) -Emp_Code = Engineering (Real estate = 60%) -Emp_Code = Fin/Accounting (computer science = 54%) -Emp_Code = Government (computer science = 50%) -Emp_Code = Info. Tech. (computer science = 50%) -Emp_code = Others (English studies= 82%)
  • 31. 7/4/2008 31 Pruned Decision Tree Given: Error rate of Pruned Sub-tree Emp_code = “Manufacturing” =3.34 Non-Pruned Sub-tree Condition Error Rate Emp_Code=“Manufacturing” 0.75 -Quali = Form 4 and 5 1.11 -Quali = Form 6 0.75 -Quali = First Degree 0.75 Total 3.36 Note: Prune tree since Pruning Error rate 3.34 < no pruning error rate 3.36
  • 32. 7/4/2008 32 Prune Tree Phase Decision Tree ROOT - Employment Manufacturing Social Work Tourism, Hotel Trading Property Construction Education Engineering Fin/Accounting Government Info. Technology Others [English Studies = 70%] Sex [Real Estate Mangement = 100%] Job [Computer Science = 70%] Job [Computer science = 50%] Income Sex [English Studies = 80%] [English Studies = 70%] [Computer Science = 100%] Female [English Studies = 100%] Male [computer Science = 100%] Executive [Real Estate Mgt = 100%] Professional, technical [Real Estate = 70%] Clerical [English studies = 70%] Sales [computer Science = 100%] Female [English Studies = 80%] Male [computer Science = 100%] > $800,000 [real estate = 100%] $200000-$250,000 [English Studies = 100%] $250,000-$299,000 [Real Estate = 100%] Professional, Technical [Real Estate Mgt = 80%]
  • 33. 7/4/2008 33 Prune Tree Phase classification Rules No. Rule Class 1 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt 2 IF Emp_Code = “Tourism, Hotel” English Studies 3 IF Emp_Code = “Education” Computer Science 4 IF Emp_Code = “Others” English Studies 5 IF Emp_Code = “Government” AND Income = “$150,000 - $199,999” English Studies 6 IF Emp_Code = “Construction” AND Job_Code = “Professional, Technical” Real Estate Mgt 7 IF Emp_Code = “Manufacturing” English Studies 8 IF Emp_Code = “Trading” AND Sex = “Female” English Studies 9 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt 10 IF Emp_Code = “Engineering” AND Job_Code = “Sales” Computer Science 11 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt 12 IF Emp_Code = “Government” AND Income = “$800,000 - $999,999” Real Estate Mgt 13 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies 14 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science 15 IF Emp_Code = “Social Work” Computer Science 16 IF Emp_Code = “Fin/Accounting” Computer Science 17 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science 18 IF Emp_Code = “Construction” AND Job_Code = “Clerical” English Studies
  • 34. 7/4/2008 34 Simplify classification rules by deleting unnecessary conditions Pessimistic error rate is due to its disappearance is minimal If the condition disappears, then the error rate is 0.338.
  • 35. 7/4/2008 35 Simplified Classification Rules No. Rule Class 1 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt 2 IF Emp_Code = “Tourism, Hotel” English Studies 3 IF Emp_Code = “Education” Computer Science 4 IF Emp_Code = “Others” English Studies 5 IF Emp_Code = “Manufacturing” English Studies 6 IF Emp_Code = “Trading” AND Sex = “Female” English Studies 7 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt 8 IF Job_Code = “Sales” Computer Science 9 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt 10 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies 11 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science 12 IF Emp_Code = “Social Work” Computer Science 13 IF Emp_Code = “Fin/Accounting” Computer Science 14 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science 15 IF Job_Code = “Clerical” English Studies 16 IF Emp_Code = “Property” Real Estate 17 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999” English Studies c
  • 36. 7/4/2008 36 Ranking Rules After simplifying the classification rule set, the remaining step is to rank the rules according to their prediction reliability percentage defined as (1 – misclassify cases / total cases of the rule) * 100% For the rule If Employment = “Trading” and “Sex=‘female’” then class = “English Studies” Gives out 6 cases with 0 misclassify cases. Therefore, give out 100% reliability percentage and thus is ranked first rule in the rule set.
  • 37. 7/4/2008 37 Success rate ranked classification rules No. Rule Class 1 IF Emp_Code = “Trading” AND Sex = “Female” English Studies 2 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt 3 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science 4 IF Emp_Code = “Social Work” Computer Science 5 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt 6 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999” English Studies 7 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science 8 IF Emp_Code = “Property” Real Estate 9 IF Job_Code = “Sales” Computer Science 10 IF Emp_Code = “Others” English Studies 11 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies 12 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt 13 IF Emp_Code = “Education” Computer Science 14 IF Emp_Code = “Manufacturing” English Studies 15 IF Emp_Code = “Tourism, Hotel” English Studies 16 IF Job_Code = “Clerical” English Studies 17 IF Emp_Code = “Fin/Accounting” Computer Science
  • 38. 7/4/2008 38 Data Prediction Stage Classifier No. of misclassify cases Error rate(%) Pruned Decision Tree 81 30.7% Classification Rule set 90 32.8% Both prediction results are reasonable good. The prediction error rate obtained is 30%, which means nearly 70% of unseen test cases can have accurate prediction result.
  • 39. 7/4/2008 39 Summary • “Employment Industry” is the most significant factor affecting an student enrolment • Decision Tree Classifier gives the best better prediction result • Windowing mechanism improves prediction accuracy
  • 40. 7/4/2008 40 Reading Assignment “Data Mining: Concepts and Techniques” 2nd edition, by Han and Kamber, Morgan Kaufmann publishers, 2007, Chapter 6, pp. 291-309.
  • 41. 7/4/2008 41 Lecture Review Question 11 (i) Explain the term “Information Gain” in Decision Tree. (ii) What is the termination condition of Growing tree phase? (iii) Given a decision tree, which option do you prefer to prune the resulting rule and why? (a) Converting the decision tree to rules and then prune the resulting rules. (b) Pruning the decision tree and then converting the pruned tree to rules.
  • 42. 7/4/2008 42 CS5483 tutorial question 11 Apply C4.5 algorithm to construct a decision tree after first splitting for purchasing records from the following data after dividing the tuples into two groups according to “age”: one is less than 25, and another is greater than or equal to 25. Show all the steps and calculation for the construction. Location Customer Sex Age Purchase records Asia Male 15 Yes Asia Female 23 No America Female 20 No Europe Male 18 No Europe Female 10 No Asia Female 40 Yes Europe Male 33 Yes Asia Male 24 Yes America Male 25 Yes Asia Female 27 Yes America Female 15 Yes Europe Male 19 No Europe Female 33 No Asia Female 35 No Europe Male 14 Yes Asia Male 29 Yes America Male 30 No