SlideShare a Scribd company logo
Data mining
Assignment week 3




BARRY KOLLEE

10349863
Regression	
  |	
  CPU	
  performance	
  
	
  
Exercise 1: Decision Trees and Logical Forms
Imagine a scenario where you want to decide whether to provide a loan. Given
the following logical formula in disjunctive normal form, draw the corresponding
decision tree.


       ( age > 25 = no ^ lives_by_himself = no )
       ( age > 25 = yes ^ employed = yes )
       ( age > 25 = yes ^ employed = no ^ in_education = yes )




2
Regression	
  |	
  CPU	
  performance	
  
	
  
Exercise 2: Information Gain and Attribute Selection

Given the following training data:

Which attribute (i.e., a1 or a2) has the higher information
gain when chosen as the first branching in a decision tree?
Explain this first intuitively (you don't need a calculator for
this) and then explain it by giving the respective
information gains (you can use a calculator for this).



Observation:

I think that a1 has a higher information gain. That’s because I see an equal distribution at a2. I conclude
this by the following:

        •   50 % of all the instances has a ‘+’ class and the other 50 % has a ‘-‘ class.
        •   50 % of all the true values have a ‘+’ class and 50 % of all true values have a ‘-‘ class.
        •   50 % of all false values have a ‘+’ class and 50 % of all false values have a ‘-‘ class.

The information gain for class a1 becomes:


       H(a1, true) = -(1/3)log2(1/3) – (2/3)log2(2/3) = 0.9183
       H(a1, false) = -(1/3)log2(1/3) – (2/3)log2(2/3) = 0.9183
       H(a1)        = 0.5 * 0.9183 + 0.5 * 0.9183      = 0.9183


So eventually the gain of ‘a1’ will be:


       Gain(a1) = 1 – 0.9183 = 0.0817



The information gain for class a2 becomes:


       H(a2, true) = -(1/2)log2(1/2) – (1/2)log2(1/2) = 1
       H(a2, false = -(1/2)log2(1/2) – (1/2)log2(1/2) = 1
       H(a2)       = 0.5 * 1 + 0.5 * 1                = 1


So eventually the gain of ‘a2’ will be:


       Gain(a2) = 1 – 1 = 0




3
Regression	
  |	
  CPU	
  performance	
  
	
  
Exercise 3: Overfitting

Give a simple example of a decision tree and a data set where the decision tree
overfits the data. Show explicitly (see the definition of overfitting) that the
decision tree in your example overfits.

My example for showing overfitted data is in the column and decision tree below. Each and every animal
has a unique number (animalnumber). If this training dataset contains a lot of animals we can easily
search for the value ‘is_in_cage’ for every animal and eventually know if this animal is within a cage.
However if the zoo retrieves new animals we can’t state/predict if this animal should be in a cage. We
shouldn’t assume this.



Example of a (training) dataset of the zoo:

                         animalnumber       number_of_visits        is_in_cage
                           animal_1               200                   No
                           animal_2               400                  Yes
                           animal_3               100                   No
                           animal_n                ..                   ..

Example it’s decision tree:


       ( number_of_visits > 140 = yes is_in_cage = yes )




4

More Related Content

PDF
Data mining Computerassignment 3
PDF
Data mining assignment 4
PDF
Data mining assignment 5
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PDF
Lec 8 03_sept [compatibility mode]
PPT
Machine Learning 3 - Decision Tree Learning
PDF
Understanding random forests
Data mining Computerassignment 3
Data mining assignment 4
Data mining assignment 5
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Lec 8 03_sept [compatibility mode]
Machine Learning 3 - Decision Tree Learning
Understanding random forests

What's hot (19)

PPTX
Machine learning and_nlp
PPT
MachineLearning.ppt
PPTX
Decision Tree - ID3
PDF
Decision tree
PPTX
Decision Tree - C4.5&CART
PDF
ID3 Algorithm & ROC Analysis
PPTX
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
PPT
002.decision trees
PPTX
ID3 ALGORITHM
PPTX
Icom4015 lecture4-f16
PDF
Lecture 8: Machine Learning in Practice (1)
PPTX
Week8 Live Lecture for Final Exam
PDF
Chapter 02-logistic regression
PPTX
Decision tree, softmax regression and ensemble methods in machine learning
PPTX
Icom4015 lecture3-s18
PDF
27 Machine Learning Unsupervised Measure Properties
PPTX
Icom4015 lecture12-s16
PDF
Decreasing and increasing functions by arun umrao
PPTX
Icom4015 lecture3-f17
Machine learning and_nlp
MachineLearning.ppt
Decision Tree - ID3
Decision tree
Decision Tree - C4.5&CART
ID3 Algorithm & ROC Analysis
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
002.decision trees
ID3 ALGORITHM
Icom4015 lecture4-f16
Lecture 8: Machine Learning in Practice (1)
Week8 Live Lecture for Final Exam
Chapter 02-logistic regression
Decision tree, softmax regression and ensemble methods in machine learning
Icom4015 lecture3-s18
27 Machine Learning Unsupervised Measure Properties
Icom4015 lecture12-s16
Decreasing and increasing functions by arun umrao
Icom4015 lecture3-f17
Ad

Viewers also liked (20)

PDF
Data Engineering - Data Mining Assignment
PDF
Data mining test notes (back)
PDF
HCI - Group Report for Metrolink App
PDF
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
PDF
2014 Profile of Results
PDF
Statistics and Data Mining
PPTX
WEKA: The Knowledge Flow Interface
PPTX
WEKA:Data Mining Input Concepts Instances And Attributes
PPTX
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
PDF
Data Mining With Excel 2007 And SQL Server 2008
DOC
Loan Processing System
PDF
WEKA - A Data Mining Tool - by Shareek Ahamed
PDF
rule-based classifier
PPTX
Text classification with Weka
PDF
Tutorial weka
PDF
Weka project - Classification & Association Rule Generation
PPT
Data Mining Final Presentation
PPTX
Weka By Chathawee Luangmanotham 54102011144
PPTX
weka data mining
Data Engineering - Data Mining Assignment
Data mining test notes (back)
HCI - Group Report for Metrolink App
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
2014 Profile of Results
Statistics and Data Mining
WEKA: The Knowledge Flow Interface
WEKA:Data Mining Input Concepts Instances And Attributes
Steps to Converting Exisiting Visitors to Customers Using Data, Testing and P...
Data Mining With Excel 2007 And SQL Server 2008
Loan Processing System
WEKA - A Data Mining Tool - by Shareek Ahamed
rule-based classifier
Text classification with Weka
Tutorial weka
Weka project - Classification & Association Rule Generation
Data Mining Final Presentation
Weka By Chathawee Luangmanotham 54102011144
weka data mining
Ad

Similar to Data mining assignment 3 (20)

PDF
Association Rule Mining with Apriori Algorithm.pdf
PPTX
kmean_naivebayes.pptx
PPTX
Week8 livelecture2010 follow_up
PPTX
An algorithm for building
PPT
Decision tree
PPT
Decision tree
PPTX
Final examexamplesapr2013
PPT
Decision_Tree in machine learning with examples.ppt
PPTX
Week 7 Lecture
PPTX
Data Science-entropy machine learning.pptx
DOCX
Operations management chapter 03 homework assignment use this
PDF
Python_Cheat_Sheet_Keywords_1664634397.pdf
PDF
Python_Cheat_Sheet_Keywords_1664634397.pdf
PPTX
Ml presentation
PDF
Lessonweeeeeeeeeeeeeeeeeewwwwwwwwwwwwwwwwwwwww5.pdf
PPT
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
PDF
Graph Methods for Generating Test Cases with Universal and Existential Constr...
PPTX
PDF
Cours Stats 5E
PDF
Decision tree learning
Association Rule Mining with Apriori Algorithm.pdf
kmean_naivebayes.pptx
Week8 livelecture2010 follow_up
An algorithm for building
Decision tree
Decision tree
Final examexamplesapr2013
Decision_Tree in machine learning with examples.ppt
Week 7 Lecture
Data Science-entropy machine learning.pptx
Operations management chapter 03 homework assignment use this
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
Ml presentation
Lessonweeeeeeeeeeeeeeeeeewwwwwwwwwwwwwwwwwwwww5.pdf
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Graph Methods for Generating Test Cases with Universal and Existential Constr...
Cours Stats 5E
Decision tree learning

More from BarryK88 (10)

PDF
Data mining test notes (front)
PDF
Data mining assignment 2
PDF
Data mining assignment 6
PDF
Data mining assignment 1
PDF
Data mining Computerassignment 2
PDF
Data mining Computerassignment 1
PDF
Semantic web final assignment
PDF
Semantic web assignment 3
PDF
Semantic web assignment 2
PDF
Semantic web assignment1
Data mining test notes (front)
Data mining assignment 2
Data mining assignment 6
Data mining assignment 1
Data mining Computerassignment 2
Data mining Computerassignment 1
Semantic web final assignment
Semantic web assignment 3
Semantic web assignment 2
Semantic web assignment1

Recently uploaded (20)

PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Computing-Curriculum for Schools in Ghana
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
master seminar digital applications in india
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Classroom Observation Tools for Teachers
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Institutional Correction lecture only . . .
PDF
TR - Agricultural Crops Production NC III.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Computing-Curriculum for Schools in Ghana
VCE English Exam - Section C Student Revision Booklet
Microbial diseases, their pathogenesis and prophylaxis
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Microbial disease of the cardiovascular and lymphatic systems
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O5-L3 Freight Transport Ops (International) V1.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
master seminar digital applications in india
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Basic Mud Logging Guide for educational purpose
Classroom Observation Tools for Teachers
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Abdominal Access Techniques with Prof. Dr. R K Mishra
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Institutional Correction lecture only . . .
TR - Agricultural Crops Production NC III.pdf

Data mining assignment 3

  • 1. Data mining Assignment week 3 BARRY KOLLEE 10349863
  • 2. Regression  |  CPU  performance     Exercise 1: Decision Trees and Logical Forms Imagine a scenario where you want to decide whether to provide a loan. Given the following logical formula in disjunctive normal form, draw the corresponding decision tree. ( age > 25 = no ^ lives_by_himself = no ) ( age > 25 = yes ^ employed = yes ) ( age > 25 = yes ^ employed = no ^ in_education = yes ) 2
  • 3. Regression  |  CPU  performance     Exercise 2: Information Gain and Attribute Selection Given the following training data: Which attribute (i.e., a1 or a2) has the higher information gain when chosen as the first branching in a decision tree? Explain this first intuitively (you don't need a calculator for this) and then explain it by giving the respective information gains (you can use a calculator for this). Observation: I think that a1 has a higher information gain. That’s because I see an equal distribution at a2. I conclude this by the following: • 50 % of all the instances has a ‘+’ class and the other 50 % has a ‘-‘ class. • 50 % of all the true values have a ‘+’ class and 50 % of all true values have a ‘-‘ class. • 50 % of all false values have a ‘+’ class and 50 % of all false values have a ‘-‘ class. The information gain for class a1 becomes: H(a1, true) = -(1/3)log2(1/3) – (2/3)log2(2/3) = 0.9183 H(a1, false) = -(1/3)log2(1/3) – (2/3)log2(2/3) = 0.9183 H(a1) = 0.5 * 0.9183 + 0.5 * 0.9183 = 0.9183 So eventually the gain of ‘a1’ will be: Gain(a1) = 1 – 0.9183 = 0.0817 The information gain for class a2 becomes: H(a2, true) = -(1/2)log2(1/2) – (1/2)log2(1/2) = 1 H(a2, false = -(1/2)log2(1/2) – (1/2)log2(1/2) = 1 H(a2) = 0.5 * 1 + 0.5 * 1 = 1 So eventually the gain of ‘a2’ will be: Gain(a2) = 1 – 1 = 0 3
  • 4. Regression  |  CPU  performance     Exercise 3: Overfitting Give a simple example of a decision tree and a data set where the decision tree overfits the data. Show explicitly (see the definition of overfitting) that the decision tree in your example overfits. My example for showing overfitted data is in the column and decision tree below. Each and every animal has a unique number (animalnumber). If this training dataset contains a lot of animals we can easily search for the value ‘is_in_cage’ for every animal and eventually know if this animal is within a cage. However if the zoo retrieves new animals we can’t state/predict if this animal should be in a cage. We shouldn’t assume this. Example of a (training) dataset of the zoo: animalnumber number_of_visits is_in_cage animal_1 200 No animal_2 400 Yes animal_3 100 No animal_n .. .. Example it’s decision tree: ( number_of_visits > 140 = yes is_in_cage = yes ) 4