SlideShare a Scribd company logo
Data mining
Assignment week 4




BARRY KOLLEE

10349863
Assignment	
  4	
  
	
  
Exercise 1: Pruning
1. Which problem do we try to address when using pruning?

“Overfitting and lack of generalization beyond training data, i.e. models that describe the training data
(too) well, but do not model the principles and characteristics underlying the data.”

On schema level we state that pruning merges a part of the tree together into one node. The difference
is descripted within the two schema’s below:




2. Describe the purpose of separating the data into training, development, and test data.

“Training data is used to build the model, and test data to test it. Just the Training data by itself is not
able to measure to what extend the model will perform (i.e.. generalize to) on unseen data. Test data
measures this, but we should not use the test data to directly inform our model construction. For this
purpose a third set is used: the development data set, which behaves like the test set but the feedback
can be used to change the model”

We create our training set to increase the accuracy of the classifier, which we use on the data. The
more data we train the more accurate the resulting model will be.

The other two sets are used to evaluate the performance of the classifier we use. The development set
is used to evaluate the accuracy of different configurations of our classifier. It’s called the development
set because we continuously need to evaluate the classification performance.

In the end we’ve got a model, which has a great performance on the test data. To get estimates on how
good the new model will deal with new data we use the test data.




2
Assignment	
  4	
  
	
  


Exercise 2: Information Gain and Attributes with Many Values.

Information gain is defined as:


Following to this definition, information gain
favors attributes with many values.
Why? Give an example.


We use a training set with (as shown in the table):

        •        N number of instances
        •        A number of attributes


                                A1              …                  Ak                   A*                class
            1                   T               …                 Black                 V1                 C1
            2                   T               …                 White                 V2                 C2
            ..                  ..              …                  …                    …                   …
            n                   F               …                 Black                 Vn                 Cn

If we want to classify a certain attribute we can state that we have a 50/50 chance of having a ‘-‘ and a
‘+’ classification. So Attribute A* could be a plus or a minus. We note this as follows.




                                [1+, 0-]
       SVi (A*) = {
                                [0+, 1-]



We can calculate the Entropy (uncertainty) of both outcomes of a plus or minus classification:



       H(S+) = - (1/1 log2 1/1 + 0/1 log2 0/1) = 0

       H(S-) = - (0/1 log2 0/1 + 1/1 log2 1/1) = 0


For calculating our information gain we perform the following formula:


       Gain(S, A*) = H(S)                   – (sum |Sv(A*)| / |S| * H(Sv(A*) )

       Gain(S, A*) = Entropy of H(S) – (gain of H(S+) + gain of H(S-))
       Gain(S, A*) = Entropy of H(S) – (0 + 0)
       Gain(S, A*) = Entropy of H(S)


We see that the Entropy of H(S+) and H(S-) is 0. So in the end we will have a high information gain because there’s
nothing to deduct.




3
Assignment	
  4	
  
	
  
Exercise 3: Missing Attribute Values
Consider the following set of training instances.
Instance 2 has a missing value
for attribute a1.

Apply at least two different strategies for dealing
with missing attribute values
and show how they work in this concrete example.

Example 1 :

We can give a prediction on the true/false value for the missing attribute ‘a1’ by looking at the attributes
from a2. Within the a2 attribute there’s an equal chance of having a ‘true’ value and having a ‘false’
value (50 % chance). We could also state this for attribute a1. In conclusion: the missing question mark
could be a ‘false’ value if we use this way of thinking.

Example 2:

We can also focus on the class attribute. Within a2 we can state the following:
   •    There’s a 100 % chance of having a ‘+’ when having the ‘true’ attribute.
   •    There’s a 50 % chance of having a ‘+’ value when having the ‘false’.

With this way of thinking we should write down the ‘true’ value at the question mark

Example 3:

Now we only look at the attribute a1. We can give a precise prediction of the value what should replace
the question mark.:


       P(true) = 2/3
       P(false) = 1/3




4
Assignment	
  4	
  
	
  



Exercise 4: Regression Trees

1. What are the stopping conditions for decision trees predicting discrete
classes?

       1.   All instances under a node have the same label.
       2.   All attributes have been used along a branch
       3.   There are no instances under a node


By labeling every input value we can state that only one of these outcomes is the correct one. We’ve
seen this with the weather example from the lecture. Because we predefine certain outcomes we also
define stopping conditions where it’s ‘Yes or No.




2. Why and how do the stopping conditions have to be changed for decision
trees that predict numerical values (e.g., regression trees)?

1. Measure the standard deviation of all instances under a node. If this value is below a pre-defined
value, we stop.
2. and
3. as before

In stead of defining a certain value like ‘yes’ or ‘no’ we define a certain range where the value can be
any point within that range. I.e. for temperature we define a particular degree in stead of hot and warm.
With this way of making our model we can still put several stopping conditions within our decision tree.




5

More Related Content

PDF
Data mining assignment 5
PDF
Data mining assignment 3
PDF
Data mining Computerassignment 3
PDF
Lec 8 03_sept [compatibility mode]
PDF
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
PDF
Lec 9 05_sept [compatibility mode]
PDF
PPTX
statistics assignment help
Data mining assignment 5
Data mining assignment 3
Data mining Computerassignment 3
Lec 8 03_sept [compatibility mode]
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
Lec 9 05_sept [compatibility mode]
statistics assignment help

What's hot (18)

PDF
Maxima & Minima of Functions - Differential Calculus by Arun Umrao
PDF
Decreasing and increasing functions by arun umrao
PDF
Differential in several variables
PPTX
Java arrays
PPTX
COM1407: Arrays
PDF
Arrays in Java | Edureka
PDF
INTRODUCTION TO MATLAB session with notes
PPTX
27 power series x
PDF
Principle of Definite Integra - Integral Calculus - by Arun Umrao
PDF
E10
PDF
Limit & Continuity of Functions - Differential Calculus by Arun Umrao
PPT
Java căn bản - Chapter3
PPT
Array 31.8.2020 updated
PDF
Principle of Function Analysis - by Arun Umrao
DOCX
Matlab lab manual
PPT
03 truncation errors
PDF
Principle of Integration - Basic Introduction - by Arun Umrao
PPT
Arrays and structures
Maxima & Minima of Functions - Differential Calculus by Arun Umrao
Decreasing and increasing functions by arun umrao
Differential in several variables
Java arrays
COM1407: Arrays
Arrays in Java | Edureka
INTRODUCTION TO MATLAB session with notes
27 power series x
Principle of Definite Integra - Integral Calculus - by Arun Umrao
E10
Limit & Continuity of Functions - Differential Calculus by Arun Umrao
Java căn bản - Chapter3
Array 31.8.2020 updated
Principle of Function Analysis - by Arun Umrao
Matlab lab manual
03 truncation errors
Principle of Integration - Basic Introduction - by Arun Umrao
Arrays and structures
Ad

Viewers also liked (19)

PPTX
Tree pruning
PDF
Data mining assignment 1
PPTX
DATA MINING IN RETAIL SECTOR
PPT
Csc1100 lecture04 ch04
PPT
05 Conditional statements
PDF
01 10 speech channel assignment
PPTX
Project_702
PDF
С++ without new and delete
PDF
Data Engineering - Data Mining Assignment
DOC
Data mining notes
PDF
Data mining with weka
PPTX
Data mining to predict academic performance.
PPT
4.2 bst
PPTX
Data Mining – analyse Bank Marketing Data Set
PDF
DATA MINING on WEKA
PPTX
Decision trees
PPTX
Naive bayes
PPTX
Data ming wsn
Tree pruning
Data mining assignment 1
DATA MINING IN RETAIL SECTOR
Csc1100 lecture04 ch04
05 Conditional statements
01 10 speech channel assignment
Project_702
С++ without new and delete
Data Engineering - Data Mining Assignment
Data mining notes
Data mining with weka
Data mining to predict academic performance.
4.2 bst
Data Mining – analyse Bank Marketing Data Set
DATA MINING on WEKA
Decision trees
Naive bayes
Data ming wsn
Ad

Similar to Data mining assignment 4 (20)

PDF
PDF
Midterm sols
PPTX
2c-decisfffffffffffffffffffffffion-tree-algorithm.pptx
PDF
03-Primitive-Datatypes.pdf
PPT
Arrays and vectors in Data Structure.ppt
PPTX
Chapter 13.pptx
PDF
Python Programming
PPTX
Python programming workshop
PDF
Array in C full basic explanation
PDF
The Ring programming language version 1.5.4 book - Part 179 of 185
DOCX
Calculus Application Problem #3 Name _________________________.docx
PDF
Reasoning about laziness
PPTX
03. Week 03.pptx
DOCX
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
DOCX
Statistics assignment
PDF
Dm part03 neural-networks-homework
PPT
Decision tree
PPT
Decision tree
PDF
Introduction to python programming
PPTX
Lesson 18-20.pptx
Midterm sols
2c-decisfffffffffffffffffffffffion-tree-algorithm.pptx
03-Primitive-Datatypes.pdf
Arrays and vectors in Data Structure.ppt
Chapter 13.pptx
Python Programming
Python programming workshop
Array in C full basic explanation
The Ring programming language version 1.5.4 book - Part 179 of 185
Calculus Application Problem #3 Name _________________________.docx
Reasoning about laziness
03. Week 03.pptx
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
Statistics assignment
Dm part03 neural-networks-homework
Decision tree
Decision tree
Introduction to python programming
Lesson 18-20.pptx

More from BarryK88 (10)

PDF
Data mining test notes (back)
PDF
Data mining test notes (front)
PDF
Data mining assignment 2
PDF
Data mining assignment 6
PDF
Data mining Computerassignment 2
PDF
Data mining Computerassignment 1
PDF
Semantic web final assignment
PDF
Semantic web assignment 3
PDF
Semantic web assignment 2
PDF
Semantic web assignment1
Data mining test notes (back)
Data mining test notes (front)
Data mining assignment 2
Data mining assignment 6
Data mining Computerassignment 2
Data mining Computerassignment 1
Semantic web final assignment
Semantic web assignment 3
Semantic web assignment 2
Semantic web assignment1

Recently uploaded (20)

PPTX
master seminar digital applications in india
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Institutional Correction lecture only . . .
PDF
01-Introduction-to-Information-Management.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Insiders guide to clinical Medicine.pdf
master seminar digital applications in india
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
O5-L3 Freight Transport Ops (International) V1.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
STATICS OF THE RIGID BODIES Hibbelers.pdf
Basic Mud Logging Guide for educational purpose
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Institutional Correction lecture only . . .
01-Introduction-to-Information-Management.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
PPH.pptx obstetrics and gynecology in nursing
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pre independence Education in Inndia.pdf
Insiders guide to clinical Medicine.pdf

Data mining assignment 4

  • 1. Data mining Assignment week 4 BARRY KOLLEE 10349863
  • 2. Assignment  4     Exercise 1: Pruning 1. Which problem do we try to address when using pruning? “Overfitting and lack of generalization beyond training data, i.e. models that describe the training data (too) well, but do not model the principles and characteristics underlying the data.” On schema level we state that pruning merges a part of the tree together into one node. The difference is descripted within the two schema’s below: 2. Describe the purpose of separating the data into training, development, and test data. “Training data is used to build the model, and test data to test it. Just the Training data by itself is not able to measure to what extend the model will perform (i.e.. generalize to) on unseen data. Test data measures this, but we should not use the test data to directly inform our model construction. For this purpose a third set is used: the development data set, which behaves like the test set but the feedback can be used to change the model” We create our training set to increase the accuracy of the classifier, which we use on the data. The more data we train the more accurate the resulting model will be. The other two sets are used to evaluate the performance of the classifier we use. The development set is used to evaluate the accuracy of different configurations of our classifier. It’s called the development set because we continuously need to evaluate the classification performance. In the end we’ve got a model, which has a great performance on the test data. To get estimates on how good the new model will deal with new data we use the test data. 2
  • 3. Assignment  4     Exercise 2: Information Gain and Attributes with Many Values. Information gain is defined as: Following to this definition, information gain favors attributes with many values. Why? Give an example. We use a training set with (as shown in the table): • N number of instances • A number of attributes A1 … Ak A* class 1 T … Black V1 C1 2 T … White V2 C2 .. .. … … … … n F … Black Vn Cn If we want to classify a certain attribute we can state that we have a 50/50 chance of having a ‘-‘ and a ‘+’ classification. So Attribute A* could be a plus or a minus. We note this as follows. [1+, 0-] SVi (A*) = { [0+, 1-] We can calculate the Entropy (uncertainty) of both outcomes of a plus or minus classification: H(S+) = - (1/1 log2 1/1 + 0/1 log2 0/1) = 0 H(S-) = - (0/1 log2 0/1 + 1/1 log2 1/1) = 0 For calculating our information gain we perform the following formula: Gain(S, A*) = H(S) – (sum |Sv(A*)| / |S| * H(Sv(A*) ) Gain(S, A*) = Entropy of H(S) – (gain of H(S+) + gain of H(S-)) Gain(S, A*) = Entropy of H(S) – (0 + 0) Gain(S, A*) = Entropy of H(S) We see that the Entropy of H(S+) and H(S-) is 0. So in the end we will have a high information gain because there’s nothing to deduct. 3
  • 4. Assignment  4     Exercise 3: Missing Attribute Values Consider the following set of training instances. Instance 2 has a missing value for attribute a1. Apply at least two different strategies for dealing with missing attribute values and show how they work in this concrete example. Example 1 : We can give a prediction on the true/false value for the missing attribute ‘a1’ by looking at the attributes from a2. Within the a2 attribute there’s an equal chance of having a ‘true’ value and having a ‘false’ value (50 % chance). We could also state this for attribute a1. In conclusion: the missing question mark could be a ‘false’ value if we use this way of thinking. Example 2: We can also focus on the class attribute. Within a2 we can state the following: • There’s a 100 % chance of having a ‘+’ when having the ‘true’ attribute. • There’s a 50 % chance of having a ‘+’ value when having the ‘false’. With this way of thinking we should write down the ‘true’ value at the question mark Example 3: Now we only look at the attribute a1. We can give a precise prediction of the value what should replace the question mark.: P(true) = 2/3 P(false) = 1/3 4
  • 5. Assignment  4     Exercise 4: Regression Trees 1. What are the stopping conditions for decision trees predicting discrete classes? 1. All instances under a node have the same label. 2. All attributes have been used along a branch 3. There are no instances under a node By labeling every input value we can state that only one of these outcomes is the correct one. We’ve seen this with the weather example from the lecture. Because we predefine certain outcomes we also define stopping conditions where it’s ‘Yes or No. 2. Why and how do the stopping conditions have to be changed for decision trees that predict numerical values (e.g., regression trees)? 1. Measure the standard deviation of all instances under a node. If this value is below a pre-defined value, we stop. 2. and 3. as before In stead of defining a certain value like ‘yes’ or ‘no’ we define a certain range where the value can be any point within that range. I.e. for temperature we define a particular degree in stead of hot and warm. With this way of making our model we can still put several stopping conditions within our decision tree. 5