Dbm630 lecture08

DBM630: Data Mining and
Data Warehousing

MS.IT. Rangsit University
Semester 2/2011

Lecture 8
Classification and Prediction
Evaluation

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1

Topics
 Train, Test and Validation sets
 Evaluation on Large data
 Unbalanced data
 Evaluation on Small data
 Cross validation
 Bootstrap
 Comparing data mining schemes
 Significance test
 Lift Chart / ROC curve
 Numeric Prediction Evaluation
2 Data Warehousing and Data Mining by Kritsada Sriphaew

Evaluation in Classification Tasks
 How predictive is the model we learned?
 Error on the training data is not a good indicator of
performance on future data
 Q: Why?
 A: Because new data will probably not be exactly the
same as the training data!
 Overfitting – fitting the training data too precisely -
usually leads to poor results on new data

3 Classification and Prediction: Evaluation

Model Selection and Bias-Variance Tradeoff
 Typical behavior of the test and training error, as model complexity is
varied.
High Bias Low Bias
Low Variance High Variance
Prediction Error

Test Sample

Training Sample

Low Model Complexity High

Classifier error rate
 Natural performance measure for classification
problems: error rate
 Success: instance’s class is predicted correctly
 Error: instance’s class is predicted incorrectly
 Error rate: proportion of errors made over the whole set
of instances
 Resubstitution error: error rate on training data
(too optimistic way!)
Accuracy Error rate (1-Accuracy)
 Generalization error: error rate 15 % data 13 %
85 % 87 % on test
2% improvement 2% error reduction
2.35% improvement rate 13.3% error reduction rate

Evaluation on LARGE data
 If many (thousands) of examples are available,
including several hundred examples from each
class, then how can we evaluate our classifier
model?
 A simple evaluation is sufficient
 For example, randomly split data into training and test sets
(usually 2/3 for train, 1/3 for test)
 Build a classifier using the train set and evaluate it
using the test set.


Classification Step 1:
Split data into train and test sets
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Testing set


Build a model on a training set
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Model Builder

Testing set


Evaluate on test set (and may be re-train)
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Model Builder
feedback
Predictions
+
Y N
-
+
Testing set
-


A note on parameter tuning
 It is important that the test data is not used in any way to create
the classifier
 Some learning schemes operate in two stages:
 Stage 1: builds the basic structure
 Stage 2: optimizes parameter settings
 The test data can’t be used for parameter tuning!
 Proper procedure uses three sets: training data, validation data,
and test data
 Validation data is used to optimize parameters


Classification: Train, Validation, Test split
Results Known
+
Training set Model
+
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -


Unbalanced data
 Sometimes, classes have very unequal frequency
 Accommodation prediction: 97% stay, 3% don’t stay
 medical diagnosis: 90% healthy, 10% disease
 eCommerce: 99% don’t buy, 1% buy
 Security: >99.99% of Americans are not terrorists
 Similar situation with multiple classes
 Majority class classifier can be 97% correct, but useless
 Solution: With two classes, a good approach is to build
BALANCED train and test sets, and train model on a balanced set
 randomly select desired number of minority class instances
 add equal number of randomly selected majority class
 That is, we ignore the effect of the number of instances for each class.

Evaluation on SMALL data
 The holdout method reserves a certain amount for testing and
uses the remainder for training
 Usually: one third for testing, the rest for training
 For “unbalanced” datasets, samples might not be representative
 Few or none instances of some classes
 Stratified sample: advanced version of balancing the data
 Make sure that each class is represented with approximately equal
proportions in both subsets
 What if we have a small data set?
 The chosen 2/3 for training may not be representative.
 The chosen 1/3 for testing may not be representative.


Repeated Holdout Method
 Holdout estimate can be made more reliable by repeating
the process with different subsamples
 In each iteration, a certain proportion is randomly selected
for training (possibly with stratification)
 The error rates on the different iterations are averaged to
yield an overall error rate
 This is called the repeated holdout method
 Still not optimum: the different test sets overlap.
 Can we prevent overlapping?


Cross-validation
 Cross-validation avoids overlapping test sets
 First step: data is split into k subsets of equal size
 Second step: each subset in turn is used for testing and
the remainder for training
 This is called k-fold cross-validation
 Often the subsets are stratified before the cross-
validation is performed
 The error estimates are averaged to yield an overall
error estimate


Cross-validation Example
 Break up data into groups of the same size (possibly with
stratification)

 Hold aside one group for testing and use the rest to build
model

Test Train
 Repeat by another
test data until


More on cross-validation
 Standard method for evaluation: stratified ten-fold
cross-validation
 Why ten?
Extensive experiments have shown that this is the
best choice to get an accurate estimate
 Stratification reduces the estimate’s variance
 Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)


Leave-One-Out cross-validation
 Leave-One-Out: Remove one instance for testing and the
other for training
a particular form of cross-validation:
 Set the number of folds equal to the number of training
instances
 i.e., for n training instances, build classifier n times
 Makes best use of the data
 Involves no random subsampling
 Very computationally expensive


Leave-One-Out-CV and stratification
 Disadvantage of Leave-One-Out-CV: stratification is not
possible
 It guarantees a non-stratified sample because there is
only one instance in the test set!
 Extreme example: random dataset split equally into two
classes
 Best inducer predicts majority class
 50% accuracy on fresh data
 Leave-One-Out-CV estimate is 100% error!


*The bootstrap
 CV uses sampling without replacement
 The same instance, once selected, can not be selected again for a
particular training/test set
 The bootstrap uses sampling with replacement to form the
training set
 Sample a dataset of n instances n times with replacement to form
a new dataset of n instances
 Use this data as the training set
 Use the instances from the original dataset that do not occur in
the new training set for testing


Evaluating the Accuracy of a Classifier or
Predictor
 Bootstrap method
 The training tuples are sampled uniformly with replacement
 Each time a tuple is selected, it is equally likely to be selected again and readded
to the training set
 There are several bootstrap method – the commonly used one is .632 bootstrap
which works as follows
 Given a data set of d tuples
 The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d
samples
 It is very likely that some of the original data tuples will occur more than once in this sample
 The data tuples that did not make it into the training set end up forming the test set
 Suppose we try this out several times – on average 63.2% of original data tuple will end up in
the bootstrap, and the remaining 36.8% will form the test set


*The 0.632 bootstrap
 Also called the 0.632 bootstrap
 For n instances, a particular instance has a probability
of 1–1/n of not being picked
 Thus its probability of ending up in the test data is:

n
 1
 1    e 1  0.368
 n
 This means the training data will contain approximately
63.2% of the instances

Note: e is an irrational constant approximately equal to 2.718281828

*Estimating error with the bootstrap
 The error estimate on the test data will be very pessimistic
 Trained on just ~63% of the instances
 Therefore, combine it with the re-substitution error:
err  0.632  errortest_instances  0.368  errortraining_instances
 The re-substitution error gets less weight than the error on
the test data
 Repeat process several times with different replacement
samples; average the results


*More on the bootstrap
 Probably the best way of estimating performance for very
small datasets
 However, it has some problems
 Consider the random dataset from above
 A perfect memorizer will achieve 0% resubstitution
error and
~50% error on test data
 Bootstrap estimate for this classifier:

err  0.632  50%  0.368  0%  31.6%
 True expected error: 50%


Evaluating Two-class Classification
(Lift Chart vs. ROC Curve)
 Information Retrieval or Search Engine
 An application to find a set of related documents given a set of
keywords.
 Hard Decision vs. Soft Decision
 Focus on soft decision
 Multiclass
 Class probability (Ranking)
 Class by class evaluation
 Example: promotional mailout
 Situation 1: classifier predicts that 0.1% of all households will respond
 Situation 2: classifier predicts that 0.4% of the 100000 most
promising households will respond


Confusion Matrix (Two-class)
 Also called contingency table

Actual Class
Yes No
True False
Yes
Predicted Positive Positive
Class False True
No
Negative Negative


Measures in Information Retrieval
 precision: Percentage of
retrieved documents that are
relevant.
TP
 recall: Percentage of relevant precision 
documents that are returned. TP  FP
F-measure: The combination TP
 recall 
measure of recall and precision TP  FN
 Precision/recall curves have 2  recall  precision
F  measure 
hyperbolic shape recall  precision
 Summary measures: average
precision at 20%, 50% and 80%
recall (three-point average
recall)


Measures in Two-Class Classification
For Positive Class For Negative Class

TP TN
precision  precision 
TP  FP TN  FN
TP TN
recall  recall 
TP  FN TN  FP
TP  TN TP  TN
accuracy  accuracy 
TP  TN  FP  FN TP  TN  FP  FN

 Usually, we focus only positive class FP
(“True” cases or “Yes” cases), FP Rate 
TN  FP
therefore, only precision and recall
TP
of positive class are used for TP Rate   recall
performance comparison TP  FN


Confusion matrix: Example
Actual Buys_computer Buys_computer Total
Predict = yes = no
Buys_computer=yes 6,954 46 7,000

Buys_computer=no 412 2,588 3,000

Total 7,366 2,634 10,000

No. of tuple of class buys_computer=yes
that were labeled by a classifier as class
buys_computer=no: FN

No. of tuple of class buys_computer=no
that were labeled by a classifier as class
buys_computer=yes: FP
31

Cumulative Gains Chart/Lift Chart/
ROC curve
 They are visual aids for measuring model performance
 Cumulative Gains is a measure of the effectiveness of predictive
model on TP Rate (%true responses)
 Lift is a measure of the effectiveness of a predictive model calculated
as the ratio between the results obtained with and without the
predictive model
 Cumulative gains and lift charts consist of a lift curve and a baseline.
The greater the area between the lift curve and the baseline, the
better the model.
 ROC is a measure of the effectiveness of a predictive model on
%positive response against %negative response TP Rate against FP
Rate. The greater the area under ROC curve, the better the model.


Generating Charts
 Instances are sorted according to their predicted
probability of being a true positive:


Cumulative Gains Chart
 The x-axis shows the percentage of samples
 The y-axis shows the percentage of positive responses
(or true positive rate). This is a percentage of the total
possible positive responses
 Baseline (overall response rate): If we sample X% of
data then we will receive X% of the total positive
responses.
 Lift Curve: Using the predictions of the response model,
calculate the percentage of positive responses for the
percent of customers contacted and map these points to
create the lift curve.

A Sample Cumulative Gains Chart
100%

TP Rate 80%
%positive
responses
60%

40% Baseline:
%positive_responses = %sample_size
20%

0

10% samples 100% samples

Lift Chart
 The x-axis shows the percentage of samples
 The y-axis shows the ratio of true positives with
model and without model
 To plot the chart: Calculate the points on the lift
curve by determining the ratio between the result
predicted by our model and the result using no model
 Example: For contacting 10% of customers, using no
model we should get 10% of responders and using
the given model we should get 30% of responders.
The y-value of the lift curve at 10% is 30 / 10 = 3.


A Sample Lift Chart
100%

80%

Lift
60%

40% Baseline:
Lift = %sample size
20%

0

10% samples 100% samples

ROC Curve
 ROC curves are similar to lift charts
 “ROC” stands for “receiver operating characteristic”
 Used in signal detection to show tradeoff between hit rate
and false alarm rate over noisy channel
 Differences:
 y axis shows percentage of true positives in sample
 x axis shows percentage of false positives in sample
(rather than sample size)


A Sample ROC Curve
1000 responds

True Positive
Rate

400
responds Baseline:
TP Rate = FP Rate

False Positive Rate
1000000-1000 mailouts

Extending to Multiple-Class
Classification
PREDICTED CLASS

C1 C2 Cn Sum Recall

C1 n11 n12 n1n R1 n11/R1
ACTUAL CLASS

C2 n21 n22 n2n R2 n22 /R2

Cn nn1 nn2 nnn Rn nnn /Rn
Sum P1 P2 Pn T R
Preci n11 n22 nnn n11 +n22+..+nnn

sion P1
P T
P2 Pn

Measures in Multiple-Class Classification

nii
precision (Ci ) 
Pi
nii
recall (Ci ) 
Ri
n11  n22    nnn
accuracy 
T


Numeric Prediction Evaluation
 Same strategies: independent test set, cross-validation,
significance tests, etc.
 Difference: error measures or generalization error
 Actual target values: a1, a2,…, an
 Predicted target values: p1, p2,…, pn
 Most popular measure: mean-squared error
p 1
 a1    p2  a2   ...   pn  an 
2 2 2

n
 Easy to manipulate mathematically


Other Measures
 The root mean-squared (rms) error:
 p1  a1    p2  a2   ...   pn  an 
2 2 2

n
 The mean absolute error is less sensitive to outliers
than the mean-squared error:
p1  a1  p2  a2  ...  pn  an
n
 Sometimes relative error values are more
appropriate
(e. g. 10% for an error of 50 when predicting 500)

Improvement on the Mean
 Often we want to know how much the scheme improves
on simply predicting the average
 The relative squared error is:
p1  a1
   p2  a2   ...   pn  an 
2 2 2

a  a   a  a   ...  a  a 
1
2
2
2
n
2

a is average
 The relative absolute error is:

p1  a1  p2  a2  ...  pn  an
a  a1  a  a2  ...  a  an


The Correlation Coefficient
 Measures the statistical correlation between the
predicted values and the actual values
S PA
 (ai  a)
2

S pSA SA  i

n 1

 ( pi  p)(ai  a)
 ( pi  p)
2
S PA  i

n 1 SP  i

n 1
 Scale independent, between -1 and +1
 Good performance leads to large values!

Practice 1
 A company wants to do a mail marketing campaign. It costs
the company $1 for each item mailed. They have information
on 100,000 customers. Create a cumulative gains and a lift
chart from the following data.


Results


Practice 2
 Using the response model P(x)=Salary(x)/1000+Age(x) for customer x
and the data table shown below, construct the cumulative gains, lift
charts and ROC curves. Ties in ranking should be arbitrarily broken by
assigning a higher rank to who appears first in the table.

Customer Name Salary Age Actual P(x) Rank
Response
A 10000 39 N
B 50000 21 Y
C 65000 25 Y
D 62000 30 Y
E 67000 19 Y
F 69000 48 N
G 65000 12 Y
H 64000 51 N
I 71000 65 Y
48 J 73000 42 N Classification and Prediction: Evaluation

Customer Salary Age Actual P(x) Rank Ordered by P(x)
Name Response Customer Actual P(x) Rank
A 10000 39 N 49 10 Name Response
B 50000 21 Y 71 9 I Y 136 1
C 65000 25 Y 90 6 F N 117 2
D 62000 30 Y 92 5 H N 115 3
E 67000 19 Y 86 7 J N 115 4
F 69000 48 N 117 2 D Y 92 5
G 65000 12 Y 77 8 C Y 90 6
H 64000 51 N 115 3 E Y 86 7
I 71000 65 Y 136 1 G Y 77 8
J 73000 42 N 115 4 B Y 71 9
A N 49 10
% Positive responses
Cumulative Gains Chart
Baseline Lift Chart Lift Baseline
100.00%
2.00
80.00%
1.50
60.00%
40.00% 1.00
20.00% 0.50
0.00% 0.00
70%
10%
20%
30%
40%
50%
60%

80%
90%
100%

100%
10%
20%
30%
40%
50%
60%
70%
80%
90%
49

Customer Actual #Sample TP FP TPR FPR
Name Response
I Y 1 1 0 16.67% 0.00%
F N 2 1 1 16.67% 25.00%
H N 3 1 2 16.67% 50.00%
J N 4 1 3 16.67% 75.00%
C Y 5 2 3 33.33% 75.00%
E Y 6 3 3 50.00% 75.00%
D Y 7 4 3 66.67% 75.00%
G Y 8 5 3 83.33% 75.00%
B Y 9 6 3 100.00% 75.00%
A N 10 6 4 100.00% 100.00%

ROC Curve
120.00%
100.00%
80.00%
TPR 60.00%
40.00%
20.00%
0.00%
0.00% 50.00% 100.00% 150.00%
50 FPR

*Predicting performance
 Assume the estimated error rate is 25%. How close
is this to the true error rate?
 Depends on the amount of test data
 Prediction is just like tossing a biased (!) coin
 “Head” is a “success”, “tail” is an “error”
 In statistics, a succession of independent events like
this is called a Bernoulli process
 Statistical theory provides us with confidence
intervals for the true underlying proportion!


*Confidence intervals
 We can say: p lies within a certain specified interval with a
certain specified confidence
 Example: S=750 successes in N=1000 trials
 Estimated success rate: 75%
 How close is this to true success rate p?
 Answer: with 80% confidence p[73.2,76.7]
 Another example: S=75 and N=100
 Estimated success rate: 75%
 With 80% confidence p[69.1,80.1]


*Mean and variance (also Mod 7)
 Mean and variance for a Bernoulli trial:
p, p (1–p)
 Expected success rate f=S/N
 Mean and variance for f : p, p (1–p)/N
 For large enough N, f follows a Normal distribution
 c% confidence interval [–z  X  z] for random variable with 0
mean is given by:
Pr[ z  X  z ]  c
 With a symmetric distribution:
Pr[ z  X  z]  1  2  Pr[ X  z]


*Confidence limits Pr[X  z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
–1 0 1 1.65 40% 0.25
 Confidence limits for the normal distribution with 0 mean and a
variance of 1:
 Thus:
Pr[1.65  X  1.65]  90%
 To use this we have to reduce our random variable f to have 0
mean and unit variance


*Transforming f
f p
 Transformed value for f : p(1  p) / N

(i.e. subtract the mean and divide by the standard
deviation)  f p 
Pr  z   z  c
 p(1  p) / N 
 Resulting equation:

 z2 f f2 z2   z2 
 Solving for p : p f 
 z    1  
2 
 2N N N 4N   N 


*Examples
 f = 75%, N = 1000, c = 80% (so that z = 1.28):
p [0.732,0.767]
 f = 75%, N = 100, c = 80% (so that z = 1.28):
p [0.691,0.801]
 Note that normal distribution assumption is only valid for
large N (i.e. N > 100)
 f = 75%, N = 10, c = 80% (so that z = 1.28):
p [0.549,0.881]
(should be taken with a grain of salt)

Dbm630 lecture08

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Dbm630 lecture08 (20)

More from Tokyo Institute of Technology (11)

Recently uploaded (20)

Dbm630 lecture08