SlideShare a Scribd company logo
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011



                                                 Lecture 8
                     Classification and Prediction
                                               Evaluation

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
 Train, Test and Validation sets
 Evaluation on Large data
        Unbalanced data
    Evaluation on Small data
        Cross validation
        Bootstrap
    Comparing data mining schemes
        Significance test
        Lift Chart / ROC curve
    Numeric Prediction Evaluation
 2                                Data Warehousing and Data Mining by Kritsada Sriphaew
Evaluation in Classification Tasks
 How predictive is the model we learned?
 Error on the training data is not a good indicator of
  performance on future data
        Q: Why?
        A: Because new data will probably not be exactly the
         same as the training data!
    Overfitting – fitting the training data too precisely -
     usually leads to poor results on new data


 3                                          Classification and Prediction: Evaluation
Model Selection and Bias-Variance Tradeoff
                      Typical behavior of the test and training error, as model complexity is
                       varied.
                                  High Bias                                   Low Bias
                                  Low Variance                                High Variance
Prediction Error




                                                       Test Sample




                        Training Sample

                   Low                           Model Complexity                                     High
                   4                                                 Classification and Prediction: Evaluation
Classifier error rate
   Natural performance measure for classification
    problems: error rate
       Success: instance’s class is predicted correctly
       Error: instance’s class is predicted incorrectly
       Error rate: proportion of errors made over the whole set
        of instances
   Resubstitution error: error rate on training data
    (too optimistic way!)
            Accuracy                                Error rate (1-Accuracy)
   Generalization error: error rate 15 % data 13 %
      85 %          87 %             on test
          2% improvement                       2% error reduction
        2.35% improvement rate              13.3% error reduction rate
5                                             Classification and Prediction: Evaluation
Evaluation on LARGE data
   If many (thousands) of examples are available,
    including several hundred examples from each
    class, then how can we evaluate our classifier
    model?
   A simple evaluation is sufficient
       For example, randomly split data into training and test sets
        (usually 2/3 for train, 1/3 for test)
   Build a classifier using the train set and evaluate it
    using the test set.

6                                            Classification and Prediction: Evaluation
Classification Step 1:
Split data into train and test sets
              THE PAST
            Results Known

                           +
                           +   Training set
                           -
                           -
                           +
     Data




             Testing set




 7                                            Classification and Prediction: Evaluation
Classification Step 2:
Build a model on a training set
             THE PAST
           Results Known

                          +
                          +         Training set
                          -
                          -
                          +
    Data

                              Model Builder




            Testing set




8                                                  Classification and Prediction: Evaluation
Classification Step 3:
Evaluate on test set (and may be re-train)
              THE PAST
            Results Known

                          +
                          +          Training set
                          -
                          -
                          +
     Data

                              Model Builder
                                                      feedback
                                                                      Predictions
                                                                 +
                                 Y   N
                                                                 -
                                                                 +
            Testing set
                                                                 -




 9                                                  Classification and Prediction: Evaluation
A note on parameter tuning
    It is important that the test data is not used in any way to create
     the classifier
    Some learning schemes operate in two stages:
        Stage 1: builds the basic structure
        Stage 2: optimizes parameter settings
    The test data can’t be used for parameter tuning!
    Proper procedure uses three sets: training data, validation data,
     and test data
        Validation data is used to optimize parameters




    10                                                    Classification and Prediction: Evaluation
Classification: Train, Validation, Test split
            Results Known
                       +
                                    Training set                             Model
                       +
                       -
                       -
                                                                             Builder
                       +
  Data
                                              Evaluate
                             Model Builder
                                                         Predictions
                                                                +
                                                                -
                                Y   N                           +
            Validation set                                      -

                                                                +
                                                                - Final Evaluation
                                                                +
Final Test Set               Final Model                        -

 11                                                Classification and Prediction: Evaluation
Unbalanced data
   Sometimes, classes have very unequal frequency
        Accommodation prediction: 97% stay, 3% don’t stay
        medical diagnosis: 90% healthy, 10% disease
        eCommerce: 99% don’t buy, 1% buy
        Security: >99.99% of Americans are not terrorists
   Similar situation with multiple classes
   Majority class classifier can be 97% correct, but useless
   Solution: With two classes, a good approach is to build
    BALANCED train and test sets, and train model on a balanced set
        randomly select desired number of minority class instances
        add equal number of randomly selected majority class
        That is, we ignore the effect of the number of instances for each class.
        12                                                           Classification and Prediction: Evaluation
Evaluation on SMALL data
    The holdout method reserves a certain amount for testing and
     uses the remainder for training
        Usually: one third for testing, the rest for training
    For “unbalanced” datasets, samples might not be representative
        Few or none instances of some classes
    Stratified sample: advanced version of balancing the data
        Make sure that each class is represented with approximately equal
         proportions in both subsets
    What if we have a small data set?
    The chosen 2/3 for training may not be representative.
    The chosen 1/3 for testing may not be representative.


    13                                                     Classification and Prediction: Evaluation
Repeated Holdout Method
    Holdout estimate can be made more reliable by repeating
     the process with different subsamples
      In each iteration, a certain proportion is randomly selected
       for training (possibly with stratification)
      The error rates on the different iterations are averaged to
       yield an overall error rate
    This is called the repeated holdout method
    Still not optimum: the different test sets overlap.
    Can we prevent overlapping?


    14                                      Classification and Prediction: Evaluation
Cross-validation
   Cross-validation avoids overlapping test sets
       First step: data is split into k subsets of equal size
       Second step: each subset in turn is used for testing and
        the remainder for training
 This is called k-fold cross-validation
 Often the subsets are stratified before the cross-
  validation is performed
 The error estimates are averaged to yield an overall
  error estimate

 15                                          Classification and Prediction: Evaluation
Cross-validation Example
    Break up data into groups of the same size (possibly with
     stratification)



    Hold aside one group for testing and use the rest to build
     model

                      Test                                          Train
    Repeat by another
     test data until

    16                                      Classification and Prediction: Evaluation
More on cross-validation
 Standard method for evaluation: stratified ten-fold
  cross-validation
 Why ten?
  Extensive experiments have shown that this is the
  best choice to get an accurate estimate
 Stratification reduces the estimate’s variance
 Even better: repeated stratified cross-validation
       E.g. ten-fold cross-validation is repeated ten times and
        results are averaged (reduces the variance)

 17                                          Classification and Prediction: Evaluation
Leave-One-Out cross-validation
        Leave-One-Out: Remove one instance for testing and the
         other for training
         a particular form of cross-validation:
            Set the number of folds equal to the number of training
             instances
            i.e., for n training instances, build classifier n times
        Makes best use of the data
        Involves no random subsampling
        Very computationally expensive



    18                                             Classification and Prediction: Evaluation
Leave-One-Out-CV and stratification
        Disadvantage of Leave-One-Out-CV: stratification is not
         possible
           It guarantees a non-stratified sample because there is
              only one instance in the test set!
        Extreme example: random dataset split equally into two
         classes
           Best inducer predicts majority class
           50% accuracy on fresh data
           Leave-One-Out-CV estimate is 100% error!



    19                                       Classification and Prediction: Evaluation
*The bootstrap
    CV uses sampling without replacement
            The same instance, once selected, can not be selected again for a
             particular training/test set
    The bootstrap uses sampling with replacement to form the
     training set
            Sample a dataset of n instances n times with replacement to form
             a new dataset of n instances
            Use this data as the training set
            Use the instances from the original dataset that do not occur in
             the new training set for testing



    20                                               Classification and Prediction: Evaluation
Evaluating the Accuracy of a Classifier or
Predictor
    Bootstrap method
        The training tuples are sampled uniformly with replacement
        Each time a tuple is selected, it is equally likely to be selected again and readded
         to the training set
        There are several bootstrap method – the commonly used one is .632 bootstrap
         which works as follows
            Given a data set of d tuples
            The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d
             samples
            It is very likely that some of the original data tuples will occur more than once in this sample
            The data tuples that did not make it into the training set end up forming the test set
            Suppose we try this out several times – on average 63.2% of original data tuple will end up in
             the bootstrap, and the remaining 36.8% will form the test set




    21                                                                     Classification and Prediction: Evaluation
*The 0.632 bootstrap
        Also called the 0.632 bootstrap
           For n instances, a particular instance has a probability
              of 1–1/n of not being picked
           Thus its probability of ending up in the test data is:

                                       n
                              1
                              1    e 1  0.368
                              n
            This means the training data will contain approximately
             63.2% of the instances


             Note: e is an irrational constant approximately equal to 2.718281828
    22                                                   Classification and Prediction: Evaluation
*Estimating error with the bootstrap
        The error estimate on the test data will be very pessimistic
           Trained on just ~63% of the instances
        Therefore, combine it with the re-substitution error:
            err  0.632  errortest_instances  0.368  errortraining_instances
        The re-substitution error gets less weight than the error on
         the test data
        Repeat process several times with different replacement
         samples; average the results



    23                                                Classification and Prediction: Evaluation
*More on the bootstrap
        Probably the best way of estimating performance for very
         small datasets
        However, it has some problems
           Consider the random dataset from above
           A perfect memorizer will achieve 0% resubstitution
             error and
                ~50% error on test data
           Bootstrap estimate for this classifier:

                     err  0.632  50%  0.368  0%  31.6%
            True expected error: 50%

    24                                       Classification and Prediction: Evaluation
Evaluating Two-class Classification
(Lift Chart vs. ROC Curve)
 Information Retrieval or Search Engine
    An application to find a set of related documents given a set of
      keywords.
    Hard Decision vs. Soft Decision
    Focus on soft decision
        Multiclass
        Class probability (Ranking)
 Class by class evaluation
 Example: promotional mailout
    Situation 1: classifier predicts that 0.1% of all households will respond
    Situation 2: classifier predicts that 0.4% of the 100000 most
      promising households will respond


 27                                               Classification and Prediction: Evaluation
Confusion Matrix (Two-class)
   Also called contingency table

                                     Actual Class
                                    Yes                   No
                                True                  False
                      Yes
       Predicted               Positive              Positive
         Class                  False                True
                      No
                               Negative             Negative



28                                        Classification and Prediction: Evaluation
Measures in Information Retrieval
   precision: Percentage of
    retrieved documents that are
    relevant.
                                                        TP
   recall: Percentage of relevant       precision 
    documents that are returned.                     TP  FP
    F-measure: The combination                          TP
                                           recall 
    measure of recall and precision                  TP  FN
   Precision/recall curves have                     2  recall  precision
                                      F  measure 
    hyperbolic shape                                   recall  precision
   Summary measures: average
    precision at 20%, 50% and 80%
    recall (three-point average
    recall)

     29                                         Classification and Prediction: Evaluation
Measures in Two-Class Classification
For Positive Class                           For Negative Class

                  TP                                       TN
    precision                               precision 
                TP  FP                                  TN  FN
                  TP                                       TN
       recall                                  recall 
                TP  FN                                  TN  FP
                     TP  TN                                  TP  TN
    accuracy                                accuracy 
                TP  TN  FP  FN                        TP  TN  FP  FN

    Usually, we focus only positive class                  FP
     (“True” cases or “Yes” cases),             FP Rate 
                                                          TN  FP
     therefore, only precision and recall
                                                            TP
     of positive class are used for             TP Rate           recall
     performance comparison                               TP  FN

    30                                           Classification and Prediction: Evaluation
Confusion matrix: Example
                 Actual Buys_computer        Buys_computer       Total
      Predict           = yes                = no
      Buys_computer=yes   6,954              46                  7,000

      Buys_computer=no    412                2,588               3,000

      Total               7,366              2,634               10,000

No. of tuple of class buys_computer=yes
that were labeled by a classifier as class
buys_computer=no: FN


                    No. of tuple of class buys_computer=no
                    that were labeled by a classifier as class
                    buys_computer=yes: FP
 31
Cumulative Gains Chart/Lift Chart/
ROC curve
    They are visual aids for measuring model performance
    Cumulative Gains is a measure of the effectiveness of predictive
     model on TP Rate (%true responses)
    Lift is a measure of the effectiveness of a predictive model calculated
     as the ratio between the results obtained with and without the
     predictive model
    Cumulative gains and lift charts consist of a lift curve and a baseline.
     The greater the area between the lift curve and the baseline, the
     better the model.
    ROC is a measure of the effectiveness of a predictive model on
     %positive response against %negative response TP Rate against FP
     Rate. The greater the area under ROC curve, the better the model.

    32                                             Classification and Prediction: Evaluation
Generating Charts
   Instances are sorted according to their predicted
    probability of being a true positive:




33                                    Classification and Prediction: Evaluation
Cumulative Gains Chart
    The x-axis shows the percentage of samples
    The y-axis shows the percentage of positive responses
     (or true positive rate). This is a percentage of the total
     possible positive responses
    Baseline (overall response rate): If we sample X% of
     data then we will receive X% of the total positive
     responses.
    Lift Curve: Using the predictions of the response model,
     calculate the percentage of positive responses for the
     percent of customers contacted and map these points to
     create the lift curve.
    34                                    Classification and Prediction: Evaluation
A Sample Cumulative Gains Chart
            100%



TP Rate     80%
%positive
responses
            60%


            40%                  Baseline:
                                 %positive_responses = %sample_size
            20%


               0




                   10% samples               100% samples
  35                               Classification and Prediction: Evaluation
Lift Chart
 The x-axis shows the percentage of samples
 The y-axis shows the ratio of true positives with
  model and without model
 To plot the chart: Calculate the points on the lift
  curve by determining the ratio between the result
  predicted by our model and the result using no model
 Example: For contacting 10% of customers, using no
  model we should get 10% of responders and using
  the given model we should get 30% of responders.
  The y-value of the lift curve at 10% is 30 / 10 = 3.

 36                               Classification and Prediction: Evaluation
A Sample Lift Chart
       100%


       80%

Lift
       60%


       40%                  Baseline:
                            Lift = %sample size
       20%


          0




              10% samples                100% samples
37                             Classification and Prediction: Evaluation
ROC Curve
    ROC curves are similar to lift charts
      “ROC” stands for “receiver operating characteristic”
      Used in signal detection to show tradeoff between hit rate
       and false alarm rate over noisy channel
    Differences:
      y axis shows percentage of true positives in sample
      x axis shows percentage of false positives in sample
       (rather than sample size)



    38                                     Classification and Prediction: Evaluation
A Sample ROC Curve
    1000 responds



True Positive
Rate


 400
 responds                     Baseline:
                              TP Rate = FP Rate




                    False Positive Rate
                              1000000-1000 mailouts
    39                           Classification and Prediction: Evaluation
Extending to Multiple-Class
       Classification
                                  PREDICTED CLASS

                          C1 C2                     Cn Sum Recall

                    C1   n11 n12                    n1n R1 n11/R1
ACTUAL CLASS




                    C2   n21 n22                    n2n R2 n22 /R2



                    Cn   nn1 nn2                    nnn Rn nnn /Rn
                    Sum P1      P2                  Pn T                  R
                    Preci n11   n22                 nnn              n11 +n22+..+nnn

                    sion P1
                                                        P                 T
                                P2                  Pn
               40                                    Classification and Prediction: Evaluation
Measures in Multiple-Class Classification

                       nii
     precision (Ci ) 
                       Pi
                       nii
        recall (Ci ) 
                       Ri
                    n11  n22    nnn
         accuracy 
                             T


41                           Classification and Prediction: Evaluation
Numeric Prediction Evaluation
    Same strategies: independent test set, cross-validation,
     significance tests, etc.
    Difference: error measures or generalization error
    Actual target values: a1, a2,…, an
    Predicted target values: p1, p2,…, pn
    Most popular measure: mean-squared error
         p 1
                 a1    p2  a2   ...   pn  an 
                     2              2                        2


                                 n
    Easy to manipulate mathematically


    42                                         Classification and Prediction: Evaluation
Other Measures
   The root mean-squared (rms) error:
              p1  a1    p2  a2   ...   pn  an 
                        2             2                   2


                                     n
   The mean absolute error is less sensitive to outliers
    than the mean-squared error:
               p1  a1  p2  a2  ...  pn  an
                              n
   Sometimes relative error values are more
    appropriate
    (e. g. 10% for an error of 50 when predicting 500)
43                                              Classification and Prediction: Evaluation
Improvement on the Mean
    Often we want to know how much the scheme improves
     on simply predicting the average
    The relative squared error is:
                   p1  a1
                               p2  a2   ...   pn  an 
                                 2           2                     2



                     a  a   a  a   ...  a  a 
                             1
                                     2
                                         2
                                             2
                                                           n
                                                               2


         a   is average
    The relative absolute error is:

                          p1  a1  p2  a2  ...  pn  an
                           a  a1  a  a2  ...  a  an

    44                                                 Classification and Prediction: Evaluation
The Correlation Coefficient
    Measures the statistical correlation between the
     predicted values and the actual values
                                     S PA
                                                      (ai  a)
                                                                     2

                                     S pSA   SA         i

                                                             n 1

                      ( pi  p)(ai  a)
                                                     ( pi  p)
                                                                     2
            S PA    i

                            n 1             SP     i

                                                             n 1
    Scale independent, between -1 and +1
    Good performance leads to large values!
    45                                       Classification and Prediction: Evaluation
Practice 1
    A company wants to do a mail marketing campaign. It costs
     the company $1 for each item mailed. They have information
     on 100,000 customers. Create a cumulative gains and a lift
     chart from the following data.




    46                                   Classification and Prediction: Evaluation
Results




47        Classification and Prediction: Evaluation
Practice 2
    Using the response model P(x)=Salary(x)/1000+Age(x) for customer x
     and the data table shown below, construct the cumulative gains, lift
     charts and ROC curves. Ties in ranking should be arbitrarily broken by
     assigning a higher rank to who appears first in the table.

         Customer Name   Salary   Age    Actual         P(x)             Rank
                                        Response
               A         10000    39       N
               B         50000    21       Y
               C         65000    25       Y
               D         62000    30       Y
               E         67000    19       Y
               F         69000    48       N
               G         65000    12       Y
               H         64000    51       N
               I         71000    65       Y
    48         J         73000    42       N       Classification and Prediction: Evaluation
Customer Salary Age  Actual        P(x)       Rank                    Ordered by P(x)
  Name               Response                             Customer  Actual       P(x)    Rank
    A      10000   39    N               49         10     Name    Response
    B      50000   21    Y               71          9          I          Y     136       1
    C      65000   25    Y               90          6          F          N     117       2
    D      62000   30    Y               92          5          H          N     115       3
    E      67000   19    Y               86          7          J          N     115       4
    F      69000   48    N              117          2          D          Y      92       5
    G      65000   12    Y              77           8          C          Y      90       6
    H      64000   51    N              115          3          E          Y      86      7
    I      71000   65    Y              136         1           G          Y      77       8
    J      73000   42    N              115         4           B          Y      71       9
                                                                A          N      49      10
                             % Positive responses
Cumulative Gains Chart
                             Baseline                               Lift Chart    Lift    Baseline
 100.00%
                                                         2.00
  80.00%
                                                         1.50
  60.00%
  40.00%                                                 1.00
  20.00%                                                 0.50
   0.00%                                                 0.00
            70%
            10%
            20%
            30%
            40%
            50%
            60%

            80%
            90%
           100%




                                                                100%
                                                                 10%
                                                                 20%
                                                                 30%
                                                                 40%
                                                                 50%
                                                                 60%
                                                                 70%
                                                                 80%
                                                                 90%
    49
Customer    Actual    #Sample   TP   FP   TPR       FPR
 Name      Response
   I          Y          1       1   0     16.67%    0.00%
   F          N         2        1   1     16.67%   25.00%
   H          N         3        1   2     16.67%   50.00%
   J          N         4        1   3     16.67%   75.00%
   C          Y         5        2   3     33.33%   75.00%
   E          Y         6        3   3     50.00%   75.00%
   D          Y         7        4   3     66.67%   75.00%
   G          Y         8        5   3     83.33%   75.00%
   B          Y         9        6   3    100.00%   75.00%
   A          N         10       6   4    100.00% 100.00%


                                                                  ROC Curve
                                   120.00%
                                   100.00%
                                    80.00%
                                TPR 60.00%
                                    40.00%
                                    20.00%
                                     0.00%
                                          0.00%              50.00%    100.00%   150.00%
   50                                                           FPR
*Predicting performance
   Assume the estimated error rate is 25%. How close
    is this to the true error rate?
       Depends on the amount of test data
   Prediction is just like tossing a biased (!) coin
       “Head” is a “success”, “tail” is an “error”
 In statistics, a succession of independent events like
  this is called a Bernoulli process
 Statistical theory provides us with confidence
  intervals for the true underlying proportion!


 51                                           Classification and Prediction: Evaluation
*Confidence intervals
        We can say: p lies within a certain specified interval with a
         certain specified confidence
        Example: S=750 successes in N=1000 trials
           Estimated success rate: 75%
           How close is this to true success rate p?
               Answer: with 80% confidence p[73.2,76.7]
        Another example: S=75 and N=100
           Estimated success rate: 75%
           With 80% confidence p[69.1,80.1]



    52                                         Classification and Prediction: Evaluation
*Mean and variance (also Mod 7)
        Mean and variance for a Bernoulli trial:
         p, p (1–p)
        Expected success rate f=S/N
        Mean and variance for f : p, p (1–p)/N
        For large enough N, f follows a Normal distribution
        c% confidence interval [–z  X  z] for random variable with 0
         mean is given by:
                                     Pr[ z  X  z ]  c
        With a symmetric distribution:
                         Pr[ z  X  z]  1  2  Pr[ X  z]

    53                                          Classification and Prediction: Evaluation
*Confidence limits                                    Pr[X  z]                      z
                                                            0.1%               3.09
                                                            0.5%               2.58
                                                               1%              2.33
                                                               5%              1.65
                                                             10%               1.28
                                                             20%               0.84
                     –1   0   1 1.65                         40%               0.25
        Confidence limits for the normal distribution with 0 mean and a
         variance of 1:
        Thus:
                   Pr[1.65  X  1.65]  90%
        To use this we have to reduce our random variable f to have 0
         mean and unit variance

    54                                              Classification and Prediction: Evaluation
*Transforming f
                                              f p
        Transformed value for f :          p(1  p) / N

         (i.e. subtract the mean and divide by the standard
         deviation)                         f p        
                                Pr  z                  z  c
                                           p(1  p) / N    
        Resulting equation:

                                  z2    f f2   z2   z2 
        Solving for p :   p f 
                                     z           1  
                                                  2 
                                  2N    N N 4N   N 



    55                                           Classification and Prediction: Evaluation
*Examples
    f = 75%, N = 1000, c = 80% (so that z = 1.28):
                   p [0.732,0.767]
    f = 75%, N = 100, c = 80% (so that z = 1.28):
                   p [0.691,0.801]
    Note that normal distribution assumption is only valid for
     large N (i.e. N > 100)
    f = 75%, N = 10, c = 80% (so that z = 1.28):
                    p [0.549,0.881]
     (should be taken with a grain of salt)
    56                                        Classification and Prediction: Evaluation

More Related Content

PDF
Lecture 9: Machine Learning in Practice (2)
PPTX
Presentation on supervised learning
PDF
Lecture 8: Machine Learning in Practice (1)
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
PPTX
Machine Learning - Splitting Datasets
PDF
Performance Evaluation for Classifiers tutorial
PPTX
Supervised learning
Lecture 9: Machine Learning in Practice (2)
Presentation on supervised learning
Lecture 8: Machine Learning in Practice (1)
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Machine Learning - Splitting Datasets
Performance Evaluation for Classifiers tutorial
Supervised learning

What's hot (20)

PDF
Cross-validation Tutorial: What, how and which?
PDF
Cross validation
PPTX
K-Folds Cross Validation Method
PPTX
Machine Learning - Accuracy and Confusion Matrix
PPTX
Machine Learning
PDF
Lecture 2: Preliminaries (Understanding and Preprocessing data)
PDF
Lecture7 cross validation
PPTX
Feature Selection in Machine Learning
PPTX
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
PPTX
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
PDF
An introduction to variable and feature selection
PPTX
Machine learning - session 3
PPTX
Lecture 6: Ensemble Methods
PDF
Module 5: Decision Trees
PPTX
Machine learning
PDF
Racing for unbalanced methods selection
PDF
Classification Based Machine Learning Algorithms
PPT
MachineLearning.ppt
PDF
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Cross-validation Tutorial: What, how and which?
Cross validation
K-Folds Cross Validation Method
Machine Learning - Accuracy and Confusion Matrix
Machine Learning
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture7 cross validation
Feature Selection in Machine Learning
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
An introduction to variable and feature selection
Machine learning - session 3
Lecture 6: Ensemble Methods
Module 5: Decision Trees
Machine learning
Racing for unbalanced methods selection
Classification Based Machine Learning Algorithms
MachineLearning.ppt
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Ad

Viewers also liked (20)

PPTX
Statistics in the age of data science, issues you can not ignore
PDF
An Introduction to NLP4L
PPTX
Comparative Study of Granger Causality Algorithm for Gene Regulatory Network
PDF
PDF
PDF
PDF
PDF
Introduction to Data Warehousing
PPT
Datawarehouse and OLAP
PDF
Dbm630_lecture02-03
PDF
PDF
Cross-Validation
PDF
PPT
Data Mining and Data Warehousing
PDF
L2. Evaluating Machine Learning Algorithms I
PPTX
Apache kylin 2.0: from classic olap to real-time data warehouse
PPTX
Design cube in Apache Kylin
PPT
Datacube
PPTX
Apache Kylin’s Performance Boost from Apache HBase
Statistics in the age of data science, issues you can not ignore
An Introduction to NLP4L
Comparative Study of Granger Causality Algorithm for Gene Regulatory Network
Introduction to Data Warehousing
Datawarehouse and OLAP
Dbm630_lecture02-03
Cross-Validation
Data Mining and Data Warehousing
L2. Evaluating Machine Learning Algorithms I
Apache kylin 2.0: from classic olap to real-time data warehouse
Design cube in Apache Kylin
Datacube
Apache Kylin’s Performance Boost from Apache HBase
Ad

Similar to Dbm630 lecture08 (20)

PDF
Barga Data Science lecture 10
PPTX
Classification in the database system.pptx
PPT
Presentation
PDF
introducatio to ml introducatio to ml introducatio to ml
PPTX
Week 11 Model Evalaution Model Evaluation
PDF
06-00-ACA-Evaluation.pdf
PPT
Mining the LET Performance in Generating Prediction Models for OTDSS
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
PPT
Overfitting and-tbl
PDF
Data mining chapter04and5-best
PPTX
Lect8 Classification & prediction
PPTX
Statistical Learning and Model Selection (1).pptx
PPT
Slides ppt
PDF
05-00-ACA-Data-Intro.pdf
PPT
3 DM Classification HFCS kilometres .ppt
PDF
Using machine learning in anti money laundering part 2
PDF
Optimization Technique for Feature Selection and Classification Using Support...
PDF
Testing and Deployment - Full Stack Deep Learning
PDF
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
PPTX
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
Barga Data Science lecture 10
Classification in the database system.pptx
Presentation
introducatio to ml introducatio to ml introducatio to ml
Week 11 Model Evalaution Model Evaluation
06-00-ACA-Evaluation.pdf
Mining the LET Performance in Generating Prediction Models for OTDSS
Automated Testing and Safety Analysis of Deep Neural Networks
Overfitting and-tbl
Data mining chapter04and5-best
Lect8 Classification & prediction
Statistical Learning and Model Selection (1).pptx
Slides ppt
05-00-ACA-Data-Intro.pdf
3 DM Classification HFCS kilometres .ppt
Using machine learning in anti money laundering part 2
Optimization Technique for Feature Selection and Classification Using Support...
Testing and Deployment - Full Stack Deep Learning
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models

More from Tokyo Institute of Technology (11)

PDF
Lecture 4 online and offline business model generation
PDF
Lecture 4: Brand Creation
PDF
Lecture3 ExperientialMarketing
PDF
Lecture3 Tools and Content Creation
PDF
Lecture2: Innovation Workshop
PDF
Lecture0: introduction Online Marketing
PDF
Lecture2: Marketing and Social Media
PDF
Lecture1: E-Commerce Business Model
PDF
Lecture0: Introduction Social Commerce
PDF
DOC
Coursesyllabus_dbm630
Lecture 4 online and offline business model generation
Lecture 4: Brand Creation
Lecture3 ExperientialMarketing
Lecture3 Tools and Content Creation
Lecture2: Innovation Workshop
Lecture0: introduction Online Marketing
Lecture2: Marketing and Social Media
Lecture1: E-Commerce Business Model
Lecture0: Introduction Social Commerce
Coursesyllabus_dbm630

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology

Dbm630 lecture08

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 8 Classification and Prediction Evaluation by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Topics  Train, Test and Validation sets  Evaluation on Large data  Unbalanced data  Evaluation on Small data  Cross validation  Bootstrap  Comparing data mining schemes  Significance test  Lift Chart / ROC curve  Numeric Prediction Evaluation 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. Evaluation in Classification Tasks  How predictive is the model we learned?  Error on the training data is not a good indicator of performance on future data  Q: Why?  A: Because new data will probably not be exactly the same as the training data!  Overfitting – fitting the training data too precisely - usually leads to poor results on new data 3 Classification and Prediction: Evaluation
  • 4. Model Selection and Bias-Variance Tradeoff  Typical behavior of the test and training error, as model complexity is varied. High Bias Low Bias Low Variance High Variance Prediction Error Test Sample Training Sample Low Model Complexity High 4 Classification and Prediction: Evaluation
  • 5. Classifier error rate  Natural performance measure for classification problems: error rate  Success: instance’s class is predicted correctly  Error: instance’s class is predicted incorrectly  Error rate: proportion of errors made over the whole set of instances  Resubstitution error: error rate on training data (too optimistic way!) Accuracy Error rate (1-Accuracy)  Generalization error: error rate 15 % data 13 % 85 % 87 % on test 2% improvement 2% error reduction 2.35% improvement rate 13.3% error reduction rate 5 Classification and Prediction: Evaluation
  • 6. Evaluation on LARGE data  If many (thousands) of examples are available, including several hundred examples from each class, then how can we evaluate our classifier model?  A simple evaluation is sufficient  For example, randomly split data into training and test sets (usually 2/3 for train, 1/3 for test)  Build a classifier using the train set and evaluate it using the test set. 6 Classification and Prediction: Evaluation
  • 7. Classification Step 1: Split data into train and test sets THE PAST Results Known + + Training set - - + Data Testing set 7 Classification and Prediction: Evaluation
  • 8. Classification Step 2: Build a model on a training set THE PAST Results Known + + Training set - - + Data Model Builder Testing set 8 Classification and Prediction: Evaluation
  • 9. Classification Step 3: Evaluate on test set (and may be re-train) THE PAST Results Known + + Training set - - + Data Model Builder feedback Predictions + Y N - + Testing set - 9 Classification and Prediction: Evaluation
  • 10. A note on parameter tuning  It is important that the test data is not used in any way to create the classifier  Some learning schemes operate in two stages:  Stage 1: builds the basic structure  Stage 2: optimizes parameter settings  The test data can’t be used for parameter tuning!  Proper procedure uses three sets: training data, validation data, and test data  Validation data is used to optimize parameters 10 Classification and Prediction: Evaluation
  • 11. Classification: Train, Validation, Test split Results Known + Training set Model + - - Builder + Data Evaluate Model Builder Predictions + - Y N + Validation set - + - Final Evaluation + Final Test Set Final Model - 11 Classification and Prediction: Evaluation
  • 12. Unbalanced data  Sometimes, classes have very unequal frequency  Accommodation prediction: 97% stay, 3% don’t stay  medical diagnosis: 90% healthy, 10% disease  eCommerce: 99% don’t buy, 1% buy  Security: >99.99% of Americans are not terrorists  Similar situation with multiple classes  Majority class classifier can be 97% correct, but useless  Solution: With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set  randomly select desired number of minority class instances  add equal number of randomly selected majority class  That is, we ignore the effect of the number of instances for each class. 12 Classification and Prediction: Evaluation
  • 13. Evaluation on SMALL data  The holdout method reserves a certain amount for testing and uses the remainder for training  Usually: one third for testing, the rest for training  For “unbalanced” datasets, samples might not be representative  Few or none instances of some classes  Stratified sample: advanced version of balancing the data  Make sure that each class is represented with approximately equal proportions in both subsets  What if we have a small data set?  The chosen 2/3 for training may not be representative.  The chosen 1/3 for testing may not be representative. 13 Classification and Prediction: Evaluation
  • 14. Repeated Holdout Method  Holdout estimate can be made more reliable by repeating the process with different subsamples  In each iteration, a certain proportion is randomly selected for training (possibly with stratification)  The error rates on the different iterations are averaged to yield an overall error rate  This is called the repeated holdout method  Still not optimum: the different test sets overlap.  Can we prevent overlapping? 14 Classification and Prediction: Evaluation
  • 15. Cross-validation  Cross-validation avoids overlapping test sets  First step: data is split into k subsets of equal size  Second step: each subset in turn is used for testing and the remainder for training  This is called k-fold cross-validation  Often the subsets are stratified before the cross- validation is performed  The error estimates are averaged to yield an overall error estimate 15 Classification and Prediction: Evaluation
  • 16. Cross-validation Example  Break up data into groups of the same size (possibly with stratification)  Hold aside one group for testing and use the rest to build model Test Train  Repeat by another test data until 16 Classification and Prediction: Evaluation
  • 17. More on cross-validation  Standard method for evaluation: stratified ten-fold cross-validation  Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate  Stratification reduces the estimate’s variance  Even better: repeated stratified cross-validation  E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) 17 Classification and Prediction: Evaluation
  • 18. Leave-One-Out cross-validation  Leave-One-Out: Remove one instance for testing and the other for training a particular form of cross-validation:  Set the number of folds equal to the number of training instances  i.e., for n training instances, build classifier n times  Makes best use of the data  Involves no random subsampling  Very computationally expensive 18 Classification and Prediction: Evaluation
  • 19. Leave-One-Out-CV and stratification  Disadvantage of Leave-One-Out-CV: stratification is not possible  It guarantees a non-stratified sample because there is only one instance in the test set!  Extreme example: random dataset split equally into two classes  Best inducer predicts majority class  50% accuracy on fresh data  Leave-One-Out-CV estimate is 100% error! 19 Classification and Prediction: Evaluation
  • 20. *The bootstrap  CV uses sampling without replacement  The same instance, once selected, can not be selected again for a particular training/test set  The bootstrap uses sampling with replacement to form the training set  Sample a dataset of n instances n times with replacement to form a new dataset of n instances  Use this data as the training set  Use the instances from the original dataset that do not occur in the new training set for testing 20 Classification and Prediction: Evaluation
  • 21. Evaluating the Accuracy of a Classifier or Predictor  Bootstrap method  The training tuples are sampled uniformly with replacement  Each time a tuple is selected, it is equally likely to be selected again and readded to the training set  There are several bootstrap method – the commonly used one is .632 bootstrap which works as follows  Given a data set of d tuples  The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples  It is very likely that some of the original data tuples will occur more than once in this sample  The data tuples that did not make it into the training set end up forming the test set  Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set 21 Classification and Prediction: Evaluation
  • 22. *The 0.632 bootstrap  Also called the 0.632 bootstrap  For n instances, a particular instance has a probability of 1–1/n of not being picked  Thus its probability of ending up in the test data is: n  1  1    e 1  0.368  n  This means the training data will contain approximately 63.2% of the instances Note: e is an irrational constant approximately equal to 2.718281828 22 Classification and Prediction: Evaluation
  • 23. *Estimating error with the bootstrap  The error estimate on the test data will be very pessimistic  Trained on just ~63% of the instances  Therefore, combine it with the re-substitution error: err  0.632  errortest_instances  0.368  errortraining_instances  The re-substitution error gets less weight than the error on the test data  Repeat process several times with different replacement samples; average the results 23 Classification and Prediction: Evaluation
  • 24. *More on the bootstrap  Probably the best way of estimating performance for very small datasets  However, it has some problems  Consider the random dataset from above  A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data  Bootstrap estimate for this classifier: err  0.632  50%  0.368  0%  31.6%  True expected error: 50% 24 Classification and Prediction: Evaluation
  • 25. Evaluating Two-class Classification (Lift Chart vs. ROC Curve)  Information Retrieval or Search Engine  An application to find a set of related documents given a set of keywords.  Hard Decision vs. Soft Decision  Focus on soft decision  Multiclass  Class probability (Ranking)  Class by class evaluation  Example: promotional mailout  Situation 1: classifier predicts that 0.1% of all households will respond  Situation 2: classifier predicts that 0.4% of the 100000 most promising households will respond 27 Classification and Prediction: Evaluation
  • 26. Confusion Matrix (Two-class)  Also called contingency table Actual Class Yes No True False Yes Predicted Positive Positive Class False True No Negative Negative 28 Classification and Prediction: Evaluation
  • 27. Measures in Information Retrieval  precision: Percentage of retrieved documents that are relevant. TP  recall: Percentage of relevant precision  documents that are returned. TP  FP F-measure: The combination TP  recall  measure of recall and precision TP  FN  Precision/recall curves have 2  recall  precision F  measure  hyperbolic shape recall  precision  Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) 29 Classification and Prediction: Evaluation
  • 28. Measures in Two-Class Classification For Positive Class For Negative Class TP TN precision  precision  TP  FP TN  FN TP TN recall  recall  TP  FN TN  FP TP  TN TP  TN accuracy  accuracy  TP  TN  FP  FN TP  TN  FP  FN  Usually, we focus only positive class FP (“True” cases or “Yes” cases), FP Rate  TN  FP therefore, only precision and recall TP of positive class are used for TP Rate   recall performance comparison TP  FN 30 Classification and Prediction: Evaluation
  • 29. Confusion matrix: Example Actual Buys_computer Buys_computer Total Predict = yes = no Buys_computer=yes 6,954 46 7,000 Buys_computer=no 412 2,588 3,000 Total 7,366 2,634 10,000 No. of tuple of class buys_computer=yes that were labeled by a classifier as class buys_computer=no: FN No. of tuple of class buys_computer=no that were labeled by a classifier as class buys_computer=yes: FP 31
  • 30. Cumulative Gains Chart/Lift Chart/ ROC curve  They are visual aids for measuring model performance  Cumulative Gains is a measure of the effectiveness of predictive model on TP Rate (%true responses)  Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model  Cumulative gains and lift charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.  ROC is a measure of the effectiveness of a predictive model on %positive response against %negative response TP Rate against FP Rate. The greater the area under ROC curve, the better the model. 32 Classification and Prediction: Evaluation
  • 31. Generating Charts  Instances are sorted according to their predicted probability of being a true positive: 33 Classification and Prediction: Evaluation
  • 32. Cumulative Gains Chart  The x-axis shows the percentage of samples  The y-axis shows the percentage of positive responses (or true positive rate). This is a percentage of the total possible positive responses  Baseline (overall response rate): If we sample X% of data then we will receive X% of the total positive responses.  Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent of customers contacted and map these points to create the lift curve. 34 Classification and Prediction: Evaluation
  • 33. A Sample Cumulative Gains Chart 100% TP Rate 80% %positive responses 60% 40% Baseline: %positive_responses = %sample_size 20% 0 10% samples 100% samples 35 Classification and Prediction: Evaluation
  • 34. Lift Chart  The x-axis shows the percentage of samples  The y-axis shows the ratio of true positives with model and without model  To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model  Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3. 36 Classification and Prediction: Evaluation
  • 35. A Sample Lift Chart 100% 80% Lift 60% 40% Baseline: Lift = %sample size 20% 0 10% samples 100% samples 37 Classification and Prediction: Evaluation
  • 36. ROC Curve  ROC curves are similar to lift charts  “ROC” stands for “receiver operating characteristic”  Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel  Differences:  y axis shows percentage of true positives in sample  x axis shows percentage of false positives in sample (rather than sample size) 38 Classification and Prediction: Evaluation
  • 37. A Sample ROC Curve 1000 responds True Positive Rate 400 responds Baseline: TP Rate = FP Rate False Positive Rate 1000000-1000 mailouts 39 Classification and Prediction: Evaluation
  • 38. Extending to Multiple-Class Classification PREDICTED CLASS C1 C2 Cn Sum Recall C1 n11 n12 n1n R1 n11/R1 ACTUAL CLASS C2 n21 n22 n2n R2 n22 /R2 Cn nn1 nn2 nnn Rn nnn /Rn Sum P1 P2 Pn T R Preci n11 n22 nnn n11 +n22+..+nnn sion P1 P T P2 Pn 40 Classification and Prediction: Evaluation
  • 39. Measures in Multiple-Class Classification nii precision (Ci )  Pi nii recall (Ci )  Ri n11  n22    nnn accuracy  T 41 Classification and Prediction: Evaluation
  • 40. Numeric Prediction Evaluation  Same strategies: independent test set, cross-validation, significance tests, etc.  Difference: error measures or generalization error  Actual target values: a1, a2,…, an  Predicted target values: p1, p2,…, pn  Most popular measure: mean-squared error p 1  a1    p2  a2   ...   pn  an  2 2 2 n  Easy to manipulate mathematically 42 Classification and Prediction: Evaluation
  • 41. Other Measures  The root mean-squared (rms) error:  p1  a1    p2  a2   ...   pn  an  2 2 2 n  The mean absolute error is less sensitive to outliers than the mean-squared error: p1  a1  p2  a2  ...  pn  an n  Sometimes relative error values are more appropriate (e. g. 10% for an error of 50 when predicting 500) 43 Classification and Prediction: Evaluation
  • 42. Improvement on the Mean  Often we want to know how much the scheme improves on simply predicting the average  The relative squared error is: p1  a1    p2  a2   ...   pn  an  2 2 2 a  a   a  a   ...  a  a  1 2 2 2 n 2 a is average  The relative absolute error is: p1  a1  p2  a2  ...  pn  an a  a1  a  a2  ...  a  an 44 Classification and Prediction: Evaluation
  • 43. The Correlation Coefficient  Measures the statistical correlation between the predicted values and the actual values S PA  (ai  a) 2 S pSA SA  i n 1  ( pi  p)(ai  a)  ( pi  p) 2 S PA  i n 1 SP  i n 1  Scale independent, between -1 and +1  Good performance leads to large values! 45 Classification and Prediction: Evaluation
  • 44. Practice 1  A company wants to do a mail marketing campaign. It costs the company $1 for each item mailed. They have information on 100,000 customers. Create a cumulative gains and a lift chart from the following data. 46 Classification and Prediction: Evaluation
  • 45. Results 47 Classification and Prediction: Evaluation
  • 46. Practice 2  Using the response model P(x)=Salary(x)/1000+Age(x) for customer x and the data table shown below, construct the cumulative gains, lift charts and ROC curves. Ties in ranking should be arbitrarily broken by assigning a higher rank to who appears first in the table. Customer Name Salary Age Actual P(x) Rank Response A 10000 39 N B 50000 21 Y C 65000 25 Y D 62000 30 Y E 67000 19 Y F 69000 48 N G 65000 12 Y H 64000 51 N I 71000 65 Y 48 J 73000 42 N Classification and Prediction: Evaluation
  • 47. Customer Salary Age Actual P(x) Rank Ordered by P(x) Name Response Customer Actual P(x) Rank A 10000 39 N 49 10 Name Response B 50000 21 Y 71 9 I Y 136 1 C 65000 25 Y 90 6 F N 117 2 D 62000 30 Y 92 5 H N 115 3 E 67000 19 Y 86 7 J N 115 4 F 69000 48 N 117 2 D Y 92 5 G 65000 12 Y 77 8 C Y 90 6 H 64000 51 N 115 3 E Y 86 7 I 71000 65 Y 136 1 G Y 77 8 J 73000 42 N 115 4 B Y 71 9 A N 49 10 % Positive responses Cumulative Gains Chart Baseline Lift Chart Lift Baseline 100.00% 2.00 80.00% 1.50 60.00% 40.00% 1.00 20.00% 0.50 0.00% 0.00 70% 10% 20% 30% 40% 50% 60% 80% 90% 100% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 49
  • 48. Customer Actual #Sample TP FP TPR FPR Name Response I Y 1 1 0 16.67% 0.00% F N 2 1 1 16.67% 25.00% H N 3 1 2 16.67% 50.00% J N 4 1 3 16.67% 75.00% C Y 5 2 3 33.33% 75.00% E Y 6 3 3 50.00% 75.00% D Y 7 4 3 66.67% 75.00% G Y 8 5 3 83.33% 75.00% B Y 9 6 3 100.00% 75.00% A N 10 6 4 100.00% 100.00% ROC Curve 120.00% 100.00% 80.00% TPR 60.00% 40.00% 20.00% 0.00% 0.00% 50.00% 100.00% 150.00% 50 FPR
  • 49. *Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount of test data  Prediction is just like tossing a biased (!) coin  “Head” is a “success”, “tail” is an “error”  In statistics, a succession of independent events like this is called a Bernoulli process  Statistical theory provides us with confidence intervals for the true underlying proportion! 51 Classification and Prediction: Evaluation
  • 50. *Confidence intervals  We can say: p lies within a certain specified interval with a certain specified confidence  Example: S=750 successes in N=1000 trials  Estimated success rate: 75%  How close is this to true success rate p?  Answer: with 80% confidence p[73.2,76.7]  Another example: S=75 and N=100  Estimated success rate: 75%  With 80% confidence p[69.1,80.1] 52 Classification and Prediction: Evaluation
  • 51. *Mean and variance (also Mod 7)  Mean and variance for a Bernoulli trial: p, p (1–p)  Expected success rate f=S/N  Mean and variance for f : p, p (1–p)/N  For large enough N, f follows a Normal distribution  c% confidence interval [–z  X  z] for random variable with 0 mean is given by: Pr[ z  X  z ]  c  With a symmetric distribution: Pr[ z  X  z]  1  2  Pr[ X  z] 53 Classification and Prediction: Evaluation
  • 52. *Confidence limits Pr[X  z] z 0.1% 3.09 0.5% 2.58 1% 2.33 5% 1.65 10% 1.28 20% 0.84 –1 0 1 1.65 40% 0.25  Confidence limits for the normal distribution with 0 mean and a variance of 1:  Thus: Pr[1.65  X  1.65]  90%  To use this we have to reduce our random variable f to have 0 mean and unit variance 54 Classification and Prediction: Evaluation
  • 53. *Transforming f f p  Transformed value for f : p(1  p) / N (i.e. subtract the mean and divide by the standard deviation)  f p  Pr  z   z  c  p(1  p) / N   Resulting equation:  z2 f f2 z2   z2   Solving for p : p f   z    1   2   2N N N 4N   N  55 Classification and Prediction: Evaluation
  • 54. *Examples  f = 75%, N = 1000, c = 80% (so that z = 1.28): p [0.732,0.767]  f = 75%, N = 100, c = 80% (so that z = 1.28): p [0.691,0.801]  Note that normal distribution assumption is only valid for large N (i.e. N > 100)  f = 75%, N = 10, c = 80% (so that z = 1.28): p [0.549,0.881] (should be taken with a grain of salt) 56 Classification and Prediction: Evaluation