SlideShare a Scribd company logo
Learning from Examples: Standard Methodology for Evaluation 1)  Start with a dataset of labeled examples 2)  Randomly partition into  N  groups 3a)  N   times, combine  N  -1 groups into    a train set 3b)  Provide  train set  to learning system 3c)  Measure accuracy on “left out” group    (the  test set ) Called   N   -fold cross validation   (typically  N  =10)
Using  Tuning  Sets Often, an ML system has to choose when to stop learning, select among alternative answers, etc. One wants the model that produces the highest accuracy on  future  examples (“overfitting avoidance”) It is a  “cheat”  to look at the  test  set while still learning Better method Set aside part of the training set Measure performance on this “tuning” data to estimate future performance for a given set of parameters Use best parameter settings, train with  all  training data (except  test  set) to estimate future performance on  new  examples
Experimental Methodology: A Pictorial Overview generate solutions select best LEARNER training examples train’ set tune set testing examples classifier expected accuracy on future examples collection of classified examples Statistical techniques such as 10-fold cross validation and  t -tests are used to get meaningful results
Proper Experimental Methodology Can Have a Huge Impact! A 2002 paper in  Nature  (a major, major journal) needed to be corrected due to “training on the testing set” Original report : 95% accuracy (5% error rate)   Corrected report (which still is buggy):    73% accuracy (27% error rate) Error rate increased over 400%!!!
Parameter Setting Notice that each train/test fold may get  different  parameter settings! That’s fine (and proper)  I.e. , a “parameterless”* algorithm internally sets parameters for  each data set  it gets
Using Multiple Tuning Sets Using a  single  tuning set can be unreliable predictor, plus some data “wasted” Hence, often the following is done: 1) For each possible set of parameters, a)  Divide  training  data into  train’  and  tune  sets, using   N-fold cross validation b)  Score this set of parameter value, average  tune  set    accuracy 2) Use  best  set of parameter settings and    all  (train’ + tune)  examples 3) Apply resulting model to  test  set
Tuning a Parameter - Sample Usage Step1: Try various values for  k  (eg, # of hidden units).    Use 10  train/tune splits for each  k   Step2: Pick best value for  k  (eg.  k  = 2),    Then train using  all  training data Step3: Measure accuracy on  test set   Tune set accuracy  (ave. over 10 runs)=92% 1 10 2 K=2 Tune set accuracy  (ave. over 10 runs)=97% 1 10 2 … Tune set accuracy  (ave. over 10 runs)=80% 1 10 2 K=100 K=0 tune train
What to Do for the FIELDED System? Do  not  use any  test  sets Instead only use  tuning  sets to determine good parameters Test  sets used to estimate  future  performance You can report this estimate to your “customer,” then use  all  the data to retrain a “product” to give them
What’s Wrong with This? Do a cross-validation study to  set parameters Do  another  cross-validation study, using the best parameters, to  estimate future accuracy How will this relate to the “true” future accuracy? Likely to be an  overestimate What about Do a proper train/tune/test experiment Improve your algorithm;  goto 1 (Machine Learning’s “dirty little” secret!)
Why Not Learn After  Each  Test Example? In “production mode,” this would make sense (assuming one received the correct label) In “experiments,” we wish to estimate Probability we’ll label the next example correctly need  several samples  to    accurately estimate
Choosing a Good  N   for CV (from Weiss & Kulikowski Textbook) # of Examples < 50 50 < ex’s < 100 > 100 Method Instead, use Bootstrapping (B. Ephron) See “bagging” later in cs760 Leave-one-out (“Jack knife”) N  = size of data set (leave out one example each time) 10-fold cross validation (CV), also useful for  t -tests
Recap:     N   -fold Cross Validation Can be used to  1) estimate  future accuracy  (by  test  sets) 2)   choose  parameter settings  (by  tuning  sets) Method 1) Randomly permute examples 2) Divide into  N   bins 3) Train on  N -1 bins, measure performance on bin ”left out” 4) Compute average accuracy on held-out sets  Examples Fold 2 Fold 3 Fold 4 Fold 5 Fold 1
Confusion Matrices -  Useful Way to Report TESTSET Errors Useful for NETtalk testbed – task of pronouncing written words
Scatter Plots - Compare Two Algo’s on Many Datasets Algo A’s Error Rate Algo B’s Error Rate Each dot is the error rate of the two algo’s on ONE dataset
Statistical Analysis of  Sampling Effects Assume we get  e  errors on  N   test set examples What can we say about the accuracy of our estimate of the true (future) error rate? We’ll assume test set/future examples  independently drawn  (iid assumption) Can give probability our true error rate is in some range – error bars
The Binomial Distribution Distribution over the number of successes in  a fixed number  n  of independent trials (with same probability of success  p  in each)
Using the Binomial Let each test case (test data point) be a trial, and let a success be an incorrect prediction Maximum likelihood  estimate  of probability  p  of success is fraction of predictions wrong Can exactly compute probability that error rate estimate  p  is off by more than some amount, say 0.025, in either direction For large N, this computation’s expensive
Central Limit Theorem Roughly, for large enough  N,  all distributions look  Gaussian  when summing/averaging  N  values Surprisingly,  N  = 30 is large enough!  (in most cases at least) - see pg 132 of textbook   0 1 Ave  Y  over  N   trials  (repeated many times)
Confidence Intervals
As You Already Learned in “Stat 101” If we estimate  μ  (mean error rate) and  σ  (std dev), we can say our ML algo’s  error rate  is   μ   ±  Z M   σ Z M  :  value you looked up in a table of N(0,1) for desired confidence; e.g., for 95% confidence  it’s 1.96
The Remaining Details
Alg 1 vs. Alg 2 Alg 1 has accuracy 80%, Alg 2 82% Is this difference significant? Depends on how many test cases these estimates are based on The test we do depends on how we arrived at these estimates
Leave-One-Out: Sign Test Suppose we ran leave-one-out cross-validation on a data set of 100 cases Divide the cases into (1) Alg 1 won, (2) Alg 2 won, (3) Ties (both wrong or both right); Throw out the ties Suppose 10 ties and 50 wins for Alg 1 Ask: Under (null) binomial(90,0.5), what is prob of 50+ or 40- successes?
What about 10-fold? Difficult to get significance from sign test of 10 cases We’re throwing out the  numbers  (accuracy estimates) for each fold, and just asking which is larger Use the numbers… t-test… designed to test for a difference of means
Paired Student  t   -tests Given 10 training/test sets 2 ML algorithms Results of the 2 ML algo’s on the 10 test-sets Determine Which algorithm is better on this problem? Is the difference  statistically significant ?
Paired Student  t   –Tests (cont.) Example Accuracies on Testsets Algorithm 1:  80% 50 75 … 99 Algorithm 2: 79 49 74 … 98 :   +1 +1 +1 … +1 Algorithm 1’s mean is better, but the two std. Deviations will clearly overlap But algorithm1 is always better than algorithm 2 i
Consider random variable =  Algo A’s   Algo B’s   test-set   i   minus  test-set   i   error    error The Random Variable in the  t   -Test Notice we’re “factoring out” test-set  difficulty  by looking at  relative  performance In general, one tries to explain variance in results across experiments Here we’re saying that   Variance = f(   Problem difficulty   ) + g(   Algorithm strength   ) i
More on the Paired  t   -Test Our  NULL   HYPOTHESIS  is that the two ML algorithms have  equivalent average accuracies i.e. differences (in the scores) are due to the “random fluctuations” about the mean of zero We compute the probability that the observed  δ  arose from the null hypothesis If this probability is  low  we  reject   the null hypo  and say that the two algo’s appear different ‘ Low’ is usually taken as  prob  ≤  0.05
The Null Hypothesis Graphically (View #1) δ Assume   zero  mean and use the  sample’s variance   (sample = experiment) 1. ½ (1 –  M  ) probability mass in each tail (ie,  M  inside) Typically  M  = 0.95 Does our measured  δ  lie in the regions indicated by arrows?  If so, reject  null hypothesis,  since it is unlikely we’d get such a  δ  by chance P( δ )
View #2 – The Confidence Interval for  δ   δ Use  sample’s mean and variance 2. Is  zero  in the  M  % of probability mass? If NOT, reject  null hypothesis P( δ )
The  t   -test Confidence Interval Given:   δ 1  , … ,  δ N  where where each  δ i  is measured on a  test set of at least 30   *  examples (so the “Central Limit Theorem” applies for individual measurements) Compute:  Confidence interval,  Δ ,  at the  M  % level for the mean difference See if    contains ZERO. If not, we can reject the NULL HYPOTHESIS i.e. algorithms A & B perform equivalently * Hence if  N   is the typical 10, our dataset must have    ≥  300 examples
The  t   -Test Calculation Compute Mean Sample Variance Lookup   t  value for  N  folds and  M   confidence level - “ N -1” is called the degrees of freedom - As  N  ∞,  t M,N-1  and  Z M  equivalent See table 5.6 in Mitchell We don’t know an analytical expression for the variance, so we need to estimate it on the data
The  t   -test Calculation (cont.) - Using View #2  (get same result using view #1) Calculate The interval  contains 0  if  PDF δ
Some Jargon:  P   –values (Uses View #1) P   -Value  = Probability of getting one’s results or greater, given the NULL HYPOTHESIS (We usually want P  ≤  0.05 to be confident that a difference  is  statistically significant ) P NULL HYPO DISTRIBUTION
From Wikipedia ( http:// en.wikipedia.org/wiki/P -value ) The  p -value of an observed value  X observed  of some random variable  X   is the probability that, given that the null hypothesis is true,  X  will assume a value as or more  un favorable to the null hypothesis as the observed value  X observed &quot; More unfavorable  to the null hypothesis&quot; can in some cases mean greater than, in some cases less than, and in some cases further away from a specified center
“ Accepting” the Null Hypothesis Note: even if the  p   –value is high, we   can not  assume the null    hypothesis is  true Eg, if we flip a coin twice and get one head,  can we  statistically infer  the coin is  fair ? Vs. if we flip a coin 100 times and observe 10 heads, we can statistically infer coin is  u nfair because that is very unlikely to happen with a fair coin How would we show a coin  is  fair?
More on the  t   -Distribution We typically don’t have enough folds to assume the central-limit theorem. (i.e. N < 30) So, we need to use the  t  distribution It’s wider (and hence, shorter) than the Gaussian ( Z   ) distribution (since PDFs integrate to 1) Hence, our confidence intervals will be wider Fortunately,  t   -tables exist Gaussian t N different curve for each  N
Some Assumptions Underlying our Calculations General Central Limit Theorem applies  (I.e., >= 30 measurements averaged) ML-Specific #errors/#tests accurately estimates  p , prob of error on 1 ex. used in formula for   which characterizes expected    future deviations about mean ( p   ) Using independent sample space of possible instances - representative of future examples - individual ex’s iid drawn For paired  t -tests, learned classifier    same for each fold (“ stability ”) since combining results across folds
Stability Stability  =  how much the model an algorithm learns changes due to minor perturbations of the training set Paired  t   -test assumptions are a better match to stable algorithm Example:  k -NN, higher the  k , the  more stable
More on Paired  t   -Test Assumption Ideally train on one data set and then do a 10-fold paired  t -test What we should do:  train test1 … test10 What we usually do:  train1 test1   …   train10 test10 However, not enough data usually to do the ideal If we assume that train data is part of each paired experiment then we  violate  independence assumptions - each train set overlaps 90% with every other train set Learned model does  not  vary while we’re measuring its performance
The Great Debate (or one of them, at least) Should you use a  one -tailed or  a  two -tailed  t -test? A  two -tailed test asks the question:  Are algorithms A and B statistically  different ? A  one -tailed test asks the question:  Is algorithm A statistically  better   than algorithm B?
One vs. Two-Tailed  Graphically P(x) x 2.5% 2.5% 2.5% One-Tailed Test Two-Tailed Test
The Great Debate (More) Which of these tests should you use when comparing your new algorithm to a state-of-the-art algorithm? You should use  two tailed , because by using it, you are saying  there is a chance I am better and a chance I am worse One tailed  is saying,  I know my algorithm is no worse , and therefore you are allowed a  larger  margin of error See  http://www.psychstat.missouristate.edu/introbook/sbk25m.htm By being more confident, it is  easier  to show significance!
Two Sided vs. One Sided You need to very carefully think about the question you are asking Are we within x of the true error rate? Measured mean mean - x mean + x
Two Sided vs. One Sided How confident are we that ML  System A’s accuracy is at least 85%? 85%
Two Sided vs. One Sided Is ML algorithm A no more accurate than algorithm B? A - B
Two Sided vs. One Sided Are ML algorithm A and B equivalently accurate? A - B
Contingency Tables + - + - True Answer Algorithm Answer Counts of occurrences n(0,0) [true neg] n(0,1) [false neg] n(1,0) [false pos] n(1,1) [true pos]
TPR and FPR True Positive Rate  =  n(1,1) / ( n(1,1) + n(0,1) ) (TPR)     =  correctly categorized +’s / total positives      P(algo outputs + | + is correct) False Positive Rate  =  n(1,0) / ( n(1,0) + n(0,0) ) (FPR)    =  incorrectly categorized –’s / total neg’s      P(algo outputs + | - is correct) Can similarly define  False Negative Rate  and  True Negative Rate See   http:// en.wikipedia.org/wiki/Type_I_and_type_II_errors
ROC Curves ROC: Receiver Operating Characteristics Started during radar research during WWII Judging algorithms on accuracy alone may not be good enough when  getting a positive wrong costs  more than  getting a negative wrong  (or vice versa) Eg, medical tests for serious diseases Eg, a movie-recommender (ala’ NetFlix) system
ROC Curves Graphically 1.0 1.0 False positives rate True positives rate Prob (alg outputs + | + is correct) Prob (alg outputs + | - is correct) Ideal   Spot Alg 1 Alg 2 Different algorithms can work better in different parts of ROC space.  This depends on cost of false + vs false -
Creating an ROC Curve - the Standard Approach You need an ML algorithm that outputs NUMERIC results such as prob(example is +) You can use  ensembles  (later) to get this from a model that only provides Boolean outputs Eg, have 100 models vote & count votes
Algo for Creating ROC Curves ( one  possibility; use it on HW2) Step 1: Sort predictions on test set Step 2: Locate a threshold between   examples with opposite categories Step 3: Compute TPR & FPR for each   threshold of Step 2 Step 4: Connect the dots
Plotting ROC Curves  - Example Ex 9 .99 + Ex 7 .98 + Ex 1 .72  - Ex 2 .70 + Ex 6 .65 + Ex 10 .51  - Ex 3 .39  - Ex 5 .24 + Ex 4 .11  - Ex 8 .01  - ML Algo Output  (Sorted)   Correct     Category   1.0 1.0 P(alg outputs + | + is correct) P(alg outputs + | - is correct) TPR=(2/5), FPR=(0/5) TPR=(2/5), FPR=(1/5) TPR=(4/5), FPR=(1/5) TPR=(4/5), FPR=(3/5) TPR=(5/5), FPR=(3/5) TPR=(5/5), FPR=(5/5)
ROC’s and Many Models ( not  in the ensemble sense) It is not necessary that we learn  one  model and then threshold its output to produce an ROC curve You could learn  different models  for  different regions  of ROC space Eg, see Goadrich, Oliphant, & Shavlik  ILP ’04 and MLJ ‘06
Area Under ROC Curve A common metric for experiments is to  numerically integrate  the ROC Curve 1.0 1.0 False positives True positives
Asymmetric Error Costs Assume that cost(FP) != cost(FN) You would like to pick a threshold that mimimizes E(total cost) =  cost(FP) x prob(FP) x (# of -) + cost(FN) x prob(FN) x (# of +) You could also have (maybe negative) costs for TP and TN (assumed zero in above)
ROC’s & Skewed Data One strength of ROC curves is that they are a good way to deal with skewed data  (|+| >> |-|) since the axes are fractions (rates) independent of the # of examples You must be careful though! Low FPR * (many negative ex)    = sizable number of FP Possibly more than # of TP
Precision vs. Recall (think about search engines) Precision  = (# of relevant items retrieved)   / (total # of items retrieved) = n(1,1) / ( n(1,1) + n(1,0) )      P(is pos | called pos) Recall   = (# of relevant items retrieved)    / (# of relevant items that exist) = n(1,1) / ( n(1,1) + n(0,1) )  = TPR      P(called pos | is pos) Notice that n(0,0) is not used in either formula  Therefore you get  no  credit for filtering out  ir relevant items
Precision vs. Recall Precision   = (# of relevant items retrieved)   / (total # of items retrieved) = n(1, 1 ) / ( n( 1 ,1) + n( 1 ,0) )    P(is pos | called pos) Recall   = (# of relevant items retrieved)    / (# of relevant items that exist) = n( 1 ,1) / ( n(1, 1 ) + n(0, 1 ) )  = TPR    P(called pos | is pos) Notice that n(0,0) is  not  used in either formula  Therefore you get  no  credit for filtering out  ir relevant items
ROC vs. Recall-Precision You can get very different visual results  on the same data The reason for this is that there may be lots of – ex’s (eg, might need to include 100 neg’s to get 1 more pos) vs. P ( + | - ) Recall Precision P ( + | + )
Recall-Precision Curves You can not  simply  connect the dots  in  Recall-Precision curves (OK to do in ROC’s) See Goadrich, Oliphant, & Shavlik,  ILP ’04 or MLJ ’06 Recall Precision x
Interpolating in PR Space Would like to interpolate correctly, then remove points that lie below interpolation Analogous to convex hull in ROC space Can you do it efficiently? Yes – convert to ROC space, take convex hull, convert back to PR space (Davis & Goadrich, ICML-06)
The Relationship between  Precision-Recall and ROC Curves Jesse Davis & Mark Goadrich Department of Computer Sciences University of Wisconsin
Four Questions about  PR space and ROC space Q1: If a curve  dominates  in one space    will it dominate in the other? Q2: What is the  “best”  PR curve? Q3: How do you  interpolate  in PR  space? Q4: Does  optimizing  AUC in one space    optimize it in the other space?
Definition: Dominance
Definition: Area Under the Curve (AUC) Precision Recall TPR FPR
How do we evaluate ML algorithms? Common evaluation metrics ROC curves   [Provost et al ’98] PR curves   [Raghavan ’89, Manning & Schutze ’99] Cost curves  [Drummond and Holte ‘00, ’04] If the class distribution is highly skewed, we believe PR curves preferable to ROC curves
Two Highly Skewed Domains Is an abnormality on a mammogram benign or malignant? Do these two identities refer to the same person? ? =
Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005]
Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005]
Predicting Aliases [Synthetic data:  Davis et al. ICIA 2005]
Predicting Aliases [Synthetic data:  Davis et al. ICIA 2005]
A1: Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in PR space
Q2: What is the  “best”  PR curve?  The “best” curve in ROC space for a set of points is the convex hull  [Provost et al ’98] It is achievable It maximizes AUC  Q: Does an analog to convex hull    exist in PR space? A2: Yes! We call it the  Achievable PR Curve
Convex Hull
Convex Hull
A2: Achievable Curve
A2: Achievable Curve
Constructing the Achievable Curve Given: Set of PR points, fixed number positive    and negative examples Translate PR points to ROC points Construct convex hull in ROC space Convert the curve into PR space Corollary:  By dominance theorem, the curve in PR space  dominates all other legal PR curves you could construct with the given points
Q3: Interpolation Interpolation in ROC space is easy Linear connection between points A B TPR FPR
Linear Interpolation Not Achievable in PR Space Precision  interpolation is counterintuitive  [Goadrich, et al., ILP 2004] Example Counts PR Curves ROC Curves 750 4750 0.75 0.53 0.75 0.14 0.10 1.00 1.00 1.00 9000 1000 0.50 0.50 0.06 0.50 500 500 Prec Recall FP Rate TP Rate FP TP
Example Interpolation A dataset with 20 positive and 2000 negative examples Q: For each extra TP covered, how many FPs do you cover? A: 0.25 0.5 30 10 B A  5 TP  5 FP  0.25 REC  0.5 PREC TP B -TP A FP B -FP A
Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B A  5 TP  5 FP  0.25 REC  0.5 PREC
Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . A  6 5 TP  10 5 FP  0.3 0.25 REC  0.375 0.5 PREC
Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . . . . A  9 8 7 6 5 TP  25 20 15 10 5 FP  0.45 0.4 0.35 0.3 0.25 REC  0.265 0.286 0.318 0.375 0.5 PREC
Optimizing AUC Interest in learning algorithms that optimize Area Under the Curve (AUC) [ Ferri et al. 2002, Cortes and Mohri 2003, Joachims 2005,    Prati and Flach 2005, Yan et al. 2003, Herschtal and Raskutti 2004 ] Q: Does an algorithm that optimizes    AUC-ROC also optimize AUC-PR? A: No.  Can easily construct counterexample
Back to Q1 A2, A3 and A4 relied on A1 Now let’s prove A1…
Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in Precision-Recall space
For Fixed N, P and TPR: FPR  Precision (Not =) + - + - True Answer Algorithm Answer N P 900 25 100 75
Conclusions about  PR and ROC Curves A curve dominates in one space iff it dominates in the other space Exists analog to convex hull in PR space,   which we call the  achievable PR curve Linear interpolation not achievable in PR space Optimizing AUC in one space does not optimize AUC in the other space

More Related Content

PPTX
Stats chapter 10
PPTX
Stats chapter 11
PPTX
Stats chapter 9
PDF
Machine learning (5)
PPTX
WEKA: Credibility Evaluating Whats Been Learned
PPT
Chapter09
PPTX
Chapter 8 review
PDF
Lecture 5: Interval Estimation
Stats chapter 10
Stats chapter 11
Stats chapter 9
Machine learning (5)
WEKA: Credibility Evaluating Whats Been Learned
Chapter09
Chapter 8 review
Lecture 5: Interval Estimation

What's hot (20)

PPTX
Point and Interval Estimation
PPTX
Testing a claim about a mean
PPTX
Evaluating hypothesis
PPTX
Statistics
PPTX
Machine learning session7(nb classifier k-nn)
PDF
Point Estimate, Confidence Interval, Hypotesis tests
PPTX
Machine learning session6(decision trees random forrest)
PDF
C2 st lecture 11 the t-test handout
PPTX
Machine learning session2
PPTX
Statistical Analysis with R- III
PPT
Business Statistics Chapter 9
PDF
Directional Hypothesis testing
PPTX
Basics of Hypothesis Testing
PPTX
Estimation and confidence interval
PPTX
Interval estimation for proportions
PPTX
Machine learning session5(logistic regression)
PPT
Estimation
PDF
CORE: May the “Power” (Statistical) - Be with You!
DOCX
Chapter 8
PPT
Math3010 week 4
Point and Interval Estimation
Testing a claim about a mean
Evaluating hypothesis
Statistics
Machine learning session7(nb classifier k-nn)
Point Estimate, Confidence Interval, Hypotesis tests
Machine learning session6(decision trees random forrest)
C2 st lecture 11 the t-test handout
Machine learning session2
Statistical Analysis with R- III
Business Statistics Chapter 9
Directional Hypothesis testing
Basics of Hypothesis Testing
Estimation and confidence interval
Interval estimation for proportions
Machine learning session5(logistic regression)
Estimation
CORE: May the “Power” (Statistical) - Be with You!
Chapter 8
Math3010 week 4
Ad

Viewers also liked (8)

PDF
STATISTICA DESCRITTIVA - Dall'ISTOGRAMMA alla TABELLA-CASO 6a - ISTOGRAMMA, S...
PPT
Variabilità e concentrazione
PDF
STATISTICA DESCRITTIVA - Dall'ISTOGRAMMA alla TABELLA-CASO 3a - CARATTERE, MO...
PDF
Appunti statistica descrittiva 2
PPTX
Statistica
PDF
STATISTICA DESCRITTIVA - PRIMI PASSI-4 - MEDIE, MODA, MEDIANA, ISTOGRAMMA, DI...
PPT
La statistica, medie e indici di variabilità
PDF
CUBICA: dal GRAFICO all'EQUAZIONE ESEMPIO 2 - TRE METODI - CALCOLI e GRAFICI ...
STATISTICA DESCRITTIVA - Dall'ISTOGRAMMA alla TABELLA-CASO 6a - ISTOGRAMMA, S...
Variabilità e concentrazione
STATISTICA DESCRITTIVA - Dall'ISTOGRAMMA alla TABELLA-CASO 3a - CARATTERE, MO...
Appunti statistica descrittiva 2
Statistica
STATISTICA DESCRITTIVA - PRIMI PASSI-4 - MEDIE, MODA, MEDIANA, ISTOGRAMMA, DI...
La statistica, medie e indici di variabilità
CUBICA: dal GRAFICO all'EQUAZIONE ESEMPIO 2 - TRE METODI - CALCOLI e GRAFICI ...
Ad

Similar to MLlectureMethod.ppt (20)

PPTX
Predictive analytics using 'R' Programming
PPTX
WEKA:Credibility Evaluating Whats Been Learned
PPTX
module_of_healthcare_wound_healing_mbbs_3.pptx
PPTX
MACHINE LEARNING PPT K MEANS CLUSTERING.
PPT
15hjkljklj'jklj'kljkjkljkljkljkl95867.ppt
PDF
Resampling methods Cross Validation Bootstrap Bias and variance estimation...
PDF
Assessing Model Performance - Beginner's Guide
PPTX
performance evaluation good for data analytics
PPT
chap4_Parametric_Methods.ppt
PPT
VCE Physics: Dealing with numerical measurments
PPTX
Machine learning session4(linear regression)
PPT
Hypothesis Testing techniques in social research.ppt
PPT
12 13 h2_measurement_ppt
DOCX
Steps of hypothesis testingSelect the appropriate testSo far.docx
PPT
Chapter10 Revised
PPT
Chapter10 Revised
PPT
Chapter10 Revised
PPT
BIIntro.ppt
PPTX
Model Calibration and Uncertainty Analysis
DOCX
SAMPLING MEAN DEFINITION The term sampling mean .docx
Predictive analytics using 'R' Programming
WEKA:Credibility Evaluating Whats Been Learned
module_of_healthcare_wound_healing_mbbs_3.pptx
MACHINE LEARNING PPT K MEANS CLUSTERING.
15hjkljklj'jklj'kljkjkljkljkljkl95867.ppt
Resampling methods Cross Validation Bootstrap Bias and variance estimation...
Assessing Model Performance - Beginner's Guide
performance evaluation good for data analytics
chap4_Parametric_Methods.ppt
VCE Physics: Dealing with numerical measurments
Machine learning session4(linear regression)
Hypothesis Testing techniques in social research.ppt
12 13 h2_measurement_ppt
Steps of hypothesis testingSelect the appropriate testSo far.docx
Chapter10 Revised
Chapter10 Revised
Chapter10 Revised
BIIntro.ppt
Model Calibration and Uncertainty Analysis
SAMPLING MEAN DEFINITION The term sampling mean .docx

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

MLlectureMethod.ppt

  • 1. Learning from Examples: Standard Methodology for Evaluation 1) Start with a dataset of labeled examples 2) Randomly partition into N groups 3a) N times, combine N -1 groups into a train set 3b) Provide train set to learning system 3c) Measure accuracy on “left out” group (the test set ) Called N -fold cross validation (typically N =10)
  • 2. Using Tuning Sets Often, an ML system has to choose when to stop learning, select among alternative answers, etc. One wants the model that produces the highest accuracy on future examples (“overfitting avoidance”) It is a “cheat” to look at the test set while still learning Better method Set aside part of the training set Measure performance on this “tuning” data to estimate future performance for a given set of parameters Use best parameter settings, train with all training data (except test set) to estimate future performance on new examples
  • 3. Experimental Methodology: A Pictorial Overview generate solutions select best LEARNER training examples train’ set tune set testing examples classifier expected accuracy on future examples collection of classified examples Statistical techniques such as 10-fold cross validation and t -tests are used to get meaningful results
  • 4. Proper Experimental Methodology Can Have a Huge Impact! A 2002 paper in Nature (a major, major journal) needed to be corrected due to “training on the testing set” Original report : 95% accuracy (5% error rate) Corrected report (which still is buggy): 73% accuracy (27% error rate) Error rate increased over 400%!!!
  • 5. Parameter Setting Notice that each train/test fold may get different parameter settings! That’s fine (and proper) I.e. , a “parameterless”* algorithm internally sets parameters for each data set it gets
  • 6. Using Multiple Tuning Sets Using a single tuning set can be unreliable predictor, plus some data “wasted” Hence, often the following is done: 1) For each possible set of parameters, a) Divide training data into train’ and tune sets, using N-fold cross validation b) Score this set of parameter value, average tune set accuracy 2) Use best set of parameter settings and all (train’ + tune) examples 3) Apply resulting model to test set
  • 7. Tuning a Parameter - Sample Usage Step1: Try various values for k (eg, # of hidden units). Use 10 train/tune splits for each k Step2: Pick best value for k (eg. k = 2), Then train using all training data Step3: Measure accuracy on test set Tune set accuracy (ave. over 10 runs)=92% 1 10 2 K=2 Tune set accuracy (ave. over 10 runs)=97% 1 10 2 … Tune set accuracy (ave. over 10 runs)=80% 1 10 2 K=100 K=0 tune train
  • 8. What to Do for the FIELDED System? Do not use any test sets Instead only use tuning sets to determine good parameters Test sets used to estimate future performance You can report this estimate to your “customer,” then use all the data to retrain a “product” to give them
  • 9. What’s Wrong with This? Do a cross-validation study to set parameters Do another cross-validation study, using the best parameters, to estimate future accuracy How will this relate to the “true” future accuracy? Likely to be an overestimate What about Do a proper train/tune/test experiment Improve your algorithm; goto 1 (Machine Learning’s “dirty little” secret!)
  • 10. Why Not Learn After Each Test Example? In “production mode,” this would make sense (assuming one received the correct label) In “experiments,” we wish to estimate Probability we’ll label the next example correctly need several samples to accurately estimate
  • 11. Choosing a Good N for CV (from Weiss & Kulikowski Textbook) # of Examples < 50 50 < ex’s < 100 > 100 Method Instead, use Bootstrapping (B. Ephron) See “bagging” later in cs760 Leave-one-out (“Jack knife”) N = size of data set (leave out one example each time) 10-fold cross validation (CV), also useful for t -tests
  • 12. Recap: N -fold Cross Validation Can be used to 1) estimate future accuracy (by test sets) 2) choose parameter settings (by tuning sets) Method 1) Randomly permute examples 2) Divide into N bins 3) Train on N -1 bins, measure performance on bin ”left out” 4) Compute average accuracy on held-out sets Examples Fold 2 Fold 3 Fold 4 Fold 5 Fold 1
  • 13. Confusion Matrices - Useful Way to Report TESTSET Errors Useful for NETtalk testbed – task of pronouncing written words
  • 14. Scatter Plots - Compare Two Algo’s on Many Datasets Algo A’s Error Rate Algo B’s Error Rate Each dot is the error rate of the two algo’s on ONE dataset
  • 15. Statistical Analysis of Sampling Effects Assume we get e errors on N test set examples What can we say about the accuracy of our estimate of the true (future) error rate? We’ll assume test set/future examples independently drawn (iid assumption) Can give probability our true error rate is in some range – error bars
  • 16. The Binomial Distribution Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each)
  • 17. Using the Binomial Let each test case (test data point) be a trial, and let a success be an incorrect prediction Maximum likelihood estimate of probability p of success is fraction of predictions wrong Can exactly compute probability that error rate estimate p is off by more than some amount, say 0.025, in either direction For large N, this computation’s expensive
  • 18. Central Limit Theorem Roughly, for large enough N, all distributions look Gaussian when summing/averaging N values Surprisingly, N = 30 is large enough! (in most cases at least) - see pg 132 of textbook 0 1 Ave Y over N trials (repeated many times)
  • 20. As You Already Learned in “Stat 101” If we estimate μ (mean error rate) and σ (std dev), we can say our ML algo’s error rate is μ ± Z M σ Z M : value you looked up in a table of N(0,1) for desired confidence; e.g., for 95% confidence it’s 1.96
  • 22. Alg 1 vs. Alg 2 Alg 1 has accuracy 80%, Alg 2 82% Is this difference significant? Depends on how many test cases these estimates are based on The test we do depends on how we arrived at these estimates
  • 23. Leave-One-Out: Sign Test Suppose we ran leave-one-out cross-validation on a data set of 100 cases Divide the cases into (1) Alg 1 won, (2) Alg 2 won, (3) Ties (both wrong or both right); Throw out the ties Suppose 10 ties and 50 wins for Alg 1 Ask: Under (null) binomial(90,0.5), what is prob of 50+ or 40- successes?
  • 24. What about 10-fold? Difficult to get significance from sign test of 10 cases We’re throwing out the numbers (accuracy estimates) for each fold, and just asking which is larger Use the numbers… t-test… designed to test for a difference of means
  • 25. Paired Student t -tests Given 10 training/test sets 2 ML algorithms Results of the 2 ML algo’s on the 10 test-sets Determine Which algorithm is better on this problem? Is the difference statistically significant ?
  • 26. Paired Student t –Tests (cont.) Example Accuracies on Testsets Algorithm 1: 80% 50 75 … 99 Algorithm 2: 79 49 74 … 98 : +1 +1 +1 … +1 Algorithm 1’s mean is better, but the two std. Deviations will clearly overlap But algorithm1 is always better than algorithm 2 i
  • 27. Consider random variable = Algo A’s Algo B’s test-set i minus test-set i error error The Random Variable in the t -Test Notice we’re “factoring out” test-set difficulty by looking at relative performance In general, one tries to explain variance in results across experiments Here we’re saying that Variance = f( Problem difficulty ) + g( Algorithm strength ) i
  • 28. More on the Paired t -Test Our NULL HYPOTHESIS is that the two ML algorithms have equivalent average accuracies i.e. differences (in the scores) are due to the “random fluctuations” about the mean of zero We compute the probability that the observed δ arose from the null hypothesis If this probability is low we reject the null hypo and say that the two algo’s appear different ‘ Low’ is usually taken as prob ≤ 0.05
  • 29. The Null Hypothesis Graphically (View #1) δ Assume zero mean and use the sample’s variance (sample = experiment) 1. ½ (1 – M ) probability mass in each tail (ie, M inside) Typically M = 0.95 Does our measured δ lie in the regions indicated by arrows? If so, reject null hypothesis, since it is unlikely we’d get such a δ by chance P( δ )
  • 30. View #2 – The Confidence Interval for δ δ Use sample’s mean and variance 2. Is zero in the M % of probability mass? If NOT, reject null hypothesis P( δ )
  • 31. The t -test Confidence Interval Given: δ 1 , … , δ N where where each δ i is measured on a test set of at least 30 * examples (so the “Central Limit Theorem” applies for individual measurements) Compute: Confidence interval, Δ , at the M % level for the mean difference See if contains ZERO. If not, we can reject the NULL HYPOTHESIS i.e. algorithms A & B perform equivalently * Hence if N is the typical 10, our dataset must have ≥ 300 examples
  • 32. The t -Test Calculation Compute Mean Sample Variance Lookup t value for N folds and M confidence level - “ N -1” is called the degrees of freedom - As N  ∞, t M,N-1 and Z M equivalent See table 5.6 in Mitchell We don’t know an analytical expression for the variance, so we need to estimate it on the data
  • 33. The t -test Calculation (cont.) - Using View #2 (get same result using view #1) Calculate The interval contains 0 if PDF δ
  • 34. Some Jargon: P –values (Uses View #1) P -Value = Probability of getting one’s results or greater, given the NULL HYPOTHESIS (We usually want P ≤ 0.05 to be confident that a difference is statistically significant ) P NULL HYPO DISTRIBUTION
  • 35. From Wikipedia ( http:// en.wikipedia.org/wiki/P -value ) The p -value of an observed value X observed of some random variable X is the probability that, given that the null hypothesis is true, X will assume a value as or more un favorable to the null hypothesis as the observed value X observed &quot; More unfavorable to the null hypothesis&quot; can in some cases mean greater than, in some cases less than, and in some cases further away from a specified center
  • 36. “ Accepting” the Null Hypothesis Note: even if the p –value is high, we can not assume the null hypothesis is true Eg, if we flip a coin twice and get one head, can we statistically infer the coin is fair ? Vs. if we flip a coin 100 times and observe 10 heads, we can statistically infer coin is u nfair because that is very unlikely to happen with a fair coin How would we show a coin is fair?
  • 37. More on the t -Distribution We typically don’t have enough folds to assume the central-limit theorem. (i.e. N < 30) So, we need to use the t distribution It’s wider (and hence, shorter) than the Gaussian ( Z ) distribution (since PDFs integrate to 1) Hence, our confidence intervals will be wider Fortunately, t -tables exist Gaussian t N different curve for each N
  • 38. Some Assumptions Underlying our Calculations General Central Limit Theorem applies (I.e., >= 30 measurements averaged) ML-Specific #errors/#tests accurately estimates p , prob of error on 1 ex. used in formula for  which characterizes expected future deviations about mean ( p ) Using independent sample space of possible instances - representative of future examples - individual ex’s iid drawn For paired t -tests, learned classifier  same for each fold (“ stability ”) since combining results across folds
  • 39. Stability Stability = how much the model an algorithm learns changes due to minor perturbations of the training set Paired t -test assumptions are a better match to stable algorithm Example: k -NN, higher the k , the more stable
  • 40. More on Paired t -Test Assumption Ideally train on one data set and then do a 10-fold paired t -test What we should do: train test1 … test10 What we usually do: train1 test1 … train10 test10 However, not enough data usually to do the ideal If we assume that train data is part of each paired experiment then we violate independence assumptions - each train set overlaps 90% with every other train set Learned model does not vary while we’re measuring its performance
  • 41. The Great Debate (or one of them, at least) Should you use a one -tailed or a two -tailed t -test? A two -tailed test asks the question: Are algorithms A and B statistically different ? A one -tailed test asks the question: Is algorithm A statistically better than algorithm B?
  • 42. One vs. Two-Tailed Graphically P(x) x 2.5% 2.5% 2.5% One-Tailed Test Two-Tailed Test
  • 43. The Great Debate (More) Which of these tests should you use when comparing your new algorithm to a state-of-the-art algorithm? You should use two tailed , because by using it, you are saying there is a chance I am better and a chance I am worse One tailed is saying, I know my algorithm is no worse , and therefore you are allowed a larger margin of error See http://www.psychstat.missouristate.edu/introbook/sbk25m.htm By being more confident, it is easier to show significance!
  • 44. Two Sided vs. One Sided You need to very carefully think about the question you are asking Are we within x of the true error rate? Measured mean mean - x mean + x
  • 45. Two Sided vs. One Sided How confident are we that ML System A’s accuracy is at least 85%? 85%
  • 46. Two Sided vs. One Sided Is ML algorithm A no more accurate than algorithm B? A - B
  • 47. Two Sided vs. One Sided Are ML algorithm A and B equivalently accurate? A - B
  • 48. Contingency Tables + - + - True Answer Algorithm Answer Counts of occurrences n(0,0) [true neg] n(0,1) [false neg] n(1,0) [false pos] n(1,1) [true pos]
  • 49. TPR and FPR True Positive Rate = n(1,1) / ( n(1,1) + n(0,1) ) (TPR) = correctly categorized +’s / total positives  P(algo outputs + | + is correct) False Positive Rate = n(1,0) / ( n(1,0) + n(0,0) ) (FPR) = incorrectly categorized –’s / total neg’s  P(algo outputs + | - is correct) Can similarly define False Negative Rate and True Negative Rate See http:// en.wikipedia.org/wiki/Type_I_and_type_II_errors
  • 50. ROC Curves ROC: Receiver Operating Characteristics Started during radar research during WWII Judging algorithms on accuracy alone may not be good enough when getting a positive wrong costs more than getting a negative wrong (or vice versa) Eg, medical tests for serious diseases Eg, a movie-recommender (ala’ NetFlix) system
  • 51. ROC Curves Graphically 1.0 1.0 False positives rate True positives rate Prob (alg outputs + | + is correct) Prob (alg outputs + | - is correct) Ideal Spot Alg 1 Alg 2 Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false -
  • 52. Creating an ROC Curve - the Standard Approach You need an ML algorithm that outputs NUMERIC results such as prob(example is +) You can use ensembles (later) to get this from a model that only provides Boolean outputs Eg, have 100 models vote & count votes
  • 53. Algo for Creating ROC Curves ( one possibility; use it on HW2) Step 1: Sort predictions on test set Step 2: Locate a threshold between examples with opposite categories Step 3: Compute TPR & FPR for each threshold of Step 2 Step 4: Connect the dots
  • 54. Plotting ROC Curves - Example Ex 9 .99 + Ex 7 .98 + Ex 1 .72 - Ex 2 .70 + Ex 6 .65 + Ex 10 .51 - Ex 3 .39 - Ex 5 .24 + Ex 4 .11 - Ex 8 .01 - ML Algo Output (Sorted) Correct Category 1.0 1.0 P(alg outputs + | + is correct) P(alg outputs + | - is correct) TPR=(2/5), FPR=(0/5) TPR=(2/5), FPR=(1/5) TPR=(4/5), FPR=(1/5) TPR=(4/5), FPR=(3/5) TPR=(5/5), FPR=(3/5) TPR=(5/5), FPR=(5/5)
  • 55. ROC’s and Many Models ( not in the ensemble sense) It is not necessary that we learn one model and then threshold its output to produce an ROC curve You could learn different models for different regions of ROC space Eg, see Goadrich, Oliphant, & Shavlik ILP ’04 and MLJ ‘06
  • 56. Area Under ROC Curve A common metric for experiments is to numerically integrate the ROC Curve 1.0 1.0 False positives True positives
  • 57. Asymmetric Error Costs Assume that cost(FP) != cost(FN) You would like to pick a threshold that mimimizes E(total cost) = cost(FP) x prob(FP) x (# of -) + cost(FN) x prob(FN) x (# of +) You could also have (maybe negative) costs for TP and TN (assumed zero in above)
  • 58. ROC’s & Skewed Data One strength of ROC curves is that they are a good way to deal with skewed data (|+| >> |-|) since the axes are fractions (rates) independent of the # of examples You must be careful though! Low FPR * (many negative ex) = sizable number of FP Possibly more than # of TP
  • 59. Precision vs. Recall (think about search engines) Precision = (# of relevant items retrieved) / (total # of items retrieved) = n(1,1) / ( n(1,1) + n(1,0) )  P(is pos | called pos) Recall = (# of relevant items retrieved) / (# of relevant items that exist) = n(1,1) / ( n(1,1) + n(0,1) ) = TPR  P(called pos | is pos) Notice that n(0,0) is not used in either formula Therefore you get no credit for filtering out ir relevant items
  • 60. Precision vs. Recall Precision = (# of relevant items retrieved) / (total # of items retrieved) = n(1, 1 ) / ( n( 1 ,1) + n( 1 ,0) )  P(is pos | called pos) Recall = (# of relevant items retrieved) / (# of relevant items that exist) = n( 1 ,1) / ( n(1, 1 ) + n(0, 1 ) ) = TPR  P(called pos | is pos) Notice that n(0,0) is not used in either formula Therefore you get no credit for filtering out ir relevant items
  • 61. ROC vs. Recall-Precision You can get very different visual results on the same data The reason for this is that there may be lots of – ex’s (eg, might need to include 100 neg’s to get 1 more pos) vs. P ( + | - ) Recall Precision P ( + | + )
  • 62. Recall-Precision Curves You can not simply connect the dots in Recall-Precision curves (OK to do in ROC’s) See Goadrich, Oliphant, & Shavlik, ILP ’04 or MLJ ’06 Recall Precision x
  • 63. Interpolating in PR Space Would like to interpolate correctly, then remove points that lie below interpolation Analogous to convex hull in ROC space Can you do it efficiently? Yes – convert to ROC space, take convex hull, convert back to PR space (Davis & Goadrich, ICML-06)
  • 64. The Relationship between Precision-Recall and ROC Curves Jesse Davis & Mark Goadrich Department of Computer Sciences University of Wisconsin
  • 65. Four Questions about PR space and ROC space Q1: If a curve dominates in one space will it dominate in the other? Q2: What is the “best” PR curve? Q3: How do you interpolate in PR space? Q4: Does optimizing AUC in one space optimize it in the other space?
  • 67. Definition: Area Under the Curve (AUC) Precision Recall TPR FPR
  • 68. How do we evaluate ML algorithms? Common evaluation metrics ROC curves [Provost et al ’98] PR curves [Raghavan ’89, Manning & Schutze ’99] Cost curves [Drummond and Holte ‘00, ’04] If the class distribution is highly skewed, we believe PR curves preferable to ROC curves
  • 69. Two Highly Skewed Domains Is an abnormality on a mammogram benign or malignant? Do these two identities refer to the same person? ? =
  • 70. Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005]
  • 71. Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005]
  • 72. Predicting Aliases [Synthetic data: Davis et al. ICIA 2005]
  • 73. Predicting Aliases [Synthetic data: Davis et al. ICIA 2005]
  • 74. A1: Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in PR space
  • 75. Q2: What is the “best” PR curve? The “best” curve in ROC space for a set of points is the convex hull [Provost et al ’98] It is achievable It maximizes AUC Q: Does an analog to convex hull exist in PR space? A2: Yes! We call it the Achievable PR Curve
  • 80. Constructing the Achievable Curve Given: Set of PR points, fixed number positive and negative examples Translate PR points to ROC points Construct convex hull in ROC space Convert the curve into PR space Corollary: By dominance theorem, the curve in PR space dominates all other legal PR curves you could construct with the given points
  • 81. Q3: Interpolation Interpolation in ROC space is easy Linear connection between points A B TPR FPR
  • 82. Linear Interpolation Not Achievable in PR Space Precision interpolation is counterintuitive [Goadrich, et al., ILP 2004] Example Counts PR Curves ROC Curves 750 4750 0.75 0.53 0.75 0.14 0.10 1.00 1.00 1.00 9000 1000 0.50 0.50 0.06 0.50 500 500 Prec Recall FP Rate TP Rate FP TP
  • 83. Example Interpolation A dataset with 20 positive and 2000 negative examples Q: For each extra TP covered, how many FPs do you cover? A: 0.25 0.5 30 10 B A 5 TP 5 FP 0.25 REC 0.5 PREC TP B -TP A FP B -FP A
  • 84. Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B A 5 TP 5 FP 0.25 REC 0.5 PREC
  • 85. Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . A 6 5 TP 10 5 FP 0.3 0.25 REC 0.375 0.5 PREC
  • 86. Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . . . . A 9 8 7 6 5 TP 25 20 15 10 5 FP 0.45 0.4 0.35 0.3 0.25 REC 0.265 0.286 0.318 0.375 0.5 PREC
  • 87. Optimizing AUC Interest in learning algorithms that optimize Area Under the Curve (AUC) [ Ferri et al. 2002, Cortes and Mohri 2003, Joachims 2005, Prati and Flach 2005, Yan et al. 2003, Herschtal and Raskutti 2004 ] Q: Does an algorithm that optimizes AUC-ROC also optimize AUC-PR? A: No. Can easily construct counterexample
  • 88. Back to Q1 A2, A3 and A4 relied on A1 Now let’s prove A1…
  • 89. Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in Precision-Recall space
  • 90. For Fixed N, P and TPR: FPR Precision (Not =) + - + - True Answer Algorithm Answer N P 900 25 100 75
  • 91. Conclusions about PR and ROC Curves A curve dominates in one space iff it dominates in the other space Exists analog to convex hull in PR space, which we call the achievable PR curve Linear interpolation not achievable in PR space Optimizing AUC in one space does not optimize AUC in the other space