SlideShare a Scribd company logo
Multi-label & Multi-class Classification
Evaluation Metrics of
- Sridhar Nomula
Introduction
Most classification problems associate a single class to each
example or instance. However, there are many classification
tasks where each instance can be associated with one or more
classes. This group of problems represents an area known as
multi-label classification.
The performance of the multi-label classifiers cannot be
assessed using the exact same definitions as for the single-label
classifiers. Precision, recall, F-measure, ROC…a lot of the
evaluation metrics you’re familiar with from multiclass don’t
readily translate to multi-label because they fail to capture the
case of a predicted label set being partially correct.
To capture the notion of partial correctness one can use metrics
that fit into two categories: example-based and label-based. In
example-based, average difference between predicted and
actual labels is evaluated for each example, and then over all
examples in the test set. In label-based, each label is evaluated
first (across all examples where it shows up) and then averaged
over all labels.
Multi-class classification
• A classification task with more than two
classes and each instance can belong to one
class
• E.g., Classify a set of images of fruits which
may be oranges, apples, or pears. Multiclass
classification makes the assumption that
each sample is assigned to one and only one
label: a fruit can be either an apple or a pear
but not both at the same time.
Multi-label
classification
• Multi-label classification problem is a task to
predict labels given two or more categories i.e.,
each instance can belong to more than one
class. (Assigns each sample a set of target labels)
• This can be thought as predicting properties of a
data-point that are not mutually exclusive, such
as topics that are relevant for a document. A
text might be about any of religion, politics,
finance or education at the same time or none
of these.
Multi-label
classification
Challenges
• Highly imbalanced dataset – each label may occur with a
different number, each document has a different number of
labels.
• Different length of a document – for text classification
problem most of the ML algorithms require documents to
have equal length.
• Multiple metrics to choose.
Categories in Metrics
In multilabel tasks, the results can also be partially correct or partially wrong. To capture the notion of partial
correctness one can use metrics. The performance metrics of multi label classifiers can be categorized as label-based
and example-based.
Labels
Examples
Label based metrics:
These are calculated separately for each of the labels and then averaged for
all without taking into account of any relation between the labels
Each label is evaluated first and then averaged over all labels. It is
important to note that any such label based method would fail to address
the correlation among the different classes.
Includes one-error, average precision, etc. These are calculated separately
for each of the labels, and then averaged for all without taking into account
any relation between the labels.
Example based metrics:
The metrics are computed in a “per datapoint” manner. Metrics include
accuracy, hamming loss, etc. These are calculated for each example and
then averaged across the test set.
Metrics
UnderstandingPrecision
Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?
The stability of that measurement when repeated many times, i.e. whether the measurement is
consistent with other measurements. (measure of a classifiers exactness)
Recall
Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?
It is also called Sensitivity or the True Positive Rate (TPR)
Accuracy
Accuracy attempts to answers the following question:
What fraction of predictions our model got right?
The proportion of correct results that a classifier achieved.
Classification Accuracy alone cannot be trusted to select a well-performing model when a class
imbalance exists.
Example based Metrics
Classification
• Subset Accuracy
• Hamming loss
• Accuracy
• Precision
• Average Precision
• Recall
• F1 score
Ranking
• One error
• Coverage
• Ranking loss
In example-based, average difference between predicted
and actual labels is evaluated for each example, and then
over all examples in the test set.
Example based is specifically built for multi-label domain.
Example based –Precision
• Out of the categories predicted, how many of the them are true categories
• Precision = |Y ∩ Z|/|Z|
• Y = True values; Z= Predicted
• The ratio of how much of the predicted is correct
• The numerator finds how many labels
in the predicted vector has common
with the ground truth.
• Out of the total true categories, how
many of them were predicted
• Recall = |Y ∩ Z|/|Y|
• Finally, it is very important to note that
the there is an inverse relationship
between precision and recall and
that these metrics are dependent on the
model score threshold that you set.
Example based -Recall
• F1 measure is a single measure
obtained by combining two
evaluation measures precision and
recall.
• It is use to make trade off
between precision and recall.
Example based –F1 score
• Measure the partial correctness.
• Accuracy for each instance is defined as the
proportion of the predicted correct labels to
the total number of labels for that instance.
Overall accuracy is the average across all the
instances.
Example based -Accuracy or Jaccard Distance
JACCARD INDEX – often called multi-label ACCURACY
• In multilabel classification, the zero -one loss function corresponds to the subset
zero-one loss.
Zero-One loss = 1- subset accuracy
• Exact Match Ratio ignore the partially incorrect and treats the example as
incorrect. (Very Strict)
Example based -Subset accuracy & Exact Match
Ratio
Example based –Hamming loss
• Hamming loss is the average fraction
of incorrect labels.
Or
• Hamming Loss measures the number
of times a pair (instance, label)is
misclassified.
• Note that hamming loss is a loss
function and that the perfect score
is 0.
• A low value of hamming loss is
required to show better classification
performance.
Hamming Loss
• Hamming loss and subset 0/1 loss
could not be optimized at the
same time.
• Hamming loss can in principal be
minimized without taking label
dependence into account.
• For 0/1 loss label dependence
must be taken into account.
• Usually not be possible to
minimize both at the same time!
• For general evaluation, use
multiple and contrasting
evaluation measures!
Additional Metrics-Log Loss (Cross
Entropy)
• Log loss, also called logistic regression loss or
cross-entropy loss, is defined on probability
estimates.
• It is commonly used in (multinomial) logistic
regression and neural networks, as well as in
some variants of expectation-maximization,
and can be used to evaluate the probability
outputs (predict_proba) of a classifier instead
of its discrete predictions.
Example based –Average Precision
Average Precision (AP) for each class. The mean Average Precision (mAP)
is computed by taking the average over the APs of all classes. There are
two different ways to measure the interpolated average precision:
11-point interpolation & interpolating all points.
11-point interpolation
• For a given task and class, the precision/recall curve is computed from a
method’s ranked output. The AP summarizes the shape of the
precision/recall curve, and is defined as the mean precision at a set of
eleven equally spaced recall levels [0,0.1, . . . ,1]:
With
• where p(r hat) is the measured precision at recall r hat
• Instead of using the precision observed at each point, the AP is
obtained by interpolating the precision only at the 11 levels ‘r’
taking the maximum precision whose recall value is greater than r.
In practice AP is the precision averaged
across all recall values between 0 and 1.
But , the integral is closely approximated
by a sum over the precisions at every
possible threshold value, multiplied by
the change in recall:
with
where p(r hat) is the measured precision at recall r hat.
Example based –Average Precision Contd.,
• Instead of interpolating only in the 11 equally spaced points,
you could interpolate through all points in such way that
•
• In this case, instead of using the precision observed at only few
points, the AP is now obtained by interpolating the precision
at each level, r, taking the maximum precision whose recall
value is greater or equal than r+1. This way we calculate the
estimated area under the curve.
• A good way to characterize the performance of a classifier is
to look at how precision and recall change as you change
the threshold.
• To calculate the AP, for a specific class (say a “DE”) the
precision-recall curve is computed from the model’s detection
output, by varying the model score threshold.
Interpolating all points
By computing a precision and recall at every position in the
ranked sequence of documents, one can plot a precision-recall
curve, plotting precision p(r) as a function of recall r.
This is a plot of precision p as a
function of recall r.
Example-Based -Experiment
Gold standard Predicted
DE LT HO OT DE LT HO OT Yi zi Y int Z YUZ Precision Recall F1 Accuracy
1 0 1 0 1 0 0 1 2 2 1 3 0.5 0.5 50% 0.333333
0 1 0 1 0 1 0 1 2 2 2 2 1 1 100% 1
1 0 0 1 1 0 0 1 2 2 2 2 1 1 100% 1
0 1 1 0 0 1 0 0 2 1 1 2 1 0.5 67% 0.5
1 0 0 0 1 0 0 1 1 2 1 2 0.5 1 67% 0.5
9 9 7 11 0.778 0.778 0.78 0.636
0.778
0.778
4(red)/4*5(records)
Hamming loss
Subset Accuracy =
No. of correctly classified samples
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
=
2
3
= 0.666
Example based –Ranking Metrics
• Coverage_error
function computes the average number of labels that have to be included in the final prediction such that
all true labels are predicted. This is useful if you want to know how many top-scored-labels you have to
predict in average without missing any true one. The best value of this metrics is thus the average number
of true labels. (Coverage : average -depth to cover all true labels)
The best value is equal to the average number of labels in y_true per sample.
• One error:
If top ranked label is not in set of true labels.
• Ranking Loss:
Average fraction of pairs not correctly ordered. i.e. true labels have a lower score than false labels,
weighted by the inverse of the number of ordered pairs of false and true labels. The best performance is
achieved with a ranking loss of zero.
Label Based Metrics
• Label based is an extended form of evaluation
measures used for single label classification
domain.
• In micro-averaging method, you sum up the
individual true positives, false positives, and
false negatives of the system for different sets
and the apply them. And the micro-average
F1-Score will be simply the harmonic mean of
above two equations.
• Macro-averaging is straight forward. We just
take the average of the precision and recall of
the system on different sets.
Here the are the true positive, false positive, true negative and false negative counts respectively for only the label.
Label based -Area Under the Curve macro and Micro
• An ROC curve (receiver operating characteristic
curve) is a graph showing the performance of a
classification model at all classification thresholds.
• An ROC curve plots TPR vs. FPR at different
classification thresholds.
• Lowering the classification threshold classifies
more items as positive, thus increasing both False
Positives and True Positives. The following figure
shows a typical ROC curve.
• ROC shows you how many correct positive
classifications can be gained as you allow for more
and more false positives.
• AUC is based on the relative predictions, so any
transformation of the predictions that preserves
the relative ranking has no effect on AUC. This is
clearly not the case for other metrics such as
squared error, log loss, or prediction bias.
True Positive Rate (TPR) is a synonym for recall and is
therefore defined as follows:
TPR=TP/TP+FN
False Positive Rate (FPR) is defined as follows:
FPR=FP/FP+TN
Label based- Exercise
Gold standard Predicted
DE LT HO OT DE LT HO OT
1 0 1 0 1 0 0 1
0 1 0 1 0 1 0 1
1 0 0 1 1 0 0 1
0 1 1 0 0 1 0 0
1 0 0 0 1 0 0 1
DE
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
LT
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
HO
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
OT
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
Confusion Matrix
Label based- Exercise Contd.,
Gold standard Predicted
DE LT HO OT DE LT HO OT
1 0 1 0 1 0 0 1
0 1 0 1 0 1 0 1
1 0 0 1 1 0 0 1
0 1 1 0 0 1 0 0
1 0 0 0 1 0 0 1
Label Precision recall F1 score
DE 100% 100% 100%
LT 20 40 30
HO 30 80 55
OT 67% 67% 67%
Macro Avg(100,20,30,66) 42% 42%
Micro
micro: Calculate metrics globally by counting the total number
of times each class was correctly predicted and incorrectly
predicted.
macro: Calculate metrics for each "class" independently, and
find their unweighted mean. This does not take label imbalance
into account.
Sum(TP[DE + LT +HO + OT])
Sum(TP[DE) + LT +HO + OT] + Sum(FP([DE) + LT +HO + OT]]
Micro
Appendix
Need of Average Precision (AP)
• F1 just evaluates the model’s performance at a specific threshold, people continue to develop
metrics like ROC (not covered here) & mAP to evaluate the performance over all possible
thresholds.
• In a typical data set, there will be many classes and their distribution is non-uniform (many “pos”
than “neg”s). So a simple accuracy-based metric will introduce biases.
• It is also important to assess the risk of misclassifications. Thus, there is the need to associate a
“confidence score” or model score with each bounding box detected and to assess the model at
various level of confidence.
• In order to address these needs, the Average Precision (AP) was introduced. AP score is to take
the average value of the precision across all recall values rather than comparing curves, its
sometimes useful to have a single number that characterizes the performance of a classifier.
• By interpolating all points, the Average Precision (AP) can be interpreted as an
approximated AUC of the Precision x Recall curve. The intention is to reduce the impact
of the wiggles in the curve.
Coding Guide
• results['subset_accuracy'] =metrics.accuracy_score(y_test,y_pred)
• results['Hamming_loss'] =metrics.hamming_loss(y_test, y_pred)
• results['zero_one_loss'] = metrics.zero_one_loss( y_test,y_pred) # subset Accuracy == Zero_one loss
• results['coverage'] =metrics.coverage_error(y_test, y_pred)
• results[Avg_precision'] =metrics.average_precision_score(y_true, y_scores)
• results['ranking_loss'] = metrics.label_ranking_loss(y_test, y_pred) # The best performance is achieved with a ranking loss of zero
• '''Label ranking average precision (LRAP) averages over the samples the answer to the following question: for each ground truth label, what fraction of higher-
ranked labels were true labels?'''
results['LRAP'] = metrics.label_ranking_average_precision_score(y_test, y_pred)
Accuracy Not Enough?
• When we use accuracy, we assign equal cost to false positives(False
positives result when a test falsely (incorrectly) reports a positive
result.) and false negatives( False negatives result when a test falsely
or incorrectly reports a negative result. ). When that data set is
imbalanced — say it has 99% of instances in one class and only 1 % in
the other — there is a great way to lower the cost. Predict that every
instance belongs to the majority class, get accuracy of 99% and go
home early.
The problem starts when the actual costs that we assign to every
error are not equal. If we deal with a rare but fatal disease, the cost of
failing to diagnose the disease of a sick person is much higher than
the cost of sending a healthy person to more tests.
Intersection over
Union
Accuracy & Precision
Trueness and repeatability
High repeatability does not guarantee a true value
Calculating the total area, we have the AP:
Calculating the
interpolation performed in
all points
By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The
intention is to reduce the impact of the wiggles in the curve
Minimum Metrics for Multilabel evaluation
References
• https://hong.xmu.edu.cn/__local/9/53/7F/E93CE99DE745EB85A5C5B65F7
F3_96AF80B1_10634E.pdf?e=.pdf
• AP:
https://sanchom.wordpress.com/tag/average-precision/
AP2:
https://github.com/rafaelpadilla/Object-Detection-Metrics
Accuracy not enough:
https://medium.com/@abhimicro3/why-classification-accuracy-is-not-
enough-9134241c0352
Thank you,
Sridhar Nomula

More Related Content

PPTX
Important Classification and Regression Metrics.pptx
PDF
Model selection and cross validation techniques
PDF
L2. Evaluating Machine Learning Algorithms I
PDF
Confusion matrix and classification evaluation metrics
PPTX
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
PPTX
Evaluating classification algorithms
PPT
Lecture11_ Evaluation Metrics for classification.ppt
PPTX
Lecture-12Evaluation Measures-ML.pptx
Important Classification and Regression Metrics.pptx
Model selection and cross validation techniques
L2. Evaluating Machine Learning Algorithms I
Confusion matrix and classification evaluation metrics
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Evaluating classification algorithms
Lecture11_ Evaluation Metrics for classification.ppt
Lecture-12Evaluation Measures-ML.pptx

What's hot (20)

PPTX
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
PDF
Multi-label Classification with Meta-labels
PPTX
Presentation on K-Means Clustering
PPTX
Text data mining1
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PDF
Unsupervised Learning in Machine Learning
PDF
CS6010 Social Network Analysis Unit I
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
PDF
K - Nearest neighbor ( KNN )
PPT
K mean-clustering algorithm
PPTX
K Nearest Neighbor Algorithm
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
Association rule mining.pptx
PPTX
k medoid clustering.pptx
PDF
CLUSTERING IN DATA MINING.pdf
PDF
PCA (Principal component analysis)
PDF
Decision tree
PPTX
Instance based learning
PPT
1.8 discretization
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Multi-label Classification with Meta-labels
Presentation on K-Means Clustering
Text data mining1
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Unsupervised Learning in Machine Learning
CS6010 Social Network Analysis Unit I
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
K - Nearest neighbor ( KNN )
K mean-clustering algorithm
K Nearest Neighbor Algorithm
Introduction to Machine Learning with SciKit-Learn
Association rule mining.pptx
k medoid clustering.pptx
CLUSTERING IN DATA MINING.pdf
PCA (Principal component analysis)
Decision tree
Instance based learning
1.8 discretization
Ad

Similar to Evaluation of multilabel multi class classification (20)

PPTX
04 performance metrics v2
PPTX
ML-ChapterFour-ModelEvaluation.pptx
PDF
evaluationmeasures-ml.pdf evaluation measures
PPTX
lecture-12evaluationmeasures-ml-221219130248-3522ee79.pptx eval
PPTX
Classification Evaluation Metrics (2).pptx
PPTX
Model Performance Metrics. Accuracy, Precision, Recall
PPTX
MACHINE LEARNING PPT K MEANS CLUSTERING.
PPTX
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
PPTX
Performance Measurement for Machine Leaning.pptx
PPTX
Machine learning session5(logistic regression)
PDF
Classification assessment methods
PPTX
2 Machine Learning GeneralAAAAAAAAAAAAAAAAAAAAAAA
PDF
2 Machine Learning General.pdf
PPTX
Performance Metrics, Baseline Model, and Hyper Parameter
PPTX
EvaluationMetrics.pptx
PDF
Performance Metrics for Machine Learning Algorithms
PPTX
Classification in the database system.pptx
PDF
A Novel Performance Measure for Machine Learning Classification
PDF
A NOVEL PERFORMANCE MEASURE FOR MACHINE LEARNING CLASSIFICATION
04 performance metrics v2
ML-ChapterFour-ModelEvaluation.pptx
evaluationmeasures-ml.pdf evaluation measures
lecture-12evaluationmeasures-ml-221219130248-3522ee79.pptx eval
Classification Evaluation Metrics (2).pptx
Model Performance Metrics. Accuracy, Precision, Recall
MACHINE LEARNING PPT K MEANS CLUSTERING.
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Performance Measurement for Machine Leaning.pptx
Machine learning session5(logistic regression)
Classification assessment methods
2 Machine Learning GeneralAAAAAAAAAAAAAAAAAAAAAAA
2 Machine Learning General.pdf
Performance Metrics, Baseline Model, and Hyper Parameter
EvaluationMetrics.pptx
Performance Metrics for Machine Learning Algorithms
Classification in the database system.pptx
A Novel Performance Measure for Machine Learning Classification
A NOVEL PERFORMANCE MEASURE FOR MACHINE LEARNING CLASSIFICATION
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Lecture1 pattern recognition............
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
.pdf is not working space design for the following data for the following dat...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
IB Computer Science - Internal Assessment.pptx
Reliability_Chapter_ presentation 1221.5784
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
climate analysis of Dhaka ,Banglades.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Lecture1 pattern recognition............
Supervised vs unsupervised machine learning algorithms
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Evaluation of multilabel multi class classification

  • 1. Multi-label & Multi-class Classification Evaluation Metrics of - Sridhar Nomula
  • 2. Introduction Most classification problems associate a single class to each example or instance. However, there are many classification tasks where each instance can be associated with one or more classes. This group of problems represents an area known as multi-label classification. The performance of the multi-label classifiers cannot be assessed using the exact same definitions as for the single-label classifiers. Precision, recall, F-measure, ROC…a lot of the evaluation metrics you’re familiar with from multiclass don’t readily translate to multi-label because they fail to capture the case of a predicted label set being partially correct. To capture the notion of partial correctness one can use metrics that fit into two categories: example-based and label-based. In example-based, average difference between predicted and actual labels is evaluated for each example, and then over all examples in the test set. In label-based, each label is evaluated first (across all examples where it shows up) and then averaged over all labels.
  • 3. Multi-class classification • A classification task with more than two classes and each instance can belong to one class • E.g., Classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
  • 4. Multi-label classification • Multi-label classification problem is a task to predict labels given two or more categories i.e., each instance can belong to more than one class. (Assigns each sample a set of target labels) • This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
  • 5. Multi-label classification Challenges • Highly imbalanced dataset – each label may occur with a different number, each document has a different number of labels. • Different length of a document – for text classification problem most of the ML algorithms require documents to have equal length. • Multiple metrics to choose.
  • 6. Categories in Metrics In multilabel tasks, the results can also be partially correct or partially wrong. To capture the notion of partial correctness one can use metrics. The performance metrics of multi label classifiers can be categorized as label-based and example-based. Labels Examples Label based metrics: These are calculated separately for each of the labels and then averaged for all without taking into account of any relation between the labels Each label is evaluated first and then averaged over all labels. It is important to note that any such label based method would fail to address the correlation among the different classes. Includes one-error, average precision, etc. These are calculated separately for each of the labels, and then averaged for all without taking into account any relation between the labels. Example based metrics: The metrics are computed in a “per datapoint” manner. Metrics include accuracy, hamming loss, etc. These are calculated for each example and then averaged across the test set.
  • 7. Metrics UnderstandingPrecision Precision attempts to answer the following question: What proportion of positive identifications was actually correct? The stability of that measurement when repeated many times, i.e. whether the measurement is consistent with other measurements. (measure of a classifiers exactness) Recall Recall attempts to answer the following question: What proportion of actual positives was identified correctly? It is also called Sensitivity or the True Positive Rate (TPR) Accuracy Accuracy attempts to answers the following question: What fraction of predictions our model got right? The proportion of correct results that a classifier achieved. Classification Accuracy alone cannot be trusted to select a well-performing model when a class imbalance exists.
  • 8. Example based Metrics Classification • Subset Accuracy • Hamming loss • Accuracy • Precision • Average Precision • Recall • F1 score Ranking • One error • Coverage • Ranking loss In example-based, average difference between predicted and actual labels is evaluated for each example, and then over all examples in the test set. Example based is specifically built for multi-label domain.
  • 9. Example based –Precision • Out of the categories predicted, how many of the them are true categories • Precision = |Y ∩ Z|/|Z| • Y = True values; Z= Predicted • The ratio of how much of the predicted is correct • The numerator finds how many labels in the predicted vector has common with the ground truth.
  • 10. • Out of the total true categories, how many of them were predicted • Recall = |Y ∩ Z|/|Y| • Finally, it is very important to note that the there is an inverse relationship between precision and recall and that these metrics are dependent on the model score threshold that you set. Example based -Recall
  • 11. • F1 measure is a single measure obtained by combining two evaluation measures precision and recall. • It is use to make trade off between precision and recall. Example based –F1 score
  • 12. • Measure the partial correctness. • Accuracy for each instance is defined as the proportion of the predicted correct labels to the total number of labels for that instance. Overall accuracy is the average across all the instances. Example based -Accuracy or Jaccard Distance JACCARD INDEX – often called multi-label ACCURACY
  • 13. • In multilabel classification, the zero -one loss function corresponds to the subset zero-one loss. Zero-One loss = 1- subset accuracy • Exact Match Ratio ignore the partially incorrect and treats the example as incorrect. (Very Strict) Example based -Subset accuracy & Exact Match Ratio
  • 14. Example based –Hamming loss • Hamming loss is the average fraction of incorrect labels. Or • Hamming Loss measures the number of times a pair (instance, label)is misclassified. • Note that hamming loss is a loss function and that the perfect score is 0. • A low value of hamming loss is required to show better classification performance.
  • 15. Hamming Loss • Hamming loss and subset 0/1 loss could not be optimized at the same time. • Hamming loss can in principal be minimized without taking label dependence into account. • For 0/1 loss label dependence must be taken into account. • Usually not be possible to minimize both at the same time! • For general evaluation, use multiple and contrasting evaluation measures!
  • 16. Additional Metrics-Log Loss (Cross Entropy) • Log loss, also called logistic regression loss or cross-entropy loss, is defined on probability estimates. • It is commonly used in (multinomial) logistic regression and neural networks, as well as in some variants of expectation-maximization, and can be used to evaluate the probability outputs (predict_proba) of a classifier instead of its discrete predictions.
  • 17. Example based –Average Precision Average Precision (AP) for each class. The mean Average Precision (mAP) is computed by taking the average over the APs of all classes. There are two different ways to measure the interpolated average precision: 11-point interpolation & interpolating all points. 11-point interpolation • For a given task and class, the precision/recall curve is computed from a method’s ranked output. The AP summarizes the shape of the precision/recall curve, and is defined as the mean precision at a set of eleven equally spaced recall levels [0,0.1, . . . ,1]: With • where p(r hat) is the measured precision at recall r hat • Instead of using the precision observed at each point, the AP is obtained by interpolating the precision only at the 11 levels ‘r’ taking the maximum precision whose recall value is greater than r. In practice AP is the precision averaged across all recall values between 0 and 1. But , the integral is closely approximated by a sum over the precisions at every possible threshold value, multiplied by the change in recall:
  • 18. with where p(r hat) is the measured precision at recall r hat. Example based –Average Precision Contd., • Instead of interpolating only in the 11 equally spaced points, you could interpolate through all points in such way that • • In this case, instead of using the precision observed at only few points, the AP is now obtained by interpolating the precision at each level, r, taking the maximum precision whose recall value is greater or equal than r+1. This way we calculate the estimated area under the curve. • A good way to characterize the performance of a classifier is to look at how precision and recall change as you change the threshold. • To calculate the AP, for a specific class (say a “DE”) the precision-recall curve is computed from the model’s detection output, by varying the model score threshold. Interpolating all points By computing a precision and recall at every position in the ranked sequence of documents, one can plot a precision-recall curve, plotting precision p(r) as a function of recall r. This is a plot of precision p as a function of recall r.
  • 19. Example-Based -Experiment Gold standard Predicted DE LT HO OT DE LT HO OT Yi zi Y int Z YUZ Precision Recall F1 Accuracy 1 0 1 0 1 0 0 1 2 2 1 3 0.5 0.5 50% 0.333333 0 1 0 1 0 1 0 1 2 2 2 2 1 1 100% 1 1 0 0 1 1 0 0 1 2 2 2 2 1 1 100% 1 0 1 1 0 0 1 0 0 2 1 1 2 1 0.5 67% 0.5 1 0 0 0 1 0 0 1 1 2 1 2 0.5 1 67% 0.5 9 9 7 11 0.778 0.778 0.78 0.636 0.778 0.778 4(red)/4*5(records) Hamming loss Subset Accuracy = No. of correctly classified samples 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2 3 = 0.666
  • 20. Example based –Ranking Metrics • Coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted. This is useful if you want to know how many top-scored-labels you have to predict in average without missing any true one. The best value of this metrics is thus the average number of true labels. (Coverage : average -depth to cover all true labels) The best value is equal to the average number of labels in y_true per sample. • One error: If top ranked label is not in set of true labels. • Ranking Loss: Average fraction of pairs not correctly ordered. i.e. true labels have a lower score than false labels, weighted by the inverse of the number of ordered pairs of false and true labels. The best performance is achieved with a ranking loss of zero.
  • 21. Label Based Metrics • Label based is an extended form of evaluation measures used for single label classification domain. • In micro-averaging method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them. And the micro-average F1-Score will be simply the harmonic mean of above two equations. • Macro-averaging is straight forward. We just take the average of the precision and recall of the system on different sets. Here the are the true positive, false positive, true negative and false negative counts respectively for only the label.
  • 22. Label based -Area Under the Curve macro and Micro • An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. • An ROC curve plots TPR vs. FPR at different classification thresholds. • Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve. • ROC shows you how many correct positive classifications can be gained as you allow for more and more false positives. • AUC is based on the relative predictions, so any transformation of the predictions that preserves the relative ranking has no effect on AUC. This is clearly not the case for other metrics such as squared error, log loss, or prediction bias. True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows: TPR=TP/TP+FN False Positive Rate (FPR) is defined as follows: FPR=FP/FP+TN
  • 23. Label based- Exercise Gold standard Predicted DE LT HO OT DE LT HO OT 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 DE TP FN FP TN 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 3 0 0 2 Pred 1 0 Actua l 1 3 0 0 0 2 1 0 1 TP FN 0 FP TN Pred Actual LT TP FN FP TN 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 3 0 0 2 Pred 1 0 Actua l 1 3 0 0 0 2 1 0 1 TP FN 0 FP TN Pred Actual HO TP FN FP TN 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 3 0 0 2 Pred 1 0 Actua l 1 3 0 0 0 2 1 0 1 TP FN 0 FP TN Pred Actual OT TP FN FP TN 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 3 0 0 2 Pred 1 0 Actua l 1 3 0 0 0 2 1 0 1 TP FN 0 FP TN Pred Actual Confusion Matrix
  • 24. Label based- Exercise Contd., Gold standard Predicted DE LT HO OT DE LT HO OT 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 Label Precision recall F1 score DE 100% 100% 100% LT 20 40 30 HO 30 80 55 OT 67% 67% 67% Macro Avg(100,20,30,66) 42% 42% Micro micro: Calculate metrics globally by counting the total number of times each class was correctly predicted and incorrectly predicted. macro: Calculate metrics for each "class" independently, and find their unweighted mean. This does not take label imbalance into account. Sum(TP[DE + LT +HO + OT]) Sum(TP[DE) + LT +HO + OT] + Sum(FP([DE) + LT +HO + OT]] Micro
  • 26. Need of Average Precision (AP) • F1 just evaluates the model’s performance at a specific threshold, people continue to develop metrics like ROC (not covered here) & mAP to evaluate the performance over all possible thresholds. • In a typical data set, there will be many classes and their distribution is non-uniform (many “pos” than “neg”s). So a simple accuracy-based metric will introduce biases. • It is also important to assess the risk of misclassifications. Thus, there is the need to associate a “confidence score” or model score with each bounding box detected and to assess the model at various level of confidence. • In order to address these needs, the Average Precision (AP) was introduced. AP score is to take the average value of the precision across all recall values rather than comparing curves, its sometimes useful to have a single number that characterizes the performance of a classifier. • By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The intention is to reduce the impact of the wiggles in the curve.
  • 27. Coding Guide • results['subset_accuracy'] =metrics.accuracy_score(y_test,y_pred) • results['Hamming_loss'] =metrics.hamming_loss(y_test, y_pred) • results['zero_one_loss'] = metrics.zero_one_loss( y_test,y_pred) # subset Accuracy == Zero_one loss • results['coverage'] =metrics.coverage_error(y_test, y_pred) • results[Avg_precision'] =metrics.average_precision_score(y_true, y_scores) • results['ranking_loss'] = metrics.label_ranking_loss(y_test, y_pred) # The best performance is achieved with a ranking loss of zero • '''Label ranking average precision (LRAP) averages over the samples the answer to the following question: for each ground truth label, what fraction of higher- ranked labels were true labels?''' results['LRAP'] = metrics.label_ranking_average_precision_score(y_test, y_pred)
  • 28. Accuracy Not Enough? • When we use accuracy, we assign equal cost to false positives(False positives result when a test falsely (incorrectly) reports a positive result.) and false negatives( False negatives result when a test falsely or incorrectly reports a negative result. ). When that data set is imbalanced — say it has 99% of instances in one class and only 1 % in the other — there is a great way to lower the cost. Predict that every instance belongs to the majority class, get accuracy of 99% and go home early. The problem starts when the actual costs that we assign to every error are not equal. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick person is much higher than the cost of sending a healthy person to more tests.
  • 29. Intersection over Union Accuracy & Precision Trueness and repeatability High repeatability does not guarantee a true value Calculating the total area, we have the AP: Calculating the interpolation performed in all points By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The intention is to reduce the impact of the wiggles in the curve
  • 30. Minimum Metrics for Multilabel evaluation

Editor's Notes

  • #2: Metrics to judge the sucess of a model
  • #3: https://peerj.com/articles/3095/
  • #5: How to solve text multi-label classification problems? We can consider two possible approaches. One which takes classic ML solution using sklearn or scikit-multilearn libraries or second one using deep learning algorithms.
  • #7: Multi-label classification problems must be assessed using different performance measures than single-label classification problems. Two of the most common performance metrics are hamming loss and Jaccard similarity.
  • #9: https://stackoverflow.com/questions/9004172/precision-recall-for-multiclass-multilabel-classification Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. Bagging and Adaptive Boosting algorithms The result is evaluated with four evaluation metrics such as hamming loss, subset accuracy, example-based accuracy and micro-averaged f-measure.
  • #10: how many of the predicted true labels are actually in the ground truth. Or precision measures the “false positive rate” or the ratio of true object detections to the total number of objects that the classifier predicted.
  • #11: Recall measures the “false negative rate” or the ratio of true object detections to the total number of objects in the data set.
  • #12: Precision and recall vary with the strictness of your classifier’s threshold. There are several ways to summarize the precision-recall curve with a single number called average precision
  • #14: Zero one loss : By default, the function returns the percentage of imperfectly predicted subsets. 
  • #18: AP: https://sanchom.wordpress.com/tag/average-precision/ The average_precision_score function computes the average precision (AP) from prediction scores. This score corresponds to the area under the precision-recall curve
  • #19: For systems that return a ranked sequence of documents, it is desirable to also consider the order in which the returned documents are presented. We use interpolated average precision that uses the maximum precision observed across all cutoffs with higher recall.
  • #20: Subset accuracy is strict measurement. Subset Accuracy and Zero one loss are referring the same dimension of the strict accuracy
  • #22: . In label-based, each label is evaluated first (across all examples where it shows up) and then averaged over all labels.
  • #23: AUC is desirable for the following two reasons: AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.
  • #27: Further, there are variations on where to take the samples when computing the interpolated average precision. Some take samples at a fixed 11 points from 0 to 1: {0, 0.1, 0.2, …, 0.9, 1.0}. This is called the 11-point interpolated average precision. Others sample at every k where the recall changes.