Evaluation of multilabel multi class classification

Multi-label & Multi-class Classification
Evaluation Metrics of
- Sridhar Nomula

Introduction
Most classification problems associate a single class to each
example or instance. However, there are many classification
tasks where each instance can be associated with one or more
classes. This group of problems represents an area known as
multi-label classification.
The performance of the multi-label classifiers cannot be
assessed using the exact same definitions as for the single-label
classifiers. Precision, recall, F-measure, ROC…a lot of the
evaluation metrics you’re familiar with from multiclass don’t
readily translate to multi-label because they fail to capture the
case of a predicted label set being partially correct.
To capture the notion of partial correctness one can use metrics
that fit into two categories: example-based and label-based. In
example-based, average difference between predicted and
actual labels is evaluated for each example, and then over all
examples in the test set. In label-based, each label is evaluated
first (across all examples where it shows up) and then averaged
over all labels.

Multi-class classification
• A classification task with more than two
classes and each instance can belong to one
class
• E.g., Classify a set of images of fruits which
may be oranges, apples, or pears. Multiclass
classification makes the assumption that
each sample is assigned to one and only one
label: a fruit can be either an apple or a pear
but not both at the same time.

Multi-label
classification
• Multi-label classification problem is a task to
predict labels given two or more categories i.e.,
each instance can belong to more than one
class. (Assigns each sample a set of target labels)
• This can be thought as predicting properties of a
data-point that are not mutually exclusive, such
as topics that are relevant for a document. A
text might be about any of religion, politics,
finance or education at the same time or none
of these.

Multi-label
classification
Challenges
• Highly imbalanced dataset – each label may occur with a
different number, each document has a different number of
labels.
• Different length of a document – for text classification
problem most of the ML algorithms require documents to
have equal length.
• Multiple metrics to choose.

Categories in Metrics
In multilabel tasks, the results can also be partially correct or partially wrong. To capture the notion of partial
correctness one can use metrics. The performance metrics of multi label classifiers can be categorized as label-based
and example-based.
Labels
Examples
Label based metrics:
These are calculated separately for each of the labels and then averaged for
all without taking into account of any relation between the labels
Each label is evaluated first and then averaged over all labels. It is
important to note that any such label based method would fail to address
the correlation among the different classes.
Includes one-error, average precision, etc. These are calculated separately
for each of the labels, and then averaged for all without taking into account
any relation between the labels.
Example based metrics:
The metrics are computed in a “per datapoint” manner. Metrics include
accuracy, hamming loss, etc. These are calculated for each example and
then averaged across the test set.

Metrics
UnderstandingPrecision
Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?
The stability of that measurement when repeated many times, i.e. whether the measurement is
consistent with other measurements. (measure of a classifiers exactness)
Recall
Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?
It is also called Sensitivity or the True Positive Rate (TPR)
Accuracy
Accuracy attempts to answers the following question:
What fraction of predictions our model got right?
The proportion of correct results that a classifier achieved.
Classification Accuracy alone cannot be trusted to select a well-performing model when a class
imbalance exists.

Example based Metrics
Classification
• Subset Accuracy
• Hamming loss
• Accuracy
• Precision
• Average Precision
• Recall
• F1 score
Ranking
• One error
• Coverage
• Ranking loss
In example-based, average difference between predicted
and actual labels is evaluated for each example, and then
over all examples in the test set.
Example based is specifically built for multi-label domain.

Example based –Precision
• Out of the categories predicted, how many of the them are true categories
• Precision = |Y ∩ Z|/|Z|
• Y = True values; Z= Predicted
• The ratio of how much of the predicted is correct
• The numerator finds how many labels
in the predicted vector has common
with the ground truth.

• Out of the total true categories, how
many of them were predicted
• Recall = |Y ∩ Z|/|Y|
• Finally, it is very important to note that
the there is an inverse relationship
between precision and recall and
that these metrics are dependent on the
model score threshold that you set.
Example based -Recall

• F1 measure is a single measure
obtained by combining two
evaluation measures precision and
recall.
• It is use to make trade off
between precision and recall.
Example based –F1 score

• Measure the partial correctness.
• Accuracy for each instance is defined as the
proportion of the predicted correct labels to
the total number of labels for that instance.
Overall accuracy is the average across all the
instances.
Example based -Accuracy or Jaccard Distance
JACCARD INDEX – often called multi-label ACCURACY

• In multilabel classification, the zero -one loss function corresponds to the subset
zero-one loss.
Zero-One loss = 1- subset accuracy
• Exact Match Ratio ignore the partially incorrect and treats the example as
incorrect. (Very Strict)
Example based -Subset accuracy & Exact Match
Ratio

Example based –Hamming loss
• Hamming loss is the average fraction
of incorrect labels.
Or
• Hamming Loss measures the number
of times a pair (instance, label)is
misclassified.
• Note that hamming loss is a loss
function and that the perfect score
is 0.
• A low value of hamming loss is
required to show better classification
performance.

Hamming Loss
• Hamming loss and subset 0/1 loss
could not be optimized at the
same time.
• Hamming loss can in principal be
minimized without taking label
dependence into account.
• For 0/1 loss label dependence
must be taken into account.
• Usually not be possible to
minimize both at the same time!
• For general evaluation, use
multiple and contrasting
evaluation measures!

Additional Metrics-Log Loss (Cross
Entropy)
• Log loss, also called logistic regression loss or
cross-entropy loss, is defined on probability
estimates.
• It is commonly used in (multinomial) logistic
regression and neural networks, as well as in
some variants of expectation-maximization,
and can be used to evaluate the probability
outputs (predict_proba) of a classifier instead
of its discrete predictions.

Example based –Average Precision
Average Precision (AP) for each class. The mean Average Precision (mAP)
is computed by taking the average over the APs of all classes. There are
two different ways to measure the interpolated average precision:
11-point interpolation & interpolating all points.
11-point interpolation
• For a given task and class, the precision/recall curve is computed from a
method’s ranked output. The AP summarizes the shape of the
precision/recall curve, and is defined as the mean precision at a set of
eleven equally spaced recall levels [0,0.1, . . . ,1]:
With
• where p(r hat) is the measured precision at recall r hat
• Instead of using the precision observed at each point, the AP is
obtained by interpolating the precision only at the 11 levels ‘r’
taking the maximum precision whose recall value is greater than r.
In practice AP is the precision averaged
across all recall values between 0 and 1.
But , the integral is closely approximated
by a sum over the precisions at every
possible threshold value, multiplied by
the change in recall:

with
where p(r hat) is the measured precision at recall r hat.
Example based –Average Precision Contd.,
• Instead of interpolating only in the 11 equally spaced points,
you could interpolate through all points in such way that
•
• In this case, instead of using the precision observed at only few
points, the AP is now obtained by interpolating the precision
at each level, r, taking the maximum precision whose recall
value is greater or equal than r+1. This way we calculate the
estimated area under the curve.
• A good way to characterize the performance of a classifier is
to look at how precision and recall change as you change
the threshold.
• To calculate the AP, for a specific class (say a “DE”) the
precision-recall curve is computed from the model’s detection
output, by varying the model score threshold.
Interpolating all points
By computing a precision and recall at every position in the
ranked sequence of documents, one can plot a precision-recall
curve, plotting precision p(r) as a function of recall r.
This is a plot of precision p as a
function of recall r.

Example-Based -Experiment
Gold standard Predicted
DE LT HO OT DE LT HO OT Yi zi Y int Z YUZ Precision Recall F1 Accuracy
1 0 1 0 1 0 0 1 2 2 1 3 0.5 0.5 50% 0.333333
0 1 0 1 0 1 0 1 2 2 2 2 1 1 100% 1
1 0 0 1 1 0 0 1 2 2 2 2 1 1 100% 1
0 1 1 0 0 1 0 0 2 1 1 2 1 0.5 67% 0.5
1 0 0 0 1 0 0 1 1 2 1 2 0.5 1 67% 0.5
9 9 7 11 0.778 0.778 0.78 0.636
0.778
0.778
4(red)/4*5(records)
Hamming loss
Subset Accuracy =
No. of correctly classified samples
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
=
2
3
= 0.666

Example based –Ranking Metrics
• Coverage_error
function computes the average number of labels that have to be included in the final prediction such that
all true labels are predicted. This is useful if you want to know how many top-scored-labels you have to
predict in average without missing any true one. The best value of this metrics is thus the average number
of true labels. (Coverage : average -depth to cover all true labels)
The best value is equal to the average number of labels in y_true per sample.
• One error:
If top ranked label is not in set of true labels.
• Ranking Loss:
Average fraction of pairs not correctly ordered. i.e. true labels have a lower score than false labels,
weighted by the inverse of the number of ordered pairs of false and true labels. The best performance is
achieved with a ranking loss of zero.

Label Based Metrics
• Label based is an extended form of evaluation
measures used for single label classification
domain.
• In micro-averaging method, you sum up the
individual true positives, false positives, and
false negatives of the system for different sets
and the apply them. And the micro-average
F1-Score will be simply the harmonic mean of
above two equations.
• Macro-averaging is straight forward. We just
take the average of the precision and recall of
the system on different sets.
Here the are the true positive, false positive, true negative and false negative counts respectively for only the label.

Label based -Area Under the Curve macro and Micro
• An ROC curve (receiver operating characteristic
curve) is a graph showing the performance of a
classification model at all classification thresholds.
• An ROC curve plots TPR vs. FPR at different
classification thresholds.
• Lowering the classification threshold classifies
more items as positive, thus increasing both False
Positives and True Positives. The following figure
shows a typical ROC curve.
• ROC shows you how many correct positive
classifications can be gained as you allow for more
and more false positives.
• AUC is based on the relative predictions, so any
transformation of the predictions that preserves
the relative ranking has no effect on AUC. This is
clearly not the case for other metrics such as
squared error, log loss, or prediction bias.
True Positive Rate (TPR) is a synonym for recall and is
therefore defined as follows:
TPR=TP/TP+FN
False Positive Rate (FPR) is defined as follows:
FPR=FP/FP+TN

Label based- Exercise
DE LT HO OT DE LT HO OT
1 0 1 0 1 0 0 1
0 1 0 1 0 1 0 1
1 0 0 1 1 0 0 1
0 1 1 0 0 1 0 0
1 0 0 0 1 0 0 1
DE
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
LT
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
HO
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
OT
TP FN FP TN
1 0 0 0
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
3 0 0 2
Pred
1 0
Actua
l
1 3 0
0 0 2
1 0
1 TP FN
0 FP TN
Pred
Actual
Confusion Matrix

Label based- Exercise Contd.,
DE LT HO OT DE LT HO OT
1 0 1 0 1 0 0 1
0 1 0 1 0 1 0 1
1 0 0 1 1 0 0 1
0 1 1 0 0 1 0 0
1 0 0 0 1 0 0 1
Label Precision recall F1 score
DE 100% 100% 100%
LT 20 40 30
HO 30 80 55
OT 67% 67% 67%
Macro Avg(100,20,30,66) 42% 42%
Micro
micro: Calculate metrics globally by counting the total number
of times each class was correctly predicted and incorrectly
predicted.
macro: Calculate metrics for each "class" independently, and
find their unweighted mean. This does not take label imbalance
into account.
Sum(TP[DE + LT +HO + OT])
Sum(TP[DE) + LT +HO + OT] + Sum(FP([DE) + LT +HO + OT]]
Micro

Need of Average Precision (AP)
• F1 just evaluates the model’s performance at a specific threshold, people continue to develop
metrics like ROC (not covered here) & mAP to evaluate the performance over all possible
thresholds.
• In a typical data set, there will be many classes and their distribution is non-uniform (many “pos”
than “neg”s). So a simple accuracy-based metric will introduce biases.
• It is also important to assess the risk of misclassifications. Thus, there is the need to associate a
“confidence score” or model score with each bounding box detected and to assess the model at
various level of confidence.
• In order to address these needs, the Average Precision (AP) was introduced. AP score is to take
the average value of the precision across all recall values rather than comparing curves, its
sometimes useful to have a single number that characterizes the performance of a classifier.
• By interpolating all points, the Average Precision (AP) can be interpreted as an
approximated AUC of the Precision x Recall curve. The intention is to reduce the impact
of the wiggles in the curve.

Coding Guide
• results['subset_accuracy'] =metrics.accuracy_score(y_test,y_pred)
• results['Hamming_loss'] =metrics.hamming_loss(y_test, y_pred)
• results['zero_one_loss'] = metrics.zero_one_loss( y_test,y_pred) # subset Accuracy == Zero_one loss
• results['coverage'] =metrics.coverage_error(y_test, y_pred)
• results[Avg_precision'] =metrics.average_precision_score(y_true, y_scores)
• results['ranking_loss'] = metrics.label_ranking_loss(y_test, y_pred) # The best performance is achieved with a ranking loss of zero
• '''Label ranking average precision (LRAP) averages over the samples the answer to the following question: for each ground truth label, what fraction of higher-
ranked labels were true labels?'''
results['LRAP'] = metrics.label_ranking_average_precision_score(y_test, y_pred)

Accuracy Not Enough?
• When we use accuracy, we assign equal cost to false positives(False
positives result when a test falsely (incorrectly) reports a positive
result.) and false negatives( False negatives result when a test falsely
or incorrectly reports a negative result. ). When that data set is
imbalanced — say it has 99% of instances in one class and only 1 % in
the other — there is a great way to lower the cost. Predict that every
instance belongs to the majority class, get accuracy of 99% and go
home early.
The problem starts when the actual costs that we assign to every
error are not equal. If we deal with a rare but fatal disease, the cost of
failing to diagnose the disease of a sick person is much higher than
the cost of sending a healthy person to more tests.

Intersection over
Union
Accuracy & Precision
Trueness and repeatability
High repeatability does not guarantee a true value
Calculating the total area, we have the AP:
Calculating the
interpolation performed in
all points
By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The
intention is to reduce the impact of the wiggles in the curve

Minimum Metrics for Multilabel evaluation

References
• https://hong.xmu.edu.cn/__local/9/53/7F/E93CE99DE745EB85A5C5B65F7
F3_96AF80B1_10634E.pdf?e=.pdf
• AP:
https://sanchom.wordpress.com/tag/average-precision/
AP2:
https://github.com/rafaelpadilla/Object-Detection-Metrics
Accuracy not enough:
https://medium.com/@abhimicro3/why-classification-accuracy-is-not-
enough-9134241c0352

Evaluation of multilabel multi class classification

More Related Content

What's hot (20)

Similar to Evaluation of multilabel multi class classification (20)

Recently uploaded (20)

Evaluation of multilabel multi class classification

Editor's Notes