SlideShare a Scribd company logo
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
DOI : 10.5121/ijdkp.2015.5201 1
A REVIEW ON EVALUATION METRICS FOR
DATA CLASSIFICATION EVALUATIONS
Hossin, M.1
and Sulaiman, M.N.2
1
Faculty of Computer Science & Information Technology, Universiti Malaysia Sarawak,
94300 Kota Samarahan, Sarawak, Malaysia
2
Faculty of Computer Science & Information Technology, Universiti Putra Malaysia,
43400 UPM Serdang, Selangor, Malaysia
ABSTRACT
Evaluation metric plays a critical role in achieving the optimal classifier during the classification training.
Thus, a selection of suitable evaluation metric is an important key for discriminating and obtaining the
optimal classifier. This paper systematically reviewed the related evaluation metrics that are specifically
designed as a discriminator for optimizing generative classifier. Generally, many generative classifiers
employ accuracy as a measure to discriminate the optimal solution during the classification training.
However, the accuracy has several weaknesses which are less distinctiveness, less discriminability, less
informativeness and bias to majority class data. This paper also briefly discusses other metrics that are
specifically designed for discriminating the optimal solution. The shortcomings of these alternative metrics
are also discussed. Finally, this paper suggests five important aspects that must be taken into consideration
in constructing a new discriminator metric.
KEYWORDS
Evaluation Metric, Accuracy, Optimized Classifier, Data Classification Evaluation
1. INTRODUCTION
In data classification problems, data can be divided into commercial data, texts, DNAs and
images. This paper emphasizes on commercial data as the focus of discussion. Furthermore, data
classification can be divided into binary, multiclass and multi-labelled classification [33]. In this
paper, the study is aimed on binary and multiclass classification which focuses on the evaluation
metrics for evaluating the effectiveness of classifiers. In general, the evaluation metric can be
described as the measurement tool that measures the performance of classifier. Different metrics
evaluate different characteristics of the classifier induced by the classification algorithm.
From the literature, the evaluation metric can be categorized into three types, which are threshold,
probability and ranking metric [2]. Each of these types of metrics evaluates the classifier with
different aims. Furthermore, all these types of metrics are scalar group method where the entire
performance is presented using a single score value. Thus, it makes easier to do the comparison
and analysis although it could mask subtle details of their behaviours. In practice, the threshold
and ranking metric were the most common metrics used by researchers to measure the
performance of classifiers. In most cases, these types of metrics can be employed into three
different evaluation applications [23].
First, the evaluation metrics were used to evaluate the generalization ability of the trained
classifier. In this case, the evaluation metric is used to measure and summarize the quality of
trained classifier when tested with the unseen data. Accuracy or error rate is one of the most
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
2
common metrics in practice used by many researchers to evaluate the generalization ability of
classifiers. Through accuracy, the trained classifier is measured based on total correctness which
refers to the total of instances that are correctly predicted by the trained classifier when tested
with the unseen data.
Second, the evaluation metrics were employed as an evaluator for model selection. In this case,
the evaluation metric task is to determine the best classifier among different types of trained
classifiers which focus on the best future performance (optimal model) when tested with unseen
data. Third, the evaluation metrics were employed as a discriminator to discriminate and select
the optimal solution (best solution) among all generated solutions during the classification
training. For example, the accuracy metric is employed to discriminate every single solution and
select the best solution that produced by a particular classification algorithm. Only the best
solution which is believed the optimal model will be tested with the unseen data.
For the first and second application of evaluation metrics, almost all types of threshold,
probability and ranking metrics could be applied to evaluate the performance and effectiveness of
classifiers. Conversely, only few types of metrics could be employed as a discriminator to
discriminate and select the optimal solution during the classification training. This paper
emphasizes on the third application of evaluation metrics for Prototype Selection (PS) classifiers.
Only relevant evaluation metrics which are associated with the latter application of metrics are
discussed.
In general, PS classifier is a generative type of classification algorithms that aim to generate a
classifier model by applying sampling technique and simultaneously used the generated model to
achieve the highest possible classification accuracy when dealing with the unseen data. Most of
PS classifiers such as Monte Carlo Sampling (MCS) algorithm [32] genetic algorithm [22],
evolutionary algorithm [10], and tabu search [38] were developed based on statistics, nature-
inspired, optimization methods or combination of these methods. Basically, all of these
classification algorithms applied stochastic or heuristic search to locate the optimal solution (a set
of prototypes) by transforming the searching and discriminating processes into optimization
problem. In other words, these algorithms begin with constructing and searching a fixed number
of prototypes (candidate solution). Then, every produced solution will be evaluated in order to
determine the optimal solution that best represents the training data and simultaneously aim to
achieve better generalization ability when dealing with the unseen data. In order to search and
discriminate the optimal solution from the large space of solutions, the selection of proper metric
is crucial in discriminating a bulk of generated solutions. Without a proper and suitable evaluation
metric, a particular PS classifier may obtain poor generalization ability when tested with the
unseen data.
Typically, most of the PS classifiers employ the accuracy or the error rate (1-accuracy) to
discriminate and to select the best (optimal) solution. However, using the accuracy as a
benchmark measurement has a number of limitations. In [30, 37], they have demonstrated that the
simplicity of this accuracy could lead to the suboptimal solutions especially when dealing with
imbalanced class distribution. Furthermore, the accuracy also exhibits poor discriminating values
to discriminate better solution in order to build an optimized classifier [17].
The purpose of this paper is to review and analyse all related evaluation metrics that were
specifically designed for optimizing the PS classifiers. This paper begins with thorough reviews
on commonly threshold type metrics and other metrics that are specifically used as a
discriminator for discriminating the optimal solution for PS classifier. This section also discusses
the limitations of these metrics as a discriminator in discriminating the optimal solution. This
paper also recommends several important aspects in constructing a new discriminator metric for
PS classifier. Finally, this paper ends with conclusions.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
3
2. REVIEW OF DISCRIMINATOR METRICS
In a typical data classification problem, the evaluation metric has been employed into two stages,
which are training stage (learning process) and testing stage. In training stage, the evaluation
metric was used to optimize the classification algorithm. In other words, the evaluation metric
was employed as the discriminator to discriminate and to select the optimal solution which can
produce a more accurate prediction of future evaluation of a particular classifier. Meanwhile, in
the testing stage, the evaluation metric was used as the evaluator to measure the effectiveness of
produced classifier when tested with the unseen data.
As mentioned earlier, the interest of this paper is to review the use of evaluation metrics in
discriminating and selecting the optimal solution in order to build optimized PS classifiers. In
previous studies, there were various types of evaluation metrics that can be used to evaluate the
quality of classifiers with different aims. On the contrary, there were little efforts have been
dedicated to study and construct the metrics that are specifically designed to discriminate the
optimal solution during the data classification training especially for Prototype Selection
classifiers. On top of that, most of the previous studies were focused on binary classification
problems as their main study domain [8]. Therefore, due to limited resources, this paper attempts
to give the best reviews on the related studies as discussed below.
2.1. Threshold Types of Discriminator Metrics
For binary classification problems, the discrimination evaluation of the best (optimal) solution
during the classification training can be defined based on confusion matrix as shown in Table 1.
The row of the table represents the predicted class, while the column represents the actual class.
From this confusion matrix, tp and tn denote the number of positive and negative instances that
are correctly classified. Meanwhile, fp and fn denote the number of misclassified negative and
positive instances, respectively. From Table 1, several commonly used metrics can be generated
as shown in Table 2 to evaluate the performance of classifier with different focuses of
evaluations. Due to multiclass problems, few of metrics listed in Table 2 have been extended for
multi-class classification evaluations (see the last four metrics).
Table 1. Confusion Matrix for Binary Classification and the Corresponding Array Representation used
in this Study
Actual Positive Class Actual Negative Class
Predicted Positive Class True positive (tp) False negative (fn)
Predicted Negative Class False positive (fp) True negative (tn)
As shown in the previous studies [3, 11, 16, 30], the accuracy is the most used evaluation metric
in practice either for binary or multi-class classification problems. Through accuracy the quality
of produced solution is evaluated based on percentage of correct predictions over total instances.
The complement metric of accuracy is error rate which evaluates the produced solution by its
percentage of incorrect predictions. Both of these metrics were used commonly by researchers in
practice to discriminate and select the optimal solution.
The advantages of accuracy or error rate are, this metric is easy to compute with less complexity;
applicable for multi-class and multi-label problems; easy-to-use scoring; and easy to understand
by human. As pointed out by many studies, the accuracy metric has the limitations in evaluation
and discrimination processes. One of the main limitations of accuracy is it produces less
distinctive and less discriminable values [18, 29]. Consequently, it leads to less discriminating
power to accuracy in selecting and determining the optimal classifier. In addition, the accuracy
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
4
also powerless in terms of informativeness [25, 36] and less favour towards minority class
instances [3, 9, 16, 30, 37].
Table 2. Threshold Metrics for Classification Evaluations
Metrics Formula Evaluation Focus
Accuracy (acc)
‫݌ݐ‬ + ‫݊ݐ‬
‫݌ݐ‬ + ݂‫݌‬ + ‫݊ݐ‬ + ݂݊
In general, the accuracy metric measures the
ratio of correct predictions over the total
number of instances evaluated.
Error Rate (err)
݂‫݌‬ + ݂݊
‫݌ݐ‬ + ݂‫݌‬ + ‫݊ݐ‬ + ݂݊
Misclassification error measures the ratio of
incorrect predictions over the total number of
instances evaluated.
Sensitivity (sn)
‫݌ݐ‬
‫݌ݐ‬ + ݂݊
This metric is used to measure the fraction of
positive patterns that are correctly classified
Specificity (sp)
‫݊ݐ‬
‫݊ݐ‬ + ݂‫݌‬
This metric is used to measure the fraction of
negative patterns that are correctly classified.
Precision (p)
‫݌ݐ‬
‫݌ݐ‬ + ݂‫݌‬
Precision is used to measure the positive
patterns that are correctly predicted from the
total predicted patterns in a positive class.
Recall (r)
‫݌ݐ‬
‫݌ݐ‬ + ‫݊ݐ‬
Recall is used to measure the fraction of
positive patterns that are correctly classified
F-Measure (FM)
2 ∗ ‫݌‬ ∗ ‫ݎ‬
‫݌‬ + ‫ݎ‬
This metric represents the harmonic mean
between recall and precision values
Geometric-mean (GM) ඥ‫݌ݐ‬ ∗ ‫݊ݐ‬
This metric is used to maximize the tp rate
and tn rate, and simultaneously keeping both
rates relatively balanced
Averaged
Accuracy
∑
‫݌ݐ‬௜ + ‫݊ݐ‬௜
‫݌ݐ‬௜ + ݂݊௜ + ݂‫݌‬௜ +
௟
௜ୀଵ
݈
The average effectiveness of all classes
Averaged
Error Rate
∑
݂‫݌‬௜ + ݂݊௜
‫݌ݐ‬௜ + ݂݊௜ + ݂‫݌‬௜ +
௟
௜ୀଵ
݈
The average error rate of all classes
Averaged
Precision
∑
‫݌ݐ‬௜
‫݌ݐ‬௜ + ݂‫݌‬௜
௟
௜ୀଵ
݈
The average of per-class precision
Averaged
Recall
∑
‫݌ݐ‬௜
‫݌ݐ‬௜ + ݂݊௜
௟
௜ୀଵ
݈
The average of per-class recall
Averaged
F-Measure
2 ∗ ‫݌‬ெ ∗ ‫ݎ‬ெ
‫݌‬ெ + ‫ݎ‬ெ
The average of per-class F-measure
Note: - each class of data; ‫݌ݐ‬௜ - true positive for ‫ܥ‬௜; ݂‫݌‬௜ - false positive for ‫ܥ‬௜; ݂݊௜ – false negative
for ‫ܥ‬௜; ‫݊ݐ‬௜ - true negative for ‫ܥ‬௜; and ‫ܯ‬ macro-averaging.
Instead of accuracy, the FM and GM also reported as a good discriminator and performed better
than accuracy in optimizing classifier for binary classification problems [20]. To the best of our
knowledge, no previous work has employed the FM and GM to discriminate and select the
optimal solution for multiclass classification problems.
In contrast, the rest of metrics in Table 2 are unsuitable to discriminate and select the optimal
solution due to single evaluation task (either positive or negative class). For discriminating and
selecting the optimal solution during the classification training, the significance trade-off between
classes is essential to ensure every class is represented by its representative prototype(s). The
trade-off between classes becomes more crucial when imbalanced class data were used. The best
selected solution turn out to be useless if none of the minority class instances were able to
correctly predicted by the chosen representative(s) (prototype) or selected as the representative(s)
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
5
(i.e. if using randomly representative (prototype) selection method) during the classification
training.
2.2. Mean Square Error (MSE)
Supervised Learning Vector Quantization (LVQ) [21] is one of the Prototype Selection
classifiers. During the learning process, supervised LVQ uses MSE to evaluate its performances
during the classification training. In general, the MSE measures the difference between the
predicted solutions and desired solutions. The smaller MSE value is required in order to obtain a
better trained of supervised LVQ. The MSE is defined as below:
‫ܧܵܯ‬ =
1
݊
෍൫ܲ௝ − ‫ܣ‬௝൯
ଶ
௡
௝ୀଵ
(1)
where Pj is the predicted value of instance j, Aj is real target value of instance j and n is the total
number of instances. Through the learning process of LVQ, the solution that has minimum MSE
score will be used as the final model (best solution).
Similar to accuracy, the main limitation of MSE is this metric does not provide the trade-off
information between class data. This may lead the discrimination process to select the sub-
optimal solution. Moreover, this metric is really dependent on the weight initialization process. In
extremely imbalanced class problem, if the initial weights are not proper selected (i.e. no initial
weight to represent the minority class data), this may lead the discrimination process ends up with
sub-optimal solution due to lack information of minority class data although the MSE value is
minimized (under-fitting or over-fitting).
2.3. Area under the ROC Curve (AUC)
AUC is one of the popular ranking type metrics. In [13, 17, 31] the AUC was used to construct an
optimized learning model and also for comparing learning algorithms [28,29]. Unlike the
threshold and probability metrics, the AUC value reflects the overall ranking performance of a
classifier. For two-class problem [13], the AUC value can be calculated as below
‫ܥܷܣ‬ =
ܵ௣ − ݊௣(݊௡ + 1)/2
݊௣݊௡
(2)
where, Sp is the sum of the all positive examples ranked, while np and nn denote the number of
positive and negative examples respectively. The AUC was proven theoretically and empirically
better than the accuracy metric [17] for evaluating the classifier performance and discriminating
an optimal solution during the classification training.
Although the performance of AUC was excellent for evaluation and discrimination processes, the
computational cost of AUC is high especially for discriminating a volume of generated solutions
of multiclass problems. To compute the AUC for multiclass problems the time complexity is
O(|C|n log n) for Provost and Domingos AUC model [28] and O(|C|2
n log n) for Hand and Till
AUC model [13].
2.4. Hybrid Discriminator Metrics
Optimized Precision is a type of hybrid threshold metrics and has been proposed as a
discriminator for building an optimized heuristic classifier [30]. This metric is a combination of
accuracy, sensitivity and specificity metrics. The sensitivity and specificity metric were used for
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
6
stabilizing and optimizing the accuracy performance when dealing with imbalanced class of two-
class problems. The OP metric can be defined as below
ܱܲ = ܽܿܿ −
|‫݌ݏ‬ − ‫|݊ݏ‬
‫݌ݏ‬ + ‫݊ݏ‬
(3)
where acc is the accuracy score, while sp and sn denotes specificity and sensitivity score
respectively. In [30], the OP metric was able to discriminate and select a better solution and
increase the classification performance of ensemble learners and Multi-Classifier Systems for
solving Human DNA Sequences dataset.
Optimized accuracy with recall and precision (OARP) is another type of hybrid threshold metrics
that is specifically designed as a discriminator to train the Monte Carlo Sampling (MCS)
classifier during the classification training. There are two types of OARP; the Optimized
Accuracy with Extended Recall-Precision version 1 (OAERP1) [15] and Optimized Accuracy
with Extended Recall-Precision version 2 (OAERP2) [14, 16].
In general, both hybrid metrics are a combination of accuracy with extended recall (rc) and
extended precision (pr) metric. The difference between both metrics is their Relationship Index
(RI). In OAERP1, the RI is calculated based on correlation from [34], while OAERP 2 adopted
the correlation given by [24]. The OAERP1 is formulated as follows:
ܱ‫ܴܲܧܣ‬ = ܽܿܿ − ܴ‫ܫ‬ଵ (4)
where
ܴ‫ܫ‬ଵ =
|݁‫݌‬ଵ + ݁‫݌‬ଶ| − |݁‫ݎ‬ଵ + ݁‫ݎ‬ଶ|
|݁‫݌‬ଵ + ݁‫݌‬ଶ| + |݁‫ݎ‬ଵ + ݁‫ݎ‬ଶ|
(5)
Meanwhile, the OAERP2 is formulated as follows:
ܱ‫ܴܲܧܣ‬ = ܽܿܿ − ܴ‫ܫ‬ଶ (6)
where
ܴ‫ܫ‬ଶ =
ቚ
݁‫݌‬ଵ − ݁‫ݎ‬ଶ
݁‫݌‬ଵ + ݁‫ݎ‬ଶ
ቚ − ቚ
݁‫݌‬ଶ − ݁‫ݎ‬ଵ
݁‫݌‬ଶ + ݁ܿଵ
ቚ
2
(7)
For both RIs formula, the ep and er represent extended precision and extended recall respectively
and the numbering denotes class 1 (positive class) and class 2 (negative class).
As shown in [14, 15, 16], the RI value for both metrics have the possibility to return zero score. If
this happens, the OAERP1 and OAERP2 score are presumed equivalent to accuracy score.
Besides, the OAERP1 and OAERP2 score also can return a negative score. To avoid negative
score, the RI score needs to be resized using decimal scaling method. Both metrics have
demonstrated better performance than accuracy in terms of distinctiveness and discriminable of
produced-value. However, the OEARP2 shows better discriminating power to choose an optimal
solution and able to build a better trained of MCS classifier [14, 16].
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
7
The main limitation of these three hybrid metrics is its only limited for discriminating and
evaluating the binary classification problems. In real-world dataset, the data available is not
limited to two-class problem. Many datasets comprise more than two classes. To the best of our
knowledge, no previous work has modified these metrics for evaluating the multiclass data. Thus,
the effectiveness of these metrics is still questionable for multiclass classification problems.
2.5. Other Metrics
Instead of the abovementioned metrics, there are graphical-based metrics, which are better than
accuracy, have been proposed to evaluate the performance of classifiers. As reported in [27],
these metrics able to depict the trade-offs between different evaluation perspectives which
allowing richer analysis of results. Although these metrics are better than accuracy or error rate,
its graphical-based output limits a metric such as Receiver Operating Curve (ROC) [4], Bayesian
Receiver Operating Characteristic (B-ROC) [1], Precision-Recall Curve [5], cost curve [6], lift
and chart calibration plot [35] to be employed as a discriminator.
Besides, there are few other metrics that were specifically designed for a particular classification
algorithm. Information gain and entropy metrics are two probability types of metrics that were
used for evaluating the utility of attributes of data in building the optimized decision tree
classifiers [26]. Due to specific purpose, these metrics are unsuitable for adoption to discriminate
and select the optimal solution. To the best of our knowledge, no previous work has exploited
these metrics to discriminate a bulk of generated solutions during the classification training of PS
classifiers.
3. IMPORTANT FACTORS IN CONSTRUCTING NEW METRICS
Through the reviews processed, this paper has figured out several factors that might help the
researchers in designing and constructing a new metric or choosing the suitable metric for
discriminating the optimal solution of PS classification algorithms. The lists of these important
factors are briefly described as below.
1. Issue on multiclass problem
Many of current metrics were originally developed and applicable for binary classification
problems with different tasks of evaluation. This is the major limitation that restricted many
good metrics for widely used as a discriminator in discriminating the optimal solution. In
reality, the data available are not limited to two-class problem. Most of data involves more
than two classes. For example, the student grades can be categorized into A, B, C, D, E and F.
Therefore, the future development of new metric or choosing a suitable metric should
accommodate this issue into consideration.
2. Less complexity and less computational cost
Since data nowadays involve multiclass data the used of particular metric becomes more
complex due to increasing classes that need to be evaluated. As the consequences, it produces
high computational cost and affects the classification training speed. Due to this matter, most
researchers simply applied accuracy or error rate to discriminate their produced solutions.
Although, accuracy and error rate is less complex metric and easy-to-use score, the PS
classifiers still require longer learning process or training process. Therefore, the biggest
challenge of future development of a new metric is to design and construct a less complex
metric with less computational cost and comprehensible enough to discriminate an optimal
solution from a bulk of generated solutions.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
8
3. Distinctiveness and discriminable
Less distinctiveness and discriminable value of produced solution is another drawback of
accuracy metric [15]. As a result, this drawback will cause the discrimination and searching
process of an optimal solution easily trapped at local optima (plateau). Thus, this drawback
must be avoided by any discriminator. In other words, the development of future metrics must
be able to produce a distinctive and discriminable value for better searching and
discriminating the optimal solution in huge solution space. The details of this effect in
discriminating the optimal solution are discussed in Table 3 and Table 5.
4. Informativeness
Another drawback of many current metrics is there is no trade-off information between
classes [24, 35]. For example, the most popular metric accuracy could not discriminate the
good and bad (informative and non-informative) solutions especially when two or more
solutions are equivalent or even contradict as shown in Table 3 and 4 respectively.
From Table 3, the accuracy metric could not distinguish which solution is better due to non-
distinctiveness and non-discriminable produced value. Intuitively, solution a2 is better than a1
since a2 can predict correctly all minority class members. In a1, there is none of the minority
class member is correctly predicted, which conclude a1 is a poor solution. On the other hands,
in Table 4, the accuracy metric concludes solution a1 is better than a2 through score
comparison. However, a1 is a poor solution where none of minority class member is correctly
predicted by a1. Intuitively, solution a2 shows better result although the score is lower than a1.
In this case, solution a2 able to predict correctly all minority class members as compared to
a1.
From both examples, it shows that the informativeness aspect is essential feature for any
metric in discriminating the informative and optimal solution.
Table 3. Informativeness Analysis for Binary Classification Problem using
Imbalanced Class Distribution (5:95) with Two Equivalent Solutions
sol tp fp tn fn total accuracy
a1 0 5 95 0 95 0.9500
a2 5 0 90 5 95 0.9500
Table 4. Informativeness Analysis for Binary Classification Problem using
Imbalanced Class Distribution (5:95) with Two Contradictory Solutions
sol tp fp tn fn total accuracy
a1 0 0 95 5 95 0.9500
a2 5 6 89 0 94 0.9400
5. Favour towards the minority class
According to [11, 12, 19], the most popular accuracy metric is greatly affected by the
proportion of majority class and less impact on minority class. Hence, it is important to
employ a proper evaluation metric that could favor towards the minority class than the
majority class. In other words, the more minority class instances are correctly predicted, the
better solution is produced especially for extremely imbalanced class problems. From
demonstrated example in Table 5, any good metric or discriminator should rank the solution
as follow: a6→a5→a4→a3→a2→a1. As demonstrated in Table 5, intuitively, the solution a6
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
9
is the most informative solution and more favor towards minority class, while a1 is the
poorest solution since it has none of minority class members. Based on accuracy metric
scores, none of these solutions could be ranked as suggested due to equivalent score among
all solutions. Furthermore, Table 5 also shows that the accuracy metric produced less
distinctive and less discriminable score which can cause the accuracy metric easily trapped at
local optima during the searching of an optimal solution.
Table 5. Favors towards Minority Class Analysis for Binary Classification Problem
using Extremely Imbalanced Class Distribution (5:9995)
sol tp fp tn fn total
a1 0 0 9995 5 9995
a2 1 1 9994 4 9995
a3 2 2 9993 3 9995
a4 3 3 9992 2 9995
a5 4 4 9991 1 9995
a6 5 5 9990 0 9995
4. CONCLUSIONS
The selection of suitable metric for discriminating the optimal solution in order to obtain an
optimized classifier is a crucial step. The proper selection of metric will ensure that the
classification training of generative type classifier is optimal. This paper hopes that with the
reviews of some metrics for discriminating the optimal solution will sensitize the data mining
researchers to the issue and encourage the researchers to think carefully, prior to select and apply
the suitable metric for optimizing the classification training. Besides, this paper also suggested
several important aspects in constructing a better metric for discriminating the optimal solution
for generative type of classification algorithms.
ACKNOWLEDGEMENTS
This work has been funded by the Ministry of Education, Malaysia under SLAI scholarship. We
also thanks to Universiti Malaysia Sarawak and other reviewers for their supports and helpful
comments.
REFERENCES
[1] A.A. Cardenas and J.S. Baras, “B-ROC curves for the assessment of classifiers over imbalanced data
sets”, in Proc. of the 21st National Conference on Artificial Intelligence Vol. 2, 2006, pp. 1581-1584
[2] R. Caruana and A. Niculescu-Mizil, “Data mining in metric space: an empirical analysis of supervised
learning performance criteria”, in Proc. of the 10th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD '04), New York, NY, USA, ACM 2004, pp. 69-78.
[3] N.V. Chawla, N. Japkowicz and A. Kolcz, “Editorial: Special issue on learning from imbalanced data
sets”, SIGKDD Explorations, 6 (2004) 1-6.
[4] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters, 27 (2006) 861-874.
[5] J. Davis and M. Goadrich, “The relationship between precision-recall and ROC curves”, in Proc. of
the 23rd International Conference on Machine Learning, 2006, pp. 233-240.
[6] C. Drummond, R.C. Holte, “Cost curves: An Improved method for visualizing classifier
performance”, Mach. Learn. 65 (2006) 95-130.
[7] P.A. Flach, P.A., “The Geometry of ROC Space: understanding Machine Learning Metrics through
ROC Isometrics”, in T. Fawcett and N. Mishra (Eds.) Proc. of the 20th Int. Conference on Machine
Learning (ICML 2003), Washington, DC, USA, AAAI Press, 2003, pp. 194-201.
[8] V. Garcia, R.A. Mollineda and J.S. Sanchez, “A bias correction function for classification
performance assessment in two-class imbalanced problems”, Knowledge-Based Systems, 59(2014)
66-74.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
10
[9] S. Garcia and F. Herrera, “Evolutionary training set selection to optimize C4.5 in imbalance
problems”, in Proc. of 8th Int. Conference on Hybrid Intelligent Systems (HIS 2008), Washington,
DC, USA, IEEE Computer Society, 2008, pp.567-572.
[10] N. Garcia-Pedrajas, J. A. Romero del Castillo and D. Ortiz-Boyer, “A cooperative coevolutionary
algorithm for instance selection for instance-based learning”. Machine Learning (2010), 78 (2010)
381-420.
[11] Q. Gu, L. Zhu and Z. Cai, “Evaluation Measures of the Classification Performance of Imbalanced
Datasets”, in Z. Cai et al. (Eds.) ISICA 2009, CCIS 51. Berlin, Heidelberg: Springer-Verlag, 2009,
pp. 461-471.
[12] S. Han, B. Yuan and W. Liu, “Rare Class Mining: Progress and Prospect”, in Proc. of Chinese
Conference on Pattern Recognition (CCPR 2009), 2009, pp. 1-5
[13] D. J. Hand and R. J. Till, “A simple generalization of the area under the ROC curve to multiple class
classification problems”, Machine Learning, 45 (2001) 171-186.
[14] M. Hossin, M. N. Sulaiman, A. Mustapha, and N. Mustapha, “A Novel Performance Metric for
Building an Optimized Classifier”, Journal of Computer Science, 7(4) (2011) 582-509.
[15] M. Hossin, M. N. Sulaiman, A. Mustapha, N. Mustapha and R. W. Rahmat, “ OAERP: a Better
Measure than Accuracy in Discriminating a Better Solution for Stochastic Classification Training”,
Journal of Artificial Intelligence, 4(3) (2011) 187-196.
[16] M. Hossin, M. N. Sulaiman, A. Mustapha, N. Mustapha and R. W. Rahmat, “A Hybrid Evaluation
Metric for Optimizing Classifier”, in Data Mining and Optimization (DMO), 2011 3rd Conference on,
2011, pp. 165-170.
[17] J. Huang and C. X. Ling, “Using AUC and accuracy in evaluating learning algorithms”, IEEE
Transactions on Knowledge Data Engineering, 17 (2005) 299-310.
[18] J. Huang and C. X. Ling, “Constructing new and better evaluation measures for machine learning”, in
R. Sangal, H. Mehta and R. K. Bagga (Eds.) Proc. of the 20th International Joint Conference on
Artificial Intelligence (IJCAI 2007), San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
2007, pp.859-864.
[19] N. Japkowicz, “Assessment metrics for imbalanced learning”, in Imbalanced Learning: Foundations,
Algorithms, and Applications, Wiley IEEE Press, 2013, pp. 187-210
[20] M. V. Joshi, “On evaluating performance of classifiers for rare classes”, in Proceedings of the 2002
IEEE Int. Conference on Data Mining (ICDN 2002) ICDM’02, Washington, D. C., USA: IEEE
Computer Society, 2002, pp. 641-644.
[21] T. Kohonen, Self-Organizing Maps, 3rd ed., Berlin Heidelberg: Springer-Verlag, 2001.
[22] L. I. Kuncheva and J. C. Bezdek, “Nearest Prototype Classification: Clustering, Genetic Algorithms,
or Random Search?” IEEE Transactions on Systems, Man, and Cybernetics-Part C: Application and
Reviews, 28(1) (1998) 160-164.
[23] N. Lavesson, and P. Davidsson, “Generic Methods for Multi-Criteria Evaluation”, in Proc. of the
Siam Int. Conference on Data Mining, Atlanta, Georgia, USA: SIAM Press, 2008, pp. 541-546.
[24] P. Lingras, and C. J. Butz, “Precision and recall in rough support vector machines”, in Proc. of the
2007 IEEE Int. Conference on Granular Computing (GRC 2007), Washington, DC, USA: IEEE
Computer Society, 2007, pp.654-654.
[25] D. J. C. MacKay, Information, Theory, Inference and Learning Algorithms. Cambridge, UK:
Cambridge University Press, 2003.
[26] T. M. Mitchell, Machine Learning, USA: MacGraw-Hill, 1997.
[27] R. Prati, G. Batista, and M. Monard, “A survery on graphical methods for classification predictive
performance evaluation”, IEEE Trans. Knowl. Data Eng. 23(2011) 1601-1618.
[28] F. Provost, and P. Domingos, “Tree induction for probability-based ranking”. Machine Learning, 52
(2003) 199-215.
[29] A. Rakotomamonyj, “Optimizing area under ROC with SVMs”, in J. Hernandez-Orallo, C. Ferri, N.
Lachiche and P. A. Flach (Eds.) 1st Int. Workshop on ROC Analysis in Artificial Intelligence
(ROCAI 2004), Valencia, Spain, 2004, pp. 71-80.
[30] R. Ranawana, and V. Palade, “Optimized precision-A new measure for classifier performance
evaluation”, in Proc. of the IEEE World Congress on Evolutionary Computation (CEC 2006), 2006,
pp. 2254-2261.
[31] S. Rosset, “Model selection via AUC”, in C. E. Brodley (Ed.) Proc. of the 21st Int. Conference on
Machine Learning (ICML 2004), New York, NY, USA: ACM, 2004, pp. 89.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015
11
[32] D.B. Skalak, “Prototype and feature selection by sampling and random mutation hill climbing
algorithm”, in W. W. Cohen and H. Hirsh (Eds.) Proc. of the 11th Int. Conference on Machine
Learning (ICML 1994), New Brunswick, NJ, USA: Morgan Kaufmann, 1994, pp.293-301.
[33] M. Sokolova and G. Lapame, “A systematic analysis of performance measures for classification
tasks”, Information Processing and Management, 45(2009) 427-437.
[34] P. N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Boston, USA: Pearson Addison
Wesley, 2006.
[35] M. Vuk and T. Curk, “ROC curve, lift chart and calibration plot”, Metodoloˇski zvezki, 3(1) (2006)
89-108.
[36] H. Wallach, “Evaluation metrics for hard classifiers”. Technical Report. (Ed.: Wallach, 2006)
http://www.inference.phy.cam.ac.uk/hmw26/papers
[37] S. W. Wilson, “Mining oblique data with XCS”, in P. L. Lanzi, W. Stolzmann and S. W. Wilson
(Eds.) Advances in Learning Classifier Systems: Third Int. Workshop (IWLCS 2000), Berlin,
Heidelberg: Springer-Verlag, 2001, pp. 283-290.
[38] H. Zhang and G. Sun, “Optimal reference subset selection for nearest neighbor classification by tabu
search”, Pattern Recognition, 35(7) (2002) 1481-1490.
AUTHORS
Mohammad b. Hossin is a senior lecturer at Universiti Malaysia Sarawak. He received his B.IT (Hons)
from Universiti Utara Malaysia (UUM) in 2000 and M.Sc. in Artificial Intelligence from University of
Essex, UK in 2003. Then, in 2012, he received his Ph.D in Intelligent Computing from Universiti Putra
Malaysia (UPM). His main research interests include data mining, decision support systems optimization
using nature-inspired algorithms and e-learning.
Md Nasir Sulaiman received his Bachelor in Science with Education major in Mathematics from
University Pertanian Malaysia in 1983. He received Master in Computing from University of Bradford,
U.K., in 1986 and PhD degree in Computer Science from Loughborough University, U.K., in 1994. He is
a lecturer at the University Putra Malaysia since 1986. He has been promoted to Associate Professor in
2002. His research interests are Intelligent Computing and Smart Home.

More Related Content

PDF
Brm unit iv - cheet sheet
PPTX
Factor analysis
PPTX
Mba2216 week 11 data analysis part 03 appendix
PDF
Classification and decision tree classifier machine learning
PPT
Abdm4064 week 11 data analysis
PPTX
RESEARCH METHODOLOGY- PROCESSING OF DATA
PPTX
Mba2216 week 11 data analysis part 01
PDF
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
Brm unit iv - cheet sheet
Factor analysis
Mba2216 week 11 data analysis part 03 appendix
Classification and decision tree classifier machine learning
Abdm4064 week 11 data analysis
RESEARCH METHODOLOGY- PROCESSING OF DATA
Mba2216 week 11 data analysis part 01
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...

What's hot (20)

PPSX
A Beginner’s Guide to Factor Analysis: Focusing on Exploratory Factor Analysis
PDF
30 14 jun17 3may 7620 7789-1-sm(edit)new
PPTX
Statistics design
PPTX
Research methodology
PPT
Multivariate Analysis Techniques
DOCX
Data Reduction and Classification for Lumosity Data
DOCX
Quantitative data analysis
PPT
Research Methodology: Questionnaire, Sampling, Data Preparation
PPTX
Analysis of data (pratik)
PPTX
Confirmatory factor analysis (cfa)
PPTX
Statistical analysis using spss
PPT
Factor analysis
PDF
Exploratory Factor Analysis Assignment 2013-2014(1)
ODP
Exploratory factor analysis
PPTX
Marketing Experimentation - Part I
PPTX
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
PDF
BTech Pattern Recognition Notes
PPTX
Statistical tools in research
PPTX
Multivariate
PPT
Quantitative analysis using SPSS
A Beginner’s Guide to Factor Analysis: Focusing on Exploratory Factor Analysis
30 14 jun17 3may 7620 7789-1-sm(edit)new
Statistics design
Research methodology
Multivariate Analysis Techniques
Data Reduction and Classification for Lumosity Data
Quantitative data analysis
Research Methodology: Questionnaire, Sampling, Data Preparation
Analysis of data (pratik)
Confirmatory factor analysis (cfa)
Statistical analysis using spss
Factor analysis
Exploratory Factor Analysis Assignment 2013-2014(1)
Exploratory factor analysis
Marketing Experimentation - Part I
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
BTech Pattern Recognition Notes
Statistical tools in research
Multivariate
Quantitative analysis using SPSS
Ad

Viewers also liked (20)

PDF
Content based indexing of music
PDF
Evaluation of rule extraction algorithms
PDF
A comparative study on term weighting methods for automated telugu text categ...
PDF
Incremental learning from unbalanced data with concept class, concept drift a...
PDF
A statistical model for gist generation a case study on hindi news article
PDF
A fuzzy logic based on sentiment
PDF
Literature review of attribute level and
PDF
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
PDF
THE EVALUATION OF FACTORS INFLUENCING SAFETY PERFORMANCE: A CASE IN AN INDUST...
PDF
Enhancing the labelling technique of
PDF
Enhancement techniques for data warehouse staging area
PDF
Effective data mining for proper
PDF
Comparison between riss and dcharm for mining gene expression data
PDF
Application of data mining tools for
PDF
Recommendation system using bloom filter in mapreduce
PDF
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
PDF
Relative parameter quantification in data
PDF
WEB-BASED DATA MINING TOOLS : PERFORMING FEEDBACK ANALYSIS AND ASSOCIATION RU...
PDF
A novel algorithm for mining closed sequential patterns
Content based indexing of music
Evaluation of rule extraction algorithms
A comparative study on term weighting methods for automated telugu text categ...
Incremental learning from unbalanced data with concept class, concept drift a...
A statistical model for gist generation a case study on hindi news article
A fuzzy logic based on sentiment
Literature review of attribute level and
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
THE EVALUATION OF FACTORS INFLUENCING SAFETY PERFORMANCE: A CASE IN AN INDUST...
Enhancing the labelling technique of
Enhancement techniques for data warehouse staging area
Effective data mining for proper
Comparison between riss and dcharm for mining gene expression data
Application of data mining tools for
Recommendation system using bloom filter in mapreduce
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
Relative parameter quantification in data
WEB-BASED DATA MINING TOOLS : PERFORMING FEEDBACK ANALYSIS AND ASSOCIATION RU...
A novel algorithm for mining closed sequential patterns
Ad

Similar to A review on evaluation metrics for (20)

PDF
A NOVEL PERFORMANCE MEASURE FOR MACHINE LEARNING CLASSIFICATION
PDF
A Novel Performance Measure for Machine Learning Classification
PDF
A Novel Performance Measure For Machine Learning Classification
PDF
A novel hybrid feature selection approach
PPTX
Systematic review and meta analysis
PPTX
Data Extraction & Synthesis by zahida kusor
PDF
Software Cost Estimation Using Clustering and Ranking Scheme
PPTX
2. Data_analysis_and_Research_Methodology.pptx
DOC
Research proposal
PDF
Multi Criteria Decision Making Methodology on Selection of a Student for All ...
PDF
Classification By Clustering Based On Adjusted Cluster
PDF
International Journal of Fuzzy Logic Systems (IJFLS)
PDF
OWA BASED MAGDM TECHNIQUE IN EVALUATING DIAGNOSTIC LABORATORY UNDER FUZZY ENV...
PDF
International Journal of Fuzzy Logic Systems (IJFLS)
PDF
Feature selection for classification
PDF
Impact of Perceived Fairness on Performance Appraisal System for Academic Sta...
PDF
Manuscript dss
PDF
Data Science Interview Questions PDF By ScholarHat
PDF
Discriminate analysis
A NOVEL PERFORMANCE MEASURE FOR MACHINE LEARNING CLASSIFICATION
A Novel Performance Measure for Machine Learning Classification
A Novel Performance Measure For Machine Learning Classification
A novel hybrid feature selection approach
Systematic review and meta analysis
Data Extraction & Synthesis by zahida kusor
Software Cost Estimation Using Clustering and Ranking Scheme
2. Data_analysis_and_Research_Methodology.pptx
Research proposal
Multi Criteria Decision Making Methodology on Selection of a Student for All ...
Classification By Clustering Based On Adjusted Cluster
International Journal of Fuzzy Logic Systems (IJFLS)
OWA BASED MAGDM TECHNIQUE IN EVALUATING DIAGNOSTIC LABORATORY UNDER FUZZY ENV...
International Journal of Fuzzy Logic Systems (IJFLS)
Feature selection for classification
Impact of Perceived Fairness on Performance Appraisal System for Academic Sta...
Manuscript dss
Data Science Interview Questions PDF By ScholarHat
Discriminate analysis

Recently uploaded (20)

PPTX
Information Storage and Retrieval Techniques Unit III
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
introduction to high performance computing
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
Information Storage and Retrieval Techniques Unit III
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Fundamentals of Mechanical Engineering.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
86236642-Electric-Loco-Shed.pdf jfkduklg
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Visual Aids for Exploratory Data Analysis.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
introduction to high performance computing
Safety Seminar civil to be ensured for safe working.
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
UNIT 4 Total Quality Management .pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Categorization of Factors Affecting Classification Algorithms Selection

A review on evaluation metrics for

  • 1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 DOI : 10.5121/ijdkp.2015.5201 1 A REVIEW ON EVALUATION METRICS FOR DATA CLASSIFICATION EVALUATIONS Hossin, M.1 and Sulaiman, M.N.2 1 Faculty of Computer Science & Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia 2 Faculty of Computer Science & Information Technology, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia ABSTRACT Evaluation metric plays a critical role in achieving the optimal classifier during the classification training. Thus, a selection of suitable evaluation metric is an important key for discriminating and obtaining the optimal classifier. This paper systematically reviewed the related evaluation metrics that are specifically designed as a discriminator for optimizing generative classifier. Generally, many generative classifiers employ accuracy as a measure to discriminate the optimal solution during the classification training. However, the accuracy has several weaknesses which are less distinctiveness, less discriminability, less informativeness and bias to majority class data. This paper also briefly discusses other metrics that are specifically designed for discriminating the optimal solution. The shortcomings of these alternative metrics are also discussed. Finally, this paper suggests five important aspects that must be taken into consideration in constructing a new discriminator metric. KEYWORDS Evaluation Metric, Accuracy, Optimized Classifier, Data Classification Evaluation 1. INTRODUCTION In data classification problems, data can be divided into commercial data, texts, DNAs and images. This paper emphasizes on commercial data as the focus of discussion. Furthermore, data classification can be divided into binary, multiclass and multi-labelled classification [33]. In this paper, the study is aimed on binary and multiclass classification which focuses on the evaluation metrics for evaluating the effectiveness of classifiers. In general, the evaluation metric can be described as the measurement tool that measures the performance of classifier. Different metrics evaluate different characteristics of the classifier induced by the classification algorithm. From the literature, the evaluation metric can be categorized into three types, which are threshold, probability and ranking metric [2]. Each of these types of metrics evaluates the classifier with different aims. Furthermore, all these types of metrics are scalar group method where the entire performance is presented using a single score value. Thus, it makes easier to do the comparison and analysis although it could mask subtle details of their behaviours. In practice, the threshold and ranking metric were the most common metrics used by researchers to measure the performance of classifiers. In most cases, these types of metrics can be employed into three different evaluation applications [23]. First, the evaluation metrics were used to evaluate the generalization ability of the trained classifier. In this case, the evaluation metric is used to measure and summarize the quality of trained classifier when tested with the unseen data. Accuracy or error rate is one of the most
  • 2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 2 common metrics in practice used by many researchers to evaluate the generalization ability of classifiers. Through accuracy, the trained classifier is measured based on total correctness which refers to the total of instances that are correctly predicted by the trained classifier when tested with the unseen data. Second, the evaluation metrics were employed as an evaluator for model selection. In this case, the evaluation metric task is to determine the best classifier among different types of trained classifiers which focus on the best future performance (optimal model) when tested with unseen data. Third, the evaluation metrics were employed as a discriminator to discriminate and select the optimal solution (best solution) among all generated solutions during the classification training. For example, the accuracy metric is employed to discriminate every single solution and select the best solution that produced by a particular classification algorithm. Only the best solution which is believed the optimal model will be tested with the unseen data. For the first and second application of evaluation metrics, almost all types of threshold, probability and ranking metrics could be applied to evaluate the performance and effectiveness of classifiers. Conversely, only few types of metrics could be employed as a discriminator to discriminate and select the optimal solution during the classification training. This paper emphasizes on the third application of evaluation metrics for Prototype Selection (PS) classifiers. Only relevant evaluation metrics which are associated with the latter application of metrics are discussed. In general, PS classifier is a generative type of classification algorithms that aim to generate a classifier model by applying sampling technique and simultaneously used the generated model to achieve the highest possible classification accuracy when dealing with the unseen data. Most of PS classifiers such as Monte Carlo Sampling (MCS) algorithm [32] genetic algorithm [22], evolutionary algorithm [10], and tabu search [38] were developed based on statistics, nature- inspired, optimization methods or combination of these methods. Basically, all of these classification algorithms applied stochastic or heuristic search to locate the optimal solution (a set of prototypes) by transforming the searching and discriminating processes into optimization problem. In other words, these algorithms begin with constructing and searching a fixed number of prototypes (candidate solution). Then, every produced solution will be evaluated in order to determine the optimal solution that best represents the training data and simultaneously aim to achieve better generalization ability when dealing with the unseen data. In order to search and discriminate the optimal solution from the large space of solutions, the selection of proper metric is crucial in discriminating a bulk of generated solutions. Without a proper and suitable evaluation metric, a particular PS classifier may obtain poor generalization ability when tested with the unseen data. Typically, most of the PS classifiers employ the accuracy or the error rate (1-accuracy) to discriminate and to select the best (optimal) solution. However, using the accuracy as a benchmark measurement has a number of limitations. In [30, 37], they have demonstrated that the simplicity of this accuracy could lead to the suboptimal solutions especially when dealing with imbalanced class distribution. Furthermore, the accuracy also exhibits poor discriminating values to discriminate better solution in order to build an optimized classifier [17]. The purpose of this paper is to review and analyse all related evaluation metrics that were specifically designed for optimizing the PS classifiers. This paper begins with thorough reviews on commonly threshold type metrics and other metrics that are specifically used as a discriminator for discriminating the optimal solution for PS classifier. This section also discusses the limitations of these metrics as a discriminator in discriminating the optimal solution. This paper also recommends several important aspects in constructing a new discriminator metric for PS classifier. Finally, this paper ends with conclusions.
  • 3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 3 2. REVIEW OF DISCRIMINATOR METRICS In a typical data classification problem, the evaluation metric has been employed into two stages, which are training stage (learning process) and testing stage. In training stage, the evaluation metric was used to optimize the classification algorithm. In other words, the evaluation metric was employed as the discriminator to discriminate and to select the optimal solution which can produce a more accurate prediction of future evaluation of a particular classifier. Meanwhile, in the testing stage, the evaluation metric was used as the evaluator to measure the effectiveness of produced classifier when tested with the unseen data. As mentioned earlier, the interest of this paper is to review the use of evaluation metrics in discriminating and selecting the optimal solution in order to build optimized PS classifiers. In previous studies, there were various types of evaluation metrics that can be used to evaluate the quality of classifiers with different aims. On the contrary, there were little efforts have been dedicated to study and construct the metrics that are specifically designed to discriminate the optimal solution during the data classification training especially for Prototype Selection classifiers. On top of that, most of the previous studies were focused on binary classification problems as their main study domain [8]. Therefore, due to limited resources, this paper attempts to give the best reviews on the related studies as discussed below. 2.1. Threshold Types of Discriminator Metrics For binary classification problems, the discrimination evaluation of the best (optimal) solution during the classification training can be defined based on confusion matrix as shown in Table 1. The row of the table represents the predicted class, while the column represents the actual class. From this confusion matrix, tp and tn denote the number of positive and negative instances that are correctly classified. Meanwhile, fp and fn denote the number of misclassified negative and positive instances, respectively. From Table 1, several commonly used metrics can be generated as shown in Table 2 to evaluate the performance of classifier with different focuses of evaluations. Due to multiclass problems, few of metrics listed in Table 2 have been extended for multi-class classification evaluations (see the last four metrics). Table 1. Confusion Matrix for Binary Classification and the Corresponding Array Representation used in this Study Actual Positive Class Actual Negative Class Predicted Positive Class True positive (tp) False negative (fn) Predicted Negative Class False positive (fp) True negative (tn) As shown in the previous studies [3, 11, 16, 30], the accuracy is the most used evaluation metric in practice either for binary or multi-class classification problems. Through accuracy the quality of produced solution is evaluated based on percentage of correct predictions over total instances. The complement metric of accuracy is error rate which evaluates the produced solution by its percentage of incorrect predictions. Both of these metrics were used commonly by researchers in practice to discriminate and select the optimal solution. The advantages of accuracy or error rate are, this metric is easy to compute with less complexity; applicable for multi-class and multi-label problems; easy-to-use scoring; and easy to understand by human. As pointed out by many studies, the accuracy metric has the limitations in evaluation and discrimination processes. One of the main limitations of accuracy is it produces less distinctive and less discriminable values [18, 29]. Consequently, it leads to less discriminating power to accuracy in selecting and determining the optimal classifier. In addition, the accuracy
  • 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 4 also powerless in terms of informativeness [25, 36] and less favour towards minority class instances [3, 9, 16, 30, 37]. Table 2. Threshold Metrics for Classification Evaluations Metrics Formula Evaluation Focus Accuracy (acc) ‫݌ݐ‬ + ‫݊ݐ‬ ‫݌ݐ‬ + ݂‫݌‬ + ‫݊ݐ‬ + ݂݊ In general, the accuracy metric measures the ratio of correct predictions over the total number of instances evaluated. Error Rate (err) ݂‫݌‬ + ݂݊ ‫݌ݐ‬ + ݂‫݌‬ + ‫݊ݐ‬ + ݂݊ Misclassification error measures the ratio of incorrect predictions over the total number of instances evaluated. Sensitivity (sn) ‫݌ݐ‬ ‫݌ݐ‬ + ݂݊ This metric is used to measure the fraction of positive patterns that are correctly classified Specificity (sp) ‫݊ݐ‬ ‫݊ݐ‬ + ݂‫݌‬ This metric is used to measure the fraction of negative patterns that are correctly classified. Precision (p) ‫݌ݐ‬ ‫݌ݐ‬ + ݂‫݌‬ Precision is used to measure the positive patterns that are correctly predicted from the total predicted patterns in a positive class. Recall (r) ‫݌ݐ‬ ‫݌ݐ‬ + ‫݊ݐ‬ Recall is used to measure the fraction of positive patterns that are correctly classified F-Measure (FM) 2 ∗ ‫݌‬ ∗ ‫ݎ‬ ‫݌‬ + ‫ݎ‬ This metric represents the harmonic mean between recall and precision values Geometric-mean (GM) ඥ‫݌ݐ‬ ∗ ‫݊ݐ‬ This metric is used to maximize the tp rate and tn rate, and simultaneously keeping both rates relatively balanced Averaged Accuracy ∑ ‫݌ݐ‬௜ + ‫݊ݐ‬௜ ‫݌ݐ‬௜ + ݂݊௜ + ݂‫݌‬௜ + ௟ ௜ୀଵ ݈ The average effectiveness of all classes Averaged Error Rate ∑ ݂‫݌‬௜ + ݂݊௜ ‫݌ݐ‬௜ + ݂݊௜ + ݂‫݌‬௜ + ௟ ௜ୀଵ ݈ The average error rate of all classes Averaged Precision ∑ ‫݌ݐ‬௜ ‫݌ݐ‬௜ + ݂‫݌‬௜ ௟ ௜ୀଵ ݈ The average of per-class precision Averaged Recall ∑ ‫݌ݐ‬௜ ‫݌ݐ‬௜ + ݂݊௜ ௟ ௜ୀଵ ݈ The average of per-class recall Averaged F-Measure 2 ∗ ‫݌‬ெ ∗ ‫ݎ‬ெ ‫݌‬ெ + ‫ݎ‬ெ The average of per-class F-measure Note: - each class of data; ‫݌ݐ‬௜ - true positive for ‫ܥ‬௜; ݂‫݌‬௜ - false positive for ‫ܥ‬௜; ݂݊௜ – false negative for ‫ܥ‬௜; ‫݊ݐ‬௜ - true negative for ‫ܥ‬௜; and ‫ܯ‬ macro-averaging. Instead of accuracy, the FM and GM also reported as a good discriminator and performed better than accuracy in optimizing classifier for binary classification problems [20]. To the best of our knowledge, no previous work has employed the FM and GM to discriminate and select the optimal solution for multiclass classification problems. In contrast, the rest of metrics in Table 2 are unsuitable to discriminate and select the optimal solution due to single evaluation task (either positive or negative class). For discriminating and selecting the optimal solution during the classification training, the significance trade-off between classes is essential to ensure every class is represented by its representative prototype(s). The trade-off between classes becomes more crucial when imbalanced class data were used. The best selected solution turn out to be useless if none of the minority class instances were able to correctly predicted by the chosen representative(s) (prototype) or selected as the representative(s)
  • 5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 5 (i.e. if using randomly representative (prototype) selection method) during the classification training. 2.2. Mean Square Error (MSE) Supervised Learning Vector Quantization (LVQ) [21] is one of the Prototype Selection classifiers. During the learning process, supervised LVQ uses MSE to evaluate its performances during the classification training. In general, the MSE measures the difference between the predicted solutions and desired solutions. The smaller MSE value is required in order to obtain a better trained of supervised LVQ. The MSE is defined as below: ‫ܧܵܯ‬ = 1 ݊ ෍൫ܲ௝ − ‫ܣ‬௝൯ ଶ ௡ ௝ୀଵ (1) where Pj is the predicted value of instance j, Aj is real target value of instance j and n is the total number of instances. Through the learning process of LVQ, the solution that has minimum MSE score will be used as the final model (best solution). Similar to accuracy, the main limitation of MSE is this metric does not provide the trade-off information between class data. This may lead the discrimination process to select the sub- optimal solution. Moreover, this metric is really dependent on the weight initialization process. In extremely imbalanced class problem, if the initial weights are not proper selected (i.e. no initial weight to represent the minority class data), this may lead the discrimination process ends up with sub-optimal solution due to lack information of minority class data although the MSE value is minimized (under-fitting or over-fitting). 2.3. Area under the ROC Curve (AUC) AUC is one of the popular ranking type metrics. In [13, 17, 31] the AUC was used to construct an optimized learning model and also for comparing learning algorithms [28,29]. Unlike the threshold and probability metrics, the AUC value reflects the overall ranking performance of a classifier. For two-class problem [13], the AUC value can be calculated as below ‫ܥܷܣ‬ = ܵ௣ − ݊௣(݊௡ + 1)/2 ݊௣݊௡ (2) where, Sp is the sum of the all positive examples ranked, while np and nn denote the number of positive and negative examples respectively. The AUC was proven theoretically and empirically better than the accuracy metric [17] for evaluating the classifier performance and discriminating an optimal solution during the classification training. Although the performance of AUC was excellent for evaluation and discrimination processes, the computational cost of AUC is high especially for discriminating a volume of generated solutions of multiclass problems. To compute the AUC for multiclass problems the time complexity is O(|C|n log n) for Provost and Domingos AUC model [28] and O(|C|2 n log n) for Hand and Till AUC model [13]. 2.4. Hybrid Discriminator Metrics Optimized Precision is a type of hybrid threshold metrics and has been proposed as a discriminator for building an optimized heuristic classifier [30]. This metric is a combination of accuracy, sensitivity and specificity metrics. The sensitivity and specificity metric were used for
  • 6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 6 stabilizing and optimizing the accuracy performance when dealing with imbalanced class of two- class problems. The OP metric can be defined as below ܱܲ = ܽܿܿ − |‫݌ݏ‬ − ‫|݊ݏ‬ ‫݌ݏ‬ + ‫݊ݏ‬ (3) where acc is the accuracy score, while sp and sn denotes specificity and sensitivity score respectively. In [30], the OP metric was able to discriminate and select a better solution and increase the classification performance of ensemble learners and Multi-Classifier Systems for solving Human DNA Sequences dataset. Optimized accuracy with recall and precision (OARP) is another type of hybrid threshold metrics that is specifically designed as a discriminator to train the Monte Carlo Sampling (MCS) classifier during the classification training. There are two types of OARP; the Optimized Accuracy with Extended Recall-Precision version 1 (OAERP1) [15] and Optimized Accuracy with Extended Recall-Precision version 2 (OAERP2) [14, 16]. In general, both hybrid metrics are a combination of accuracy with extended recall (rc) and extended precision (pr) metric. The difference between both metrics is their Relationship Index (RI). In OAERP1, the RI is calculated based on correlation from [34], while OAERP 2 adopted the correlation given by [24]. The OAERP1 is formulated as follows: ܱ‫ܴܲܧܣ‬ = ܽܿܿ − ܴ‫ܫ‬ଵ (4) where ܴ‫ܫ‬ଵ = |݁‫݌‬ଵ + ݁‫݌‬ଶ| − |݁‫ݎ‬ଵ + ݁‫ݎ‬ଶ| |݁‫݌‬ଵ + ݁‫݌‬ଶ| + |݁‫ݎ‬ଵ + ݁‫ݎ‬ଶ| (5) Meanwhile, the OAERP2 is formulated as follows: ܱ‫ܴܲܧܣ‬ = ܽܿܿ − ܴ‫ܫ‬ଶ (6) where ܴ‫ܫ‬ଶ = ቚ ݁‫݌‬ଵ − ݁‫ݎ‬ଶ ݁‫݌‬ଵ + ݁‫ݎ‬ଶ ቚ − ቚ ݁‫݌‬ଶ − ݁‫ݎ‬ଵ ݁‫݌‬ଶ + ݁ܿଵ ቚ 2 (7) For both RIs formula, the ep and er represent extended precision and extended recall respectively and the numbering denotes class 1 (positive class) and class 2 (negative class). As shown in [14, 15, 16], the RI value for both metrics have the possibility to return zero score. If this happens, the OAERP1 and OAERP2 score are presumed equivalent to accuracy score. Besides, the OAERP1 and OAERP2 score also can return a negative score. To avoid negative score, the RI score needs to be resized using decimal scaling method. Both metrics have demonstrated better performance than accuracy in terms of distinctiveness and discriminable of produced-value. However, the OEARP2 shows better discriminating power to choose an optimal solution and able to build a better trained of MCS classifier [14, 16].
  • 7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 7 The main limitation of these three hybrid metrics is its only limited for discriminating and evaluating the binary classification problems. In real-world dataset, the data available is not limited to two-class problem. Many datasets comprise more than two classes. To the best of our knowledge, no previous work has modified these metrics for evaluating the multiclass data. Thus, the effectiveness of these metrics is still questionable for multiclass classification problems. 2.5. Other Metrics Instead of the abovementioned metrics, there are graphical-based metrics, which are better than accuracy, have been proposed to evaluate the performance of classifiers. As reported in [27], these metrics able to depict the trade-offs between different evaluation perspectives which allowing richer analysis of results. Although these metrics are better than accuracy or error rate, its graphical-based output limits a metric such as Receiver Operating Curve (ROC) [4], Bayesian Receiver Operating Characteristic (B-ROC) [1], Precision-Recall Curve [5], cost curve [6], lift and chart calibration plot [35] to be employed as a discriminator. Besides, there are few other metrics that were specifically designed for a particular classification algorithm. Information gain and entropy metrics are two probability types of metrics that were used for evaluating the utility of attributes of data in building the optimized decision tree classifiers [26]. Due to specific purpose, these metrics are unsuitable for adoption to discriminate and select the optimal solution. To the best of our knowledge, no previous work has exploited these metrics to discriminate a bulk of generated solutions during the classification training of PS classifiers. 3. IMPORTANT FACTORS IN CONSTRUCTING NEW METRICS Through the reviews processed, this paper has figured out several factors that might help the researchers in designing and constructing a new metric or choosing the suitable metric for discriminating the optimal solution of PS classification algorithms. The lists of these important factors are briefly described as below. 1. Issue on multiclass problem Many of current metrics were originally developed and applicable for binary classification problems with different tasks of evaluation. This is the major limitation that restricted many good metrics for widely used as a discriminator in discriminating the optimal solution. In reality, the data available are not limited to two-class problem. Most of data involves more than two classes. For example, the student grades can be categorized into A, B, C, D, E and F. Therefore, the future development of new metric or choosing a suitable metric should accommodate this issue into consideration. 2. Less complexity and less computational cost Since data nowadays involve multiclass data the used of particular metric becomes more complex due to increasing classes that need to be evaluated. As the consequences, it produces high computational cost and affects the classification training speed. Due to this matter, most researchers simply applied accuracy or error rate to discriminate their produced solutions. Although, accuracy and error rate is less complex metric and easy-to-use score, the PS classifiers still require longer learning process or training process. Therefore, the biggest challenge of future development of a new metric is to design and construct a less complex metric with less computational cost and comprehensible enough to discriminate an optimal solution from a bulk of generated solutions.
  • 8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 8 3. Distinctiveness and discriminable Less distinctiveness and discriminable value of produced solution is another drawback of accuracy metric [15]. As a result, this drawback will cause the discrimination and searching process of an optimal solution easily trapped at local optima (plateau). Thus, this drawback must be avoided by any discriminator. In other words, the development of future metrics must be able to produce a distinctive and discriminable value for better searching and discriminating the optimal solution in huge solution space. The details of this effect in discriminating the optimal solution are discussed in Table 3 and Table 5. 4. Informativeness Another drawback of many current metrics is there is no trade-off information between classes [24, 35]. For example, the most popular metric accuracy could not discriminate the good and bad (informative and non-informative) solutions especially when two or more solutions are equivalent or even contradict as shown in Table 3 and 4 respectively. From Table 3, the accuracy metric could not distinguish which solution is better due to non- distinctiveness and non-discriminable produced value. Intuitively, solution a2 is better than a1 since a2 can predict correctly all minority class members. In a1, there is none of the minority class member is correctly predicted, which conclude a1 is a poor solution. On the other hands, in Table 4, the accuracy metric concludes solution a1 is better than a2 through score comparison. However, a1 is a poor solution where none of minority class member is correctly predicted by a1. Intuitively, solution a2 shows better result although the score is lower than a1. In this case, solution a2 able to predict correctly all minority class members as compared to a1. From both examples, it shows that the informativeness aspect is essential feature for any metric in discriminating the informative and optimal solution. Table 3. Informativeness Analysis for Binary Classification Problem using Imbalanced Class Distribution (5:95) with Two Equivalent Solutions sol tp fp tn fn total accuracy a1 0 5 95 0 95 0.9500 a2 5 0 90 5 95 0.9500 Table 4. Informativeness Analysis for Binary Classification Problem using Imbalanced Class Distribution (5:95) with Two Contradictory Solutions sol tp fp tn fn total accuracy a1 0 0 95 5 95 0.9500 a2 5 6 89 0 94 0.9400 5. Favour towards the minority class According to [11, 12, 19], the most popular accuracy metric is greatly affected by the proportion of majority class and less impact on minority class. Hence, it is important to employ a proper evaluation metric that could favor towards the minority class than the majority class. In other words, the more minority class instances are correctly predicted, the better solution is produced especially for extremely imbalanced class problems. From demonstrated example in Table 5, any good metric or discriminator should rank the solution as follow: a6→a5→a4→a3→a2→a1. As demonstrated in Table 5, intuitively, the solution a6
  • 9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 9 is the most informative solution and more favor towards minority class, while a1 is the poorest solution since it has none of minority class members. Based on accuracy metric scores, none of these solutions could be ranked as suggested due to equivalent score among all solutions. Furthermore, Table 5 also shows that the accuracy metric produced less distinctive and less discriminable score which can cause the accuracy metric easily trapped at local optima during the searching of an optimal solution. Table 5. Favors towards Minority Class Analysis for Binary Classification Problem using Extremely Imbalanced Class Distribution (5:9995) sol tp fp tn fn total a1 0 0 9995 5 9995 a2 1 1 9994 4 9995 a3 2 2 9993 3 9995 a4 3 3 9992 2 9995 a5 4 4 9991 1 9995 a6 5 5 9990 0 9995 4. CONCLUSIONS The selection of suitable metric for discriminating the optimal solution in order to obtain an optimized classifier is a crucial step. The proper selection of metric will ensure that the classification training of generative type classifier is optimal. This paper hopes that with the reviews of some metrics for discriminating the optimal solution will sensitize the data mining researchers to the issue and encourage the researchers to think carefully, prior to select and apply the suitable metric for optimizing the classification training. Besides, this paper also suggested several important aspects in constructing a better metric for discriminating the optimal solution for generative type of classification algorithms. ACKNOWLEDGEMENTS This work has been funded by the Ministry of Education, Malaysia under SLAI scholarship. We also thanks to Universiti Malaysia Sarawak and other reviewers for their supports and helpful comments. REFERENCES [1] A.A. Cardenas and J.S. Baras, “B-ROC curves for the assessment of classifiers over imbalanced data sets”, in Proc. of the 21st National Conference on Artificial Intelligence Vol. 2, 2006, pp. 1581-1584 [2] R. Caruana and A. Niculescu-Mizil, “Data mining in metric space: an empirical analysis of supervised learning performance criteria”, in Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '04), New York, NY, USA, ACM 2004, pp. 69-78. [3] N.V. Chawla, N. Japkowicz and A. Kolcz, “Editorial: Special issue on learning from imbalanced data sets”, SIGKDD Explorations, 6 (2004) 1-6. [4] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters, 27 (2006) 861-874. [5] J. Davis and M. Goadrich, “The relationship between precision-recall and ROC curves”, in Proc. of the 23rd International Conference on Machine Learning, 2006, pp. 233-240. [6] C. Drummond, R.C. Holte, “Cost curves: An Improved method for visualizing classifier performance”, Mach. Learn. 65 (2006) 95-130. [7] P.A. Flach, P.A., “The Geometry of ROC Space: understanding Machine Learning Metrics through ROC Isometrics”, in T. Fawcett and N. Mishra (Eds.) Proc. of the 20th Int. Conference on Machine Learning (ICML 2003), Washington, DC, USA, AAAI Press, 2003, pp. 194-201. [8] V. Garcia, R.A. Mollineda and J.S. Sanchez, “A bias correction function for classification performance assessment in two-class imbalanced problems”, Knowledge-Based Systems, 59(2014) 66-74.
  • 10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 10 [9] S. Garcia and F. Herrera, “Evolutionary training set selection to optimize C4.5 in imbalance problems”, in Proc. of 8th Int. Conference on Hybrid Intelligent Systems (HIS 2008), Washington, DC, USA, IEEE Computer Society, 2008, pp.567-572. [10] N. Garcia-Pedrajas, J. A. Romero del Castillo and D. Ortiz-Boyer, “A cooperative coevolutionary algorithm for instance selection for instance-based learning”. Machine Learning (2010), 78 (2010) 381-420. [11] Q. Gu, L. Zhu and Z. Cai, “Evaluation Measures of the Classification Performance of Imbalanced Datasets”, in Z. Cai et al. (Eds.) ISICA 2009, CCIS 51. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 461-471. [12] S. Han, B. Yuan and W. Liu, “Rare Class Mining: Progress and Prospect”, in Proc. of Chinese Conference on Pattern Recognition (CCPR 2009), 2009, pp. 1-5 [13] D. J. Hand and R. J. Till, “A simple generalization of the area under the ROC curve to multiple class classification problems”, Machine Learning, 45 (2001) 171-186. [14] M. Hossin, M. N. Sulaiman, A. Mustapha, and N. Mustapha, “A Novel Performance Metric for Building an Optimized Classifier”, Journal of Computer Science, 7(4) (2011) 582-509. [15] M. Hossin, M. N. Sulaiman, A. Mustapha, N. Mustapha and R. W. Rahmat, “ OAERP: a Better Measure than Accuracy in Discriminating a Better Solution for Stochastic Classification Training”, Journal of Artificial Intelligence, 4(3) (2011) 187-196. [16] M. Hossin, M. N. Sulaiman, A. Mustapha, N. Mustapha and R. W. Rahmat, “A Hybrid Evaluation Metric for Optimizing Classifier”, in Data Mining and Optimization (DMO), 2011 3rd Conference on, 2011, pp. 165-170. [17] J. Huang and C. X. Ling, “Using AUC and accuracy in evaluating learning algorithms”, IEEE Transactions on Knowledge Data Engineering, 17 (2005) 299-310. [18] J. Huang and C. X. Ling, “Constructing new and better evaluation measures for machine learning”, in R. Sangal, H. Mehta and R. K. Bagga (Eds.) Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007, pp.859-864. [19] N. Japkowicz, “Assessment metrics for imbalanced learning”, in Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley IEEE Press, 2013, pp. 187-210 [20] M. V. Joshi, “On evaluating performance of classifiers for rare classes”, in Proceedings of the 2002 IEEE Int. Conference on Data Mining (ICDN 2002) ICDM’02, Washington, D. C., USA: IEEE Computer Society, 2002, pp. 641-644. [21] T. Kohonen, Self-Organizing Maps, 3rd ed., Berlin Heidelberg: Springer-Verlag, 2001. [22] L. I. Kuncheva and J. C. Bezdek, “Nearest Prototype Classification: Clustering, Genetic Algorithms, or Random Search?” IEEE Transactions on Systems, Man, and Cybernetics-Part C: Application and Reviews, 28(1) (1998) 160-164. [23] N. Lavesson, and P. Davidsson, “Generic Methods for Multi-Criteria Evaluation”, in Proc. of the Siam Int. Conference on Data Mining, Atlanta, Georgia, USA: SIAM Press, 2008, pp. 541-546. [24] P. Lingras, and C. J. Butz, “Precision and recall in rough support vector machines”, in Proc. of the 2007 IEEE Int. Conference on Granular Computing (GRC 2007), Washington, DC, USA: IEEE Computer Society, 2007, pp.654-654. [25] D. J. C. MacKay, Information, Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press, 2003. [26] T. M. Mitchell, Machine Learning, USA: MacGraw-Hill, 1997. [27] R. Prati, G. Batista, and M. Monard, “A survery on graphical methods for classification predictive performance evaluation”, IEEE Trans. Knowl. Data Eng. 23(2011) 1601-1618. [28] F. Provost, and P. Domingos, “Tree induction for probability-based ranking”. Machine Learning, 52 (2003) 199-215. [29] A. Rakotomamonyj, “Optimizing area under ROC with SVMs”, in J. Hernandez-Orallo, C. Ferri, N. Lachiche and P. A. Flach (Eds.) 1st Int. Workshop on ROC Analysis in Artificial Intelligence (ROCAI 2004), Valencia, Spain, 2004, pp. 71-80. [30] R. Ranawana, and V. Palade, “Optimized precision-A new measure for classifier performance evaluation”, in Proc. of the IEEE World Congress on Evolutionary Computation (CEC 2006), 2006, pp. 2254-2261. [31] S. Rosset, “Model selection via AUC”, in C. E. Brodley (Ed.) Proc. of the 21st Int. Conference on Machine Learning (ICML 2004), New York, NY, USA: ACM, 2004, pp. 89.
  • 11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 11 [32] D.B. Skalak, “Prototype and feature selection by sampling and random mutation hill climbing algorithm”, in W. W. Cohen and H. Hirsh (Eds.) Proc. of the 11th Int. Conference on Machine Learning (ICML 1994), New Brunswick, NJ, USA: Morgan Kaufmann, 1994, pp.293-301. [33] M. Sokolova and G. Lapame, “A systematic analysis of performance measures for classification tasks”, Information Processing and Management, 45(2009) 427-437. [34] P. N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Boston, USA: Pearson Addison Wesley, 2006. [35] M. Vuk and T. Curk, “ROC curve, lift chart and calibration plot”, Metodoloˇski zvezki, 3(1) (2006) 89-108. [36] H. Wallach, “Evaluation metrics for hard classifiers”. Technical Report. (Ed.: Wallach, 2006) http://www.inference.phy.cam.ac.uk/hmw26/papers [37] S. W. Wilson, “Mining oblique data with XCS”, in P. L. Lanzi, W. Stolzmann and S. W. Wilson (Eds.) Advances in Learning Classifier Systems: Third Int. Workshop (IWLCS 2000), Berlin, Heidelberg: Springer-Verlag, 2001, pp. 283-290. [38] H. Zhang and G. Sun, “Optimal reference subset selection for nearest neighbor classification by tabu search”, Pattern Recognition, 35(7) (2002) 1481-1490. AUTHORS Mohammad b. Hossin is a senior lecturer at Universiti Malaysia Sarawak. He received his B.IT (Hons) from Universiti Utara Malaysia (UUM) in 2000 and M.Sc. in Artificial Intelligence from University of Essex, UK in 2003. Then, in 2012, he received his Ph.D in Intelligent Computing from Universiti Putra Malaysia (UPM). His main research interests include data mining, decision support systems optimization using nature-inspired algorithms and e-learning. Md Nasir Sulaiman received his Bachelor in Science with Education major in Mathematics from University Pertanian Malaysia in 1983. He received Master in Computing from University of Bradford, U.K., in 1986 and PhD degree in Computer Science from Loughborough University, U.K., in 1994. He is a lecturer at the University Putra Malaysia since 1986. He has been promoted to Associate Professor in 2002. His research interests are Intelligent Computing and Smart Home.