Relationships between diversity of classification ensembles and single class

Relationships between Diversity of Classification Ensembles and
Single-Class
Performance Measures
Abstract
In class imbalance learning problems, how to better recognize examples from the minority class is the key
focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions
have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an
ensemble could help to improve the performance of class imbalance learning. However, no study has actually
investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It
is unclear whether diversity will have a similar or different impact on the performance of minority and majority
classes. In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive
impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q-
statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al.
We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is
extended to single-class performance measures, including recall, precision, and Fmeasure, which are widely
used in class imbalance learning. Six different situations of diversity’s impact on these measures are obtained
through theoretical analysis. Finally, to further understand how diversity affects the single class performance
and overall performance in class imbalance problems, we carry out extensive experimental studies on both
artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong
GLOBALSOFT TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com

correlations between diversity and discussed performance measures. Diversity shows a positive impact on the
minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean.
Exixting System
A typical imbalanced data set with two classes, one class is heavily under-represented compared to the other
class that contains a relatively large number of examples. Class imbalance pervasively exists in many realworld
applications, such as medical diagnosis fraud detection risk management text classification etc. Rare cases in
these domains suffer from higher misclassification costs than common cases. It is a promising research area that
has been drawing more and more attention in data mining and machine learning, since many standard machine
learning algorithms have been reported to be less effective when dealing with this kind of problems. The
fundamental issue to be resolved is that they tend to ignore or overfit the minority class. Hence, great research
efforts have been made on the development of a good learning model that can predict rare cases more accurately
to lower down the total risk. The difference of individual learners is interpreted as “diversity” in ensemble
learning. It has been proved to be one of the main reasons for the success of ensembles from both theoretical
and empirical aspects. To date, existing studies have discussed the relationship between diversity and overall
accuracy. In class imbalance cases, however, the overall accuracy is not appropriate and less meaningful.
Disadvantages
If diversity is shown to be beneficial in imbalanced scenarios, it will suggest an alternative way of
handling class imbalance problems by considering diversity explicitly in the learning process.
explain why diversity is not always beneficial to the overall performance.
Two arguments are proposed accordingly for the minority and majority classes of a class imbalance
problem, respectively.
Proposed System
There is no agreed definition for diversity. Quite a few pairwise and nonpairwise diversity
measures were proposed in the literature such as Q-statistic double-default measure entropy
generalized diversity These attractive features lead to a variety of ensemble methods proposed to handle

imbalanced data sets from the data and algorithm levels. the data level, sampling strategies are
integrated into the training of each ensemble member. For instance, Li’s BEV and Chan and Stolfo
combining model were proposed based on the idea of Bagging by undersampling the majority class
examples and combining them with all the minority class examples to form balanced training subsets.
SMOTEBoost and DataBoost-IM were designed to alter the imbalanced distribution based on Boosting.
the classification characteristics of class imbalance learning into account. We first give some insight into
the class imbalance problem from the view of base learning algorithms, such as decision trees and neural
networks. Skewed class distributions and different misclassification costs make the classification
difficulty mainly reflect in the overfitting to the minority class and the overgeneralization to the majority
class, because the small class has less contribution to the classifier.
Advantages
The classification context, it is loosely described as “making errors on different examples” . Clearly, a
set of identical classifiers does not bring any advantages.
Ensemble composed of many of such classifiers, each classifier tends to label most of the data as the
majority class.
Artificial data sets and highly imbalanced real-world benchmarks are included in our experiments.
The proceed with correlation analysis and present corresponding decision boundary plots. We also
provide some insight intodiversity and performance measures at different levels of ensemble size.
Module
1. Diversity And Overall Accuracy
2. Correlation Analysis
3. Impact of Ensemble Size
4. Imbalanced Data
5. Single-Class Performance
6. Overall Performance

Module Description
Diversity And Overall Accuracy
A classification pattern refers to the voting combinations of the individual classifiers that an ensemble
can have. The accuracy is given by the majority voting method of combining classifier decisions. First, two
extreme patterns are defined, which present different effects of diversity. It is shown that diversity is not always
beneficial to the generalization performance. The reason is then explained in a general pattern. According to the
features of the patterns, we relate them to the classification of each class of a class imbalance problem, and
propose two arguments for the minority and majority classes, respectively.
Correlation Analysis
The Spearman correlation coefficient is a nonparametric measure of statistical dependence between two
variables, and insensitive to how the measures are scaled. the correlation coefficients of the singleclass
performance measures and the overall accuracy in two sampling ranges of r. the three data sets are positive,
which shows that ensemble diversity for each class has the same changing tendency as the overall diversity,
regardless of whether the data set is balanced. On one hand, it guarantees that increasing the classification
diversity over the whole data set can increase diversity over each class. On the other hand, it confirms that the
diversity measure Q-statistic is not sensitive to imbalanced distributions.
Impact of Ensemble Size
The ensemble size is important to the application of an ensemble, we look into how diversity and the
other performance measures change at different levels of ensemble size on the three artificial data sets. the
measures are affected by the ensemble size and the differences among the training data with different imbalance
degrees. Instead of keeping the constant size of 15 classifiers for an ensemble model, we adjust the number of
decision trees from 5 to 955 with interval 50. The sampling rate for training is set to a moderate value of 100
percent.
Imbalanced Data
The impact of diversity on single-class performance in depth through artificial data sets. Now we ask
whether the results are applicable to realworld domains. In this section, we report the correlation results for the
same research question on fifteen highly imbalanced real-world benchmarks. The data information is
summarized.
Single-Class Performance

The single-class performance should be our focus. For the minority class, recall has a very strong
negative correlation with Q in all cases; precision has a very strong positive correlation with Q in 12 out of 15
cases; the coefficients of F-measure do not show a consistent relationship, where 6 cases present positive
correlations and 5 cases present negative correlations. The observation suggests that more minority-class
examples are identified with some loss of precision by increasing diversity.
Overall Performance
We have explained, accuracy is not a good overall performance measure for class imbalance problems,
which is strongly biased to the majority class. Although the singleclass measures, we have discussed so far
reflect better the performance information for one class, it is still necessary to evaluate how well a classifier can
balance the performance between classes. G-mean and AUC are better choices.

FLOW CHART
Imbalance Learning
Class Imbalance Learning
Class Imbalance
Learning
Minority Class
Ensemble Could
Classification Of Imbalanced
Diversity of
Classification Ensembles
Single-Class
Performance Measures

CONCLUSIONS
The relationships between ensemble diversity and performance measures for class imbalance learning, aiming at
the following questions: what is the impact of diversity on single-class performance? Does diversity have a
positive effect on the classification of minority/majority class? We chose Q-statistic as the diversity measure
and considered three single-class performance measures including recall, precision, and F-measure. The
relationship with overall performance was also discussed empirically by examining G-mean and AUC for a
complete understanding. To answer the first question, we gave some mathematical links between Q-statistic and
the single-class measures. This part of work is based on Kuncheva et al.’s pattern analysis. We extended it to
the single-class context under specific classification patterns of ensemble and explained why we expect
diversity to have different impacts on minority and majority classes in class imbalance scenarios. Six possible
behaving situations of the single-class measures with respective to Q-statistic are obtained. For the second
question, we verified the measure behaviors empirically on a set of artificial and real-world imbalanced data
sets. We examined the impact of diversity on each class through correlation analysis. Strong correlations are
found. We show the positive effect of diversity in recognizing minority class examples and balancing recall
against precision of the minority class. It degrades the classification performance of the majority class in terms
of recall and F-measure on real world data sets. Diversity is beneficial to the overall performance in terms of G-
mean and AUC. Significant and consistent correlations found in this paper encourage us to take this step
further. We would like to explore in the future if and to what degree the existing class imbalance learning
methods can lead to improved diversity and contribute to the classification performance. We are interested in
the development of novel ensemble learning algorithms for class imbalance learning that can make best use of
our diversity analysis here, so that the importance of the minority class can be better considered. It is also
important in the future to consider class imbalance problems with more than two classes.
REFFERENCE
[1] R.M. Valdovinos and J.S. Sanchez, “Class-Dependant Resampling for Medical Applications,” Proc. Fourth
Int’l Conf. Machine Learning and Applications (ICMLA ’05), pp. 351-356, 2005.
[2] T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, no.
3, pp. 291-316, 1997.

[3] K.J. Ezawa, M. Singh, and S.W. Norton, “Learning Goal Oriented Bayesian Networks for
Telecommunications Risk Management,” Proc. 13th Int’l Conf. Machine Learning, pp. 139- 147, 1996.
[4] C. Cardie and N. Howe, “Improving Minority Class Prediction Using Case specific Feature Weights,” Proc.
14th Int’l Conf. Machine Learning, pp. 57-65, 1997.
[5] G.M. Weiss, “Mining with Rarity: A Unifying Framework,” ACM SIGKDD Explorations Newsletters, vol.
6, no. 1, pp. 7-19, 2004.
[6] S. Visa and A. Ralescu, “Issues in Mining Imbalanced Data Sets - A Review Paper,” Proc. 16th Midwest
Artificial Intelligence and Cognitive Science Conf., pp. 67-73, 2005.
[7] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data
Analysis, vol. 6, no. 5, pp. 429- 449, 2002.
[8] C. Li, “Classifying Imbalanced Data Using a Bagging Ensemble Variation,” Proc. 45th Ann. Southeast
Regional Conf. (AVM-SE 45), pp. 203-208, 2007.
[9] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory Undersampling for Class Imbalance Learning,” IEEE
Trans. Systems, Man, and Cybernetics, vol. 39, no. 2, pp. 539-550, Apr. 2009.

Relationships between diversity of classification ensembles and single class

More Related Content

What's hot (12)

Viewers also liked (14)

Similar to Relationships between diversity of classification ensembles and single class (20)

More from IEEEFINALYEARPROJECTS (20)

Recently uploaded (20)

Relationships between diversity of classification ensembles and single class