SlideShare a Scribd company logo
Relationships between Diversity of Classification Ensembles and
Single-Class
Performance Measures
Abstract
In class imbalance learning problems, how to better recognize examples from the minority class is the key
focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions
have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an
ensemble could help to improve the performance of class imbalance learning. However, no study has actually
investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It
is unclear whether diversity will have a similar or different impact on the performance of minority and majority
classes. In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive
impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q-
statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al.
We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is
extended to single-class performance measures, including recall, precision, and Fmeasure, which are widely
used in class imbalance learning. Six different situations of diversity’s impact on these measures are obtained
through theoretical analysis. Finally, to further understand how diversity affects the single class performance
and overall performance in class imbalance problems, we carry out extensive experimental studies on both
artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong
GLOBALSOFT TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com
correlations between diversity and discussed performance measures. Diversity shows a positive impact on the
minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean.
Exixting System
A typical imbalanced data set with two classes, one class is heavily under-represented compared to the other
class that contains a relatively large number of examples. Class imbalance pervasively exists in many realworld
applications, such as medical diagnosis fraud detection risk management text classification etc. Rare cases in
these domains suffer from higher misclassification costs than common cases. It is a promising research area that
has been drawing more and more attention in data mining and machine learning, since many standard machine
learning algorithms have been reported to be less effective when dealing with this kind of problems. The
fundamental issue to be resolved is that they tend to ignore or overfit the minority class. Hence, great research
efforts have been made on the development of a good learning model that can predict rare cases more accurately
to lower down the total risk. The difference of individual learners is interpreted as “diversity” in ensemble
learning. It has been proved to be one of the main reasons for the success of ensembles from both theoretical
and empirical aspects. To date, existing studies have discussed the relationship between diversity and overall
accuracy. In class imbalance cases, however, the overall accuracy is not appropriate and less meaningful.
Disadvantages
If diversity is shown to be beneficial in imbalanced scenarios, it will suggest an alternative way of
handling class imbalance problems by considering diversity explicitly in the learning process.
explain why diversity is not always beneficial to the overall performance.
Two arguments are proposed accordingly for the minority and majority classes of a class imbalance
problem, respectively.
Proposed System
There is no agreed definition for diversity. Quite a few pairwise and nonpairwise diversity
measures were proposed in the literature such as Q-statistic double-default measure entropy
generalized diversity These attractive features lead to a variety of ensemble methods proposed to handle
imbalanced data sets from the data and algorithm levels. the data level, sampling strategies are
integrated into the training of each ensemble member. For instance, Li’s BEV and Chan and Stolfo
combining model were proposed based on the idea of Bagging by undersampling the majority class
examples and combining them with all the minority class examples to form balanced training subsets.
SMOTEBoost and DataBoost-IM were designed to alter the imbalanced distribution based on Boosting.
the classification characteristics of class imbalance learning into account. We first give some insight into
the class imbalance problem from the view of base learning algorithms, such as decision trees and neural
networks. Skewed class distributions and different misclassification costs make the classification
difficulty mainly reflect in the overfitting to the minority class and the overgeneralization to the majority
class, because the small class has less contribution to the classifier.
Advantages
The classification context, it is loosely described as “making errors on different examples” . Clearly, a
set of identical classifiers does not bring any advantages.
Ensemble composed of many of such classifiers, each classifier tends to label most of the data as the
majority class.
Artificial data sets and highly imbalanced real-world benchmarks are included in our experiments.
The proceed with correlation analysis and present corresponding decision boundary plots. We also
provide some insight intodiversity and performance measures at different levels of ensemble size.
Module
1. Diversity And Overall Accuracy
2. Correlation Analysis
3. Impact of Ensemble Size
4. Imbalanced Data
5. Single-Class Performance
6. Overall Performance
Module Description
Diversity And Overall Accuracy
A classification pattern refers to the voting combinations of the individual classifiers that an ensemble
can have. The accuracy is given by the majority voting method of combining classifier decisions. First, two
extreme patterns are defined, which present different effects of diversity. It is shown that diversity is not always
beneficial to the generalization performance. The reason is then explained in a general pattern. According to the
features of the patterns, we relate them to the classification of each class of a class imbalance problem, and
propose two arguments for the minority and majority classes, respectively.
Correlation Analysis
The Spearman correlation coefficient is a nonparametric measure of statistical dependence between two
variables, and insensitive to how the measures are scaled. the correlation coefficients of the singleclass
performance measures and the overall accuracy in two sampling ranges of r. the three data sets are positive,
which shows that ensemble diversity for each class has the same changing tendency as the overall diversity,
regardless of whether the data set is balanced. On one hand, it guarantees that increasing the classification
diversity over the whole data set can increase diversity over each class. On the other hand, it confirms that the
diversity measure Q-statistic is not sensitive to imbalanced distributions.
Impact of Ensemble Size
The ensemble size is important to the application of an ensemble, we look into how diversity and the
other performance measures change at different levels of ensemble size on the three artificial data sets. the
measures are affected by the ensemble size and the differences among the training data with different imbalance
degrees. Instead of keeping the constant size of 15 classifiers for an ensemble model, we adjust the number of
decision trees from 5 to 955 with interval 50. The sampling rate for training is set to a moderate value of 100
percent.
Imbalanced Data
The impact of diversity on single-class performance in depth through artificial data sets. Now we ask
whether the results are applicable to realworld domains. In this section, we report the correlation results for the
same research question on fifteen highly imbalanced real-world benchmarks. The data information is
summarized.
Single-Class Performance
The single-class performance should be our focus. For the minority class, recall has a very strong
negative correlation with Q in all cases; precision has a very strong positive correlation with Q in 12 out of 15
cases; the coefficients of F-measure do not show a consistent relationship, where 6 cases present positive
correlations and 5 cases present negative correlations. The observation suggests that more minority-class
examples are identified with some loss of precision by increasing diversity.
Overall Performance
We have explained, accuracy is not a good overall performance measure for class imbalance problems,
which is strongly biased to the majority class. Although the singleclass measures, we have discussed so far
reflect better the performance information for one class, it is still necessary to evaluate how well a classifier can
balance the performance between classes. G-mean and AUC are better choices.
FLOW CHART
Imbalance Learning
Class Imbalance Learning
Class Imbalance
Learning
Minority Class
Ensemble Could
Classification Of Imbalanced
Diversity of
Classification Ensembles
Single-Class
Performance Measures
CONCLUSIONS
The relationships between ensemble diversity and performance measures for class imbalance learning, aiming at
the following questions: what is the impact of diversity on single-class performance? Does diversity have a
positive effect on the classification of minority/majority class? We chose Q-statistic as the diversity measure
and considered three single-class performance measures including recall, precision, and F-measure. The
relationship with overall performance was also discussed empirically by examining G-mean and AUC for a
complete understanding. To answer the first question, we gave some mathematical links between Q-statistic and
the single-class measures. This part of work is based on Kuncheva et al.’s pattern analysis. We extended it to
the single-class context under specific classification patterns of ensemble and explained why we expect
diversity to have different impacts on minority and majority classes in class imbalance scenarios. Six possible
behaving situations of the single-class measures with respective to Q-statistic are obtained. For the second
question, we verified the measure behaviors empirically on a set of artificial and real-world imbalanced data
sets. We examined the impact of diversity on each class through correlation analysis. Strong correlations are
found. We show the positive effect of diversity in recognizing minority class examples and balancing recall
against precision of the minority class. It degrades the classification performance of the majority class in terms
of recall and F-measure on real world data sets. Diversity is beneficial to the overall performance in terms of G-
mean and AUC. Significant and consistent correlations found in this paper encourage us to take this step
further. We would like to explore in the future if and to what degree the existing class imbalance learning
methods can lead to improved diversity and contribute to the classification performance. We are interested in
the development of novel ensemble learning algorithms for class imbalance learning that can make best use of
our diversity analysis here, so that the importance of the minority class can be better considered. It is also
important in the future to consider class imbalance problems with more than two classes.
REFFERENCE
[1] R.M. Valdovinos and J.S. Sanchez, “Class-Dependant Resampling for Medical Applications,” Proc. Fourth
Int’l Conf. Machine Learning and Applications (ICMLA ’05), pp. 351-356, 2005.
[2] T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, no.
3, pp. 291-316, 1997.
[3] K.J. Ezawa, M. Singh, and S.W. Norton, “Learning Goal Oriented Bayesian Networks for
Telecommunications Risk Management,” Proc. 13th Int’l Conf. Machine Learning, pp. 139- 147, 1996.
[4] C. Cardie and N. Howe, “Improving Minority Class Prediction Using Case specific Feature Weights,” Proc.
14th Int’l Conf. Machine Learning, pp. 57-65, 1997.
[5] G.M. Weiss, “Mining with Rarity: A Unifying Framework,” ACM SIGKDD Explorations Newsletters, vol.
6, no. 1, pp. 7-19, 2004.
[6] S. Visa and A. Ralescu, “Issues in Mining Imbalanced Data Sets - A Review Paper,” Proc. 16th Midwest
Artificial Intelligence and Cognitive Science Conf., pp. 67-73, 2005.
[7] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data
Analysis, vol. 6, no. 5, pp. 429- 449, 2002.
[8] C. Li, “Classifying Imbalanced Data Using a Bagging Ensemble Variation,” Proc. 45th Ann. Southeast
Regional Conf. (AVM-SE 45), pp. 203-208, 2007.
[9] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory Undersampling for Class Imbalance Learning,” IEEE
Trans. Systems, Man, and Cybernetics, vol. 39, no. 2, pp. 539-550, Apr. 2009.

More Related Content

PDF
Dotnet relationships between diversity of classification ensembles and singl...
PDF
Fitting and understanding Multilevel Models-Andrew Gelman
PDF
Linq 2013 session_red_3_grammatikopoulos_gregoriadis_natsi_klapsinou
PPTX
Correlational research
PDF
C0252014021
PPT
Day 9 hypothesis and correlation for students
Dotnet relationships between diversity of classification ensembles and singl...
Fitting and understanding Multilevel Models-Andrew Gelman
Linq 2013 session_red_3_grammatikopoulos_gregoriadis_natsi_klapsinou
Correlational research
C0252014021
Day 9 hypothesis and correlation for students

What's hot (12)

PPTX
Correlational Designs
PPT
Correlational research design (Kartika Ajeng A)
PPT
Statistics And Correlation
PPT
Correlational research
PPTX
Les7e ppt ada_0101
PPTX
Research Methodology (Correlational Research) By Emeral & Sarah
PDF
A new class of models for rating data - Marica Manisera, Paola Zuccolotto, Se...
PPT
Les5e ppt 01
PPTX
Les7e ppt ada_0103
PPTX
Correlational research - Research Methodology - Manu Melwin Joy
PDF
Mba724 s2 w2 spss intro & daya types
PPTX
Final corraletional research ppts
Correlational Designs
Correlational research design (Kartika Ajeng A)
Statistics And Correlation
Correlational research
Les7e ppt ada_0101
Research Methodology (Correlational Research) By Emeral & Sarah
A new class of models for rating data - Marica Manisera, Paola Zuccolotto, Se...
Les5e ppt 01
Les7e ppt ada_0103
Correlational research - Research Methodology - Manu Melwin Joy
Mba724 s2 w2 spss intro & daya types
Final corraletional research ppts
Ad

Viewers also liked (14)

PDF
2012 2013 ieee java projects richbraintechnologies
DOCX
Security analysis of a single sign on mechanism for distributed computer netw...
DOCX
Exploiting cooperative relay for high performance communications in mimo ad h...
DOCX
Optimal route queries with arbitrary order constraints
DOCX
Content sharing over smartphone based delay-tolerant networks
DOCX
Qo s ranking prediction for cloud services
PDF
2013 2014 ieee finalyear beme java projects richbraintechnologies
PDF
2012 2013 ieee finalyear be btech java projects richbraintechnologies
DOCX
Dynamic resource allocation using virtual machines for cloud computing enviro...
DOCX
Distributed web systems performance forecasting
DOCX
Access policy consolidation for event processing systems
DOCX
Ginix generalized inverted index for keyword search
DOCX
Scalable face image retrieval using attribute enhanced sparse codewords
DOCX
Query adaptive image search with hash codes
2012 2013 ieee java projects richbraintechnologies
Security analysis of a single sign on mechanism for distributed computer netw...
Exploiting cooperative relay for high performance communications in mimo ad h...
Optimal route queries with arbitrary order constraints
Content sharing over smartphone based delay-tolerant networks
Qo s ranking prediction for cloud services
2013 2014 ieee finalyear beme java projects richbraintechnologies
2012 2013 ieee finalyear be btech java projects richbraintechnologies
Dynamic resource allocation using virtual machines for cloud computing enviro...
Distributed web systems performance forecasting
Access policy consolidation for event processing systems
Ginix generalized inverted index for keyword search
Scalable face image retrieval using attribute enhanced sparse codewords
Query adaptive image search with hash codes
Ad

Similar to Relationships between diversity of classification ensembles and single class (20)

PDF
Java relationships between diversity of classification ensembles and single-...
PDF
Relationships between diversity of classification ensembles and single class ...
PDF
Java relationships between diversity of classification ensembles and single-...
PDF
Relationships between diversity of classification ensembles and single class ...
PDF
Dotnet relationships between diversity of classification ensembles and singl...
PDF
Analysis of Imbalanced Classification Algorithms A Perspective View
PPTX
BASIC STATISTICAL TREATMENT IN RESEARCH.pptx
DOCX
The Elaboration ModelIntroductionThe elaboration mod.docx
DOCX
BUS 308 Week 3 Lecture 1 Examining Differences - Continued.docx
PDF
6145-Article Text-9370-1-10-20200513.pdf
DOCX
Chapter 12Choosing an Appropriate Statistical TestiStockph.docx
DOCX
Manova Report
PDF
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PPT
The Case for Generalized Estimating Equations in State-level Analysis
PDF
Multi-Cluster Based Approach for skewed Data in Data Mining
DOCX
7Repeated Measures Designs for Interval DataLearnin.docx
PDF
An overview on data mining designed for imbalanced datasets
PDF
An overview on data mining designed for imbalanced datasets
DOCX
Technology-based assessments-special educationNew technologies r.docx
Java relationships between diversity of classification ensembles and single-...
Relationships between diversity of classification ensembles and single class ...
Java relationships between diversity of classification ensembles and single-...
Relationships between diversity of classification ensembles and single class ...
Dotnet relationships between diversity of classification ensembles and singl...
Analysis of Imbalanced Classification Algorithms A Perspective View
BASIC STATISTICAL TREATMENT IN RESEARCH.pptx
The Elaboration ModelIntroductionThe elaboration mod.docx
BUS 308 Week 3 Lecture 1 Examining Differences - Continued.docx
6145-Article Text-9370-1-10-20200513.pdf
Chapter 12Choosing an Appropriate Statistical TestiStockph.docx
Manova Report
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
The Case for Generalized Estimating Equations in State-level Analysis
Multi-Cluster Based Approach for skewed Data in Data Mining
7Repeated Measures Designs for Interval DataLearnin.docx
An overview on data mining designed for imbalanced datasets
An overview on data mining designed for imbalanced datasets
Technology-based assessments-special educationNew technologies r.docx

More from IEEEFINALYEARPROJECTS (20)

DOCX
Scalable face image retrieval using attribute enhanced sparse codewords
DOCX
Reversible watermarking based on invariant image classification and dynamic h...
DOCX
Reversible data hiding with optimal value transfer
DOCX
Noise reduction based on partial reference, dual-tree complex wavelet transfo...
DOCX
Local directional number pattern for face analysis face and expression recogn...
DOCX
An access point based fec mechanism for video transmission over wireless la ns
DOCX
Towards differential query services in cost efficient clouds
DOCX
Spoc a secure and privacy preserving opportunistic computing framework for mo...
DOCX
Secure and efficient data transmission for cluster based wireless sensor netw...
DOCX
Privacy preserving back propagation neural network learning over arbitrarily ...
DOCX
Non cooperative location privacy
DOCX
Harnessing the cloud for securely outsourcing large
DOCX
Geo community-based broadcasting for data dissemination in mobile social netw...
DOCX
Enabling data dynamic and indirect mutual trust for cloud computing storage s...
DOCX
Dynamic resource allocation using virtual machines for cloud computing enviro...
DOCX
A secure protocol for spontaneous wireless ad hoc networks creation
DOCX
Utility privacy tradeoff in databases an information-theoretic approach
DOCX
Two tales of privacy in online social networks
DOCX
Spatial approximate string search
DOCX
Sort a self organizing trust model for peer-to-peer systems
Scalable face image retrieval using attribute enhanced sparse codewords
Reversible watermarking based on invariant image classification and dynamic h...
Reversible data hiding with optimal value transfer
Noise reduction based on partial reference, dual-tree complex wavelet transfo...
Local directional number pattern for face analysis face and expression recogn...
An access point based fec mechanism for video transmission over wireless la ns
Towards differential query services in cost efficient clouds
Spoc a secure and privacy preserving opportunistic computing framework for mo...
Secure and efficient data transmission for cluster based wireless sensor netw...
Privacy preserving back propagation neural network learning over arbitrarily ...
Non cooperative location privacy
Harnessing the cloud for securely outsourcing large
Geo community-based broadcasting for data dissemination in mobile social netw...
Enabling data dynamic and indirect mutual trust for cloud computing storage s...
Dynamic resource allocation using virtual machines for cloud computing enviro...
A secure protocol for spontaneous wireless ad hoc networks creation
Utility privacy tradeoff in databases an information-theoretic approach
Two tales of privacy in online social networks
Spatial approximate string search
Sort a self organizing trust model for peer-to-peer systems

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Relationships between diversity of classification ensembles and single class

  • 1. Relationships between Diversity of Classification Ensembles and Single-Class Performance Measures Abstract In class imbalance learning problems, how to better recognize examples from the minority class is the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an ensemble could help to improve the performance of class imbalance learning. However, no study has actually investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of minority and majority classes. In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q- statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and Fmeasure, which are widely used in class imbalance learning. Six different situations of diversity’s impact on these measures are obtained through theoretical analysis. Finally, to further understand how diversity affects the single class performance and overall performance in class imbalance problems, we carry out extensive experimental studies on both artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong GLOBALSOFT TECHNOLOGIES IEEE PROJECTS & SOFTWARE DEVELOPMENTS IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401 Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com
  • 2. correlations between diversity and discussed performance measures. Diversity shows a positive impact on the minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean. Exixting System A typical imbalanced data set with two classes, one class is heavily under-represented compared to the other class that contains a relatively large number of examples. Class imbalance pervasively exists in many realworld applications, such as medical diagnosis fraud detection risk management text classification etc. Rare cases in these domains suffer from higher misclassification costs than common cases. It is a promising research area that has been drawing more and more attention in data mining and machine learning, since many standard machine learning algorithms have been reported to be less effective when dealing with this kind of problems. The fundamental issue to be resolved is that they tend to ignore or overfit the minority class. Hence, great research efforts have been made on the development of a good learning model that can predict rare cases more accurately to lower down the total risk. The difference of individual learners is interpreted as “diversity” in ensemble learning. It has been proved to be one of the main reasons for the success of ensembles from both theoretical and empirical aspects. To date, existing studies have discussed the relationship between diversity and overall accuracy. In class imbalance cases, however, the overall accuracy is not appropriate and less meaningful. Disadvantages If diversity is shown to be beneficial in imbalanced scenarios, it will suggest an alternative way of handling class imbalance problems by considering diversity explicitly in the learning process. explain why diversity is not always beneficial to the overall performance. Two arguments are proposed accordingly for the minority and majority classes of a class imbalance problem, respectively. Proposed System There is no agreed definition for diversity. Quite a few pairwise and nonpairwise diversity measures were proposed in the literature such as Q-statistic double-default measure entropy generalized diversity These attractive features lead to a variety of ensemble methods proposed to handle
  • 3. imbalanced data sets from the data and algorithm levels. the data level, sampling strategies are integrated into the training of each ensemble member. For instance, Li’s BEV and Chan and Stolfo combining model were proposed based on the idea of Bagging by undersampling the majority class examples and combining them with all the minority class examples to form balanced training subsets. SMOTEBoost and DataBoost-IM were designed to alter the imbalanced distribution based on Boosting. the classification characteristics of class imbalance learning into account. We first give some insight into the class imbalance problem from the view of base learning algorithms, such as decision trees and neural networks. Skewed class distributions and different misclassification costs make the classification difficulty mainly reflect in the overfitting to the minority class and the overgeneralization to the majority class, because the small class has less contribution to the classifier. Advantages The classification context, it is loosely described as “making errors on different examples” . Clearly, a set of identical classifiers does not bring any advantages. Ensemble composed of many of such classifiers, each classifier tends to label most of the data as the majority class. Artificial data sets and highly imbalanced real-world benchmarks are included in our experiments. The proceed with correlation analysis and present corresponding decision boundary plots. We also provide some insight intodiversity and performance measures at different levels of ensemble size. Module 1. Diversity And Overall Accuracy 2. Correlation Analysis 3. Impact of Ensemble Size 4. Imbalanced Data 5. Single-Class Performance 6. Overall Performance
  • 4. Module Description Diversity And Overall Accuracy A classification pattern refers to the voting combinations of the individual classifiers that an ensemble can have. The accuracy is given by the majority voting method of combining classifier decisions. First, two extreme patterns are defined, which present different effects of diversity. It is shown that diversity is not always beneficial to the generalization performance. The reason is then explained in a general pattern. According to the features of the patterns, we relate them to the classification of each class of a class imbalance problem, and propose two arguments for the minority and majority classes, respectively. Correlation Analysis The Spearman correlation coefficient is a nonparametric measure of statistical dependence between two variables, and insensitive to how the measures are scaled. the correlation coefficients of the singleclass performance measures and the overall accuracy in two sampling ranges of r. the three data sets are positive, which shows that ensemble diversity for each class has the same changing tendency as the overall diversity, regardless of whether the data set is balanced. On one hand, it guarantees that increasing the classification diversity over the whole data set can increase diversity over each class. On the other hand, it confirms that the diversity measure Q-statistic is not sensitive to imbalanced distributions. Impact of Ensemble Size The ensemble size is important to the application of an ensemble, we look into how diversity and the other performance measures change at different levels of ensemble size on the three artificial data sets. the measures are affected by the ensemble size and the differences among the training data with different imbalance degrees. Instead of keeping the constant size of 15 classifiers for an ensemble model, we adjust the number of decision trees from 5 to 955 with interval 50. The sampling rate for training is set to a moderate value of 100 percent. Imbalanced Data The impact of diversity on single-class performance in depth through artificial data sets. Now we ask whether the results are applicable to realworld domains. In this section, we report the correlation results for the same research question on fifteen highly imbalanced real-world benchmarks. The data information is summarized. Single-Class Performance
  • 5. The single-class performance should be our focus. For the minority class, recall has a very strong negative correlation with Q in all cases; precision has a very strong positive correlation with Q in 12 out of 15 cases; the coefficients of F-measure do not show a consistent relationship, where 6 cases present positive correlations and 5 cases present negative correlations. The observation suggests that more minority-class examples are identified with some loss of precision by increasing diversity. Overall Performance We have explained, accuracy is not a good overall performance measure for class imbalance problems, which is strongly biased to the majority class. Although the singleclass measures, we have discussed so far reflect better the performance information for one class, it is still necessary to evaluate how well a classifier can balance the performance between classes. G-mean and AUC are better choices.
  • 6. FLOW CHART Imbalance Learning Class Imbalance Learning Class Imbalance Learning Minority Class Ensemble Could Classification Of Imbalanced Diversity of Classification Ensembles Single-Class Performance Measures
  • 7. CONCLUSIONS The relationships between ensemble diversity and performance measures for class imbalance learning, aiming at the following questions: what is the impact of diversity on single-class performance? Does diversity have a positive effect on the classification of minority/majority class? We chose Q-statistic as the diversity measure and considered three single-class performance measures including recall, precision, and F-measure. The relationship with overall performance was also discussed empirically by examining G-mean and AUC for a complete understanding. To answer the first question, we gave some mathematical links between Q-statistic and the single-class measures. This part of work is based on Kuncheva et al.’s pattern analysis. We extended it to the single-class context under specific classification patterns of ensemble and explained why we expect diversity to have different impacts on minority and majority classes in class imbalance scenarios. Six possible behaving situations of the single-class measures with respective to Q-statistic are obtained. For the second question, we verified the measure behaviors empirically on a set of artificial and real-world imbalanced data sets. We examined the impact of diversity on each class through correlation analysis. Strong correlations are found. We show the positive effect of diversity in recognizing minority class examples and balancing recall against precision of the minority class. It degrades the classification performance of the majority class in terms of recall and F-measure on real world data sets. Diversity is beneficial to the overall performance in terms of G- mean and AUC. Significant and consistent correlations found in this paper encourage us to take this step further. We would like to explore in the future if and to what degree the existing class imbalance learning methods can lead to improved diversity and contribute to the classification performance. We are interested in the development of novel ensemble learning algorithms for class imbalance learning that can make best use of our diversity analysis here, so that the importance of the minority class can be better considered. It is also important in the future to consider class imbalance problems with more than two classes. REFFERENCE [1] R.M. Valdovinos and J.S. Sanchez, “Class-Dependant Resampling for Medical Applications,” Proc. Fourth Int’l Conf. Machine Learning and Applications (ICMLA ’05), pp. 351-356, 2005. [2] T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 291-316, 1997.
  • 8. [3] K.J. Ezawa, M. Singh, and S.W. Norton, “Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management,” Proc. 13th Int’l Conf. Machine Learning, pp. 139- 147, 1996. [4] C. Cardie and N. Howe, “Improving Minority Class Prediction Using Case specific Feature Weights,” Proc. 14th Int’l Conf. Machine Learning, pp. 57-65, 1997. [5] G.M. Weiss, “Mining with Rarity: A Unifying Framework,” ACM SIGKDD Explorations Newsletters, vol. 6, no. 1, pp. 7-19, 2004. [6] S. Visa and A. Ralescu, “Issues in Mining Imbalanced Data Sets - A Review Paper,” Proc. 16th Midwest Artificial Intelligence and Cognitive Science Conf., pp. 67-73, 2005. [7] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429- 449, 2002. [8] C. Li, “Classifying Imbalanced Data Using a Bagging Ensemble Variation,” Proc. 45th Ann. Southeast Regional Conf. (AVM-SE 45), pp. 203-208, 2007. [9] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory Undersampling for Class Imbalance Learning,” IEEE Trans. Systems, Man, and Cybernetics, vol. 39, no. 2, pp. 539-550, Apr. 2009.