SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3708
Predicting Customers Churn in Telecom Industry using Centroid
Oversampling method and KNN classifier
Pragya Joshi
Department of Computer Engineering
Shri G.S. Institute of Tech. and Science
23, Sir Visvesvaraya Marg, Indore (MP)
Email: joshi.pragya@gmail.com
Abstract— Subscriber retention is a key issue for several
service-based organizations mainly in telecom industry, where
detecting customer’s behaviour is one of the main interest.
Customer churn occurs when an existing subscriber stops doing
business or ends the relationship with a company. Since the
proportion of churned customers are less as compared to the
whole subscriber base, it leads to imbalanced data distribution.
To improve the classification performance of imbalanced data
learning, a novel over-sampling method, Centroid Oversampling
Technique, based on centroid of three nearest neighbor points, is
proposed. It generates a set of representative data points to
broaden the decision regions of small class spaces. The
representative centroids are regarded as synthetic examples in
order to resolve the imbalanced class problem. This approach
addresses the problem of synthetic minority oversampling
techniques, which overlaps the reflection of majority classes on
training examples. Our comprehensive experimental results show
that Centroid Oversampling Technique can achieve better
performance than renowned multi-class oversampling methods.
Index-terms - Class Imbalance, Oversampling Techniques,
Centroid method, K Nearest Neighbor.
I. INTRODUCTION
Techniques for handling class-imbalance problem can be
broadly categorized into two sets basis their evaluation pro-
cedure, Oversampling and Undersampling. In oversampling,
some of the data points of minority class are replicated to
generate synthetic samples to form balanced dataset. Under-
sampling reduces the number of observations from majority
class to balance the dataset, but removing observations might
cause the training data to miss the significant information
pertaining to majority class.
Best known oversampling techniques are SMOTE, ADASYN,
MTDF. Synthetic Minority Oversampling Technique SMOTE[3]
is considered as the most popular technique for handling CIP,
as It produces new minority data points based on k-nearest
neighbor(kNN), giving positive observations.
A major concern of SMOTE[3] is that the class with frequent
samples dominate the neighboring data points of a test instance
in-spite of distance measurements which leads to sub-optimal
classification performance on the minority class. In this
proposed approach, a modified oversampling technique is
introduced which uses the concept of generating synthetic data
samples by calculating centroid of three nearest points (kNN).
The main objective of this approach is
Surendra Gupta
Department of Computer Engineering
Shri G.S. Institute of Tech. and Science
23, Sir Visvesvaraya Marg, Indore (MP)
Email: sgupta@sgsits.ac.in
to correct the inherent bias to majority class on any distance
measurement. So, the proposed work is aimed at modifying
the existing oversampling technique to enhance the
accuracy in predicting customer churn so that special tariffs
can be assigned to retain them.
The main aim of this approach is to minimise the miss-
classified errors while retaining a suitably high accuracy in
representing the original features. Many machine learning
algorithms yield low performance while dealing with large
imbalanced datasets. Imbalanced classes can cause different
classification approaches to face difficulties in learning,
which can results in poor performance.
Therefore, in this paper, a technique has been proposed i.e.
centroid oversampling technique. The basic idea of this section
is to balance the class distribution using centroid method and
kNN approach. In addition, we reduce the number of features
and remove the not related, unnecessary or noisy information.
Besides, this improves the presentation of information
prediction with speeding up the processing algorithm. To
improve the prediction accuracy, use PCA algorithm for
dimension reduction.
This paper is ordered as follows. In section II, an overview of
work already done in this area with their advantages and
disadvantages is discussed. In the section, III the proposed
approach is presented. Section IV presents the details of
experimental analysis of the result. Conclusion and future work
is discussed in section V.
II. RELATED WORK
Substantial work has been carried out by earlier re-searchers
in the field of classification techniques for Imbalanced datasets,
handling miss-classified errors and related areas. Some of the
most relevant techniques are discussed in this section.
SMOTE[3], Synthetic Minority Over-sampling method, is an
oversampling method which uses the concept of creating new
minority class samples by incorporating several minority class
cases that lie close to each other. It is based on KNN approach, it
chooses K nearest neighbors, merge them and generates new
synthetic samples. The algorithm uses the feature vectors and
its nearest neighbors, calculates the distance among these
vectors. The variance is then multiplied by random number
among (0, 1) and added back to feature.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3709
SMOTE algorithm is a pioneer algorithm and countless other
algorithms are derivative of SMOTE.
SMOTE[3] is an oversampling method which yields “synthetic”
data points basis distance measure between two given data points.
Oversampling is done using each minority class samples, generating
new data examples which connects the nearest neighbors of that
class samples. Neighbors are chosen randomly basis the ratio of
oversampling being done.
The key most objective of oversampling techniques is to
strengthen the minority class. The motive behind this algorithm is
quite simple. Oversampling causes overfitting, and because of
recurrent data instances, the decision boundary gets constricted. So,
in SMOTE[3] paper, it has been shown that these newly constructed
data samples are not exact replicas, but new points and thus it
relaxes the decision boundary, thereby helping the algorithm to
estimate the theory more precisely.
However, some advantages and disadvantages of SMOTE are
listed below.
Advantages: 1.Overcomes over-fitting by generating synthetic
examples rather than recurrent data points. 2. Information about all
the data instances remain intact. 3. Implementation and
Interpretation is quite simple and understandable.
Disadvantages: 1. overlapping of neighboring classes can
generate additional noise. Since SMOTE does not realise
neighbouring data samples can be from different classes also. 2. Not
recommended for high dimensional data, so SMOTE is not
practically used for high dimensional data sets.
ADASYN[4] (Adaptive Synthetic) is an oversampling technique in
which class weights are distributed for numerous minority data
samples, which are being used based on their struggle level in
learning. Most of the synthetic samples are generated for classes
which are difficult to train as compared to other class samples
which can be easily learned. It is an improved version of Smote.
After creating those samples, it adds a random weight to the points
thus making it more realistic. In other words instead of all the
sample being linearly correlated to the parent they have a little
more variance in them i.e they are bit scattered.
ADASYN[4] advances learning w.r.t. data distribution in below
ways:
(1) Decreasing the bias lead by the imbalanced class.(2)
Attentively moving the decision boundary of classification algorithm
near the most problematic examples.
Based on previous research work, it has been observed that
SMOTE[3] is one of the promising approach which is mostly used for
creating new rules in classification problems. This approach can be
widely used in the applications of knowledge discovery, data
mining, and Telecom domain. But researches can be done in the
context of testing this method using bulky databases and to compare
this approach with different existing approaches.
III. PROPOSED METHOD
The Proposed approach is based on Centroid oversampling technique
to generate synthetic samples by using kNN algorithm to identify the
nearest neighbors and compute centroid of these nearest data
points in the space.
Existing SMOTE[3] technique has a limitation while working with
imbalanced dataset,i.e. the classes having frequent data examples
overrules neighbors of testing instances despite of distance metrics.
The proposed approach is aimed to resolve this issue by
generating centroid between the nearest data points rather than on
the outlying areas. In this way, synthetic samples are uniformly
distributed among the original data points which reduces the
probability of selecting outlier in the data space.
The most important benefit of this proposed approach is its
capability of correcting internal bias to majority classes of present
kNN algorithm. So, this approach is aimed to produce better data
samples than the existing techniques, and hence better classification
results can be achieved.
Architecture of the projected system is illustrated in figure
I.
Fig. 1: Architecture Diagram of Proposed Approach
A. Data Pre-processing
The system takes customer churn related dataset as an input. It
reads the dataset, and preprocess the categorical or non-numerical
attributes and missing values so as to make the data appropriate for
further processing.
B. Data Modelling and Validation
Apply 9-fold cross-validation to calculate the learning abilities of
classification model on different data sets. A statistical method
called Cross-validation is being used to estimate the learning skills
of various models. It is mainly applied in machine learning
techniques to evaluate a model for predictive learning and
modelling issues.
Apply 9- fold cross-validation to build a model on 75% of original
data and reserve 25% on testing purpose.
Original file gets divided into testing and training sub-files and so
that further task can be applied on each set comprising of test and
train dataset.
C. Data Oversampling
Centroid method is used to oversample the dataset. After first
iteration of 9-fold cross-validation, oversampling is applied on 75%
of original dataset, i.e. testing set. Detailed algorithm is explained
here:
 Centroid Oversampling method is based on calculating the median
of a triangle or three points.
 Select three nearest neighbor points from the given dataset.
 Calculate the median of these three points.
 The output should be a centroid.
 Continue with the next three neighbors and generate centroid in the
same way.
By generating centroids of nearest points, original dataset is
oversampled by 100%. The oversampled dataset is then passed to
the classification module to train the classification model.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3710
Algorithm 1 Centroid Oversampling Algorithm
Input: P={p1,p2....pn} Training positive examples.
n: Number of minority examples.
N: No. of synthetic data points to be generated for each positive data
point.
k: 3
Output: Set of artificial instances.
Procedure:
1: Preprocess dataset to change the categorical attributes into
numerical one and traverse the dataset (each tuple) to preprocess
the missing values.
2: for i = 1 to min
3: Find the two nearest neighbors of pi and store their indexes in
an array.
4: end for
5: for m in indices
6: Generate centroid of three points corresponding to their index
values.
7: newSample=(a[m]+a[m+1]+a[m+2])/3
8: m=m+1
9: Add new samples one by one to form Synthetic array.
10: Synthetic newSample
11: end for
D. Classification Module
1) Load the Testing dataset generated after cross fold validation
on which oversampling is required to be done.
2) Use KNN (K-Nearest Neighbor) for classification.
KNN algorithms have been recognized as one of the top most
powerful data mining algorithms for their capability of creating
simple yet powerful classifiers. The k neighbors nearby a test
instances are usually called prototypes.
Salient features of kNN are:
 Most Powerful Data mining algorithm.
 It is exceptionally easy to design and implement.
 Since the algorithm needs no training afore creation of
predictions, new data can be further added seamlessly.
 Only two parameters are essential to implement KNN i.e. the
value of K and the distance function (e.g. Euclidean or Manhattan
etc.)
 Capability of constructing simple but dominant classifiers.
 KNN can be implemented for both classification and
regression predictive problems.
 It is lazy learning algorithm and thus, needs no training
preceding for constructing real time predictions.
This marks the KNN algorithm quicker than rest of the algorithms,
which require training e.g SVM, linear regression, etc.
Classification module takes resultant dataset as an input. Dataset
is segregated into two sets: training set and testing set for applying
classification. Classifier is trained using training set and using
testing set its accuracy is computed.
E. Performance Evaluation Module
There are many parameters available which can be used to check
performance of classification. The parameters used in the project to
check classification performance are :
Accuracy : Ratio of accurate prediction to the total no of prediction
in a model.
Recall : It is considered as a measure of completeness. It tells about
how many positive class samples are correctly classified.
Recall : Recall = sensitivity
Precision : It is a measure of exactness. It can be defined as, from the
samples labelled as positive, how many actually belongs to positive
class.
F-measure : Harmonic mean between recall and preci-sion.
Fig. 2: Flow Diagram of Proposed Approach
IV. TESTING AND RESULTS
Proposed approach is implemented using Python. It is
evaluated on three datasets, mainly on telecom churn datasets
which are taken from UCI repository and Kaggle.
A comparison is made on proposed model by applying KNN
classifier on the original data and data generated by the
proposed centroid oversampling technique.
TABLE I: THE COMPARISON OF CLASSIFICATION
ACCURACIES FOR CUSTOMER-CHURN DATASET
K=3
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.92 0.91 0.94
Recall 0.75 0.91 0.95
Precision 0.87 0.84 0.82
F-measure 0.65 0.91 0.89
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3711
K=5
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.90 0.87 0.91
Recall 0.75 0.88 0.90
Precision 0.84 0.79 0.77
F-measure 0.53 0.87 0.83
K=7
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.9 0.86 0.90
Recall 0.67 0.87 0.88
Precision 0.85 0.78 0.76
F-measure 0.49 0.86 0.80
TABLE III: THE COMPARISON OF CLASSIFI-
CATION ACCURACIES FOR CUSTOMER-CHURN-
MODELLING DATASET
K=3
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.84 0.87 0.87
Recall 0.67 0.85 0.87
Precision 0.70 0.83 0.79
F-measure 0.50 0.89 0.82
K=5
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.81 0.81 0.82
Recall 0.59 0.79 0.80
Precision 0.63 0.77 0.73
F-measure 0.32 0.84 0.74
K=7
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.80 0.79 0.80
Recall 0.55 0.76 0.78
Precision 0.59 0.75 0.69
F-measure 0.21 0.83 0.71
VI. VISUALIZATION OF THE RESULT
Visualization graph has been plotted on the density
distribution of available datasets where the data points of
various features are plotted before and after applying the
centroid oversampling method on churn.csv dataset.
TABLE II: THE COMPARISON OF CLASSIFI-
CATION ACCURACIES FOR TELCO-CUSTOMER-
CHURN DATASET
K=3
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.92 0.88 0.94
Recall 0.75 0.84 0.95
Precision 0.87 0.86 0.82
F-measure 0.65 0.91 0.89
K=5
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.79 0.83 0.81
Recall 0.67 0.77 0.82
Precision 0.69 0.81 0.76
F-measure 0.52 0.88 0.79
K=7
Parameters KNN SMOTE+KNN Centroid+KNN
Accuracy 0.77 0.80 0.78
Recall 0.63 0.74 0.79
Precision 0.63 0.78 0.71
F-measure 0.43 0.86 0.76
V. STATISTICAL ANALYSIS OF THE RESULT
From the table I,II and III, it has been observed that the
classification accuracy for Centroid oversampling method is
best as compared to SMOTE as well as KNN for the value of
K=3.
Fig. 3: Density distribution graph of Churn dataset before and
after applying centroid oversampling method
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3712
Fig. 4: Data Statics before and after applying Centroid
Oversampling Technique
VII. CONCLUSION
The outcome of the project is focused on finding the customer
churn who are going to discontinue the telecom services.
Determining customer’s nature empower companies to boost
support of the customers and boost the gross performance of the
industry. Expected result shows the increase in classification
accuracy of proposed model as compared to the base model. Once
the best oversampling method is identified for a particular dataset,
the same can be used to enhance the classifier accuracy. The
proposed system shows the better performance for different
datasets than the existing oversampling techniques.
VIII. FUTURE WORK
Accuracy of many classification algorithms depend on the
behavior of the dataset being analysed. For imbalanced
datasets, main focus areas are the performance measures and
density distribution of data. The proposed approach is based on
binary classification of data, so further research can be done with
muti-class classification also. Future scope can be in the direction to
implement the technique on high dimensional data and to combine
the centroid oversampling method with different classifiers to
enhance the classification parameters. This method can be
efficiently used in the areas of image processing and other data
mining approaches.
REFERENCES
[1] Adnan Amin , Sajid Anwar , Awais Adnan , Muhammad Nawaz , New-ton
Howard , Junaid Qadir , (senior Member, Ieee), Ahmad Hawalah4 , And Amir
Hussain , (Senior Member, IEEE), ”Comparing Oversam-pling Techniques to
Handle the Class Imbalance Problem”,October 26, 2016.
[2] J. Burez and D. Van den Poel, ”Handling class imbalance in customer
churn prediction”, Expert Syst. Appl.vol. 36, no. 3, pp. 4626–4636, 2009.
[3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
”SMOTE:Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol.
16, no. 1, pp. 321–357, 2002.
[4] H. He, Y. Bai, E. A. Garcia, and S. Li, ”ADASYN: Adaptive synthetic
sampling approach for imbalanced learning”. IEEE Int.Joint Conf. Neural
Network,Jun. 2008,pp. 1322–1328
[5] Lian Yan,Richard H. Wolniewicz and Robert Dodier, ”Predicting
Customer Behavior in Telecommunications”. IEEE Intelligent Systems
Volume: 19 , Issue: 2 , Mar-Apr 2004
[6] Jianping Gou, Zhang Yi ,Lan Du, Taisong Xiong ”A Local Mean-Based
k-Nearest Centroid Neighbor Classifier” The Computer Journal archive
Volume 55 Issue 9, September 2012 Pages 1058-1071
[7] Han, E., Karypis, G.: ”Centroid-based document classification.”In:
˙
Zighed, D.A., Komorowski, J., Zytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI),
vol. 1910, pp. 116–123. Springer, Heidelberg (2000)
[8] Kiran Dahiya,Surbhi Bhatia ”Customer churn analysis in telecom
industry” IEEE-International Conference on Reliability, Infocom Tech-
nologies and Optimization (ICRITO) (Trends and Future Directions) 17
December 2015
[9] V.B.Surya Prasatha,, Haneen Arafat Abu Alfeilatb, Omar Lasass-
mehb,Ahmad B.A.Hassanat, ”Distance and Similarity Measures Effect on the
Performance of K-Nearest Neighbor Classifier- A Review”, 2017
[10] Liu, Z.G.; Pan, Q.; Dezert, J. ” A new belief-based K-nearest neighbor
classification method. Pattern Recognition” 2013, 46, 834–844
[11] Wei Liu and Sanjay Chawla,”Class Confidence Weighted kNN Algo-
rithms for Imbalanced Data Sets”, PAKDD 2011 LNCS (LNAI), vol 6635, pp
345-356. Springer, Heidelberg (2011)
[12] Asuncion, A., Newman, D.: ”UCI Machine Learning Repository”
[13] Demsar,ˇ J.: ”Statistical comparisons of classifiers over multiple data
sets. The Journal of Machine Learning Research 7”, 1–30 (2006)

More Related Content

PDF
A novel ensemble modeling for intrusion detection system
PDF
Fuzzy logic applications for data acquisition systems of practical measurement
PDF
direct marketing in banking using data mining
PDF
Handling Imbalanced Data: SMOTE vs. Random Undersampling
PDF
A new model for iris data set classification based on linear support vector m...
PDF
COMPARATIVE ANALYSIS OF MINUTIAE BASED FINGERPRINT MATCHING ALGORITHMS
PDF
PDF
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A novel ensemble modeling for intrusion detection system
Fuzzy logic applications for data acquisition systems of practical measurement
direct marketing in banking using data mining
Handling Imbalanced Data: SMOTE vs. Random Undersampling
A new model for iris data set classification based on linear support vector m...
COMPARATIVE ANALYSIS OF MINUTIAE BASED FINGERPRINT MATCHING ALGORITHMS
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...

What's hot (20)

PDF
Accounting for variance in machine learning benchmarks
PDF
When deep learners change their mind learning dynamics for active learning
PDF
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
PDF
A chi-square-SVM based pedagogical rule extraction method for microarray data...
PDF
A Simple Segmentation Approach for Unconstrained Cursive Handwritten Words in...
PDF
F017533540
PDF
A1802050102
PDF
Survey on semi supervised classification methods and
PDF
Survey on semi supervised classification methods and feature selection
PDF
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
PDF
M43016571
PDF
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
PDF
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
PDF
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...
PDF
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
PDF
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
PDF
Possibility fuzzy c means clustering for expression invariant face recognition
PDF
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
PPTX
A survey on methods and applications of meta-learning with GNNs
Accounting for variance in machine learning benchmarks
When deep learners change their mind learning dynamics for active learning
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
A chi-square-SVM based pedagogical rule extraction method for microarray data...
A Simple Segmentation Approach for Unconstrained Cursive Handwritten Words in...
F017533540
A1802050102
Survey on semi supervised classification methods and
Survey on semi supervised classification methods and feature selection
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
M43016571
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
Possibility fuzzy c means clustering for expression invariant face recognition
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
A survey on methods and applications of meta-learning with GNNs
Ad

Similar to IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversampling Method and KNN Classifier (20)

PDF
Intrusion Detection System using K-Means Clustering and SMOTE
PDF
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
PDF
AN IMPROVED CTGAN FOR DATA PROCESSING METHOD OF IMBALANCED DISK FAILURE
PDF
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
PDF
A Threshold fuzzy entropy based feature selection method applied in various b...
PDF
Analysis of Common Supervised Learning Algorithms Through Application
PDF
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
PDF
SAMPLING BASED APPROACHES TO HANDLE IMBALANCES IN NETWORK TRAFFIC DATASET FOR...
PDF
Analysis of Common Supervised Learning Algorithms Through Application
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
PPT
Neural networks for the prediction and forecasting of water resources variables
PDF
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
PPTX
Deep learning(UNIT 3) BY Ms SURBHI SAROHA
PDF
IRJET- Missing Data Imputation by Evidence Chain
PDF
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
PDF
Detection of Outliers in Large Dataset using Distributed Approach
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Oversampling technique in student performance classification from engineering...
PDF
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
PDF
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
Intrusion Detection System using K-Means Clustering and SMOTE
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
AN IMPROVED CTGAN FOR DATA PROCESSING METHOD OF IMBALANCED DISK FAILURE
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
A Threshold fuzzy entropy based feature selection method applied in various b...
Analysis of Common Supervised Learning Algorithms Through Application
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
SAMPLING BASED APPROACHES TO HANDLE IMBALANCES IN NETWORK TRAFFIC DATASET FOR...
Analysis of Common Supervised Learning Algorithms Through Application
Survey on Feature Selection and Dimensionality Reduction Techniques
Neural networks for the prediction and forecasting of water resources variables
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
Deep learning(UNIT 3) BY Ms SURBHI SAROHA
IRJET- Missing Data Imputation by Evidence Chain
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
Detection of Outliers in Large Dataset using Distributed Approach
The International Journal of Engineering and Science (The IJES)
Oversampling technique in student performance classification from engineering...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
Sustainable Sites - Green Building Construction
PDF
Digital Logic Computer Design lecture notes
PPTX
Lecture Notes Electrical Wiring System Components
PPT
Project quality management in manufacturing
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
composite construction of structures.pdf
PPTX
web development for engineering and engineering
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Welding lecture in detail for understanding
Geodesy 1.pptx...............................................
Sustainable Sites - Green Building Construction
Digital Logic Computer Design lecture notes
Lecture Notes Electrical Wiring System Components
Project quality management in manufacturing
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
573137875-Attendance-Management-System-original
Foundation to blockchain - A guide to Blockchain Tech
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
additive manufacturing of ss316l using mig welding
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
composite construction of structures.pdf
web development for engineering and engineering
Automation-in-Manufacturing-Chapter-Introduction.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Welding lecture in detail for understanding

IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversampling Method and KNN Classifier

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3708 Predicting Customers Churn in Telecom Industry using Centroid Oversampling method and KNN classifier Pragya Joshi Department of Computer Engineering Shri G.S. Institute of Tech. and Science 23, Sir Visvesvaraya Marg, Indore (MP) Email: joshi.pragya@gmail.com Abstract— Subscriber retention is a key issue for several service-based organizations mainly in telecom industry, where detecting customer’s behaviour is one of the main interest. Customer churn occurs when an existing subscriber stops doing business or ends the relationship with a company. Since the proportion of churned customers are less as compared to the whole subscriber base, it leads to imbalanced data distribution. To improve the classification performance of imbalanced data learning, a novel over-sampling method, Centroid Oversampling Technique, based on centroid of three nearest neighbor points, is proposed. It generates a set of representative data points to broaden the decision regions of small class spaces. The representative centroids are regarded as synthetic examples in order to resolve the imbalanced class problem. This approach addresses the problem of synthetic minority oversampling techniques, which overlaps the reflection of majority classes on training examples. Our comprehensive experimental results show that Centroid Oversampling Technique can achieve better performance than renowned multi-class oversampling methods. Index-terms - Class Imbalance, Oversampling Techniques, Centroid method, K Nearest Neighbor. I. INTRODUCTION Techniques for handling class-imbalance problem can be broadly categorized into two sets basis their evaluation pro- cedure, Oversampling and Undersampling. In oversampling, some of the data points of minority class are replicated to generate synthetic samples to form balanced dataset. Under- sampling reduces the number of observations from majority class to balance the dataset, but removing observations might cause the training data to miss the significant information pertaining to majority class. Best known oversampling techniques are SMOTE, ADASYN, MTDF. Synthetic Minority Oversampling Technique SMOTE[3] is considered as the most popular technique for handling CIP, as It produces new minority data points based on k-nearest neighbor(kNN), giving positive observations. A major concern of SMOTE[3] is that the class with frequent samples dominate the neighboring data points of a test instance in-spite of distance measurements which leads to sub-optimal classification performance on the minority class. In this proposed approach, a modified oversampling technique is introduced which uses the concept of generating synthetic data samples by calculating centroid of three nearest points (kNN). The main objective of this approach is Surendra Gupta Department of Computer Engineering Shri G.S. Institute of Tech. and Science 23, Sir Visvesvaraya Marg, Indore (MP) Email: sgupta@sgsits.ac.in to correct the inherent bias to majority class on any distance measurement. So, the proposed work is aimed at modifying the existing oversampling technique to enhance the accuracy in predicting customer churn so that special tariffs can be assigned to retain them. The main aim of this approach is to minimise the miss- classified errors while retaining a suitably high accuracy in representing the original features. Many machine learning algorithms yield low performance while dealing with large imbalanced datasets. Imbalanced classes can cause different classification approaches to face difficulties in learning, which can results in poor performance. Therefore, in this paper, a technique has been proposed i.e. centroid oversampling technique. The basic idea of this section is to balance the class distribution using centroid method and kNN approach. In addition, we reduce the number of features and remove the not related, unnecessary or noisy information. Besides, this improves the presentation of information prediction with speeding up the processing algorithm. To improve the prediction accuracy, use PCA algorithm for dimension reduction. This paper is ordered as follows. In section II, an overview of work already done in this area with their advantages and disadvantages is discussed. In the section, III the proposed approach is presented. Section IV presents the details of experimental analysis of the result. Conclusion and future work is discussed in section V. II. RELATED WORK Substantial work has been carried out by earlier re-searchers in the field of classification techniques for Imbalanced datasets, handling miss-classified errors and related areas. Some of the most relevant techniques are discussed in this section. SMOTE[3], Synthetic Minority Over-sampling method, is an oversampling method which uses the concept of creating new minority class samples by incorporating several minority class cases that lie close to each other. It is based on KNN approach, it chooses K nearest neighbors, merge them and generates new synthetic samples. The algorithm uses the feature vectors and its nearest neighbors, calculates the distance among these vectors. The variance is then multiplied by random number among (0, 1) and added back to feature.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3709 SMOTE algorithm is a pioneer algorithm and countless other algorithms are derivative of SMOTE. SMOTE[3] is an oversampling method which yields “synthetic” data points basis distance measure between two given data points. Oversampling is done using each minority class samples, generating new data examples which connects the nearest neighbors of that class samples. Neighbors are chosen randomly basis the ratio of oversampling being done. The key most objective of oversampling techniques is to strengthen the minority class. The motive behind this algorithm is quite simple. Oversampling causes overfitting, and because of recurrent data instances, the decision boundary gets constricted. So, in SMOTE[3] paper, it has been shown that these newly constructed data samples are not exact replicas, but new points and thus it relaxes the decision boundary, thereby helping the algorithm to estimate the theory more precisely. However, some advantages and disadvantages of SMOTE are listed below. Advantages: 1.Overcomes over-fitting by generating synthetic examples rather than recurrent data points. 2. Information about all the data instances remain intact. 3. Implementation and Interpretation is quite simple and understandable. Disadvantages: 1. overlapping of neighboring classes can generate additional noise. Since SMOTE does not realise neighbouring data samples can be from different classes also. 2. Not recommended for high dimensional data, so SMOTE is not practically used for high dimensional data sets. ADASYN[4] (Adaptive Synthetic) is an oversampling technique in which class weights are distributed for numerous minority data samples, which are being used based on their struggle level in learning. Most of the synthetic samples are generated for classes which are difficult to train as compared to other class samples which can be easily learned. It is an improved version of Smote. After creating those samples, it adds a random weight to the points thus making it more realistic. In other words instead of all the sample being linearly correlated to the parent they have a little more variance in them i.e they are bit scattered. ADASYN[4] advances learning w.r.t. data distribution in below ways: (1) Decreasing the bias lead by the imbalanced class.(2) Attentively moving the decision boundary of classification algorithm near the most problematic examples. Based on previous research work, it has been observed that SMOTE[3] is one of the promising approach which is mostly used for creating new rules in classification problems. This approach can be widely used in the applications of knowledge discovery, data mining, and Telecom domain. But researches can be done in the context of testing this method using bulky databases and to compare this approach with different existing approaches. III. PROPOSED METHOD The Proposed approach is based on Centroid oversampling technique to generate synthetic samples by using kNN algorithm to identify the nearest neighbors and compute centroid of these nearest data points in the space. Existing SMOTE[3] technique has a limitation while working with imbalanced dataset,i.e. the classes having frequent data examples overrules neighbors of testing instances despite of distance metrics. The proposed approach is aimed to resolve this issue by generating centroid between the nearest data points rather than on the outlying areas. In this way, synthetic samples are uniformly distributed among the original data points which reduces the probability of selecting outlier in the data space. The most important benefit of this proposed approach is its capability of correcting internal bias to majority classes of present kNN algorithm. So, this approach is aimed to produce better data samples than the existing techniques, and hence better classification results can be achieved. Architecture of the projected system is illustrated in figure I. Fig. 1: Architecture Diagram of Proposed Approach A. Data Pre-processing The system takes customer churn related dataset as an input. It reads the dataset, and preprocess the categorical or non-numerical attributes and missing values so as to make the data appropriate for further processing. B. Data Modelling and Validation Apply 9-fold cross-validation to calculate the learning abilities of classification model on different data sets. A statistical method called Cross-validation is being used to estimate the learning skills of various models. It is mainly applied in machine learning techniques to evaluate a model for predictive learning and modelling issues. Apply 9- fold cross-validation to build a model on 75% of original data and reserve 25% on testing purpose. Original file gets divided into testing and training sub-files and so that further task can be applied on each set comprising of test and train dataset. C. Data Oversampling Centroid method is used to oversample the dataset. After first iteration of 9-fold cross-validation, oversampling is applied on 75% of original dataset, i.e. testing set. Detailed algorithm is explained here:  Centroid Oversampling method is based on calculating the median of a triangle or three points.  Select three nearest neighbor points from the given dataset.  Calculate the median of these three points.  The output should be a centroid.  Continue with the next three neighbors and generate centroid in the same way. By generating centroids of nearest points, original dataset is oversampled by 100%. The oversampled dataset is then passed to the classification module to train the classification model.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3710 Algorithm 1 Centroid Oversampling Algorithm Input: P={p1,p2....pn} Training positive examples. n: Number of minority examples. N: No. of synthetic data points to be generated for each positive data point. k: 3 Output: Set of artificial instances. Procedure: 1: Preprocess dataset to change the categorical attributes into numerical one and traverse the dataset (each tuple) to preprocess the missing values. 2: for i = 1 to min 3: Find the two nearest neighbors of pi and store their indexes in an array. 4: end for 5: for m in indices 6: Generate centroid of three points corresponding to their index values. 7: newSample=(a[m]+a[m+1]+a[m+2])/3 8: m=m+1 9: Add new samples one by one to form Synthetic array. 10: Synthetic newSample 11: end for D. Classification Module 1) Load the Testing dataset generated after cross fold validation on which oversampling is required to be done. 2) Use KNN (K-Nearest Neighbor) for classification. KNN algorithms have been recognized as one of the top most powerful data mining algorithms for their capability of creating simple yet powerful classifiers. The k neighbors nearby a test instances are usually called prototypes. Salient features of kNN are:  Most Powerful Data mining algorithm.  It is exceptionally easy to design and implement.  Since the algorithm needs no training afore creation of predictions, new data can be further added seamlessly.  Only two parameters are essential to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)  Capability of constructing simple but dominant classifiers.  KNN can be implemented for both classification and regression predictive problems.  It is lazy learning algorithm and thus, needs no training preceding for constructing real time predictions. This marks the KNN algorithm quicker than rest of the algorithms, which require training e.g SVM, linear regression, etc. Classification module takes resultant dataset as an input. Dataset is segregated into two sets: training set and testing set for applying classification. Classifier is trained using training set and using testing set its accuracy is computed. E. Performance Evaluation Module There are many parameters available which can be used to check performance of classification. The parameters used in the project to check classification performance are : Accuracy : Ratio of accurate prediction to the total no of prediction in a model. Recall : It is considered as a measure of completeness. It tells about how many positive class samples are correctly classified. Recall : Recall = sensitivity Precision : It is a measure of exactness. It can be defined as, from the samples labelled as positive, how many actually belongs to positive class. F-measure : Harmonic mean between recall and preci-sion. Fig. 2: Flow Diagram of Proposed Approach IV. TESTING AND RESULTS Proposed approach is implemented using Python. It is evaluated on three datasets, mainly on telecom churn datasets which are taken from UCI repository and Kaggle. A comparison is made on proposed model by applying KNN classifier on the original data and data generated by the proposed centroid oversampling technique. TABLE I: THE COMPARISON OF CLASSIFICATION ACCURACIES FOR CUSTOMER-CHURN DATASET K=3 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.92 0.91 0.94 Recall 0.75 0.91 0.95 Precision 0.87 0.84 0.82 F-measure 0.65 0.91 0.89
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3711 K=5 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.90 0.87 0.91 Recall 0.75 0.88 0.90 Precision 0.84 0.79 0.77 F-measure 0.53 0.87 0.83 K=7 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.9 0.86 0.90 Recall 0.67 0.87 0.88 Precision 0.85 0.78 0.76 F-measure 0.49 0.86 0.80 TABLE III: THE COMPARISON OF CLASSIFI- CATION ACCURACIES FOR CUSTOMER-CHURN- MODELLING DATASET K=3 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.84 0.87 0.87 Recall 0.67 0.85 0.87 Precision 0.70 0.83 0.79 F-measure 0.50 0.89 0.82 K=5 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.81 0.81 0.82 Recall 0.59 0.79 0.80 Precision 0.63 0.77 0.73 F-measure 0.32 0.84 0.74 K=7 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.80 0.79 0.80 Recall 0.55 0.76 0.78 Precision 0.59 0.75 0.69 F-measure 0.21 0.83 0.71 VI. VISUALIZATION OF THE RESULT Visualization graph has been plotted on the density distribution of available datasets where the data points of various features are plotted before and after applying the centroid oversampling method on churn.csv dataset. TABLE II: THE COMPARISON OF CLASSIFI- CATION ACCURACIES FOR TELCO-CUSTOMER- CHURN DATASET K=3 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.92 0.88 0.94 Recall 0.75 0.84 0.95 Precision 0.87 0.86 0.82 F-measure 0.65 0.91 0.89 K=5 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.79 0.83 0.81 Recall 0.67 0.77 0.82 Precision 0.69 0.81 0.76 F-measure 0.52 0.88 0.79 K=7 Parameters KNN SMOTE+KNN Centroid+KNN Accuracy 0.77 0.80 0.78 Recall 0.63 0.74 0.79 Precision 0.63 0.78 0.71 F-measure 0.43 0.86 0.76 V. STATISTICAL ANALYSIS OF THE RESULT From the table I,II and III, it has been observed that the classification accuracy for Centroid oversampling method is best as compared to SMOTE as well as KNN for the value of K=3. Fig. 3: Density distribution graph of Churn dataset before and after applying centroid oversampling method
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3712 Fig. 4: Data Statics before and after applying Centroid Oversampling Technique VII. CONCLUSION The outcome of the project is focused on finding the customer churn who are going to discontinue the telecom services. Determining customer’s nature empower companies to boost support of the customers and boost the gross performance of the industry. Expected result shows the increase in classification accuracy of proposed model as compared to the base model. Once the best oversampling method is identified for a particular dataset, the same can be used to enhance the classifier accuracy. The proposed system shows the better performance for different datasets than the existing oversampling techniques. VIII. FUTURE WORK Accuracy of many classification algorithms depend on the behavior of the dataset being analysed. For imbalanced datasets, main focus areas are the performance measures and density distribution of data. The proposed approach is based on binary classification of data, so further research can be done with muti-class classification also. Future scope can be in the direction to implement the technique on high dimensional data and to combine the centroid oversampling method with different classifiers to enhance the classification parameters. This method can be efficiently used in the areas of image processing and other data mining approaches. REFERENCES [1] Adnan Amin , Sajid Anwar , Awais Adnan , Muhammad Nawaz , New-ton Howard , Junaid Qadir , (senior Member, Ieee), Ahmad Hawalah4 , And Amir Hussain , (Senior Member, IEEE), ”Comparing Oversam-pling Techniques to Handle the Class Imbalance Problem”,October 26, 2016. [2] J. Burez and D. Van den Poel, ”Handling class imbalance in customer churn prediction”, Expert Syst. Appl.vol. 36, no. 3, pp. 4626–4636, 2009. [3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ”SMOTE:Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, no. 1, pp. 321–357, 2002. [4] H. He, Y. Bai, E. A. Garcia, and S. Li, ”ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. IEEE Int.Joint Conf. Neural Network,Jun. 2008,pp. 1322–1328 [5] Lian Yan,Richard H. Wolniewicz and Robert Dodier, ”Predicting Customer Behavior in Telecommunications”. IEEE Intelligent Systems Volume: 19 , Issue: 2 , Mar-Apr 2004 [6] Jianping Gou, Zhang Yi ,Lan Du, Taisong Xiong ”A Local Mean-Based k-Nearest Centroid Neighbor Classifier” The Computer Journal archive Volume 55 Issue 9, September 2012 Pages 1058-1071 [7] Han, E., Karypis, G.: ”Centroid-based document classification.”In: ˙ Zighed, D.A., Komorowski, J., Zytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 116–123. Springer, Heidelberg (2000) [8] Kiran Dahiya,Surbhi Bhatia ”Customer churn analysis in telecom industry” IEEE-International Conference on Reliability, Infocom Tech- nologies and Optimization (ICRITO) (Trends and Future Directions) 17 December 2015 [9] V.B.Surya Prasatha,, Haneen Arafat Abu Alfeilatb, Omar Lasass- mehb,Ahmad B.A.Hassanat, ”Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier- A Review”, 2017 [10] Liu, Z.G.; Pan, Q.; Dezert, J. ” A new belief-based K-nearest neighbor classification method. Pattern Recognition” 2013, 46, 834–844 [11] Wei Liu and Sanjay Chawla,”Class Confidence Weighted kNN Algo- rithms for Imbalanced Data Sets”, PAKDD 2011 LNCS (LNAI), vol 6635, pp 345-356. Springer, Heidelberg (2011) [12] Asuncion, A., Newman, D.: ”UCI Machine Learning Repository” [13] Demsar,ˇ J.: ”Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7”, 1–30 (2006)