SlideShare a Scribd company logo
© 2015, IJCSE All Rights Reserved 97
International Journal of Computer Sciences and EngineeringInternational Journal of Computer Sciences and EngineeringInternational Journal of Computer Sciences and EngineeringInternational Journal of Computer Sciences and Engineering Open Access
Review Paper Volume-3, Issue-8 E-ISSN: 2347-2693
A Comparative Study of Spam Detection in Social Networks Using
Bayesian Classifier and Correlation Based Feature Subset Selection
Sanjeev Dhawan1
, Kulvinder Singh2
and Meena Devi3*
1, 2
Faculty of Computer Science & Engineering, University Institute of Engineering and Technology,
Kurukshetra University, Kurukshetra- 136119, Haryana, India
3*
Dept. of Computer Engineering) Research Scholar, University Institute of Engineering and Technology, Kurukshetra
University, Kurukshetra-136119, Haryana, India
Received: Jul /09/2015 Revised: Jul/22/2015 Accepted: Aug/20/2015 Published: Aug/30/ 2015
Abstract— The article gives an overview of some of the most popular machine learning methods (Naïve Bayesian classifier,
naïve Bayesian k-cross validation, naïve Bayesian info gain, Bayesian classification and Bayesian net with correlation based
feature subset selection) and of their applicability to the problem of spam-filtering. Brief descriptions of the algorithms are
presented, which are meant to be understandable by a reader not familiar with them before. Classification and clustering
techniques in data mining are useful for a wide variety of real time applications dealing with large amount of data. Some of the
application areas of data mining are text classification, medical diagnosis, intrusion detection systems etc. The Naive Bayesian
Classifier technique is based on the Bayesian theorem and is particularly suited when the dimensionality of the inputs is high.
Despite its simplicity, Naive Bayesian can often outperform more sophisticated classification methods. The approach is called
“naïve” because it assumes the independence between the various attribute values. Naïve Bayesian classification can be viewed
as both a descriptive and a predictive type of algorithm. The probabilities are descriptive are used to predict the class
membership for a untrained data.
Keywords— Bayesian Classifier, Feature Subset Selection, Naïve Bayesian Classifier, Correlation Based FSS, Info Gain, K-
cross validation, Spam, Non-Spam
I. INTRODUCTION
Classification techniques analyze and categorize the data
into known classes. Each data sample is labeled with a
known class label. Clustering is a process of grouping
objects resulting into set of clusters such that similar objects
are members of the same cluster and dissimilar objects
belongs to different clusters.[1] In classification the classes
are pre-defined. Training sample data are used to create a
model, where each training sample is assigned a predefined
label. Data mining involves the use of sophisticated data
analysis tools to discover previously unknown, valid
patterns and relationships in large data set. These tools can
include statistical models, mathematical algorithm and
machine learning methods. Other than collection and
managing data, data mining also includes analysis and
prediction. In this paper we will try to understand the logic
behind Bayesian classification. The Naive Bayesian
Classifier technique is based on the Bayesian theorem and is
particularly suited when the dimensionality of the inputs is
high. Despite its simplicity, Naive Bayesian can often
outperform more sophisticated classification methods.
II. Naïve Bayesian Classifier
The Naive Bayesian classifier is a straightforward and
frequently used method for supervised learning. It provides
a flexible way for dealing with any number of attributes or
classes, and is based on probability theory. It is the
asymptotically fastest learning algorithm that examines all
its training input. It has been demonstrated to perform
surprisingly well in a very wide variety of problems in spite
of the simplistic nature of the model. Furthermore, small
amounts of bad data, or ‘‘noise,’’ do not perturb the results
by much.[2] However, as mentioned above, the central
assumption in Naive Bayesian classification is that given a
particular class membership, the probabilities of particular
attributes having particular values are independent of each
other. However, this assumption is often violated in reality.
For example, in demographic data, many attributes have
obvious dependencies, such as age and income. A plausible
assumption of independence is computationally
problematic. This is best described by redundant attributes.
If we posit two independent features, and a third which is
redundant (i.e., perfectly correlated) with the first, the first
attribute will have twice as much influence on the
expression as the second has, which is a strength not
reflected in reality. The increased strength of the first
attribute increases the possibility of unwanted bias in the
classification. Even with this independence assumption,
Naive Bayesian classification still works well in practice.
However, some researchers have shown that although
irrelevant features should theoretically not hurt the accuracy
of Naive Bayesian, they do degrade performance in
practice. This paper illustrates that if those redundant or
irrelevant attributes are eliminated, the performance of
Naïve Bayesian Classifier can significantly increase.
International Journal of Computer Sciences and Engineering Vol.-3, Issue -8, pp(97-100) Aug 2015 E-ISSN: 2347-2693
© 2015, IJCSE All Rights Reserved 98
III. NAÏVE BAYESIAN K-CROSS VALIDATION
For k-fold cross-validation, data is split into k groups (e.g.
10). Then select one of those groups and use the model
(built from your training data) to predict the 'labels' of this
testing group. Once you have your model built and cross-
validated, then it can be used to predict data that don't
currently have labels.[5] The cross-validation is used to
prevent over fitting. In K cross validation only 1 of the 10
groups is not used. Let's say you had 100 samples. You split
it into groups 1-10, 11-20, ... 91-100. You would first train
on all the groups from 11-100 and predict the test group 1-
10. Then you would repeat the same analysis on 1-10 and
21-100 as the training and 11-20 as the testing group and so
orth. The results typically averaged at the end.
IV. NAÏVE BAYESIAN INFO GAIN
The information gain of a given attribute X with respect to
the class attribute Y is the reduction in uncertainty about the
value of Y when we know the value of X[3].The uncertainty
about the value of Y is measured by its entropy, H(Y). The
uncertainty about the value of Y when we know the value of
X is given by the conditional entropy of Y given X, H (Y|X)
as shown in below:
IG = H (Y) – H (Y|X) = H (X) – H (X|Y)
IG is a symmetrical measure [11]. The information gained
about Y after observing X is equal to the information gained
about X after observing Y.
V. BAYESIAN CLASSIFIER
The Bayesian classifier is a simple but effective learning
algorithm which can be used to classify the incoming
messages into several classes (ω1, ω2…ωn). In fact, it is
capable of much more than just that. The Bayesian classifier
is used in document classification, voice recognition and
even in facial recognition [9]. It is a simple probabilistic
classifier (mathematical mapping system) which requires
the following:
1. The prior probability that a given event belongs to a
specific class
2. The likelihood function of a given feature set
describing a class P(x|ω1)
Once these data are available, the classifier divides the
sample space into disjoint regions ( 1, 2… n). When
there are only two classes (in our case: spam and not-spam),
the classifier also provides a decision function δ(x) such
that
δ (x) = ω1 if x Є 1
δ(x) = ω2 if x Є 2
Initially, the classifier needs to be trained on labeled
features to allow it to build up the likelihood functions and
the priori probabilities. After the classifier is put to work, as
it comes across newer values for the features, it
automatically adjusts the likelihood functions and the
decision boundaries appropriately.
Bayesian theorem provides a way of calculating the
posterior probability, P(c | x), from P(c), P(x), and P(x | c).
Naive Bayesian classifier assumes that the effect of the
value of a predictor (x) on a given class (c) is independent
of the values of other predictors. This assumption is called
class conditional independence.
Likelihood Class Prior Probability
P(c | x) =
Posterior Probability Predictor Prior Probability
P (c |X) = P ( ) × P ( ) ×…….. × P ( ) × P(c)
• P (c | x) is the posterior probability of class (target)
given predictor (attribute).
• P(c) is the prior probability of class.
• P (x | c) is the likelihood which is the probability
of predictor given class.
• P(x) is the prior probability of predictor.
VI. CORRELATION BASED FSS
CFS algorithm relies on a heuristic for evaluating the worth
or merit of a subset of features. This heuristic takes into
account the usefulness of individual features for predicting
the class label along with the level of intercorrelation
among them. The hypotheses on which the heuristic is
based can be stated:
Good feature subsets contain features highly correlated with
(predictive of) the class, yet uncorrelated with (not
predictive of) each other.
Features are relevant if their values vary systematically with
category membership. In other words, a feature is useful if
it is correlated with or predictive of the class; otherwise it is
irrelevant. Empirical evidence from the feature selection
literature shows that, along with irrelevant features,
redundant information should be eliminated as well [6].
A feature is said to be redundant if one or more of the other
features are highly correlated with it. The above definitions
for relevance and redundancy lead to the idea that best
features for a given classification are those that are highly
correlated with one of the classes and have an insignificant
correlation with the rest of the features in the set.
If the correlation between each of the components in a test
and the outside variable is known, and the inter-correlation
between each pair of components is given, then the
correlation between a composite consisting of the summed
components and the outside variable can be predicted from
International Journal of Computer Sciences and Engineering Vol.-3, Issue -8, pp(97-100) Aug 2015 E-ISSN: 2347-2693
© 2015, IJCSE All Rights Reserved 99
(1)
Where
rzc = correlation between the summed components and the
outside variable.
k = number of components (features).
rzi = average of the correlations between the components
and the outside variable.
rii = average inter-correlation between components.
Equation 1 represents the Pearson’s correlation coefficient,
where all the variables have been standardized. The
numerator can be thought of as giving an indication of how
predictive of the class a group of features are; the
denominator of how much redundancy there is among them
[7]. Thus, equation 1 shows that the correlation between a
composite and an outside variable is a function of the
number of component variables in the composite and the
magnitude of the inter-correlations among them, together
with the magnitude of the correlations between the
components and the outside variable. Some conclusions can
be extracted from (1):
• The higher the correlations between the components
and the outside variable, the higher the correlation
between the composite and the outside variable.
• As the number of components in the composite
increases, the correlation between the composite and
the outside variable increases.
• The lower the inter-correlation among the components,
the higher the correlation between the composite and
the outside variable.
VII CLASSIFICATION RESULTS
Classifier TP Rate FP
Rate
Precisio
n
Recall
Naïve Bayes 0.793 0.152 0.842 0.793
Naïve Bayes
20 Folds
0.692 0.046 0.959 0.692
NB Info Gain
FSS
0.8 0.196 0.808 0.8
Bayes Net 0.9 0.123 0.9 0.9
Bayes Net +
CFS
0.924 0.096 0.925 0.924
Table 1 Comparison of Performance of Various
Algorithms
In this above table comparision of performance of various
algorithm has been shown and from the above table it is
found that performance of Bayesian Net with Correlation
Based Feature Subset Selection is best among all these
algorithm with respect to TP Rate,FP Rate, Precision and
Recall
VII. CONCLUSION AND FUTURE SCOPE
Feature subset selection (FSS) plays a vital act in the fields
of data excavating and contraption learning. A good FSS
algorithm can efficiently remove irrelevant and redundant
features and seize into report feature interaction. This also
clears the understanding of the data and additionally
enhances the presentation of a learner by enhancing the
generalization capacity and the interpretability of the
discovering mode.An alternative way employing a classifier
on a corpus of e-mail memos from countless users and a
collective dataset.
In this work, we have worked on improving SPAM
detection based on feature subset selection of Spam data set.
The Feature Subset selection methods such as Info Gain
Attribute selection and Correlation based Attribute
Selection can be perceived as the main enhancement to
Naïve Bayesian/ probabilistic methods. We have analyzed
the Probabilistic SPAM Filters and attained more than 92%
of success in filtering SPAM.
However, many open issues still remain open such as the
system deals only with content as it has been translated to
plain text or HTML. Since some spam is sent where most of
the messages are inbuilt in image, it would be worth looking
at ways in which images and other attachments could be
examined by the system. These could include algorithms
which extract text from the attachment, or more complex
analysis of the information contained within the attachment.
We can also work on a technique to recognize web junk e-
mail according to finding these boosting pages in place of
web spam page itself. We will begin from a small set of
spam seed pages to get a hold of boosting pages. Then web
junk e-mail pages are supposed to be identified making use
of boosting pages. We can also work on a better larger
dataset; the system should be tested over a longer period
than the one-year one available in the public domain.
ACKNOWLEDGEMENT
I would like to acknowledge Dr. Sanjeev Dhawan, Assistant
Professor, University Institute of Engineering and
Technology (U.I.E.T), Kurukshetra University, Kurukshetra
for introducing the present topic and for his inspiring
guidance, valuable suggestions and support throughout the
work.
REFERENCES
[1] Rushdi Shams and Robert Mercer,” Classifying Spam Emails
using Text and Readability Features,” IEEE 13th
International Conference on Data Mining (ICDM), 2013, pp.
657-666.
International Journal of Computer Sciences and Engineering Vol.-3, Issue -8, pp(97-100) Aug 2015 E-ISSN: 2347-2693
© 2015, IJCSE All Rights Reserved 100
[2] Chotirat “ANN” Ratana Mahatana and Dimitrios
Gunppulos,” Feature Selection For the Naïve Bayesian
Classifier Using Decision Trees,” Applied Artificial
Intelligence, Volume-17, 2003, pp. 475-487.
[3] Mehdi Naseriparsa, Amir-Masoud Bidgoli, Touraj Varaee,”A
Hybrid Feature Selection Method to Improve Performance of
a Group of Classification Algorithms,” International Journal
of Computer Applications (0975-8887), Volume 69, No-17,
May 2013.
[4] Aakriti Aggarwal and Ankur Gupta, “Detection of DDoS
Attack Using UCLA Dataset on Different Classifiers,
International Journal of Computer Science and Engineering,
Volume-03, Issue-08, August 2015, pp. 33-37.
[5] Ioannis Kanaris, Konstantinos Kanaris, Ioannis Houvardas,
And Efstathios Stamatatos, “Words Vs. Character N-Grams
For Anti-Spam Filtering,” International Journal on Artificial
Intelligence Tools, 2006, pp.1-20.
[6] Mehdi Naseriparsa, Amir-Masoud Bidgoli and Touraj
Varaee,” A Hybrid Feature Selection Method to Improve
Performance of a Group of Classification Algorithms”
International Journal of Computer Applications (0975 –
8887),Volume 69, Issue- 17,May 2013
[7] Sanjeev Dhawan and Meena Devi, “Spam Detection in Social
Networks Using Correlation Based Feature Subset Selection,”
International Journal of Computer Applications Technology
and Research, Volume 4, Issue-8, August 2015, pp. 629-632.
[8] Dipali Bhosale and Roshani Ade,” Feature Selection based
Classification using Naive Bayesian, J48 and Support Vector
Machine,” International Journal of Computer Applications
(0975 – 8887) Volume 99– No.16, August 2014.
[9] Anjana Kumari,” Study on Naive Bayesian Classifier and its
relation to Information Gain,” International Journal on Recent
and Innovation Trends in Computing and Communication,
Volume: 2, Issue- 3, March 2014, pp.601 – 603.
AUTHORS PROFILE
Meena Devi has done her bachelor of technology
degree in Computer Science and Engineering
with first division in year 2013 and currently
persuing her Master of Technology degree in
Computer Engineering from Kurukshetra
University, Kurukshetra. Her areas of interest are
WEKA, java.

More Related Content

PDF
Analysis of Classification Algorithm in Data Mining
PPTX
lazy learners and other classication methods
DOC
report.doc
PDF
Classifiers
PDF
Associative Classification: Synopsis
PDF
IJCSI-10-6-1-288-292
PPTX
Data mining approaches and methods
PPTX
04 Classification in Data Mining
Analysis of Classification Algorithm in Data Mining
lazy learners and other classication methods
report.doc
Classifiers
Associative Classification: Synopsis
IJCSI-10-6-1-288-292
Data mining approaches and methods
04 Classification in Data Mining

What's hot (20)

PPTX
Marketing analytics - clustering Types
PDF
Chapter 05 k nn
PDF
MACHINE LEARNING TOOLBOX
DOC
Chapter6.doc
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PPTX
Cluster analysis
PPTX
Data mining: Classification and prediction
PPTX
Hierarchical Clustering in Data Mining
PDF
M08 BiasVarianceTradeoff
DOC
Rd1 r17a19 datawarehousing and mining_cap617t_cap617
PPT
slides
PPT
Classification and prediction
PDF
Text Quantification
PDF
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
PDF
Survey on Various Classification Techniques in Data Mining
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
DOC
Graph Clustering and cluster
PPTX
Cluster Validation
PPTX
Program_Cluster_Analysis
Marketing analytics - clustering Types
Chapter 05 k nn
MACHINE LEARNING TOOLBOX
Chapter6.doc
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
Cluster analysis
Data mining: Classification and prediction
Hierarchical Clustering in Data Mining
M08 BiasVarianceTradeoff
Rd1 r17a19 datawarehousing and mining_cap617t_cap617
slides
Classification and prediction
Text Quantification
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
Survey on Various Classification Techniques in Data Mining
Data Mining: Concepts and Techniques — Chapter 2 —
Graph Clustering and cluster
Cluster Validation
Program_Cluster_Analysis
Ad

Viewers also liked (11)

PDF
19 ijcse-01227
PDF
16 ijcse-01237
PPTX
Espire towers Luxury Property in Faridabad
DOCX
2016_AZ_Resume
PDF
OIC Process Flow V7
PDF
[WEBINARIO amdia OM Latam] Cómo utilizar los datos para conversar con mis cli...
PDF
Anaplan_Platform_WP
PPT
Lg corporate story
PDF
Rúbrica para "Banco Común de Conocimientos" (P2P) #EduExpandida
PDF
Rúbrica para "Prototipo de mutante" (P2P) #EduExpandida
PPTX
Autism: The Challenges and Opportunities
19 ijcse-01227
16 ijcse-01237
Espire towers Luxury Property in Faridabad
2016_AZ_Resume
OIC Process Flow V7
[WEBINARIO amdia OM Latam] Cómo utilizar los datos para conversar con mis cli...
Anaplan_Platform_WP
Lg corporate story
Rúbrica para "Banco Común de Conocimientos" (P2P) #EduExpandida
Rúbrica para "Prototipo de mutante" (P2P) #EduExpandida
Autism: The Challenges and Opportunities
Ad

Similar to 18 ijcse-01232 (20)

PPTX
pjgjhkjhkjhkkhkhkkhkjhjhjhjkhjhjkhjhroject.pptx
PPTX
UNIT 3: Data Warehousing and Data Mining
PDF
Analysis On Classification Techniques In Mammographic Mass Data Set
PDF
A Decision Tree Based Classifier for Classification & Prediction of Diseases
PPTX
dataminingclassificationprediction123 .pptx
PDF
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
PPTX
MACHINE LEARNING Unit -2 Algorithm.pptx
PPTX
SVM - Functional Verification
PDF
Analysis and Comparison Study of Data Mining Algorithms Using Rapid Miner
PDF
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
PPTX
Chapter4-ML.pptx slide for concept of mechanic learning
PPT
slides
PDF
Hypothesis on Different Data Mining Algorithms
PPTX
demo lecture for foundation class for btech
PDF
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
PDF
DIFFERENCE OF PROBABILITY AND INFORMATION ENTROPY FOR SKILLS CLASSIFICATION A...
DOCX
Naive bayes classifier
PPTX
fINAL ML PPT.pptx
PDF
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
PDF
An Experimental Study of Diabetes Disease Prediction System Using Classificat...
pjgjhkjhkjhkkhkhkkhkjhjhjhjkhjhjkhjhroject.pptx
UNIT 3: Data Warehousing and Data Mining
Analysis On Classification Techniques In Mammographic Mass Data Set
A Decision Tree Based Classifier for Classification & Prediction of Diseases
dataminingclassificationprediction123 .pptx
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
MACHINE LEARNING Unit -2 Algorithm.pptx
SVM - Functional Verification
Analysis and Comparison Study of Data Mining Algorithms Using Rapid Miner
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
Chapter4-ML.pptx slide for concept of mechanic learning
slides
Hypothesis on Different Data Mining Algorithms
demo lecture for foundation class for btech
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
DIFFERENCE OF PROBABILITY AND INFORMATION ENTROPY FOR SKILLS CLASSIFICATION A...
Naive bayes classifier
fINAL ML PPT.pptx
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
An Experimental Study of Diabetes Disease Prediction System Using Classificat...

More from Shivlal Mewada (20)

PDF
31 ijcse-01238-9 vetrikodi
PDF
30 ijcse-01238-8 thangaponu
PDF
29 ijcse-01238-7 sumathi
PDF
28 ijcse-01238-6 sowmiya
PDF
27 ijcse-01238-5 sivaranjani
PDF
26 ijcse-01238-4 sinthuja
PDF
25 ijcse-01238-3 saratha
PDF
24 ijcse-01238-2 manohari
PDF
23 ijcse-01238-1indhunisha
PDF
22 ijcse-01208
PDF
21 ijcse-01230
PDF
20 ijcse-01225-3
PDF
15 ijcse-01236
PDF
14 ijcse-01234
PDF
13 ijcse-01233
PDF
12 ijcse-01224
PDF
11 ijcse-01219
PDF
9 ijcse-01223
PDF
8 ijcse-01235
PDF
7 ijcse-01229
31 ijcse-01238-9 vetrikodi
30 ijcse-01238-8 thangaponu
29 ijcse-01238-7 sumathi
28 ijcse-01238-6 sowmiya
27 ijcse-01238-5 sivaranjani
26 ijcse-01238-4 sinthuja
25 ijcse-01238-3 saratha
24 ijcse-01238-2 manohari
23 ijcse-01238-1indhunisha
22 ijcse-01208
21 ijcse-01230
20 ijcse-01225-3
15 ijcse-01236
14 ijcse-01234
13 ijcse-01233
12 ijcse-01224
11 ijcse-01219
9 ijcse-01223
8 ijcse-01235
7 ijcse-01229

Recently uploaded (20)

PDF
Updated Idioms and Phrasal Verbs in English subject
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Computing-Curriculum for Schools in Ghana
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
master seminar digital applications in india
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Updated Idioms and Phrasal Verbs in English subject
Weekly quiz Compilation Jan -July 25.pdf
Cell Types and Its function , kingdom of life
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
What if we spent less time fighting change, and more time building what’s rig...
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
LDMMIA Reiki Yoga Finals Review Spring Summer
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
Paper A Mock Exam 9_ Attempt review.pdf.
Microbial diseases, their pathogenesis and prophylaxis
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Computing-Curriculum for Schools in Ghana
Chinmaya Tiranga quiz Grand Finale.pdf
master seminar digital applications in india
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Orientation - ARALprogram of Deped to the Parents.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf

18 ijcse-01232

  • 1. © 2015, IJCSE All Rights Reserved 97 International Journal of Computer Sciences and EngineeringInternational Journal of Computer Sciences and EngineeringInternational Journal of Computer Sciences and EngineeringInternational Journal of Computer Sciences and Engineering Open Access Review Paper Volume-3, Issue-8 E-ISSN: 2347-2693 A Comparative Study of Spam Detection in Social Networks Using Bayesian Classifier and Correlation Based Feature Subset Selection Sanjeev Dhawan1 , Kulvinder Singh2 and Meena Devi3* 1, 2 Faculty of Computer Science & Engineering, University Institute of Engineering and Technology, Kurukshetra University, Kurukshetra- 136119, Haryana, India 3* Dept. of Computer Engineering) Research Scholar, University Institute of Engineering and Technology, Kurukshetra University, Kurukshetra-136119, Haryana, India Received: Jul /09/2015 Revised: Jul/22/2015 Accepted: Aug/20/2015 Published: Aug/30/ 2015 Abstract— The article gives an overview of some of the most popular machine learning methods (Naïve Bayesian classifier, naïve Bayesian k-cross validation, naïve Bayesian info gain, Bayesian classification and Bayesian net with correlation based feature subset selection) and of their applicability to the problem of spam-filtering. Brief descriptions of the algorithms are presented, which are meant to be understandable by a reader not familiar with them before. Classification and clustering techniques in data mining are useful for a wide variety of real time applications dealing with large amount of data. Some of the application areas of data mining are text classification, medical diagnosis, intrusion detection systems etc. The Naive Bayesian Classifier technique is based on the Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayesian can often outperform more sophisticated classification methods. The approach is called “naïve” because it assumes the independence between the various attribute values. Naïve Bayesian classification can be viewed as both a descriptive and a predictive type of algorithm. The probabilities are descriptive are used to predict the class membership for a untrained data. Keywords— Bayesian Classifier, Feature Subset Selection, Naïve Bayesian Classifier, Correlation Based FSS, Info Gain, K- cross validation, Spam, Non-Spam I. INTRODUCTION Classification techniques analyze and categorize the data into known classes. Each data sample is labeled with a known class label. Clustering is a process of grouping objects resulting into set of clusters such that similar objects are members of the same cluster and dissimilar objects belongs to different clusters.[1] In classification the classes are pre-defined. Training sample data are used to create a model, where each training sample is assigned a predefined label. Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data set. These tools can include statistical models, mathematical algorithm and machine learning methods. Other than collection and managing data, data mining also includes analysis and prediction. In this paper we will try to understand the logic behind Bayesian classification. The Naive Bayesian Classifier technique is based on the Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayesian can often outperform more sophisticated classification methods. II. Naïve Bayesian Classifier The Naive Bayesian classifier is a straightforward and frequently used method for supervised learning. It provides a flexible way for dealing with any number of attributes or classes, and is based on probability theory. It is the asymptotically fastest learning algorithm that examines all its training input. It has been demonstrated to perform surprisingly well in a very wide variety of problems in spite of the simplistic nature of the model. Furthermore, small amounts of bad data, or ‘‘noise,’’ do not perturb the results by much.[2] However, as mentioned above, the central assumption in Naive Bayesian classification is that given a particular class membership, the probabilities of particular attributes having particular values are independent of each other. However, this assumption is often violated in reality. For example, in demographic data, many attributes have obvious dependencies, such as age and income. A plausible assumption of independence is computationally problematic. This is best described by redundant attributes. If we posit two independent features, and a third which is redundant (i.e., perfectly correlated) with the first, the first attribute will have twice as much influence on the expression as the second has, which is a strength not reflected in reality. The increased strength of the first attribute increases the possibility of unwanted bias in the classification. Even with this independence assumption, Naive Bayesian classification still works well in practice. However, some researchers have shown that although irrelevant features should theoretically not hurt the accuracy of Naive Bayesian, they do degrade performance in practice. This paper illustrates that if those redundant or irrelevant attributes are eliminated, the performance of Naïve Bayesian Classifier can significantly increase.
  • 2. International Journal of Computer Sciences and Engineering Vol.-3, Issue -8, pp(97-100) Aug 2015 E-ISSN: 2347-2693 © 2015, IJCSE All Rights Reserved 98 III. NAÏVE BAYESIAN K-CROSS VALIDATION For k-fold cross-validation, data is split into k groups (e.g. 10). Then select one of those groups and use the model (built from your training data) to predict the 'labels' of this testing group. Once you have your model built and cross- validated, then it can be used to predict data that don't currently have labels.[5] The cross-validation is used to prevent over fitting. In K cross validation only 1 of the 10 groups is not used. Let's say you had 100 samples. You split it into groups 1-10, 11-20, ... 91-100. You would first train on all the groups from 11-100 and predict the test group 1- 10. Then you would repeat the same analysis on 1-10 and 21-100 as the training and 11-20 as the testing group and so orth. The results typically averaged at the end. IV. NAÏVE BAYESIAN INFO GAIN The information gain of a given attribute X with respect to the class attribute Y is the reduction in uncertainty about the value of Y when we know the value of X[3].The uncertainty about the value of Y is measured by its entropy, H(Y). The uncertainty about the value of Y when we know the value of X is given by the conditional entropy of Y given X, H (Y|X) as shown in below: IG = H (Y) – H (Y|X) = H (X) – H (X|Y) IG is a symmetrical measure [11]. The information gained about Y after observing X is equal to the information gained about X after observing Y. V. BAYESIAN CLASSIFIER The Bayesian classifier is a simple but effective learning algorithm which can be used to classify the incoming messages into several classes (ω1, ω2…ωn). In fact, it is capable of much more than just that. The Bayesian classifier is used in document classification, voice recognition and even in facial recognition [9]. It is a simple probabilistic classifier (mathematical mapping system) which requires the following: 1. The prior probability that a given event belongs to a specific class 2. The likelihood function of a given feature set describing a class P(x|ω1) Once these data are available, the classifier divides the sample space into disjoint regions ( 1, 2… n). When there are only two classes (in our case: spam and not-spam), the classifier also provides a decision function δ(x) such that δ (x) = ω1 if x Є 1 δ(x) = ω2 if x Є 2 Initially, the classifier needs to be trained on labeled features to allow it to build up the likelihood functions and the priori probabilities. After the classifier is put to work, as it comes across newer values for the features, it automatically adjusts the likelihood functions and the decision boundaries appropriately. Bayesian theorem provides a way of calculating the posterior probability, P(c | x), from P(c), P(x), and P(x | c). Naive Bayesian classifier assumes that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. This assumption is called class conditional independence. Likelihood Class Prior Probability P(c | x) = Posterior Probability Predictor Prior Probability P (c |X) = P ( ) × P ( ) ×…….. × P ( ) × P(c) • P (c | x) is the posterior probability of class (target) given predictor (attribute). • P(c) is the prior probability of class. • P (x | c) is the likelihood which is the probability of predictor given class. • P(x) is the prior probability of predictor. VI. CORRELATION BASED FSS CFS algorithm relies on a heuristic for evaluating the worth or merit of a subset of features. This heuristic takes into account the usefulness of individual features for predicting the class label along with the level of intercorrelation among them. The hypotheses on which the heuristic is based can be stated: Good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. Features are relevant if their values vary systematically with category membership. In other words, a feature is useful if it is correlated with or predictive of the class; otherwise it is irrelevant. Empirical evidence from the feature selection literature shows that, along with irrelevant features, redundant information should be eliminated as well [6]. A feature is said to be redundant if one or more of the other features are highly correlated with it. The above definitions for relevance and redundancy lead to the idea that best features for a given classification are those that are highly correlated with one of the classes and have an insignificant correlation with the rest of the features in the set. If the correlation between each of the components in a test and the outside variable is known, and the inter-correlation between each pair of components is given, then the correlation between a composite consisting of the summed components and the outside variable can be predicted from
  • 3. International Journal of Computer Sciences and Engineering Vol.-3, Issue -8, pp(97-100) Aug 2015 E-ISSN: 2347-2693 © 2015, IJCSE All Rights Reserved 99 (1) Where rzc = correlation between the summed components and the outside variable. k = number of components (features). rzi = average of the correlations between the components and the outside variable. rii = average inter-correlation between components. Equation 1 represents the Pearson’s correlation coefficient, where all the variables have been standardized. The numerator can be thought of as giving an indication of how predictive of the class a group of features are; the denominator of how much redundancy there is among them [7]. Thus, equation 1 shows that the correlation between a composite and an outside variable is a function of the number of component variables in the composite and the magnitude of the inter-correlations among them, together with the magnitude of the correlations between the components and the outside variable. Some conclusions can be extracted from (1): • The higher the correlations between the components and the outside variable, the higher the correlation between the composite and the outside variable. • As the number of components in the composite increases, the correlation between the composite and the outside variable increases. • The lower the inter-correlation among the components, the higher the correlation between the composite and the outside variable. VII CLASSIFICATION RESULTS Classifier TP Rate FP Rate Precisio n Recall Naïve Bayes 0.793 0.152 0.842 0.793 Naïve Bayes 20 Folds 0.692 0.046 0.959 0.692 NB Info Gain FSS 0.8 0.196 0.808 0.8 Bayes Net 0.9 0.123 0.9 0.9 Bayes Net + CFS 0.924 0.096 0.925 0.924 Table 1 Comparison of Performance of Various Algorithms In this above table comparision of performance of various algorithm has been shown and from the above table it is found that performance of Bayesian Net with Correlation Based Feature Subset Selection is best among all these algorithm with respect to TP Rate,FP Rate, Precision and Recall VII. CONCLUSION AND FUTURE SCOPE Feature subset selection (FSS) plays a vital act in the fields of data excavating and contraption learning. A good FSS algorithm can efficiently remove irrelevant and redundant features and seize into report feature interaction. This also clears the understanding of the data and additionally enhances the presentation of a learner by enhancing the generalization capacity and the interpretability of the discovering mode.An alternative way employing a classifier on a corpus of e-mail memos from countless users and a collective dataset. In this work, we have worked on improving SPAM detection based on feature subset selection of Spam data set. The Feature Subset selection methods such as Info Gain Attribute selection and Correlation based Attribute Selection can be perceived as the main enhancement to Naïve Bayesian/ probabilistic methods. We have analyzed the Probabilistic SPAM Filters and attained more than 92% of success in filtering SPAM. However, many open issues still remain open such as the system deals only with content as it has been translated to plain text or HTML. Since some spam is sent where most of the messages are inbuilt in image, it would be worth looking at ways in which images and other attachments could be examined by the system. These could include algorithms which extract text from the attachment, or more complex analysis of the information contained within the attachment. We can also work on a technique to recognize web junk e- mail according to finding these boosting pages in place of web spam page itself. We will begin from a small set of spam seed pages to get a hold of boosting pages. Then web junk e-mail pages are supposed to be identified making use of boosting pages. We can also work on a better larger dataset; the system should be tested over a longer period than the one-year one available in the public domain. ACKNOWLEDGEMENT I would like to acknowledge Dr. Sanjeev Dhawan, Assistant Professor, University Institute of Engineering and Technology (U.I.E.T), Kurukshetra University, Kurukshetra for introducing the present topic and for his inspiring guidance, valuable suggestions and support throughout the work. REFERENCES [1] Rushdi Shams and Robert Mercer,” Classifying Spam Emails using Text and Readability Features,” IEEE 13th International Conference on Data Mining (ICDM), 2013, pp. 657-666.
  • 4. International Journal of Computer Sciences and Engineering Vol.-3, Issue -8, pp(97-100) Aug 2015 E-ISSN: 2347-2693 © 2015, IJCSE All Rights Reserved 100 [2] Chotirat “ANN” Ratana Mahatana and Dimitrios Gunppulos,” Feature Selection For the Naïve Bayesian Classifier Using Decision Trees,” Applied Artificial Intelligence, Volume-17, 2003, pp. 475-487. [3] Mehdi Naseriparsa, Amir-Masoud Bidgoli, Touraj Varaee,”A Hybrid Feature Selection Method to Improve Performance of a Group of Classification Algorithms,” International Journal of Computer Applications (0975-8887), Volume 69, No-17, May 2013. [4] Aakriti Aggarwal and Ankur Gupta, “Detection of DDoS Attack Using UCLA Dataset on Different Classifiers, International Journal of Computer Science and Engineering, Volume-03, Issue-08, August 2015, pp. 33-37. [5] Ioannis Kanaris, Konstantinos Kanaris, Ioannis Houvardas, And Efstathios Stamatatos, “Words Vs. Character N-Grams For Anti-Spam Filtering,” International Journal on Artificial Intelligence Tools, 2006, pp.1-20. [6] Mehdi Naseriparsa, Amir-Masoud Bidgoli and Touraj Varaee,” A Hybrid Feature Selection Method to Improve Performance of a Group of Classification Algorithms” International Journal of Computer Applications (0975 – 8887),Volume 69, Issue- 17,May 2013 [7] Sanjeev Dhawan and Meena Devi, “Spam Detection in Social Networks Using Correlation Based Feature Subset Selection,” International Journal of Computer Applications Technology and Research, Volume 4, Issue-8, August 2015, pp. 629-632. [8] Dipali Bhosale and Roshani Ade,” Feature Selection based Classification using Naive Bayesian, J48 and Support Vector Machine,” International Journal of Computer Applications (0975 – 8887) Volume 99– No.16, August 2014. [9] Anjana Kumari,” Study on Naive Bayesian Classifier and its relation to Information Gain,” International Journal on Recent and Innovation Trends in Computing and Communication, Volume: 2, Issue- 3, March 2014, pp.601 – 603. AUTHORS PROFILE Meena Devi has done her bachelor of technology degree in Computer Science and Engineering with first division in year 2013 and currently persuing her Master of Technology degree in Computer Engineering from Kurukshetra University, Kurukshetra. Her areas of interest are WEKA, java.