SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 31
A MULTI-CLASSIFIER PREDICTION MODEL FOR PHISHING
DETECTION
Sarju S1
, Riju Thomas2
, Emilin Shyni C3
1, 2
PG Scholar, 3
Associate Professor, Department of Computer Science and Engineering, KCG College of Technology,
Tamilnadu, India
Abstract
Phishing is the technique of stealing personal and sensitive information from an email client by sending emails that impersonates like
the ones from some trustworthy organizations. Phishing mails are a specific type of spam mails; however the effects of them are much
more terrible than alternate sorts. Mostly the phishing attackers aim the clients of the financial organizations, so its detection needs
high priority. Lots of research activities are done to detect the phished emails, in the proposed methodology a multi-classifier
prediction model is introduced for detecting phished emails. Our contention is that solitary classifier prediction might not be
satisfactory to urge the clearest picture of the phishing email; multi-classifier prediction has accuracy 99.8% with an FP rate of 0.8%.
Keywords: Phishing email detection, Machine learning techniques, Multi-classifier prediction model, Majority voting
----------------------------------------------------------------------***------------------------------------------------------------------------
1. INTRODUCTION
Nowadays the emails turn into one of the generally utilized
communication medium within the globe. Because of the fame
of emails, the attackers utilized it to snatch the client data.
Phishing messages are a particular kind of spam mail, which is
used to take the individual and fiscal data from the email
clients. Generally the attackers send an email that looks like
legitimate messages from some reputed organizations, which
lead clients to phishing sites. Phishing sites always have a user
entry form, when he enters his data like user names,
passwords and credit card details which in turn utilized by the
attackers to do some deceitful exercises. As per the latest
report issued by APWG [1] (Anti-Phishing Working Group),
amount of phishing attack evolved and burgeoned in
overabundance of 20% in 2013. Latest Trends Report of
APWG quotes that the overall number of distinctive phishing
internet sites rose to 143,353 throughout the July-September
that is over the past quarter's 119,101.
Heaps of researches are carried out to detect the phished mails,
in which the machine learning techniques are most prominent
one due the higher precision of detection. The classification
algorithms developed in the learning techniques are utilized
for anticipating the class of the given mail, but each
classification algorithms have their own particular blemishes.
In the proposed technique we implemented a multi-
classification method which fuses three most accurate
classifiers for foreseeing the class label. A prediction model is
constructed with the classifiers which incorporate Support
Vector Machine (SVM), J48 and Instance Base 1, majority
voting algorithm is used to settle on the last choice in regards
to the class name of the given email. The dataset holds 5260
publically accessible email corpus for train the prediction
model and 500 messages are utilized as the test mails.
2. BACKGROUND
Recently, a variety of anti-phishing techniques are introduced,
out of which machine learning technique based approaches are
most prominent one. Features are extracted from the html part
and body part of the email is used by the machine learning
algorithms to predict the possibility of the phishing.
Chandrasekaran et.al [2] proposed phishing email detection
based on the structural properties of the phished mails. A total
of 25 structural features are extracted and classification of the
mails is done using the Support vector Machine algorithm.
Data set only contains 200 mails from the public phishing mail
corpus, so the results might not be accurate. Sarju et al.[3]
utilized the structural properties to detect the spam emails and
used Naïve Bayes, Adaboost and Random Forest are used to
measure the accuracy. From the above works it identified that
the structural properties can be used to discriminate the
phished emails from ham mails.
A number of other novel features are also useful to identify
phished emails. Bergholz et.al [4] analyzed different
properties of the email to detect the phished ones. The content
of the emails are evaluated for constructing the feature set and
is used for phishing detection. Random Forest and SVM are
used to measure the accuracy of the detection.
Abu-Nimeh et al.[5] compared machine learning techniques in
phishing detection, they used six different classifiers. A total
of 43 features are extracted from the dataset of 2889 phishing
and ham emails. They found that Random Forest outperforms
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 32
all other classifiers used in their methodology. Miyamoto et al.
[6] analyzed nine machine learning techniques for detection of
phishing sites; they found that Random Forest and SVM
outperforms all the other classifiers.
3. PROPOSED METHOD
In the proposed method structural features, content based
features and element features of emails are analyzed and
extracted for training the prediction model as shown in the
Figure 1, Fi represents the feature set extracted from the mails
and C1 and C2 the corresponding class labels.
Fig 1 - Multi-Classifier Prediction system for Phishing email
detection
3.1 Feature Extraction & Analysis
Feature Extraction and analysis phase, extracts the different
features that plays important role in detecting the phished
emails. In this work, we extracted the structural features,
content based features and element features from the emails.
Emails are available in the Multipart Internet Mail Extension
(MIME) format, an HTML parser is used to parse the mail and
a HTML is tree constructed. Structural features are extracted
from the HTML tree and are multi part count, non ASCII
characters, non-text content and content labeling. The
multipart feature divides the email into different parts. For
example an email with an attachment has two parts, text
content and attachment. An email in the MIME format
contains a header which shows the character encoding used in
that mail. This can be used to identify whether any non ASCII
characters used in that mail. Table 1 shows the content label
types used in this paper.
Table 1 - Email Content Types
Element features includes the web technologies used in the
email. In this work, we extracted element features of type
Boolean which indicates whether HTML, JavaScript, VB
Script, XHTML, and CSS used in the mail. Finally the content
based features available in the email are extracted. Totally 42
features are extracted from the email corpus and give it as a
training set to the prediction model.
3.2 Prediction Model
The extracted features from the feature extraction and analysis
stage are used in the multi-classifier prediction model. The
prediction model is built using the machine learning
algorithms which includes J48, SVM and IB1, each one is
capable of classifying the mails into phished or ham mails.
The accuracy of the prediction can be improved by combining
the classifiers. The final decision regarding the category of the
mail is done through the majority voting algorithm. Different
research works are done to combining classifiers [7-9].
Majority Voting is the mostly used way for combining
classifiers, which count the votes for each class that are
predicted by the classifiers and majority class is selected. The
new confidence fi x for class i is calculated as
fi x = I(maxj pji x = j)j (1)
in which I() is the identifier function: I(x) =1 if x is true else
I(x) will be zero.
When a new mail comes the prediction model identifies its
category based in the training test given.
4. EXPERIMENTAL EVALUATION
The dataset holds 5260 publically accessible email corpus of
phished and ham mails for train the prediction model and 500
messages are utilized as the test mails. The performance of the
prediction model is analyzed using different measures like
True Positive, False Positive, Accuracy and Receiver Operator
Characteristics curve.
File Extension MIMEType Description
.txt text/plain Plain text
.htm text/html Styled text in HTML format
.jpg image/jpeg Picture in JPEGformat
.gif image/gif Picture in GIF format
.wav audio/x-wave Sound in WAVE format
.mp3 audio/mpeg Music in MP3 format
.mpg video/mpeg Video in MPEGformat
.zip application/zip Compressed file in PK-ZIP format
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 33
The information about actual and predicted classifications
done by machine learning systems is represented in the form
of a confusion matrix [10] as shown in the figure 2 and
accuracy is measured based on entries.
Fig 2 - Confusion Matrix
Accuracy =
TP + TN
TN + FN + FP + TP
(2)
If the mail is ham and it is classified as ham, it is counted as a
True Positive (TP); if it is classified as phish, it is counted as a
False Negative (FN). If the mail is phished and it is classified
as phish, it is counted as a True Negative (TN); if it is
classified as ham, it is counted as a False Positive (FP).
The Figure 3 compares the FP rate obtained when the
classifiers used independently and also combined using
majority voting. It is shown that the multi-classification using
majority voting outperforms individual classifier performance
with an FP rate of 0.8%
.
Fig 3 - Phishing Prediction FP Rate
The accuracy of the prediction model is evaluated using the
equation 2. From the results, it is clearly understood that the
accuracy of the multi-classification is higher than the
classifiers applied individually. Multi-classification with
majority voting gains an accuracy of 99.80% and is shown in
the Figure 4.
Fig 4 - Phishing Prediction Accuracy Comparison
Another performance measure used to evaluate our proposed
system is ROC [11] curve. In an ROC graph X axis plotted
with FP rate and TP on Y axis. Figure 5a and 5b shows the
ROC measures for the prediction model, by analyzing the
graphs it is identified that the multi-classification prediction
model has improved results compared to the independent
classifier prediction model.
(a)
(b)
Fig 5 – ROC Curves (a) when classifiers used independently ;
(b) Multi- classification using Majority Voting
Phish Ham
Phish TN FN
Ham FP TP
Predicted
Actual
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 34
5. CONCLUSIONS
Our argument is that single classifier would not be adequate to
urge the clearest image of the phishing email detection
accuracy. The experimental results shown that proposed using
multi-classifier prediction model outperforms the individual
classifier based prediction models in many aspects. It
preserves an accuracy of 99.8% with an FP rate of 0.8%. The
ROC measure shows that the multi-classification with
majority voting gives almost an ideal curve compared to
independent classification algorithms.
In the future work, we are planning to incorporate the topic
modeling features to the feature set generation stage, because
it has the capability to overcome the novel techniques used by
the attackers.
REFERENCES
[1] APWG, Anti phishing working group.
http://wwww.antiphishing.org (2013). Accessed 31
September 2013
[2] Chandrasekaran M, Narayanan M and Upadhyaya S.:
Phishing email detection based on structural
properties. In: Proceedings of 9th Annual NYS Cyber
Security Conference, pp. 2-8 (2006).
[3] Sarju S, Riju Thomas and Emilin Shyni C.: Spam
Email Detection using Structural Features.
International Journal of Computer Applications
89(3):pp.38-41 (2014).
[4] A Bergholz, JH Chang, F Reichartz, and S Strobel.:
Improved Phishing Detection using Model-Based
Features. In Proceedings of the Conference on Email
and Anti-Spam (CEAS), Mountain View, CA,(2008).
[5] Saeed Abu-Nimeh , Dario Nappa , Xinlei Wang , Suku
Nair.: A comparison of machine learning techniques
for phishing detection, In. Proceedings of the anti-
phishing working groups 2nd annual eCrime
researchers summit, pp.60-69, (2007).
[6] Daisuke Miyamoto , Hiroaki Hazeyama and Youki
Kadobayashi.: An evaluation of machine learning-
based methods for detection of phishing sites,
Proceedings of the 15th international conference on
Advances in neuro-information processing, (2008).
[7] Geman D and Geman S.: Stochastic relaxation, Gibbs
distribution, and the Bayesian restoration of images.
IEEE Trans. on Pattern Analysis and Machine
Intelligence, PAMI-6(6), pp.721–741, (1984).
[8] Jain AK, Zhong Y and Lakshmanan S.: Object
Matching Using Deformable Templates. IEEE Trans.
Pattern Analysis and Machine Intelligence 18(3),
pp.267–278, (1996)
[9] Kumazawa I.: Shape extraction by cellular Hough
transform, Technical report of IEICE, PRMU, pp.96–
105, (1996)
[10] Provost F, Fawcett T and Kohavi R.: The case against
accuracy estimation for comparing induction
algorithms. In. Proceedings. ICML-98.pp. 445–453,
(1998)
[11] T. Fawcett, An Introduction to ROC Analysis, Pattern
Recognition Letters 27(8), pp. 861-874, (2006).
[12] WEKA. http://www.cs.waikato.ac.nz/ml/weka/ (2013):
Accessed 31 November 2013
[13] SpamAssassin, http://spamassassin.apache.org (2013)
: Accessed 31 November 2013
[14] Phishingcorpus http://monkey.org (2013), Accessed 31
November 2013.
[15] Apache James. (2013) Mime4J Parser.
http://james.apache.org/mime4j: Accessed 14
October 2013.

More Related Content

PDF
IRJET- Suspicious Email Detection System
PDF
Ijigsp v6-n2-6
PDF
20120140506007
PDF
Obfuscated computer virus detection using machine learning algorithm
PDF
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
PDF
Thesis presentation
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Aj35198205
IRJET- Suspicious Email Detection System
Ijigsp v6-n2-6
20120140506007
Obfuscated computer virus detection using machine learning algorithm
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
Thesis presentation
The International Journal of Engineering and Science (The IJES)
Aj35198205

What's hot (13)

PDF
The Detection of Suspicious Email Based on Decision Tree ...
PDF
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
PDF
Analysis of an image spam in email based on content analysis
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
DOCX
Abstract
PDF
IRJET - Sentiment Analysis of Posts and Comments of OSN
PDF
Rule based messege filtering and blacklist management for online social network
PDF
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
PDF
A COMPARATIVE ANALYSIS OF DIFFERENT FEATURE SET ON THE PERFORMANCE OF DIFFERE...
PDF
The Proposed Development of Prototype with Secret Messages Model in Whatsapp ...
PDF
Differential evolution detection models for SMS spam
PDF
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
The Detection of Suspicious Email Based on Decision Tree ...
AN EMPIRICAL ANALYSIS OF EMAIL FORENSICS TOOLS
Analysis of an image spam in email based on content analysis
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
Abstract
IRJET - Sentiment Analysis of Posts and Comments of OSN
Rule based messege filtering and blacklist management for online social network
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
A COMPARATIVE ANALYSIS OF DIFFERENT FEATURE SET ON THE PERFORMANCE OF DIFFERE...
The Proposed Development of Prototype with Secret Messages Model in Whatsapp ...
Differential evolution detection models for SMS spam
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
Ad

Similar to A multi classifier prediction model for phishing detection (20)

PDF
E-Mail Spam Detection Using Supportive Vector Machine
PDF
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
PDF
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
PDF
Email Spam Detection Using Machine Learning
PDF
Study of Various Techniques to Filter Spam Emails
PDF
IRJET- Analysis and Detection of E-Mail Phishing using Pyspark
PDF
A Survey on Spam Filtering Methods and Mapreduce with SVM
PDF
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
PDF
Detection of Spam in Emails using Machine Learning
PDF
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
PDF
IRJET- Review on the Simple Text Messages Classification
PDF
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
PDF
Enhancing spam detection using Harris Hawks optimization algorithm
E-Mail Spam Detection Using Supportive Vector Machine
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
Email Spam Detection Using Machine Learning
Study of Various Techniques to Filter Spam Emails
IRJET- Analysis and Detection of E-Mail Phishing using Pyspark
A Survey on Spam Filtering Methods and Mapreduce with SVM
OPTIMIZING HYPERPARAMETERS FOR ENHANCED EMAIL CLASSIFICATION AND FORENSIC ANA...
Detection of Spam in Emails using Machine Learning
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
SPAM FILTERING SECURITY EVALUATION FRAMEWORK USING SVM, LR AND MILR
IRJET- Review on the Simple Text Messages Classification
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
Enhancing spam detection using Harris Hawks optimization algorithm
Ad

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
PDF
Material management in construction – a case study
PDF
Managing drought short term strategies in semi arid regions a case study
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
PDF
Geographical information system (gis) for water resources management
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
PDF
Factors influencing compressive strength of geopolymer concrete
PDF
Experimental investigation on circular hollow steel columns in filled with li...
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
PDF
Evaluation of punching shear in flat slabs
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
PDF
Evaluation of operational efficiency of urban road network using travel time ...
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Mechanical properties of hybrid fiber reinforced concrete for pavements
Material management in construction – a case study
Managing drought short term strategies in semi arid regions a case study
Life cycle cost analysis of overlay for an urban road in bangalore
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of compaction energy on soil stabilized with chemical stabilizer
Geographical information system (gis) for water resources management
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Factors influencing compressive strength of geopolymer concrete
Experimental investigation on circular hollow steel columns in filled with li...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Evaluation of punching shear in flat slabs
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of operational efficiency of urban road network using travel time ...
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of morphometric parameters and runoff using rs & gis techniques
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of use of recycled materials on indirect tensile strength of asphalt c...

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Welding lecture in detail for understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
DOCX
573137875-Attendance-Management-System-original
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Construction Project Organization Group 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
web development for engineering and engineering
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Well-logging-methods_new................
Lecture Notes Electrical Wiring System Components
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Geodesy 1.pptx...............................................
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
OOP with Java - Java Introduction (Basics)
Embodied AI: Ushering in the Next Era of Intelligent Systems
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Welding lecture in detail for understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
573137875-Attendance-Management-System-original
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Construction Project Organization Group 2.pptx
Sustainable Sites - Green Building Construction
web development for engineering and engineering
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Well-logging-methods_new................

A multi classifier prediction model for phishing detection

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 31 A MULTI-CLASSIFIER PREDICTION MODEL FOR PHISHING DETECTION Sarju S1 , Riju Thomas2 , Emilin Shyni C3 1, 2 PG Scholar, 3 Associate Professor, Department of Computer Science and Engineering, KCG College of Technology, Tamilnadu, India Abstract Phishing is the technique of stealing personal and sensitive information from an email client by sending emails that impersonates like the ones from some trustworthy organizations. Phishing mails are a specific type of spam mails; however the effects of them are much more terrible than alternate sorts. Mostly the phishing attackers aim the clients of the financial organizations, so its detection needs high priority. Lots of research activities are done to detect the phished emails, in the proposed methodology a multi-classifier prediction model is introduced for detecting phished emails. Our contention is that solitary classifier prediction might not be satisfactory to urge the clearest picture of the phishing email; multi-classifier prediction has accuracy 99.8% with an FP rate of 0.8%. Keywords: Phishing email detection, Machine learning techniques, Multi-classifier prediction model, Majority voting ----------------------------------------------------------------------***------------------------------------------------------------------------ 1. INTRODUCTION Nowadays the emails turn into one of the generally utilized communication medium within the globe. Because of the fame of emails, the attackers utilized it to snatch the client data. Phishing messages are a particular kind of spam mail, which is used to take the individual and fiscal data from the email clients. Generally the attackers send an email that looks like legitimate messages from some reputed organizations, which lead clients to phishing sites. Phishing sites always have a user entry form, when he enters his data like user names, passwords and credit card details which in turn utilized by the attackers to do some deceitful exercises. As per the latest report issued by APWG [1] (Anti-Phishing Working Group), amount of phishing attack evolved and burgeoned in overabundance of 20% in 2013. Latest Trends Report of APWG quotes that the overall number of distinctive phishing internet sites rose to 143,353 throughout the July-September that is over the past quarter's 119,101. Heaps of researches are carried out to detect the phished mails, in which the machine learning techniques are most prominent one due the higher precision of detection. The classification algorithms developed in the learning techniques are utilized for anticipating the class of the given mail, but each classification algorithms have their own particular blemishes. In the proposed technique we implemented a multi- classification method which fuses three most accurate classifiers for foreseeing the class label. A prediction model is constructed with the classifiers which incorporate Support Vector Machine (SVM), J48 and Instance Base 1, majority voting algorithm is used to settle on the last choice in regards to the class name of the given email. The dataset holds 5260 publically accessible email corpus for train the prediction model and 500 messages are utilized as the test mails. 2. BACKGROUND Recently, a variety of anti-phishing techniques are introduced, out of which machine learning technique based approaches are most prominent one. Features are extracted from the html part and body part of the email is used by the machine learning algorithms to predict the possibility of the phishing. Chandrasekaran et.al [2] proposed phishing email detection based on the structural properties of the phished mails. A total of 25 structural features are extracted and classification of the mails is done using the Support vector Machine algorithm. Data set only contains 200 mails from the public phishing mail corpus, so the results might not be accurate. Sarju et al.[3] utilized the structural properties to detect the spam emails and used Naïve Bayes, Adaboost and Random Forest are used to measure the accuracy. From the above works it identified that the structural properties can be used to discriminate the phished emails from ham mails. A number of other novel features are also useful to identify phished emails. Bergholz et.al [4] analyzed different properties of the email to detect the phished ones. The content of the emails are evaluated for constructing the feature set and is used for phishing detection. Random Forest and SVM are used to measure the accuracy of the detection. Abu-Nimeh et al.[5] compared machine learning techniques in phishing detection, they used six different classifiers. A total of 43 features are extracted from the dataset of 2889 phishing and ham emails. They found that Random Forest outperforms
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 32 all other classifiers used in their methodology. Miyamoto et al. [6] analyzed nine machine learning techniques for detection of phishing sites; they found that Random Forest and SVM outperforms all the other classifiers. 3. PROPOSED METHOD In the proposed method structural features, content based features and element features of emails are analyzed and extracted for training the prediction model as shown in the Figure 1, Fi represents the feature set extracted from the mails and C1 and C2 the corresponding class labels. Fig 1 - Multi-Classifier Prediction system for Phishing email detection 3.1 Feature Extraction & Analysis Feature Extraction and analysis phase, extracts the different features that plays important role in detecting the phished emails. In this work, we extracted the structural features, content based features and element features from the emails. Emails are available in the Multipart Internet Mail Extension (MIME) format, an HTML parser is used to parse the mail and a HTML is tree constructed. Structural features are extracted from the HTML tree and are multi part count, non ASCII characters, non-text content and content labeling. The multipart feature divides the email into different parts. For example an email with an attachment has two parts, text content and attachment. An email in the MIME format contains a header which shows the character encoding used in that mail. This can be used to identify whether any non ASCII characters used in that mail. Table 1 shows the content label types used in this paper. Table 1 - Email Content Types Element features includes the web technologies used in the email. In this work, we extracted element features of type Boolean which indicates whether HTML, JavaScript, VB Script, XHTML, and CSS used in the mail. Finally the content based features available in the email are extracted. Totally 42 features are extracted from the email corpus and give it as a training set to the prediction model. 3.2 Prediction Model The extracted features from the feature extraction and analysis stage are used in the multi-classifier prediction model. The prediction model is built using the machine learning algorithms which includes J48, SVM and IB1, each one is capable of classifying the mails into phished or ham mails. The accuracy of the prediction can be improved by combining the classifiers. The final decision regarding the category of the mail is done through the majority voting algorithm. Different research works are done to combining classifiers [7-9]. Majority Voting is the mostly used way for combining classifiers, which count the votes for each class that are predicted by the classifiers and majority class is selected. The new confidence fi x for class i is calculated as fi x = I(maxj pji x = j)j (1) in which I() is the identifier function: I(x) =1 if x is true else I(x) will be zero. When a new mail comes the prediction model identifies its category based in the training test given. 4. EXPERIMENTAL EVALUATION The dataset holds 5260 publically accessible email corpus of phished and ham mails for train the prediction model and 500 messages are utilized as the test mails. The performance of the prediction model is analyzed using different measures like True Positive, False Positive, Accuracy and Receiver Operator Characteristics curve. File Extension MIMEType Description .txt text/plain Plain text .htm text/html Styled text in HTML format .jpg image/jpeg Picture in JPEGformat .gif image/gif Picture in GIF format .wav audio/x-wave Sound in WAVE format .mp3 audio/mpeg Music in MP3 format .mpg video/mpeg Video in MPEGformat .zip application/zip Compressed file in PK-ZIP format
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 33 The information about actual and predicted classifications done by machine learning systems is represented in the form of a confusion matrix [10] as shown in the figure 2 and accuracy is measured based on entries. Fig 2 - Confusion Matrix Accuracy = TP + TN TN + FN + FP + TP (2) If the mail is ham and it is classified as ham, it is counted as a True Positive (TP); if it is classified as phish, it is counted as a False Negative (FN). If the mail is phished and it is classified as phish, it is counted as a True Negative (TN); if it is classified as ham, it is counted as a False Positive (FP). The Figure 3 compares the FP rate obtained when the classifiers used independently and also combined using majority voting. It is shown that the multi-classification using majority voting outperforms individual classifier performance with an FP rate of 0.8% . Fig 3 - Phishing Prediction FP Rate The accuracy of the prediction model is evaluated using the equation 2. From the results, it is clearly understood that the accuracy of the multi-classification is higher than the classifiers applied individually. Multi-classification with majority voting gains an accuracy of 99.80% and is shown in the Figure 4. Fig 4 - Phishing Prediction Accuracy Comparison Another performance measure used to evaluate our proposed system is ROC [11] curve. In an ROC graph X axis plotted with FP rate and TP on Y axis. Figure 5a and 5b shows the ROC measures for the prediction model, by analyzing the graphs it is identified that the multi-classification prediction model has improved results compared to the independent classifier prediction model. (a) (b) Fig 5 – ROC Curves (a) when classifiers used independently ; (b) Multi- classification using Majority Voting Phish Ham Phish TN FN Ham FP TP Predicted Actual
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 34 5. CONCLUSIONS Our argument is that single classifier would not be adequate to urge the clearest image of the phishing email detection accuracy. The experimental results shown that proposed using multi-classifier prediction model outperforms the individual classifier based prediction models in many aspects. It preserves an accuracy of 99.8% with an FP rate of 0.8%. The ROC measure shows that the multi-classification with majority voting gives almost an ideal curve compared to independent classification algorithms. In the future work, we are planning to incorporate the topic modeling features to the feature set generation stage, because it has the capability to overcome the novel techniques used by the attackers. REFERENCES [1] APWG, Anti phishing working group. http://wwww.antiphishing.org (2013). Accessed 31 September 2013 [2] Chandrasekaran M, Narayanan M and Upadhyaya S.: Phishing email detection based on structural properties. In: Proceedings of 9th Annual NYS Cyber Security Conference, pp. 2-8 (2006). [3] Sarju S, Riju Thomas and Emilin Shyni C.: Spam Email Detection using Structural Features. International Journal of Computer Applications 89(3):pp.38-41 (2014). [4] A Bergholz, JH Chang, F Reichartz, and S Strobel.: Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), Mountain View, CA,(2008). [5] Saeed Abu-Nimeh , Dario Nappa , Xinlei Wang , Suku Nair.: A comparison of machine learning techniques for phishing detection, In. Proceedings of the anti- phishing working groups 2nd annual eCrime researchers summit, pp.60-69, (2007). [6] Daisuke Miyamoto , Hiroaki Hazeyama and Youki Kadobayashi.: An evaluation of machine learning- based methods for detection of phishing sites, Proceedings of the 15th international conference on Advances in neuro-information processing, (2008). [7] Geman D and Geman S.: Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-6(6), pp.721–741, (1984). [8] Jain AK, Zhong Y and Lakshmanan S.: Object Matching Using Deformable Templates. IEEE Trans. Pattern Analysis and Machine Intelligence 18(3), pp.267–278, (1996) [9] Kumazawa I.: Shape extraction by cellular Hough transform, Technical report of IEICE, PRMU, pp.96– 105, (1996) [10] Provost F, Fawcett T and Kohavi R.: The case against accuracy estimation for comparing induction algorithms. In. Proceedings. ICML-98.pp. 445–453, (1998) [11] T. Fawcett, An Introduction to ROC Analysis, Pattern Recognition Letters 27(8), pp. 861-874, (2006). [12] WEKA. http://www.cs.waikato.ac.nz/ml/weka/ (2013): Accessed 31 November 2013 [13] SpamAssassin, http://spamassassin.apache.org (2013) : Accessed 31 November 2013 [14] Phishingcorpus http://monkey.org (2013), Accessed 31 November 2013. [15] Apache James. (2013) Mime4J Parser. http://james.apache.org/mime4j: Accessed 14 October 2013.