SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 984
ANALYSIS AND DETECTION OF E-MAIL PHISHING USING PYSPARK
1RISHIKESH B H, 2SHREEHARI A S, 3SRIHARSHA G S,4SUNIL HUDAGE, 5REKHA K S
5Department of Computer Science and Engineering
THE NATIONAL INSTITUTE OF ENGINEERING(NIE), Mananthavady Road, Mysuru – 570008, Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Phishing is an act that attempts to steal
information, personal data by using spoofed emails and
fraudulent web sites to trick people into giving up personal
information. Phishing E-mails involve malware links and is
totally committed to obtain sensitive & valuable information.
Phishing has become more and more complicated and
sophisticated and attack can bypass the filter set by anti-
phishing techniques. Phishing impact ranges from denial of
access to e-mail to substantial financial loss, resulting loss of
public's trust in internet. We provide robust method to detect
phishing E-mails which performs some cross-validations
techniques. The method includes Text Analysis, Link Analysis
to encounter phishing countermeasures. Educational
materials reduced user’s tendency to enter information into
phishing webpages.
Key Words: Naïve Bayes, PySpark, Big Data, Link
Analysis, Machine Learning, Virus Total.
1. INTRODUCTION
Security is a key aspect in the field of information and
communication technology. Informationsecuritybearsgreat
value to personal as well as corporate sectors. different
companies and organizations need to protect their
customers and employee’s information related to business
plans, financial outcomes, product information, and the like
[1]. Phishing is one of the luring techniquesusedbyphishing
artist in the intention of exploiting the personal details of
unsuspected users. Phishing website is a mock website that
lookssimilar in appearance but different in destination. The
unsuspected users post their data thinking that these
websites come from trusted financial institutions. Big data
refers to an enormous amount of dataset that is able to
expose patterns associated with human interaction through
computational analysis [2]. The main purpose here is to
detect the e-mails which user receives is legitimate or not.
The goals of our paper as well as system are: (a) To provide
security (b) Accuracy in detection.
Recently, Govt of India issued alert on spread of Locky
Ransomware which is being spread throughe-mailphishing.
There are three fundamental attributes of email security –
Confidentiality, Integrity and Availability [3].
2. SOFTWARE DESCRIPTION
2.1 PySpark
The Spark Python API (PySpark) exposes the Spark
programming model to Python. To support Python with
Spark, Apache Spark community released a tool, PySpark.
Using PySpark, you can work with RDDs in Python
programming language also.
At a high level, every Spark application consists of a driver
program that runs the user’s main function and executes
variousparallel operations on a cluster.Asecondabstraction
in Spark is shared variables that can be used in parallel
operations.
2.2 Natural Language Toolkit (NLTK)
The Natural Language Toolkit, or more commonly NLTK,isa
suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for English writteninthe
Python programming language. NLTK includes graphical
demonstrations and sample data.
NLTK is intended to support research and teachinginNLPor
closely related areas, including empirical linguistics,
cognitive science, artificial intelligence, information
retrieval, and machine learning. NLTK has been used
successfully as a teaching tool, as an individual study tool,
and as a platform for prototyping and building research
systems.
2.3 Virus Total
The Virus Total API lets you upload and scan files or URLs,
accessfinished scan reports and make automatic comments
without the need of using the website interface. In other
words, it allows you to build simple scripts to access the
information generated by Virus Total.
2.4 Naïve Bayes
The Naive Bayes classifier is designed for use when features
are independent of one another within each class, but it
appears to work well in practice even when that
independence assumption is not valid.
It classifies data in two steps (a) Using the training samples,
the method estimates the parameters of a probability
distribution, assuming features are conditionally
independent given the class. (b) For any unseen testsample,
the method computes the posterior probability of that
sample belonging to each class.
3. LITERATURE SURVEY
Tarnnum et al., [4] have conducted two studies to observe
the security threats for big data: (a) first study was carried
out on Enron email dataset (that contains about half a
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 985
million of emails) to investigate the securitychallengesofbig
data in the field of email communication; and (b) second
study was carried out on 35 undergraduate students to
observe how phishing email generation based on users’
intention or behavior may break the security system
Liping Ma et al., [5] has presented an approach to detect
phishing e-mails using hybrid features and have presenteda
method to build a robust classifier to detect phishing emails
using hybrid features and to select features using
information gain, experimented on 10 cross-validations to
build an initial classifier which performs well. The
experiment also analyses the quality of each feature using
information gain and best feature set is selected after a
recursive learning process. Experimental result shows the
selected features perform as well as the original features.
Sa’id Abdullah Al-Saaidah., [6], through this research, varied
classification algorithms are discussed and compared, such
as; Naïve-Bayes, Decision Tree (DT), Logistic Regression,
Classification and Regression Trees and Sequential Minimal
Optimization (SMO). The experiment was executed using
WEKA Tool on a dataset of 4800 Email, 2400phishingemails
and 2400 legitimate emails represented the 47 features of
the email structure
4.SYSTEM ANALYSIS
4.1 Existing System
Anti - Phishing using Machine Learning: This software is the
normal "Network Security Filter that stopsyou from visiting
suspicious websites with a Twist". Their software will never
update. The existing system only detects spam e-mails and
put into the spam folder. This is done by noticing a domain
constantly sending spam messagesandblacklistsuchsender.
Phishing Domain Detection with Machine Learning:Uniform
Resource Locator (URL) is created to address web pages
which results in time inefficient and Phishers are intelligent
to bypass the barriers.
4.2 Proposed System
PySpark is used to implement Naïve-Bayes algorithm which
is fast in execution time compared to normal machine
learning model. Link analysis is done using web page
extraction & to check if any malicious or phishing contents
are there in the link. To improve the accuracy of link
analysis, we are using VirusTotal API to detect any phishing
sites.
5. METHODOLOGY
The two main operations performed are Text Classification
and Link Analysis. Text Classification includes data cleaning,
data preprocessing, bag of words model and Naïve-Bayes
classifier. Link Analysis includes page extraction and using
Virus Total API. The steps involved are:
a) Gathering the datasets, we have collected 48000 emails
from different dataset available.
b) Data cleaning, to clean the gathered emails in the above
step and convert into tsv files.
c) Data preprocessing, before text classification we are
creating bag of words model and applying count vectorizer
to hash the words.
d) Using Naïve-Bayes algorithm, we classify whether the
texts are spam or ham.
e) We extract links present in E-mails and apply Link
Analysis which comprises of 2 steps:
1) Extract the contents of web pages or check if any form is
present and whether it asks for personal information.
2) Virus Total API is used to check for presence of phishing
links.
5.1 SYSTEM DESIGN
Phishing attack and detection system is broken down into
sub-modules like web portal, database, personal
information, graphical statistics and e-mail phishing
detection system
Fig. 1 Sub-modules of Phishing Detection System
The E-mail phishing detection system is broken down into
several processes like link analysis, Naïve-Bayes classifier
and resulting probability value is compared with the
threshold probability value.
Fig. 2 E-mail Phishing Detection System
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 986
6. IMPLEMENTATION
6.1 Naïve-Bayes Algorithm
It is a classification technique based on Bayes’ Theoremwith
an assumption of independence among predictors.Insimple
terms, a Naive Bayes classifier assumes that the presence of
a particular feature in a class is unrelated to the presence of
any other feature. For example, a fruit may be considered to
be an apple if it is red, round, and about 3 inchesin diameter.
Even if these features depend on each other or upon the
existence of the other features, all of these properties
independently contribute to the probability that this fruit is
an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful
for very large data sets. Along with simplicity, NaiveBayesis
known to outperform even highlysophisticatedclassification
methods. Bayes theorem provides a way of calculating
posterior probability P(c|x) from P(c), P(x) and P(x|c). Look
at the equation below [9]:
….[9]
Above,
 P(c|x) is the posterior probability of class(c, target)
given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of
predictor given class.
 P(x) is the prior probability of predictor.
7. CONCLUSIONS AND FUTURE ENHANCEMENTS
Phishing is a form of criminal conduct that poses increasing
threats to consumers, financial institutions, and commercial
enterprises. Because phishing shows no sign of abating, and
indeed is likely to continue in newer and more sophisticated
forms, law enforcement, other government agencies,andthe
private sector in both countrieswill need to cooperate more
closely than ever in their efforts to combat phishing,through
improved public education, prevention, authentication, and
bi-national and national enforcement efforts.
Using Big Data analytics to detect phishing e-mails is
developed in response to the increased threat posed by
maliciouse-mailsthat closely resemble legitimateones.This
methodology not only helps detect phishing messages but
also makes it easier to detect such phishing messages evenif
they more closely mimic legitimate ones. This would,
however, not be possible without knowledge ofbig data and
previous knowledge of current threats. Detection of
download links to files and improving the accuracy of
detection
REFERENCES
[1] Every company needs to have a security program
(2008) [Online]. Available:
https://www.appliedtrust.com/resources/security/eve
rycompany-needs-to-have-a-security-program
Oxford, "Big data," in Oxford Dictionary, Oxford
University Press, 2016. [Online].Available:
http://www.oxforddictionaries.com/
definition/English/bigdata
[2] P. Cocca. “Email security threats,” SANS Institute, USA,
Rep. Version 1.4b Option 1, pp. 1-16. Sept. 20,2004
[3] Tarnnum Zaki, Md. Sami Uddin, Md. Mahedi Hasan,
“Security Threats for Big Data”, IEEE-2017.
[4] Liping Ma, Paul Watters, Simon Brown, “Detecting
Phishing E-mails Using Hybrid Features”, workshop on
Ubiquitous, Autonomic and Trusted Computing.
[5] Sa’id Abdullah Al-Saaidah, “Detecting Phishing E-mails
Using Machine Learning”, MEU-2017.
[6] J. Crowe. (2016). Phishing by the numbers: Must-know.
Phishing statistics 2016 [Online]. Available:
https://blog.barkly.com/phishingstatistics- 2016
[7] New EDRM Enron email dataset (n.d.) [Online].
Available:
http://spamassassin.apache.org/old/publiccorpus/
[8] https://www.analyticsvidhya.com/blog/2017/09/naive
-bayes-explained/
[9] Ron Zacharski, “A Programmer’sGuide to Data Mining”,
a book which gives detailed description of Naïve-Bayes
Algorithm and unstructured text.
[10] Steven Bird, Ewan Klein and Edward Loper, “Natural
Language Processing with Python”, a book on NLTK
toolkit for Hadoop.

More Related Content

PDF
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
PDF
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
PDF
Searchable symmetric encryption security definitions
PDF
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
PDF
IRJET - Survey on Malware Detection using Deep Learning Methods
PDF
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
PPT
Malware analysis on android using supervised machine learning techniques
PDF
Adaptive authentication to determine login attempt penalty from multiple inpu...
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
Searchable symmetric encryption security definitions
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
IRJET - Survey on Malware Detection using Deep Learning Methods
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Malware analysis on android using supervised machine learning techniques
Adaptive authentication to determine login attempt penalty from multiple inpu...

What's hot (20)

PDF
IRJET- Detecting Phishing Websites using Machine Learning
PDF
IRJET- Identification of Clone Attacks in Social Networking Sites
PDF
Honeywords for Password Security and Management
PDF
A model to find the agent who responsible for data leakage
PDF
A model to find the agent who responsible for data leakage
PDF
Phishing Websites Detection Using Back Propagation Algorithm: A Review
PDF
IRJET - An Automated System for Detection of Social Engineering Phishing Atta...
PDF
A44090104
PDF
IRJET- Noisy Content Detection on Web Data using Machine Learning
PDF
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
PDF
IRJET- Android Malware Detection using Machine Learning
PDF
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
PDF
Image Based Relational Database Watermarking: A Survey
PDF
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
PDF
Fraud and Malware Detection in Google Play by using Search Rank
PDF
PHISHING MITIGATION TECHNIQUES: A LITERATURE SURVEY
PDF
IRJET - Chrome Extension for Detecting Phishing Websites
PDF
IRJET - Fake News Detection: A Survey
PPTX
Malware Detection Using Machine Learning Techniques
DOCX
robust malware detection for iot devices using deep eigen space learning
IRJET- Detecting Phishing Websites using Machine Learning
IRJET- Identification of Clone Attacks in Social Networking Sites
Honeywords for Password Security and Management
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
Phishing Websites Detection Using Back Propagation Algorithm: A Review
IRJET - An Automated System for Detection of Social Engineering Phishing Atta...
A44090104
IRJET- Noisy Content Detection on Web Data using Machine Learning
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
IRJET- Android Malware Detection using Machine Learning
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
Image Based Relational Database Watermarking: A Survey
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
Fraud and Malware Detection in Google Play by using Search Rank
PHISHING MITIGATION TECHNIQUES: A LITERATURE SURVEY
IRJET - Chrome Extension for Detecting Phishing Websites
IRJET - Fake News Detection: A Survey
Malware Detection Using Machine Learning Techniques
robust malware detection for iot devices using deep eigen space learning
Ad

Similar to IRJET- Analysis and Detection of E-Mail Phishing using Pyspark (20)

PDF
Intelligent Spam Mail Detection System
PDF
Cyber Threat Prediction using ML
DOCX
Research Report
PDF
IRJET- A Survey on Automatic Phishing Email Detection using Natural Langu...
PDF
IJSRED-V2I4P0
PPTX
Phishing Detection Presentation.pptx
PDF
IRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
PDF
The International Journal of Engineering and Science (The IJES)
PDF
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
PDF
Detecting Phishing using Machine Learning
PDF
Report on CyberSecurity for college students
PDF
Detecting Phishing Websites Using Machine Learning
PDF
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
PDF
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
PDF
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
PDF
Phishing Detection using Decision Tree Model
PDF
IRJET- Preventing Phishing Attack using Evolutionary Algorithms
PDF
IRJET- Machine Learning Techniques to Seek Out Malicious Websites
PDF
IRJET- Phishing Website Detection based on Machine Learning
PPTX
Rootconf_phishing_v2
Intelligent Spam Mail Detection System
Cyber Threat Prediction using ML
Research Report
IRJET- A Survey on Automatic Phishing Email Detection using Natural Langu...
IJSRED-V2I4P0
Phishing Detection Presentation.pptx
IRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
The International Journal of Engineering and Science (The IJES)
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
Detecting Phishing using Machine Learning
Report on CyberSecurity for college students
Detecting Phishing Websites Using Machine Learning
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
Phishing Detection using Decision Tree Model
IRJET- Preventing Phishing Attack using Evolutionary Algorithms
IRJET- Machine Learning Techniques to Seek Out Malicious Websites
IRJET- Phishing Website Detection based on Machine Learning
Rootconf_phishing_v2
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Construction Project Organization Group 2.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Geodesy 1.pptx...............................................
PDF
Well-logging-methods_new................
PDF
PPT on Performance Review to get promotions
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
composite construction of structures.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Model Code of Practice - Construction Work - 21102022 .pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mechanical Engineering MATERIALS Selection
Construction Project Organization Group 2.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Embodied AI: Ushering in the Next Era of Intelligent Systems
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Digital Logic Computer Design lecture notes
Operating System & Kernel Study Guide-1 - converted.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Geodesy 1.pptx...............................................
Well-logging-methods_new................
PPT on Performance Review to get promotions
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
composite construction of structures.pdf

IRJET- Analysis and Detection of E-Mail Phishing using Pyspark

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 984 ANALYSIS AND DETECTION OF E-MAIL PHISHING USING PYSPARK 1RISHIKESH B H, 2SHREEHARI A S, 3SRIHARSHA G S,4SUNIL HUDAGE, 5REKHA K S 5Department of Computer Science and Engineering THE NATIONAL INSTITUTE OF ENGINEERING(NIE), Mananthavady Road, Mysuru – 570008, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Phishing is an act that attempts to steal information, personal data by using spoofed emails and fraudulent web sites to trick people into giving up personal information. Phishing E-mails involve malware links and is totally committed to obtain sensitive & valuable information. Phishing has become more and more complicated and sophisticated and attack can bypass the filter set by anti- phishing techniques. Phishing impact ranges from denial of access to e-mail to substantial financial loss, resulting loss of public's trust in internet. We provide robust method to detect phishing E-mails which performs some cross-validations techniques. The method includes Text Analysis, Link Analysis to encounter phishing countermeasures. Educational materials reduced user’s tendency to enter information into phishing webpages. Key Words: Naïve Bayes, PySpark, Big Data, Link Analysis, Machine Learning, Virus Total. 1. INTRODUCTION Security is a key aspect in the field of information and communication technology. Informationsecuritybearsgreat value to personal as well as corporate sectors. different companies and organizations need to protect their customers and employee’s information related to business plans, financial outcomes, product information, and the like [1]. Phishing is one of the luring techniquesusedbyphishing artist in the intention of exploiting the personal details of unsuspected users. Phishing website is a mock website that lookssimilar in appearance but different in destination. The unsuspected users post their data thinking that these websites come from trusted financial institutions. Big data refers to an enormous amount of dataset that is able to expose patterns associated with human interaction through computational analysis [2]. The main purpose here is to detect the e-mails which user receives is legitimate or not. The goals of our paper as well as system are: (a) To provide security (b) Accuracy in detection. Recently, Govt of India issued alert on spread of Locky Ransomware which is being spread throughe-mailphishing. There are three fundamental attributes of email security – Confidentiality, Integrity and Availability [3]. 2. SOFTWARE DESCRIPTION 2.1 PySpark The Spark Python API (PySpark) exposes the Spark programming model to Python. To support Python with Spark, Apache Spark community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. At a high level, every Spark application consists of a driver program that runs the user’s main function and executes variousparallel operations on a cluster.Asecondabstraction in Spark is shared variables that can be used in parallel operations. 2.2 Natural Language Toolkit (NLTK) The Natural Language Toolkit, or more commonly NLTK,isa suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English writteninthe Python programming language. NLTK includes graphical demonstrations and sample data. NLTK is intended to support research and teachinginNLPor closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. 2.3 Virus Total The Virus Total API lets you upload and scan files or URLs, accessfinished scan reports and make automatic comments without the need of using the website interface. In other words, it allows you to build simple scripts to access the information generated by Virus Total. 2.4 Naïve Bayes The Naive Bayes classifier is designed for use when features are independent of one another within each class, but it appears to work well in practice even when that independence assumption is not valid. It classifies data in two steps (a) Using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class. (b) For any unseen testsample, the method computes the posterior probability of that sample belonging to each class. 3. LITERATURE SURVEY Tarnnum et al., [4] have conducted two studies to observe the security threats for big data: (a) first study was carried out on Enron email dataset (that contains about half a
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 985 million of emails) to investigate the securitychallengesofbig data in the field of email communication; and (b) second study was carried out on 35 undergraduate students to observe how phishing email generation based on users’ intention or behavior may break the security system Liping Ma et al., [5] has presented an approach to detect phishing e-mails using hybrid features and have presenteda method to build a robust classifier to detect phishing emails using hybrid features and to select features using information gain, experimented on 10 cross-validations to build an initial classifier which performs well. The experiment also analyses the quality of each feature using information gain and best feature set is selected after a recursive learning process. Experimental result shows the selected features perform as well as the original features. Sa’id Abdullah Al-Saaidah., [6], through this research, varied classification algorithms are discussed and compared, such as; Naïve-Bayes, Decision Tree (DT), Logistic Regression, Classification and Regression Trees and Sequential Minimal Optimization (SMO). The experiment was executed using WEKA Tool on a dataset of 4800 Email, 2400phishingemails and 2400 legitimate emails represented the 47 features of the email structure 4.SYSTEM ANALYSIS 4.1 Existing System Anti - Phishing using Machine Learning: This software is the normal "Network Security Filter that stopsyou from visiting suspicious websites with a Twist". Their software will never update. The existing system only detects spam e-mails and put into the spam folder. This is done by noticing a domain constantly sending spam messagesandblacklistsuchsender. Phishing Domain Detection with Machine Learning:Uniform Resource Locator (URL) is created to address web pages which results in time inefficient and Phishers are intelligent to bypass the barriers. 4.2 Proposed System PySpark is used to implement Naïve-Bayes algorithm which is fast in execution time compared to normal machine learning model. Link analysis is done using web page extraction & to check if any malicious or phishing contents are there in the link. To improve the accuracy of link analysis, we are using VirusTotal API to detect any phishing sites. 5. METHODOLOGY The two main operations performed are Text Classification and Link Analysis. Text Classification includes data cleaning, data preprocessing, bag of words model and Naïve-Bayes classifier. Link Analysis includes page extraction and using Virus Total API. The steps involved are: a) Gathering the datasets, we have collected 48000 emails from different dataset available. b) Data cleaning, to clean the gathered emails in the above step and convert into tsv files. c) Data preprocessing, before text classification we are creating bag of words model and applying count vectorizer to hash the words. d) Using Naïve-Bayes algorithm, we classify whether the texts are spam or ham. e) We extract links present in E-mails and apply Link Analysis which comprises of 2 steps: 1) Extract the contents of web pages or check if any form is present and whether it asks for personal information. 2) Virus Total API is used to check for presence of phishing links. 5.1 SYSTEM DESIGN Phishing attack and detection system is broken down into sub-modules like web portal, database, personal information, graphical statistics and e-mail phishing detection system Fig. 1 Sub-modules of Phishing Detection System The E-mail phishing detection system is broken down into several processes like link analysis, Naïve-Bayes classifier and resulting probability value is compared with the threshold probability value. Fig. 2 E-mail Phishing Detection System
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 986 6. IMPLEMENTATION 6.1 Naïve-Bayes Algorithm It is a classification technique based on Bayes’ Theoremwith an assumption of independence among predictors.Insimple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inchesin diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, NaiveBayesis known to outperform even highlysophisticatedclassification methods. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below [9]: ….[9] Above,  P(c|x) is the posterior probability of class(c, target) given predictor (x, attributes).  P(c) is the prior probability of class.  P(x|c) is the likelihood which is the probability of predictor given class.  P(x) is the prior probability of predictor. 7. CONCLUSIONS AND FUTURE ENHANCEMENTS Phishing is a form of criminal conduct that poses increasing threats to consumers, financial institutions, and commercial enterprises. Because phishing shows no sign of abating, and indeed is likely to continue in newer and more sophisticated forms, law enforcement, other government agencies,andthe private sector in both countrieswill need to cooperate more closely than ever in their efforts to combat phishing,through improved public education, prevention, authentication, and bi-national and national enforcement efforts. Using Big Data analytics to detect phishing e-mails is developed in response to the increased threat posed by maliciouse-mailsthat closely resemble legitimateones.This methodology not only helps detect phishing messages but also makes it easier to detect such phishing messages evenif they more closely mimic legitimate ones. This would, however, not be possible without knowledge ofbig data and previous knowledge of current threats. Detection of download links to files and improving the accuracy of detection REFERENCES [1] Every company needs to have a security program (2008) [Online]. Available: https://www.appliedtrust.com/resources/security/eve rycompany-needs-to-have-a-security-program Oxford, "Big data," in Oxford Dictionary, Oxford University Press, 2016. [Online].Available: http://www.oxforddictionaries.com/ definition/English/bigdata [2] P. Cocca. “Email security threats,” SANS Institute, USA, Rep. Version 1.4b Option 1, pp. 1-16. Sept. 20,2004 [3] Tarnnum Zaki, Md. Sami Uddin, Md. Mahedi Hasan, “Security Threats for Big Data”, IEEE-2017. [4] Liping Ma, Paul Watters, Simon Brown, “Detecting Phishing E-mails Using Hybrid Features”, workshop on Ubiquitous, Autonomic and Trusted Computing. [5] Sa’id Abdullah Al-Saaidah, “Detecting Phishing E-mails Using Machine Learning”, MEU-2017. [6] J. Crowe. (2016). Phishing by the numbers: Must-know. Phishing statistics 2016 [Online]. Available: https://blog.barkly.com/phishingstatistics- 2016 [7] New EDRM Enron email dataset (n.d.) [Online]. Available: http://spamassassin.apache.org/old/publiccorpus/ [8] https://www.analyticsvidhya.com/blog/2017/09/naive -bayes-explained/ [9] Ron Zacharski, “A Programmer’sGuide to Data Mining”, a book which gives detailed description of Naïve-Bayes Algorithm and unstructured text. [10] Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, a book on NLTK toolkit for Hadoop.