SlideShare a Scribd company logo
Real Time classification of
malicious URLs
Daniyar Mukhanov, Chandan Gowda
Introduction
- Malicous software in Online Social Network (OSN)
Malicous web sites are top 3 thread to enterprise security
- Koobface virus. Anagram of word “Facebook”
Koobface
Twitter
Cyber criminals can piggyback on events to share malicious URL-s
Aim of paper
Develop a real-time machine classification system to distinguish between malicious
and benign URLs within seconds of the URL being clicked
Training several machine classification models by getting data during two large
sport events:
- Superbowl
- Cricket World Cup
Related Work
- Malware propagation and Social networks
- Classifying malicious web pages
Malware propagation and Social networks
- Low degree of connections is not an obstacle
- Highly clustered networks slows propagation
- Large-scale events are ideal for spreading malware
Classifying malicious webpages
used static analysis of scripts embedded within a Web page
Static code analysis to detect evasive malware
Honeypots to interact with malicious content and anti-virus to analyse the
malicious content
Static code Vs Run-time analysis
Data collection
American Super Bowl; to train data
Cricket World Cup; to test data
- #superbowlXLIX - 122 542 URL containing tweets
- #CWC15 - 7961 URL containing tweets
Identifying malicious URLs
- Client-side honeypot system
- Low interaction honeypots and high interaction honeypots
- The Capture HPC toolkit
- 5 minutes of visit
Architecture for suspicious URL annotation
- Capture HPC operates in VM
- User can specify own omission or inclusion rule
Sampling and Feature Identification
• Data has been collected from twitter with the help of Tweepy.
• Data from one event used to train a classifier and data from another event is
used to test the model’s generalizability.
• Super Bowl training data contained 1000 URLs as Malicious and Benign each.
• Cricket World Cup testing data contained 891 Malicious URLs and 1100
Benign.
Sampling and Feature Identification
- 80% of URLs from Cricket World Cup found to be malicious
Metrics:
- CPU
- Connection established
- Port Number
- Process ID
- Remote IP
- Network Interface
- Bytes sent/received
Baseline Model Selection
Data modelling activity is intended for:
• Extracting features from machine activity that would help predict malicious behaviour during
an interaction with a URL
• To connect the dots between machine activity and malicious behaviour
• Generative Vs Discriminative models
• Data acquired can include logs of machine activity even during idle system state.
• Hence it is likely there is noise as well as malicious behaviour recorded in those logs.
Statistics for Trained and Test Datasets
t
● High variance in mean recorded values
for CPU usage, bytes/packets
sent/received and ports used.
● But Standard Deviation is very similar for
both the data sets.
Baseline Model Selection
• Datasets contained well balanced number of malicious and benign activity logs but
largely benign.
• This could have an impact on the effectiveness of a discriminative classifier.
• Identifying decision boundaries where the inputs may not be linearly separable.
• So in this case, a generative model suits better.
Choosing classifiers
Generative Models
1. Bayesian Classifier
2. Naïve Bayesian Classifier
Discriminative Models
1. J48 Decision Tree
2. Multi Layer Perception Model (MLP)
Baseline Model Results- Generative Models
The low error rates at t=60 in Bayesian model during training phase suggest:
1. The features that we’re using to build the models are predictive of malicious activities
2. Malicious activities are occurring within first 60 seconds of interaction.
3. There are conditional dependencies between variables.
Baseline Model Results- Discriminative Models
• MLP has a precision of 0.720 at t=30, only slightly below its optimum level. But it demonstrates the model’s ability to
reduce false positives early on.
Classifier Performance over time
● This chart depicts correctly classified
instances over a period of time incrementally.
● Discriminative models outperform generative
models.
● This suggests that certain malicious activities
are linearly separable from benign behaviour.
● the model, Naive Bayesian fails to perform
well.
● MLP model outsmarted the rest of the
classifiers.
Model Analysis
● MLP produced 9 hidden nodes and the table
shows weightings given for each
class(Benign/Malicious)
● Here node 9 stands out with higher weight
for malicious behaviour
NODE WEIGHTS BY CLASS
Model Analysis
● Node 9 holds highest value for bytes received
variable.
● Compare it with Node 3 for Bytes sent/received
and Packets sent/received
● This is an interesting find as we know Node 9
was involved with malicious links.
● Most important discovery is in the connection
attribute which is weighted high for Node 1.
● Subsequently Remote IP and Bytes Sent also
receive a massive hike. Suggestive of an attack.
MLP ANALYSIS
Sampled learning
Correctly classified instances with sampled
training data
Conclusion
- Endpoint is not clear from tweets
- MLP model performed best on unseen data 72%
- Bayesian approach performed best in early stages of interaction 66%
- Twitter recently introduced new policies to protect from harm.

More Related Content

PDF
Knowledge and Data Engineering IEEE 2015 Projects
DOCX
Warningbird a near real time detection system for suspicious urls in twitter ...
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
PPTX
Crime analysis of different situations
PPTX
Scout: A Contactless Active Vulnerability Tool - Dissertation, a year long pr...
DOCX
NTVI Federal
PDF
Emotion Sense: From Design to Deployment
Knowledge and Data Engineering IEEE 2015 Projects
Warningbird a near real time detection system for suspicious urls in twitter ...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Crime analysis of different situations
Scout: A Contactless Active Vulnerability Tool - Dissertation, a year long pr...
NTVI Federal
Emotion Sense: From Design to Deployment

Viewers also liked (16)

PDF
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
DOCX
REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS
PPTX
Amazon marketplace
DOC
PDF
Sharing economy-2
PPTX
Weka.arff
PDF
Twitter r t under crisis
PDF
Fighting spam using social gate keepers
PPTX
PDF
Weka_Manual_Sagar
PDF
Weka
PDF
Weka presentation cmt111
PPTX
Social influence and political mobilization
PPTX
Predictive Analytics: It's The Intervention That Matters
PPT
An Introduction To Weka
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS
Amazon marketplace
Sharing economy-2
Weka.arff
Twitter r t under crisis
Fighting spam using social gate keepers
Weka_Manual_Sagar
Weka
Weka presentation cmt111
Social influence and political mobilization
Predictive Analytics: It's The Intervention That Matters
An Introduction To Weka
Ad

Similar to Real time classification of malicious urls.pptx 2 (20)

PDF
Navy security contest-bigdataforsecurity
PDF
DATI, AI E ROBOTICA @POLITO
PPTX
Fake Job Detection PPT.pptx using python
PPTX
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
PDF
IRJET - Twitter Spam Detection using Cobweb
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
PPTX
Machine Learning + Analytics in Splunk
PDF
Detection of Phishing Websites using machine Learning Algorithm
PPTX
CREDIT CARD FRAUD DETECTION
PPTX
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
PPTX
spamzombieppt
PDF
stackconf 2024 | IGNITE: Practical AI with Machine Learning for Observability...
PDF
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
PDF
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
PDF
Analysis on Fraud Detection Mechanisms Using Machine Learning Techniques
PDF
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
PPTX
BsidesLVPresso2016_JZeditsv6
PDF
IRJET - Chrome Extension for Detecting Phishing Websites
PDF
Data mining final report
PPTX
李育杰/The Growth of a Data Scientist
Navy security contest-bigdataforsecurity
DATI, AI E ROBOTICA @POLITO
Fake Job Detection PPT.pptx using python
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
IRJET - Twitter Spam Detection using Cobweb
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Machine Learning + Analytics in Splunk
Detection of Phishing Websites using machine Learning Algorithm
CREDIT CARD FRAUD DETECTION
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
spamzombieppt
stackconf 2024 | IGNITE: Practical AI with Machine Learning for Observability...
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
Analysis on Fraud Detection Mechanisms Using Machine Learning Techniques
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
BsidesLVPresso2016_JZeditsv6
IRJET - Chrome Extension for Detecting Phishing Websites
Data mining final report
李育杰/The Growth of a Data Scientist
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
GDM (1) (1).pptx small presentation for students
PDF
01-Introduction-to-Information-Management.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
master seminar digital applications in india
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Computing-Curriculum for Schools in Ghana
PDF
RMMM.pdf make it easy to upload and study
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Chinmaya Tiranga quiz Grand Finale.pdf
Cell Types and Its function , kingdom of life
GDM (1) (1).pptx small presentation for students
01-Introduction-to-Information-Management.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Module 4: Burden of Disease Tutorial Slides S2 2025
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Final Presentation General Medicine 03-08-2024.pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
master seminar digital applications in india
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Computing-Curriculum for Schools in Ghana
RMMM.pdf make it easy to upload and study
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
2.FourierTransform-ShortQuestionswithAnswers.pdf

Real time classification of malicious urls.pptx 2

  • 1. Real Time classification of malicious URLs Daniyar Mukhanov, Chandan Gowda
  • 2. Introduction - Malicous software in Online Social Network (OSN) Malicous web sites are top 3 thread to enterprise security - Koobface virus. Anagram of word “Facebook”
  • 4. Twitter Cyber criminals can piggyback on events to share malicious URL-s
  • 5. Aim of paper Develop a real-time machine classification system to distinguish between malicious and benign URLs within seconds of the URL being clicked Training several machine classification models by getting data during two large sport events: - Superbowl - Cricket World Cup
  • 6. Related Work - Malware propagation and Social networks - Classifying malicious web pages
  • 7. Malware propagation and Social networks - Low degree of connections is not an obstacle - Highly clustered networks slows propagation - Large-scale events are ideal for spreading malware
  • 8. Classifying malicious webpages used static analysis of scripts embedded within a Web page Static code analysis to detect evasive malware Honeypots to interact with malicious content and anti-virus to analyse the malicious content Static code Vs Run-time analysis
  • 9. Data collection American Super Bowl; to train data Cricket World Cup; to test data - #superbowlXLIX - 122 542 URL containing tweets - #CWC15 - 7961 URL containing tweets
  • 10. Identifying malicious URLs - Client-side honeypot system - Low interaction honeypots and high interaction honeypots - The Capture HPC toolkit - 5 minutes of visit
  • 11. Architecture for suspicious URL annotation - Capture HPC operates in VM - User can specify own omission or inclusion rule
  • 12. Sampling and Feature Identification • Data has been collected from twitter with the help of Tweepy. • Data from one event used to train a classifier and data from another event is used to test the model’s generalizability. • Super Bowl training data contained 1000 URLs as Malicious and Benign each. • Cricket World Cup testing data contained 891 Malicious URLs and 1100 Benign.
  • 13. Sampling and Feature Identification - 80% of URLs from Cricket World Cup found to be malicious Metrics: - CPU - Connection established - Port Number - Process ID - Remote IP - Network Interface - Bytes sent/received
  • 14. Baseline Model Selection Data modelling activity is intended for: • Extracting features from machine activity that would help predict malicious behaviour during an interaction with a URL • To connect the dots between machine activity and malicious behaviour • Generative Vs Discriminative models • Data acquired can include logs of machine activity even during idle system state. • Hence it is likely there is noise as well as malicious behaviour recorded in those logs.
  • 15. Statistics for Trained and Test Datasets t ● High variance in mean recorded values for CPU usage, bytes/packets sent/received and ports used. ● But Standard Deviation is very similar for both the data sets.
  • 16. Baseline Model Selection • Datasets contained well balanced number of malicious and benign activity logs but largely benign. • This could have an impact on the effectiveness of a discriminative classifier. • Identifying decision boundaries where the inputs may not be linearly separable. • So in this case, a generative model suits better.
  • 17. Choosing classifiers Generative Models 1. Bayesian Classifier 2. Naïve Bayesian Classifier Discriminative Models 1. J48 Decision Tree 2. Multi Layer Perception Model (MLP)
  • 18. Baseline Model Results- Generative Models The low error rates at t=60 in Bayesian model during training phase suggest: 1. The features that we’re using to build the models are predictive of malicious activities 2. Malicious activities are occurring within first 60 seconds of interaction. 3. There are conditional dependencies between variables.
  • 19. Baseline Model Results- Discriminative Models • MLP has a precision of 0.720 at t=30, only slightly below its optimum level. But it demonstrates the model’s ability to reduce false positives early on.
  • 20. Classifier Performance over time ● This chart depicts correctly classified instances over a period of time incrementally. ● Discriminative models outperform generative models. ● This suggests that certain malicious activities are linearly separable from benign behaviour. ● the model, Naive Bayesian fails to perform well. ● MLP model outsmarted the rest of the classifiers.
  • 21. Model Analysis ● MLP produced 9 hidden nodes and the table shows weightings given for each class(Benign/Malicious) ● Here node 9 stands out with higher weight for malicious behaviour NODE WEIGHTS BY CLASS
  • 22. Model Analysis ● Node 9 holds highest value for bytes received variable. ● Compare it with Node 3 for Bytes sent/received and Packets sent/received ● This is an interesting find as we know Node 9 was involved with malicious links. ● Most important discovery is in the connection attribute which is weighted high for Node 1. ● Subsequently Remote IP and Bytes Sent also receive a massive hike. Suggestive of an attack. MLP ANALYSIS
  • 23. Sampled learning Correctly classified instances with sampled training data
  • 24. Conclusion - Endpoint is not clear from tweets - MLP model performed best on unseen data 72% - Bayesian approach performed best in early stages of interaction 66% - Twitter recently introduced new policies to protect from harm.

Editor's Notes

  • #9: A snapshot of the memory, executables and registry of the honeypot computer is recorded before crawling a site. After visiting the site, the state of memory, executables, and registry is recorded and compared to the previous snapshot. The changes are analyzed to determine if the visited site installed any malware onto the client honeypot computer.