SlideShare a Scribd company logo
1© Copyright 2015 Pivotal. All rights reserved. 1
Data Science Driven
Malware Detection
Malicious Domain Association
Anirudh Kondaveeti, PhD
Principal Data Scientist
2© Copyright 2015 Pivotal. All rights reserved.
Project Goal
 Goal: Find domains that have time and user based co-occurrence
relationships to aid the detection of coordinated network attacks.
 Example: Domain A is a watering hole. It redirects users to an exploit kit at
Domain B within a short time window.
– B is relatively unknown: Visiting B is a low
frequency (support) event.
– B is almost always redirected from A: The
conditional probability (confidence) of an
initial visit to A is high given B is visited later on.
User visits
watering hole
domain A
Domain B
hosts exploit
kit
Watering hole
domain A
redirects to
domain B
User machine
compromised
3© Copyright 2015 Pivotal. All rights reserved.
Data Sources & Preprocessing
 Historical Proxy Logs
– Information about “who is accessing which website at what time”
– Approx. 3 months of data with billions of connection records
 Local Domain White List
– List of non-malicious websites
 Preprocessing
Host Name
Normalization
(anirudh.facebook.com ->
facebook.com)
Filter Invalid Host
Names
( www.facebook,ca)
Identify “unpopular”
domains
( www.francelegal.com)
User Specific
Sessionization
4© Copyright 2015 Pivotal. All rights reserved.
User-Specific Sessionization
 Each user’s proxy logs are sessionized so that two consecutive connections
in the same session occur within a user-specified time window (e.g. 60s).
 Sequential patterns are derived from sessionized data.
Connection Time Domain
Session
ID
2015-07-03 12:41:08 googlevideo.com 1
2015-07-03 12:41:09 twitter.com 1
2015-07-03 12:41:12 youtube.com 1
2015-07-03 12:41:14 doubleclick.net 1
2015-07-03 12:41:15 google.com 1
2015-07-03 12:41:15 googleanalytics.com 1
2015-07-03 12:41:28 youtube.com 1
2015-07-03 12:59:23 facebook.com 2
2015-07-03 12:59:24 yahoo.com 2
>60s apart, start
a new session
5© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
 Sequential Pattern Mining
– Find time-ordered co-occurrence relationships between multiple domains.
– Output low frequency, high confidence sequences of domains:
[{Domain1},{Domain2, Domain3},…] => [DomainN].
 Graph Mining
– Build a “social network” graph between domains by creating edges
between pairs of domains that are associated with high confidence
– Use graph based algorithms to find fully and partially connected
subgraphs
 Two approaches can be used in conjunction to compliment
each other.
6© Copyright 2015 Pivotal. All rights reserved.
Modeling Framework Design Considerations
 Operational feasibility
– Incremental data processing and modeling on incoming new data, e.g. on a weekly
basis, to distribute workload over time.
– Results are updated to incorporate new model outputs.
 Computational tractability
– Implement most of the modeling frameworks in plain SQL, and design efficient
Window functions to achieve better runtime performance.
– Explicit PL/R routine parallelization to leverage the Massively Parallel Processing
architecture of the Greenplum database.
7© Copyright 2015 Pivotal. All rights reserved.
An Incremental Modeling Framework
Initial Proxy Logs &
Domain Whitelist
Preprocessed Proxy
Logs
• Host normalization & validation
• Data filtering
• Sessionization
Model-Specific
Results
Model Execution:
• Sequential Pattern Mining
• Graph Mining
New Proxy Logs &
(Possibly) Updated
Domain Whitelist
Preprocessed New
Proxy Logs
• Host normalization & validation
• Data filtering
• Sessionization
Updated Model-
Specific Results
Initial Run
Update
Model Update:
• Sequential Pattern Mining
• Graph Mining
8© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Sequential Pattern Mining
9© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Sequential Pattern Mining
Create time-ordered
domain sequences from
sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset of
sequences containing
those domains
Find high confidence, low
support sequential patterns
of targeted domains in
parallel
10© Copyright 2015 Pivotal. All rights reserved.
Sequence Creation
 Each sequence contains domains in a session
by the same user.
 Domains are ordered by connection time.
 Sequence for example on the right
– Sequence 1 : [ {googlevideo.com}, {twitter.com},
{youtube.com}, {doubleclick.net}, {google.com},
{googleanalytics.com} ]
– Sequence 2: [{facebook.com}, {yahoo.com}]
Connection Time Domain
Session
ID
2015-01-06 14:41:08 googlevideo.com 1
2015-01-06 14:41:09 twitter.com 1
2015-01-06 14:41:12 youtube.com 1
2015-01-06 14:41:14 doubleclick.net 1
2015-01-06 14:41:15 google.com 1
2015-01-06 14:41:15 googleanalytics.com 1
2015-01-06 14:59:23 facebook.com 2
2015-01-06 14:59:24 yahoo.com 2
11© Copyright 2015 Pivotal. All rights reserved.
Sequence Statistics
 sup: Support of a pattern P is the ratio of sequences in which a
pattern occurs
– sup({a,e}) = 2/10
 conf: Confidence of a rule X => Y is proportion of transactions
containing X that also contain Y
– conf({a => e}) = sup({a,e})/sup({a}) = 2/5
 #users: Number of distinct users for which a pattern P occurs
– #users({a}) = 1
 sup and #users follow monotone property
i.e.
– {a,e} {a}
– sup({a,e}) ≤ sup({a})
– #users({a,e}) ≤ #users({a})
10 sequences from a single user
12© Copyright 2015 Pivotal. All rights reserved.
Sequential Pattern Mining (SPM) in Parallel
 Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with
low support and high confidence patterns occurring in a minimum number of user
sequences.
 High confidence patterns relating to a given set of domains are obtained in parallel:
i.e., SPM runs independently on different subsets of sequences for different domains.
SELECT a_targeted_domain,
sequential_pattern_mining(min_support, min_confidence, min_num_users)
FROM input_table
Pseudo code:
Find domain A with
small support (or
known bad domain)
Subset sequences from
data containing A
Find sequential patterns
of A with high confidence
Repeat for all A in parallel on separate GPDB node
13© Copyright 2015 Pivotal. All rights reserved.
Relative Confidence to Adjust Ranking of Patterns
 For each domain of interest, SPM is run only on the subset of sequences containing that domain. This
may cause some sequential patterns to have artificially high confidence.
 Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences
in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.
 We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset
where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.
 Relative confidence favors the pattern whose left hand side contains less popular domains (see the
highlighted example below).
Relative confidence
favors unpopular left
hand side pattern
Domain Pattern Supp Conf Rel Conf
revenueindia.
net
<{google.com},{facebook.com}> =>
<{revenueindia.net}> 0.079 0.75 0.0001
revenueindia.
net
<{google.com}, {fileshare.com}> =>
<{revenueindia.net}> 0.071 0.75 0.067
revenueindia.
net
<{fileshare.com},{redworm.com}> =>
<{revenueindia.net}> 0.030 1.00 0.51
14© Copyright 2015 Pivotal. All rights reserved.
Model Update: Sequential Pattern Mining
 The model update module for sequential pattern mining follows a similar workflow as
its model execution module.
 One additional step is simply to merge the new results obtained from the incoming
new data with the existing set of patterns, including updating rule quality metrics:
support, confidence, etc.
Create time-ordered
domain sequences from
new sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset
of sequences containing
those domains
Find high confidence, low
support sequential
patterns of targeted
domains in parallel
Merge new results with
the existing set of
patterns.
15© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Graph Mining
16© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Graph Mining
Construct “baskets” of
domains (co-
occurrence domains)
by running a sliding
window of certain time
interval through data
Find high confidence,
low support pairwise
association rules of the
form
Domain 1 => Domain 2
Create social network
of domains
Find partially and fully
connected sub-graphs
17© Copyright 2015 Pivotal. All rights reserved.
Construction of “Baskets”
 Domains visited by a user in a certain
time window form a “basket”, analogous
to items purchased in a single
transaction as in market basket analysis.
 The time interval for the sliding window
(60s window used in the implementation)
can be tuned.
 A basket contains distinct domains in a
sliding window:
Example on right:
Basket 1 = {googlevideo.com, twitter.com, youtube.com,
doubleclick.net, google.com}
Connection Time Domain
2015-01-06 14:41:00 googlevideo.com
2015-01-06 14:41:09 twitter.com
2015-01-06 14:41:12 youtube.com
2015-01-06 14:41:14 doubleclick.net
2015-01-06 14:42:00 google.com
2015-01-06 14:42:05 googleanalytics.com
2015-01-06 14:42:08 pivotal.io
2015-01-06 14:59:23 facebook.com
2015-01-06 14:59:24 yahoo.com
1
2
18© Copyright 2015 Pivotal. All rights reserved.
Pairwise Association Rule Mining
 Given domain-to-basket assignments, pairwise association rule mining mainly
involves evaluation of:
– Co-occurrence frequency: the number of times two domains fall in a common basket.
– Conditional probability: probability of seeing domain 2 given domain 1 is present.
 Pairwise rule mining is implemented in plain SQL in a scalable fashion.
Domain A Domain B
#
{A,B}
# A # B P(A|B) P(B|A)
# A
to B
# B
to A
# AB
Same
Time
Max(#
User
Names/
M)
#
Date
Min
Date
Max
Date
pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1
2015-02-
26
2015-
02-26
pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1
2015-02-
23
2015-
02-23
pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8
2015-01-
23
2015-
02-17
High confidence (>0.5) associations involving
multiple users over several days (e.g. highlighted
rules) are generally more interesting.
19© Copyright 2015 Pivotal. All rights reserved.
Exploring Interactions between Domains
 To explore the interactions between domains, we build an
undirected correlation graph using the discovered pairwise
domain association rules.
 Each node in the graph is a domain. An edge connects two
domains if their co-occurrence confidence is higher than a
threshold (e.g. 0.2).
 The example on the right shows the tightly connected “social
network” of a particular domain.
 Partially and fully connected networks indicate possible
waterhole or bot-net attacks.
 Question: How to quantify the connectivity of a network?
0.25
0.37
0.71
0.52
0.1
0.6
0.1
Weight of Edge denotes
the confidence
Node denotes the
domain
abc.com
xyz.com
hga.com
hebf.com
20© Copyright 2015 Pivotal. All rights reserved.
OddBall Metrics for Graph Anomaly Detection
 We take the OddBall approach* to quantify the connectivity of each domain’s network:
– Identify each domain’s one-step neighborhood (also called ego-net).
– Extract two graph features from the ego-net:
▪ N: Number of neighbors
▪ E: Number of edges in the ego-net
 The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2
* OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010.
Picture Source: ICDM’12 tutorial
on graph anomaly detection
• Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1
indicates some degree of connectivity among neighbors.
• The higher the ratio the higher degree of connectivity (given
same number of neighbors). Generally OddBall ratio of >1.5 is
more interesting.
• One can additionally compute clique percentage: the ratio
between E and the number of edges needed to form a clique:
E/[(N2+N)/2], to measure network connectivity.
21© Copyright 2015 Pivotal. All rights reserved.
Sample Domains with Highly Connected Networks
Highlighted domain has a
fully connected network, a
clique!
Domain
#
Neighb
ors
Neighbours
#
Edg
e
log(
E)/lo
g(N)
Clique
Percen
t
# User
Names
a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6
s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9
r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7
abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11
d.com
e.com
b.com
c.com
a.com
22© Copyright 2015 Pivotal. All rights reserved.
Detecting Isolated Clusters
 Given the domain correlation graph, one can also identify isolated groups of domains that
only interact with domains in the same group, but not others (a bot-net like structure).
 This can be formulated as the task of finding connected components (CCs) in a graph.
 The example below show that malicious sites tend to exist in small CCs.
Sample Connected Component
qre.com
jekc.com
fbc.com
abc.com
ghk.com
bcd.com
Known malicious site
23© Copyright 2015 Pivotal. All rights reserved.
Operationalization and
Outlook
24© Copyright 2015 Pivotal. All rights reserved.
Operationalization Vision
Run Algorithms
Inspect Anomalies
Evaluate Model
Outputs
Refine Algorithms
Load New Data
• Owned by Data Engineer/Data Scientist
• Incrementally (e.g. weekly) update models
using new batches of data, e.g. as a Cron job
• Owned by security
team
• Ideally model outputs
provided via
interactive web
dashboards
• Feedback on model
performance from security
team.
• Opportunities for refinement
and ideas for new models
• Owned by Data Scientist
• Refine algorithms
• Owned by Data Engineer
• Load new data
BUILT FOR THE SPEED OF BUSINESS

More Related Content

PPTX
2016 Cybersecurity Analytics State of the Union
PDF
Data science workshop
PPTX
Just the sketch: advanced streaming analytics in Apache Metron
PDF
First in Class: Optimizing the Data Lake for Tighter Integration
PDF
Deep Learning in Security—An Empirical Example in User and Entity Behavior An...
PDF
Fit For Purpose: Preventing a Big Data Letdown
PDF
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
PPTX
Random Decision Forests at Scale
2016 Cybersecurity Analytics State of the Union
Data science workshop
Just the sketch: advanced streaming analytics in Apache Metron
First in Class: Optimizing the Data Lake for Tighter Integration
Deep Learning in Security—An Empirical Example in User and Entity Behavior An...
Fit For Purpose: Preventing a Big Data Letdown
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Random Decision Forests at Scale

What's hot (20)

PPTX
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
PDF
Python for Data Science - TDC 2015
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
PDF
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
PPTX
Preparing for the Cybersecurity Renaissance
PPTX
Building a Real-Time Security Application Using Log Data and Machine Learning...
PPTX
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
PPTX
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
PPT
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
PDF
Open Source Data Management for Industry 4.0
PPT
Big Data Real Time Analytics - A Facebook Case Study
PDF
Data Science Crash Course
PPTX
Applying Noisy Knowledge Graphs to Real Problems
PPTX
Shikha fdp 62_14july2017
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
PDF
Threat Detection and Response at Scale with Dominique Brezinski
PPTX
Perspectives on Ethical Big Data Governance
PDF
H2O for Medicine and Intro to H2O in Python
PDF
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
Python for Data Science - TDC 2015
Hortonworks Hybrid Cloud - Putting you back in control of your data
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Preparing for the Cybersecurity Renaissance
Building a Real-Time Security Application Using Log Data and Machine Learning...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Open Source Data Management for Industry 4.0
Big Data Real Time Analytics - A Facebook Case Study
Data Science Crash Course
Applying Noisy Knowledge Graphs to Real Problems
Shikha fdp 62_14july2017
Agile Big Data Analytics Development: An Architecture-Centric Approach
Threat Detection and Response at Scale with Dominique Brezinski
Perspectives on Ethical Big Data Governance
H2O for Medicine and Intro to H2O in Python
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
Ad

Viewers also liked (20)

PDF
[FAST CAMPUS] 1강 data science overview
PDF
Intro to Data Science for Non-Data Scientists
PDF
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
PDF
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
PDF
저성장 시대 데이터 경제만이 살길이다
PDF
Pivotal Digital Transformation Forum: Data Science
PDF
What Is the Future of Data Sharing?
PDF
Data Science - Part XIV - Genetic Algorithms
PDF
Data Science - Part XI - Text Analytics
PDF
Data Science - Part X - Time Series Forecasting
PDF
Data Science - Part XIII - Hidden Markov Models
PDF
Data Science - Part XVII - Deep Learning & Image Processing
PDF
To Serve and Protect: Making Sense of Hadoop Security
PPTX
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
PPTX
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
PPTX
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
PDF
State of Application Security Vol. 4
PDF
Senzations’15: Secure Internet of Things
PDF
frog IoT Big Design IoT World Congress 2015
PDF
IoT and BD Introduction
[FAST CAMPUS] 1강 data science overview
Intro to Data Science for Non-Data Scientists
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
저성장 시대 데이터 경제만이 살길이다
Pivotal Digital Transformation Forum: Data Science
What Is the Future of Data Sharing?
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XI - Text Analytics
Data Science - Part X - Time Series Forecasting
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XVII - Deep Learning & Image Processing
To Serve and Protect: Making Sense of Hadoop Security
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
State of Application Security Vol. 4
Senzations’15: Secure Internet of Things
frog IoT Big Design IoT World Congress 2015
IoT and BD Introduction
Ad

Similar to Data Science Driven Malware Detection (20)

PPTX
Social Νetworks Data Mining
PDF
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
PDF
Detection of Behavior using Machine Learning
PPTX
Big data and machine learning / Gil Chamiel
PPT
Temporal data mining
PDF
Analysis of Time Series Data & Pattern Sequencing
PDF
S-CUBE LP: Mining Lifecycle Event Logs for Enhancing SBAs
PPTX
Data mining
PPTX
Data mining
PDF
A Novel Framework on Web Usage Mining
PPTX
Classification
PDF
WebSite Visit Forecasting Using Data Mining Techniques
PDF
Big data Mining Using Very-Large-Scale Data Processing Platforms
PDF
Using Data Science for Cybersecurity
PPTX
Introduction Data Science.pptx
PDF
IRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
PDF
Dunham - Data Mining.pdf
PDF
Dunham - Data Mining.pdf
PPTX
Best practices machine learning final
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Social Νetworks Data Mining
Efficient Way to Identify User Aware Rare Sequential Patterns in Document Str...
Detection of Behavior using Machine Learning
Big data and machine learning / Gil Chamiel
Temporal data mining
Analysis of Time Series Data & Pattern Sequencing
S-CUBE LP: Mining Lifecycle Event Logs for Enhancing SBAs
Data mining
Data mining
A Novel Framework on Web Usage Mining
Classification
WebSite Visit Forecasting Using Data Mining Techniques
Big data Mining Using Very-Large-Scale Data Processing Platforms
Using Data Science for Cybersecurity
Introduction Data Science.pptx
IRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
Dunham - Data Mining.pdf
Dunham - Data Mining.pdf
Best practices machine learning final
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
PDF
What AI Means For Your Product Strategy And What To Do About It
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
PPTX
Enhancing DevEx and Simplifying Operations at Scale
PDF
Spring Update | July 2023
PPTX
Platforms, Platform Engineering, & Platform as a Product
PPTX
Building Cloud Ready Apps
PDF
Spring Boot 3 And Beyond
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
PPTX
tanzu_developer_connect.pptx
PDF
Tanzu Virtual Developer Connect Workshop - French
PDF
Tanzu Developer Connect Workshop - English
PDF
Virtual Developer Connect Workshop - English
PDF
Tanzu Developer Connect - French
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
PDF
SpringOne Tour: The Influential Software Engineer
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
Spring into AI presented by Dan Vega 5/14
What AI Means For Your Product Strategy And What To Do About It
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Enhancing DevEx and Simplifying Operations at Scale
Spring Update | July 2023
Platforms, Platform Engineering, & Platform as a Product
Building Cloud Ready Apps
Spring Boot 3 And Beyond
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
tanzu_developer_connect.pptx
Tanzu Virtual Developer Connect Workshop - French
Tanzu Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
Tanzu Developer Connect - French
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: Domain-Driven Design: Theory vs Practice

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to machine learning and Linear Models
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to machine learning and Linear Models
Qualitative Qantitative and Mixed Methods.pptx
Clinical guidelines as a resource for EBP(1).pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Analytics and business intelligence.pdf
Mega Projects Data Mega Projects Data
Introduction-to-Cloud-ComputingFinal.pptx
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu

Data Science Driven Malware Detection

  • 1. 1© Copyright 2015 Pivotal. All rights reserved. 1 Data Science Driven Malware Detection Malicious Domain Association Anirudh Kondaveeti, PhD Principal Data Scientist
  • 2. 2© Copyright 2015 Pivotal. All rights reserved. Project Goal  Goal: Find domains that have time and user based co-occurrence relationships to aid the detection of coordinated network attacks.  Example: Domain A is a watering hole. It redirects users to an exploit kit at Domain B within a short time window. – B is relatively unknown: Visiting B is a low frequency (support) event. – B is almost always redirected from A: The conditional probability (confidence) of an initial visit to A is high given B is visited later on. User visits watering hole domain A Domain B hosts exploit kit Watering hole domain A redirects to domain B User machine compromised
  • 3. 3© Copyright 2015 Pivotal. All rights reserved. Data Sources & Preprocessing  Historical Proxy Logs – Information about “who is accessing which website at what time” – Approx. 3 months of data with billions of connection records  Local Domain White List – List of non-malicious websites  Preprocessing Host Name Normalization (anirudh.facebook.com -> facebook.com) Filter Invalid Host Names ( www.facebook,ca) Identify “unpopular” domains ( www.francelegal.com) User Specific Sessionization
  • 4. 4© Copyright 2015 Pivotal. All rights reserved. User-Specific Sessionization  Each user’s proxy logs are sessionized so that two consecutive connections in the same session occur within a user-specified time window (e.g. 60s).  Sequential patterns are derived from sessionized data. Connection Time Domain Session ID 2015-07-03 12:41:08 googlevideo.com 1 2015-07-03 12:41:09 twitter.com 1 2015-07-03 12:41:12 youtube.com 1 2015-07-03 12:41:14 doubleclick.net 1 2015-07-03 12:41:15 google.com 1 2015-07-03 12:41:15 googleanalytics.com 1 2015-07-03 12:41:28 youtube.com 1 2015-07-03 12:59:23 facebook.com 2 2015-07-03 12:59:24 yahoo.com 2 >60s apart, start a new session
  • 5. 5© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches  Sequential Pattern Mining – Find time-ordered co-occurrence relationships between multiple domains. – Output low frequency, high confidence sequences of domains: [{Domain1},{Domain2, Domain3},…] => [DomainN].  Graph Mining – Build a “social network” graph between domains by creating edges between pairs of domains that are associated with high confidence – Use graph based algorithms to find fully and partially connected subgraphs  Two approaches can be used in conjunction to compliment each other.
  • 6. 6© Copyright 2015 Pivotal. All rights reserved. Modeling Framework Design Considerations  Operational feasibility – Incremental data processing and modeling on incoming new data, e.g. on a weekly basis, to distribute workload over time. – Results are updated to incorporate new model outputs.  Computational tractability – Implement most of the modeling frameworks in plain SQL, and design efficient Window functions to achieve better runtime performance. – Explicit PL/R routine parallelization to leverage the Massively Parallel Processing architecture of the Greenplum database.
  • 7. 7© Copyright 2015 Pivotal. All rights reserved. An Incremental Modeling Framework Initial Proxy Logs & Domain Whitelist Preprocessed Proxy Logs • Host normalization & validation • Data filtering • Sessionization Model-Specific Results Model Execution: • Sequential Pattern Mining • Graph Mining New Proxy Logs & (Possibly) Updated Domain Whitelist Preprocessed New Proxy Logs • Host normalization & validation • Data filtering • Sessionization Updated Model- Specific Results Initial Run Update Model Update: • Sequential Pattern Mining • Graph Mining
  • 8. 8© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches Sequential Pattern Mining
  • 9. 9© Copyright 2015 Pivotal. All rights reserved. Model Execution: Sequential Pattern Mining Create time-ordered domain sequences from sessionized data Given a list of targeted domains (e.g. rare domains), select subset of sequences containing those domains Find high confidence, low support sequential patterns of targeted domains in parallel
  • 10. 10© Copyright 2015 Pivotal. All rights reserved. Sequence Creation  Each sequence contains domains in a session by the same user.  Domains are ordered by connection time.  Sequence for example on the right – Sequence 1 : [ {googlevideo.com}, {twitter.com}, {youtube.com}, {doubleclick.net}, {google.com}, {googleanalytics.com} ] – Sequence 2: [{facebook.com}, {yahoo.com}] Connection Time Domain Session ID 2015-01-06 14:41:08 googlevideo.com 1 2015-01-06 14:41:09 twitter.com 1 2015-01-06 14:41:12 youtube.com 1 2015-01-06 14:41:14 doubleclick.net 1 2015-01-06 14:41:15 google.com 1 2015-01-06 14:41:15 googleanalytics.com 1 2015-01-06 14:59:23 facebook.com 2 2015-01-06 14:59:24 yahoo.com 2
  • 11. 11© Copyright 2015 Pivotal. All rights reserved. Sequence Statistics  sup: Support of a pattern P is the ratio of sequences in which a pattern occurs – sup({a,e}) = 2/10  conf: Confidence of a rule X => Y is proportion of transactions containing X that also contain Y – conf({a => e}) = sup({a,e})/sup({a}) = 2/5  #users: Number of distinct users for which a pattern P occurs – #users({a}) = 1  sup and #users follow monotone property i.e. – {a,e} {a} – sup({a,e}) ≤ sup({a}) – #users({a,e}) ≤ #users({a}) 10 sequences from a single user
  • 12. 12© Copyright 2015 Pivotal. All rights reserved. Sequential Pattern Mining (SPM) in Parallel  Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with low support and high confidence patterns occurring in a minimum number of user sequences.  High confidence patterns relating to a given set of domains are obtained in parallel: i.e., SPM runs independently on different subsets of sequences for different domains. SELECT a_targeted_domain, sequential_pattern_mining(min_support, min_confidence, min_num_users) FROM input_table Pseudo code: Find domain A with small support (or known bad domain) Subset sequences from data containing A Find sequential patterns of A with high confidence Repeat for all A in parallel on separate GPDB node
  • 13. 13© Copyright 2015 Pivotal. All rights reserved. Relative Confidence to Adjust Ranking of Patterns  For each domain of interest, SPM is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence.  Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.  We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.  Relative confidence favors the pattern whose left hand side contains less popular domains (see the highlighted example below). Relative confidence favors unpopular left hand side pattern Domain Pattern Supp Conf Rel Conf revenueindia. net <{google.com},{facebook.com}> => <{revenueindia.net}> 0.079 0.75 0.0001 revenueindia. net <{google.com}, {fileshare.com}> => <{revenueindia.net}> 0.071 0.75 0.067 revenueindia. net <{fileshare.com},{redworm.com}> => <{revenueindia.net}> 0.030 1.00 0.51
  • 14. 14© Copyright 2015 Pivotal. All rights reserved. Model Update: Sequential Pattern Mining  The model update module for sequential pattern mining follows a similar workflow as its model execution module.  One additional step is simply to merge the new results obtained from the incoming new data with the existing set of patterns, including updating rule quality metrics: support, confidence, etc. Create time-ordered domain sequences from new sessionized data Given a list of targeted domains (e.g. rare domains), select subset of sequences containing those domains Find high confidence, low support sequential patterns of targeted domains in parallel Merge new results with the existing set of patterns.
  • 15. 15© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches Graph Mining
  • 16. 16© Copyright 2015 Pivotal. All rights reserved. Model Execution: Graph Mining Construct “baskets” of domains (co- occurrence domains) by running a sliding window of certain time interval through data Find high confidence, low support pairwise association rules of the form Domain 1 => Domain 2 Create social network of domains Find partially and fully connected sub-graphs
  • 17. 17© Copyright 2015 Pivotal. All rights reserved. Construction of “Baskets”  Domains visited by a user in a certain time window form a “basket”, analogous to items purchased in a single transaction as in market basket analysis.  The time interval for the sliding window (60s window used in the implementation) can be tuned.  A basket contains distinct domains in a sliding window: Example on right: Basket 1 = {googlevideo.com, twitter.com, youtube.com, doubleclick.net, google.com} Connection Time Domain 2015-01-06 14:41:00 googlevideo.com 2015-01-06 14:41:09 twitter.com 2015-01-06 14:41:12 youtube.com 2015-01-06 14:41:14 doubleclick.net 2015-01-06 14:42:00 google.com 2015-01-06 14:42:05 googleanalytics.com 2015-01-06 14:42:08 pivotal.io 2015-01-06 14:59:23 facebook.com 2015-01-06 14:59:24 yahoo.com 1 2
  • 18. 18© Copyright 2015 Pivotal. All rights reserved. Pairwise Association Rule Mining  Given domain-to-basket assignments, pairwise association rule mining mainly involves evaluation of: – Co-occurrence frequency: the number of times two domains fall in a common basket. – Conditional probability: probability of seeing domain 2 given domain 1 is present.  Pairwise rule mining is implemented in plain SQL in a scalable fashion. Domain A Domain B # {A,B} # A # B P(A|B) P(B|A) # A to B # B to A # AB Same Time Max(# User Names/ M) # Date Min Date Max Date pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1 2015-02- 26 2015- 02-26 pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1 2015-02- 23 2015- 02-23 pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8 2015-01- 23 2015- 02-17 High confidence (>0.5) associations involving multiple users over several days (e.g. highlighted rules) are generally more interesting.
  • 19. 19© Copyright 2015 Pivotal. All rights reserved. Exploring Interactions between Domains  To explore the interactions between domains, we build an undirected correlation graph using the discovered pairwise domain association rules.  Each node in the graph is a domain. An edge connects two domains if their co-occurrence confidence is higher than a threshold (e.g. 0.2).  The example on the right shows the tightly connected “social network” of a particular domain.  Partially and fully connected networks indicate possible waterhole or bot-net attacks.  Question: How to quantify the connectivity of a network? 0.25 0.37 0.71 0.52 0.1 0.6 0.1 Weight of Edge denotes the confidence Node denotes the domain abc.com xyz.com hga.com hebf.com
  • 20. 20© Copyright 2015 Pivotal. All rights reserved. OddBall Metrics for Graph Anomaly Detection  We take the OddBall approach* to quantify the connectivity of each domain’s network: – Identify each domain’s one-step neighborhood (also called ego-net). – Extract two graph features from the ego-net: ▪ N: Number of neighbors ▪ E: Number of edges in the ego-net  The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2 * OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010. Picture Source: ICDM’12 tutorial on graph anomaly detection • Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1 indicates some degree of connectivity among neighbors. • The higher the ratio the higher degree of connectivity (given same number of neighbors). Generally OddBall ratio of >1.5 is more interesting. • One can additionally compute clique percentage: the ratio between E and the number of edges needed to form a clique: E/[(N2+N)/2], to measure network connectivity.
  • 21. 21© Copyright 2015 Pivotal. All rights reserved. Sample Domains with Highly Connected Networks Highlighted domain has a fully connected network, a clique! Domain # Neighb ors Neighbours # Edg e log( E)/lo g(N) Clique Percen t # User Names a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6 s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9 r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7 abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11 d.com e.com b.com c.com a.com
  • 22. 22© Copyright 2015 Pivotal. All rights reserved. Detecting Isolated Clusters  Given the domain correlation graph, one can also identify isolated groups of domains that only interact with domains in the same group, but not others (a bot-net like structure).  This can be formulated as the task of finding connected components (CCs) in a graph.  The example below show that malicious sites tend to exist in small CCs. Sample Connected Component qre.com jekc.com fbc.com abc.com ghk.com bcd.com Known malicious site
  • 23. 23© Copyright 2015 Pivotal. All rights reserved. Operationalization and Outlook
  • 24. 24© Copyright 2015 Pivotal. All rights reserved. Operationalization Vision Run Algorithms Inspect Anomalies Evaluate Model Outputs Refine Algorithms Load New Data • Owned by Data Engineer/Data Scientist • Incrementally (e.g. weekly) update models using new batches of data, e.g. as a Cron job • Owned by security team • Ideally model outputs provided via interactive web dashboards • Feedback on model performance from security team. • Opportunities for refinement and ideas for new models • Owned by Data Scientist • Refine algorithms • Owned by Data Engineer • Load new data
  • 25. BUILT FOR THE SPEED OF BUSINESS