Data Science Driven Malware Detection

1© Copyright 2015 Pivotal. All rights reserved. 1
Data Science Driven
Malware Detection
Malicious Domain Association
Anirudh Kondaveeti, PhD
Principal Data Scientist

2© Copyright 2015 Pivotal. All rights reserved.
Project Goal
 Goal: Find domains that have time and user based co-occurrence
relationships to aid the detection of coordinated network attacks.
 Example: Domain A is a watering hole. It redirects users to an exploit kit at
Domain B within a short time window.
– B is relatively unknown: Visiting B is a low
frequency (support) event.
– B is almost always redirected from A: The
conditional probability (confidence) of an
initial visit to A is high given B is visited later on.
User visits
watering hole
domain A
Domain B
hosts exploit
kit
Watering hole
domain A
redirects to
domain B
User machine
compromised

Data Sources & Preprocessing
 Historical Proxy Logs
– Information about “who is accessing which website at what time”
– Approx. 3 months of data with billions of connection records
 Local Domain White List
– List of non-malicious websites
 Preprocessing
Host Name
Normalization
(anirudh.facebook.com ->
facebook.com)
Filter Invalid Host
Names
( www.facebook,ca)
Identify “unpopular”
domains
( www.francelegal.com)
User Specific
Sessionization

User-Specific Sessionization
 Each user’s proxy logs are sessionized so that two consecutive connections
in the same session occur within a user-specified time window (e.g. 60s).
 Sequential patterns are derived from sessionized data.
Connection Time Domain
Session
ID
2015-07-03 12:41:08 googlevideo.com 1
2015-07-03 12:41:09 twitter.com 1
2015-07-03 12:41:12 youtube.com 1
2015-07-03 12:41:14 doubleclick.net 1
2015-07-03 12:41:15 google.com 1
2015-07-03 12:41:15 googleanalytics.com 1
2015-07-03 12:41:28 youtube.com 1
2015-07-03 12:59:23 facebook.com 2
2015-07-03 12:59:24 yahoo.com 2
>60s apart, start
a new session

Modeling Approaches
 Sequential Pattern Mining
– Find time-ordered co-occurrence relationships between multiple domains.
– Output low frequency, high confidence sequences of domains:
[{Domain1},{Domain2, Domain3},…] => [DomainN].
 Graph Mining
– Build a “social network” graph between domains by creating edges
between pairs of domains that are associated with high confidence
– Use graph based algorithms to find fully and partially connected
subgraphs
 Two approaches can be used in conjunction to compliment
each other.

Modeling Framework Design Considerations
 Operational feasibility
– Incremental data processing and modeling on incoming new data, e.g. on a weekly
basis, to distribute workload over time.
– Results are updated to incorporate new model outputs.
 Computational tractability
– Implement most of the modeling frameworks in plain SQL, and design efficient
Window functions to achieve better runtime performance.
– Explicit PL/R routine parallelization to leverage the Massively Parallel Processing
architecture of the Greenplum database.

An Incremental Modeling Framework
Initial Proxy Logs &
Domain Whitelist
Preprocessed Proxy
Logs
• Host normalization & validation
• Data filtering
• Sessionization
Model-Specific
Results
Model Execution:
• Sequential Pattern Mining
• Graph Mining
New Proxy Logs &
(Possibly) Updated
Domain Whitelist
Preprocessed New
Proxy Logs
• Host normalization & validation
• Data filtering
• Sessionization
Updated Model-
Specific Results
Initial Run
Update
Model Update:
• Sequential Pattern Mining
• Graph Mining

Modeling Approaches
Sequential Pattern Mining

Model Execution: Sequential Pattern Mining
Create time-ordered
domain sequences from
sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset of
sequences containing
those domains
Find high confidence, low
support sequential patterns
of targeted domains in
parallel

Sequence Creation
 Each sequence contains domains in a session
by the same user.
 Domains are ordered by connection time.
 Sequence for example on the right
– Sequence 1 : [ {googlevideo.com}, {twitter.com},
{youtube.com}, {doubleclick.net}, {google.com},
{googleanalytics.com} ]
– Sequence 2: [{facebook.com}, {yahoo.com}]
Session
ID
2015-01-06 14:41:08 googlevideo.com 1
2015-01-06 14:41:09 twitter.com 1
2015-01-06 14:41:12 youtube.com 1
2015-01-06 14:41:14 doubleclick.net 1
2015-01-06 14:41:15 google.com 1
2015-01-06 14:41:15 googleanalytics.com 1
2015-01-06 14:59:23 facebook.com 2
2015-01-06 14:59:24 yahoo.com 2

Sequence Statistics
 sup: Support of a pattern P is the ratio of sequences in which a
pattern occurs
– sup({a,e}) = 2/10
 conf: Confidence of a rule X => Y is proportion of transactions
containing X that also contain Y
– conf({a => e}) = sup({a,e})/sup({a}) = 2/5
 #users: Number of distinct users for which a pattern P occurs
– #users({a}) = 1
 sup and #users follow monotone property
i.e.
– {a,e} {a}
– sup({a,e}) ≤ sup({a})
– #users({a,e}) ≤ #users({a})
10 sequences from a single user

Sequential Pattern Mining (SPM) in Parallel
 Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with
low support and high confidence patterns occurring in a minimum number of user
sequences.
 High confidence patterns relating to a given set of domains are obtained in parallel:
i.e., SPM runs independently on different subsets of sequences for different domains.
SELECT a_targeted_domain,
sequential_pattern_mining(min_support, min_confidence, min_num_users)
FROM input_table
Pseudo code:
Find domain A with
small support (or
known bad domain)
Subset sequences from
data containing A
Find sequential patterns
of A with high confidence
Repeat for all A in parallel on separate GPDB node

Relative Confidence to Adjust Ranking of Patterns
 For each domain of interest, SPM is run only on the subset of sequences containing that domain. This
may cause some sequential patterns to have artificially high confidence.
 Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences
in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.
 We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset
where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.
 Relative confidence favors the pattern whose left hand side contains less popular domains (see the
highlighted example below).
Relative confidence
favors unpopular left
hand side pattern
Domain Pattern Supp Conf Rel Conf
revenueindia.
net
<{google.com},{facebook.com}> =>
<{revenueindia.net}> 0.079 0.75 0.0001
revenueindia.
net
<{google.com}, {fileshare.com}> =>
revenueindia.
net
<{fileshare.com},{redworm.com}> =>

Model Update: Sequential Pattern Mining
 The model update module for sequential pattern mining follows a similar workflow as
its model execution module.
 One additional step is simply to merge the new results obtained from the incoming
new data with the existing set of patterns, including updating rule quality metrics:
support, confidence, etc.
Create time-ordered
domain sequences from
new sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset
of sequences containing
those domains
Find high confidence, low
support sequential
patterns of targeted
domains in parallel
Merge new results with
the existing set of
patterns.

Modeling Approaches
Graph Mining

Model Execution: Graph Mining
Construct “baskets” of
domains (co-
occurrence domains)
by running a sliding
window of certain time
interval through data
Find high confidence,
low support pairwise
association rules of the
form
Domain 1 => Domain 2
Create social network
of domains
Find partially and fully
connected sub-graphs

Construction of “Baskets”
 Domains visited by a user in a certain
time window form a “basket”, analogous
to items purchased in a single
transaction as in market basket analysis.
 The time interval for the sliding window
(60s window used in the implementation)
can be tuned.
 A basket contains distinct domains in a
sliding window:
Example on right:
Basket 1 = {googlevideo.com, twitter.com, youtube.com,
doubleclick.net, google.com}
2015-01-06 14:41:00 googlevideo.com
2015-01-06 14:41:09 twitter.com
2015-01-06 14:41:12 youtube.com
2015-01-06 14:41:14 doubleclick.net
2015-01-06 14:42:00 google.com
2015-01-06 14:42:05 googleanalytics.com
2015-01-06 14:42:08 pivotal.io
2015-01-06 14:59:23 facebook.com
2015-01-06 14:59:24 yahoo.com
1
2

Pairwise Association Rule Mining
 Given domain-to-basket assignments, pairwise association rule mining mainly
involves evaluation of:
– Co-occurrence frequency: the number of times two domains fall in a common basket.
– Conditional probability: probability of seeing domain 2 given domain 1 is present.
 Pairwise rule mining is implemented in plain SQL in a scalable fashion.
Domain A Domain B
#
{A,B}
# A # B P(A|B) P(B|A)
# A
to B
# B
to A
# AB
Same
Time
Max(#
User
Names/
M)
#
Date
Min
Date
Max
Date
pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1
2015-02-
26
2015-
02-26
pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1
2015-02-
23
2015-
02-23
pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8
2015-01-
23
2015-
02-17
High confidence (>0.5) associations involving
multiple users over several days (e.g. highlighted
rules) are generally more interesting.

Exploring Interactions between Domains
 To explore the interactions between domains, we build an
undirected correlation graph using the discovered pairwise
domain association rules.
 Each node in the graph is a domain. An edge connects two
domains if their co-occurrence confidence is higher than a
threshold (e.g. 0.2).
 The example on the right shows the tightly connected “social
network” of a particular domain.
 Partially and fully connected networks indicate possible
waterhole or bot-net attacks.
 Question: How to quantify the connectivity of a network?
0.25
0.37
0.71
0.52
0.1
0.6
0.1
Weight of Edge denotes
the confidence
Node denotes the
domain
abc.com
xyz.com
hga.com
hebf.com

OddBall Metrics for Graph Anomaly Detection
 We take the OddBall approach* to quantify the connectivity of each domain’s network:
– Identify each domain’s one-step neighborhood (also called ego-net).
– Extract two graph features from the ego-net:
▪ N: Number of neighbors
▪ E: Number of edges in the ego-net
 The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2
* OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010.
Picture Source: ICDM’12 tutorial
on graph anomaly detection
• Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1
indicates some degree of connectivity among neighbors.
• The higher the ratio the higher degree of connectivity (given
same number of neighbors). Generally OddBall ratio of >1.5 is
more interesting.
• One can additionally compute clique percentage: the ratio
between E and the number of edges needed to form a clique:
E/[(N2+N)/2], to measure network connectivity.

Sample Domains with Highly Connected Networks
Highlighted domain has a
fully connected network, a
clique!
Domain
#
Neighb
ors
Neighbours
#
Edg
e
log(
E)/lo
g(N)
Clique
Percen
t
# User
Names
a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6
s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9
r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7
abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11
d.com
e.com
b.com
c.com
a.com

Detecting Isolated Clusters
 Given the domain correlation graph, one can also identify isolated groups of domains that
only interact with domains in the same group, but not others (a bot-net like structure).
 This can be formulated as the task of finding connected components (CCs) in a graph.
 The example below show that malicious sites tend to exist in small CCs.
Sample Connected Component
qre.com
jekc.com
fbc.com
abc.com
ghk.com
bcd.com
Known malicious site

Operationalization and
Outlook

Operationalization Vision
Run Algorithms
Inspect Anomalies
Evaluate Model
Outputs
Refine Algorithms
Load New Data
• Owned by Data Engineer/Data Scientist
• Incrementally (e.g. weekly) update models
using new batches of data, e.g. as a Cron job
• Owned by security
team
• Ideally model outputs
provided via
interactive web
dashboards
• Feedback on model
performance from security
team.
• Opportunities for refinement
and ideas for new models
• Owned by Data Scientist
• Refine algorithms
• Owned by Data Engineer
• Load new data

BUILT FOR THE SPEED OF BUSINESS

Data Science Driven Malware Detection

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data Science Driven Malware Detection (20)

More from VMware Tanzu (20)

Recently uploaded (20)

Data Science Driven Malware Detection