Situational Awareness, Botnet and Malware Detection in the Modern Era - Davide Papini - Codemotion Milan 2016

SITUATIONAL AWARENESS, BOT-
NET AND MALWARE DETECTION
IN THE MODERN ERA
Machine Learning Enabled Advanced Security
CodeMotion Milan 2016
Davide Papini

Introduction ML for Cyber Security Final Remarks
ABOUT ME
Research & Innovation @Ele ronica S.p.a.
Postdoc @ISG Royal Holloway, UK on ML
applied to cyber situational awareness.
M.Sc. Telecommunication Engineering
@Politecnico di Milano:
→ Erasmus @Danmarks Tekniske Universitet
→ Master Thesis on ``Anomaly Based
Wireless Intrusion Detection Systems''
Ph.D. @Danmarks Tekniske Universitet:
→ ``Attacker Modeling in Ubiquitous
Computing Systems''
→ External stay at COSIC, KU Leuven
2

WHAT THIS TALK IS ABOUT
Topics:
Applications of ML in Cybersecurity research.
Successful research: botnets, DGAs, early malware
detection.
ML traps.
Evaluation metrics.
NOT about:
New ML algorithms.
Showing one specific Security-ML based application.
Wear you out with math.
3

MOTIVATIONAL SLIDE
4

MOTIVATIONAL SLIDE
5

MOTIVATIONAL SLIDE
Control of the botnet for 10 days: 180,000 infections,
recording of over 70GB of data.
Torpig intercepts and records keystroke information at a
low level, targeting a wide variety of applications and
websites.
Stealing financial and personal informations, login
credentials for social networking etc.
Torpig periodically uploads any new data that it has
captured to a central server.
The researchers were able to infiltrate the botnet by
registering one of the domains from a list of potential ones
infected machines use.
5

SOME STATISTICS
h ps://www.mcafee.com/us/resources/reports/rp-quarterly-threats-sep-2016.pdf
6

SOME STATISTICS
h ps://www.mcafee.com/us/resources/reports/rp-quarterly-threats-sep-2016.pdf
450,000 new malware per day.
20,000 is mobile malware.
Includes: ransomware, botnets, rootkits, trojians …
6

NEED A GAME CHANGER
Modern malware/intrusions are difficult to detect/block:
Code obfuscation, polimorfism and packing.
Malware written ad-hoc for specific targets.
AVs are mainly signature-based.
URL Blacklists cannot be updated fast enough.
Local changes are often too small/subtle to be detected.
Logs contains lot of noise (≃ 90%)
7

NEED A GAME CHANGER
Modern malware/intrusions are difficult to detect/block:
Code obfuscation, polimorfism and packing.
Malware written ad-hoc for specific targets.
AVs are mainly signature-based.
URL Blacklists cannot be updated fast enough.
Local changes are often too small/subtle to be detected.
Logs contains lot of noise (≃ 90%)
Need for intelligent approaches:
Adapt to unforseen "events"
Learn from data i.e. extract behaviours NOT signatures
Leverage global knowledge
Can be quasi-real-time.
7

Machine learning has been applied to many fields in security:
Botnet detection and classification
Mobile application analysis
Spam detection and campaigns analysis
Situational awareness through network traffic analysis
Download malware detection
and many more...
Also in many flavours:
Supervised
Unsupervised
combinations of those
9

BOTNETS
Situational awareness: knowledge of the health status of a
network (e.g. malware infections, intrusions and data
exfiltration).
Botnet: a network of bots (drones), i.e. programs installed
on the machines of unwitting Internet users and receiving
commands from a bot controller.
10

BOTNETS C&C CHANNEL
Bots connect to C&C Server in three ways:
Hard coded IP:
Bot → 1.2.3.4
Hard coded domain:
Bot → badguy.ru → 1.2.3.4
Automatically Generated Domains:
→ Bot cycles through time-dependent domains.
→ Domain names are generated using a Domain Generation
Algorithm.
→ The botmaster needs to register only one of those domains.
jhhfghf7.tk faukiijjj25.tk pvgvy.tk
cvq.com epu.org bwn.org
11

BOTNETS C&C CHANNEL
Hard coded IP:
Bot → 1.2.3.4
Hard coded domain:
Algorithm.
courtesy of E.Colombo - Cerberus
11

BOTNETS C&C CHANNEL
Hard coded IP:
Bot → 1.2.3.4
Hard coded domain:
Algorithm.
Sinkholing: If domain is already
registered
botmaster looses control of botnets!
11

PHOENIX AND CERBERUS
Developed at Polimi and ISG@RHUL
System that relies on Machine Learning to identify DGA:
Leverage known malicious and benign domain names to
build a classifier:
→ Distinguish Human Generated Domains from AGD.
→ Identifies the DGA used: botnets might share the same
DGA.
Use unsupervised learning to identify new DGAs.
Traffic comes from a na onal authoritative DNS server.
S. Schiavoni et al., Phoenix: DGA-Based Botnet Tracking and Intelligence. In Detection of
Intrusions and Malware, and Vulnerability Assessment (DIMVA) 2014.
E. Colombo, Cerberus: Detec on and Characteriza on of Automa cally-Generated
Malicious Domains. Master Thesis, Politecnico di Milano 2014.
12

PHOENIX AND CERBERUS
Developed at Polimi and ISG@RHUL
System that relies on Machine Learning to identify DGA:
Leverage known malicious and benign domain names to
build a classifier:
→ Distinguish Human Generated Domains from AGD.
→ Identifies the DGA used: botnets might share the same
DGA.
Use unsupervised learning to identify new DGAs.
Traffic comes from a na onal authoritative DNS server.
S. Schiavoni et al., Phoenix: DGA-Based Botnet Tracking and Intelligence. In Detection of
Intrusions and Malware, and Vulnerability Assessment (DIMVA) 2014.
E. Colombo, Cerberus: Detec on and Characteriza on of Automa cally-Generated
Malicious Domains. Master Thesis, Politecnico di Milano 2014.
Malicious Domains Phoenix Clusters
Time DetectiveSuspicious Domains
Filtering
DNS Stream
Classifier
Bootstrap
Filtering
Detection
12

CERBERUS FINDINGS
187 malicious domains detected and labeled
3,576 suspicious domains collected
47 clusters of DGA-generated domains discovered
319 new domains detected in the next 24 hours
13

MASTINO: REALTIME MALWARE DETECTION
Developed at TrendMicro and presented Defcon London 2016
System for advanced realtime malware detection:
Leverages global knowledge on download events
Classifies malware from goodware
Based on statistical evidence and graph analysis:
Tripartite graph: URLs, Files, Machines
Intrinsic features e.g.
→ file: size, obfuscated, signed;
→ url: FQD, e2LD, query path
→ machine: malware download history, processes
Behaviour-based features:
→ Consider reputation of neighboring nodes
→ Help to classify unknown nodes
14

MASTINO: REALTIME MALWARE DETECTION
Developed at TrendMicro and presented Defcon London 2016
System for advanced realtime malware detection:
Leverages global knowledge on download events
Classifies malware from goodware
Based on statistical evidence and graph analysis:
Tripartite graph: URLs, Files, Machines
Intrinsic features e.g.
→ file: size, obfuscated, signed;
→ url: FQD, e2LD, query path
→ machine: malware download history, processes
Behaviour-based features:
→ Consider reputation of neighboring nodes
→ Help to classify unknown nodes
Huge work on feature enginering!
14

MASTINO SYSTEM OVERVIEW
Copyright 2016 Trend Micro Inc.7
System Overview
courtesy of M.Balduzzi - TrendMicro
15

MASTINO TRAINING AND DETECTION
courtesy of M.Balduzzi - TrendMicro
16

MASTINO RESULTS
Mastino evaluation:
On testing dataset: 95.8% TP, 0.5% FP
Early detection experiment, deployed in the wild for 6
months:
→ Detected 84% of future malware
→ Verified later through VirusTotal
17

MASTINO RESULTS
Mastino evaluation:
On testing dataset: 95.8% TP, 0.5% FP
Early detection experiment, deployed in the wild for 6
months:
→ Detected 84% of future malware
→ Verified later through VirusTotal
Detec on me ≃ 0.16s!
17

ISSUES
Traditional ML developed for ``natural'' objects:
Natural Language Processing.
Image analysis e.g. picture text search.
Classification of plants animals.
Economics laws.
Metrics like ROC, FP, FN, work very well in these cases,
however cyberworld is not natural:
Things change abruptly e.g. updates, new malware, new
technologies.
There is no clear evolutionary law.
Change is deterministic and unpredictable.
Behaviours change/slide over time.
18

ML TRAPS
Machine learning often seen as a black-box panacea:
Little is understood.
Results with hi accuracy taken without questioning quality.
However:
Overfitting: if training and testing is not done carefully.
Validity of results: a system that works on paper may not
work in the field.
Datasets: Variety vs Chronology
19

ML TRAPS
Machine learning often seen as a black-box panacea:
Little is understood.
Results with hi accuracy taken without questioning quality.
However:
Overfitting: if training and testing is not done carefully.
Validity of results: a system that works on paper may not
work in the field.
Datasets: Variety vs Chronology
Need for novel metrics!
19

CONFORMAL EVALUATOR
Library developed at Informa on Security Group at Royal
Holloway:
Evaluates algorithms in terms of confidence and credibility.
Core is Non-Conformity measure, elicited directly from the
algorithm, which in essence tells the difference between a
sample and a set of samples.
Builds decision and alpha assessments to evaluate the
algorithm.
R. Jordaney, Z. Wang, D. Papini, I. Nouretdinov and L. Cavallaro, Misleading Metrics:
On Evalua ng Machine Learning for Malware with Conﬁdence, Technical Report 2016-1
Royal Holloway University of London.
20

CONFORMAL EVALUATOR
Library developed at Informa on Security Group at Royal
Holloway:
Evaluates algorithms in terms of confidence and credibility.
Core is Non-Conformity measure, elicited directly from the
algorithm, which in essence tells the difference between a
sample and a set of samples.
Builds decision and alpha assessments to evaluate the
algorithm.
R. Jordaney, Z. Wang, D. Papini, I. Nouretdinov and L. Cavallaro, Misleading Metrics:
On Evalua ng Machine Learning for Malware with Conﬁdence, Technical Report 2016-1
Royal Holloway University of London.
Training and
Testing
Dataset
Similarity Based
Classiﬁcation/Clustering
Algorithm
Conformal
Evaluator
Alpha
Assessment
Decision
Assessment
Non-Conformity
Measure
Conformal Evaluator Overview
20

CE: EXAMPLE 1
System for Botnet detection and classification
bifrose sasfis blackenergy banbra pushdo
0.0
0.2
0.4
0.6
0.8
1.0 0.86 0.27 0.29 0.31 0.9 0.2 0.84 0.18 0.95 0.15
Average algorithm correct choice
Average algorithm credibility Average algorithm confidence
0.0
0.2
0.4
0.6
0.8
1.0 0.42 0.53 0.58 0.17 0.68 0.29 0.62 0.29 0.73 0.12
Average algorithm incorrect choice
Decision Assessment
21

CE: EXAMPLE 1
0.0
0.2
0.4
0.6
0.8
1.0 0.86 0.27 0.29 0.31 0.9 0.2 0.84 0.18 0.95 0.15
0.0
0.2
0.4
0.6
0.8
1.0 0.42 0.53 0.58 0.17 0.68 0.29 0.62 0.29 0.73 0.12
Decision Assessment
bifrose's
samples
sasfis's
samples
blackenergy's
samples
banbra's
samples
pushdo's
samples
0.0
0.2
0.4
0.6
0.8
1.0
P-values
P-values: bifrose P-values: sasfis P-values: blackenergy P-values: banbra P-values: pushdo
Alpha Assessment
21

CE: EXAMPLE 1
0.0
0.2
0.4
0.6
0.8
1.0 0.86 0.27 0.29 0.31 0.9 0.2 0.84 0.18 0.95 0.15
0.0
0.2
0.4
0.6
0.8
1.0 0.42 0.53 0.58 0.17 0.68 0.29 0.62 0.29 0.73 0.12
Decision Assessment
bifrose's
samples
sasfis's
samples
blackenergy's
samples
banbra's
samples
pushdo's
samples
0.0
0.2
0.4
0.6
0.8
1.0
P-values
P-values: bifrose P-values: sasfis P-values: blackenergy P-values: banbra P-values: pushdo
Alpha Assessment
Although the algorithm has reasonably good re-
sults on paper, CE shows the quality of the re-
sults is not good!
We run experiments on another dataset to
confirm, and the classifier get worse.
21

CE: EXAMPLE 2
Mobile App classification: Malware vs Goodware
Correct choices Incorrect choices0.0
0.2
0.4
0.6
0.8
1.0
Average algorithm credibility for correct choice
Average algorithm confidence for correct choice
Average algorithm credibility for incorrect choice
Average algorithm confidence for incorrect choice
MALICIOUS's
samples
BENIGN's
samples
0.0
0.2
0.4
0.6
0.8
1.0
P-values
P-values: MALICIOUS P-values: BENIGN
22

FINAL REMARKS
Getting your hands in the game, what you need:
You need to study a bit of ML
You need a problem
You need data
You need good metrics
In the wild analysis is a plus
You need tools:
→ We did everything in python: Numpy, Scipy
→ ML libraries: sk-learn, shogun-toolbox.org
24

FINAL REMARKS
Machine Learning is great for Cyber Security!
25

FINAL REMARKS
Machine Learning is great for Cyber Security!
Thanks for listening:
Ques ons?
25

Situational Awareness, Botnet and Malware Detection in the Modern Era - Davide Papini - Codemotion Milan 2016

More Related Content

Viewers also liked (7)

Similar to Situational Awareness, Botnet and Malware Detection in the Modern Era - Davide Papini - Codemotion Milan 2016 (20)

More from Codemotion (20)

Recently uploaded (20)

Situational Awareness, Botnet and Malware Detection in the Modern Era - Davide Papini - Codemotion Milan 2016