BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for applying supervised machine learning in cyber defense.

Synthetic
Examples
For Applying
Supervised Machine Learning
In Cyber Defense
Samuel Crisanto
Naveed Ahmad
Office 365 Security

The Context
• Not customer facing security
• Not the typical IT setup with lot of chaos
• Intrusion Detection in Data Center
• Same software running on hundreds of
thousands of machines doing similar things

0
1
2
3
4
5
8/25 8/27 8/29 8/31 9/02
Alerts Per Day
+ We Catch Attacks
~440 Billion Events Per Day
Filter
Rules
Service Normal
Hacker’s Activity
Auto-Learn
Catching Bad Actors
0
100000
200000
300000
400000
500000
8/25 8/27 8/29 8/31 9/2
Detection per Day
Anomaly Rule
0
100000
200000
300000
400000
500000
8/25 8/27 8/29 8/31 9/2
Detection per Day
Anomaly Rule

• Can’t learn what we haven’t seen
• But there is value in auto-learning what we have seen
• With a world class Pen Test team you can auto-learn a lot
Supervised ML – Limitations & Opportunities
0
1
2
3
4
5
8/25 8/27 8/29 8/31 9/02
Alerts Per Day
+ We Catch Attacks
0
100000
200000
300000
400000
500000
8/25 8/27 8/29 8/31 9/2
Detection per Day
Anomaly Rule

Transforming Human Problem in to ML ProblemThe Context

P – Process
U – User
SG – Security Group
RD – Remote Destination
RK – Registry Key
D – Detection
Entity with one or more associated
detections
Entity without an
associated detection
Detection
P2 Launched Process
SG1
RD1
Established Connection
D1
P1 Launched Process
P3
Launched Process
Created User
P3
Added user to Security Group
D3
U1
D2
Launched Process
P4
Registry Key Added
D4
RK1
High
Privileged
Injected
Process
powershell.exe
Connection
to C2 Server
Local Security Group
net.exe
net.exe
Local User
Registry Auto-StartKey
regedit.exe
D1
D3
D2
D4
Human Triage

Process-Tree Boundary
Machine Boundary
User-Session Boundary
Benign Detection Noise
Malicious Detection
Detection Corelation
Feature Extractor
Feature Vector
Feature Extractor
Feature Vector
MaliciousExample Benign Example
<Features>
<Feature Type ="Numeric" Signal=“Detection1" Operation="Count" Field="ProcessName" />
<Feature Type ="Numeric" Signal="Detection2" Operation="Max" Field="Score" />
<Feature Type ="Numeric" Signal=“Detection3" Operation="MaxSum" Field="Bytes,IP,Port" />
...
</Features>
Extracting Features

Training Set
[3.0, 2.0, 0.95, 3.0, 5455.0, 6345.0, 2.0, 0.73, … ] – Malicious
[0.5, 0.7, 0.95, 1.0, 1151.0, 1312.0, 1.0, 0.43, … ] – Benign
[0.2, 0.9, 0.23, 1.5, 1252.0, 2113.0, 0.9, 0.31, … ] – Benign
[2.3, 1.8, 0.89, 2.3, 4995.0, 5545.0, 1.9, 0.85, … ] – Malicious
…
Known Examples
New Activity
[1.3, 2.1, 0. 29, 1.3, 2791.0, 1595.0, 2.9, 0.95, … ] Malicious/Benign?
Random Forest
Machine Learning

The Challenge
In Applying
Supervised ML for
Defending
a Diverse Service
1. Not enough successfulmalicious examples in
all the diverse populations
→ For training a good model
2. Not enough benign examples in targeted
subpopulations
→ For accurately testing and validating model

Scarcity of Successful Attack Examples
• 438 compromised machine examples
Attack
Examples
Confirmed Attacks
Risky OCE Activity
Attack Automation
• Nearly half a million machines in our service, and growing
• For a given time window, 156K show benign anomalous behavior
0 20000 40000 60000 80000 100000 120000 140000 160000 180000
Benign
Malicious
Examples

Scarcity of Successful Attack Examples
For all the Diverse Populations We Protect
• Model Performance
• 156K Benign, 438 Malicious
• Area under ROC Curve = 0.971
• Area under PR Curve = 0.999
• Awesome model or Overfit?
Detections

Solution
Crafting synthetic attack examples from past
attacksfor all diverse populations we want to
protect

Extract Attack SignalsExtract Attack SignalsExtract Attack SignalsM1
M2
M1
M2
B1
B2
B3
M1B1
M1B2
M1B3
M2B1
M2B2
M2B3
Overlaid with Attack Signals
Anomaly Scoring
Extract Attack Signals
Sampled Benign
Cartesian Bootstrapping
Sample

Cartesian Bootstrapping - Distribution
Bootstrappedmaliciousexamples are more representativeof diverse target population
Detections

• Model Performance - Before
• 156K Benign, 438 Malicious Examples
• Overfit model performingpoorlyin production
• Model Performance - After
• 156K Benign, 84K Malicious Examples
• Balanced model performingwell in production
Cartesian Bootstrapping - Results
Malicious Activities Benign Activities
Synthetic
Attack Examples
438 156K
84K

Measuring model performance on targeted small
subpopulations
The Next
Challenge

Test on a
subpopulation
Learn on all
data
The model is trained on the entire dataset.
How does it perform on a small, targeted subpopulation?
Measuring Performance

Test on a
subpopulation
Learn on all
data
Does the model capture the behavior of smaller subpopulations?
Do we perform well on smaller services and smaller roles?
Each service has machines in different roles

Role
Count
Machinecount per role
in Exchange
Services(obfuscated,truncated)
Count
Machinecount per
service
Variance in services and roles

Role
Detection types per role in Exchange
Services
(obfuscated,
truncated)
Detection types per service
Variance in services and roles
Detection Type count (obfuscated) Detection Type count (obfuscated)

Images adapted from work by Walber
https://commons.wikimedia.org/wiki/File:Precisionrecall.svg
[CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)],
from Wikimedia Commons
Ignore noise (precision)
Detect cyberattacks (recall)
Measuring success

A typical small subpopulation will have a low number of benign
machines and no examples of malicious machines
Building a dataset to validate the model

We need to borrow malicious examples from other services.
Validating on this dataset is problematic.

If we classify everything as malicious,
Precision = 22/27 = 0.814
Recall = 22/22 = 1

Instead, start by synthesizing benign examples.
Benign examples should equal malicious examples.

Use oversampling to generate synthetic benign machines.
These vary in some ways, but not others

Modify malicious examples by joining them with benign examples
Union of detections -> Synthetic malicious machine
(D1, D2) U (D1, D3) -> (D1, D2, D3)

Rare benign machines skew the data, and are amplified by resampling

Sample from the frequency
tables
Discard
outliers
What detections occur together,
and what are their counts?
Discard
outliers
How many types of detections
occur on a machine?
Craft synthetic benign machines
without representing outliers
Normalized
Sampling

Detection types per service
over a 10-day period
Normalized
Sampling
Detection type counts (obfuscated)
Services
(obfuscated,
truncated)

Services
(obfuscated,
truncated)
Number of detections
40,000??
Detections per machine
over a 10-day period
Normalized
Sampling

Detection Type Detection Count
Detection 1 33
Detection 2 16
Detection 3 17
Detection 4 5
Detection 6 3
Detection 7 37744
Detection 8 7
Detection 9 2
Detection 10 2304
… (etc) … (etc)
Detections on that machine over
the last 10 days
What’s going
on?

• Balance the classes by generating benign examples
simple oversampling or normalized sampling
• Modify malicious examples to reflect service background noise
In Summary

Takeaways
• Pen Test + Supervised ML effectively spots
known attacksagainst a diverse set of services
• Use Cartesian Bootstrapping to train the model
with diverse examples
• Use Normalized Sampling to validate the model
for targeted subpopulations

BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for applying supervised machine learning in cyber defense.

More Related Content

What's hot (18)

Similar to BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for applying supervised machine learning in cyber defense. (20)

More from BlueHat Security Conference (20)

Recently uploaded (20)

BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for applying supervised machine learning in cyber defense.