VB2015 Malware Classification meets crowd-sourcing

Malwareclassificationmeets
crowd-sourcing
John Park, HP Security Research
October, 2015

Overview
1. Malware Classification
2. Crowd-Sourcing Logistics
3. What Crowd-Sourcing Is Good For
4. What Comes After Crowd-Sourcing

Whatis“MalwareClassification”?

“Malware
Classification”
“WhatMalware
AnalystDoes.”

Malware
Analysts
Unknown
File
Good
Bad

Unknown
File
Malware
Analysts
Good
Bad
Blah

Unknown
File
Malware
Analysts
Good
Bad
Blah
Shady
WhiteBox
Analysis
BlackBox
Analysis
Prior
Knowledge

Unknown
File
Malware
Analysts
Good
Bad
Blah
Shady
VirusTotal
WhiteBox
Analysis
BlackBox
Analysis
Prior
Knowledge

Unknown
File
Malware
Analysts
Good
Bad
Blah
Shady
Virus
Total
WhiteBox
Analysis
BlackBox
Analysis
Prior
Knowledge
Malware Classification
“Super-System”

Malware Analyst
Experts
Data Sharing
Automation IT
Infrastructure
Algorithms

Malware Analyst
Experts
Data Sharing
Automation IT
Infrastructure
Algorithms
Makes the final call
Trusted parties.
Layered access.
Handles large numbers of files, fast.
This is where Crowd-Sourcing is
useful. Cross-field collaboration

Aneasywaytodoacrowd-sourcingon
algorithmsistohold
a“DataScienceCompetition”

EarlyDays:NetflixPrizeCompetition(2006-
2009)
1. The user ratings data is provided.
2. Netflix’s algorithm can guess within
about 1 star range. (Root Mean Square Error =
0.95)
3. If you can improve the accuracy by 10% (RMSE
= 0.85), you get $1 million.

MicrosoftMalwareClassificationChallenge
Data:
- 21741 malware files
- 9 different malware families
- HexDump, without PE Header
- Assembly file, using IDA
Objective:
-Classify them into right family.

Kaggle:likeNetflixPrizeCompetition.
withshortercycle.
withsmallerprize.
withdiverseproblems.

TimeFrame
Competition
Begins.
DataSet +
Objectives
are posted
Forum
Month1
Problem
Clarifying.
Early birds
take a stab
at it.
Forum
Month 2
Benchmark
code is
shared.
Inflow of
“Benchmark
ers”
Forum
Month 3
Last Min
Leap-
Frogging.
Complains
and Moans
about
dataset
Competition
Ends
Forum
AfterMath
The secret
sauce is
shared.
Top 10
teams
usually use
the same
approach.

Everycompetitionisalittledifferent,
likesnowflakes.
1. Security is a popular topic
2. Large DataSet (~400GB)
3. Unfamiliar Data Structure

Typical Data Format
Training Set
Test Set
n entries
m features answer
guess this value

DataScience,explainedby
“CommunicationTheory”
Person A’s
Brain
Person B’s
Brain
Words spoken
by Person A
Words heard
by Person B
Room Noise
Wire Noise
EM interference
Decoding the languageEncoding using the language

Re-arrangeinDataScienceway
Person A’s
Brain
Person B’s
Brain
Words spoken
by Person A
Words heard
by Person B

Typical Data Format
Training Set
Test Set
n entries
m features answer
“Signal” flows this way.

TypicalMalwareAnalyst’sKnowledge
Network
Access
Startup
Hooks
Install
Rootkit
Pings to
C2 server
Include string
“hacktool”
Digitally
Signed by CA
Answer
Yes Yes No Yes No No Bot_X
Yes No Yes No Yes No Hacktool
Yes Yes No No No Yes Known Good
No Yes Yes Yes No No KeyLogger
Yes Yes Yes Yes No No Backdoor
BlackBox and WhiteBox Analysis
Prior
Knowledge

MalwareAnalyst’sClassificationMethod
Network
Access
Startup
Hooks
Install
Rootkit
Pings to
C2 server
Include string
“hacktool”
Digitally
Signed by CA
Answer
Yes Yes No Yes No No Bot_X
Yes No Yes No Yes No Hacktool
Yes Yes No No No Yes Known Good
No Yes Yes Yes No No KeyLogger
Yes Yes Yes Yes No No Backdoor
Yes Yes Yes Yes Yes No Unknown
Yes Yes No Yes No Yes Unknown
BlackBox and WhiteBox Analysis
Prior
Knowledge
Classify unknown
into known

HowMachineLearningWorks(lotsoftinybits)
Number
of
Sections
File
Size
Entropy
(percen
tile)
Ratio
between
1bits and
0 bits
Include string
“ackto”
Include string
“cktoo”
Answer
3 920k 0.23 0.40 0 0 Bot_X
2 3200k 0.15 0.24 1 1 Hacktool
4 252k 0.55 0.54 0 1 Known Good
3 1283k 0.23 0.23 0 1 KeyLogger
8 884k 0.44 0.21 0 0 Backdoor
3 885k 0.34 0.98 1 0 Unknown
5 6413k 0.78 0.45 0 1 Unknown
Feature Extraction
Training
Set

1st PlaceWinnerSolution
http://blog.kaggle.com/2015/05/26/microsoft-malware-winners-interview-1st-place-no-to-overfitting/

MentalVisualizationofFeatureExtraction
Opcode
n-gram
Segment Count
Virtual size
File size
“projection/snapshot”

Therearemanyalgorithms,
but,these3methodswillsuityourneeds.
- Use Decision-Trees: XGBoost (most cases)
- Use Neural-Nets: Deep-Learning (vision +
audio)
- Use Hyperplane: SVM (small data + scientific
data)

So,pluginthewinningsolution
intotheproductionsystem,
andcallitaday??

“It'slikeafingerpointingatthemoon.
Donotconcentrateonthefinger,
oryouwillmissalloftheheavenlyglory.”
- Bruce Lee

Wehavelearnedthesethingsworkwell:
- Opcode n-gram (short sequence
of actions)
- XGBoost (decision trees, case by
case)

NilsimsaHash
unsupervised + supervised learning
many similarity to Deep Learning

NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare

NilsimsaHash
N-gram
Counting
Hamming
Distance
Opcode Sequence
BOF EOF

NilsimsaHash
N-gram
Counting
Hamming
Distance
0 0 1 0 0 1 0 1Bloom Filter
a good balance of Space, Time, Allowable Error
0x0000 0x0111

NilsimsaHash
N-gram
Counting
Hamming
Distance
0x0000 0x0111
0 0 4 0 0 7 0 1Counting Bloom Filter

NilsimsaHash
N-gram
Counting
Hamming
Distance
0x0000 0x0111
0 0 4 0 0 7 0 1Counting Bloom Filter
0 0 1 0 0 1 0 0Normalized
1, if above avg
0, if below avg

NilsimsaHash
N-gram
Counting
Hamming
Distance
0x0000 0x0111
0 1 0 0 0 1 0 1Known File 7238
0 0 1 0 0 1 0 0Unkonwn File
Hamming
Distance of 3

HowtoimproveNilsimsahash
(orgeo-politicaltension)
1. minimize collision
2. prevent the abuse
3. prepare for the unknowns

ImprovementstoNilsimsaHash(TLSHbyTrendMicro)
N-gram
Counting
Hamming
Distance
0x0000 0x0111
0 0 4 0 0 7 0 1
0 0 1 0 0 1 0 0Normalized
Counting Bloom Filter
0 0 2 0 0 3 0 1TLSH
3, if 1st quarter
2, if 2nd quarter
1, if 3rd quarter
0, if 4th quarter
Source: http://www.academia.edu/7833902/TLSH_-A_Locality_Sensitive_Hash

ImprovementstoNilsimsaHash(Width)
N-gram
Counting
Hamming
Distance
0 0 4 0 0 4 0 1 0 1 0 1 0 0 0 1
Longer CBF
0x0000 0x0111
0 0 4 0 0 7 0 1
CBF
0x0000 0x0111 0x11110x1000

HyperLogLog
Hash (baseball) = 0x0100101110111010
Hash (basketball) = 0x0001010110110011
Hash (blahblah) = 0x1010111011011010
Hash (rocket) = 0x0000100101011101
The length of the longest preceding ‘0’s can
estimate the size of CBF needed with required
memory space of O(1)

Attack:HashShaping
N-gram
Counting
Hamming
Distance
0x0000 0x0111
812 12 283 7023 204 34 13 99Cleanfile CBF
540 23 140 400 170 20 30 80Malicious CBF
High-count bucket is the most important bucket.
If hash (xor – xor – xor - xor) happens to fall into this bucket,
the hash can be shaped by inserting 6500x “XOR EAX, EAX”.

Defend:ResisttheHashShaping,byDualHash
812 12 283 7023 204 34 13 99Cleanfile CBF
Using hash1
540 23 140 400 170 20 30 80Malicious CBF
Using hash1
hash1 (xor – xor – xor - xor) falls into this bucket.
hash2 (xor – xor – xor - xor) falls into this bucket. Doh!
1032 1094 12 913 375 15 70 5023
These are the most important buckets.
Cleanfile CBF
Using hash2
1844 1106 295 7936 579 49 83 5122Dual Hash
This bucket should stay low.

N-gram CBF Normalize
Hamming
Distance

Opcode
N-gram
CBF Normalize
Hamming
Distance
PE file
Use IDA to
convert Binary
into Assembly

Opcode
N-gram
CBF Normalize
Hamming
Distance
PE file
Preparefortheunknown
?Unknown
Architecture

Opcode
N-gram
CBF Normalize
Hamming
Distance
PE file
Unknown
Architecture Select-gram*
Select-gram*: n-gram of only top 10% frequent elements.
Preparefortheunknown

Thisisthelastslide.
Itisallabout“CommunicationTheory.”
Person A’s
Brain
Person B’s
Brain
Words spoken
by Person A
Words heard
by Person B

VB2015 Malware Classification meets crowd-sourcing

More Related Content

Similar to VB2015 Malware Classification meets crowd-sourcing (20)

Recently uploaded (20)

VB2015 Malware Classification meets crowd-sourcing

Editor's Notes