A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

A WHIRLWIND TOUR
OF ACADEMIC TECHNIQUES
FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare, Deakin University

Introduction








Started off in industry (Qualys, now Volvent).
Have a Masters by Research.
About to receive a PhD from Deakin University.
Last 5 years in post-graduate University research.
Learnt some cool things along the way.

What did I do at University?


Malwise v1 (Masters)




Malwise v2




Binary comparison and visualization service.

Clonewise




Binary clustering service.

Simseer




More improved malware variant search service.

Simseer Cluster




Improved version.

Simseer Search




Malware variant detection system.

Automated detection of embedded libraries in source.

Bugalyze


Detection of bugs using data flow analysis.

Outline








Mathematical Objects
Comparing
Similarity Searching
Classification
Clustering
Program Analysis

An incomplete list of mathematical
objects








Strings
Vectors
Sets
Sets of Objects
Trees
Graphs

Objects




Objects have different performance.
Example
 Comparing

two vectors is fairly fast.
 Exact matching two strings is fairly fast.
 Inexact matching two strings is medium slow/fast.
 Comparing two graphs is slow.
A K T KT K
| | | | | sequence alignment O(mn)
A TK TT T K

Transforming one object to another


Problem
 Comparing

two 100kb strings using the edit distance is
impractically slow.



Solution

ed(“hello”, “ggello”) = 2

 Transform

the strings into vectors.
 Then, use a vector comparison – which is fast.


Examples
 Comparing

malware samples
 Finding near duplicate web pages
 Comparing E-Mails

N-Grams







Extract all N-length substrings (N-Grams) from
original string.
From training set of strings, choose best N-Grams.
Each unique N-Gram is an index in a vector.
The value of the element is the number of times it
occurs.
W|IEH}R

W|IE
|IEH
IEH}
EH}R

Another N-Gram example





Extract N-Grams
Represent new object as a ‘Set of N-Grams’
Compare sets using set similarity metrics

A Graph problem










Graph problems like approximate similarity are slow to
solve.
Decompose graph into subgraphs of at most k-nodes.
Canonicalize small graphs, represent by adjacency
matrix, transform to string.
Graph is now a ‘Set of Strings’.
Optionally represent as vector of ‘important ksubgraphs’.
Use Vector distance metrics to compare, index, and
search.

K-subgraph decomposition
L_0

L_0

L_3

L_3

L_3

L_3

L_6

L_6

true

L_0

L_6

L_6

L_1

L_1

L_7

true

L_1

L_7

L_1

L_4

L_2

L_4

L_2

L_4

L_7

true
L_2

L_7

L_2

true

L_4
L_5

true
L_5

L_0

L_5

L_3
L_6

0101000
0000000
0000010
0010100
0000010
0000001
1001000

0001010
0000000
1000000
0000100
0010000
0101000
1000000

0000001
0000100
0000001
0010000
0001010
0010000
0100100

L_1

L_2

L_4

L_5

Graphs – Case Study






Implemented in Malwise and Simseer
Take control flow graphs of programs.
Decompile into strings.
One:
 Consider

program as a vector of N-Grams of
decompiled strings.



L_0

Two:

L_3
true

 Consider

program as a set of strings.

L_6

true
L_1

L_7

true
L_2

L_4
true
L_5

true

proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}

Final Remarks on Objects




Know how to represent your problem.
Look into how the representation can be
approximated
 By



transforming it into another object

Vectors are often a good choice.

Comparing


Problem
 Measure

the similarity (or distance between) two

objects.


Solution
 Represent

objects mathematically.
 Use multitude of mathematical measures.


Examples
 Malware

similarity
 Near duplicate web pages

Comparing Sets






A set is a collection of elements.
Given an equality function between elements, we
can measure set similarity.
Inexact matching
index
 Dice coefficient



 Jaccard



s

2 A B
AB

J ( A, B) 

A B
A B

Comparing Vectors – Ugh, math.


Euclidean Distance 

d ( p, q ) 

 (qi  pi)
n

2

i 1



Manhattan Distance 

n

d ( p, q )   q 
i 1



Cosine Similarity 

i

similarity  cos( ) 

p

i

A B
A B

Vector distance – a different look




A vector is an n-dimensional point in space.
E.g., a 2-d vector is <x,y>

Cosine similarity






Line from origin to n-dimensional point.
Given 2 lines, what’s the angle (theta) between
them?
The smaller the angle, the more similar.
Point A

Point B

Theta

Comparing Vectors – Case Study


Malwise v2
 Feature

vector of N-Grams of decompiled flowgraphs
 Manhattan Distance


Simseer Search
 Same

feature vector
 Euclidean Distance

Comparing Sets – Case Study








Malwise v1
An element is a graph invariant of the control flow
graph, represented as an integer.
A program is a set of integers.
Compare similarity between two programs using
Dice coefficient.

Malwise v1 - Comparing Sets

1
T



F

2

(1 -> 2), (1 -> 4)
(2 -> 3), ()
(), ()
(4 -> 3), ()

4
T

T

3

s ( A, B) 

2 wi x Ai  Bi
i

w x A  w x B
i

i

i

i

i

i

Comparing Sets of Strings in Malwise
v2 – Case Study






String is a decompiled flowgraph.
Program is a set of strings.
Edit distance between strings.
Construct 1:1 mapping between elements of sets:
 Such



that the sum of distances is minimized.

Solved using ‘combinatorial optimisation’
 Assignment

Problem
 Solution by “graph matching”

Malwise v2 - Comparing Sets of
Strings
L_0
L_3
true

L_6

true
L_1

L_7

true
L_2

L_4

true

proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}

W|IEH}R

true
L_5

p
BR
BW|{B}BR
BI{B}BR
BSSR
BSR
BSSSR

BR
BW|{B}BR
BSSR

d=ed(p,q)

q

Final Remarks on Comparing




Inexact matching is your friend.
Try to use known distance metrics.
 They



have useful properties and index better.

If it’s too slow to compare, transform the object.

Similarity Searching


Problem
 Find



all ‘similar’ objects to my query in a database

Example
 Find

all words in a dictionary with at most 3 differences
to my query word.




This problem is known as a ‘similarity search’
Solution
 Naive

exhaustive search.
 Better to use ‘Metric Trees’

Similarity Search Constraints


Variations
 K-nearest

neighbours – the k closests objects to the

query.
 All objects within a specific distance to the query.




Search based on using a ‘metric distance’.
Metric distances satisfy mathematical properties.
Examples
 Euclidean

Distance
 Jaccard Distance
 Cosine Distance is not metric

Searching – Case Study


Malwise v2
 Distance

metric is Manhattan Distance.
 Use VP-Trees to index and search in stage 1.
 Use DBM-Trees to index and search in stage 2.
 Implemented using open source GBDI Arboretum
library.
Query Benign

r
q
d(p,q)
p
Query Malicious
Query
Malware

Final Remarks on Searching





Searching for inexact matches is useful.
Use good distance metrics.
Use open source libraries.

Classification


The problem:
 Given

a set of N classes.
 And a query object.
 Assign one of the classes to the object.


Class A
Class B

Examples
 Is

this binary (malicious, not malicious)?
 Is this gmail email (primary, social, promotional)?
 Is this web page (defaced, not defaced)?

Classification Methodology


Supervised Learning
 Given

a training set of objects labelled by their class.
 Build a model.
 Then use the model to classify unknown objects.


Unsupervised Learning
 No

labelled data exists.
 “Cluster” objects into classes.
 Use clusters to train model.
 Then classify as per-normal.

Classification – What do I have to do?








Represent objects using “feature vectors”
A vector is an array.
Each element represents a “feature”.
The value of the element tends to be a count of
something, or a size.
Feature examples
 The

number of times a dictionary word such as “Hello”
appears in an E-Mail.
 The size of a binary.
 The number of times LoadLibraryA is executed.

Classification – WEKA?







Put the feature vectors into the text-based ARFF file
format.
Plug into the WEKA machine learning toolkit.
Experiment with different classifiers.
Part of your labelled data can be used to evaluate
the accuracy.

Weka ARFF file
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,?

WEKA

10/25/2013

University of Waikato

34

Classification – Case Study


Clonewise
 Feature

vector is set of features extracted from a pair
of packages.
 Classify - do these packages share code (yes, no)?
 Classify – is the 1st package embedded in the 2nd
package (yes, no)?

Final Remarks on Classification





Lots of problems can be considered as this.
Learn how to use WEKA.
Vectors are very good representations.

Clustering


Problem
 To

group together “similar” objects under some notion
of similarity.



Easy solution
 Represent

objects using “feature vectors”.
 Plug into WEKA.


Packages in Fedora Linux 

Clustering - Case Study


Simseer Cluster
 Represent

binaries using N-Grams of decompiled
flowgraphs.
 Use most frequent N-Grams as features.
 Distance measure is cosine distance.

Final Remarks on Clustering




A classic machine learning problem.
Again, learn to use WEKA.

Program Analysis





An incredibly large and deep field.
This section skims the surface.
Main approaches
Proving 
 Model Checking

 Abstract Interpretation
 Data Flow Analysis 
 Theorem

Model Checking





Looks at program states generated by a program.
Some states indicate bugs.
Try BLAST, a model checker for small C programs.
 Caveat

- it’s pretty old now.

Theorem Proving - SMT


SMT – what is it?




An equation solver that covers the types of operations seen
in machine code.

Approach for Bug Detection
User input can be anything generally, so treat this as a
“symbolic” variable.
 The rest is concrete.
 Simulate execution of the program, plugging all the machine
code that is executed into the solver formuli.




Concolic execution


Combining symbolic execution with concrete execution.

Concolic Execution







At branches, can we have user input that forces us
to go down each path?
Use the SMT solver to tell us.
Launch execution down ‘feasible’ paths.
Use the solver to tell us if bugs are present.
 What

user input, if any, can make this pointer NULL?

Concolic path-sensitive analysis
lea 0x4(%esp
),%ecx
and $0 xfffffff,%esp
0
pushl -0x4(%ecx
)
push %ebp
mov %esp
,%ebp
push %ecx
sub $0x24,%esp
call 4011 0 <___main
b
>
movl $0x0,-0x8(%ebp
)
jmp 40115f <_main
+0x2f>

1

movl $0x4020
a0,(%esp
)
4011
call
b 8 <_puts
>
addl $0x1,-0x8(%ebp
)

3
cmpl $0x9,-0x8(%ebp
)
jle 40114f <_main
+0x1f>

2
add
pop
pop
lea
ret

$0x24,%esp
%ecx
%ebp
-0x4(%ecx
),%esp

4

2

Abstract Interpretation




Abstract the execution of the program.
Example
 Only

consider the sign of a variable, not the actual
value.



Requires a transfer function
 What



an instruction does to the abstract data.

And a Join/Meet function
 How

data is combined when it meets from different
control flow.

Data Flow Analysis


Similar to abstract interpretation.
 Uses

a transfer function, a join.
 Implement both using a monotone framework.




Data Flow analysis is used by compilers.
Classic data flow problems
 The

reach of defining or assigning to a variable.
 Knowing if a variable will be read again before being
assigned a new value.

Data Flow Analysis – Case Study




Implemented in Bugalyze.
Example bug detection
 In

free(ptr), where is ptr used before it is reassigned,
and is it used in a free?




Has found real bugs in Debian Linux.
Still a work-in-progress.

Final Remarks on Program Analysis





A wide and deep field.
Good to know the basic approaches.
Reversing is becoming more rigourous (think
HexRays).

Conclusion






Academia has some useful techniques.
It’s good to know some of the basic methods.
Will improve industrial programs.
Any questions?

A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS (20)

More from Silvio Cesare (15)

Recently uploaded (20)

A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS