SlideShare a Scribd company logo
A WHIRLWIND TOUR
OF ACADEMIC TECHNIQUES
FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare, Deakin University
Introduction








Started off in industry (Qualys, now Volvent).
Have a Masters by Research.
About to receive a PhD from Deakin University.
Last 5 years in post-graduate University research.
Learnt some cool things along the way.
What did I do at University?


Malwise v1 (Masters)




Malwise v2




Binary comparison and visualization service.

Clonewise




Binary clustering service.

Simseer




More improved malware variant search service.

Simseer Cluster




Improved version.

Simseer Search




Malware variant detection system.

Automated detection of embedded libraries in source.

Bugalyze


Detection of bugs using data flow analysis.
Outline








Mathematical Objects
Comparing
Similarity Searching
Classification
Clustering
Program Analysis
An incomplete list of mathematical
objects








Strings
Vectors
Sets
Sets of Objects
Trees
Graphs
Objects




Objects have different performance.
Example
 Comparing

two vectors is fairly fast.
 Exact matching two strings is fairly fast.
 Inexact matching two strings is medium slow/fast.
 Comparing two graphs is slow.
A K T KT K
| | | | | sequence alignment O(mn)
A TK TT T K
Transforming one object to another


Problem
 Comparing

two 100kb strings using the edit distance is
impractically slow.



Solution

ed(“hello”, “ggello”) = 2

 Transform

the strings into vectors.
 Then, use a vector comparison – which is fast.


Examples
 Comparing

malware samples
 Finding near duplicate web pages
 Comparing E-Mails
N-Grams







Extract all N-length substrings (N-Grams) from
original string.
From training set of strings, choose best N-Grams.
Each unique N-Gram is an index in a vector.
The value of the element is the number of times it
occurs.
W|IEH}R

W|IE
|IEH
IEH}
EH}R
Another N-Gram example





Extract N-Grams
Represent new object as a ‘Set of N-Grams’
Compare sets using set similarity metrics
A Graph problem










Graph problems like approximate similarity are slow to
solve.
Decompose graph into subgraphs of at most k-nodes.
Canonicalize small graphs, represent by adjacency
matrix, transform to string.
Graph is now a ‘Set of Strings’.
Optionally represent as vector of ‘important ksubgraphs’.
Use Vector distance metrics to compare, index, and
search.
K-subgraph decomposition
L_0

L_0

L_3

L_3

L_3

L_3

L_6

L_6

true

L_0

L_6

L_6

L_1

L_1

L_7

true

L_1

L_7

L_1

L_4

L_2

L_4

L_2

L_4

L_7

true
L_2

L_7

L_2

true

L_4
L_5

true
L_5

L_0

L_5

L_3
L_6

0101000
0000000
0000010
0010100
0000010
0000001
1001000

0001010
0000000
1000000
0000100
0010000
0101000
1000000

0000001
0000100
0000001
0010000
0001010
0010000
0100100

L_1

L_2

L_4

L_5
Graphs – Case Study






Implemented in Malwise and Simseer
Take control flow graphs of programs.
Decompile into strings.
One:
 Consider

program as a vector of N-Grams of
decompiled strings.



L_0

Two:

L_3
true

 Consider

program as a set of strings.

L_6

true
L_1

L_7

true
L_2

L_4
true
L_5

true

proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}
Final Remarks on Objects




Know how to represent your problem.
Look into how the representation can be
approximated
 By



transforming it into another object

Vectors are often a good choice.
Comparing


Problem
 Measure

the similarity (or distance between) two

objects.


Solution
 Represent

objects mathematically.
 Use multitude of mathematical measures.


Examples
 Malware

similarity
 Near duplicate web pages
Comparing Sets






A set is a collection of elements.
Given an equality function between elements, we
can measure set similarity.
Inexact matching
index
 Dice coefficient



 Jaccard



s

2 A B
AB

J ( A, B) 

A B
A B
Comparing Vectors – Ugh, math.


Euclidean Distance 

d ( p, q ) 

 (qi  pi)
n

2

i 1



Manhattan Distance 

n

d ( p, q )   q 
i 1



Cosine Similarity 

i

similarity  cos( ) 

p

i

A B
A B
Vector distance – a different look




A vector is an n-dimensional point in space.
E.g., a 2-d vector is <x,y>
Cosine similarity






Line from origin to n-dimensional point.
Given 2 lines, what’s the angle (theta) between
them?
The smaller the angle, the more similar.
Point A

Point B

Theta
Comparing Vectors – Case Study


Malwise v2
 Feature

vector of N-Grams of decompiled flowgraphs
 Manhattan Distance


Simseer Search
 Same

feature vector
 Euclidean Distance
Comparing Sets – Case Study








Malwise v1
An element is a graph invariant of the control flow
graph, represented as an integer.
A program is a set of integers.
Compare similarity between two programs using
Dice coefficient.
Malwise v1 - Comparing Sets

1
T



F

2

(1 -> 2), (1 -> 4)
(2 -> 3), ()
(), ()
(4 -> 3), ()

4
T

T

3

s ( A, B) 

2 wi x Ai  Bi
i

w x A  w x B
i

i

i

i

i

i
Comparing Sets of Strings in Malwise
v2 – Case Study






String is a decompiled flowgraph.
Program is a set of strings.
Edit distance between strings.
Construct 1:1 mapping between elements of sets:
 Such



that the sum of distances is minimized.

Solved using ‘combinatorial optimisation’
 Assignment

Problem
 Solution by “graph matching”
Malwise v2 - Comparing Sets of
Strings
L_0
L_3
true

L_6

true
L_1

L_7

true
L_2

L_4

true

proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}

W|IEH}R

true
L_5

p
BR
BW|{B}BR
BI{B}BR
BSSR
BSR
BSSSR

BR
BW|{B}BR
BSSR

d=ed(p,q)

q
Final Remarks on Comparing




Inexact matching is your friend.
Try to use known distance metrics.
 They



have useful properties and index better.

If it’s too slow to compare, transform the object.
Similarity Searching


Problem
 Find



all ‘similar’ objects to my query in a database

Example
 Find

all words in a dictionary with at most 3 differences
to my query word.




This problem is known as a ‘similarity search’
Solution
 Naive

exhaustive search.
 Better to use ‘Metric Trees’
Similarity Search Constraints


Variations
 K-nearest

neighbours – the k closests objects to the

query.
 All objects within a specific distance to the query.




Search based on using a ‘metric distance’.
Metric distances satisfy mathematical properties.
Examples
 Euclidean

Distance
 Jaccard Distance
 Cosine Distance is not metric
Searching – Case Study


Malwise v2
 Distance

metric is Manhattan Distance.
 Use VP-Trees to index and search in stage 1.
 Use DBM-Trees to index and search in stage 2.
 Implemented using open source GBDI Arboretum
library.
Query Benign

r
q
d(p,q)
p
Query Malicious
Query
Malware
Final Remarks on Searching





Searching for inexact matches is useful.
Use good distance metrics.
Use open source libraries.
Classification


The problem:
 Given

a set of N classes.
 And a query object.
 Assign one of the classes to the object.


Class A
Class B

Examples
 Is

this binary (malicious, not malicious)?
 Is this gmail email (primary, social, promotional)?
 Is this web page (defaced, not defaced)?
Classification Methodology


Supervised Learning
 Given

a training set of objects labelled by their class.
 Build a model.
 Then use the model to classify unknown objects.


Unsupervised Learning
 No

labelled data exists.
 “Cluster” objects into classes.
 Use clusters to train model.
 Then classify as per-normal.
Classification – What do I have to do?








Represent objects using “feature vectors”
A vector is an array.
Each element represents a “feature”.
The value of the element tends to be a count of
something, or a size.
Feature examples
 The

number of times a dictionary word such as “Hello”
appears in an E-Mail.
 The size of a binary.
 The number of times LoadLibraryA is executed.
Classification – WEKA?







Put the feature vectors into the text-based ARFF file
format.
Plug into the WEKA machine learning toolkit.
Experiment with different classifiers.
Part of your labelled data can be used to evaluate
the accuracy.
Weka ARFF file
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,?
WEKA

10/25/2013

University of Waikato

34
Classification – Case Study


Clonewise
 Feature

vector is set of features extracted from a pair
of packages.
 Classify - do these packages share code (yes, no)?
 Classify – is the 1st package embedded in the 2nd
package (yes, no)?
Final Remarks on Classification





Lots of problems can be considered as this.
Learn how to use WEKA.
Vectors are very good representations.
Clustering


Problem
 To

group together “similar” objects under some notion
of similarity.



Easy solution
 Represent

objects using “feature vectors”.
 Plug into WEKA.


Packages in Fedora Linux 
Clustering - Case Study


Simseer Cluster
 Represent

binaries using N-Grams of decompiled
flowgraphs.
 Use most frequent N-Grams as features.
 Distance measure is cosine distance.
Final Remarks on Clustering




A classic machine learning problem.
Again, learn to use WEKA.
Program Analysis





An incredibly large and deep field.
This section skims the surface.
Main approaches
Proving 
 Model Checking

 Abstract Interpretation
 Data Flow Analysis 
 Theorem
Model Checking





Looks at program states generated by a program.
Some states indicate bugs.
Try BLAST, a model checker for small C programs.
 Caveat

- it’s pretty old now.
Theorem Proving - SMT


SMT – what is it?




An equation solver that covers the types of operations seen
in machine code.

Approach for Bug Detection
User input can be anything generally, so treat this as a
“symbolic” variable.
 The rest is concrete.
 Simulate execution of the program, plugging all the machine
code that is executed into the solver formuli.




Concolic execution


Combining symbolic execution with concrete execution.
Concolic Execution







At branches, can we have user input that forces us
to go down each path?
Use the SMT solver to tell us.
Launch execution down ‘feasible’ paths.
Use the solver to tell us if bugs are present.
 What

user input, if any, can make this pointer NULL?
Concolic path-sensitive analysis
lea 0x4(%esp
),%ecx
and $0 xfffffff,%esp
0
pushl -0x4(%ecx
)
push %ebp
mov %esp
,%ebp
push %ecx
sub $0x24,%esp
call 4011 0 <___main
b
>
movl $0x0,-0x8(%ebp
)
jmp 40115f <_main
+0x2f>

1

movl $0x4020
a0,(%esp
)
4011
call
b 8 <_puts
>
addl $0x1,-0x8(%ebp
)

3
cmpl $0x9,-0x8(%ebp
)
jle 40114f <_main
+0x1f>

2
add
pop
pop
lea
ret

$0x24,%esp
%ecx
%ebp
-0x4(%ecx
),%esp

4

2
Abstract Interpretation




Abstract the execution of the program.
Example
 Only

consider the sign of a variable, not the actual
value.



Requires a transfer function
 What



an instruction does to the abstract data.

And a Join/Meet function
 How

data is combined when it meets from different
control flow.
Data Flow Analysis


Similar to abstract interpretation.
 Uses

a transfer function, a join.
 Implement both using a monotone framework.




Data Flow analysis is used by compilers.
Classic data flow problems
 The

reach of defining or assigning to a variable.
 Knowing if a variable will be read again before being
assigned a new value.
Data Flow Analysis – Case Study




Implemented in Bugalyze.
Example bug detection
 In

free(ptr), where is ptr used before it is reassigned,
and is it used in a free?




Has found real bugs in Debian Linux.
Still a work-in-progress.
Bugalyze – Case Study
Final Remarks on Program Analysis





A wide and deep field.
Good to know the basic approaches.
Reversing is becoming more rigourous (think
HexRays).
Conclusion






Academia has some useful techniques.
It’s good to know some of the basic methods.
Will improve industrial programs.
Any questions?

More Related Content

PDF
Reconstruction of a Complete Dataset from an Incomplete Dataset by ARA (Attri...
PDF
Bytewise Approximate Match: Theory, Algorithms and Applications
PDF
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
PDF
Time series deep learning
PPTX
Function Java Vector class
PPTX
Vectors in Java
PDF
Lecture20 vector
PPTX
K Nearest Neighbor Algorithm
Reconstruction of a Complete Dataset from an Incomplete Dataset by ARA (Attri...
Bytewise Approximate Match: Theory, Algorithms and Applications
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
Time series deep learning
Function Java Vector class
Vectors in Java
Lecture20 vector
K Nearest Neighbor Algorithm

What's hot (20)

PDF
LectureNotes-05-DSA
PDF
Introduction to data mining and machine learning
PPTX
K nearest neighbor
PPTX
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
PDF
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
PPTX
K-Nearest Neighbor Classifier
PDF
International Journal of Computational Engineering Research(IJCER)
PPT
2.7 other classifiers
PPTX
Neural network for machine learning
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
PDF
25 Machine Learning Unsupervised Learaning K-means K-centers
PDF
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
PDF
Sparql semantic information retrieval by
PPTX
CarroNatali
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
PDF
LectureNotes-02-DSA
PDF
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
PPTX
Neural Models for Information Retrieval
PDF
Data exploration validation and sanitization
LectureNotes-05-DSA
Introduction to data mining and machine learning
K nearest neighbor
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
K-Nearest Neighbor Classifier
International Journal of Computational Engineering Research(IJCER)
2.7 other classifiers
Neural network for machine learning
A Study of Efficiency Improvements Technique for K-Means Algorithm
25 Machine Learning Unsupervised Learaning K-means K-centers
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Sparql semantic information retrieval by
CarroNatali
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
LectureNotes-02-DSA
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Neural Models for Information Retrieval
Data exploration validation and sanitization
Ad

Viewers also liked (7)

PDF
新浪内部对腾讯公司的深度解析
PDF
Auditing the Opensource Kernels
PPTX
Wire - A Formal Intermediate Language for Binary Analysis
PPTX
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
PDF
異種・協調・分散ロボットに関する研究
PDF
微博合作介绍 V0.2
PPTX
Moto%20 x%20project
新浪内部对腾讯公司的深度解析
Auditing the Opensource Kernels
Wire - A Formal Intermediate Language for Binary Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
異種・協調・分散ロボットに関する研究
微博合作介绍 V0.2
Moto%20 x%20project
Ad

Similar to A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS (20)

PDF
Changes and Bugs: Mining and Predicting Development Activities
PDF
Bytewise approximate matching, searching and clustering
PDF
Incremental Item-based Collaborative Filtering
PDF
Local vs. Global Models for Effort Estimation and Defect Prediction
PDF
The Magical Art of Extracting Meaning From Data
DOC
learningIntro.doc
DOC
learningIntro.doc
PDF
Computer Engineer Master Project
PDF
Graph Analysis Beyond Linear Algebra
PDF
Finding local lessons in software engineering
PDF
Applying Machine Learning to Software Clustering
PDF
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
[Dagstuhl Seminar 17281] Similarity Calculation Method for Binary Executables
PDF
[241]large scale search with polysemous codes
PDF
How to Find Relevant Data for Effort Estimation
PDF
Unit-I-DAA.pdf
PDF
Bayesian Counters
PPTX
Hierarchical clustering
PDF
Automatic comparison of malware
PPTX
2015 bioinformatics database_searching_wimvancriekinge
Changes and Bugs: Mining and Predicting Development Activities
Bytewise approximate matching, searching and clustering
Incremental Item-based Collaborative Filtering
Local vs. Global Models for Effort Estimation and Defect Prediction
The Magical Art of Extracting Meaning From Data
learningIntro.doc
learningIntro.doc
Computer Engineer Master Project
Graph Analysis Beyond Linear Algebra
Finding local lessons in software engineering
Applying Machine Learning to Software Clustering
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
[Dagstuhl Seminar 17281] Similarity Calculation Method for Binary Executables
[241]large scale search with polysemous codes
How to Find Relevant Data for Effort Estimation
Unit-I-DAA.pdf
Bayesian Counters
Hierarchical clustering
Automatic comparison of malware
2015 bioinformatics database_searching_wimvancriekinge

More from Silvio Cesare (15)

PDF
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
PPTX
Simseer.com - Malware Similarity and Clustering Made Easy
PPTX
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
PPTX
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
PPTX
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
PPT
Effective flowgraph-based malware variant detection
PPT
Simseer - A Software Similarity Web Service
PPTX
Faster, More Effective Flowgraph-based Malware Classification
PPTX
Automated Detection of Software Bugs and Vulnerabilities in Linux
PPTX
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
PPT
Simple Bugs and Vulnerabilities in Linux Distributions
PPT
Fast Automated Unpacking and Classification of Malware
PPT
Malware Classification Using Structured Control Flow
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
PPT
Security Applications For Emulation
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Effective flowgraph-based malware variant detection
Simseer - A Software Similarity Web Service
Faster, More Effective Flowgraph-based Malware Classification
Automated Detection of Software Bugs and Vulnerabilities in Linux
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Simple Bugs and Vulnerabilities in Linux Distributions
Fast Automated Unpacking and Classification of Malware
Malware Classification Using Structured Control Flow
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Security Applications For Emulation

Recently uploaded (20)

PDF
Mushroom cultivation and it's methods.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
A Presentation on Touch Screen Technology
PDF
project resource management chapter-09.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
August Patch Tuesday
PDF
Getting Started with Data Integration: FME Form 101
PDF
Heart disease approach using modified random forest and particle swarm optimi...
Mushroom cultivation and it's methods.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
1 - Historical Antecedents, Social Consideration.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
1. Introduction to Computer Programming.pptx
A Presentation on Touch Screen Technology
project resource management chapter-09.pdf
Approach and Philosophy of On baking technology
TLE Review Electricity (Electricity).pptx
cloud_computing_Infrastucture_as_cloud_p
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
Enhancing emotion recognition model for a student engagement use case through...
August Patch Tuesday
Getting Started with Data Integration: FME Form 101
Heart disease approach using modified random forest and particle swarm optimi...

A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

  • 1. A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS Silvio Cesare, Deakin University
  • 2. Introduction      Started off in industry (Qualys, now Volvent). Have a Masters by Research. About to receive a PhD from Deakin University. Last 5 years in post-graduate University research. Learnt some cool things along the way.
  • 3. What did I do at University?  Malwise v1 (Masters)   Malwise v2   Binary comparison and visualization service. Clonewise   Binary clustering service. Simseer   More improved malware variant search service. Simseer Cluster   Improved version. Simseer Search   Malware variant detection system. Automated detection of embedded libraries in source. Bugalyze  Detection of bugs using data flow analysis.
  • 5. An incomplete list of mathematical objects       Strings Vectors Sets Sets of Objects Trees Graphs
  • 6. Objects   Objects have different performance. Example  Comparing two vectors is fairly fast.  Exact matching two strings is fairly fast.  Inexact matching two strings is medium slow/fast.  Comparing two graphs is slow. A K T KT K | | | | | sequence alignment O(mn) A TK TT T K
  • 7. Transforming one object to another  Problem  Comparing two 100kb strings using the edit distance is impractically slow.  Solution ed(“hello”, “ggello”) = 2  Transform the strings into vectors.  Then, use a vector comparison – which is fast.  Examples  Comparing malware samples  Finding near duplicate web pages  Comparing E-Mails
  • 8. N-Grams     Extract all N-length substrings (N-Grams) from original string. From training set of strings, choose best N-Grams. Each unique N-Gram is an index in a vector. The value of the element is the number of times it occurs. W|IEH}R W|IE |IEH IEH} EH}R
  • 9. Another N-Gram example    Extract N-Grams Represent new object as a ‘Set of N-Grams’ Compare sets using set similarity metrics
  • 10. A Graph problem       Graph problems like approximate similarity are slow to solve. Decompose graph into subgraphs of at most k-nodes. Canonicalize small graphs, represent by adjacency matrix, transform to string. Graph is now a ‘Set of Strings’. Optionally represent as vector of ‘important ksubgraphs’. Use Vector distance metrics to compare, index, and search.
  • 12. Graphs – Case Study     Implemented in Malwise and Simseer Take control flow graphs of programs. Decompile into strings. One:  Consider program as a vector of N-Grams of decompiled strings.  L_0 Two: L_3 true  Consider program as a set of strings. L_6 true L_1 L_7 true L_2 L_4 true L_5 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; }
  • 13. Final Remarks on Objects   Know how to represent your problem. Look into how the representation can be approximated  By  transforming it into another object Vectors are often a good choice.
  • 14. Comparing  Problem  Measure the similarity (or distance between) two objects.  Solution  Represent objects mathematically.  Use multitude of mathematical measures.  Examples  Malware similarity  Near duplicate web pages
  • 15. Comparing Sets    A set is a collection of elements. Given an equality function between elements, we can measure set similarity. Inexact matching index  Dice coefficient   Jaccard  s 2 A B AB J ( A, B)  A B A B
  • 16. Comparing Vectors – Ugh, math.  Euclidean Distance  d ( p, q )   (qi  pi) n 2 i 1  Manhattan Distance  n d ( p, q )   q  i 1  Cosine Similarity  i similarity  cos( )  p i A B A B
  • 17. Vector distance – a different look   A vector is an n-dimensional point in space. E.g., a 2-d vector is <x,y>
  • 18. Cosine similarity    Line from origin to n-dimensional point. Given 2 lines, what’s the angle (theta) between them? The smaller the angle, the more similar. Point A Point B Theta
  • 19. Comparing Vectors – Case Study  Malwise v2  Feature vector of N-Grams of decompiled flowgraphs  Manhattan Distance  Simseer Search  Same feature vector  Euclidean Distance
  • 20. Comparing Sets – Case Study     Malwise v1 An element is a graph invariant of the control flow graph, represented as an integer. A program is a set of integers. Compare similarity between two programs using Dice coefficient.
  • 21. Malwise v1 - Comparing Sets 1 T  F 2 (1 -> 2), (1 -> 4) (2 -> 3), () (), () (4 -> 3), () 4 T T 3 s ( A, B)  2 wi x Ai  Bi i w x A  w x B i i i i i i
  • 22. Comparing Sets of Strings in Malwise v2 – Case Study     String is a decompiled flowgraph. Program is a set of strings. Edit distance between strings. Construct 1:1 mapping between elements of sets:  Such  that the sum of distances is minimized. Solved using ‘combinatorial optimisation’  Assignment Problem  Solution by “graph matching”
  • 23. Malwise v2 - Comparing Sets of Strings L_0 L_3 true L_6 true L_1 L_7 true L_2 L_4 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; } W|IEH}R true L_5 p BR BW|{B}BR BI{B}BR BSSR BSR BSSSR BR BW|{B}BR BSSR d=ed(p,q) q
  • 24. Final Remarks on Comparing   Inexact matching is your friend. Try to use known distance metrics.  They  have useful properties and index better. If it’s too slow to compare, transform the object.
  • 25. Similarity Searching  Problem  Find  all ‘similar’ objects to my query in a database Example  Find all words in a dictionary with at most 3 differences to my query word.   This problem is known as a ‘similarity search’ Solution  Naive exhaustive search.  Better to use ‘Metric Trees’
  • 26. Similarity Search Constraints  Variations  K-nearest neighbours – the k closests objects to the query.  All objects within a specific distance to the query.    Search based on using a ‘metric distance’. Metric distances satisfy mathematical properties. Examples  Euclidean Distance  Jaccard Distance  Cosine Distance is not metric
  • 27. Searching – Case Study  Malwise v2  Distance metric is Manhattan Distance.  Use VP-Trees to index and search in stage 1.  Use DBM-Trees to index and search in stage 2.  Implemented using open source GBDI Arboretum library. Query Benign r q d(p,q) p Query Malicious Query Malware
  • 28. Final Remarks on Searching    Searching for inexact matches is useful. Use good distance metrics. Use open source libraries.
  • 29. Classification  The problem:  Given a set of N classes.  And a query object.  Assign one of the classes to the object.  Class A Class B Examples  Is this binary (malicious, not malicious)?  Is this gmail email (primary, social, promotional)?  Is this web page (defaced, not defaced)?
  • 30. Classification Methodology  Supervised Learning  Given a training set of objects labelled by their class.  Build a model.  Then use the model to classify unknown objects.  Unsupervised Learning  No labelled data exists.  “Cluster” objects into classes.  Use clusters to train model.  Then classify as per-normal.
  • 31. Classification – What do I have to do?      Represent objects using “feature vectors” A vector is an array. Each element represents a “feature”. The value of the element tends to be a count of something, or a size. Feature examples  The number of times a dictionary word such as “Hello” appears in an E-Mail.  The size of a binary.  The number of times LoadLibraryA is executed.
  • 32. Classification – WEKA?     Put the feature vectors into the text-based ARFF file format. Plug into the WEKA machine learning toolkit. Experiment with different classifiers. Part of your labelled data can be used to evaluate the accuracy.
  • 33. Weka ARFF file @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,?
  • 35. Classification – Case Study  Clonewise  Feature vector is set of features extracted from a pair of packages.  Classify - do these packages share code (yes, no)?  Classify – is the 1st package embedded in the 2nd package (yes, no)?
  • 36. Final Remarks on Classification    Lots of problems can be considered as this. Learn how to use WEKA. Vectors are very good representations.
  • 37. Clustering  Problem  To group together “similar” objects under some notion of similarity.  Easy solution  Represent objects using “feature vectors”.  Plug into WEKA.  Packages in Fedora Linux 
  • 38. Clustering - Case Study  Simseer Cluster  Represent binaries using N-Grams of decompiled flowgraphs.  Use most frequent N-Grams as features.  Distance measure is cosine distance.
  • 39. Final Remarks on Clustering   A classic machine learning problem. Again, learn to use WEKA.
  • 40. Program Analysis    An incredibly large and deep field. This section skims the surface. Main approaches Proving   Model Checking   Abstract Interpretation  Data Flow Analysis   Theorem
  • 41. Model Checking    Looks at program states generated by a program. Some states indicate bugs. Try BLAST, a model checker for small C programs.  Caveat - it’s pretty old now.
  • 42. Theorem Proving - SMT  SMT – what is it?   An equation solver that covers the types of operations seen in machine code. Approach for Bug Detection User input can be anything generally, so treat this as a “symbolic” variable.  The rest is concrete.  Simulate execution of the program, plugging all the machine code that is executed into the solver formuli.   Concolic execution  Combining symbolic execution with concrete execution.
  • 43. Concolic Execution     At branches, can we have user input that forces us to go down each path? Use the SMT solver to tell us. Launch execution down ‘feasible’ paths. Use the solver to tell us if bugs are present.  What user input, if any, can make this pointer NULL?
  • 44. Concolic path-sensitive analysis lea 0x4(%esp ),%ecx and $0 xfffffff,%esp 0 pushl -0x4(%ecx ) push %ebp mov %esp ,%ebp push %ecx sub $0x24,%esp call 4011 0 <___main b > movl $0x0,-0x8(%ebp ) jmp 40115f <_main +0x2f> 1 movl $0x4020 a0,(%esp ) 4011 call b 8 <_puts > addl $0x1,-0x8(%ebp ) 3 cmpl $0x9,-0x8(%ebp ) jle 40114f <_main +0x1f> 2 add pop pop lea ret $0x24,%esp %ecx %ebp -0x4(%ecx ),%esp 4 2
  • 45. Abstract Interpretation   Abstract the execution of the program. Example  Only consider the sign of a variable, not the actual value.  Requires a transfer function  What  an instruction does to the abstract data. And a Join/Meet function  How data is combined when it meets from different control flow.
  • 46. Data Flow Analysis  Similar to abstract interpretation.  Uses a transfer function, a join.  Implement both using a monotone framework.   Data Flow analysis is used by compilers. Classic data flow problems  The reach of defining or assigning to a variable.  Knowing if a variable will be read again before being assigned a new value.
  • 47. Data Flow Analysis – Case Study   Implemented in Bugalyze. Example bug detection  In free(ptr), where is ptr used before it is reassigned, and is it used in a free?   Has found real bugs in Debian Linux. Still a work-in-progress.
  • 49. Final Remarks on Program Analysis    A wide and deep field. Good to know the basic approaches. Reversing is becoming more rigourous (think HexRays).
  • 50. Conclusion     Academia has some useful techniques. It’s good to know some of the basic methods. Will improve industrial programs. Any questions?