SlideShare a Scribd company logo
Examining Malware with Python
Examining Malware with Python
Phil Roth
Data Scientist at Endgame
@mrphilroth
3
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
4
Yes it’s malware, but what kind?
The Data
5
10868 labeled samples
10873 unlabeled samples
~500 GB uncompressed
9 classes
Classes
6
Hex Dump
7
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
raw data in hex
Hex Dump
8
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00
00401180
EC 01 2A 10 2A 01 AE
raw data in hex
Disassembly
9
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
10
HEADER:00400000
HEADER:00400000 ;
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) |
HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> |
HEADER:00400000 ; | License info: |
HEADER:00400000 ; | Microsoft |
HEADER:00400000 ; +-------------------------------------------------------------------------+
HEADER:00400000 ;
HEADER:00400000
HEADER:00400000
HEADER:00400000 .686p
HEADER:00400000 .mmx
HEADER:00400000 .model flat
HEADER:00400000
HEADER:00400000 ; ===========================================================================
HEADER:00400000
HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND]
.text:00401000 ;
.text:00401000 ; Format : Portable executable for 80386 (PE)
.text:00401000 ; Imagebase : 400000
.text:00401000 ; Section 1. (virtual address 00001000)
.text:00401000 ; Virtual size : 00071050 ( 462928.)
.text:00401000 ; Section size in file : 00071200 ( 463360.)
.text:00401000 ; Offset to raw data for section: 00000400
.text:00401000 ; Flags 60000020: Text Executable Readable
.text:00401000 ; Alignment : default
.text:00401000 ; ===========================================================================
Disassembly
11
HEADER:00400000
Disassembly
12
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
13
mov ebx,dword_4B107C
.text:00470050 ; =============== S U B R O U T I N E ====================================
.text:00470050
.text:00470050 ; Attributes: bp-based frame
.text:00470050
.text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp
.text:00470050
.text:00470050 var_68 = dword ptr -68h
.text:00470050 var_64 = dword ptr -64h
.text:00470050 var_60 = dword ptr -60h
.text:00470050
.text:00470050 55 push ebp
.text:00470051 8B EC mov ebp, esp
.text:00470053 83 C4 98 add esp, 0FFFFFF98h
.text:00470056 33 C0 xor eax, eax
.text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C
.text:0047005E 89 55 EC mov [ebp+var_14], edx
.text:00470061 89 45 EC mov [ebp+var_14], eax
.text:00470064 53 push ebx
.text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C
.text:0047006B 83 FB 2D cmp ebx, 2Dh
.text:0047006E 75 03 jnz short loc_470073
.text:00470070 89 5D EC mov [ebp+var_14], ebx
.text:00470073
.text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj
.text:00470073 56 push esi
.text:00470074 33 C0 xor eax, eax
.text:00470076 8B 5D EC mov ebx, [ebp+var_14]
Disassembly
14
mov ebx,dword_4B107C
Disassembly
15
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
.idata:0046F4DC ;
.idata:0046F4DC ; Imports from KERNEL32.DLL
.idata:0046F4DC ;
.idata:0046F4DC ; ===========================================================================
.idata:0046F4DC
.idata:0046F4DC ; Segment type: Externs
.idata:0046F4DC ; _idata
.idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId()
.idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword
.idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo
.idata:0046F4DC ; GetCurrentThreadId^Yr
.idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ...
.idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr
.idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ...
.idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword
.idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr
.idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ...
.idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword
.idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr
.idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ...
.idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword
.idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
Disassembly
16
Imports from KERNEL32.DLL
__stdcall VirtualAlloc(
My Solution
17
Byte ngrams
Instruction
ngrams
Named
features
SelectKBest
SelectKBest
Gradient
Boosting
Classifier
Features Feature Selection Model
Manual
Features
Byte ngrams
18
00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
Possibilies
1gram: 256
2gram: 65536
3gram: 16777216
4gram: 4294967296
Solution: Hashing
Byte ngrams
19
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
Byte ngrams
20
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1,3),
analyzer="word", n_features=2**16, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the byte ngrams and reducing
dimensionality:
class CustomExtractor() :
def __init__(self, vectorizer=HashingVectorizer()) :
self.vectorizer = vectorizer
def fit(self, X, y) :
return self # stateless
def transform(self, X, y=None) :
pool = multiprocessing.Pool()
rows = pool.map(self.feature_extract, X, 32)
return scipy.sparse.vstack(list(rows))
fit_transform = transform
def feature_extract(self, file_name) :
clean_bytes = " ".join(toolz.pipe(
open(file_name, "r"),
map(lambda line : line.rstrip().split()[1:]),
toolz.concat,
filter(lambda b : b != "??" and b != "?")
))
return self.vectorizer.transform([clean_bytes])
Byte ngrams
21
Why they might be useful: https://github.com/wapiflapi/binglide
Byte ngrams
22
sample 0A32eTdBKayjCWhZqDOQ
Instruction ngrams
23
push lea push mov call mov mov pop retn
mov jmp
push mov mov call test jz push call add mov pop retn
mov mov mov mov retn
mov lea mov inc test jnz sub retn
mov mov mov push mov push push push push call add mov pop retn
mov mov mov push mov push push push push call add mov pop retn
xor retn
mov retn
mov retn
mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call push call mov push push push mov call mov retn
mov mov mov retn
mov test jz mov mov push push call mov mov retn
push push push push call mov push push push mov call push call mov retn
Extracted instructions:
Instruction ngrams
24
vectorizer = HashingVectorizer(
input="content", lowercase=True, stop_words=None, ngram_range=(1, 2),
analyzer="word", n_features=2**25, binary=False, norm=None,
non_negative=True
)
pipe = Pipeline([
("extraction", CustomExtractor(vectorizer=vectorizer)),
("sel", VarianceThreshold(threshold=0)),
("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True,
sublinear_tf=True)),
("kbest", SelectKBest(score_func=f_classif, k=500))
])
Code for extracting the instruction ngrams and reducing
dimensionality:
Section Names, Imports, Imported Functions.
Extracted these features with regular expressions.
Features were (awkwardly) selected in the same
step as instruction ngrams.
Named Features
25
Named Features
26
import re
re_features = {
"imports" : {
"re" : re.compile("Imports from w.+"),
"extract" : lambda m : m.group().split()[-1],
"filter" : lambda m : True
},
"imported_functions" : {
"re" : re.compile("__stdcall w.+("),
"extract" : lambda m : m.group().split()[-1][:-1],
"filter" : lambda m : not m.startswith("sub_")
},
"section_names" : {
"re" : re.compile("^S+?:"),
"extract" : lambda m : m.group()[:-1],
"filter" : lambda m : True
}
}
Named Features
27
from toolz import pipe, unique
from tools.curried import map, filter
def process_re_feature(lines, re_dict) :
return pipe(
lines,
map(re_dict["re"].search),
filter(lambda m : m is not None),
map(re_dict["extract"]),
filter(re_dict["filter"]),
unique
)
Named Features
28
Manual Features
29
{
"number_of_collapsed_functions": 451,
"number_of_imported_functions": 101,
"sample_length": 1201668,
"number_of_imports": 4,
"number_of_sections": 4,
"section_length_0": 979764,
...
“section_length_6”: 0,
"length_of_functions_0": 2706,
...
"length_of_functions_15": 107
}
0A32eTdBKayjCWhZqDOQ
Gradient Boosting Classifier on 1026 features
Grid search optimized parameters
Also tried: LogisticRegression, MultinomialNB,
KNeighborsClassifier, RandomForestClassifier
Final Model
30
clf = GradientBoostingClassifier(
loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_depth=3, init=None, random_state=None, max_features=200,
max_leaf_nodes=None, warm_start=False, verbose=2
)
Final Model tSNE Plot
31
Final Model tSNE Plot
32
pipe = Pipeline([
("tsvd", TruncatedSVD(n_components=50)),
("tsne", TSNE(n_components=2, perplexity=40.0,
early_exaggeration=4.0, learning_rate=1000.0,
n_iter=1000, metric='euclidean', init='random’))
])
33
Results:
I did OK…
More focused on productization
xgboost
malware as an image
compression ratio as a feature
other expanded feature sets
probability calibration
semi supervised learning
Winning Strategies
34
usable in a product
specific to
competitions
35
ida ******************************
CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ]
mean: 0.03817940685733493 std: 0.008799619405211161
capstone ******************************
CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089]
mean: 0.05441113231562615 std: 0.008283830117670508
code = bytes(bytearray.fromhex("".join(map(
lambda l : "".join(l.split()[1:]).replace("?", ""),
open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r")
))))
from capstone import Cs, CS_ARCH_X86, CS_MODE_32
md = Cs(CS_ARCH_X86, CS_MODE_32)
instructions = " ".join(
[t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"]
)
Using Capstone
IDA not (easily) batch distributable
capstone single pass produces suboptimal results
radare2 Python scriptable reversing framework
vivisect pure Python, largely undocumented
disassembler and analysis project
Disassemblers
36
Other Projects
37
pefile extracts header information from executables
binglide visualizations of entropy and byte ngrams
cuckoo automated dynamic analysis
barf binary analysis framework with code analysis
38
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
Thank You
Examining Malware with Python

More Related Content

PDF
TC74VHCT04AFN PSpice Model (Free SPICE Model)
PPTX
Big Data mit Microsoft?
PDF
自動車セキュリティの現状 by クリス・ヴァラセク Chris Valasek
PDF
Samsung gt s5570 galaxy mini 05 main electrical part list
PDF
Grant prideco drill_pipe_data_tables
PDF
Al Fazl International 24 Apr 2015 - Weekly
PPTX
Time Series Analysis for Network Secruity
TC74VHCT04AFN PSpice Model (Free SPICE Model)
Big Data mit Microsoft?
自動車セキュリティの現状 by クリス・ヴァラセク Chris Valasek
Samsung gt s5570 galaxy mini 05 main electrical part list
Grant prideco drill_pipe_data_tables
Al Fazl International 24 Apr 2015 - Weekly
Time Series Analysis for Network Secruity

Viewers also liked (14)

PPS
Outpost networksecurity
PDF
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
PDF
Python reading and writing files
PPT
PPTX
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
PPTX
Cathexis therapeutic imagery
PDF
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
PDF
Assemblea Pubblica 19/12
PDF
Assembly Information Management System
PPT
Chinese food box
PPTX
New Jersey photos
PPTX
Presentasi function room
PPT
Evolution of computers
PPTX
Yoga gives your life a new direction
Outpost networksecurity
A SURVEY ON SECURITY IN WIRELESS SENSOR NETWORKS
Python reading and writing files
Виртуальное рабочее место на базе продуктов Microsoft (Desktops as a service)
Cathexis therapeutic imagery
Paes Andrano - Bozza - Piano d’Azione per l’Energia Sostenibile
Assemblea Pubblica 19/12
Assembly Information Management System
Chinese food box
New Jersey photos
Presentasi function room
Evolution of computers
Yoga gives your life a new direction
Ad

Similar to Examining Malware with Python (20)

PDF
No more dumb hex!
PDF
WebAssembly for the rest of us - Jan-Erik Rediger - Codemotion Amsterdam 2017
PDF
バイナリかるた(アーキテクチャかるた)
PDF
バイナリかるた(アーキテクチャかるた・完全版)
PDF
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
DOC
Machine Problem 1
PDF
BlueTeam-RedTeam Exercise - Backdoor containment
PPTX
Assembly Language Tutorials for Windows - 03 Assembly Language Programming
PDF
DEF CON 23 - CHRIS DOMAS - REpsych
PPTX
Introduction to debugging linux applications
PPT
C for Microcontrollers
PDF
The walking 0xDEAD
PDF
Memory management
TXT
Mona cheatsheet
PDF
nullcon 2011 - Memory analysis – Looking into the eye of the bits
PPTX
Introduction to Linux Exploit Development
PPT
class04_x86assembly.ppt hy there u need be
PDF
Compilation process
PDF
nullcon 2011 - Memory analysis – Looking into the eye of the bits
No more dumb hex!
WebAssembly for the rest of us - Jan-Erik Rediger - Codemotion Amsterdam 2017
バイナリかるた(アーキテクチャかるた)
バイナリかるた(アーキテクチャかるた・完全版)
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
Machine Problem 1
BlueTeam-RedTeam Exercise - Backdoor containment
Assembly Language Tutorials for Windows - 03 Assembly Language Programming
DEF CON 23 - CHRIS DOMAS - REpsych
Introduction to debugging linux applications
C for Microcontrollers
The walking 0xDEAD
Memory management
Mona cheatsheet
nullcon 2011 - Memory analysis – Looking into the eye of the bits
Introduction to Linux Exploit Development
class04_x86assembly.ppt hy there u need be
Compilation process
nullcon 2011 - Memory analysis – Looking into the eye of the bits
Ad

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
annual-report-2024-2025 original latest.
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to machine learning and Linear Models
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
[EN] Industrial Machine Downtime Prediction
Supervised vs unsupervised machine learning algorithms
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Miokarditis (Inflamasi pada Otot Jantung)
annual-report-2024-2025 original latest.
STUDY DESIGN details- Lt Col Maksud (21).pptx
Fluorescence-microscope_Botany_detailed content
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ISS -ESG Data flows What is ESG and HowHow
Introduction to Knowledge Engineering Part 1
Introduction to machine learning and Linear Models
Galatica Smart Energy Infrastructure Startup Pitch Deck
Qualitative Qantitative and Mixed Methods.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx

Examining Malware with Python

  • 2. Examining Malware with Python Phil Roth Data Scientist at Endgame @mrphilroth
  • 3. 3 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions
  • 4. 4 Yes it’s malware, but what kind?
  • 5. The Data 5 10868 labeled samples 10873 unlabeled samples ~500 GB uncompressed 9 classes
  • 7. Hex Dump 7 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 raw data in hex
  • 8. Hex Dump 8 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80 00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90 00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19 00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00 00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00 00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00 004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08 004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A 004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04 004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82 004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00 004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00 00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00 00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00 00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10 00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11 00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10 00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01 00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00 00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00 00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11 00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00 00401180 EC 01 2A 10 2A 01 AE raw data in hex
  • 9. Disassembly 9 HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; ===========================================================================
  • 10. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 10 HEADER:00400000
  • 11. HEADER:00400000 ; HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; | This file has been generated by The Interactive Disassembler (IDA) | HEADER:00400000 ; | Copyright (c) 2013 Hex-Rays, <support@hex-rays.com> | HEADER:00400000 ; | License info: | HEADER:00400000 ; | Microsoft | HEADER:00400000 ; +-------------------------------------------------------------------------+ HEADER:00400000 ; HEADER:00400000 HEADER:00400000 HEADER:00400000 .686p HEADER:00400000 .mmx HEADER:00400000 .model flat HEADER:00400000 HEADER:00400000 ; =========================================================================== HEADER:00400000 HEADER:00400000 ; [00001000 BYTES: COLLAPSED SEGMENT HEADER. PRESS KEYPAD CTRL-"+" TO EXPAND] .text:00401000 ; .text:00401000 ; Format : Portable executable for 80386 (PE) .text:00401000 ; Imagebase : 400000 .text:00401000 ; Section 1. (virtual address 00001000) .text:00401000 ; Virtual size : 00071050 ( 462928.) .text:00401000 ; Section size in file : 00071200 ( 463360.) .text:00401000 ; Offset to raw data for section: 00000400 .text:00401000 ; Flags 60000020: Text Executable Readable .text:00401000 ; Alignment : default .text:00401000 ; =========================================================================== Disassembly 11 HEADER:00400000
  • 12. Disassembly 12 .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14]
  • 13. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 13 mov ebx,dword_4B107C
  • 14. .text:00470050 ; =============== S U B R O U T I N E ==================================== .text:00470050 .text:00470050 ; Attributes: bp-based frame .text:00470050 .text:00470050 sub_470050 proc near ; CODE XREF: start+D8D^Yp .text:00470050 .text:00470050 var_68 = dword ptr -68h .text:00470050 var_64 = dword ptr -64h .text:00470050 var_60 = dword ptr -60h .text:00470050 .text:00470050 55 push ebp .text:00470051 8B EC mov ebp, esp .text:00470053 83 C4 98 add esp, 0FFFFFF98h .text:00470056 33 C0 xor eax, eax .text:00470058 8B 15 7C 10 4B 00 mov edx, dword_4B107C .text:0047005E 89 55 EC mov [ebp+var_14], edx .text:00470061 89 45 EC mov [ebp+var_14], eax .text:00470064 53 push ebx .text:00470065 8B 1D 7C 10 4B 00 mov ebx, dword_4B107C .text:0047006B 83 FB 2D cmp ebx, 2Dh .text:0047006E 75 03 jnz short loc_470073 .text:00470070 89 5D EC mov [ebp+var_14], ebx .text:00470073 .text:00470073 loc_470073: ; CODE XREF: sub_470050+1E^Xj .text:00470073 56 push esi .text:00470074 33 C0 xor eax, eax .text:00470076 8B 5D EC mov ebx, [ebp+var_14] Disassembly 14 mov ebx,dword_4B107C
  • 15. Disassembly 15 .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr
  • 16. .idata:0046F4DC ; .idata:0046F4DC ; Imports from KERNEL32.DLL .idata:0046F4DC ; .idata:0046F4DC ; =========================================================================== .idata:0046F4DC .idata:0046F4DC ; Segment type: Externs .idata:0046F4DC ; _idata .idata:0046F4DC ; DWORD __stdcall GetCurrentThreadId() .idata:0046F4DC ?? ?? ?? ?? extrn __imp_GetCurrentThreadId:dword .idata:0046F4DC ; DATA XREF: .text:0046F66C^Yo .idata:0046F4DC ; GetCurrentThreadId^Yr .idata:0046F4E0 ; BOOL __stdcall WriteFile(HANDLE hFile, LPCVOID lpBuffer, DWORD ... .idata:0046F4E0 ?? ?? ?? ?? extrn WriteFile:dword ; DATA XREF: .text:00471E4C^Yr .idata:0046F4E4 ; BOOL __stdcall FindNextVolumeA(HANDLE hFindVolume, LPSTR lpszVolumeName, DW ... .idata:0046F4E4 ?? ?? ?? ?? extrn FindNextVolumeA:dword .idata:0046F4E4 ; DATA XREF: .text:00471E46^Yr .idata:0046F4E8 ; LPVOID __stdcall VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD ... .idata:0046F4E8 ?? ?? ?? ?? extrn __imp_VirtualAlloc:dword .idata:0046F4E8 ; DATA XREF: VirtualAlloc^Yr .idata:0046F4EC ; BOOL __stdcall EnumResourceLanguagesA(HMODULE hModule, LPCSTR lpType, LPCSTR ... .idata:0046F4EC ?? ?? ?? ?? extrn EnumResourceLanguagesA:dword .idata:0046F4EC ; DATA XREF: .text:00471E70^Yr Disassembly 16 Imports from KERNEL32.DLL __stdcall VirtualAlloc(
  • 18. Byte ngrams 18 00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20 00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01 00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18 00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04 Possibilies 1gram: 256 2gram: 65536 3gram: 16777216 4gram: 4294967296 Solution: Hashing
  • 19. Byte ngrams 19 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality:
  • 20. Byte ngrams 20 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1,3), analyzer="word", n_features=2**16, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the byte ngrams and reducing dimensionality: class CustomExtractor() : def __init__(self, vectorizer=HashingVectorizer()) : self.vectorizer = vectorizer def fit(self, X, y) : return self # stateless def transform(self, X, y=None) : pool = multiprocessing.Pool() rows = pool.map(self.feature_extract, X, 32) return scipy.sparse.vstack(list(rows)) fit_transform = transform def feature_extract(self, file_name) : clean_bytes = " ".join(toolz.pipe( open(file_name, "r"), map(lambda line : line.rstrip().split()[1:]), toolz.concat, filter(lambda b : b != "??" and b != "?") )) return self.vectorizer.transform([clean_bytes])
  • 21. Byte ngrams 21 Why they might be useful: https://github.com/wapiflapi/binglide
  • 23. Instruction ngrams 23 push lea push mov call mov mov pop retn mov jmp push mov mov call test jz push call add mov pop retn mov mov mov mov retn mov lea mov inc test jnz sub retn mov mov mov push mov push push push push call add mov pop retn mov mov mov push mov push push push push call add mov pop retn xor retn mov retn mov retn mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call push call mov push push push mov call mov retn mov mov mov retn mov test jz mov mov push push call mov mov retn push push push push call mov push push push mov call push call mov retn Extracted instructions:
  • 24. Instruction ngrams 24 vectorizer = HashingVectorizer( input="content", lowercase=True, stop_words=None, ngram_range=(1, 2), analyzer="word", n_features=2**25, binary=False, norm=None, non_negative=True ) pipe = Pipeline([ ("extraction", CustomExtractor(vectorizer=vectorizer)), ("sel", VarianceThreshold(threshold=0)), ("tfidf", TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)), ("kbest", SelectKBest(score_func=f_classif, k=500)) ]) Code for extracting the instruction ngrams and reducing dimensionality:
  • 25. Section Names, Imports, Imported Functions. Extracted these features with regular expressions. Features were (awkwardly) selected in the same step as instruction ngrams. Named Features 25
  • 26. Named Features 26 import re re_features = { "imports" : { "re" : re.compile("Imports from w.+"), "extract" : lambda m : m.group().split()[-1], "filter" : lambda m : True }, "imported_functions" : { "re" : re.compile("__stdcall w.+("), "extract" : lambda m : m.group().split()[-1][:-1], "filter" : lambda m : not m.startswith("sub_") }, "section_names" : { "re" : re.compile("^S+?:"), "extract" : lambda m : m.group()[:-1], "filter" : lambda m : True } }
  • 27. Named Features 27 from toolz import pipe, unique from tools.curried import map, filter def process_re_feature(lines, re_dict) : return pipe( lines, map(re_dict["re"].search), filter(lambda m : m is not None), map(re_dict["extract"]), filter(re_dict["filter"]), unique )
  • 29. Manual Features 29 { "number_of_collapsed_functions": 451, "number_of_imported_functions": 101, "sample_length": 1201668, "number_of_imports": 4, "number_of_sections": 4, "section_length_0": 979764, ... “section_length_6”: 0, "length_of_functions_0": 2706, ... "length_of_functions_15": 107 } 0A32eTdBKayjCWhZqDOQ
  • 30. Gradient Boosting Classifier on 1026 features Grid search optimized parameters Also tried: LogisticRegression, MultinomialNB, KNeighborsClassifier, RandomForestClassifier Final Model 30 clf = GradientBoostingClassifier( loss='deviance', learning_rate=0.1, n_estimators=300, subsample=0.9, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, init=None, random_state=None, max_features=200, max_leaf_nodes=None, warm_start=False, verbose=2 )
  • 31. Final Model tSNE Plot 31
  • 32. Final Model tSNE Plot 32 pipe = Pipeline([ ("tsvd", TruncatedSVD(n_components=50)), ("tsne", TSNE(n_components=2, perplexity=40.0, early_exaggeration=4.0, learning_rate=1000.0, n_iter=1000, metric='euclidean', init='random’)) ])
  • 33. 33 Results: I did OK… More focused on productization
  • 34. xgboost malware as an image compression ratio as a feature other expanded feature sets probability calibration semi supervised learning Winning Strategies 34 usable in a product specific to competitions
  • 35. 35 ida ****************************** CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ] mean: 0.03817940685733493 std: 0.008799619405211161 capstone ****************************** CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089] mean: 0.05441113231562615 std: 0.008283830117670508 code = bytes(bytearray.fromhex("".join(map( lambda l : "".join(l.split()[1:]).replace("?", ""), open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r") )))) from capstone import Cs, CS_ARCH_X86, CS_MODE_32 md = Cs(CS_ARCH_X86, CS_MODE_32) instructions = " ".join( [t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"] ) Using Capstone
  • 36. IDA not (easily) batch distributable capstone single pass produces suboptimal results radare2 Python scriptable reversing framework vivisect pure Python, largely undocumented disassembler and analysis project Disassemblers 36
  • 37. Other Projects 37 pefile extracts header information from executables binglide visualizations of entropy and byte ngrams cuckoo automated dynamic analysis barf binary analysis framework with code analysis
  • 38. 38 Python tools for text classification can easily be adopted for malware classification. When using instruction ngrams, your disassembler and analysis passes are very important. references: http://bit.ly/scipy-malware Conclusions