SlideShare a Scribd company logo
Malwareclassificationmeets
crowd-sourcing
John Park, HP Security Research
October, 2015
Overview
1. Malware Classification
2. Crowd-Sourcing Logistics
3. What Crowd-Sourcing Is Good For
4. What Comes After Crowd-Sourcing
Whatis“MalwareClassification”?
“Malware
Classification”
“WhatMalware
AnalystDoes.”
Malware
Analysts
Unknown
File
Good
Bad
Unknown
File
Malware
Analysts
Good
Bad
Blah
Unknown
File
Malware
Analysts
Good
Bad
Blah
Shady
WhiteBox
Analysis
BlackBox
Analysis
Prior
Knowledge
Unknown
File
Malware
Analysts
Good
Bad
Blah
Shady
VirusTotal
WhiteBox
Analysis
BlackBox
Analysis
Prior
Knowledge
Unknown
File
Malware
Analysts
Good
Bad
Blah
Shady
Virus
Total
WhiteBox
Analysis
BlackBox
Analysis
Prior
Knowledge
Malware Classification
“Super-System”
Malware Analyst
Experts
Data Sharing
Automation IT
Infrastructure
Algorithms
Malware Analyst
Experts
Data Sharing
Automation IT
Infrastructure
Algorithms
Makes the final call
Trusted parties.
Layered access.
Handles large numbers of files, fast.
This is where Crowd-Sourcing is
useful. Cross-field collaboration
Aneasywaytodoacrowd-sourcingon
algorithmsistohold
a“DataScienceCompetition”
EarlyDays:NetflixPrizeCompetition(2006-
2009)
1. The user ratings data is provided.
2. Netflix’s algorithm can guess within
about 1 star range. (Root Mean Square Error =
0.95)
3. If you can improve the accuracy by 10% (RMSE
= 0.85), you get $1 million.
M1 M2 M3
U1 U2 U3
MicrosoftMalwareClassificationChallenge
Data:
- 21741 malware files
- 9 different malware families
- HexDump, without PE Header
- Assembly file, using IDA
Objective:
-Classify them into right family.
Kaggle:likeNetflixPrizeCompetition.
withshortercycle.
withsmallerprize.
withdiverseproblems.
TimeFrame
Competition
Begins.
DataSet +
Objectives
are posted
Forum
Month1
Problem
Clarifying.
Early birds
take a stab
at it.
Forum
Month 2
Benchmark
code is
shared.
Inflow of
“Benchmark
ers”
Forum
Month 3
Last Min
Leap-
Frogging.
Complains
and Moans
about
dataset
Competition
Ends
Forum
AfterMath
The secret
sauce is
shared.
Top 10
teams
usually use
the same
approach.
Everycompetitionisalittledifferent,
likesnowflakes.
1. Security is a popular topic
2. Large DataSet (~400GB)
3. Unfamiliar Data Structure
Typical Data Format
Training Set
Test Set
n entries
m features answer
guess this value
DataScience,explainedby
“CommunicationTheory”
Person A’s
Brain
Person B’s
Brain
Words spoken
by Person A
Words heard
by Person B
Room Noise
Wire Noise
EM interference
Decoding the languageEncoding using the language
Re-arrangeinDataScienceway
Person A’s
Brain
Person B’s
Brain
Words spoken
by Person A
Words heard
by Person B
Typical Data Format
Training Set
Test Set
n entries
m features answer
“Signal” flows this way.
TypicalMalwareAnalyst’sKnowledge
Network
Access
Startup
Hooks
Install
Rootkit
Pings to
C2 server
Include string
“hacktool”
Digitally
Signed by CA
Answer
Yes Yes No Yes No No Bot_X
Yes No Yes No Yes No Hacktool
Yes Yes No No No Yes Known Good
No Yes Yes Yes No No KeyLogger
Yes Yes Yes Yes No No Backdoor
BlackBox and WhiteBox Analysis
Prior
Knowledge
MalwareAnalyst’sClassificationMethod
Network
Access
Startup
Hooks
Install
Rootkit
Pings to
C2 server
Include string
“hacktool”
Digitally
Signed by CA
Answer
Yes Yes No Yes No No Bot_X
Yes No Yes No Yes No Hacktool
Yes Yes No No No Yes Known Good
No Yes Yes Yes No No KeyLogger
Yes Yes Yes Yes No No Backdoor
Yes Yes Yes Yes Yes No Unknown
Yes Yes No Yes No Yes Unknown
BlackBox and WhiteBox Analysis
Prior
Knowledge
Classify unknown
into known
HowMachineLearningWorks(lotsoftinybits)
Number
of
Sections
File
Size
Entropy
(percen
tile)
Ratio
between
1bits and
0 bits
Include string
“ackto”
Include string
“cktoo”
Answer
3 920k 0.23 0.40 0 0 Bot_X
2 3200k 0.15 0.24 1 1 Hacktool
4 252k 0.55 0.54 0 1 Known Good
3 1283k 0.23 0.23 0 1 KeyLogger
8 884k 0.44 0.21 0 0 Backdoor
3 885k 0.34 0.98 1 0 Unknown
5 6413k 0.78 0.45 0 1 Unknown
Feature Extraction
Training
Set
1st PlaceWinnerSolution
http://blog.kaggle.com/2015/05/26/microsoft-malware-winners-interview-1st-place-no-to-overfitting/
MentalVisualizationofFeatureExtraction
Opcode
n-gram
Segment Count
Virtual size
File size
“projection/snapshot”
Therearemanyalgorithms,
but,these3methodswillsuityourneeds.
- Use Decision-Trees: XGBoost (most cases)
- Use Neural-Nets: Deep-Learning (vision +
audio)
- Use Hyperplane: SVM (small data + scientific
data)
So,pluginthewinningsolution
intotheproductionsystem,
andcallitaday??
“It'slikeafingerpointingatthemoon.
Donotconcentrateonthefinger,
oryouwillmissalloftheheavenlyglory.”
- Bruce Lee
Wehavelearnedthesethingsworkwell:
- Opcode n-gram (short sequence
of actions)
- XGBoost (decision trees, case by
case)
NilsimsaHash
unsupervised + supervised learning
many similarity to Deep Learning
NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
Opcode Sequence
BOF EOF
NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0 0 1 0 0 1 0 1Bloom Filter
a good balance of Space, Time, Allowable Error
0x0000 0x0111
NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0 0 1 0 0 1 0 1Bloom Filter
0x0000 0x0111
0 0 4 0 0 7 0 1Counting Bloom Filter
NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0x0000 0x0111
0 0 4 0 0 7 0 1Counting Bloom Filter
0 0 1 0 0 1 0 0Normalized
1, if above avg
0, if below avg
0 0 1 0 0 1 0 1Bloom Filter
NilsimsaHash
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0x0000 0x0111
0 1 0 0 0 1 0 1Known File 7238
0 0 1 0 0 1 0 0Unkonwn File
Hamming
Distance of 3
HowtoimproveNilsimsahash
(orgeo-politicaltension)
1. minimize collision
2. prevent the abuse
3. prepare for the unknowns
ImprovementstoNilsimsaHash(TLSHbyTrendMicro)
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0x0000 0x0111
0 0 4 0 0 7 0 1
0 0 1 0 0 1 0 0Normalized
Counting Bloom Filter
0 0 2 0 0 3 0 1TLSH
3, if 1st quarter
2, if 2nd quarter
1, if 3rd quarter
0, if 4th quarter
Source: http://www.academia.edu/7833902/TLSH_-A_Locality_Sensitive_Hash
PowerLaw/Long-Tail
PowerLaw/Long-Tail
ImprovementstoNilsimsaHash(Width)
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0 0 4 0 0 4 0 1 0 1 0 1 0 0 0 1
Longer CBF
0x0000 0x0111
0 0 4 0 0 7 0 1
CBF
0x0000 0x0111 0x11110x1000
HyperLogLog
Hash (baseball) = 0x0100101110111010
Hash (basketball) = 0x0001010110110011
Hash (blahblah) = 0x1010111011011010
Hash (rocket) = 0x0000100101011101
The length of the longest preceding ‘0’s can
estimate the size of CBF needed with required
memory space of O(1)
Attack:HashShaping
N-gram
Counting
Bloom Filter Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
0x0000 0x0111
812 12 283 7023 204 34 13 99Cleanfile CBF
540 23 140 400 170 20 30 80Malicious CBF
High-count bucket is the most important bucket.
If hash (xor – xor – xor - xor) happens to fall into this bucket,
the hash can be shaped by inserting 6500x “XOR EAX, EAX”.
Defend:ResisttheHashShaping,byDualHash
812 12 283 7023 204 34 13 99Cleanfile CBF
Using hash1
540 23 140 400 170 20 30 80Malicious CBF
Using hash1
hash1 (xor – xor – xor - xor) falls into this bucket.
hash2 (xor – xor – xor - xor) falls into this bucket. Doh!
1032 1094 12 913 375 15 70 5023
These are the most important buckets.
Cleanfile CBF
Using hash2
1844 1106 295 7936 579 49 83 5122Dual Hash
This bucket should stay low.
N-gram CBF Normalize
Hamming
Distance
Feature Extraction Intermediate Info Storage Importance Selector Compare
Opcode
N-gram
CBF Normalize
Hamming
Distance
PE file
Use IDA to
convert Binary
into Assembly
Opcode
N-gram
CBF Normalize
Hamming
Distance
PE file
Preparefortheunknown
?Unknown
Architecture
Opcode
N-gram
CBF Normalize
Hamming
Distance
PE file
Unknown
Architecture Select-gram*
Select-gram*: n-gram of only top 10% frequent elements.
Preparefortheunknown
Thisisthelastslide.
Itisallabout“CommunicationTheory.”
Person A’s
Brain
Person B’s
Brain
Words spoken
by Person A
Words heard
by Person B
Thankyou
hp.com/go/hpsr

More Related Content

PPTX
Breaking DES
PDF
Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering
PDF
A meta-analysis of computational biology benchmarks reveals predictors of pro...
PDF
Securing Neural Networks
PDF
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
PPTX
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
PDF
Malware classification and traceability
PPTX
Detection of known and unknown DDoS attacks using Artificial Neural Networks
Breaking DES
Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering
A meta-analysis of computational biology benchmarks reveals predictors of pro...
Securing Neural Networks
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Malware classification and traceability
Detection of known and unknown DDoS attacks using Artificial Neural Networks

Similar to VB2015 Malware Classification meets crowd-sourcing (20)

PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
PPT
Petascale Analytics - The World of Big Data Requires Big Analytics
DOCX
ieee paper
PDF
Download full ebook of Ctfctfallinone Firmianay instant download pdf
PPT
Computation and Knowledge
PDF
Computational decision making
PDF
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
PPTX
Bioinfo ngs data format visualization v2
PDF
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
DOC
Fyp ideas
PPTX
Ember
PDF
Auscert Finding needles in haystacks (the size of countries)
PPTX
0box Analyzer--Afterdark Runtime Forensics for Automated Malware Analysis and...
PPT
Learning Better Context Characterizations: An Intelligent Information Retriev...
PPT
Network Security Data Visualization
ODP
Debugging With Id
PPT
Machine Learning ICS 273A
PPT
Machine Learning ICS 273A
PDF
Tokens, Complex Systems, and Nature
PPTX
20101017 program analysis_for_security_livshits_lecture03_security
Next-generation sequencing data format and visualization with ngs.plot 2015
Petascale Analytics - The World of Big Data Requires Big Analytics
ieee paper
Download full ebook of Ctfctfallinone Firmianay instant download pdf
Computation and Knowledge
Computational decision making
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
Bioinfo ngs data format visualization v2
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
Fyp ideas
Ember
Auscert Finding needles in haystacks (the size of countries)
0box Analyzer--Afterdark Runtime Forensics for Automated Malware Analysis and...
Learning Better Context Characterizations: An Intelligent Information Retriev...
Network Security Data Visualization
Debugging With Id
Machine Learning ICS 273A
Machine Learning ICS 273A
Tokens, Complex Systems, and Nature
20101017 program analysis_for_security_livshits_lecture03_security
Ad

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
How to run a consulting project- client discovery
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Global Data and Analytics Market Outlook Report
PPT
Predictive modeling basics in data cleaning process
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Introduction to Inferential Statistics.pptx
PDF
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Leprosy and NLEP programme community medicine
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Microsoft Core Cloud Services powerpoint
Topic 5 Presentation 5 Lesson 5 Corporate Fin
CYBER SECURITY the Next Warefare Tactics
How to run a consulting project- client discovery
Qualitative Qantitative and Mixed Methods.pptx
Global Data and Analytics Market Outlook Report
Predictive modeling basics in data cleaning process
[EN] Industrial Machine Downtime Prediction
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
New ISO 27001_2022 standard and the changes
Introduction to Inferential Statistics.pptx
Introduction to Data Science and Data Analysis
Ad

VB2015 Malware Classification meets crowd-sourcing

Editor's Notes

  • #2: Thanks for coming. It’s late in the afternoon, so I will try to make this presentation interesting and worthwhile for your time. Let’s jump right in. In this session, I would like to talk about “malware classification and Crowdsorucing”.
  • #3: what is malware classification We will talk abnout the logistics of crowd sourcing. And, what is crousing coursinc g good for., And, the REAL focus of this talk is “what comes after” crowdsourcing. [it will starts in 20000 ft up in the air, then zoom down to at the ground level. To the bit level I don’t like people who just drops “machine learning” as “something smart”. So, I like to get to the nitty gritty detail. In a way, it is Trojan horse. I will come out easy. and set that expection, that this is a fluff talk. Still useful fluff talk. ]
  • #5: I love circular definition.  Concept = actual process
  • #8: Shady = Legal. Because Out user has freedom of choice. But, it is in a bad taste. (Apple ecosystem has the notion of “good taste”. Windows + Android ecosystem, is starting to have the notion of “good taste”) <CLICK> So, how does Malware Analyst actually make that decision?
  • #9: And, we have this thing called “Virus Total” On the surface, it is just multivendor platform. But, it is a mean of communicating with the rest of the community. It is like asking this room, “have you seen this file? If so, I don’t want to duplicate the work” There are already more files to look at.
  • #10: We start to put VT in the pipeline. To detect, and to prioritize. (multiple vendor detect, work on it first) Then, we make 3 vendor rules. <CLICK> Inside circle is what is becoming into the “uber-system” So, how do you make this “uber system”
  • #11: Subliminal HPE teal box messaging. “The 4 Component” of Malware Classificaiton “system” Largely, there are 4 main component, in this picture. next slide will describe in details.
  • #12: Malware Analysts,= malwre classifciaiton is not simple, as shown before. so we already need a human, to make the final call. a human, is always the final assurance. There are always edge cases, and human is the central. Data Sharing is the “threat intel sharing”. VirusTotal is one way. ThreatCentral is another way. There are many platforms. Automation IT infrasturucte. To process large amount of files, we need a huge server farms. Vanja’s Friday talk. Algorithm is missing piece. And this is where Data Science comes in. Also, cross-field. Algorithms are universal.
  • #13: What is good about “crowd-sourcing” or “competition”, is that, You can tap into a large number of smart people, from the diverse field.
  • #14: Netflix was the first “internet scale” competition. for the internet scale competition, it has to be “REALLYU SIMPLE” to explain. These are people, who are not familiar with the industry. So, it would be best, if it can be explain within 3 bullet points. .. It’s Simple, but not Easy. it took the “internet” 3 years to achieve this goal. And, the winning team came up with some crazy shit algorithm.
  • #15: How you approach to solve this problem is that, something called “Collaborative Filtering”, The simplest visualization is that, every rating is a line. Then you construct this “2 row structure”. Once that is done, imagine you are pouring water to User1. the edge will acts as “conduit”, and which ever Movie that collects the most “water”, is the matching movie. It works surprisingly well!
  • #16: 7 min mark
  • #17: Even though this was the first competition from out industry, actually many company already does this. On Kaggle. Netflix Competition was dragged out too long. 3 yrs was too long. We don’t have that long attention span. And, also they found that, it is not the prize that draw the crowd. It is the ego, that drives. And the job offer. With shorter cycle and smaller winning, they went with a fast iteration. improved the competition format. It is like having 200 research team, all working different method. it is a tremendous “brain collecting” ZDI is one method of crowd-sourcing research
  • #18: Now. It has large users base. It is its own world, like security people has their own world. Depending on the interestingness of the topic, or easiness of the competition. it draws on average 200 teams to 1000 teams. Most of the registered are non-plyaers, or just want to see. But, the top 10% are very good.
  • #19: Month 3 leap frogging. Once you put 200 hrs on it, and other people do better, you get “angry” and do everything. The best thing about these competitions is “insights sharing” But, what is the best thing about these crowdsourcing is that. The insight from these diverse, short cycle competitions is “nurturing” the next competition. You learn some method works well for one data set. Then, you apply the similar method to a different set. .
  • #20: Every competition is little different. It is similar to “malware analysis”. And you need to quickly understand the core of problem set. Security has been popular topic for a while. On average, it draws 2x the average. And this is good for the “security industry” One of the largest datasets. Most datasets are 5MB to 200MB. If not image files. Cant do it in MacBook Air. If this large, First pass would need to be Single Pass. For most of the crowd, PE files structure is foreign. (it tooks us a while to get familiar with it)
  • #21: Data Sciencist like to view things as in “matrix” This is the standard table format. IF you give data in this format, they are happy. If the data is some other format, they cant do anything, until the data can be transformed into this format. N entries. M features, Answers. You are given 2 sets of data. Training set is the set with the answer Test set is the set, missing the answer. And, your objective is to …. <CLICK> The objective is to guess this value. Now, most people stuck at what to do with data, at this point. They will try to learn new frameworks, that comes out every month. Blindly pluck in the data. So, let me offer you an easy framework of thought.
  • #22: I was looking into old works from Shannon, Turing, Weaver, Wiener. Whom I consider “foundind fathers” of computers. And, how they visulaied data, back then, is stll the best way to explain. I like “communication theory” or “information theory”, because it is technical, as well as, philosophical. The best way to communicate is “to minimize the loss of signal” between each steps.
  • #23: Let’s rearange the “communication” in a slightly different way.
  • #24: It’s like magic. Now every can see the angle. This is how you should think. You are sitting in the “answer” region of training set. And you want to pass message to the “answer” in the test set. So, the best strategy is to “reduce noise, while pushing as much signal to the next step” Find any link that is losing too much signal, and fix that.
  • #25: First rule of the competition is, that “there is no rule”. It gets fierce. 1. There are always some form of data leak. That Path is data-leak path. And what we are doing is “data leak collecting” But, sometimes, too much data is leaked in a single variable. (sometimes, machine measured data had least significant digits acting as hash value” ) 2. LeaderBoards is on the test set. Cross-Validation is within Trainingset. If the training/test set split is not even, Trusting on the wrong board doesn’t get you. 3. If bad error metic is chosen, then one weird outliar mess the whole game. And it is about hunting out that one “error”. In this case, it is more about debugging. 4. Sometimes, If you understand well about how error metrics works, you can score high, even without seeing the data. 5. Most of algorithm are geared toward compact, matrix format. So, if it not typical “toy dataset”, it does not perform well. Narrow means there are only few features. When not enough information is given about an object, you use strategy to maximize on feature creationg, such as hotcoding/pairing Wide means there are lots of features. But don’t know which is important. Sparse means there are lots of 0 in the data. Dense means there are lots of information in a small space.
  • #26: In my days of Malware Analyst, I used have this chart, similar to this. Black Box and WhiteBox Analysis mainly to extract existence of these functionalities. You extract all these “Functionality” features. And use “Logics” and “Commen sense” to determine if a files is bad or not. A skilled malware analyst can identify a malware, within about 10 min. and, in that 10 min, we look for “red flag”. There are probably about 30 to 50 things you look for. At one point, I try to make a system, classify malware using this method.
  • #27: And, when a new files comes in, We use this logic to find what the unknown files is.
  • #28: I used to make a classification tool, using the previous slide’s method. But, along the way, I tried something different. In order to maximize on “Computer’s ability”, you need to think little bit differently. The main strategy is “lots of tiny things” Human brain can’t hold 20+ features in the head. So, we simplify things into “Chunks”. Each functionality, such as network access, has “meaning.”. It means it has already been “chunked” Computer works before “chunking” Machines don’t have logic. Only statistics. In that world, the best way is “lots of meaningless bits”, but in very large numbers. Like filesizes. Who remembers that stuff… Like the number of occurance of ASCII 61. But, computers can do that, easily. What they excel is doing lots arithemrics, really really fast.
  • #29: To make an AI, human is a good bench mark. But, it is not exactly. A good analogy is Birds and Airplanes. “Flying” is what is important. You are trying to make an “flying machine”. It does not have to be “feathered light-weight folding wings”.
  • #30: This is the solution given by the first place. This is the way, how data scientist communicate to each other. With this single picture, everyone knows how to replicate it. -Feature Engineering: Golden features. These are the features, if missing, you cannot win the competition. or, if running out of time/computing power, just delving into these features, are better than adding other noisy features. -Opcode n-gram is important. -Segment/Section count is important. In the real practice, I don’t think this would be useful. Or it could be easily fooled. But for this competition, it worked great. -asm pixel intensity: is something that security people would not think of. And, those from “vision” would use it. We could use entropy, or something similar. But, we could not see program as pixels. This is great about “crowdsourcing” Then, modeling – XGBoost. Gradient boosting. It is basically building a tree, and on build additional depth, to minimize the error. Then repeat. (and some other clever method) Ensemble. - basically random forest. Separate training dataset into pieces, and let them vote later.
  • #31: But, to all the kagglers, to win the competition, it is about feature engineering. feature extraction. what does it mean. features, think of them like sense. missing a sense. missing a smell, in food.  There are frameworks to try every algorithm at the same time. So, it comes down to “conceptualizing the data”. Or the “feature engineering” that decide the winner. Lets say you are collecting color. You don’t have to know “apple” is “red”. But you need to know “color” is important.
  • #32: 22min Feature extraction is like “taking a snapshot” Missing a critical feature, is like For example, food. It looks good. But, how does it smell? Is it hot or cold? Is it spicy or sweet.
  • #33: Many people think “data science” == “very difficult uber smart algorithm” For most of the competition, The winning team, has done, using one of these 3 algorithm families. And, rest was about “feature engineering” You don’t have to understand everything. But, What is important is that, you SHOULD KNOW how it work. Similar to driving a car. You don’t need exactly how the every parts of the car works. But, you do need to understand how Internal Combustion Engine works. How the outside temperatue affects. How the physics works, such large mass has more momentum than smaller mass. These days, there are frameworks that applys every algorithm. So, picking an algorithm matters less. These are 3 good algorithms. XGBoost is GBM (gradient boosted machine). Boosting is a way of building a tree, to minimize the misclassification. In the “kaggle” world, the global top 10 has “god status”, and the first place winner has been saygin “when in doubt, use XGBoost”. It is flexible. Deep Learning is “kinda new hot method”. And for “computer vision” challenges, DL has been sweeping the wins. For smaller data, or scienfitic data, such as weather data, where there is no “fine grain” manipulation. SVR has been working well. If you are sure that there is some “magic mathermatical formula”, such as physics, then SVR works well.
  • #35: 23 min mark Read the quote. Netflix acatually didn’t use the winning solution into the production system. For crowssourcing, and for everything else in the life, It is about, taking that knowledge, and making it better.
  • #36: This might be “backward reasoning” Opcode sequence works well. But this must be the reason. Data science has no such thing like that. It works, and we are trying to conceptualize “why” it works. Then, you use the same conceptuliaztion for other problem sets. “imagine you are a CPU”. I don’t think this is that difficult for this conference crowd.
  • #37: I like communication theory. I like linguistics. 30 sec intro to NLP. For “human Language”, this is the very basic structure. In a sentence, the central piece of information is about “Verb”. The action. What is happening. Thus, if you are looking for “golden feature” in NLP, you look for “Verbs”. And, Computers are no different. “We are still under the “mental frameworks” of communication theory”. Human and computers are both “communicating organism”.
  • #38: 30 sec intro to NLP. For “human Language”, this is the very basic structure. In a sentence, the central piece of information is about “Verb”. The action. What is happening. Thus, if you are looking for “golden feature” in NLP, you look for “Verbs”. And, Computers are no different. “We are still under the “mental frameworks” of communication theory”. Human and computers are both “communicating organism”.
  • #39: XGBoost is awesome. But, it is not “solve it all”. I don’t like it that much, because a complicated tree is hard to interpret. Remember the 3 “usually winning” algorithms. XGBoost, DeepLearning, and SVR. SVR is parametric. Don’t use SVR for malware. It just does not work. XGBoost is non-parametic. Deep Learning is hard to say what it is. It is stacking of parametric component in non-parametric way. It worked well for this competition, because -The problem set was easy. It didn’t contain junks. -it is classifly inbween malwares. No “junks” and “blah” and “shady”. It is from 30000 files? In the real world, XGBoost might not be enough…..
  • #40: … If you are familiar with “machine learning” scene, “deep learning” has been the hottest topic.
  • #41: Then, I came across Nilsimsa hash. I didn’t make it. It is from a mysterious origin. and it looks remarkable similar to DNN.
  • #42: It is a “BEAST of an algorithm”. – mix of modulus arithematic, and frequency analysis. efficient storage. You can view that in 4 large stages. I prefer this “intermediate storage”, because it is easier to trace back.
  • #43: Sliding Window. N-gram. Could do byte-grams. Feature Hashing
  • #44: Bloom Filter is already an awesome data structure. In engineering, We usually talk about Space and Time trade-offs. BloomFilter is a mix of Space + Time + Allowable Error Tradeoffs. If it is there, it is there. If it is not there, it could say it is there. It is awesome. Look it up.
  • #45: Counting Bloom Filter is a modification of Bloom Filter. It is a bloom filter, that can count more than 1. Instead of Yes or No (binary), we could store quantity. N-gram + CBF is just a way of storing a “LARGE Feature Space” into “the most tight space”. In Data Science, it is called “Feature Hashing.”. But, not using bloomfilter. Bloomfilter can reduce the memory space significantly.
  • #46: We are turning back to “binary” format here. This is the step, where each “n-gram” is compared against the average. Each small snippet is compared against the whole file. Mostly, to prep it for the final stage of “Hamming Distance”
  • #47: And, all you need to do, is to find the another known file with the lowest hamming distance away. Hamming Distance is “XOR bitwise, then count 1 bits.” It is much faster than “edit distance”.
  • #48: 26 min mark We don’t want to stop at this. Just read, and go over each element. In the next slides. (this is such a generic slide, that it could fit into any slide…)
  • #49: We are not the only company, that has researching into improving Nilsimsa. Trend Micro did this. It’s a very good improvement. Is there any other vendors, who are looking into this? Please raise your hand?? <then, explain what Trend did> Why this works well is… < next slide>
  • #51: You can view this as “minimizing the collision” It would double the size of the memory. But, at this point, it is already tiny memory.
  • #52: If TLSH is improving the “height” of the CBF, we could also look at the “width”. Earlier than Later. since once info is lost, you can’t get it back. If 2 different element falls into the same bucket in CBF, that is where info-collision, or info-loss is occurring. Basically, the CBF need to be large enough space to contain “information/entropy/ability to store information” So, how to measure a good size.
  • #53: Hyper Log Log is a cheap and fast way to estimate how much space you need. This is how BitCoin/BlockChain decide the “hash difficulty”.
  • #54: Competition only cares about the accuracy. But, in the real world, Our industry has to deal with “deliberate attack”. Bloom Filter is where “hash collision” is a norm.
  • #55: If dual hashed , the memory requirement doubles. in security, if preventing attack is imporatant, we can use some spaces. Feeling paranoid? Do a triple hash. (It is Dual Hash. Not Double-hash == (hash(hash(val)), as in BlockChain.)
  • #56: There are more ways to improve. WE can make our problem, more difficult. Why? Because it is fun.
  • #57: PE file has been transformed into ASM file, using disassembler. This is assuming, we know the PE file structure well.
  • #58: Let’s say “ Some country, or some organization made a totally different computer architecture. You know it is compiled code. Or some sort of language. there is a compiler, with totally different opcode structure” We don’t have Disassembler, What to do?
  • #59: Because what is importance is the “frequence of the sequecnce of the frequent elements.” We make our own ad-hoc DisASM. “Select-gram” is some word I made up. I don’t know what it is called in the industry. Basically, take the most frequently occurring bytes. These will be the “popular Opcodes”. Any “efficient language” cannot escape this pattern. If escaped, it start to lose the notion of a language.
  • #60: This is the last slide. If you have been dozing off, this is a single slide, to take home with. Data Science, will get big, in the near future. And if you are stuck, you should come back to this framework. All you need to do, is to To pass as much information to the next step. Without introducing too much noice, in the link. And, the rest will work out by itself.