SlideShare a Scribd company logo
Recognizing malware
Libor Mořkovský
Computer virus
bacterial cell
based on work by Anderson Brito
Computer virus
executable file
entry point
Computer virus
Inserting code into files is never “good”.
executable file
entry point
image courtesy of
Looking Glass Studios
Malware
How do you recognize a thief?
image courtesy of
Looking Glass Studios
Twentieth Century Fox
Malware
How do you recognize a thief?
Malware
How do you recognize a thief?
image courtesy of
Looking Glass Studios
Twentieth Century Fox
Paramount Pictures
Malware
completely different behaviors are considered “bad”
we need a judge to decide who crossed the line
•
•
Malware | Many faces
unlike real thieves, malware can be duplicated
not only duplicated, but also modified
all this is done by machines
too much work to judge each one manually
•
•
•
•
Finding similar files
oooooooooooo
o
oo
o
oo
oooo
ooooo
oooo
o
oo
o
oo
o
ooooooooo
o
o
MDS1
MDS2
class
oo
oo
oo
oo
CLEAN
MALWARE
QUERY
UNKNOWN
Finding similar files
need a file representation
need a distance function
•
•
Finding similar files | File vector
each executable file is represented by a feature vector
the PE format is complex, so we keep exactly one
version of the extractor code (C++)
the vector comprises static and dynamic features, the
exact content is proprietary
•
•
•
Database record
• One record = constant vector of over 100 attributes
• the “file fingerprint”
• Each attribute has a data type and semantic
Attribute Data Type Semantic
sha256 32 byte array CHECKSUM
pe_sect_cnt uint16_t VALUE
pe_sect_rawoff_entry uint32_t OFFSET
• The complete contents of the vector are kept secret
• static and dynamic features of PE executables
Finding similar files | Distance
sum of partial distances
each distance operator assigned manually
weights assigned manually to equalize contribution
•
•
•
Nearest neighbor query
• Compound distance function
• Data type and semantic determine partial dist. func.
Data Type Semantic Partial distance function
32 byte array CHECKSUM RETURN_ZERO
uint16_t VALUE EQUAL_RET32
uint32_t OFFSET LOG
• Each partial distance function = one kernel function
• Over 100 kernels for every NN query
• Intermediate results kept in the “Scratchpad”
Finding similar files | Data
~60 M data points
sparse and well separated
(in many cases)
•
•
Finding similar files | Implementation
we started with GPUs
their high memory throughput allows “naive”
implementation and rapid prototyping
column-oriented database
•
•
•
Classification | Requirements
find easily what is responsible for
a mistake – transparency
fix the problem quickly – tractability
•
•
Classification | Algorithm
Instance based classifier.
Classification | Optimizations
scaling and HW problems with GPUs
we invested in algorithmic optimizations:
VP-tree, distance bounded search
hand optimized distance function (assembly)
CPU version is ~100x faster
•
•
•
•
Classification | Deployment
→
FileSHAandu
ser id →
←Fileprevale
nce ←
←
Fileclass
ification ←
→
Filefinger
print →
← Generic detections ←
↑ File classifications and
Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
Classification | Deployment
→
FileSHAandu
ser id →
←Fileprevale
nce ←
←
Fileclass
ification ←
→
Filefinger
print →
← Generic detections ←
↑ File classifications and
Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
Classification | Deployment
→
FileSHAandu
ser id →
←Fileprevale
nce ←
←
Fileclass
ification ←
→
Filefinger
print →
← Generic detections ←
↑ File classifications and
Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
Rule generator
detect more variants in the wild
(our) rule is a conjunction of several conditions
known as Win32:Evo-Gen
completely different optimization problem than
classification - still uses the GPU
•
•
•
•
Libor Mořkovský - Recognizing Malware
Q&A

More Related Content

PDF
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
PPTX
Practical Malware Analysis: Ch 2 Malware Analysis in Virtual Machines & 3: Ba...
PPTX
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
PDF
Is Linux/Moose endangered or extinct?
PDF
Practical Malware Analysis Ch13
PPTX
Malware analysis
PDF
CNIT 126: 10: Kernel Debugging with WinDbg
PDF
CNIT 126 12: Covert Malware Launching
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
Practical Malware Analysis: Ch 2 Malware Analysis in Virtual Machines & 3: Ba...
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
Is Linux/Moose endangered or extinct?
Practical Malware Analysis Ch13
Malware analysis
CNIT 126: 10: Kernel Debugging with WinDbg
CNIT 126 12: Covert Malware Launching

What's hot (20)

PDF
CNIT 126 13: Data Encoding
PDF
Practical Malware Analysis Ch12
PDF
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
PPTX
Basic Malware Analysis
PPTX
Introduction to Malware Analysis
PDF
Practical Malware Analysis Ch 14: Malware-Focused Network Signatures
PDF
Practical Malware Analysis: Ch 8: Debugging
PDF
CNIT 126: 10: Kernel Debugging with WinDbg
PDF
CNIT 126 Ch 11: Malware Behavior
PDF
Awesome Concurrency with Elixir Tasks
PDF
CNIT 126 7: Analyzing Malicious Windows Programs
PDF
CNIT 126 Ch 0: Malware Analysis Primer & 1: Basic Static Techniques
PPT
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
PPTX
Basic Dynamic Analysis of Malware
PDF
9: OllyDbg
PDF
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
PDF
CNIT 126 Ch 9: OllyDbg
PDF
Practical Malware Analysis: Ch 11: Malware Behavior
PDF
CNIT 126 Ch 7: Analyzing Malicious Windows Programs
PPTX
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
CNIT 126 13: Data Encoding
Practical Malware Analysis Ch12
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Basic Malware Analysis
Introduction to Malware Analysis
Practical Malware Analysis Ch 14: Malware-Focused Network Signatures
Practical Malware Analysis: Ch 8: Debugging
CNIT 126: 10: Kernel Debugging with WinDbg
CNIT 126 Ch 11: Malware Behavior
Awesome Concurrency with Elixir Tasks
CNIT 126 7: Analyzing Malicious Windows Programs
CNIT 126 Ch 0: Malware Analysis Primer & 1: Basic Static Techniques
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Basic Dynamic Analysis of Malware
9: OllyDbg
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
CNIT 126 Ch 9: OllyDbg
Practical Malware Analysis: Ch 11: Malware Behavior
CNIT 126 Ch 7: Analyzing Malicious Windows Programs
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
Ad

Similar to Libor Mořkovský - Recognizing Malware (20)

PDF
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
PDF
Inbot10 vxclass
PDF
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
PDF
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
PDF
MALWARE DETECTION AND SUPPRESSION USING BLOCKCHAIN TECHNOLOGY
PDF
stackconf 2021 | Data Driven Security
DOCX
A malware detection method for health sensor data based on machine learning
PDF
Malware Detection - A Machine Learning Perspective
PDF
Malware1
PPTX
Object Recognition
PDF
A trust system based on multi level virus detection
PDF
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
PDF
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
PPTX
Malware Classification Using Deep Learning
PPTX
malware detection ppt for vtu project and other final year project
PDF
VxClass for Incident Response
PDF
Fast Parallel Similarity Calculations with FPGA Hardware
PDF
Near-memory & In-Memory Detection of Fileless Malware
PPTX
Team_8_CSM_B_Presentatbbvvvvvvion[1].pptx
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
Inbot10 vxclass
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
MALWARE DETECTION AND SUPPRESSION USING BLOCKCHAIN TECHNOLOGY
stackconf 2021 | Data Driven Security
A malware detection method for health sensor data based on machine learning
Malware Detection - A Machine Learning Perspective
Malware1
Object Recognition
A trust system based on multi level virus detection
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
BlueHat Seattle 2019 || The good, the bad & the ugly of ML based approaches f...
Malware Classification Using Deep Learning
malware detection ppt for vtu project and other final year project
VxClass for Incident Response
Fast Parallel Similarity Calculations with FPGA Hardware
Near-memory & In-Memory Detection of Fileless Malware
Team_8_CSM_B_Presentatbbvvvvvvion[1].pptx
Ad

More from Machine Learning Prague (13)

PDF
Vít Listík - Email.cz workshop
PDF
Lukáš Vrábel - Deep Convolutional Neural Networks
PDF
Tomáš Cícha - Machine Learning Solutions at Seznam.cz
PDF
Jan Pospíšil - Azure ML
PPTX
Michael Levin - MatrixNet Applications at Yandex
PDF
Adam Ashenfelter - Finding the Oddballs
PPTX
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
Kateřina Veselovská - ML Approaches to Sentiment Analysis
PPTX
Jiří Materna - Artificial Intelligence in Creative Writing
PPTX
Jan Šedivý - Intelligent Personal Assistants
PPTX
Marek Rosa - Inventing General Artificial Intelligence: A Vision and Methodology
PPTX
Xuedong Huang - Deep Learning and Intelligent Applications
Vít Listík - Email.cz workshop
Lukáš Vrábel - Deep Convolutional Neural Networks
Tomáš Cícha - Machine Learning Solutions at Seznam.cz
Jan Pospíšil - Azure ML
Michael Levin - MatrixNet Applications at Yandex
Adam Ashenfelter - Finding the Oddballs
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Tomáš Mikolov - Distributed Representations for NLP
Kateřina Veselovská - ML Approaches to Sentiment Analysis
Jiří Materna - Artificial Intelligence in Creative Writing
Jan Šedivý - Intelligent Personal Assistants
Marek Rosa - Inventing General Artificial Intelligence: A Vision and Methodology
Xuedong Huang - Deep Learning and Intelligent Applications

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Cloud computing and distributed systems.
PDF
Modernizing your data center with Dell and AMD
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
Cloud computing and distributed systems.
Modernizing your data center with Dell and AMD
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx

Libor Mořkovský - Recognizing Malware

  • 2. Computer virus bacterial cell based on work by Anderson Brito
  • 4. Computer virus Inserting code into files is never “good”. executable file entry point
  • 5. image courtesy of Looking Glass Studios Malware How do you recognize a thief?
  • 6. image courtesy of Looking Glass Studios Twentieth Century Fox Malware How do you recognize a thief?
  • 7. Malware How do you recognize a thief? image courtesy of Looking Glass Studios Twentieth Century Fox Paramount Pictures
  • 8. Malware completely different behaviors are considered “bad” we need a judge to decide who crossed the line • •
  • 9. Malware | Many faces unlike real thieves, malware can be duplicated not only duplicated, but also modified all this is done by machines too much work to judge each one manually • • • •
  • 11. Finding similar files need a file representation need a distance function • •
  • 12. Finding similar files | File vector each executable file is represented by a feature vector the PE format is complex, so we keep exactly one version of the extractor code (C++) the vector comprises static and dynamic features, the exact content is proprietary • • • Database record • One record = constant vector of over 100 attributes • the “file fingerprint” • Each attribute has a data type and semantic Attribute Data Type Semantic sha256 32 byte array CHECKSUM pe_sect_cnt uint16_t VALUE pe_sect_rawoff_entry uint32_t OFFSET • The complete contents of the vector are kept secret • static and dynamic features of PE executables
  • 13. Finding similar files | Distance sum of partial distances each distance operator assigned manually weights assigned manually to equalize contribution • • • Nearest neighbor query • Compound distance function • Data type and semantic determine partial dist. func. Data Type Semantic Partial distance function 32 byte array CHECKSUM RETURN_ZERO uint16_t VALUE EQUAL_RET32 uint32_t OFFSET LOG • Each partial distance function = one kernel function • Over 100 kernels for every NN query • Intermediate results kept in the “Scratchpad”
  • 14. Finding similar files | Data ~60 M data points sparse and well separated (in many cases) • •
  • 15. Finding similar files | Implementation we started with GPUs their high memory throughput allows “naive” implementation and rapid prototyping column-oriented database • • •
  • 16. Classification | Requirements find easily what is responsible for a mistake – transparency fix the problem quickly – tractability • •
  • 18. Classification | Optimizations scaling and HW problems with GPUs we invested in algorithmic optimizations: VP-tree, distance bounded search hand optimized distance function (assembly) CPU version is ~100x faster • • • •
  • 19. Classification | Deployment → FileSHAandu ser id → ←Fileprevale nce ← ← Fileclass ification ← → Filefinger print → ← Generic detections ← ↑ File classifications and Evo-gen detections → Threats → Set updates ↓ Medusa Scavenger Avast users FileRep
  • 20. Classification | Deployment → FileSHAandu ser id → ←Fileprevale nce ← ← Fileclass ification ← → Filefinger print → ← Generic detections ← ↑ File classifications and Evo-gen detections → Threats → Set updates ↓ Medusa Scavenger Avast users FileRep
  • 21. Classification | Deployment → FileSHAandu ser id → ←Fileprevale nce ← ← Fileclass ification ← → Filefinger print → ← Generic detections ← ↑ File classifications and Evo-gen detections → Threats → Set updates ↓ Medusa Scavenger Avast users FileRep
  • 22. Rule generator detect more variants in the wild (our) rule is a conjunction of several conditions known as Win32:Evo-Gen completely different optimization problem than classification - still uses the GPU • • • •
  • 24. Q&A