Motivation Methodology Evaluation Conclusion
Machine Learning for Malware Detection:
Beyond Accuracy Rates
Lucas Galante, Marcus Botacin, Andr´e Gr´egio, Paulo L´ıcio de
Geus
SBSEG 2019
Machine Learning for Malware Detection: Beyond Accuracy Rates 1 / 17
Motivation Methodology Evaluation Conclusion
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 2 / 17
Motivation Methodology Evaluation Conclusion
Motivation
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 3 / 17
Motivation Methodology Evaluation Conclusion
Motivation
Malware Increase
Figure: Increase of 46% in malware activity in Q2 of 2019.
https://tinyurl.com/y6qzn83h
Machine Learning for Malware Detection: Beyond Accuracy Rates 4 / 17
Motivation Methodology Evaluation Conclusion
Motivation
Malware Classification
Figure: Necessity of antimalware applications.
https://tinyurl.com/y2bz5k58
Machine Learning for Malware Detection: Beyond Accuracy Rates 5 / 17
Motivation Methodology Evaluation Conclusion
Malware Classifier
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 6 / 17
Motivation Methodology Evaluation Conclusion
Malware Classifier
Extracted Features
Table: Malware Features classified according extraction method (static
and dynamic) and representation (discrete or continuous).
Static Dynamic
Discrete Continuous Both
Embedded files Dissasembly fail Size sections # headers fork syscall /proc access
/home string ptrace syscall /home string # .dynamic ptrace syscall /home access
/sys string Network strings /sys string # sections socket syscall passwd access
Linkage Header present passwd string # symbols mmap syscal permission denied
UPX passwd string # libs # relocations SIGTERM
fork syscall compiler string Size sample # debug section SIGSEGV
Machine Learning for Malware Detection: Beyond Accuracy Rates 7 / 17
Motivation Methodology Evaluation Conclusion
Malware Classifier
Classification Overview
Figure: Overview of classification process.
Machine Learning for Malware Detection: Beyond Accuracy Rates 8 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 9 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Importance of a Good Feature Extraction Procedure
SVM classification of static continuous features.
Kernel/Iter(#) 1000 10000 100000
Poly 49.32% 49.74% 49.95%
Linear 73.87% 77.64% 80.94%
rbf 84.92% 84.92% 84.92%
SVM classification of dynamic continuous features.
Kernel/ Iter (#) 1000 10000 100000
Poly 49.92% 49.76% 50.71%
Linear 93.73% 86.51% 86.73%
rbf 92.63% 92.63% 92.63%
Machine Learning for Malware Detection: Beyond Accuracy Rates 10 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Importance of Evaluated Datasets
Mixed dataset. Random Forest classification of static
continuous features.
Max Depth/ Estimators (#) 16 32 64
8 99.17% 99.06% 99.20%
16 99.13% 99.06% 99.09%
32 99.09% 99.13% 99.17%
VirusTotal dataset. Random Forest classification of static
continuous features.
Max Depth/ Estimators (#) 16 32 64
8 94.29% 94.35% 94.24%
16 94.24% 94.14% 94.08%
32 94.08% 94.14% 94.19%
Machine Learning for Malware Detection: Beyond Accuracy Rates 11 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
Analyst Importance
SVM classification of dynamic continuous features.
Kernel/ Iter (#) 1000 10000 100000
Poly 50.91% 54.05% 58.16%
Linear 97.97% 97.56% 80.35%
rbf 98.54% 98.54% 98.54%
SVM classification of dynamic discrete features.
Kernel/ Iter (#) 1000 10000 100000
Poly 79.68% 79.91% 79.91%
Linear 96.48% 96.48% 96.48%
rbf 96.35% 96.35% 96.35%
Machine Learning for Malware Detection: Beyond Accuracy Rates 12 / 17
Motivation Methodology Evaluation Conclusion
Beyond Accuracy Rates
What ML results teach us
Static feature importance
Static
Discrete Continuous
Network strings 40% Binary size 27%
UPX present 17% # headers 16.70%
passwd strings 1.40% # debug sections 0.20%
Dynamic feature importance
Dynamic
Discrete Continuous
mmap 50% # mmap 68%
fork 6% # fork 10.80%
SIGSEGV 10.60% # SIGSEGV 1.30%
Machine Learning for Malware Detection: Beyond Accuracy Rates 13 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Agenda
1 Motivation
Motivation
2 Methodology
Malware Classifier
3 Evaluation
Beyond Accuracy Rates
4 Conclusion
Conclusion
Machine Learning for Malware Detection: Beyond Accuracy Rates 14 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Conclusion
Our results show that:
Dynamic features outperforms static features
Discrete features present smaller accuracy variance
Dataset’s distinct characteristics impose challenges to ML
models
Feature analysis can be used as feedback information
Machine Learning for Malware Detection: Beyond Accuracy Rates 15 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Acknowledgement
This work is supported by:
Brazilian National Counsel of Technological and Scientific
Development
CESeg assistance
Machine Learning for Malware Detection: Beyond Accuracy Rates 16 / 17
Motivation Methodology Evaluation Conclusion
Conclusion
Questions, Critics and Suggestions.
Contact
galante@lasca.ic.unicamp.br
Complete version
https://github.com/marcusbotacin/ELF.Classifier
Previous work
https://github.com/marcusbotacin/Linux.Malware
Reverse Engineering Workshop
Thursday @ 13:30
Machine Learning for Malware Detection: Beyond Accuracy Rates 17 / 17

More Related Content

PPT
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
PDF
IRJET - Survey on Malware Detection using Deep Learning Methods
PDF
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
PDF
IRJET- Effective Technique Used for Malware Detection using Machine Learning
PDF
A Tale of Experiments on Bug Prediction
PDF
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
PDF
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
PDF
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
IRJET - Survey on Malware Detection using Deep Learning Methods
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
IRJET- Effective Technique Used for Malware Detection using Machine Learning
A Tale of Experiments on Bug Prediction
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE
MINING PATTERNS OF SEQUENTIAL MALICIOUS APIS TO DETECT MALWARE

Similar to Machine Learning for Malware Detection: Beyond Accuracy Rates (20)

PPTX
IMPLEMENTATION OF MACHINE LEARNING IN E-COMMERCE & BEYOND
PPTX
MALWARE DETECTION A FRAMEWORK FOR REVERSE ENGINEERED ANDROID APPLICATIONS_.pptx
PDF
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
PDF
BH-US-06-Bilar.pdf
PDF
A survey of fault prediction using machine learning algorithms
PPTX
Malware Detection Using Data Mining Techniques
PDF
Malicious Linux binaries: A Landscape
PPTX
Measuring the Code Quality Using Software Metrics
DOCX
Automated Android Malware Detection Using Optimal Ensemble Learning Approach ...
PPTX
Antimalware
PDF
Near-memory & In-Memory Detection of Fileless Malware
PDF
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
PPTX
Malware Detection Using Machine Learning Techniques
PPTX
Pindroid - Android Malware Detection Tool
PDF
Criminal Identification using Arm7
PPTX
Injection Attack detection using ML for
DOCX
Network intrusion detection using supervised machine learning technique with ...
PDF
IRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
PDF
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
PDF
Local Descriptor based Face Recognition System
IMPLEMENTATION OF MACHINE LEARNING IN E-COMMERCE & BEYOND
MALWARE DETECTION A FRAMEWORK FOR REVERSE ENGINEERED ANDROID APPLICATIONS_.pptx
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
BH-US-06-Bilar.pdf
A survey of fault prediction using machine learning algorithms
Malware Detection Using Data Mining Techniques
Malicious Linux binaries: A Landscape
Measuring the Code Quality Using Software Metrics
Automated Android Malware Detection Using Optimal Ensemble Learning Approach ...
Antimalware
Near-memory & In-Memory Detection of Fileless Malware
PROVIDING CYBER SECURITY SOLUTION FOR MALWARE DETECTION USING SUPPORT VECTOR ...
Malware Detection Using Machine Learning Techniques
Pindroid - Android Malware Detection Tool
Criminal Identification using Arm7
Injection Attack detection using ML for
Network intrusion detection using supervised machine learning technique with ...
IRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
COMPARISON OF MALWARE CLASSIFICATION METHODS USING CONVOLUTIONAL NEURAL NETWO...
Local Descriptor based Face Recognition System
Ad

More from Marcus Botacin (20)

PDF
Cross-Regional Malware Detection via Model Distilling and Federated Learning
PDF
What do malware analysts want from academia? A survey on the state-of-the-pra...
PDF
GPThreats: Fully-automated AI-generated malware and its security risks
PDF
[Texas A&M University] Research @ Botacin's Lab
PDF
Pilares da Segurança e Chaves criptográficas
PDF
Machine Learning by Examples - Marcus Botacin - TAMU 2024
PDF
Near-memory & In-Memory Detection of Fileless Malware
PDF
GPThreats-3: Is Automated Malware Generation a Threat?
PDF
[HackInTheBOx] All You Always Wanted to Know About Antiviruses
PDF
[Usenix Enigma\ Why Is Our Security Research Failing? Five Practices to Change!
PDF
Hardware-accelerated security monitoring
PDF
How do we detect malware? A step-by-step guide
PDF
Among Viruses, Trojans, and Backdoors:Fighting Malware in 2022
PDF
Extraindo Caracterı́sticas de Arquivos Binários Executáveis
PDF
On the Malware Detection Problem: Challenges & Novel Approaches
PDF
All You Need to Know to Win a Cybersecurity Adversarial Machine Learning Comp...
PDF
Does Your Threat Model Consider Country and Culture? A Case Study of Brazilia...
PDF
Integridade, confidencialidade, disponibilidade, ransomware
PDF
An Empirical Study on the Blocking of HTTP and DNS Requests at Providers Leve...
PDF
On the Security of Application Installers & Online Software Repositories
Cross-Regional Malware Detection via Model Distilling and Federated Learning
What do malware analysts want from academia? A survey on the state-of-the-pra...
GPThreats: Fully-automated AI-generated malware and its security risks
[Texas A&M University] Research @ Botacin's Lab
Pilares da Segurança e Chaves criptográficas
Machine Learning by Examples - Marcus Botacin - TAMU 2024
Near-memory & In-Memory Detection of Fileless Malware
GPThreats-3: Is Automated Malware Generation a Threat?
[HackInTheBOx] All You Always Wanted to Know About Antiviruses
[Usenix Enigma\ Why Is Our Security Research Failing? Five Practices to Change!
Hardware-accelerated security monitoring
How do we detect malware? A step-by-step guide
Among Viruses, Trojans, and Backdoors:Fighting Malware in 2022
Extraindo Caracterı́sticas de Arquivos Binários Executáveis
On the Malware Detection Problem: Challenges & Novel Approaches
All You Need to Know to Win a Cybersecurity Adversarial Machine Learning Comp...
Does Your Threat Model Consider Country and Culture? A Case Study of Brazilia...
Integridade, confidencialidade, disponibilidade, ransomware
An Empirical Study on the Blocking of HTTP and DNS Requests at Providers Leve...
On the Security of Application Installers & Online Software Repositories
Ad

Recently uploaded (20)

PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Architecture types and enterprise applications.pdf
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Tartificialntelligence_presentation.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
STKI Israel Market Study 2025 version august
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Modernising the Digital Integration Hub
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Five Habits of High-Impact Board Members
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Assigned Numbers - 2025 - Bluetooth® Document
Architecture types and enterprise applications.pdf
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Hindi spoken digit analysis for native and non-native speakers
Tartificialntelligence_presentation.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Group 1 Presentation -Planning and Decision Making .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
STKI Israel Market Study 2025 version august
O2C Customer Invoices to Receipt V15A.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Developing a website for English-speaking practice to English as a foreign la...
A novel scalable deep ensemble learning framework for big data classification...
Getting started with AI Agents and Multi-Agent Systems
Web Crawler for Trend Tracking Gen Z Insights.pptx
Modernising the Digital Integration Hub
A comparative study of natural language inference in Swahili using monolingua...
Five Habits of High-Impact Board Members

Machine Learning for Malware Detection: Beyond Accuracy Rates

  • 1. Motivation Methodology Evaluation Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates Lucas Galante, Marcus Botacin, Andr´e Gr´egio, Paulo L´ıcio de Geus SBSEG 2019 Machine Learning for Malware Detection: Beyond Accuracy Rates 1 / 17
  • 2. Motivation Methodology Evaluation Conclusion Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 2 / 17
  • 3. Motivation Methodology Evaluation Conclusion Motivation Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 3 / 17
  • 4. Motivation Methodology Evaluation Conclusion Motivation Malware Increase Figure: Increase of 46% in malware activity in Q2 of 2019. https://tinyurl.com/y6qzn83h Machine Learning for Malware Detection: Beyond Accuracy Rates 4 / 17
  • 5. Motivation Methodology Evaluation Conclusion Motivation Malware Classification Figure: Necessity of antimalware applications. https://tinyurl.com/y2bz5k58 Machine Learning for Malware Detection: Beyond Accuracy Rates 5 / 17
  • 6. Motivation Methodology Evaluation Conclusion Malware Classifier Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 6 / 17
  • 7. Motivation Methodology Evaluation Conclusion Malware Classifier Extracted Features Table: Malware Features classified according extraction method (static and dynamic) and representation (discrete or continuous). Static Dynamic Discrete Continuous Both Embedded files Dissasembly fail Size sections # headers fork syscall /proc access /home string ptrace syscall /home string # .dynamic ptrace syscall /home access /sys string Network strings /sys string # sections socket syscall passwd access Linkage Header present passwd string # symbols mmap syscal permission denied UPX passwd string # libs # relocations SIGTERM fork syscall compiler string Size sample # debug section SIGSEGV Machine Learning for Malware Detection: Beyond Accuracy Rates 7 / 17
  • 8. Motivation Methodology Evaluation Conclusion Malware Classifier Classification Overview Figure: Overview of classification process. Machine Learning for Malware Detection: Beyond Accuracy Rates 8 / 17
  • 9. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 9 / 17
  • 10. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Importance of a Good Feature Extraction Procedure SVM classification of static continuous features. Kernel/Iter(#) 1000 10000 100000 Poly 49.32% 49.74% 49.95% Linear 73.87% 77.64% 80.94% rbf 84.92% 84.92% 84.92% SVM classification of dynamic continuous features. Kernel/ Iter (#) 1000 10000 100000 Poly 49.92% 49.76% 50.71% Linear 93.73% 86.51% 86.73% rbf 92.63% 92.63% 92.63% Machine Learning for Malware Detection: Beyond Accuracy Rates 10 / 17
  • 11. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Importance of Evaluated Datasets Mixed dataset. Random Forest classification of static continuous features. Max Depth/ Estimators (#) 16 32 64 8 99.17% 99.06% 99.20% 16 99.13% 99.06% 99.09% 32 99.09% 99.13% 99.17% VirusTotal dataset. Random Forest classification of static continuous features. Max Depth/ Estimators (#) 16 32 64 8 94.29% 94.35% 94.24% 16 94.24% 94.14% 94.08% 32 94.08% 94.14% 94.19% Machine Learning for Malware Detection: Beyond Accuracy Rates 11 / 17
  • 12. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates Analyst Importance SVM classification of dynamic continuous features. Kernel/ Iter (#) 1000 10000 100000 Poly 50.91% 54.05% 58.16% Linear 97.97% 97.56% 80.35% rbf 98.54% 98.54% 98.54% SVM classification of dynamic discrete features. Kernel/ Iter (#) 1000 10000 100000 Poly 79.68% 79.91% 79.91% Linear 96.48% 96.48% 96.48% rbf 96.35% 96.35% 96.35% Machine Learning for Malware Detection: Beyond Accuracy Rates 12 / 17
  • 13. Motivation Methodology Evaluation Conclusion Beyond Accuracy Rates What ML results teach us Static feature importance Static Discrete Continuous Network strings 40% Binary size 27% UPX present 17% # headers 16.70% passwd strings 1.40% # debug sections 0.20% Dynamic feature importance Dynamic Discrete Continuous mmap 50% # mmap 68% fork 6% # fork 10.80% SIGSEGV 10.60% # SIGSEGV 1.30% Machine Learning for Malware Detection: Beyond Accuracy Rates 13 / 17
  • 14. Motivation Methodology Evaluation Conclusion Conclusion Agenda 1 Motivation Motivation 2 Methodology Malware Classifier 3 Evaluation Beyond Accuracy Rates 4 Conclusion Conclusion Machine Learning for Malware Detection: Beyond Accuracy Rates 14 / 17
  • 15. Motivation Methodology Evaluation Conclusion Conclusion Conclusion Our results show that: Dynamic features outperforms static features Discrete features present smaller accuracy variance Dataset’s distinct characteristics impose challenges to ML models Feature analysis can be used as feedback information Machine Learning for Malware Detection: Beyond Accuracy Rates 15 / 17
  • 16. Motivation Methodology Evaluation Conclusion Conclusion Acknowledgement This work is supported by: Brazilian National Counsel of Technological and Scientific Development CESeg assistance Machine Learning for Malware Detection: Beyond Accuracy Rates 16 / 17
  • 17. Motivation Methodology Evaluation Conclusion Conclusion Questions, Critics and Suggestions. Contact galante@lasca.ic.unicamp.br Complete version https://github.com/marcusbotacin/ELF.Classifier Previous work https://github.com/marcusbotacin/Linux.Malware Reverse Engineering Workshop Thursday @ 13:30 Machine Learning for Malware Detection: Beyond Accuracy Rates 17 / 17