SlideShare a Scribd company logo
David Pryce, Wandera
Detecting Mobile Malware
with Apache Spark
#DSSAIS12
#DSSAIS12
Summary
• The problem: Mobile-first malware detection
• The data and features
• The Machine Learning (ML) model
• Why Apache Spark?
• Making it production ready
• Data Science @ Wandera
!2
The power of enterprise mobility
!3
Devices are prone to
security threats
Concerns around
appropriate usage
Data usage costs are
opaque and spiraling
Potentially exposing
sensitive data
Seamless internal
communication
Added flexibility to
working hours
Access to more apps and
productivity tools
E-mail and other services
available anywhere
Happy hunting ground for attackers
!4
“Mobile threats can no longer be ignored”
- AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE
100%
Mobile malware
growth in 2016
435%
High severity threats
(CVSS) growth in 2016
80%
of organizations
experienced mobile
phishing attack
38%
of hackers bypass
endpoint defense using
social engineering
Introducing the Secure Mobile Gateway
!5
ON-DEVICE
DETECTION
IN-NETWORK
PROTECTION
#DSSAIS12
The rise of mobile malware
!6
Credit: GData 2017
Our objectives: Identify and Classify
!7
SMS
MALWARE TYPES
Ransomware Spyware Banker Trojan
Rooting Adware
#DSSAIS12
Why is this a novel problem?
• Mobile malware is on the rise
• Signature based detection is no longer scalable or effective
• We needed a solution that could
• work across both known and unknown threats;
• effectively protect our customers; and
• enable threat research to quickly identify new outbreaks
• First solution = signatures and lists
• Our solution = machine learning!
!8
#DSSAIS12
The data…
!9
Good and bad apps
• Source 1: official app stores
• Source 2: seen in our devices
• Source 3: seen by our gateway
3rd-party threat intelligence
External input verified for labels
(supervised learning)
Currently storing: ~2 million labelled apps
+
#DSSAIS12
… and the features
!10
Baidu 2016
#DSSAIS12
Feature extraction
!11
Direct metadata extraction
• Total unique fields for all apps ~ 500,000
• A typical app ~ 10+ fields
• SPARSE VECTOR
Solution:
• Hashing function (vector to indices)
• Allows for fast retrieval
• With big enough map (2^20) to avoid clashes
• DENSE VECTOR
#DSSAIS12
The Machine Learning model
• Selected model = Logistic Regression
◦ Models tried = (LogReg, SVM, Decision Tree)
• K-fold cross validation to select best parameters
• Accuracy: 0.96 

!12
#DSSAIS12
Why Apache Spark?
!13
Model
persistence
PMML paradigm already
integrated
Truly big
data
Millions of data points,
millions of fields
Ease of use
Fast, easy and iterative.
From EDA to app in
days. Scala and python
API.
Deployment
and Scale
From local to cluster is
easy!
Wandera 2018
#DSSAIS12
Production ready?
!14
P.M.M.L
• Predictive Model Markup Language
• Industry standard
• Pro: Language agnostic, REST API, good algo
coverage
• Con: large file size
!15
#DSSAIS12
Production ready?
!16
• Saving to PMML (ML vs MLlib / DF vs RDD)
• DataFrame API - doesn’t have PMML functionality (yet)
• Hacked PMML to get probabilities for predictions
• Size of model ~ 20Mb (compressed)
• Overall time to train: less than 2 hours on a big enough cluster
F
Live scoring
!17
Extracts features &
scores app
User installs new app
1
2
If score > 0.9
INVESTIGATE / NOTIFY
3
#DSSAIS12
Data Science @ Wandera
!18
• Cross-disciplinary team of scientists, analysts & developers
• Focus on solving real-world problems in a real-time, distributed network
• Global team with presence in USA, London, UK and Czech Republic
= Innovative Research + Scalable Architecture + Efficient Feature Delivery
#DSSAIS12
Thanks for listening
!19
#DSSAIS12
Appendix 1: model testing results
!20
Wandera 2018

More Related Content

PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
PDF
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
PDF
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
PDF
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
PDF
Data Warehousing with Spark Streaming at Zalando
PDF
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Data Warehousing with Spark Streaming at Zalando
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...

What's hot (20)

PDF
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
PDF
Spark at Airbnb
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PDF
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PDF
How R Developers Can Build and Share Data and AI Applications that Scale with...
PDF
Spark Summit EU talk by Pat Patterson
PDF
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
PDF
Automated Production Ready ML at Scale
PDF
Accelerating Machine Learning on Databricks Runtime
PDF
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
PDF
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
PDF
The Power of Unified Analytics with Ali Ghodsi
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
PDF
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
PDF
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
PDF
Managing the Complete Machine Learning Lifecycle with MLflow
PDF
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
PPTX
Spline 2 - Vision and Architecture Overview
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Spark at Airbnb
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
How R Developers Can Build and Share Data and AI Applications that Scale with...
Spark Summit EU talk by Pat Patterson
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Automated Production Ready ML at Scale
Accelerating Machine Learning on Databricks Runtime
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
The Power of Unified Analytics with Ali Ghodsi
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Managing the Complete Machine Learning Lifecycle with MLflow
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Spline 2 - Vision and Architecture Overview
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Ad

Similar to Detecting Mobile Malware with Apache Spark with David Pryce (20)

PDF
Detecting Mobile Malware with Apache Spark with David Pryce
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
PPTX
MapR Product Update - Spring 2017
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
PPTX
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
PPTX
20160000 Cloud Discovery Event - Cloud Access Security Brokers
PDF
Accelerate Big Data Application Development with Cascading
PPTX
MapR and Machine Learning Primer
PPTX
Get Your Head in the Cloud: A Practical Model for Enterprise Cloud Security
PPTX
Webinar: Déployez facilement Kubernetes & vos containers
PDF
High-performance database technology for rock-solid IoT solutions
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PDF
Predictive Maintenance Using Recurrent Neural Networks
PDF
Big Data LDN 2017: How to leverage the cloud for Business Solutions
PPTX
Get Started with Cloudera’s Cyber Solution
PPTX
ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...
PDF
Mendix-7-Keynote
PDF
Splunk for DevOps - Faster Insights - Better Code
PPTX
How to get Real-Time Value from your IoT Data - Datastax
Detecting Mobile Malware with Apache Spark with David Pryce
Scaling up with Cisco Big Data: Data + Science = Data Science
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Product Update - Spring 2017
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
20160000 Cloud Discovery Event - Cloud Access Security Brokers
Accelerate Big Data Application Development with Cascading
MapR and Machine Learning Primer
Get Your Head in the Cloud: A Practical Model for Enterprise Cloud Security
Webinar: Déployez facilement Kubernetes & vos containers
High-performance database technology for rock-solid IoT solutions
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Predictive Maintenance Using Recurrent Neural Networks
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Get Started with Cloudera’s Cyber Solution
ProtectWise Revolutionizes Enterprise Network Security in the Cloud with Data...
Mendix-7-Keynote
Splunk for DevOps - Faster Insights - Better Code
How to get Real-Time Value from your IoT Data - Datastax
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Lecture1 pattern recognition............
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Foundation of Data Science unit number two notes
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Moving the Public Sector (Government) to a Digital Adoption
Lecture1 pattern recognition............
Reliability_Chapter_ presentation 1221.5784
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Foundation of Data Science unit number two notes
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
Fluorescence-microscope_Botany_detailed content
Launch Your Data Science Career in Kochi – 2025
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Detecting Mobile Malware with Apache Spark with David Pryce

  • 1. David Pryce, Wandera Detecting Mobile Malware with Apache Spark #DSSAIS12
  • 2. #DSSAIS12 Summary • The problem: Mobile-first malware detection • The data and features • The Machine Learning (ML) model • Why Apache Spark? • Making it production ready • Data Science @ Wandera !2
  • 3. The power of enterprise mobility !3 Devices are prone to security threats Concerns around appropriate usage Data usage costs are opaque and spiraling Potentially exposing sensitive data Seamless internal communication Added flexibility to working hours Access to more apps and productivity tools E-mail and other services available anywhere
  • 4. Happy hunting ground for attackers !4 “Mobile threats can no longer be ignored” - AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE 100% Mobile malware growth in 2016 435% High severity threats (CVSS) growth in 2016 80% of organizations experienced mobile phishing attack 38% of hackers bypass endpoint defense using social engineering
  • 5. Introducing the Secure Mobile Gateway !5 ON-DEVICE DETECTION IN-NETWORK PROTECTION
  • 6. #DSSAIS12 The rise of mobile malware !6 Credit: GData 2017
  • 7. Our objectives: Identify and Classify !7 SMS MALWARE TYPES Ransomware Spyware Banker Trojan Rooting Adware
  • 8. #DSSAIS12 Why is this a novel problem? • Mobile malware is on the rise • Signature based detection is no longer scalable or effective • We needed a solution that could • work across both known and unknown threats; • effectively protect our customers; and • enable threat research to quickly identify new outbreaks • First solution = signatures and lists • Our solution = machine learning! !8
  • 9. #DSSAIS12 The data… !9 Good and bad apps • Source 1: official app stores • Source 2: seen in our devices • Source 3: seen by our gateway 3rd-party threat intelligence External input verified for labels (supervised learning) Currently storing: ~2 million labelled apps +
  • 10. #DSSAIS12 … and the features !10 Baidu 2016
  • 11. #DSSAIS12 Feature extraction !11 Direct metadata extraction • Total unique fields for all apps ~ 500,000 • A typical app ~ 10+ fields • SPARSE VECTOR Solution: • Hashing function (vector to indices) • Allows for fast retrieval • With big enough map (2^20) to avoid clashes • DENSE VECTOR
  • 12. #DSSAIS12 The Machine Learning model • Selected model = Logistic Regression ◦ Models tried = (LogReg, SVM, Decision Tree) • K-fold cross validation to select best parameters • Accuracy: 0.96 
 !12
  • 13. #DSSAIS12 Why Apache Spark? !13 Model persistence PMML paradigm already integrated Truly big data Millions of data points, millions of fields Ease of use Fast, easy and iterative. From EDA to app in days. Scala and python API. Deployment and Scale From local to cluster is easy!
  • 15. P.M.M.L • Predictive Model Markup Language • Industry standard • Pro: Language agnostic, REST API, good algo coverage • Con: large file size !15
  • 16. #DSSAIS12 Production ready? !16 • Saving to PMML (ML vs MLlib / DF vs RDD) • DataFrame API - doesn’t have PMML functionality (yet) • Hacked PMML to get probabilities for predictions • Size of model ~ 20Mb (compressed) • Overall time to train: less than 2 hours on a big enough cluster F
  • 17. Live scoring !17 Extracts features & scores app User installs new app 1 2 If score > 0.9 INVESTIGATE / NOTIFY 3
  • 18. #DSSAIS12 Data Science @ Wandera !18 • Cross-disciplinary team of scientists, analysts & developers • Focus on solving real-world problems in a real-time, distributed network • Global team with presence in USA, London, UK and Czech Republic = Innovative Research + Scalable Architecture + Efficient Feature Delivery
  • 20. #DSSAIS12 Appendix 1: model testing results !20 Wandera 2018