Detecting Mobile Malware with Apache Spark with David Pryce

David Pryce, Wandera
Detecting Mobile Malware
with Apache Spark
#DSSAIS12

#DSSAIS12
Summary
• The problem: Mobile-first malware detection
• The data and features
• The Machine Learning (ML) model
• Why Apache Spark?
• Making it production ready
• Data Science @ Wandera
!2

The power of enterprise mobility
!3
Devices are prone to
security threats
Concerns around
appropriate usage
Data usage costs are
opaque and spiraling
Potentially exposing
sensitive data
Seamless internal
communication
Added flexibility to
working hours
Access to more apps and
productivity tools
E-mail and other services
available anywhere

Happy hunting ground for attackers
!4
“Mobile threats can no longer be ignored”
- AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE
100%
Mobile malware
growth in 2016
435%
High severity threats
(CVSS) growth in 2016
80%
of organizations
experienced mobile
phishing attack
38%
of hackers bypass
endpoint defense using
social engineering

Introducing the Secure Mobile Gateway
!5
ON-DEVICE
DETECTION
IN-NETWORK
PROTECTION

#DSSAIS12
The rise of mobile malware
!6
Credit: GData 2017

Our objectives: Identify and Classify
!7
SMS
MALWARE TYPES
Ransomware Spyware Banker Trojan
Rooting Adware

#DSSAIS12
Why is this a novel problem?
• Mobile malware is on the rise
• Signature based detection is no longer scalable or effective
• We needed a solution that could
• work across both known and unknown threats;
• effectively protect our customers; and
• enable threat research to quickly identify new outbreaks
• First solution = signatures and lists
• Our solution = machine learning!
!8

#DSSAIS12
The data…
!9
Good and bad apps
• Source 1: official app stores
• Source 2: seen in our devices
• Source 3: seen by our gateway
3rd-party threat intelligence
External input verified for labels
(supervised learning)
Currently storing: ~2 million labelled apps
+

#DSSAIS12
… and the features
!10
Baidu 2016

#DSSAIS12
Feature extraction
!11
Direct metadata extraction
• Total unique fields for all apps ~ 500,000
• A typical app ~ 10+ fields
• SPARSE VECTOR
Solution:
• Hashing function (vector to indices)
• Allows for fast retrieval
• With big enough map (2^20) to avoid clashes
• DENSE VECTOR

#DSSAIS12
The Machine Learning model
• Selected model = Logistic Regression
◦ Models tried = (LogReg, SVM, Decision Tree)
• K-fold cross validation to select best parameters
• Accuracy: 0.96  
!12

#DSSAIS12
Why Apache Spark?
!13
Model
persistence
PMML paradigm already
integrated
Truly big
data
Millions of data points,
millions of fields
Ease of use
Fast, easy and iterative.
From EDA to app in
days. Scala and python
API.
Deployment
and Scale
From local to cluster is
easy!

Wandera 2018
#DSSAIS12
Production ready?
!14

P.M.M.L
• Predictive Model Markup Language
• Industry standard
• Pro: Language agnostic, REST API, good algo
coverage
• Con: large file size
!15

#DSSAIS12
Production ready?
!16
• Saving to PMML (ML vs MLlib / DF vs RDD)
• DataFrame API - doesn’t have PMML functionality (yet)
• Hacked PMML to get probabilities for predictions
• Size of model ~ 20Mb (compressed)
• Overall time to train: less than 2 hours on a big enough cluster
F

Live scoring
!17
Extracts features &
scores app
User installs new app
1
2
If score > 0.9
INVESTIGATE / NOTIFY
3

#DSSAIS12
Data Science @ Wandera
!18
• Cross-disciplinary team of scientists, analysts & developers
• Focus on solving real-world problems in a real-time, distributed network
• Global team with presence in USA, London, UK and Czech Republic
= Innovative Research + Scalable Architecture + Efficient Feature Delivery

#DSSAIS12
Thanks for listening
!19

#DSSAIS12
Appendix 1: model testing results
!20
Wandera 2018

Detecting Mobile Malware with Apache Spark with David Pryce

More Related Content

What's hot (20)

Similar to Detecting Mobile Malware with Apache Spark with David Pryce (20)

More from Databricks (20)

Recently uploaded (20)

Detecting Mobile Malware with Apache Spark with David Pryce