SlideShare a Scribd company logo
Real-Time Anomaly Detection
with Spark MLlib, Akka and
Cassandra
Natalino Busa
Data Platform Architect at Ing
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
@natbusa | linkedin.com/in/natalinobusa
@natbusa | linkedin: Natalino Busa
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
@natbusa | linkedin: Natalino Busa
ING group
Empowering people to stay a step ahead
in life and in business.
http://www.ing.com/About-us/Purpose-Strategy.htm
@natbusa | linkedin: Natalino Busa
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
@natbusa | linkedin: Natalino Busa
Apply advanced, predictive analytics on live data
Event-Driven and exposed via APIs
Lean Architecture, Easy to integrate
Available, Consistent, Streaming, Real-time Data
Resilient, Distributed, Scalable, Maintainable
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
Data Principles
ING group
@natbusa | linkedin: Natalino Busa
Big Data and Fast Data
10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m
time
population:events,transactions,
sessions,customers,etc
event streams
recent data
historical big data
@natbusa | linkedin: Natalino Busa
Why Fast Data?
1. Relevant up-to-date information.
2. Delivers actionable events.
@natbusa | linkedin: Natalino Busa
Why Big Data?
1. Analyze and model
2. Learn, cluster, categorize, organize facts
@natbusa | linkedin: Natalino Busa10
Distributed
Data Store
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
API for mobile and web
Training, Scoring and Exposing models
@natbusa | linkedin: Natalino Busa11
Distributed
Data Store
Fast Analytics
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
API for mobile and web
Training, Scoring and Exposing models
read the data
write the model
@natbusa | linkedin: Natalino Busa12
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Training, Scoring and Exposing models
read the model
read the data
write the model
@natbusa | linkedin: Natalino Busa
Cassandra+Akka+Spark: Machine Learning
Fast writes
2D Data Structure
Replicated
Tunable consistency
Multi-Data centers
C*Akka Spark
Very Fast processing
Distributed, Scalable computing
Actor-based Pipelines
Actor state can be persisted
Supervision strategies
Ad-Hoc Queries
Joins, Aggregate
User Defined Functions
Machine Learning,
Advanced Stats and Analytics
@natbusa | linkedin: Natalino Busa
Akka-Cassandra-Spark Stack
Cassandra-Spark Connector
Cassandra
Spark
Streaming SQL MLlib Graphx
Extract
Data
Create Models,
Enrich, Transform
Fetch from other
Sources: Kafka
Fetch from other
Sources: DB’s, Files
Akka
Analytics, Statistics, Data
Science, Model Training
Access
Model
Persist
Actors’ State
@natbusa | linkedin: Natalino Busa
Cassandra-Spark Connector
Cassandra: Store all the data
Spark: Analyze all the data
DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors
Storage! Analytics!
Data
@natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection
An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a different
mechanism.
Hawkins, 1980
@natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection
Distance Based Density Based
@natbusa | linkedin: Natalino Busa
Example: Analyze gowalla check-ins
year | month | day | time | uid | lat | lon | ts | vid
------+-------+-----+------+--------+----------+-----------+--------------------------+---------
2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955
2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160
2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870
Check-ins dataset
Venues dataset
vid | name | lat | long ------+-------+-----+------+--------+----------+-----------
+--------------------------+---------
754108 | My Suit NY | 40.73474 | -73.87434
249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289
6919688 | Sky Asian Bistro | 40.67621 | -73.98405
@natbusa | linkedin: Natalino Busa
Data Science: clustering venues
@natbusa | linkedin: Natalino Busa
Data Science: clustering venues
Weekly visitors patterns!
Madison Square, Apple Store, Radio City Music Hall
Thursdays, Fridays, Saturdays are busy
Statue of Liberty, Jacob K. Javits Convention Center,
Whole Foods Market (Columbus Circle)
Not popular on midweek
Intuition:
@natbusa | linkedin: Natalino Busa
Data Science: clustering with k-means
Histograms components as dimensions
Similar histograms would occupy similar places in
the feature space
How do I compare histograms:
- EMD
- Chi-squared distance
- Space transformation (DCT)
Intuition:
@natbusa | linkedin: Natalino Busa
K-Means: Featurize data + cluster
val weekly_visits = checkins_venues.select("vid","ts")
.map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts"))
.reduceByKey(_ + _)
.mapValues(_ => featurize_histogram(_._1))
val numClusters = 15
val numIterations = 100
val clusters = KMeans.train(weekly_visits, numClusters, numIterations)
PairRDDs, weekly patterns per venue
cluster similar weekly patterns
@natbusa | linkedin: Natalino Busa
How to use it
1) Classification
Classify venues to given groups
2) Anomaly Detection
Detect shift in the clustering assignment for a given venue for a given week
Keep monitoring weekly change in patterns, when it happens trigger a signal
week 26 week 27
Action
@natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
@natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
Intuition:
@natbusa | linkedin: Natalino Busa
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
@natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
val locs = checkins_venues.select("uid", "lat","lon")
.map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))
.reduceByKey(_ + _)
.mapValues( dbscan (_) )
Have a look at: scalanlp/nak
@natbusa | linkedin: Natalino Busa
Data Science:
Two ways to find anomalies with clustering
- Cluster big amount of data with k-means and histograms
- Apply clustering independently to million of users,
to each identify the patterns with dbscan algorithm
@natbusa | linkedin: Natalino Busa
MLlib vs PairRDDs
KMeans.train(FeaturesRDD, numClusters, numIterations)
UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) )
RDDs map functions
Parallelism easy to exploit
The function runs locally for each Key
Pick your fav machine learning algorithms
Limited nr of points
Running in parallel for millions of Keys
MLlib
Truly distributed algorithm
Classify venues to given groups
Millions of datapoints
Limited amount of clusters
@natbusa | linkedin: Natalino Busa30
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Training, Scoring and Exposing models
read the model
read the data
write the model
@natbusa | linkedin: Natalino Busa
Training vs Scoring: Latency budget
● Akka: millisecond response
● Spark: in-memory data models
Train: Spark
Score: Spark
Train: Spark
Score: Akka
slow: minutes fast: millisecs
Model Scoring
ModelTraining
slow:minutes
@natbusa | linkedin: Natalino Busa
Akka
Mixed Load Cassandra Cluster
Coral: Web API for dynamic data flows
@natbusa | linkedin: Natalino Busa
Akka
Web API for dynamic data flows
● a web api to define/manage/run streaming data-flows
● open source and community managed
● event processing as a service
coral-streaming/coral
Steven Raemaekers
Jasper van Zandbeek
Ger van Rossum
Hoda Alemi
Koen Verschuren
@natbusa | linkedin: Natalino Busa34
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Summary:
read the model
read the data
write the model
@natbusa | linkedin: Natalino Busa
Akka
Feedback to the community:
More Algorithms for machine learning!
- DBSCAN, OPTICS, PAM
- More metrics, non-euclidean spaces, etc
- Non distributed algorithms: more scalanlp integration?
Streaming all the way:
Unify batch (Spark) and event streaming (Akka) computing
@natbusa | linkedin: Natalino Busa
Thanks!
- Vision and strategy on an event-driven bank
- ING CIO management team and awesome colleagues
Spark, Cassandra, Akka communities !
@natbusa | linkedin: Natalino Busa
webinar + live demo: Dec 9th
@natbusa | linkedin: Natalino Busa
Resources
Coral: event processing webapi
https://github.com/coral-streaming/coral
Spark + Cassandra: Clustering Events
http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html
Spark: Machine Learning, SQL frames
https://spark.apache.org/docs/latest/mllib-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
Datastax: Analytics and Spark connector
http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html
Anomaly Detection
Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
@natbusa | linkedin: Natalino Busa
Resources
Datasets
https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2011
https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip
The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant
PTDC/EIA-EIA/109840/2009. .
Pictures:
"DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data.
svg#/media/File:DBSCAN-density-data.svg
"DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:
DBSCAN-Illustration.svg
"Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons -
https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png
"Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https:
//commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg
"Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.
svg#/media/File:Michelsonmorley-boxplot.svg

More Related Content

PPTX
Delta lake and the delta architecture
PDF
Solving Enterprise Data Challenges with Apache Arrow
PPTX
AWS Lake Formation Deep Dive
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Hyperspace for Delta Lake
PDF
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
SQream DB, GPU-accelerated data warehouse
Delta lake and the delta architecture
Solving Enterprise Data Challenges with Apache Arrow
AWS Lake Formation Deep Dive
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Hyperspace for Delta Lake
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
SQream DB, GPU-accelerated data warehouse

What's hot (20)

PPTX
Introduction to AWS VPC, Guidelines, and Best Practices
PPTX
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
PPTX
Data Lake Overview
PPTX
Semantic web
PPTX
Azure ppt
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Streaming architecture patterns
PPTX
Is the traditional data warehouse dead?
DOCX
Unit II -BIG DATA ANALYTICS.docx
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
PDF
AWS Glue - let's get stuck in!
PDF
Introduction to Hadoop Administration
PDF
Event Streaming in the Telco Industry with Apache Kafka® and Confluent
PDF
ORACLE ARCHITECTURE
PDF
Making Apache Spark Better with Delta Lake
PDF
The CAP Theorem
PDF
Apache Flink Stream Processing
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Introduction to AWS VPC, Guidelines, and Best Practices
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
Data Lake Overview
Semantic web
Azure ppt
Massive Data Processing in Adobe Using Delta Lake
Streaming architecture patterns
Is the traditional data warehouse dead?
Unit II -BIG DATA ANALYTICS.docx
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
AWS Glue - let's get stuck in!
Introduction to Hadoop Administration
Event Streaming in the Telco Industry with Apache Kafka® and Confluent
ORACLE ARCHITECTURE
Making Apache Spark Better with Delta Lake
The CAP Theorem
Apache Flink Stream Processing
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Ad

Similar to Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa (20)

PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Strata London 16: sightseeing, venues, and friends
PDF
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PDF
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
PDF
Scaling Analytics with Apache Spark
PPTX
Letgo Data Platform: A global overview
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
Spark-Zeppelin-ML on HWX
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Extracting Insights from Data at Twitter
PDF
Data in Action
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
PDF
Big data landscape
PPTX
Outlier and fraud detection using Hadoop
PPTX
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Strata London 16: sightseeing, venues, and friends
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Counting Unique Users in Real-Time: Here's a Challenge for You!
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Scaling Analytics with Apache Spark
Letgo Data Platform: A global overview
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Spark-Zeppelin-ML on HWX
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Extracting Insights from Data at Twitter
Data in Action
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Big data landscape
Outlier and fraud detection using Hadoop
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Global journeys: estimating international migration
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
Global journeys: estimating international migration
Data_Analytics_and_PowerBI_Presentation.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Miokarditis (Inflamasi pada Otot Jantung)
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Moving the Public Sector (Government) to a Digital Adoption
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa

  • 1. Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra Natalino Busa Data Platform Architect at Ing
  • 2. Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing @natbusa | linkedin.com/in/natalinobusa
  • 3. @natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm
  • 4. @natbusa | linkedin: Natalino Busa ING group Empowering people to stay a step ahead in life and in business. http://www.ing.com/About-us/Purpose-Strategy.htm
  • 5. @natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm Clear and Easy Anytime, Anywhere Empower Keep getting better
  • 6. @natbusa | linkedin: Natalino Busa Apply advanced, predictive analytics on live data Event-Driven and exposed via APIs Lean Architecture, Easy to integrate Available, Consistent, Streaming, Real-time Data Resilient, Distributed, Scalable, Maintainable Clear and Easy Anytime, Anywhere Empower Keep getting better Data Principles ING group
  • 7. @natbusa | linkedin: Natalino Busa Big Data and Fast Data 10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m time population:events,transactions, sessions,customers,etc event streams recent data historical big data
  • 8. @natbusa | linkedin: Natalino Busa Why Fast Data? 1. Relevant up-to-date information. 2. Delivers actionable events.
  • 9. @natbusa | linkedin: Natalino Busa Why Big Data? 1. Analyze and model 2. Learn, cluster, categorize, organize facts
  • 10. @natbusa | linkedin: Natalino Busa10 Distributed Data Store Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models
  • 11. @natbusa | linkedin: Natalino Busa11 Distributed Data Store Fast Analytics Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models read the data write the model
  • 12. @natbusa | linkedin: Natalino Busa12 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  • 13. @natbusa | linkedin: Natalino Busa Cassandra+Akka+Spark: Machine Learning Fast writes 2D Data Structure Replicated Tunable consistency Multi-Data centers C*Akka Spark Very Fast processing Distributed, Scalable computing Actor-based Pipelines Actor state can be persisted Supervision strategies Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics
  • 14. @natbusa | linkedin: Natalino Busa Akka-Cassandra-Spark Stack Cassandra-Spark Connector Cassandra Spark Streaming SQL MLlib Graphx Extract Data Create Models, Enrich, Transform Fetch from other Sources: Kafka Fetch from other Sources: DB’s, Files Akka Analytics, Statistics, Data Science, Model Training Access Model Persist Actors’ State
  • 15. @natbusa | linkedin: Natalino Busa Cassandra-Spark Connector Cassandra: Store all the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data
  • 16. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980
  • 17. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection Distance Based Density Based
  • 18. @natbusa | linkedin: Natalino Busa Example: Analyze gowalla check-ins year | month | day | time | uid | lat | lon | ts | vid ------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870 Check-ins dataset Venues dataset vid | name | lat | long ------+-------+-----+------+--------+----------+----------- +--------------------------+--------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405
  • 19. @natbusa | linkedin: Natalino Busa Data Science: clustering venues
  • 20. @natbusa | linkedin: Natalino Busa Data Science: clustering venues Weekly visitors patterns! Madison Square, Apple Store, Radio City Music Hall Thursdays, Fridays, Saturdays are busy Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle) Not popular on midweek Intuition:
  • 21. @natbusa | linkedin: Natalino Busa Data Science: clustering with k-means Histograms components as dimensions Similar histograms would occupy similar places in the feature space How do I compare histograms: - EMD - Chi-squared distance - Space transformation (DCT) Intuition:
  • 22. @natbusa | linkedin: Natalino Busa K-Means: Featurize data + cluster val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1)) val numClusters = 15 val numIterations = 100 val clusters = KMeans.train(weekly_visits, numClusters, numIterations) PairRDDs, weekly patterns per venue cluster similar weekly patterns
  • 23. @natbusa | linkedin: Natalino Busa How to use it 1) Classification Classify venues to given groups 2) Anomaly Detection Detect shift in the clustering assignment for a given venue for a given week Keep monitoring weekly change in patterns, when it happens trigger a signal week 26 week 27 Action
  • 24. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues
  • 25. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ... Intuition:
  • 26. @natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical
  • 27. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ + _) .mapValues( dbscan (_) ) Have a look at: scalanlp/nak
  • 28. @natbusa | linkedin: Natalino Busa Data Science: Two ways to find anomalies with clustering - Cluster big amount of data with k-means and histograms - Apply clustering independently to million of users, to each identify the patterns with dbscan algorithm
  • 29. @natbusa | linkedin: Natalino Busa MLlib vs PairRDDs KMeans.train(FeaturesRDD, numClusters, numIterations) UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) ) RDDs map functions Parallelism easy to exploit The function runs locally for each Key Pick your fav machine learning algorithms Limited nr of points Running in parallel for millions of Keys MLlib Truly distributed algorithm Classify venues to given groups Millions of datapoints Limited amount of clusters
  • 30. @natbusa | linkedin: Natalino Busa30 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  • 31. @natbusa | linkedin: Natalino Busa Training vs Scoring: Latency budget ● Akka: millisecond response ● Spark: in-memory data models Train: Spark Score: Spark Train: Spark Score: Akka slow: minutes fast: millisecs Model Scoring ModelTraining slow:minutes
  • 32. @natbusa | linkedin: Natalino Busa Akka Mixed Load Cassandra Cluster Coral: Web API for dynamic data flows
  • 33. @natbusa | linkedin: Natalino Busa Akka Web API for dynamic data flows ● a web api to define/manage/run streaming data-flows ● open source and community managed ● event processing as a service coral-streaming/coral Steven Raemaekers Jasper van Zandbeek Ger van Rossum Hoda Alemi Koen Verschuren
  • 34. @natbusa | linkedin: Natalino Busa34 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Summary: read the model read the data write the model
  • 35. @natbusa | linkedin: Natalino Busa Akka Feedback to the community: More Algorithms for machine learning! - DBSCAN, OPTICS, PAM - More metrics, non-euclidean spaces, etc - Non distributed algorithms: more scalanlp integration? Streaming all the way: Unify batch (Spark) and event streaming (Akka) computing
  • 36. @natbusa | linkedin: Natalino Busa Thanks! - Vision and strategy on an event-driven bank - ING CIO management team and awesome colleagues Spark, Cassandra, Akka communities !
  • 37. @natbusa | linkedin: Natalino Busa webinar + live demo: Dec 9th
  • 38. @natbusa | linkedin: Natalino Busa Resources Coral: event processing webapi https://github.com/coral-streaming/coral Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
  • 39. @natbusa | linkedin: Natalino Busa Resources Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: "DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data. svg#/media/File:DBSCAN-density-data.svg "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File: DBSCAN-Illustration.svg "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https: //commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot. svg#/media/File:Michelsonmorley-boxplot.svg