SlideShare a Scribd company logo
Streaming Outlier Analysis for Fun and Scalability
Casey Stella
2016
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Table of Contents
Streaming Analytics
Framework
Demos
Questions
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The future involves non-trivial analytics done on streaming data
• It’s not just IoT
• There is a need for insights to keep pace with the velocity of your data
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
• The Good: There is no shortage of computational frameworks to handle streaming
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
• The Good: There is no shortage of computational frameworks to handle streaming
• The Bad: There are not an overabundance of high-quality outlier analysis
frameworks
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which “interesting”
points are differentiated from “normal” points. Often “interesting” implies some sort of
error or state which should be researched further.
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which “interesting”
points are differentiated from “normal” points. Often “interesting” implies some sort of
error or state which should be researched further.
Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and
Cambridge Mobile Telematics, noted several properties of IoT data:
• Data produced by IoT applications often have come from some “ordinary”
distribution
• IoT anomalies are often systemic
• They are often fairly rare
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
◦ Expensive computationally, but run infrequently
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
◦ Expensive computationally, but run infrequently
This becomes a data filter which can be attached to a timeseries data stream
within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi)
to detect outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
• A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
• A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
tl;dr: A formal way to encode our intuition: If a point is far away from the
“central” point of our window, then it’s likely an outlier.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Demos
Demos
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available at
http://github.com/cestella/streaming_outliers
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016

More Related Content

PDF
NLP Structured Data Investigation on Non-Text
PDF
Spark Summit EU talk by Casey Stella
PDF
NLP Structured Data Investigation on Non-Text
PDF
Natural Language Processing on Non-Textual Data
PDF
NLP Structured Data Investigation on Non-Text
PDF
NLP Structured Data Investigation on Non-Text by Casey Stella
PPTX
Elsevier’s Healthcare Knowledge Graph
PDF
Metopen 6
NLP Structured Data Investigation on Non-Text
Spark Summit EU talk by Casey Stella
NLP Structured Data Investigation on Non-Text
Natural Language Processing on Non-Textual Data
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text by Casey Stella
Elsevier’s Healthcare Knowledge Graph
Metopen 6

What's hot (19)

PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PDF
Opportunistic Persistent Data Storage
PPTX
Knowledge graph construction for research & medicine
PPTX
Diversity and Depth: Implementing AI across many long tail domains
PDF
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
PDF
Crisis of confidence, p-hacking and the future of psychology
PPTX
Swapnil soni Thesis_Presentation
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
PPTX
Workshop on Systematic Searching (Oslo)
PDF
On community-standards, data curation and scholarly communication - BITS, Ita...
PPTX
Introduction to Systematic Reviews (Oslo)
PDF
Answering More Questions with Provenance and Query Patterns
PDF
Program theory evaluation
PPTX
Using the search engine as recommendation engine
PPTX
Dynamic Search Using Semantics & Statistics
PDF
Differential privacy (개인정보 차등보호)
PDF
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
PDF
Public PhD Defense - Ben De Meester
PDF
On community-standards, data curation and scholarly communication" Stanford M...
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Opportunistic Persistent Data Storage
Knowledge graph construction for research & medicine
Diversity and Depth: Implementing AI across many long tail domains
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Crisis of confidence, p-hacking and the future of psychology
Swapnil soni Thesis_Presentation
The Roots: Linked data and the foundations of successful Agriculture Data
Workshop on Systematic Searching (Oslo)
On community-standards, data curation and scholarly communication - BITS, Ita...
Introduction to Systematic Reviews (Oslo)
Answering More Questions with Provenance and Query Patterns
Program theory evaluation
Using the search engine as recommendation engine
Dynamic Search Using Semantics & Statistics
Differential privacy (개인정보 차등보호)
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Public PhD Defense - Ben De Meester
On community-standards, data curation and scholarly communication" Stanford M...
Ad

Viewers also liked (12)

PDF
Real time applications using the R Language
PPTX
Anomaly Detection with Apache Spark
PPTX
Data Mining: Outlier analysis
PDF
Streaming Data in R
PDF
Real-TIme Market Data in R
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Storm: distributed and fault-tolerant realtime computation
PDF
Realtime Analytics with Storm and Hadoop
PPTX
Yahoo compares Storm and Spark
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
Real time applications using the R Language
Anomaly Detection with Apache Spark
Data Mining: Outlier analysis
Streaming Data in R
Real-TIme Market Data in R
Resource Aware Scheduling in Apache Storm
Scaling Apache Storm - Strata + Hadoop World 2014
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Yahoo compares Storm and Spark
Apache Storm 0.9 basic training - Verisign
Hadoop Summit Europe 2014: Apache Storm Architecture
Ad

Similar to Streaming Outlier Analysis for Fun and Scalability (20)

PPTX
Exploratory_Data_Analysis on data analysis using python.pptx
PPTX
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
PDF
Data wrangling week 10
PDF
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
PPTX
Monitoring Distributed Systems
PDF
Outlier analysis for Temporal Datasets
PDF
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
PPTX
Outliers or anamolies IN DATA ANALYTICS.pptx
PDF
Anomaly detection Meetup Slides
PDF
Anomaly detection Workshop slides
PPTX
Outlier-Detection-in-Higher-Dimensions in data mining
PPTX
Anomaly Detection for Real-World Systems
PPTX
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
PPTX
Outlier analysis and anomaly detection
PDF
Dataday Texas 2016 - Datadog
DOC
report2.doc
PDF
Demand Planning Leadership Exchange: Tips to Optimize JDA Demand Planning mod...
PDF
Outlier Detection Approaches in Data Mining
PDF
Anomaly detection
PDF
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
Exploratory_Data_Analysis on data analysis using python.pptx
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Data wrangling week 10
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Monitoring Distributed Systems
Outlier analysis for Temporal Datasets
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Outliers or anamolies IN DATA ANALYTICS.pptx
Anomaly detection Meetup Slides
Anomaly detection Workshop slides
Outlier-Detection-in-Higher-Dimensions in data mining
Anomaly Detection for Real-World Systems
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Outlier analysis and anomaly detection
Dataday Texas 2016 - Datadog
report2.doc
Demand Planning Leadership Exchange: Tips to Optimize JDA Demand Planning mod...
Outlier Detection Approaches in Data Mining
Anomaly detection
PyData NYC 2015 - Automatically Detecting Outliers with Datadog

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
August Patch Tuesday
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
A Presentation on Touch Screen Technology
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Heart disease approach using modified random forest and particle swarm optimi...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Web App vs Mobile App What Should You Build First.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Zenith AI: Advanced Artificial Intelligence
cloud_computing_Infrastucture_as_cloud_p
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
August Patch Tuesday
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hindi spoken digit analysis for native and non-native speakers
A Presentation on Touch Screen Technology
Chapter 5: Probability Theory and Statistics
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
SOPHOS-XG Firewall Administrator PPT.pptx
Hybrid model detection and classification of lung cancer
Getting Started with Data Integration: FME Form 101
Programs and apps: productivity, graphics, security and other tools
DP Operators-handbook-extract for the Mautical Institute
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Digital-Transformation-Roadmap-for-Companies.pptx

Streaming Outlier Analysis for Fun and Scalability

  • 1. Streaming Outlier Analysis for Fun and Scalability Casey Stella 2016 Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 2. Table of Contents Streaming Analytics Framework Demos Questions Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 4. Streaming Analytics • The future involves non-trivial analytics done on streaming data • It’s not just IoT • There is a need for insights to keep pace with the velocity of your data Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 5. Streaming Analytics • The Good: Much of the data can be coerced into timeseries Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 6. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 7. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 8. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 9. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive • The Good: There is no shortage of computational frameworks to handle streaming Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 10. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive • The Good: There is no shortage of computational frameworks to handle streaming • The Bad: There are not an overabundance of high-quality outlier analysis frameworks Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 11. Outlier Analysis Outlier analysis or anomaly detection is the analytical technique by which “interesting” points are differentiated from “normal” points. Often “interesting” implies some sort of error or state which should be researched further. 1 http://arxiv.org/pdf/1603.00567v1.pdf Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 12. Outlier Analysis Outlier analysis or anomaly detection is the analytical technique by which “interesting” points are differentiated from “normal” points. Often “interesting” implies some sort of error or state which should be researched further. Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and Cambridge Mobile Telematics, noted several properties of IoT data: • Data produced by IoT applications often have come from some “ordinary” distribution • IoT anomalies are often systemic • They are often fairly rare 1 http://arxiv.org/pdf/1603.00567v1.pdf Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 13. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 14. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 15. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 16. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 17. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample ◦ Expensive computationally, but run infrequently Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 18. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample ◦ Expensive computationally, but run infrequently This becomes a data filter which can be attached to a timeseries data stream within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi) to detect outliers. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 19. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 20. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 21. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 22. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). • A point is considered an outlier if its distance from the current window median, scaled by the MAD for the previous window, is above a threshold. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 23. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). • A point is considered an outlier if its distance from the current window median, scaled by the MAD for the previous window, is above a threshold. tl;dr: A formal way to encode our intuition: If a point is far away from the “central” point of our window, then it’s likely an outlier. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 24. Architecture Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 25. Demos Demos Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 26. Questions Thanks for your attention! Questions? • Code & scripts for this talk available at http://github.com/cestella/streaming_outliers • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016