SlideShare a Scribd company logo
Arun  Kejariwal                  Karthik  Ramasamy	
  
	
  	
  	
  	
  	
  MZ  Research                                                                      Twi.er
Anomaly Detection in Real-Time Data Streams
Using Heron
2
3
DATA  @  MZ  
An  Overview
GOW AND MOBILE STRIKE
Peaked at 1M events/sec
MARKETING
Serve >1B impressions/day worldwide
Integrated with >150 distinct advertising channels
POTPOURRI
~35B messages/day
Writes: 20TB/day
4
SENSORS
Monitoring	
  
Smartwatches,  Refrigerators  
Wearables
ACTUATORS
Automa,on	
  
Manufacturing	
  
Robo@cs
DRONES
Expanding  the  scope	
  
Delivery,  Real  Estate	
  
Power  Transmission  Lines
MOBILE
Life’s  Remote  Control	
  
Personaliza@on	
  
Produc@vity
EXPLOSION  IN  DATA  VELOCITY  AND  VOLUME
5
MANUFACTURING HEALTH	
  
Care
POWER	
  
Grid
GAS	
  
Pipelines
SECURITY OPERATIONS ROBOTICS #  TWEETS	
  
per  minute
ANOMALY  DETECTION:  WHY  BOTHER?
DIGITAL	
  
Marke,ng
CONNECTED	
  
Cars
6
ANOMALY  DETECTION:  LIVE  EXAMPLE
7
ANOMALY  DETECTION:  HISTORY
8
RESEARCHED  
FOR  
>100  YEARS
Manufacturing
Econometrics
Networking
Image  Processing
Computer  Vision (Cyber)	
  Security
Text  Mining
Signal  Processing
Finance
Experimental  Social  Psychology
Web  Opera@ons
Sta@s@cs  (and  Time  Series  Analysis)
Data  Fidelity
Astronomy
ANOMALY  DETECTION:    APPLICATION  DOMAINS
9
ANOMALY  DETECTION:  RECENT  WORKS  IN  INDUSTRY
JAN’15 MARCH’15 AUG’15
NOV’15NOV’15AUG’15
JULY’15
JUNE’16
10
FALSE	
  
Posi@ve	
  
Rate
FALSE	
  
Nega@ve	
  
Rate
SCALE	
  
Data	
  
Granularity
WHY  NOT  USE  OFF-­‐THE-­‐SHELF?
Anomalies  are  CONTEXTUAL
11
Severity
Data	
  
Characteris@cs
Data    
Fidelity
Different  Ac@ons	
  
Page  or  not  
Sta@onarity,  Normal  	
  
Distribu,on  
Missing  Data	
  
Data  Corrup,on  
MOSTLY  UNSUPERVISED
12
DATA  VISUALIZATION	
  
Not  viable  in  prac2ce
13
MEAN AND STANDARD DEVIATION
Mean: Compute incrementally
Not robust in the presence of anomalies
COMMONLY  USED  STATISTICS
TRIMMED MEAN
Robust in the presence of anomalies
Small samples?
How to handle asymmetric distributions?
Results in a biased estimator
What should be the trimming boundaries?
WINSORIZED MEAN
L-ESTIMATORS
Linear combinations of order statistics
14
ROBUST  STATISTICS
MEDIAN AND MEDIAN ABSOLUTE DEVIATION (MAD)
Robust in the presence of anomalies
Not amenable to incremental computation
Use q-digest, t-digest
What if MAD is zero?
A sample with many similar values
BROADENED MEDIAN, M-ESTIMATORS, SN AND QN
15
ANALYZE INDIVIDUAL TIME SERIES
Too many alerts
Not actionable
Alert Fatigue
MULTIPLE  TIME  SERIES	
  
Methods
MINIMUM COVARIANCE DETERMINANT (MCD)
Proposed by Rousseeuw, 1984
Mahalanobis distance1
FastMCD
[1]	
  “On	
  the	
  generalised	
  distance	
  in	
  sta/s/cs”,	
  by	
  P.	
  C.	
  Mahalanobis,	
  1936.	
  
16
MULTIPLE  TIME  SERIES	
  
Other  Methods
CORRELATION
Direction
Magnitude
nxn Correlation Matrix?
Bake in context
Exploit topology
17
CHALLENGES
Susceptible to Anomalies
Data Skew
Missing Data
Speed
MULTIPLE  TIME  SERIES	
  
Other  Methods
TECHNIQUES
Robust Correlation
Cross Correlation
Intersection Analysis
Trade-off between speed and accuracy
THE	
  BIG	
  PICTURE
19
THE  FLOW	
  
RTpla9orm  and  Heron
Live  Data Streaming  Computa,on
RTpla/orm
20
RTplatform
Cloud-based platform built for connecting, processing,
and reacting to live data.
+ Extreme scale
+ High performance
+ Unprecedented reliability
+ Natively serverless
21
RTplatform
“Real-time” has many definitions that have variable KPIs.
Real time results on data-at-rest, not on live data
22
Live Stream Bots
A backbone for live data:
Free Messaging for publishers
and subscribers
Filter, analyze and
transform messages
in live stream
Notify
Anomaly
detection
RTplatform
MESSAGING Real-time Pub/Sub with ultra-low latency and high fanout
QUERYING Filter, analyze, and transform messages live, in-stream
BOTS Deploy rule-based bots for real-time anomaly detection/reaction
23
RTplatform
HERON
25
HERON  DESIGN  GOALS
Task isolation
Ease	
  of	
  debug-­‐ability/isolaDon/profiling
Support for back pressure
Topologies	
  should	
  self	
  adjusDng
Efficiency
Reduce resource consumption
Off -the-shelf schedulers
Unmanaged	
  	
  -­‐	
  Apache	
  YARN/Mesos	
  
Managed	
  -­‐	
  	
  Apache	
  Aurora,	
  Amazon	
  ECS
Use of main stream languages
C++,	
  Java	
  and	
  Python
Batching of tuples
AmorDzing	
  the	
  cost	
  of	
  transferring	
  tuples !
"#
G
4 !
26
HERON  ARCHITECTURE
Topology 1
Topology
Submission
Scheduler
Topology 2
Topology N
27
TOPOLOGY  ARCHITECTURE
Topology
Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics
Manager
Metrics
Manager
27
28
STREAM  MANAGER	
  
Sample  Topology
% %
S1 B2 B3
%
B4
29
HERON  PHYSICAL  EXECUTION
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
30
BACKPRESSURE	
  
Stragglers  are  the  norm  in  a  mul2-­‐tenant  distributed  systems
BAD HOST EXECUTION SKEW INADEQUATE
PROVISIONING
Ñ"
31
SENDERS TO STRAGGLER: DROP DATA
BACKPRESSURE	
  
Approaches  to  Handle  Stragglers
DETECT STRAGGLERS AND RESCHEDULE THEM
SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
32
BACKPRESSURE	
  
Data  Drop  Strategy
UNPREDICTABLE AFFECTS ACCURACY POOR VISIBILITY
33
BACKPRESSURE	
  
Slow  Down  Sender
HANDLES
TEMPORARY
SPIKES
#
PROCESSES DATA
AT MAXIMUM
RATE
/
PROVIDES
PREDICTABILITY
REDUCE
RECOVERY TIMES
34
BACKPRESSURE	
  
Stream  Manager
TCP backpressure
Spout based backpressure
Stagewise backpressure
!
!
!
35
BACKPRESSURE  -­‐  TCP	
  
Stream  Manager
Slows  upstream  and  downstream  instances
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
36
BACKPRESSURE  -­‐  SPOUT	
  
Stream  Manager
S1 S1
S1S1S1 S1
S1S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
B2
B3 B4
B2
B3
B2
B3 B4
B4
37
IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
BACKPRESSURE	
  
In  Prac2ce
SOMETIMES USER PREFERS DROPPING OF DATA
Care about only latest data
SUSTAINED BACK PRESSURE
Irrecoverable GC cycles, Bad or faulty host
38
PREDICTABILITY
Tuple failures are more deterministic
BACKPRESSURE	
  
Advantages
SELF ADJUSTS
Topology goes as fast as the slowest component
39
HERON:  EXTENSIBLE  STREAMING  ENGINE
HARDWARE
BASIC INTER/INTRA IPC
Topology
Master
Stream
Manager
Instance
Metrics
Manager
Scribe Graphite
SCHEDULERSTATEMANAGER
40
PLUG AND PLAY COMPONENTS
As environment changes, core does not change
MULTI LANGUAGE INSTANCES
Support multiple language API with native instances
MULTIPLE PROCESSING SEMANTICS
Efficient stream managers for each semantics
EASE OF DEVELOPMENT
Faster development of components with little dependency
HERON:  EXTENSIBLE  STREAMING  ENGINE
41
REPEATED SERIALIZATION
Java objects —> Byte Arrays —> Protocol Buffers
EAGER DESERIALIZATION
Stream manager deserializes entire tuple even though full contents are not examined
IMMUTABILITY
Stream manager does not reuse any ProtoBuf objects
OPTIMIZING  HERON
42
HERON:  PERFORMANCE	
  
At  most  once  seman2cs
0
2000
4000
6000
8000
10000
12000
25 100 200
MILLION	TUPLES/MIN
SPOUT	PARALLELISM
THROUGHPUT
Without	Optimizations With	Optimizations
0
5
10
15
20
25
30
35
25 100 200
MILLION	TUPLES/MIN
SPOUT	PARALLELISM
THROUGHPUT	 PER	CORE
Without	Optimizations With	Optimizations
43
HERON:  PERFORMANCE	
  
At  least  once  seman2cs
0
500
1000
1500
2000
2500
25 100 200
MILLION	TUPLES/MIN
SPOUT	PARALLELISM
THROUGHPUT
Without	Optimizations With	Optimizations
0
20
40
60
80
100
120
140
160
180
25 100 200
MILLISECS
SPOUT	PARALLELISM
LATENCY
Without	Optimizations With	Optimizations
44
HERON:  PERFORMANCE	
  
At  least  once  seman2cs  -­‐  Impact  of  Cache  Drain  Frequency
0
500
1000
1500
2000
2500
0 5 10 15 20 25 30 35
MILLION	TUPLES/MIN
CACHE	DRAIN	FREQUENCY	(MS)
THROUGHPUT	 VS	CACHE	 DRAIN	FREQUENCY
200 100 25
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
LATENCY	(MS)
CACHE	DRAIN	FREQUENCY	(MS)
LATENCY	 VS	CACHE	 DRAIN	FREQUENCY
200 100 25
45
HALBERT	
  
Nakagawa	
  
Co-­‐Founder	
  &	
  CTO
FRANCOIS	
  
Orsini	
  
CTO
JOSH	
  
Lulewicz	
  
Head  of  Data  Placorm
WE  ARE  HIRING!
KARTHIK	
  
Ramasamy	
  
Manager
46
QUESTIONS    ANSWERS	
  
Go  ahead.	
  
Don‘t  hesitate.
47
READINGS
STROM @ TWITTER
A. Toshniwall et. al, SIGMOD 2014.
TWITTER HERON: STREAM PROCESSING AT SCALE
S. Kulkarni et al., SIGMOD 2015.
STREAMING @ TWITTER
M. Fu, 2016.
TWITTER HERON: TOWARDS EXTENSIBLE STREAMING ENGINES
M. Fu, ICDE 2017.
48
READINGS
LIMITS THEOREMS FOR THE MEDIAN DEVIATIONS
P. Hall and A. H. Welsh, 1985.
ALTERNATIVES TO MEDIAN ABSOLUTE DEVIATION
P. J. Rousseeuw and C. Croux, 1993.
ASYMPTOTIC INDEPENDENCE OF MEDIAN AND MAD
M. Falk, 1997.
BAHADUR REPRESENTATIONS FOR THE MEDIAN ABSOLUTE DEVIATION AND ITS
MODIFICATIONS
S. Mazumder and R. Serfling, 2009.
THE MINIMUM REGULARIZED COVARIANCE DETERMINANT ESTIMATOR
K. Boudt, P. J. Rousseeuw, S. Vanduffel and T. Verdonck, 2017.
THANK  YOU	
  
For  your  aKen2on!

More Related Content

PDF
Data Data Everywhere: Not An Insight to Take Action Upon
PDF
Live Anomaly Detection
PDF
Strata 2014 Anomaly Detection
PDF
A Practical Guide to Anomaly Detection for DevOps
PPTX
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
PPTX
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
Cheap learning-dunning-9-18-2015
Data Data Everywhere: Not An Insight to Take Action Upon
Live Anomaly Detection
Strata 2014 Anomaly Detection
A Practical Guide to Anomaly Detection for DevOps
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Cheap learning-dunning-9-18-2015

What's hot (20)

PPTX
Real-Time Streaming Data Analysis with HTM
PPTX
Where is Data Going? - RMDC Keynote
PPTX
What is the past future tense of data?
PPTX
Streaming Analytics: It's Not the Same Game
PDF
Realtime Data Analysis Patterns
PPTX
Finding Changes in Real Data
PPTX
Which Algorithms Really Matter
PPTX
Real time-hadoop
PPTX
Strata New York 2012
PPTX
Getting Started with Numenta Technology
PDF
Detecting Anomalies in Streaming Data
PPTX
Goto amsterdam-2013-skinned
PPTX
Anomaly Detection - New York Machine Learning
PPTX
Doing-the-impossible
PDF
Storm users group real time hadoop
PPTX
Dunning time-series-2015
PPTX
Deep Learning for Fraud Detection
PPTX
Time Series Anomaly Detection with .net and Azure
PPTX
T digest-update
PPTX
Building multi-modal recommendation engines using search engines
Real-Time Streaming Data Analysis with HTM
Where is Data Going? - RMDC Keynote
What is the past future tense of data?
Streaming Analytics: It's Not the Same Game
Realtime Data Analysis Patterns
Finding Changes in Real Data
Which Algorithms Really Matter
Real time-hadoop
Strata New York 2012
Getting Started with Numenta Technology
Detecting Anomalies in Streaming Data
Goto amsterdam-2013-skinned
Anomaly Detection - New York Machine Learning
Doing-the-impossible
Storm users group real time hadoop
Dunning time-series-2015
Deep Learning for Fraud Detection
Time Series Anomaly Detection with .net and Azure
T digest-update
Building multi-modal recommendation engines using search engines
Ad

Viewers also liked (20)

PDF
Secure development environment @ Meet Magento Croatia 2017
PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
PDF
Diagnóstico SEO Técnico con Herramientas #TheInbounder
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
PPTX
IoT Connected Brewery
PDF
Real Time Analytics: Algorithms and Systems
PDF
B2B Marketing and The Power of Twitter
PPTX
Kafka presentation
PPT
Understanding P2P
PPT
How do you make things stick?
PDF
Velocity 2015-final
PDF
Fortune 1000 HR Leader Survey Results
PDF
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
PDF
Prins Amedeo officieel benoemd bij Gutzwiller bank
PDF
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
PDF
PPTX
NJ Future Redevelopment Forum 2017 Anderson
PDF
[GUIDE] Vigilance sommeil - Guide prévention et santé
Secure development environment @ Meet Magento Croatia 2017
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Diagnóstico SEO Técnico con Herramientas #TheInbounder
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Developing streaming applications with apache apex (strata + hadoop world)
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
IoT Connected Brewery
Real Time Analytics: Algorithms and Systems
B2B Marketing and The Power of Twitter
Kafka presentation
Understanding P2P
How do you make things stick?
Velocity 2015-final
Fortune 1000 HR Leader Survey Results
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Prins Amedeo officieel benoemd bij Gutzwiller bank
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
NJ Future Redevelopment Forum 2017 Anderson
[GUIDE] Vigilance sommeil - Guide prévention et santé
Ad

Similar to Anomaly detection in real-time data streams using Heron (20)

PPTX
Crash course on data streaming (with examples using Apache Flink)
PPSX
HYPERSIM Relay Protection Webinar
PDF
Machine Learning @NECST
PDF
Stream Processing Overview
PDF
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
PDF
What we do to improve scalability in our RDF processing system
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
PDF
Smallsat 2021
PDF
NVIDIA @ Infinite Conference, London
PDF
Gene's law
PPT
Inside LoLA - Experiences from building a state space tool for place transiti...
PDF
OPAL-RT ePHASORsim Webinar
PDF
Cognitive Engine: Boosting Scientific Discovery
PDF
OPAL-RT Webinar - Challenges in Protection Relay Testing
PDF
Vlsi projects
PPTX
Spark streaming for the internet of flying things 20160510.pptx
PDF
Experiences in ELK with D3.js for Large Log Analysis and Visualization
PDF
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
PDF
Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...
PPT
First aid andriod in defence
Crash course on data streaming (with examples using Apache Flink)
HYPERSIM Relay Protection Webinar
Machine Learning @NECST
Stream Processing Overview
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
What we do to improve scalability in our RDF processing system
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Smallsat 2021
NVIDIA @ Infinite Conference, London
Gene's law
Inside LoLA - Experiences from building a state space tool for place transiti...
OPAL-RT ePHASORsim Webinar
Cognitive Engine: Boosting Scientific Discovery
OPAL-RT Webinar - Challenges in Protection Relay Testing
Vlsi projects
Spark streaming for the internet of flying things 20160510.pptx
Experiences in ELK with D3.js for Large Log Analysis and Visualization
DEF CON 23: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simplex ...
Black Hat '15: Spread Spectrum Satcom Hacking: Attacking The GlobalStar Simpl...
First aid andriod in defence

More from Arun Kejariwal (18)

PDF
Anomaly Detection At The Edge
PDF
Serverless Streaming Architectures and Algorithms for the Enterprise
PDF
Sequence-to-Sequence Modeling for Time Series
PDF
Sequence-to-Sequence Modeling for Time Series
PDF
Model Serving via Pulsar Functions
PDF
Designing Modern Streaming Data Applications
PDF
Correlation Analysis on Live Data Streams
PDF
Deep Learning for Time Series Data
PDF
Correlation Analysis on Live Data Streams
PDF
Modern real-time streaming architectures
PDF
Finding bad apples early: Minimizing performance impact
PDF
Statistical Learning Based Anomaly Detection @ Twitter
PDF
Days In Green (DIG): Forecasting the life of a healthy service
PDF
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
PDF
A Systematic Approach to Capacity Planning in the Real World
PDF
Isolating Events from the Fail Whale
PDF
Techniques for Minimizing Cloud Footprint
PDF
A Tool for Practical Garbage Collection Analysis In the Cloud
Anomaly Detection At The Edge
Serverless Streaming Architectures and Algorithms for the Enterprise
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Model Serving via Pulsar Functions
Designing Modern Streaming Data Applications
Correlation Analysis on Live Data Streams
Deep Learning for Time Series Data
Correlation Analysis on Live Data Streams
Modern real-time streaming architectures
Finding bad apples early: Minimizing performance impact
Statistical Learning Based Anomaly Detection @ Twitter
Days In Green (DIG): Forecasting the life of a healthy service
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
A Systematic Approach to Capacity Planning in the Real World
Isolating Events from the Fail Whale
Techniques for Minimizing Cloud Footprint
A Tool for Practical Garbage Collection Analysis In the Cloud

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.

Anomaly detection in real-time data streams using Heron

  • 1. Arun  Kejariwal                  Karthik  Ramasamy            MZ  Research                                                                      Twi.er Anomaly Detection in Real-Time Data Streams Using Heron
  • 2. 2
  • 3. 3 DATA  @  MZ   An  Overview GOW AND MOBILE STRIKE Peaked at 1M events/sec MARKETING Serve >1B impressions/day worldwide Integrated with >150 distinct advertising channels POTPOURRI ~35B messages/day Writes: 20TB/day
  • 4. 4 SENSORS Monitoring   Smartwatches,  Refrigerators   Wearables ACTUATORS Automa,on   Manufacturing   Robo@cs DRONES Expanding  the  scope   Delivery,  Real  Estate   Power  Transmission  Lines MOBILE Life’s  Remote  Control   Personaliza@on   Produc@vity EXPLOSION  IN  DATA  VELOCITY  AND  VOLUME
  • 5. 5 MANUFACTURING HEALTH   Care POWER   Grid GAS   Pipelines SECURITY OPERATIONS ROBOTICS #  TWEETS   per  minute ANOMALY  DETECTION:  WHY  BOTHER? DIGITAL   Marke,ng CONNECTED   Cars
  • 8. 8 RESEARCHED   FOR   >100  YEARS Manufacturing Econometrics Networking Image  Processing Computer  Vision (Cyber)  Security Text  Mining Signal  Processing Finance Experimental  Social  Psychology Web  Opera@ons Sta@s@cs  (and  Time  Series  Analysis) Data  Fidelity Astronomy ANOMALY  DETECTION:    APPLICATION  DOMAINS
  • 9. 9 ANOMALY  DETECTION:  RECENT  WORKS  IN  INDUSTRY JAN’15 MARCH’15 AUG’15 NOV’15NOV’15AUG’15 JULY’15 JUNE’16
  • 10. 10 FALSE   Posi@ve   Rate FALSE   Nega@ve   Rate SCALE   Data   Granularity WHY  NOT  USE  OFF-­‐THE-­‐SHELF? Anomalies  are  CONTEXTUAL
  • 11. 11 Severity Data   Characteris@cs Data     Fidelity Different  Ac@ons   Page  or  not   Sta@onarity,  Normal     Distribu,on   Missing  Data   Data  Corrup,on   MOSTLY  UNSUPERVISED
  • 12. 12 DATA  VISUALIZATION   Not  viable  in  prac2ce
  • 13. 13 MEAN AND STANDARD DEVIATION Mean: Compute incrementally Not robust in the presence of anomalies COMMONLY  USED  STATISTICS TRIMMED MEAN Robust in the presence of anomalies Small samples? How to handle asymmetric distributions? Results in a biased estimator What should be the trimming boundaries? WINSORIZED MEAN L-ESTIMATORS Linear combinations of order statistics
  • 14. 14 ROBUST  STATISTICS MEDIAN AND MEDIAN ABSOLUTE DEVIATION (MAD) Robust in the presence of anomalies Not amenable to incremental computation Use q-digest, t-digest What if MAD is zero? A sample with many similar values BROADENED MEDIAN, M-ESTIMATORS, SN AND QN
  • 15. 15 ANALYZE INDIVIDUAL TIME SERIES Too many alerts Not actionable Alert Fatigue MULTIPLE  TIME  SERIES   Methods MINIMUM COVARIANCE DETERMINANT (MCD) Proposed by Rousseeuw, 1984 Mahalanobis distance1 FastMCD [1]  “On  the  generalised  distance  in  sta/s/cs”,  by  P.  C.  Mahalanobis,  1936.  
  • 16. 16 MULTIPLE  TIME  SERIES   Other  Methods CORRELATION Direction Magnitude nxn Correlation Matrix? Bake in context Exploit topology
  • 17. 17 CHALLENGES Susceptible to Anomalies Data Skew Missing Data Speed MULTIPLE  TIME  SERIES   Other  Methods TECHNIQUES Robust Correlation Cross Correlation Intersection Analysis Trade-off between speed and accuracy
  • 19. 19 THE  FLOW   RTpla9orm  and  Heron Live  Data Streaming  Computa,on RTpla/orm
  • 20. 20 RTplatform Cloud-based platform built for connecting, processing, and reacting to live data. + Extreme scale + High performance + Unprecedented reliability + Natively serverless
  • 21. 21 RTplatform “Real-time” has many definitions that have variable KPIs. Real time results on data-at-rest, not on live data
  • 22. 22 Live Stream Bots A backbone for live data: Free Messaging for publishers and subscribers Filter, analyze and transform messages in live stream Notify Anomaly detection RTplatform MESSAGING Real-time Pub/Sub with ultra-low latency and high fanout QUERYING Filter, analyze, and transform messages live, in-stream BOTS Deploy rule-based bots for real-time anomaly detection/reaction
  • 24. HERON
  • 25. 25 HERON  DESIGN  GOALS Task isolation Ease  of  debug-­‐ability/isolaDon/profiling Support for back pressure Topologies  should  self  adjusDng Efficiency Reduce resource consumption Off -the-shelf schedulers Unmanaged    -­‐  Apache  YARN/Mesos   Managed  -­‐    Apache  Aurora,  Amazon  ECS Use of main stream languages C++,  Java  and  Python Batching of tuples AmorDzing  the  cost  of  transferring  tuples ! "# G 4 !
  • 27. 27 TOPOLOGY  ARCHITECTURE Topology Master ZK Cluster Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager 27
  • 28. 28 STREAM  MANAGER   Sample  Topology % % S1 B2 B3 % B4
  • 29. 29 HERON  PHYSICAL  EXECUTION S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4
  • 30. 30 BACKPRESSURE   Stragglers  are  the  norm  in  a  mul2-­‐tenant  distributed  systems BAD HOST EXECUTION SKEW INADEQUATE PROVISIONING Ñ"
  • 31. 31 SENDERS TO STRAGGLER: DROP DATA BACKPRESSURE   Approaches  to  Handle  Stragglers DETECT STRAGGLERS AND RESCHEDULE THEM SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
  • 32. 32 BACKPRESSURE   Data  Drop  Strategy UNPREDICTABLE AFFECTS ACCURACY POOR VISIBILITY
  • 33. 33 BACKPRESSURE   Slow  Down  Sender HANDLES TEMPORARY SPIKES # PROCESSES DATA AT MAXIMUM RATE / PROVIDES PREDICTABILITY REDUCE RECOVERY TIMES
  • 34. 34 BACKPRESSURE   Stream  Manager TCP backpressure Spout based backpressure Stagewise backpressure ! ! !
  • 35. 35 BACKPRESSURE  -­‐  TCP   Stream  Manager Slows  upstream  and  downstream  instances S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4
  • 36. 36 BACKPRESSURE  -­‐  SPOUT   Stream  Manager S1 S1 S1S1S1 S1 S1S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager B2 B3 B4 B2 B3 B2 B3 B4 B4
  • 37. 37 IN MOST SCENARIOS BACK PRESSURE RECOVERS Without any manual intervention BACKPRESSURE   In  Prac2ce SOMETIMES USER PREFERS DROPPING OF DATA Care about only latest data SUSTAINED BACK PRESSURE Irrecoverable GC cycles, Bad or faulty host
  • 38. 38 PREDICTABILITY Tuple failures are more deterministic BACKPRESSURE   Advantages SELF ADJUSTS Topology goes as fast as the slowest component
  • 39. 39 HERON:  EXTENSIBLE  STREAMING  ENGINE HARDWARE BASIC INTER/INTRA IPC Topology Master Stream Manager Instance Metrics Manager Scribe Graphite SCHEDULERSTATEMANAGER
  • 40. 40 PLUG AND PLAY COMPONENTS As environment changes, core does not change MULTI LANGUAGE INSTANCES Support multiple language API with native instances MULTIPLE PROCESSING SEMANTICS Efficient stream managers for each semantics EASE OF DEVELOPMENT Faster development of components with little dependency HERON:  EXTENSIBLE  STREAMING  ENGINE
  • 41. 41 REPEATED SERIALIZATION Java objects —> Byte Arrays —> Protocol Buffers EAGER DESERIALIZATION Stream manager deserializes entire tuple even though full contents are not examined IMMUTABILITY Stream manager does not reuse any ProtoBuf objects OPTIMIZING  HERON
  • 42. 42 HERON:  PERFORMANCE   At  most  once  seman2cs 0 2000 4000 6000 8000 10000 12000 25 100 200 MILLION TUPLES/MIN SPOUT PARALLELISM THROUGHPUT Without Optimizations With Optimizations 0 5 10 15 20 25 30 35 25 100 200 MILLION TUPLES/MIN SPOUT PARALLELISM THROUGHPUT PER CORE Without Optimizations With Optimizations
  • 43. 43 HERON:  PERFORMANCE   At  least  once  seman2cs 0 500 1000 1500 2000 2500 25 100 200 MILLION TUPLES/MIN SPOUT PARALLELISM THROUGHPUT Without Optimizations With Optimizations 0 20 40 60 80 100 120 140 160 180 25 100 200 MILLISECS SPOUT PARALLELISM LATENCY Without Optimizations With Optimizations
  • 44. 44 HERON:  PERFORMANCE   At  least  once  seman2cs  -­‐  Impact  of  Cache  Drain  Frequency 0 500 1000 1500 2000 2500 0 5 10 15 20 25 30 35 MILLION TUPLES/MIN CACHE DRAIN FREQUENCY (MS) THROUGHPUT VS CACHE DRAIN FREQUENCY 200 100 25 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 LATENCY (MS) CACHE DRAIN FREQUENCY (MS) LATENCY VS CACHE DRAIN FREQUENCY 200 100 25
  • 45. 45 HALBERT   Nakagawa   Co-­‐Founder  &  CTO FRANCOIS   Orsini   CTO JOSH   Lulewicz   Head  of  Data  Placorm WE  ARE  HIRING! KARTHIK   Ramasamy   Manager
  • 46. 46 QUESTIONS    ANSWERS   Go  ahead.   Don‘t  hesitate.
  • 47. 47 READINGS STROM @ TWITTER A. Toshniwall et. al, SIGMOD 2014. TWITTER HERON: STREAM PROCESSING AT SCALE S. Kulkarni et al., SIGMOD 2015. STREAMING @ TWITTER M. Fu, 2016. TWITTER HERON: TOWARDS EXTENSIBLE STREAMING ENGINES M. Fu, ICDE 2017.
  • 48. 48 READINGS LIMITS THEOREMS FOR THE MEDIAN DEVIATIONS P. Hall and A. H. Welsh, 1985. ALTERNATIVES TO MEDIAN ABSOLUTE DEVIATION P. J. Rousseeuw and C. Croux, 1993. ASYMPTOTIC INDEPENDENCE OF MEDIAN AND MAD M. Falk, 1997. BAHADUR REPRESENTATIONS FOR THE MEDIAN ABSOLUTE DEVIATION AND ITS MODIFICATIONS S. Mazumder and R. Serfling, 2009. THE MINIMUM REGULARIZED COVARIANCE DETERMINANT ESTIMATOR K. Boudt, P. J. Rousseeuw, S. Vanduffel and T. Verdonck, 2017.
  • 49. THANK  YOU   For  your  aKen2on!