SlideShare a Scribd company logo
StreamApprox
Approximate  Stream  Analytics  in  
Apache  Flink
Sep  2017
Do  Le  Quoc,  Pramod Bhatotia,  
Ruichuan Chen,    Christof  Fetzer,  Volker  Hilt, Thorsten  Strufe
Modern  online  services
1
Stream  
Aggregator
Stream  
Analytics  
System
Useful
Information
2
Modern  online  services
Low  latency  
Tension
Approximate  computing
Efficient  resource
utilization
Approximate  Computing
3
Many  applications:
Approximate  output  is  good  enough!
E.g.  :  Google  Trends  -­‐-­‐-­‐ Big  Data  vs  Machine  Learning  (Sep/2012  – Sep/2017)
The  trend  of  data  is  more  important  than the  precise  numbers
Approximate  Computing
4
Take  a  
sample Approximate  output  
± Error  bound  Compute
Approximate  computing
Idea:  To  achieve  low  latency,  compute  over  a  sub-­‐set  of  data  items  
instead  of  the  entire  data-­‐set
State-­‐of-­‐the-­‐art  systems
5
ApproxHadoop [ASPLOS’15] Using  multi-­‐stage  sampling
BlinkDB [EuroSyS’13] Using  pre-­‐existing  samples
Quickr [SIGMOD’16] Injecting  samplers  into  query  plan
Not  designed  
for  
stream  analytics
StreamApprox:  Design  goals
6
Practical Supports  adaptive  execution  based  on  query  budget
Efficient Employs  online  sampling  techniques
Transparent Targets  existing  applications  w/  minor  code  changes
Outline
• Motivation
• Design
• Evaluation
7
StreamApprox:  Overview
8
Input  data  stream
Approximate  output  
± Error  bound
StreamApprox
Streaming  query Query  budget  
Stream  
Aggregator  
(E.g Kafka)
S1
S2
Sn
…
data  stream
Query  budget:  
• Latency/throughput guarantees
• Desired  computing  resources  for  query  processing
• Desired  accuracy
Key  idea:  Sampling
9
Simple  random  sampling  (SRS):
Stratified  sampling  (STS):
SRS
SRS
SRS
SRS
Key  idea:  Sampling
Reservoir  sampling  (RS):
10
i
Size  of  reservoir  =  k
Replace  by  item  i
Drop  item  i
Spark-­‐based  Sampling
11
Step  
#1
Create  strata  
using  groupByKey()
Step  
#3
Synchronize  between  
worker  nodes
to  select  a  
sample  of  size  k
Step  
#2
Apply  SRS
to  each  stratum  Si
These  steps  are  very  expensive
Spark-­‐based  Stratified  Sampling  (Spark-­‐based  STS)
StreamApprox:  Core  idea
12
Easy  to  parallelize,  doesn't  
need  any  synchronization  
between  workers
RS Weight  =  #items/k  =  6/4S2
RS
S3
Weight  =  1
S1
RS
Size  of  reservoir  =  k
Weight  =  #items/k  =  8/4
RS  :  Reservoir  Sampling
k  =  4
Online  Adaptive  Stratified  Reservoir  Sampling  (OASRS)
StreamApprox:  Core  idea
13
Weight  =  2
Weight  =  1.5
Weight  =  1
OASRS
Worker  1
Weight  =  1
Weight  =  2
Weight  =  1.5
Size  of  reservoir  =  4
OASRS
Worker  2
Implementation
14
Approximate  output  
± Error  bound
StreamApprox
Stream  
Aggregator
S1
S2
Sn
…
data  stream
Implementation
15
Sampling  
module
Flink Computation  
Engine
Output
± Error  bound
Error  
Estimation  
moduleRefined  sampling  
parameters
Stream  
aggregator  
S1
S2
Sn
…
Flink-­‐based  StreamApprox
Outline
• Motivation
• Design
• Evaluation
16
Experimental  setup
• Evaluation  questions
• Throughput  vs  sample  size
• Throughput  vs  accuracy
• Testbed
• Cluster:  17  nodes  
• Datasets:  
• Synthesis:  Gaussian  distribution,  Poisson  distribution  datasets  
• CAIDA  Network  traffic  traces;  NYC  Taxi  ride  records
17
See  the  paper  
for  more  
results!
Throughput
0
2
4
6
8
10 20 40 60 80
Throughput  (M)
#items/s
Sampling  fraction  (%)
Flink-­‐based  StreamApprox
Spark-­‐based  StreamApprox
Spark-­‐based  STS
18
Higher
the  better
Spark-­‐based  StreamApprox: ~2X  higher  throughput  over  Spark-­‐based  STS
Flink-­‐based  StreamApprox:  1.3X  higher  throughput  over  Spark-­‐based  StreamApprox
With  sampling  fraction  <  60%
Throughput  vs  Accuracy
0
1
2
3
4
5
0.5 1
Throughput  (M)
#items/s
Accuracy  loss  (%)
Flink-­‐based  StreamApprox
Spark-­‐based  StreamApprox
Spark-­‐based  STS
19
Higher
the  better
Spark-­‐based  StreamApprox: ~1.32X  higher  throughput  over  Spark-­‐based  STS
Flink-­‐based  StreamApprox:  1.62X  higher  throughput  over  Spark-­‐based  StreamApprox
With  the  same  accuracy  loss
Conclusion
20
StreamApprox:  Approximate  computing  for  stream  analytics
Practical Adaptive  execution  based  on  query  budget
Efficient Online  stratified  sampling  technique
Thank  you!
Details:  StreamApprox [Middleware’17]  
https://streamapprox.github.io
Transparent Supports  applications  w/  minor  code  changes

More Related Content

PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
PDF
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PDF
Stateful Distributed Stream Processing
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
PPTX
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Stateful Distributed Stream Processing
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...

What's hot (20)

PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PPTX
Apache flink
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Marton Balassi – Stateful Stream Processing
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PDF
Flink Gelly - Karlsruhe - June 2015
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
PPTX
Apache Flink: API, runtime, and project roadmap
PPTX
The Stream Processor as a Database Apache Flink
PPTX
First Flink Bay Area meetup
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
PPTX
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
PDF
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
PDF
Stateful stream processing with Apache Flink
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Apache flink
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Marton Balassi – Stateful Stream Processing
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Ufuc Celebi – Stream & Batch Processing in one System
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Gelly - Karlsruhe - June 2015
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink: API, runtime, and project roadmap
The Stream Processor as a Database Apache Flink
First Flink Bay Area meetup
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Stateful stream processing with Apache Flink
Ad

Similar to Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink (20)

PPTX
Edge Comp.pptx
PDF
Approximation algorithms for stream and batch processing
PPTX
approxIoT.pptx
PDF
Processing and analysing streaming data with Python. Pycon Italy 2022
PPTX
An Introduction to Distributed Data Streaming
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
PPT
Spark streaming
PPTX
t10_part1.pptx
PDF
Márton Balassi Streaming ML with Flink-
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PPTX
Streaming Hypothesis Reasoning - William Smith, Jan 2016
PPT
strata_spark_streaming.ppt
PPT
Moving Towards a Streaming Architecture
PPTX
Streaming HYpothesis REasoning
PPTX
Crash course on data streaming (with examples using Apache Flink)
PPTX
Real time streaming analytics
Edge Comp.pptx
Approximation algorithms for stream and batch processing
approxIoT.pptx
Processing and analysing streaming data with Python. Pycon Italy 2022
An Introduction to Distributed Data Streaming
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Chicago Flink Meetup: Flink's streaming architecture
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Spark streaming
t10_part1.pptx
Márton Balassi Streaming ML with Flink-
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
Streaming Hypothesis Reasoning - William Smith, Jan 2016
strata_spark_streaming.ppt
Moving Towards a Streaming Architecture
Streaming HYpothesis REasoning
Crash course on data streaming (with examples using Apache Flink)
Real time streaming analytics
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
annual-report-2024-2025 original latest.
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Database Infoormation System (DBIS).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Lecture1 pattern recognition............
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Quality review (1)_presentation of this 21
Clinical guidelines as a resource for EBP(1).pdf
annual-report-2024-2025 original latest.
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Database Infoormation System (DBIS).pptx
1_Introduction to advance data techniques.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
.pdf is not working space design for the following data for the following dat...
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Foundation of Data Science unit number two notes
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction-to-Cloud-ComputingFinal.pptx
Reliability_Chapter_ presentation 1221.5784
Quality review (1)_presentation of this 21

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink

  • 1. StreamApprox Approximate  Stream  Analytics  in   Apache  Flink Sep  2017 Do  Le  Quoc,  Pramod Bhatotia,   Ruichuan Chen,    Christof  Fetzer,  Volker  Hilt, Thorsten  Strufe
  • 2. Modern  online  services 1 Stream   Aggregator Stream   Analytics   System Useful Information
  • 3. 2 Modern  online  services Low  latency   Tension Approximate  computing Efficient  resource utilization
  • 4. Approximate  Computing 3 Many  applications: Approximate  output  is  good  enough! E.g.  :  Google  Trends  -­‐-­‐-­‐ Big  Data  vs  Machine  Learning  (Sep/2012  – Sep/2017) The  trend  of  data  is  more  important  than the  precise  numbers
  • 5. Approximate  Computing 4 Take  a   sample Approximate  output   ± Error  bound  Compute Approximate  computing Idea:  To  achieve  low  latency,  compute  over  a  sub-­‐set  of  data  items   instead  of  the  entire  data-­‐set
  • 6. State-­‐of-­‐the-­‐art  systems 5 ApproxHadoop [ASPLOS’15] Using  multi-­‐stage  sampling BlinkDB [EuroSyS’13] Using  pre-­‐existing  samples Quickr [SIGMOD’16] Injecting  samplers  into  query  plan Not  designed   for   stream  analytics
  • 7. StreamApprox:  Design  goals 6 Practical Supports  adaptive  execution  based  on  query  budget Efficient Employs  online  sampling  techniques Transparent Targets  existing  applications  w/  minor  code  changes
  • 9. StreamApprox:  Overview 8 Input  data  stream Approximate  output   ± Error  bound StreamApprox Streaming  query Query  budget   Stream   Aggregator   (E.g Kafka) S1 S2 Sn … data  stream Query  budget:   • Latency/throughput guarantees • Desired  computing  resources  for  query  processing • Desired  accuracy
  • 10. Key  idea:  Sampling 9 Simple  random  sampling  (SRS): Stratified  sampling  (STS): SRS SRS SRS SRS
  • 11. Key  idea:  Sampling Reservoir  sampling  (RS): 10 i Size  of  reservoir  =  k Replace  by  item  i Drop  item  i
  • 12. Spark-­‐based  Sampling 11 Step   #1 Create  strata   using  groupByKey() Step   #3 Synchronize  between   worker  nodes to  select  a   sample  of  size  k Step   #2 Apply  SRS to  each  stratum  Si These  steps  are  very  expensive Spark-­‐based  Stratified  Sampling  (Spark-­‐based  STS)
  • 13. StreamApprox:  Core  idea 12 Easy  to  parallelize,  doesn't   need  any  synchronization   between  workers RS Weight  =  #items/k  =  6/4S2 RS S3 Weight  =  1 S1 RS Size  of  reservoir  =  k Weight  =  #items/k  =  8/4 RS  :  Reservoir  Sampling k  =  4 Online  Adaptive  Stratified  Reservoir  Sampling  (OASRS)
  • 14. StreamApprox:  Core  idea 13 Weight  =  2 Weight  =  1.5 Weight  =  1 OASRS Worker  1 Weight  =  1 Weight  =  2 Weight  =  1.5 Size  of  reservoir  =  4 OASRS Worker  2
  • 15. Implementation 14 Approximate  output   ± Error  bound StreamApprox Stream   Aggregator S1 S2 Sn … data  stream
  • 16. Implementation 15 Sampling   module Flink Computation   Engine Output ± Error  bound Error   Estimation   moduleRefined  sampling   parameters Stream   aggregator   S1 S2 Sn … Flink-­‐based  StreamApprox
  • 18. Experimental  setup • Evaluation  questions • Throughput  vs  sample  size • Throughput  vs  accuracy • Testbed • Cluster:  17  nodes   • Datasets:   • Synthesis:  Gaussian  distribution,  Poisson  distribution  datasets   • CAIDA  Network  traffic  traces;  NYC  Taxi  ride  records 17 See  the  paper   for  more   results!
  • 19. Throughput 0 2 4 6 8 10 20 40 60 80 Throughput  (M) #items/s Sampling  fraction  (%) Flink-­‐based  StreamApprox Spark-­‐based  StreamApprox Spark-­‐based  STS 18 Higher the  better Spark-­‐based  StreamApprox: ~2X  higher  throughput  over  Spark-­‐based  STS Flink-­‐based  StreamApprox:  1.3X  higher  throughput  over  Spark-­‐based  StreamApprox With  sampling  fraction  <  60%
  • 20. Throughput  vs  Accuracy 0 1 2 3 4 5 0.5 1 Throughput  (M) #items/s Accuracy  loss  (%) Flink-­‐based  StreamApprox Spark-­‐based  StreamApprox Spark-­‐based  STS 19 Higher the  better Spark-­‐based  StreamApprox: ~1.32X  higher  throughput  over  Spark-­‐based  STS Flink-­‐based  StreamApprox:  1.62X  higher  throughput  over  Spark-­‐based  StreamApprox With  the  same  accuracy  loss
  • 21. Conclusion 20 StreamApprox:  Approximate  computing  for  stream  analytics Practical Adaptive  execution  based  on  query  budget Efficient Online  stratified  sampling  technique Thank  you! Details:  StreamApprox [Middleware’17]   https://streamapprox.github.io Transparent Supports  applications  w/  minor  code  changes