Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink

StreamApprox
Approximate Stream Analytics in
Apache Flink
Sep 2017
Do Le Quoc, Pramod Bhatotia,
Ruichuan Chen, Christof Fetzer, Volker Hilt, Thorsten Strufe

Modern online services
1
Stream
Aggregator
Stream
Analytics
System
Useful
Information

2
Modern online services
Low latency
Tension
Approximate computing
Efficient resource
utilization

Approximate Computing
3
Many applications:
Approximate output is good enough!
E.g. : Google Trends -‐-‐-‐ Big Data vs Machine Learning (Sep/2012 – Sep/2017)
The trend of data is more important than the precise numbers

Approximate Computing
4
Take a
sample Approximate output
± Error bound Compute
Approximate computing
Idea: To achieve low latency, compute over a sub-‐set of data items
instead of the entire data-‐set

State-‐of-‐the-‐art systems
5
ApproxHadoop [ASPLOS’15] Using multi-‐stage sampling
BlinkDB [EuroSyS’13] Using pre-‐existing samples
Quickr [SIGMOD’16] Injecting samplers into query plan
Not designed
for
stream analytics

StreamApprox: Design goals
6
Practical Supports adaptive execution based on query budget
Efficient Employs online sampling techniques
Transparent Targets existing applications w/ minor code changes

Outline
• Motivation
• Design
• Evaluation
7

StreamApprox: Overview
8
Input data stream
Approximate output
± Error bound
StreamApprox
Streaming query Query budget
Stream
Aggregator
(E.g Kafka)
S1
S2
Sn
…
data stream
Query budget:
• Latency/throughput guarantees
• Desired computing resources for query processing
• Desired accuracy

Key idea: Sampling
9
Simple random sampling (SRS):
Stratified sampling (STS):
SRS
SRS
SRS
SRS

Key idea: Sampling
Reservoir sampling (RS):
10
i
Size of reservoir = k
Replace by item i
Drop item i

Spark-‐based Sampling
11
Step
#1
Create strata
using groupByKey()
Step
#3
Synchronize between
worker nodes
to select a
sample of size k
Step
#2
Apply SRS
to each stratum Si
These steps are very expensive
Spark-‐based Stratified Sampling (Spark-‐based STS)

StreamApprox: Core idea
12
Easy to parallelize, doesn't
need any synchronization
between workers
RS Weight = #items/k = 6/4S2
RS
S3
Weight = 1
S1
RS
Size of reservoir = k
Weight = #items/k = 8/4
RS : Reservoir Sampling
k = 4
Online Adaptive Stratified Reservoir Sampling (OASRS)

StreamApprox: Core idea
13
Weight = 2
Weight = 1.5
Weight = 1
OASRS
Worker 1
Weight = 1
Weight = 2
Weight = 1.5
Size of reservoir = 4
OASRS
Worker 2

Implementation
14
Approximate output
± Error bound
StreamApprox
Stream
Aggregator
S1
S2
Sn
…
data stream

Implementation
15
Sampling
module
Flink Computation
Engine
Output
± Error bound
Error
Estimation
moduleRefined sampling
parameters
Stream
aggregator
S1
S2
Sn
…
Flink-‐based StreamApprox

Outline
• Motivation
• Design
• Evaluation
16

Experimental setup
• Evaluation questions
• Throughput vs sample size
• Throughput vs accuracy
• Testbed
• Cluster: 17 nodes
• Datasets:
• Synthesis: Gaussian distribution, Poisson distribution datasets
• CAIDA Network traffic traces; NYC Taxi ride records
17
See the paper
for more
results!

Throughput
0
2
4
6
8
10 20 40 60 80
Throughput (M)
#items/s
Sampling fraction (%)
Spark-‐based StreamApprox
Spark-‐based STS
18
Higher
the better
Spark-‐based StreamApprox: ~2X higher throughput over Spark-‐based STS
Flink-‐based StreamApprox: 1.3X higher throughput over Spark-‐based StreamApprox
With sampling fraction < 60%

Throughput vs Accuracy
0
1
2
3
4
5
0.5 1
Throughput (M)
#items/s
Accuracy loss (%)
Spark-‐based StreamApprox
Spark-‐based STS
19
Higher
the better
Spark-‐based StreamApprox: ~1.32X higher throughput over Spark-‐based STS
Flink-‐based StreamApprox: 1.62X higher throughput over Spark-‐based StreamApprox
With the same accuracy loss

Conclusion
20
StreamApprox: Approximate computing for stream analytics
Practical Adaptive execution based on query budget
Efficient Online stratified sampling technique
Thank you!
Details: StreamApprox [Middleware’17]
https://streamapprox.github.io
Transparent Supports applications w/ minor code changes

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink

More Related Content

What's hot (20)

Similar to Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approximate computing for stream analytics in Apache Flink