Blinkdb

Nitish Upreti
nzu100@cse.psu.edu
1

Goal : Solve Big Data !
2
How to achieve the best
Performance ?

100 TB on 1000 machines
½ - 1 Hour 1 - 5 Minutes 1 second
Hard Disks
?
Memory
3

Better and Faster
Frameworks ?
Evolved To Evolves To ?
4

If we cannot do better than
In-Memory than what?
5

Can we use Approximate
Computing ?
6

Can you tolerate Errors ?
Well, It depends on the
scenario right…
7

Massive log Batch processing
9

Computing ?
Answer : YES / NO
10

Computing ?
Answer : MAYBE
12

Exploratory / Interactive
Data Processing
-- Getting a sense of data (Data
Scientists)
-- Debugging ? (SREs / DevOps)
14

Computing ?
Answer : YES !
15

1) BlinkDB : Queries with Bounded Errors and
bounded Response Times on Very Large Data.
2) Blink and It’s Done : Interactive Queries on Very
Large Data.
3) A General Bootstrap Performance Diagnostic.
4) Knowing When You’re Wrong : Building Fast and
Reliable Approximate Query Processing Systems.
Sameer Agrawal, Ariel Kleiner, Henry Milner, Barzan Mozafari,
Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Our Goal
Support interactive SQL-like aggregate
queries over massive sets of data
17

Our Goal
blinkdb> SELECT AVG(jobtime)
FROM very_big_log
AVG, COUNT,
SUM, STDEV,
PERCENTILE etc.
18

FROM very_big_log
WHERE src = ‘hadoop’
FILTERS, GROUP BY clauses
Our Goal
19

FROM very_big_log
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
JOINS, Nested Queries etc.
Our Goal
20

blinkdb> SELECT my_function(jobtime)
FROM very_big_log
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
ML Primitives,
User Defined Functions
Our Goal
21

Our Goal
FROM very_big_log
ERROR WITHIN 10% AT CONFIDENCE 95%
22

Our Goal
FROM very_big_log
WITHIN 5 SECONDS
23

Query Execution on Samples
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
(Exploration Query)
What is the average buffering
ratio in the table?
0.2325 (Precise)
24

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.2325 (Precise)
0.19
25

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.2325 (Precise)
0.19 +/- 0.05
26

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
ratio in the table?
ID City Buff Ratio Sampling
Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
0.2325 (Precise)
0.19 +/- 0.05
$0.22 +/- 0.02
27

Speed/Accuracy Trade-off
Error
Time to
Execute on
Entire Dataset
30 mins
Interactive
Queries
2 sec
Execution Time (Sample Size) 28

Speed/Accuracy Trade-off
Error
Time to
Execute on
Entire Dataset
30 mins
Interactive
Queries
2 sec
Pre-Existing
Noise Execution Time (Sample Size) 29

Where do you want to be
on the curve ?
30

Sampling Vs No Sampling on
100 Machines
1000
900
800
700
600
500
400
300
200
100
0
10x as response time
is dominated by I/O
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data (10TB)
Query Response Time (Seconds)
102
1020
18 13 10 8
31

Sampling Vs No Sampling
1000
900
800
700
600
500
400
300
200
100
0
Error Bars
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Query Response Time (Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
32

Okay, so you can tolerate
errors…
What are some of the fundamental
challenges ?
What types of Sample to create ? (cannot
Sample everything)
This boils to : What is our assumption on the
nature of future query workload ?
33

Usual Assumption: Future
queries are SIMILAR to
past queries.
What is Similarity ?
( Choosing the wrong notion has a heavy penalty :
Under / Over fitting )
34

Predictable QCS
• Fits well on the model of exploratory queries.
(Queries are usually distinct but most will use
the same column)
• What kind of videos are popular for a region ?
- Require looking at data from thousands of videos and
hundreds of geographical regions. However fixed column
sets : “video titles” (for grouping) and “viewer location” (for
filtering).
• Backed by empirical evidence from Conviva &
Facebook. Key reason for BlinkDB
efficiency. ( Lots of work in Database theory)36

What is BlinkDB?
A framework built on Shark and Spark that …
- Creates and maintains a variety of offline
samples from underlying data.
- Returns fast, approximate answers with error bars
by executing queries on samples of data (
Runtime Error Latency Profile for Sample
selection )
- Verifies the correctness of the error bars that it
returns at runtime.
39

Building Samples for Queries
• Uniform SamplingVs Stratified Sampling.
• Uniform sampling is however inefficient for
queries that compute aggregates from
group :
- We could simply miss under-representing group.
- We care about error of each query equally, with uniform
sampling we would be assigning more samples to a group
which is more represented.
• Solution : Sample size assignment is
deterministic and not random. This can be
achieved with Stratified sampling. 41

What QCS to sample on ?
• Formulation as an optimization problem,
where three major factors to consider are :
“sparsity” of data, “workload characteristics”
and “storage cost of samples”.
• Sparsity : Define a sparsity function as the
number of groups whose size in ‘T’ is less
than some number ‘M’.
44

QCS to sample (Contd)…
• Workload : A query with QCS ‘qj’ has some
unknown probability ‘pj’. The best estimate of
pj is past frequency of queries with QCS qj.
• Storage Cost : Assume a simple formulation
where K is same for each group. For a set of
columns in ϕ the storage cost is |S(ϕ,K)| :
45

Goal : Maximize the
weighted sum of coverage.
where ‘coverage’ for a query ‘qi’ given a sample is
defined as the probability that a given value ‘x’ for the
columns is also present among the rows of S(ϕi,K).
46

Optimization Problem
Optimize the following MILP :
where ‘m’ are all possible QCS and ‘j’ indexes over all
queries and ‘i’ over all column sets. 47

Given a known QCS …
• Compute Sample Count for a Group :
- K = min ( n’ / D(ϕ) , |T x0 | )
• Take Samples as :
- For each group, sample K rows uniformly at random
without replacement forming sample Sx
• The entire sample S(ϕ,K) is the disjoint union
of multiple Sx :
- If |Tx| > K, we answer based on K random tuples otherwise
we can provide an exact answer.
• For aggregate function AVG, SUM, COUNT
and Quartile, K directly determines error . 49

Sharing QCS
• Multiple queries with different ‘t’ and ‘n’ will
share the same QCS. We need to select a
subset fromour sample dependency.
• We need an appropriate storage technique
to allow such subsets to be identified at
runtime.
50

Storage Technique
• The rows of stratified sample S(ϕ,K) are
stored sequentially according to order of
columns in ϕ.
• When Sx is spread over multiple HDFS
blocks, each block contains a random
subset from Sx .
• It is then enough to read any subset of
blocks comprising Sx as long as these
blocks contained minimum needed
records.
51

Storage Requirement
A table with 1 billion (10^12) tuples and a
column set with a Zipf distribution (heavy
tailed) with an exponent of 1.5, it turns out
that the storage required by sample S(ϕ, K) is
only 2.4% of the original table for K = 10^4,
5.2% for K = 10^5 , and 11.4% for K = 10^6.
This is also consistent with real world data
from Conviva & Facebook.
53

What is BlinkDB?
- creates and maintains a variety of samples from
underlying data.
- returns fast, approximate answers by executing
queries on samples of data with error bars.
- verifies the correctness of the error bars that it
returns at runtime.
54

Selecting a Sample
• If BlinkDB finds one or more stratified
samples on a set of columns ‘ϕi’ such that our
query ‘q’ ⊆ ϕi , we pick the ϕi with smallest
number of columns.
• If no such QCS samples exist, run ‘q’ on in-memory
subsets of all samples maintained
by the system. Out of these, we select those
with high selectivity.
- Selectivity = Number of Rows Selected by q / Number of
rows read by q
56

Selecting right Sample Size
• Construct an ELP (Error Latency Profile) that
characterizes the rate at which the error
decreases ( and time increases) with increase
in sample size by running query on smaller
samples.
• The scaling rate depends on query structure
like JOINS, GROUP BY, Physical data
placement and underlying data distribution.
57

Error Profile
• Given Q’s error constraints : Idea is to predict the
size of smallest sample that satisfies constraints.
• Variance and Closed form aggregate functions are
estimated using standard closed form formulas.
• Also BlinkDB estimates query selectivity, input
data distribution by running queries on smaller
subsamples.
• Number of rows are thus calculated using
Statistical Error estimates.
58

Latency Profile
• Given Q’s time constraints : Idea is to predict the
maximum size sample that we should run query on
within the constraints.
• Value of ‘n’ depends on input data, physical
placement of disk, query structure and available
resources. So as a simplification : BlinkDB simply
predicts ‘n’ by assuming latency scales linearly in
input size.
• For very small in-memory samples : BlinkDB runs a
few smaller samples until performance seems to
grow linearly and then estimate the linear scaling
constants. 60

Correcting Bias
• Running a query on a non-uniform sample
introduces certain amount of statistical bias
and introduce subtle inaccuracies.
• Solution : BlinkDB periodically replaces the
samples use using a low priority background
task which periodically ( daily ) samples the
original data which are then used by the
system.
61

Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
62

Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested
queries, UDFs, joins etc.
- Very computationally expensive. 63

But we are not done yet …
• Statistical functions like CLT and Bootstrap
operate under a set of assumption on query /
data.
• We need to have some correctness verifiers
!
64

What is BlinkDB?
underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
returns at runtime
65

Kleiner’s Diagnostics
Error
More Data  Higher Accuracy
300 Data Points  97% Accuracy
Sample Size
[KDD 13]
66

300 Data Points ≈ 30K
Queries for Bootstrap !
67

So In an Approximate QP :
• One query that estimates the answer.
• Hundred Queries on Resample of data that
computes the error.
• Tens of Thousand of Queries to verify if this
error is correct.
• BAD PERFORMANCE !
• Solution: Single Pass Execution
framework. 68

What is BlinkDB?
underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
returns at runtime
69

BlinkDBArchitecture
Command-line Shell Thrift/JDBC
Hadoop Storage (e.g., HDFS, Hbase, Presto)
Meta
store
Hadoop/Spark/Presto
SQL
Parser
Query
Optimizer
Physical Plan
UDFs
Execution
Driver

Implementation Changes
• Additions in Query Language Parser.
• Parser can trigger a sample creation and
maintenance module.
• A sample selection module that re-writes the query
and assigns it an approximately sized sample.
• Uncertainty module to modify all pre-existing
aggregation function to return error bars and
confidence intervals.
• A module periodically samples from the original
data, creating new samples which are then used by
the system. (Co-Relation +Workload Changes) 73

BlinkDB Vs. No Sampling
2.5 TB
from
Cache
7.5 TB
from Disk
Log Scale
75

Scaling BlinkDB
76
Each query operates on 100N GB of data.

Response Time
and
Error Bounds …
20 Conviva queries averaged over 10runs 77

Play with BlinkDB!
https://github.com/sameeragarwal/blinkdb
78

Take Away …
• The only way for now to escape memory
performance barrier is to use Approximate
Computing .
• A huge role to play in exploratory data
analysis.
• BlinkDB provides a framework for AQP +
Error Bars +Verifies them.
• Great Performance on real world workloads.
79

Personal Takeaway : Take a
STATISTICS class!
80

Credits
These slides are derived from Sameer Agarwal’s
presentation : http://goo.gl/cvVb1X
81

Blinkdb

More Related Content

What's hot (6)

Viewers also liked (20)

Similar to Blinkdb (20)

More from Nitish Upreti (7)

Recently uploaded (20)

Blinkdb

Editor's Notes