SlideShare a Scribd company logo
Nitish Upreti 
nzu100@cse.psu.edu 
1
Goal : Solve Big Data ! 
2 
How to achieve the best 
Performance ?
100 TB on 1000 machines 
½ - 1 Hour 1 - 5 Minutes 1 second 
Hard Disks 
? 
Memory 
3
Better and Faster 
Frameworks ? 
Evolved To Evolves To ? 
4
If we cannot do better than 
In-Memory than what? 
5
Can we use Approximate 
Computing ? 
6
Can you tolerate Errors ? 
Well, It depends on the 
scenario right… 
7
Overview of Big Data Space 
8
Massive log Batch processing 
9
Can we use Approximate 
Computing ? 
Answer : YES / NO 
10
Streaming data processing 
11
Can we use Approximate 
Computing ? 
Answer : MAYBE 
12
Exploratory Data Analysis 
13
Exploratory / Interactive 
Data Processing 
-- Getting a sense of data (Data 
Scientists) 
-- Debugging ? (SREs / DevOps) 
14
Can we use Approximate 
Computing ? 
Answer : YES ! 
15
1) BlinkDB : Queries with Bounded Errors and 
bounded Response Times on Very Large Data. 
2) Blink and It’s Done : Interactive Queries on Very 
Large Data. 
3) A General Bootstrap Performance Diagnostic. 
4) Knowing When You’re Wrong : Building Fast and 
Reliable Approximate Query Processing Systems. 
Sameer Agrawal, Ariel Kleiner, Henry Milner, Barzan Mozafari, 
Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
Our Goal 
Support interactive SQL-like aggregate 
queries over massive sets of data 
17
Our Goal 
Support interactive SQL-like aggregate 
queries over massive sets of data 
blinkdb> SELECT AVG(jobtime) 
FROM very_big_log 
AVG, COUNT, 
SUM, STDEV, 
PERCENTILE etc. 
18
Support interactive SQL-like aggregate 
queries over massive sets of data 
blinkdb> SELECT AVG(jobtime) 
FROM very_big_log 
WHERE src = ‘hadoop’ 
FILTERS, GROUP BY clauses 
Our Goal 
19
Support interactive SQL-like aggregate 
queries over massive sets of data 
blinkdb> SELECT AVG(jobtime) 
FROM very_big_log 
WHERE src = ‘hadoop’ 
LEFT OUTER JOIN logs2 
ON very_big_log.id = logs.id 
JOINS, Nested Queries etc. 
Our Goal 
20
Support interactive SQL-like aggregate 
queries over massive sets of data 
blinkdb> SELECT my_function(jobtime) 
FROM very_big_log 
WHERE src = ‘hadoop’ 
LEFT OUTER JOIN logs2 
ON very_big_log.id = logs.id 
ML Primitives, 
User Defined Functions 
Our Goal 
21
Our Goal 
Support interactive SQL-like aggregate 
queries over massive sets of data 
blinkdb> SELECT my_function(jobtime) 
FROM very_big_log 
WHERE src = ‘hadoop’ 
ERROR WITHIN 10% AT CONFIDENCE 95% 
22
Our Goal 
Support interactive SQL-like aggregate 
queries over massive sets of data 
blinkdb> SELECT my_function(jobtime) 
FROM very_big_log 
WHERE src = ‘hadoop’ 
WITHIN 5 SECONDS 
23
Query Execution on Samples 
ID City Buff Ratio 
1 NYC 0.78 
2 NYC 0.13 
3 Berkeley 0.25 
4 NYC 0.19 
5 NYC 0.11 
6 Berkeley 0.09 
7 NYC 0.18 
8 NYC 0.15 
9 Berkeley 0.13 
10 Berkeley 0.49 
11 NYC 0.19 
12 Berkeley 0.10 
(Exploration Query) 
What is the average buffering 
ratio in the table? 
0.2325 (Precise) 
24
Query Execution on Samples 
ID City Buff Ratio 
1 NYC 0.78 
2 NYC 0.13 
3 Berkeley 0.25 
4 NYC 0.19 
5 NYC 0.11 
6 Berkeley 0.09 
7 NYC 0.18 
8 NYC 0.15 
9 Berkeley 0.13 
10 Berkeley 0.49 
11 NYC 0.19 
12 Berkeley 0.10 
What is the average buffering 
ratio in the table? 
ID City Buff Ratio Sampling Rate 
2 NYC 0.13 1/4 
6 Berkeley 0.25 1/4 
8 NYC 0.19 1/4 
Uniform 
Sample 
0.2325 (Precise) 
0.19 
25
Query Execution on Samples 
ID City Buff Ratio 
1 NYC 0.78 
2 NYC 0.13 
3 Berkeley 0.25 
4 NYC 0.19 
5 NYC 0.11 
6 Berkeley 0.09 
7 NYC 0.18 
8 NYC 0.15 
9 Berkeley 0.13 
10 Berkeley 0.49 
11 NYC 0.19 
12 Berkeley 0.10 
What is the average buffering 
ratio in the table? 
ID City Buff Ratio Sampling Rate 
2 NYC 0.13 1/4 
6 Berkeley 0.25 1/4 
8 NYC 0.19 1/4 
Uniform 
Sample 
0.2325 (Precise) 
0.19 +/- 0.05 
26
Query Execution on Samples 
ID City Buff Ratio 
1 NYC 0.78 
2 NYC 0.13 
3 Berkeley 0.25 
4 NYC 0.19 
5 NYC 0.11 
6 Berkeley 0.09 
7 NYC 0.18 
8 NYC 0.15 
9 Berkeley 0.13 
10 Berkeley 0.49 
11 NYC 0.19 
12 Berkeley 0.10 
What is the average buffering 
ratio in the table? 
ID City Buff Ratio Sampling 
Rate 
2 NYC 0.13 1/2 
3 Berkeley 0.25 1/2 
5 NYC 0.19 1/2 
6 Berkeley 0.09 1/2 
8 NYC 0.18 1/2 
12 Berkeley 0.49 1/2 
Uniform 
Sample 
0.2325 (Precise) 
0.19 +/- 0.05 
$0.22 +/- 0.02 
27
Speed/Accuracy Trade-off 
Error 
Time to 
Execute on 
Entire Dataset 
30 mins 
Interactive 
Queries 
2 sec 
Execution Time (Sample Size) 28
Speed/Accuracy Trade-off 
Error 
Time to 
Execute on 
Entire Dataset 
30 mins 
Interactive 
Queries 
2 sec 
Pre-Existing 
Noise Execution Time (Sample Size) 29
Where do you want to be 
on the curve ? 
30
Sampling Vs No Sampling on 
100 Machines 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
10x as response time 
is dominated by I/O 
1 10-1 10-2 10-3 10-4 10-5 
Fraction of full data (10TB) 
Query Response Time (Seconds) 
102 
1020 
18 13 10 8 
31
Sampling Vs No Sampling 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Error Bars 
1 10-1 10-2 10-3 10-4 10-5 
Fraction of full data 
Query Response Time (Seconds) 
103 
1020 
18 13 10 8 
(0.02%) 
(0.07%) (1.1%) (3.4%) (11%) 
32
Okay, so you can tolerate 
errors… 
What are some of the fundamental 
challenges ? 
What types of Sample to create ? (cannot 
Sample everything) 
This boils to : What is our assumption on the 
nature of future query workload ? 
33
Usual Assumption: Future 
queries are SIMILAR to 
past queries. 
What is Similarity ? 
( Choosing the wrong notion has a heavy penalty : 
Under / Over fitting ) 
34
Workload Taxonomy 
35
Predictable QCS 
• Fits well on the model of exploratory queries. 
(Queries are usually distinct but most will use 
the same column) 
• What kind of videos are popular for a region ? 
- Require looking at data from thousands of videos and 
hundreds of geographical regions. However fixed column 
sets : “video titles” (for grouping) and “viewer location” (for 
filtering). 
• Backed by empirical evidence from Conviva & 
Facebook. Key reason for BlinkDB 
efficiency. ( Lots of work in Database theory)36
37
BlinkDB Overview 
38
What is BlinkDB? 
A framework built on Shark and Spark that … 
- Creates and maintains a variety of offline 
samples from underlying data. 
- Returns fast, approximate answers with error bars 
by executing queries on samples of data ( 
Runtime Error Latency Profile for Sample 
selection ) 
- Verifies the correctness of the error bars that it 
returns at runtime. 
39
1) Sample Creation 
40
Building Samples for Queries 
• Uniform SamplingVs Stratified Sampling. 
• Uniform sampling is however inefficient for 
queries that compute aggregates from 
group : 
- We could simply miss under-representing group. 
- We care about error of each query equally, with uniform 
sampling we would be assigning more samples to a group 
which is more represented. 
• Solution : Sample size assignment is 
deterministic and not random. This can be 
achieved with Stratified sampling. 41
Some Terminology … 
42
QCS to Sample On 
43
What QCS to sample on ? 
• Formulation as an optimization problem, 
where three major factors to consider are : 
“sparsity” of data, “workload characteristics” 
and “storage cost of samples”. 
• Sparsity : Define a sparsity function as the 
number of groups whose size in ‘T’ is less 
than some number ‘M’. 
44
QCS to sample (Contd)… 
• Workload : A query with QCS ‘qj’ has some 
unknown probability ‘pj’. The best estimate of 
pj is past frequency of queries with QCS qj. 
• Storage Cost : Assume a simple formulation 
where K is same for each group. For a set of 
columns in ϕ the storage cost is |S(ϕ,K)| : 
45
Goal : Maximize the 
weighted sum of coverage. 
where ‘coverage’ for a query ‘qi’ given a sample is 
defined as the probability that a given value ‘x’ for the 
columns is also present among the rows of S(ϕi,K). 
46
Optimization Problem 
Optimize the following MILP : 
where ‘m’ are all possible QCS and ‘j’ indexes over all 
queries and ‘i’ over all column sets. 47
How to sample ? 
48
Given a known QCS … 
• Compute Sample Count for a Group : 
- K = min ( n’ / D(ϕ) , |T x0 | ) 
• Take Samples as : 
- For each group, sample K rows uniformly at random 
without replacement forming sample Sx 
• The entire sample S(ϕ,K) is the disjoint union 
of multiple Sx : 
- If |Tx| > K, we answer based on K random tuples otherwise 
we can provide an exact answer. 
• For aggregate function AVG, SUM, COUNT 
and Quartile, K directly determines error . 49
Sharing QCS 
• Multiple queries with different ‘t’ and ‘n’ will 
share the same QCS. We need to select a 
subset fromour sample dependency. 
• We need an appropriate storage technique 
to allow such subsets to be identified at 
runtime. 
50
Storage Technique 
• The rows of stratified sample S(ϕ,K) are 
stored sequentially according to order of 
columns in ϕ. 
• When Sx is spread over multiple HDFS 
blocks, each block contains a random 
subset from Sx . 
• It is then enough to read any subset of 
blocks comprising Sx as long as these 
blocks contained minimum needed 
records. 
51
Bij = Data Block 
(HDFS) 
52
Storage Requirement 
A table with 1 billion (10^12) tuples and a 
column set with a Zipf distribution (heavy 
tailed) with an exponent of 1.5, it turns out 
that the storage required by sample S(ϕ, K) is 
only 2.4% of the original table for K = 10^4, 
5.2% for K = 10^5 , and 11.4% for K = 10^6. 
This is also consistent with real world data 
from Conviva & Facebook. 
53
What is BlinkDB? 
A framework built on Shark and Spark that … 
- creates and maintains a variety of samples from 
underlying data. 
- returns fast, approximate answers by executing 
queries on samples of data with error bars. 
- verifies the correctness of the error bars that it 
returns at runtime. 
54
2) BlinkDB Runtime 
55
Selecting a Sample 
• If BlinkDB finds one or more stratified 
samples on a set of columns ‘ϕi’ such that our 
query ‘q’ ⊆ ϕi , we pick the ϕi with smallest 
number of columns. 
• If no such QCS samples exist, run ‘q’ on in-memory 
subsets of all samples maintained 
by the system. Out of these, we select those 
with high selectivity. 
- Selectivity = Number of Rows Selected by q / Number of 
rows read by q 
56
Selecting right Sample Size 
• Construct an ELP (Error Latency Profile) that 
characterizes the rate at which the error 
decreases ( and time increases) with increase 
in sample size by running query on smaller 
samples. 
• The scaling rate depends on query structure 
like JOINS, GROUP BY, Physical data 
placement and underlying data distribution. 
57
Error Profile 
• Given Q’s error constraints : Idea is to predict the 
size of smallest sample that satisfies constraints. 
• Variance and Closed form aggregate functions are 
estimated using standard closed form formulas. 
• Also BlinkDB estimates query selectivity, input 
data distribution by running queries on smaller 
subsamples. 
• Number of rows are thus calculated using 
Statistical Error estimates. 
58
59
Latency Profile 
• Given Q’s time constraints : Idea is to predict the 
maximum size sample that we should run query on 
within the constraints. 
• Value of ‘n’ depends on input data, physical 
placement of disk, query structure and available 
resources. So as a simplification : BlinkDB simply 
predicts ‘n’ by assuming latency scales linearly in 
input size. 
• For very small in-memory samples : BlinkDB runs a 
few smaller samples until performance seems to 
grow linearly and then estimate the linear scaling 
constants. 60
Correcting Bias 
• Running a query on a non-uniform sample 
introduces certain amount of statistical bias 
and introduce subtle inaccuracies. 
• Solution : BlinkDB periodically replaces the 
samples use using a low priority background 
task which periodically ( daily ) samples the 
original data which are then used by the 
system. 
61
Error Estimation 
Closed Form Aggregate Functions 
- Central Limit Theorem 
- Applicable to AVG, COUNT, SUM, 
VARIANCE and STDEV 
62
Error Estimation 
Closed Form Aggregate Functions 
- Central Limit Theorem 
- Applicable to AVG, COUNT, SUM, 
VARIANCE and STDEV 
Generalized Aggregate Functions 
- Statistical Bootstrap 
- Applicable to complex and nested 
queries, UDFs, joins etc. 
- Very computationally expensive. 63
But we are not done yet … 
• Statistical functions like CLT and Bootstrap 
operate under a set of assumption on query / 
data. 
• We need to have some correctness verifiers 
! 
64
What is BlinkDB? 
A framework built on Shark and Spark that … 
- creates and maintains a variety of samples from 
underlying data 
- returns fast, approximate answers with error bars 
by executing queries on samples of data 
- verifies the correctness of the error bars that it 
returns at runtime 
65
Kleiner’s Diagnostics 
Error 
More Data  Higher Accuracy 
300 Data Points  97% Accuracy 
Sample Size 
[KDD 13] 
66
300 Data Points ≈ 30K 
Queries for Bootstrap ! 
67
So In an Approximate QP : 
• One query that estimates the answer. 
• Hundred Queries on Resample of data that 
computes the error. 
• Tens of Thousand of Queries to verify if this 
error is correct. 
• BAD PERFORMANCE ! 
• Solution: Single Pass Execution 
framework. 68
What is BlinkDB? 
A framework built on Shark and Spark that … 
- creates and maintains a variety of samples from 
underlying data 
- returns fast, approximate answers with error bars 
by executing queries on samples of data 
- verifies the correctness of the error bars that it 
returns at runtime 
69
BlinkDB Implementation 
70
BlinkDBArchitecture 
Command-line Shell Thrift/JDBC 
Hadoop Storage (e.g., HDFS, Hbase, Presto) 
Meta 
store 
Hadoop/Spark/Presto 
SQL 
Parser 
Query 
Optimizer 
Physical Plan 
UDFs 
Execution 
Driver
72
Implementation Changes 
• Additions in Query Language Parser. 
• Parser can trigger a sample creation and 
maintenance module. 
• A sample selection module that re-writes the query 
and assigns it an approximately sized sample. 
• Uncertainty module to modify all pre-existing 
aggregation function to return error bars and 
confidence intervals. 
• A module periodically samples from the original 
data, creating new samples which are then used by 
the system. (Co-Relation +Workload Changes) 73
BlinkDB Evaluation 
74
BlinkDB Vs. No Sampling 
2.5 TB 
from 
Cache 
7.5 TB 
from Disk 
Log Scale 
75
Scaling BlinkDB 
76 
Each query operates on 100N GB of data.
Response Time 
and 
Error Bounds … 
20 Conviva queries averaged over 10runs 77
Play with BlinkDB! 
https://github.com/sameeragarwal/blinkdb 
78
Take Away … 
• The only way for now to escape memory 
performance barrier is to use Approximate 
Computing . 
• A huge role to play in exploratory data 
analysis. 
• BlinkDB provides a framework for AQP + 
Error Bars +Verifies them. 
• Great Performance on real world workloads. 
79
Personal Takeaway : Take a 
STATISTICS class! 
80
Credits 
These slides are derived from Sameer Agarwal’s 
presentation : http://goo.gl/cvVb1X 
81
Questions ? 
THANK YOU! 
82

More Related Content

PPTX
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
PDF
The Evolution of Big Data at Spotify
PPT
Teradata a z
PDF
Scala Data Pipelines @ Spotify
PPTX
Collaborative Filtering at Spotify
PDF
Building Data Pipelines for Music Recommendations at Spotify
PDF
APA Annotated Bibliography 7th ed
PDF
Common issues with Apache Kafka® Producer
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
The Evolution of Big Data at Spotify
Teradata a z
Scala Data Pipelines @ Spotify
Collaborative Filtering at Spotify
Building Data Pipelines for Music Recommendations at Spotify
APA Annotated Bibliography 7th ed
Common issues with Apache Kafka® Producer

What's hot (6)

PDF
Design Thinking for Educators: Brainstorming Engagement
PPTX
Design Patterns For Real Time Streaming Data Analytics
PDF
A Deep Dive into Query Execution Engine of Spark SQL
KEY
No locked doors, no windows barred: hacking OpenAM infrastructure
PDF
From Idea to Execution: Spotify's Discover Weekly
PDF
Building a Feature Store around Dataframes and Apache Spark
Design Thinking for Educators: Brainstorming Engagement
Design Patterns For Real Time Streaming Data Analytics
A Deep Dive into Query Execution Engine of Spark SQL
No locked doors, no windows barred: hacking OpenAM infrastructure
From Idea to Execution: Spotify's Discover Weekly
Building a Feature Store around Dataframes and Apache Spark
Ad

Viewers also liked (20)

PDF
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
PPTX
BlinkDB: Qureying Petabytes of Data in Seconds using Sampling
PDF
How do cognitive agents handle the tradeoff between speed and accuracy?
PPTX
Hick-Hyman & Fitts Law _Jing
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PDF
Data integration with Apache Kafka
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PPTX
MBA study material- Ethics
PPTX
Social Proof and Reference Group
PPT
kls xii : Bab iii pers dlm masyarakat
PPTX
Brandstreaming: an introduction
PDF
Gps4b
PPTX
PHP Apps on the Move - Migrating from In-House to Cloud
PDF
PPT
Portfolio
PPT
Inside Sina Weibo
PDF
E book the-art_of_internet_dating
PPT
OpenGL ES based UI Development on TI Platforms
PDF
SGGSWU Prospectus 2016-17
PDF
A V I D Juicy Ultimate Brake Bleeding
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB: Qureying Petabytes of Data in Seconds using Sampling
How do cognitive agents handle the tradeoff between speed and accuracy?
Hick-Hyman & Fitts Law _Jing
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Data integration with Apache Kafka
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
MBA study material- Ethics
Social Proof and Reference Group
kls xii : Bab iii pers dlm masyarakat
Brandstreaming: an introduction
Gps4b
PHP Apps on the Move - Migrating from In-House to Cloud
Portfolio
Inside Sina Weibo
E book the-art_of_internet_dating
OpenGL ES based UI Development on TI Platforms
SGGSWU Prospectus 2016-17
A V I D Juicy Ultimate Brake Bleeding
Ad

Similar to Blinkdb (20)

PDF
AIRS2016
PDF
BlinkDB - Approximate Queries on Very Large Data
PPTX
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
PDF
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
PDF
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
PDF
Heuristic design of experiments w meta gradient search
PDF
Machine Learning meets DevOps
PDF
Ontology-based data access: why it is so cool!
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
PDF
01A - Greatest Hits of CS111 Data structure and algorithm
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
Junhua wang ai_next_con
PDF
Decision Forests and discriminant analysis
PDF
Deep Learning Inference at speed and scale
PPT
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
PDF
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
PPTX
The Other HPC: High Productivity Computing
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
AIRS2016
BlinkDB - Approximate Queries on Very Large Data
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Heuristic design of experiments w meta gradient search
Machine Learning meets DevOps
Ontology-based data access: why it is so cool!
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
01A - Greatest Hits of CS111 Data structure and algorithm
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Junhua wang ai_next_con
Decision Forests and discriminant analysis
Deep Learning Inference at speed and scale
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
The Other HPC: High Productivity Computing
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017

More from Nitish Upreti (7)

PPTX
Facebook's TAO & Unicorn data storage and search platforms
PDF
PPTX
Final presentation
PPTX
Project progress
PPTX
Socail Influence & Homophilly
PPT
Software testing
PPTX
PSU CSE 541 Project Idea
Facebook's TAO & Unicorn data storage and search platforms
Final presentation
Project progress
Socail Influence & Homophilly
Software testing
PSU CSE 541 Project Idea

Recently uploaded (20)

PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Geodesy 1.pptx...............................................
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
Mechanical Engineering MATERIALS Selection
PPTX
additive manufacturing of ss316l using mig welding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Well-logging-methods_new................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Project quality management in manufacturing
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
UNIT 4 Total Quality Management .pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Geodesy 1.pptx...............................................
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mechanical Engineering MATERIALS Selection
additive manufacturing of ss316l using mig welding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
bas. eng. economics group 4 presentation 1.pptx
OOP with Java - Java Introduction (Basics)
Model Code of Practice - Construction Work - 21102022 .pdf
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Project quality management in manufacturing
R24 SURVEYING LAB MANUAL for civil enggi
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
573137875-Attendance-Management-System-original
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
UNIT 4 Total Quality Management .pptx

Blinkdb

  • 2. Goal : Solve Big Data ! 2 How to achieve the best Performance ?
  • 3. 100 TB on 1000 machines ½ - 1 Hour 1 - 5 Minutes 1 second Hard Disks ? Memory 3
  • 4. Better and Faster Frameworks ? Evolved To Evolves To ? 4
  • 5. If we cannot do better than In-Memory than what? 5
  • 6. Can we use Approximate Computing ? 6
  • 7. Can you tolerate Errors ? Well, It depends on the scenario right… 7
  • 8. Overview of Big Data Space 8
  • 9. Massive log Batch processing 9
  • 10. Can we use Approximate Computing ? Answer : YES / NO 10
  • 12. Can we use Approximate Computing ? Answer : MAYBE 12
  • 14. Exploratory / Interactive Data Processing -- Getting a sense of data (Data Scientists) -- Debugging ? (SREs / DevOps) 14
  • 15. Can we use Approximate Computing ? Answer : YES ! 15
  • 16. 1) BlinkDB : Queries with Bounded Errors and bounded Response Times on Very Large Data. 2) Blink and It’s Done : Interactive Queries on Very Large Data. 3) A General Bootstrap Performance Diagnostic. 4) Knowing When You’re Wrong : Building Fast and Reliable Approximate Query Processing Systems. Sameer Agrawal, Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
  • 17. Our Goal Support interactive SQL-like aggregate queries over massive sets of data 17
  • 18. Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc. 18
  • 19. Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal 19
  • 20. Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal 20
  • 21. Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id ML Primitives, User Defined Functions Our Goal 21
  • 22. Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ ERROR WITHIN 10% AT CONFIDENCE 95% 22
  • 23. Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ WITHIN 5 SECONDS 23
  • 24. Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 (Exploration Query) What is the average buffering ratio in the table? 0.2325 (Precise) 24
  • 25. Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.2325 (Precise) 0.19 25
  • 26. Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.2325 (Precise) 0.19 +/- 0.05 26
  • 27. Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample 0.2325 (Precise) 0.19 +/- 0.05 $0.22 +/- 0.02 27
  • 28. Speed/Accuracy Trade-off Error Time to Execute on Entire Dataset 30 mins Interactive Queries 2 sec Execution Time (Sample Size) 28
  • 29. Speed/Accuracy Trade-off Error Time to Execute on Entire Dataset 30 mins Interactive Queries 2 sec Pre-Existing Noise Execution Time (Sample Size) 29
  • 30. Where do you want to be on the curve ? 30
  • 31. Sampling Vs No Sampling on 100 Machines 1000 900 800 700 600 500 400 300 200 100 0 10x as response time is dominated by I/O 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data (10TB) Query Response Time (Seconds) 102 1020 18 13 10 8 31
  • 32. Sampling Vs No Sampling 1000 900 800 700 600 500 400 300 200 100 0 Error Bars 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data Query Response Time (Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) 32
  • 33. Okay, so you can tolerate errors… What are some of the fundamental challenges ? What types of Sample to create ? (cannot Sample everything) This boils to : What is our assumption on the nature of future query workload ? 33
  • 34. Usual Assumption: Future queries are SIMILAR to past queries. What is Similarity ? ( Choosing the wrong notion has a heavy penalty : Under / Over fitting ) 34
  • 36. Predictable QCS • Fits well on the model of exploratory queries. (Queries are usually distinct but most will use the same column) • What kind of videos are popular for a region ? - Require looking at data from thousands of videos and hundreds of geographical regions. However fixed column sets : “video titles” (for grouping) and “viewer location” (for filtering). • Backed by empirical evidence from Conviva & Facebook. Key reason for BlinkDB efficiency. ( Lots of work in Database theory)36
  • 37. 37
  • 39. What is BlinkDB? A framework built on Shark and Spark that … - Creates and maintains a variety of offline samples from underlying data. - Returns fast, approximate answers with error bars by executing queries on samples of data ( Runtime Error Latency Profile for Sample selection ) - Verifies the correctness of the error bars that it returns at runtime. 39
  • 41. Building Samples for Queries • Uniform SamplingVs Stratified Sampling. • Uniform sampling is however inefficient for queries that compute aggregates from group : - We could simply miss under-representing group. - We care about error of each query equally, with uniform sampling we would be assigning more samples to a group which is more represented. • Solution : Sample size assignment is deterministic and not random. This can be achieved with Stratified sampling. 41
  • 43. QCS to Sample On 43
  • 44. What QCS to sample on ? • Formulation as an optimization problem, where three major factors to consider are : “sparsity” of data, “workload characteristics” and “storage cost of samples”. • Sparsity : Define a sparsity function as the number of groups whose size in ‘T’ is less than some number ‘M’. 44
  • 45. QCS to sample (Contd)… • Workload : A query with QCS ‘qj’ has some unknown probability ‘pj’. The best estimate of pj is past frequency of queries with QCS qj. • Storage Cost : Assume a simple formulation where K is same for each group. For a set of columns in ϕ the storage cost is |S(ϕ,K)| : 45
  • 46. Goal : Maximize the weighted sum of coverage. where ‘coverage’ for a query ‘qi’ given a sample is defined as the probability that a given value ‘x’ for the columns is also present among the rows of S(ϕi,K). 46
  • 47. Optimization Problem Optimize the following MILP : where ‘m’ are all possible QCS and ‘j’ indexes over all queries and ‘i’ over all column sets. 47
  • 49. Given a known QCS … • Compute Sample Count for a Group : - K = min ( n’ / D(ϕ) , |T x0 | ) • Take Samples as : - For each group, sample K rows uniformly at random without replacement forming sample Sx • The entire sample S(ϕ,K) is the disjoint union of multiple Sx : - If |Tx| > K, we answer based on K random tuples otherwise we can provide an exact answer. • For aggregate function AVG, SUM, COUNT and Quartile, K directly determines error . 49
  • 50. Sharing QCS • Multiple queries with different ‘t’ and ‘n’ will share the same QCS. We need to select a subset fromour sample dependency. • We need an appropriate storage technique to allow such subsets to be identified at runtime. 50
  • 51. Storage Technique • The rows of stratified sample S(ϕ,K) are stored sequentially according to order of columns in ϕ. • When Sx is spread over multiple HDFS blocks, each block contains a random subset from Sx . • It is then enough to read any subset of blocks comprising Sx as long as these blocks contained minimum needed records. 51
  • 52. Bij = Data Block (HDFS) 52
  • 53. Storage Requirement A table with 1 billion (10^12) tuples and a column set with a Zipf distribution (heavy tailed) with an exponent of 1.5, it turns out that the storage required by sample S(ϕ, K) is only 2.4% of the original table for K = 10^4, 5.2% for K = 10^5 , and 11.4% for K = 10^6. This is also consistent with real world data from Conviva & Facebook. 53
  • 54. What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of samples from underlying data. - returns fast, approximate answers by executing queries on samples of data with error bars. - verifies the correctness of the error bars that it returns at runtime. 54
  • 56. Selecting a Sample • If BlinkDB finds one or more stratified samples on a set of columns ‘ϕi’ such that our query ‘q’ ⊆ ϕi , we pick the ϕi with smallest number of columns. • If no such QCS samples exist, run ‘q’ on in-memory subsets of all samples maintained by the system. Out of these, we select those with high selectivity. - Selectivity = Number of Rows Selected by q / Number of rows read by q 56
  • 57. Selecting right Sample Size • Construct an ELP (Error Latency Profile) that characterizes the rate at which the error decreases ( and time increases) with increase in sample size by running query on smaller samples. • The scaling rate depends on query structure like JOINS, GROUP BY, Physical data placement and underlying data distribution. 57
  • 58. Error Profile • Given Q’s error constraints : Idea is to predict the size of smallest sample that satisfies constraints. • Variance and Closed form aggregate functions are estimated using standard closed form formulas. • Also BlinkDB estimates query selectivity, input data distribution by running queries on smaller subsamples. • Number of rows are thus calculated using Statistical Error estimates. 58
  • 59. 59
  • 60. Latency Profile • Given Q’s time constraints : Idea is to predict the maximum size sample that we should run query on within the constraints. • Value of ‘n’ depends on input data, physical placement of disk, query structure and available resources. So as a simplification : BlinkDB simply predicts ‘n’ by assuming latency scales linearly in input size. • For very small in-memory samples : BlinkDB runs a few smaller samples until performance seems to grow linearly and then estimate the linear scaling constants. 60
  • 61. Correcting Bias • Running a query on a non-uniform sample introduces certain amount of statistical bias and introduce subtle inaccuracies. • Solution : BlinkDB periodically replaces the samples use using a low priority background task which periodically ( daily ) samples the original data which are then used by the system. 61
  • 62. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV 62
  • 63. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. - Very computationally expensive. 63
  • 64. But we are not done yet … • Statistical functions like CLT and Bootstrap operate under a set of assumption on query / data. • We need to have some correctness verifiers ! 64
  • 65. What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime 65
  • 66. Kleiner’s Diagnostics Error More Data  Higher Accuracy 300 Data Points  97% Accuracy Sample Size [KDD 13] 66
  • 67. 300 Data Points ≈ 30K Queries for Bootstrap ! 67
  • 68. So In an Approximate QP : • One query that estimates the answer. • Hundred Queries on Resample of data that computes the error. • Tens of Thousand of Queries to verify if this error is correct. • BAD PERFORMANCE ! • Solution: Single Pass Execution framework. 68
  • 69. What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime 69
  • 71. BlinkDBArchitecture Command-line Shell Thrift/JDBC Hadoop Storage (e.g., HDFS, Hbase, Presto) Meta store Hadoop/Spark/Presto SQL Parser Query Optimizer Physical Plan UDFs Execution Driver
  • 72. 72
  • 73. Implementation Changes • Additions in Query Language Parser. • Parser can trigger a sample creation and maintenance module. • A sample selection module that re-writes the query and assigns it an approximately sized sample. • Uncertainty module to modify all pre-existing aggregation function to return error bars and confidence intervals. • A module periodically samples from the original data, creating new samples which are then used by the system. (Co-Relation +Workload Changes) 73
  • 75. BlinkDB Vs. No Sampling 2.5 TB from Cache 7.5 TB from Disk Log Scale 75
  • 76. Scaling BlinkDB 76 Each query operates on 100N GB of data.
  • 77. Response Time and Error Bounds … 20 Conviva queries averaged over 10runs 77
  • 78. Play with BlinkDB! https://github.com/sameeragarwal/blinkdb 78
  • 79. Take Away … • The only way for now to escape memory performance barrier is to use Approximate Computing . • A huge role to play in exploratory data analysis. • BlinkDB provides a framework for AQP + Error Bars +Verifies them. • Great Performance on real world workloads. 79
  • 80. Personal Takeaway : Take a STATISTICS class! 80
  • 81. Credits These slides are derived from Sameer Agarwal’s presentation : http://goo.gl/cvVb1X 81
  • 82. Questions ? THANK YOU! 82

Editor's Notes

  • #22: The goal is to perform some aggregate analysis on this massive data.
  • #23: The goal is to perform some aggregate analysis on this massive data.
  • #24: The goal is to perform some aggregate analysis on this massive data.
  • #28: You can also control the size of the sample.
  • #30: Errors in the database : Issue with data collection. Measurement strategy.
  • #63: What about error estimates where there is no aggregate function?