SlideShare a Scribd company logo
Succinct: Fast
Interactive Queries
Anurag Khandelwal
Interactive Queries at Scale
Interactive Queries at Scale
Search Tweets by @AMPLab about #Succinct
Interactive Queries at Scale
Search
Regular Expressions
Tweets by @AMPLab about #Succinct
Links to Berkeley or Stanford domains

.*(berkeley|stanford).edu
Interactive Queries at Scale
Search
Regular Expressions
Range Queries
Tweets by @AMPLab about #Succinct
Links to Berkeley or Stanford domains

.*(berkeley|stanford).edu
All my Facebook posts between 2013 and 2016
Interactive Queries at Scale
Search
Regular Expressions
Range Queries
Graph Queries
Tweets by @AMPLab about #Succinct
Links to Berkeley or Stanford domains

.*(berkeley|stanford).edu
All my Facebook posts between 2013 and 2016
Friends of my friends who like trekking
Interactive Queries at Scale
Search
Random Access
Regular Expressions
Range Queries
Graph Queries
Aggregate Queries
Updates
Tweets by @AMPLab about #Succinct
Links to Berkeley or Stanford domains

.*(berkeley|stanford).edu
All my Facebook posts between 2013 and 2016
Friends of my friends who like trekking
Interactive Queries at Scale
Search
Random Access
Regular Expressions
Range Queries
Graph Queries
Aggregate Queries
Updates
Compute Platforms
Interactive Queries at Scale
Search
Random Access
Regular Expressions
Range Queries
Graph Queries
Aggregate Queries
Updates
Compute Platforms
Query Engines
Interactive Queries at Scale
Search
Random Access
Regular Expressions
Range Queries
Graph Queries
Aggregate Queries
Updates
Compute Platforms
Query Engines
Data Stores
Interactive Queries at Scale
Interactive Queries at Scale
Today’s focus on two main issues:
Interactive Queries at Scale
‣ Performance degradation when data size > memory
Today’s focus on two main issues:
Interactive Queries at Scale
‣ Performance degradation when data size > memory
Today’s focus on two main issues:
Throughput
(Ops)
0
500
1000
1500
2000
Input Size
1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
Interactive Queries at Scale
‣ Performance degradation when data size > memory
Today’s focus on two main issues:
Throughput
(Ops)
0
500
1000
1500
2000
Input Size
1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
Interactive Queries at Scale
‣ Performance degradation when data size > memory
Today’s focus on two main issues:
‣ Handling skewed query workloads
Throughput
(Ops)
0
500
1000
1500
2000
Input Size
1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
Interactive Queries at Scale
‣ Performance degradation when data size > memory
Today’s focus on two main issues:
‣ Handling skewed query workloads
Throughput
(Ops)
0
500
1000
1500
2000
Input Size
1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
Interactive Queries at Scale
‣ Performance degradation when data size > memory
Today’s focus on two main issues:
‣ Handling skewed query workloads
Throughput
(Ops)
0
500
1000
1500
2000
Input Size
1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
Maximum sustainable throughput
Our Solution
BlowFish [NSDI’16]
Succinct [NSDI’15]
Succinct

Encryption
GraphStore
KVStore
ColumnarStore
RowStore
UnstructuredData
Our Solution
‣ Compressed representation → More queries in faster storage
BlowFish [NSDI’16]
Succinct [NSDI’15]
Succinct

Encryption
GraphStore
KVStore
ColumnarStore
RowStore
UnstructuredData
Our Solution
‣ Compressed representation → More queries in faster storage
‣ Rich functionality directly on compressed representation
‣ Search, RegEx, Range queries
BlowFish [NSDI’16]
Succinct [NSDI’15]
Succinct

Encryption
GraphStore
KVStore
ColumnarStore
RowStore
UnstructuredData
Our Solution
‣ Compressed representation → More queries in faster storage
‣ Rich functionality directly on compressed representation
‣ Search, RegEx, Range queries
‣ Flexible support for different data models
BlowFish [NSDI’16]
Succinct [NSDI’15]
Succinct

Encryption
GraphStore
KVStore
ColumnarStore
RowStore
UnstructuredData
Our Solution
‣ Compressed representation → More queries in faster storage
‣ Rich functionality directly on compressed representation
‣ Search, RegEx, Range queries
‣ Flexible support for different data models
‣ Handles skewed & time-varying workloads
BlowFish [NSDI’16]
Succinct [NSDI’15]
Succinct

Encryption
GraphStore
KVStore
ColumnarStore
RowStore
UnstructuredData
Existing Techniques
SEARCH( )
Example:
Existing Techniques
Data Scans
SEARCH( )
Example:
Existing Techniques
Data Scans
SEARCH( )
Example: Ex: Apache Spark
Existing Techniques
Data Scans
SEARCH( )
Example: Ex: Apache Spark
Existing Techniques
Data Scans
SEARCH( )
Example: Ex: Apache Spark
Existing Techniques
Data Scans
Low storage
High Latency
SEARCH( )
Example: Ex: Apache Spark
Existing Techniques
Data Scans Indexes
Low storage
High Latency
SEARCH( )
Example: Ex: Apache Spark
Existing Techniques
Data Scans Indexes
Low storage
High Latency
SEARCH( )
Example: Ex: Apache Spark Ex: SOLR
Existing Techniques
Data Scans Indexes
Low storage
High Latency
SEARCH( )
Example: Ex: Apache Spark Ex: SOLR
Existing Techniques
0, 10, 14, 16, 19, 26, 29
1, 4, 5, 8, 20, 22, 24
2, 15, 17, 27
3, 6, 7, 9, 12, 13, 18, 23 ..
11, 21
Data Scans Indexes
Low storage
High Latency
SEARCH( )
Example: Ex: Apache Spark Ex: SOLR
Existing Techniques
0, 10, 14, 16, 19, 26, 29
1, 4, 5, 8, 20, 22, 24
2, 15, 17, 27
3, 6, 7, 9, 12, 13, 18, 23 ..
11, 21
Data Scans Indexes
Low storage
High Latency
SEARCH( )
Example: Ex: Apache Spark Ex: SOLR
Existing Techniques
0, 10, 14, 16, 19, 26, 29
1, 4, 5, 8, 20, 22, 24
2, 15, 17, 27
3, 6, 7, 9, 12, 13, 18, 23 ..
11, 21
Data Scans Indexes
Low storage
High Latency
High storage
Low Latency
SEARCH( )
Example: Ex: Apache Spark Ex: SOLR
Succinct
Succinct
Succinct
Succinct
Succinct
Succinct
Queries executed directly
on the
compressed representation
Succinct
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct
What makes Succinct unique
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct
What makes Succinct unique
No additional
indexes
Query responses embedded
in the compressed
representation
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct
What makes Succinct unique
No additional
indexes
Query responses embedded
in the compressed
representation
No data scans Functionality of indexes
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct
What makes Succinct unique
No additional
indexes
Query responses embedded
in the compressed
representation
No data scans Functionality of indexes
No
decompression
Queries directly on
the compressed representation
(except for data access queries)
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct Scale
In-memory data sizes >= memory capacity
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct Scale
In-memory data sizes >= memory capacity
Complex queries
Search, range, random access, RegEx
Succinct
Queries executed directly
on the
compressed representation
Low Storage
Low Latency
Succinct Scale
In-memory data sizes >= memory capacity
Complex queries
Search, range, random access, RegEx
Interactivity
Avoids data scans and decompression
Succinct Data Representation
Succinct Data Representation
Builds on a large body of theory work
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
‣ Strong functionality (search)
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
‣ Strong functionality (search) ‣ No structure
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
‣ Strong functionality (search) ‣ No structure
Compression?
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
‣ Strong functionality (search) ‣ No structure
Compression?
‣ Sample the suffix array
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
‣ Strong functionality (search) ‣ No structure
Compression?
‣ Sample the suffix array
‣ Store set of pointers to compute unsampled values on the fly
Succinct Data Representation
Builds on a large body of theory work
Suffix Arrays
‣ Strong functionality (search) ‣ No structure
Compression?
‣ Sample the suffix array
‣ Store set of pointers to compute unsampled values on the fly
Possesses structure that enables compression!
Succinct Data Model
Succinct Data Model
‣ Unstructured data
‣ Key-value stores (Voldemort, Dynamo)
‣ Document store (Elasticsearch, MongoDB)
‣ Tables (Cassandra, BigTable)
‣ And many more ....
Unified
Interface
Succinct Data Model
‣ Unstructured data
‣ Key-value stores (Voldemort, Dynamo)
‣ Document store (Elasticsearch, MongoDB)
‣ Tables (Cassandra, BigTable)
‣ And many more ....
Unified
Interface
With all the powerful queries on
values, documents, columns
Data Model & Functionality
For unstructured data:
Data Model & Functionality
Original Input Succinct
For unstructured data:
Data Model & Functionality
Original Input Succinct
SEARCH( )= {0, 10, 14, 16, 19, 26, 29}
Search: returns offsets of arbitrary strings in uncompressed file
For unstructured data:
Data Model & Functionality
Original Input Succinct
SEARCH( )= {0, 10, 14, 16, 19, 26, 29}
For unstructured data:
Extract(0, 5) = { , , , , }
Extract: returns data at arbitrary offsets in uncompressed file
Data Model & Functionality
Original Input Succinct
SEARCH( )= {0, 10, 14, 16, 19, 26, 29}
For unstructured data:
Extract(0, 5) = { , , , , }
COUNT( ) = 7
Count: returns count of arbitrary strings in uncompressed file
Data Model & Functionality
Original Input Succinct
SEARCH( )= {0, 10, 14, 16, 19, 26, 29}
For unstructured data:
Extract(0, 5) = { , , , , }
COUNT( ) = 7
Append( , , , , )
Append: appends arbitrary strings to uncompressed file
Data Model & Functionality
Original Input Succinct
SEARCH( )= {0, 10, 14, 16, 19, 26, 29}
For unstructured data:
Extract(0, 5) = { , , , , }
COUNT( ) = 7
Append( , , , , )
Range Queries, REGULAR EXPRESSIONS
Unifying the Data Models
Unifying the Data Models
Unifying the Data Models
Unifying the Data Models
Unifying the Data Models
SEARCH(Column1, )
Unifying the Data Models
SEARCH(Column1, )SEARCH( )
Succinct Architecture
Succinct Architecture
Multi-store Architecture
Succinct Architecture
SuccinctStore
Multi-store Architecture
Succinct Architecture
SuccinctStore
SuffixStore
Multi-store Architecture
Succinct Architecture
SuccinctStore
SuffixStore
LogStore
Multi-store Architecture
Succinct Architecture
SuccinctStore
SuffixStore
LogStore
Data APPENDS
Multi-store Architecture
Succinct Architecture
SuccinctStore
SuffixStore
LogStore
Data APPENDS
Multi-store Architecture
Succinct Architecture
SuccinctStore
SuffixStore
LogStore
Data APPENDS
Multi-store Architecture
Succinct on 

Apache Spark
Queries on Compressed RDDs
Queries on Compressed RDDs
New Functionalities
Document store, 

Key-Value store
search on documents,
values
Queries on Compressed RDDs
New Functionalities
Document store, 

Key-Value store
search on documents,
values
Faster operations on
RDDs
random access, filters avoid scans
Queries on Compressed RDDs
New Functionalities
Document store, 

Key-Value store
search on documents,
values
Faster operations on
RDDs
random access, filters avoid scans
More in-memory Compressed RDDs
no decompression
overheads
Unstructured data using SuccinctRDD
import edu.berkeley.cs.succinct._ Import classes
Unstructured data using SuccinctRDD
import edu.berkeley.cs.succinct._
val rdd = ctx.textFile(…).map(_.getBytes)
val succinctRDD = rdd.succinct
Load data & compress
using Succinct
Unstructured data using SuccinctRDD
import edu.berkeley.cs.succinct._
val rdd = ctx.textFile(…).map(_.getBytes)
val succinctRDD = rdd.succinct
val offsets = succinctRDD.search("Berkeley")
Find all occurrences
of ā€œBerkeleyā€
Unstructured data using SuccinctRDD
import edu.berkeley.cs.succinct._
val rdd = ctx.textFile(…).map(_.getBytes)
val succinctRDD = rdd.succinct
val count = succinctRDD.count("Berkeley")
val offsets = succinctRDD.search("Berkeley")
Count #occurrences
of ā€œBerkeleyā€
Unstructured data using SuccinctRDD
import edu.berkeley.cs.succinct._
val rdd = ctx.textFile(…).map(_.getBytes)
val succinctRDD = rdd.succinct
val bytes = succinctRDD.extract(50, 100)
val count = succinctRDD.count("Berkeley")
val offsets = succinctRDD.search("Berkeley")
Extract 100 bytes
from offset 50
Unstructured data using SuccinctRDD
Key-Value Store using SuccinctKVRDD
import edu.berkeley.cs.succinct.kv._ Import classes
Key-Value Store using SuccinctKVRDD
import edu.berkeley.cs.succinct.kv._
val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))

val succinctKVRDD = kvRDD.succinctKV Load data & compress using Succinct
Key-Value Store using SuccinctKVRDD
import edu.berkeley.cs.succinct.kv._
val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))

val succinctKVRDD = kvRDD.succinctKV
val keys = succinctKVRDD.search("Berkeley") Find all keys for values that
contain ā€œBerkeleyā€
Key-Value Store using SuccinctKVRDD
import edu.berkeley.cs.succinct.kv._
val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))

val succinctKVRDD = kvRDD.succinctKV
val value = succinctKVRDD.get(0)
val keys = succinctKVRDD.search("Berkeley")
Get value for key 0
Key-Value Store using SuccinctKVRDD
Evaluation
Evaluation
Dataset Wikipedia dataset

~40GB data
Evaluation
Dataset
Cluster
Wikipedia dataset

~40GB data
Amazon EC2, 5 machines, 30GB RAM each
Evaluation
Dataset
Cluster
Workload
Wikipedia dataset

~40GB data
Amazon EC2, 5 machines, 30GB RAM each
Search queries, 1-10,000 occurrences
Evaluation
Dataset
Cluster
Workload
Systems
Wikipedia dataset

~40GB data
Amazon EC2, 5 machines, 30GB RAM each
Search queries, 1-10,000 occurrences
Spark, Elasticsearch
Evaluation
Dataset
Cluster
Workload
Systems
Wikipedia dataset

~40GB data
Amazon EC2, 5 machines, 30GB RAM each
Search queries, 1-10,000 occurrences
Spark, Elasticsearch
Caveats Absolute numbers are dataset dependent
Evaluation: Search
Evaluation: Search
Takeaway: Succinct on Apache Spark is 2.5x faster than Elasticsearch
while being 2.5x more space efficient.

(Data fits in memory for all systems)
Support for Regular Expressions
Support for Regular Expressions
Applications Data Cleaning

Information Extraction

Bioinformatics
Document Stores
Support for Regular Expressions
Applications
Operators
Data Cleaning

Information Extraction

Bioinformatics
Document Stores
Union, Concat, Wildcard, Repeat
Support for Regular Expressions
Applications
Operators
Data Cleaning

Information Extraction

Bioinformatics
Document Stores
Union, Concat, Wildcard, Repeat
Example .*(berkeley|stanford).edu
Support for Regular Expressions
Support for Regular Expressions
val matches = succinctRDD.regexSearch(".*(berkeley|stanford).edu")
Find all matches for the RegEx
ā€œ.*(berkeley|stanford).eduā€
SuccinctRDD
Support for Regular Expressions
val matches = succinctRDD.regexSearch(".*(berkeley|stanford).edu")
Find all matches for the RegEx
ā€œ.*(berkeley|stanford).eduā€
SuccinctRDD
val matchKeys = succinctKVRDD.regexSearch(".*(berkeley|stanford).edu")
Find all keys for values that contain the RegEx
ā€œ.*(berkeley|stanford).eduā€
SuccinctKVRDD
Evaluation: RegEx
Evaluation: RegEx
Evaluation: RegEx
Takeaway: Succinct significantly speeds up RegEx queries even when
all the data fits in memory for all systems.
Succinct on Apache Spark
Succinct on Apache Spark
Already in use at Elsevier Labs
Succinct on Apache Spark
Already in use at Elsevier Labs
‣ Use case: Annotation Search
Succinct on Apache Spark
Already in use at Elsevier Labs
‣ Use case: Annotation Search
Documents
Succinct on Apache Spark
Already in use at Elsevier Labs
‣ Use case: Annotation Search
Documents
1, sentence, (0, 15)
2, word, (0, 4)
3, word, (5, 10)
4, word, (11, 15)
Annotations
Succinct on Apache Spark
Already in use at Elsevier Labs
‣ Use case: Annotation Search
Documents
1, sentence, (0, 15)
2, word, (0, 4)
3, word, (5, 10)
4, word, (11, 15)
Annotations
ā€œFind sentences that talk about open problems in researchā€
Succinct on Apache Spark
Already in use at Elsevier Labs
‣ Use case: Annotation Search
Documents
1, sentence, (0, 15)
2, word, (0, 4)
3, word, (5, 10)
4, word, (11, 15)
Annotations
(remains|is|still)	(unknown|unclear|uncertain)	within	<sentence>
RegEx Annotation
ā€œFind sentences that talk about open problems in researchā€
Succinct on Apache Spark
Already in use at Elsevier Labs
‣ Use case: Annotation Search
Documents
1, sentence, (0, 15)
2, word, (0, 4)
3, word, (5, 10)
4, word, (11, 15)
Annotations
https://spark-packages.org/package/amplab/succinct
(remains|is|still)	(unknown|unclear|uncertain)	within	<sentence>
RegEx Annotation
ā€œFind sentences that talk about open problems in researchā€
Problem: Skewed Query Workloads
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
Problem: Skewed Query Workloads
‣ Succinct: Larger fraction of queries in main memory
‣ Challenge: skewed load across shards?
‣ Challenge: time varying loads?
Load distribution across partitions is often non-uniform
Problem: Skewed Query Workloads
‣ Succinct: Larger fraction of queries in main memory
‣ Challenge: skewed load across shards?
‣ Challenge: time varying loads?
‣ E.g.: Memcached + MySQL deployment @ Facebook
Load distribution across partitions is often non-uniform
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
Selective Replication
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
Traditional approach:
Selective Replication
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
#Replicas
Traditional approach:
Selective Replication
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
#Replicas
#Replicas α Load
Traditional approach:
Selective Replication
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
#Replicas
#Replicas α Load
Coarse grained
Traditional approach:
Selective Replication
Problem: Skewed Query Workloads
Load distribution across partitions is often non-uniform
#Replicas
#Replicas α Load
Coarse grained 1-2Ɨ throughput → 2Ɨ storage
Traditional approach:
Succinct + BlowFish
Succinct + BlowFish
Succinct + BlowFish
Succinct + BlowFish
Storage
Throughput
Succinct + BlowFish
Storage
Throughput
Indexes
Succinct + BlowFish
Storage
Throughput
Scans
Indexes
Succinct + BlowFish
Storage
Throughput
Scans
Indexes
Succinct
Succinct + BlowFish
Storage
Throughput
Scans
Indexes
Succinct
Storage-Performance tradeoff
curve for each partition
Succinct + BlowFish
Storage
Throughput
Scans
Indexes
Succinct
Storage-Performance tradeoff
curve for each partition
BlowFish: Layered Sampled Array
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
3 14RATE = 4
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
3 14RATE = 4
15 0 8 5RATE = 2
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
3 14RATE = 4
15 0 8 5RATE = 2
Different combination of layers
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
3 14RATE = 4
15 0 8 5RATE = 2
Different combination of layers Different points on tradeoff curve
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
→
Rate = 2
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
3 14RATE = 4
15 0 8 5RATE = 2
Different combination of layers Different points on tradeoff curve
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
→
Rate = 2
Layer Additions and Deletions
Unsampled values computed on the fly
OriginalSampled 

Array 9 15 3 0 12 8 14 5
9 12RATE = 8
3 14RATE = 4
15 0 8 5RATE = 2
Different combination of layers Different points on tradeoff curve
Recap: Succinct stores a sampled suffix array
BlowFish: Layered Sampled Array
→
Rate = 2
Layer Additions and Deletions Move along tradeoff curve→
Unsampled values computed on the fly
BlowFish: Technical Details
BlowFish: Technical Details
‣ How should partitions share cache on a server?
BlowFish: Technical Details
‣ How should partitions share cache on a server?
BlowFish: Technical Details
‣ How should partitions share cache on a server?
Low Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
‣ How should requests be scheduled across replicas?
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
‣ How should requests be scheduled across replicas?
Unified Solution: Back-pressure style scheduling
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
Cache proportional to load,
‣ How should requests be scheduled across replicas?
Unified Solution: Back-pressure style scheduling
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
Cache proportional to load,
‣ How should requests be scheduled across replicas?
Unified Solution: Back-pressure style scheduling
without explicit coordination
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
‣ How should requests be scheduled across replicas?
Unified Solution: Back-pressure style scheduling
1.5x higher throughput than Selective Replication,
High ThresholdLow Threshold
BlowFish: Technical Details
‣ How should partitions share cache on a server?
‣ How should partitions share cache across servers?
‣ How should requests be scheduled across replicas?
Unified Solution: Back-pressure style scheduling
1.5x higher throughput than Selective Replication,
within 11% of maximum possible throughput
High ThresholdLow Threshold
Succinct
+
BlowFish
‣ Standalone system (prototyped & tested)
Succinct
+
BlowFish
‣ Standalone system (prototyped & tested)
‣ Spark Package: Succinct on Apache Spark
Succinct
+
BlowFish
‣ Standalone system (prototyped & tested)
‣ Spark Package: Succinct on Apache Spark
‣ As libraries
‣ C++, Java, Scala
‣ for ease of integration
Succinct
+
BlowFish
Thanks!



succinct.cs.berkeley.edu
Backup Slides
Array of Suffixes (AoS)
banana$
(Input)
Array of Suffixes (AoS)
banana$
banana$
anana$
nana$
ana$
na$
a$
$
Suffixes
(Input)
Array of Suffixes (AoS)
banana$
banana$
anana$
nana$
ana$
na$
a$
$
Suffixes
$
a$
ana$
anana$
banana$
na$
nana$
Array of
Suffixes (AoS)
lexicographicalorder
(Input)
AoS to Input (AoS2Input) Array
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
AoS to Input (AoS2Input) Array
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
locations of suffixes

(suffix array)
AoS to Input (AoS2Input) Array
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
locations of suffixes

(suffix array)
AoS to Input (AoS2Input) Array
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
locations of suffixes

(suffix array)
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
search(ā€œanā€) = {1, 3}
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
search(ā€œanā€) = {1, 3}
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
search(ā€œanā€) = {1, 3}
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
search(ā€œanā€) = {1, 3}
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
search(ā€œanā€) = {1, 3}
Example: search(ā€œanā€)
$
a$
ana$
anana$
banana$
na$
nana$
AoS
6
AoS2Input
5
3
1
0
4
2
b
Input
0
1
2
3
4
5
6
a
n
a
n
a
$
search(ā€œanā€) = {1, 3}
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
3
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
3
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
2
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
2
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
Store only the first character
(entire suffix can be computed
ā€œon the flyā€ using Next Pointer Array (NPA))
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
a
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
an
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
an
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
ana
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
ana
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
ana$
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
ana$
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
Next Pointer Array: Reducing AoS Size
$
a$
ana$
anana$
banana$
na$
nana$
AoS
0
1
2
3
4
5
6
NPA
4
0
5
1
2
AoS NPA
$
a
b
n
4
0
5
6
3
1
2
0
1
2
3
4
5
6
AoS NPA
$0
1
2
3
4
5
6
a
a
a
b
n
n
4
0
5
6
3
1
2
3
6
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
AoS2Input NPA
4
0
5
6
3
1
2
6
0
2
0
1
2
3
4
5
6
3
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
AoS2Input NPA
4
0
5
6
3
1
2
6
0
2
0
1
2
3
4
5
6
3
Store only a few sampled values

(unsampled values computed 

ā€œon the flyā€ using NPA)
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
AoS2Input NPA
4
0
5
6
3
1
2
6
0
2
0
1
2
3
4
5
6
3
Store only a few sampled values

(unsampled values computed 

ā€œon the flyā€ using NPA)
Reducing the size of AoS2Input
6
AoS2Input
5
0
2
4
NPA
0
5
6
3
1
2
0
1
2
3
4
5
6
3
1
4
AoS2Input NPA
4
0
5
6
3
1
2
6
0
2
0
1
2
3
4
5
6
3
Store only a few sampled values

(unsampled values computed 

ā€œon the flyā€ using NPA)
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
- better compressibility
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
- better compressibility
- avoids binary search on AoS (lower latency)
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
- better compressibility
- avoids binary search on AoS (lower latency)
- enables wider range of queries (E.g., RegEx)
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
- better compressibility
- avoids binary search on AoS (lower latency)
- enables wider range of queries (E.g., RegEx)
$
a
b
n
4
0
5
6
3
1
2
Compressing NPA
Increasing sequence of integers

(values for suffixes starting with
same character)
Can be compressed

(E.g., using run-length encoding)
Succinct uses a 2-dimensional representation of NPA
- better compressibility
- avoids binary search on AoS (lower latency)
- enables wider range of queries (E.g., RegEx)
See upcoming NSDI paper!
$
a
b
n
4
0
5
6
3
1
2
Evaluation: Storage Footprint
10 node 150GB cluster
Evaluation: Storage Footprint
10 with in-
h metadata;
r-store with
a bug also
es the sys-
ariable. For
ss to the in-
orm micro-
single ma-
failure sce-
nd Cassan-
exes. These
nd wildcard
rt wildcard
ide slightly
y, for Suc-
valuate the
tion.
lti-attribute
rgeKV from
75
150
225
DataSizethat
FitsinMemory(GB)
SmallKV LargeKV
MongoDB
Cassandra
HyperDex
Succinct
RAM
Figure 12: Succinct pushes more than 10Ɨ larger amount
of data in memory compared to the next best system, while
providing similar or stronger functionality.
10 node 150GB cluster
Evaluation: Storage Footprint
Takeaway: Succinct can push >11x more data in memory
10 with in-
h metadata;
r-store with
a bug also
es the sys-
ariable. For
ss to the in-
orm micro-
single ma-
failure sce-
nd Cassan-
exes. These
nd wildcard
rt wildcard
ide slightly
y, for Suc-
valuate the
tion.
lti-attribute
rgeKV from
75
150
225
DataSizethat
FitsinMemory(GB)
SmallKV LargeKV
MongoDB
Cassandra
HyperDex
Succinct
RAM
Figure 12: Succinct pushes more than 10Ɨ larger amount
of data in memory compared to the next best system, while
providing similar or stronger functionality.
10 node 150GB cluster
Evaluation: Throughput (95% GET + 5% PUT)
10 node 150GB cluster, uniform random access pattern
Evaluation: Throughput (95% GET + 5% PUT)
10 node 150GB cluster, uniform random access pattern
Evaluation: Throughput (95% GET + 5% PUT)
Takeaway: Succinct achieves performance comparable to existing
open source systems for queries on primary attributes
10 node 150GB cluster, uniform random access pattern
Evaluation: Throughput (95% SEARCH + 5% PUT)
10 node 150GB cluster, search queries with 1-10K occurrences
Evaluation: Throughput (95% SEARCH + 5% PUT)
10 node 150GB cluster, search queries with 1-10K occurrences
Evaluation: Throughput (95% SEARCH + 5% PUT)
Takeaway: Succinct by pushing more data in faster storage provides
performance similar to existing systems for 10-11x larger data sizes.
10 node 150GB cluster, search queries with 1-10K occurrences
Evaluation: RegEx Latency
40GB Wikipedia dataset, 5 commonly used RegEx queries
Single EC2 node, 32 vCPUs, 244GB RAM
Evaluation: RegEx Latency
40GB Wikipedia dataset, 5 commonly used RegEx queries
Single EC2 node, 32 vCPUs, 244GB RAM
Evaluation: RegEx Latency
Takeaway: Succinct significantly speeds up RegEx queries even when
all the data fits in memory for all systems.
40GB Wikipedia dataset, 5 commonly used RegEx queries
Single EC2 node, 32 vCPUs, 244GB RAM
Support for JSON
val ids1 = succinctJsonRDD.search("AMPLab")
Search for JSON documents containing ā€œAMPLabā€
Support for JSON
val ids2 = succinctJsonRDD.filter("city", "Berkeley")
val ids1 = succinctJsonRDD.search("AMPLab")
Filter JSON documents where the ā€œcityā€ attribute has value ā€œBerkeleyā€
Support for JSON
val jsonDoc = succinctJsonRDD.get(0)
val ids2 = succinctJsonRDD.filter("city", "Berkeley")
val ids1 = succinctJsonRDD.search("AMPLab")
Get JSON document with id 0
Support for JSON
Layer Additions & Deletions
9 12RATE = 8
3 14RATE = 4
15 0 8 5RATE = 2
Layer Additions & Deletions
9 12RATE = 8
3 14RATE = 4
Layer Additions & Deletions
Layer Deletions: simple
RATE = 2
9 12RATE = 8
3 14RATE = 4
Layer Additions & Deletions
Layer Addition:
RATE = 2
9 12RATE = 8
3 14RATE = 4
Unsampled values already computed during query execution
Layer Additions & Deletions
Layer Addition:
RATE = 2
9 12RATE = 8
3 14RATE = 4
815
Unsampled values already computed during query execution
Layer Additions & Deletions
Layer Addition:
Layers in LSA populated opportunistically!!
Spatial Skew
Spatial Skew
Load distribution across partitions is heavily skewed
Object
Load
1
Compressed
Wasted Cache!
Spatial Skew
Load distribution across partitions is heavily skewed
#Replicas α Load
Selective Replication
Spatial Skew
Load distribution across partitions is heavily skewed
#Replicas α Load
Selective Replication
BlowFish
Fractionally change storage
just enough to meet load
1
Compressed
Uncompressed
10
Object
Load
Spatial Skew
Load distribution across partitions is heavily skewed
#Replicas α Load
Selective Replication
BlowFish
Fractionally change storage
just enough to meet load
1.5x higher throughput than Selective Replication,
1
Compressed
Uncompressed
10
Object
Load
Spatial Skew
Load distribution across partitions is heavily skewed
#Replicas α Load
Selective Replication
BlowFish
Fractionally change storage
just enough to meet load
1.5x higher throughput than Selective Replication,
within 10% of optimal
1
Compressed
Uncompressed
10
Object
Load
Changes in Spatial Skew
Changes in Spatial Skew
Study on Facebook
Warehouse Cluster
[HotStorage’13]
Changes in Spatial Skew
Transient failures → 90% of failuresStudy on Facebook
Warehouse Cluster
[HotStorage’13]
Changes in Spatial Skew
Transient failures → 90% of failures
Replica creation delayed by 15 mins
Study on Facebook
Warehouse Cluster
[HotStorage’13]
Changes in Spatial Skew
Transient failures → 90% of failures
Replica creation delayed by 15 mins
Study on Facebook
Warehouse Cluster
[HotStorage’13]
Leads to variation in load over time
Changes in Spatial Skew
Transient failures → 90% of failures
Replica creation delayed by 15 mins
Replica#1
Replica#2
Replica#3
Data Partitions Request Queues
Study on Facebook
Warehouse Cluster
[HotStorage’13]
Leads to variation in load over time
Changes in Spatial Skew
Transient failures → 90% of failures
Replica creation delayed by 15 mins
Replica#1
Replica#2
Replica#3
Data Partitions Request Queues
Study on Facebook
Warehouse Cluster
[HotStorage’13]
Leads to variation in load over time
Changes in Spatial Skew
Transient failures → 90% of failures
Replica creation delayed by 15 mins
Replica#1
Replica#2
Replica#3
Data Partitions Request Queues
Study on Facebook
Warehouse Cluster
[HotStorage’13]
Leads to variation in load over time
Changes in Spatial Skew
Replica#1
Replica#2
Replica#3
Changes in Spatial Skew
Replica#1
Replica#2
Replica#3
Changes in Spatial SkewOperations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Replica#1
Replica#2
Replica#3
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Changes in Spatial Skew
Load Throughput
Operations/second
0
600
1200
1800
2400
3000
Time (mins)
0 30 60 90 120
RequestQueueSize
0K
10K
20K
30K
40K
50K
Time (mins)
0 30 60 90 120
Adapts to 3x higher load in < 5 mins
Replica#1
Replica#2
Replica#3

More Related Content

PDF
Fire-fighting java big data problems
ODP
Search Solutions 2015: Towards a new model of search relevance testing
PPTX
Approaching Join Index - Lucene/Solr Revolution 2014
PPT
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
PDF
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
PDF
The Lonesome LOD Cloud
PPTX
iRap - Interest based RDF update propagation
PPTX
Eventually Elasticsearch: Eventual Consistency in the Real World
Fire-fighting java big data problems
Search Solutions 2015: Towards a new model of search relevance testing
Approaching Join Index - Lucene/Solr Revolution 2014
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
The Lonesome LOD Cloud
iRap - Interest based RDF update propagation
Eventually Elasticsearch: Eventual Consistency in the Real World

What's hot (7)

PDF
Stardog Linked Data Catalog
PDF
SAS for Beginners
PPTX
Scalable Data Models with Elasticsearch
PDF
Sem tech 2010_integrity_constraints
PDF
NoSQL and Architectures
PDF
Overview of GraphQL & Clients
PDF
Velox: Models in Action
Stardog Linked Data Catalog
SAS for Beginners
Scalable Data Models with Elasticsearch
Sem tech 2010_integrity_constraints
NoSQL and Architectures
Overview of GraphQL & Clients
Velox: Models in Action
Ad

Viewers also liked (18)

PDF
Improving Hardware Efficiency for DNN Applications
PPTX
Real time machine learning visualization with spark -- Hadoop Summit 2016
PPTX
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
PPTX
Java/Scala Lab 2016. АлексанГр Конопко: Машинное Š¾Š±ŃƒŃ‡ŠµŠ½ŠøŠµ в Spark.
PDF
Alpine Tech Talk: System ML by Berthold Reinwald
PPTX
Real Time Machine Learning Visualization With Spark
PPTX
Pruning convolutional neural networks for resource efficient inference
PPTX
Paper Reading, "On Causal and Anticausal Learning", ICML-12
PDF
Neural_Programmer_Interpreter
PDF
Making neural programming architectures generalize via recursion
PPTX
[DL輪読会] Hybrid computing using a neural network with dynamic external memory
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
PPTX
[DL輪読会]Exploiting Cyclic Symmetry in Convolutional Neural Networks
PPTX
[DL輪読会]Unsupervised Cross-Domain Image Generation
PDF
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
PDF
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
PPTX
[DL輪読会] GAN系の研究まとめ (NIPS2016とICLR2016が中心)
PPTX
[DL輪読会]Understanding deep learning requires rethinking generalization
Improving Hardware Efficiency for DNN Applications
Real time machine learning visualization with spark -- Hadoop Summit 2016
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
Java/Scala Lab 2016. АлексанГр Конопко: Машинное Š¾Š±ŃƒŃ‡ŠµŠ½ŠøŠµ в Spark.
Alpine Tech Talk: System ML by Berthold Reinwald
Real Time Machine Learning Visualization With Spark
Pruning convolutional neural networks for resource efficient inference
Paper Reading, "On Causal and Anticausal Learning", ICML-12
Neural_Programmer_Interpreter
Making neural programming architectures generalize via recursion
[DL輪読会] Hybrid computing using a neural network with dynamic external memory
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
[DL輪読会]Exploiting Cyclic Symmetry in Convolutional Neural Networks
[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会] GAN系の研究まとめ (NIPS2016とICLR2016が中心)
[DL輪読会]Understanding deep learning requires rethinking generalization
Ad

Similar to SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab (20)

PDF
Fishing Graphs in a Hadoop Data Lake by Jƶrg Schad and Max Neunhoeffer at Big...
PDF
Fishing Graphs in a Hadoop Data Lake
PDF
Introduction to MongoDB
PDF
Building Applications with a Graph Database
PDF
Fishing Graphs in a Hadoop Data Lake
PDF
Meetup070416 Presentations
KEY
Embrace NoSQL and Eventual Consistency with Ripple
PPTX
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
PDF
Sustainable queryable access to Linked Data
PPTX
Jumpstart: MongoDB BI Connector & Tableau
PDF
The Why, When, and How of NoSQL - A Practical Approach
PPTX
Presto for apps deck varada prestoconf
PDF
Velox at SF Data Mining Meetup
PDF
Lighthouse - an open-source library to build data lakes - Kris Peeters
PDF
Forcelandia 2016 PK Chunking
PPTX
The Internet as a Single Database
PPTX
Why postgres SQL deserve noSQL fan respect - Riga dev day 2016
PDF
Overview of running R in the Oracle Database
PPTX
Apache HAWQ Architecture
PDF
Reark : a Reference Architecture for Android using RxJava
Fishing Graphs in a Hadoop Data Lake by Jƶrg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake
Introduction to MongoDB
Building Applications with a Graph Database
Fishing Graphs in a Hadoop Data Lake
Meetup070416 Presentations
Embrace NoSQL and Eventual Consistency with Ripple
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Sustainable queryable access to Linked Data
Jumpstart: MongoDB BI Connector & Tableau
The Why, When, and How of NoSQL - A Practical Approach
Presto for apps deck varada prestoconf
Velox at SF Data Mining Meetup
Lighthouse - an open-source library to build data lakes - Kris Peeters
Forcelandia 2016 PK Chunking
The Internet as a Single Database
Why postgres SQL deserve noSQL fan respect - Riga dev day 2016
Overview of running R in the Oracle Database
Apache HAWQ Architecture
Reark : a Reference Architecture for Android using RxJava

More from Chester Chen (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
PDF
zookeeer+raft-2.pdf
PPTX
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
PDF
A missing link in the ML infrastructure stack?
PDF
Shopify datadiscoverysf bigdata
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
PDF
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
SFBigAnalytics- hybrid data management using cdap
PDF
Sf big analytics: bighead
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
PPTX
2018 data warehouse features in spark
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PPTX
2018 02 20-jeg_index
PDF
Index conf sparkml-feb20-n-pentreath
SFBigAnalytics_SparkRapid_20220622.pdf
zookeeer+raft-2.pdf
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
A missing link in the ML infrastructure stack?
Shopify datadiscoverysf bigdata
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SFBigAnalytics_20190724: Monitor kafka like a Pro
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SFBigAnalytics- hybrid data management using cdap
Sf big analytics: bighead
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
2018 data warehouse features in spark
2018 02-08-what's-new-in-apache-spark-2.3
2018 02 20-jeg_index
Index conf sparkml-feb20-n-pentreath

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Lecture1 pattern recognition............
PDF
annual-report-2024-2025 original latest.
PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Introduction to the R Programming Language
[EN] Industrial Machine Downtime Prediction
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Qualitative Qantitative and Mixed Methods.pptx
Lecture1 pattern recognition............
annual-report-2024-2025 original latest.
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to machine learning and Linear Models
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
Introduction to the R Programming Language

SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab

  • 3. Interactive Queries at Scale Search Tweets by @AMPLab about #Succinct
  • 4. Interactive Queries at Scale Search Regular Expressions Tweets by @AMPLab about #Succinct Links to Berkeley or Stanford domains
 .*(berkeley|stanford).edu
  • 5. Interactive Queries at Scale Search Regular Expressions Range Queries Tweets by @AMPLab about #Succinct Links to Berkeley or Stanford domains
 .*(berkeley|stanford).edu All my Facebook posts between 2013 and 2016
  • 6. Interactive Queries at Scale Search Regular Expressions Range Queries Graph Queries Tweets by @AMPLab about #Succinct Links to Berkeley or Stanford domains
 .*(berkeley|stanford).edu All my Facebook posts between 2013 and 2016 Friends of my friends who like trekking
  • 7. Interactive Queries at Scale Search Random Access Regular Expressions Range Queries Graph Queries Aggregate Queries Updates Tweets by @AMPLab about #Succinct Links to Berkeley or Stanford domains
 .*(berkeley|stanford).edu All my Facebook posts between 2013 and 2016 Friends of my friends who like trekking
  • 8. Interactive Queries at Scale Search Random Access Regular Expressions Range Queries Graph Queries Aggregate Queries Updates Compute Platforms
  • 9. Interactive Queries at Scale Search Random Access Regular Expressions Range Queries Graph Queries Aggregate Queries Updates Compute Platforms Query Engines
  • 10. Interactive Queries at Scale Search Random Access Regular Expressions Range Queries Graph Queries Aggregate Queries Updates Compute Platforms Query Engines Data Stores
  • 12. Interactive Queries at Scale Today’s focus on two main issues:
  • 13. Interactive Queries at Scale ‣ Performance degradation when data size > memory Today’s focus on two main issues:
  • 14. Interactive Queries at Scale ‣ Performance degradation when data size > memory Today’s focus on two main issues: Throughput (Ops) 0 500 1000 1500 2000 Input Size 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
  • 15. Interactive Queries at Scale ‣ Performance degradation when data size > memory Today’s focus on two main issues: Throughput (Ops) 0 500 1000 1500 2000 Input Size 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
  • 16. Interactive Queries at Scale ‣ Performance degradation when data size > memory Today’s focus on two main issues: ‣ Handling skewed query workloads Throughput (Ops) 0 500 1000 1500 2000 Input Size 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
  • 17. Interactive Queries at Scale ‣ Performance degradation when data size > memory Today’s focus on two main issues: ‣ Handling skewed query workloads Throughput (Ops) 0 500 1000 1500 2000 Input Size 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
  • 18. Interactive Queries at Scale ‣ Performance degradation when data size > memory Today’s focus on two main issues: ‣ Handling skewed query workloads Throughput (Ops) 0 500 1000 1500 2000 Input Size 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB Maximum sustainable throughput
  • 19. Our Solution BlowFish [NSDI’16] Succinct [NSDI’15] Succinct
 Encryption GraphStore KVStore ColumnarStore RowStore UnstructuredData
  • 20. Our Solution ‣ Compressed representation → More queries in faster storage BlowFish [NSDI’16] Succinct [NSDI’15] Succinct
 Encryption GraphStore KVStore ColumnarStore RowStore UnstructuredData
  • 21. Our Solution ‣ Compressed representation → More queries in faster storage ‣ Rich functionality directly on compressed representation ‣ Search, RegEx, Range queries BlowFish [NSDI’16] Succinct [NSDI’15] Succinct
 Encryption GraphStore KVStore ColumnarStore RowStore UnstructuredData
  • 22. Our Solution ‣ Compressed representation → More queries in faster storage ‣ Rich functionality directly on compressed representation ‣ Search, RegEx, Range queries ‣ Flexible support for different data models BlowFish [NSDI’16] Succinct [NSDI’15] Succinct
 Encryption GraphStore KVStore ColumnarStore RowStore UnstructuredData
  • 23. Our Solution ‣ Compressed representation → More queries in faster storage ‣ Rich functionality directly on compressed representation ‣ Search, RegEx, Range queries ‣ Flexible support for different data models ‣ Handles skewed & time-varying workloads BlowFish [NSDI’16] Succinct [NSDI’15] Succinct
 Encryption GraphStore KVStore ColumnarStore RowStore UnstructuredData
  • 26. Existing Techniques Data Scans SEARCH( ) Example: Ex: Apache Spark
  • 27. Existing Techniques Data Scans SEARCH( ) Example: Ex: Apache Spark
  • 28. Existing Techniques Data Scans SEARCH( ) Example: Ex: Apache Spark
  • 29. Existing Techniques Data Scans Low storage High Latency SEARCH( ) Example: Ex: Apache Spark
  • 30. Existing Techniques Data Scans Indexes Low storage High Latency SEARCH( ) Example: Ex: Apache Spark
  • 31. Existing Techniques Data Scans Indexes Low storage High Latency SEARCH( ) Example: Ex: Apache Spark Ex: SOLR
  • 32. Existing Techniques Data Scans Indexes Low storage High Latency SEARCH( ) Example: Ex: Apache Spark Ex: SOLR
  • 33. Existing Techniques 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency SEARCH( ) Example: Ex: Apache Spark Ex: SOLR
  • 34. Existing Techniques 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency SEARCH( ) Example: Ex: Apache Spark Ex: SOLR
  • 35. Existing Techniques 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency High storage Low Latency SEARCH( ) Example: Ex: Apache Spark Ex: SOLR
  • 39. Succinct Queries executed directly on the compressed representation Succinct
  • 40. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct
  • 41. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct What makes Succinct unique
  • 42. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct What makes Succinct unique No additional indexes Query responses embedded in the compressed representation
  • 43. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct What makes Succinct unique No additional indexes Query responses embedded in the compressed representation No data scans Functionality of indexes
  • 44. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct What makes Succinct unique No additional indexes Query responses embedded in the compressed representation No data scans Functionality of indexes No decompression Queries directly on the compressed representation (except for data access queries)
  • 45. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct
  • 46. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct Scale In-memory data sizes >= memory capacity
  • 47. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct Scale In-memory data sizes >= memory capacity Complex queries Search, range, random access, RegEx
  • 48. Succinct Queries executed directly on the compressed representation Low Storage Low Latency Succinct Scale In-memory data sizes >= memory capacity Complex queries Search, range, random access, RegEx Interactivity Avoids data scans and decompression
  • 50. Succinct Data Representation Builds on a large body of theory work
  • 51. Succinct Data Representation Builds on a large body of theory work Suffix Arrays
  • 52. Succinct Data Representation Builds on a large body of theory work Suffix Arrays ‣ Strong functionality (search)
  • 53. Succinct Data Representation Builds on a large body of theory work Suffix Arrays ‣ Strong functionality (search) ‣ No structure
  • 54. Succinct Data Representation Builds on a large body of theory work Suffix Arrays ‣ Strong functionality (search) ‣ No structure Compression?
  • 55. Succinct Data Representation Builds on a large body of theory work Suffix Arrays ‣ Strong functionality (search) ‣ No structure Compression? ‣ Sample the suffix array
  • 56. Succinct Data Representation Builds on a large body of theory work Suffix Arrays ‣ Strong functionality (search) ‣ No structure Compression? ‣ Sample the suffix array ‣ Store set of pointers to compute unsampled values on the fly
  • 57. Succinct Data Representation Builds on a large body of theory work Suffix Arrays ‣ Strong functionality (search) ‣ No structure Compression? ‣ Sample the suffix array ‣ Store set of pointers to compute unsampled values on the fly Possesses structure that enables compression!
  • 59. Succinct Data Model ‣ Unstructured data ‣ Key-value stores (Voldemort, Dynamo) ‣ Document store (Elasticsearch, MongoDB) ‣ Tables (Cassandra, BigTable) ‣ And many more .... Unified Interface
  • 60. Succinct Data Model ‣ Unstructured data ‣ Key-value stores (Voldemort, Dynamo) ‣ Document store (Elasticsearch, MongoDB) ‣ Tables (Cassandra, BigTable) ‣ And many more .... Unified Interface With all the powerful queries on values, documents, columns
  • 61. Data Model & Functionality For unstructured data:
  • 62. Data Model & Functionality Original Input Succinct For unstructured data:
  • 63. Data Model & Functionality Original Input Succinct SEARCH( )= {0, 10, 14, 16, 19, 26, 29} Search: returns offsets of arbitrary strings in uncompressed file For unstructured data:
  • 64. Data Model & Functionality Original Input Succinct SEARCH( )= {0, 10, 14, 16, 19, 26, 29} For unstructured data: Extract(0, 5) = { , , , , } Extract: returns data at arbitrary offsets in uncompressed file
  • 65. Data Model & Functionality Original Input Succinct SEARCH( )= {0, 10, 14, 16, 19, 26, 29} For unstructured data: Extract(0, 5) = { , , , , } COUNT( ) = 7 Count: returns count of arbitrary strings in uncompressed file
  • 66. Data Model & Functionality Original Input Succinct SEARCH( )= {0, 10, 14, 16, 19, 26, 29} For unstructured data: Extract(0, 5) = { , , , , } COUNT( ) = 7 Append( , , , , ) Append: appends arbitrary strings to uncompressed file
  • 67. Data Model & Functionality Original Input Succinct SEARCH( )= {0, 10, 14, 16, 19, 26, 29} For unstructured data: Extract(0, 5) = { , , , , } COUNT( ) = 7 Append( , , , , ) Range Queries, REGULAR EXPRESSIONS
  • 72. Unifying the Data Models SEARCH(Column1, )
  • 73. Unifying the Data Models SEARCH(Column1, )SEARCH( )
  • 84. Queries on Compressed RDDs New Functionalities Document store, 
 Key-Value store search on documents, values
  • 85. Queries on Compressed RDDs New Functionalities Document store, 
 Key-Value store search on documents, values Faster operations on RDDs random access, filters avoid scans
  • 86. Queries on Compressed RDDs New Functionalities Document store, 
 Key-Value store search on documents, values Faster operations on RDDs random access, filters avoid scans More in-memory Compressed RDDs no decompression overheads
  • 87. Unstructured data using SuccinctRDD
  • 88. import edu.berkeley.cs.succinct._ Import classes Unstructured data using SuccinctRDD
  • 89. import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(…).map(_.getBytes) val succinctRDD = rdd.succinct Load data & compress using Succinct Unstructured data using SuccinctRDD
  • 90. import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(…).map(_.getBytes) val succinctRDD = rdd.succinct val offsets = succinctRDD.search("Berkeley") Find all occurrences of ā€œBerkeleyā€ Unstructured data using SuccinctRDD
  • 91. import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(…).map(_.getBytes) val succinctRDD = rdd.succinct val count = succinctRDD.count("Berkeley") val offsets = succinctRDD.search("Berkeley") Count #occurrences of ā€œBerkeleyā€ Unstructured data using SuccinctRDD
  • 92. import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(…).map(_.getBytes) val succinctRDD = rdd.succinct val bytes = succinctRDD.extract(50, 100) val count = succinctRDD.count("Berkeley") val offsets = succinctRDD.search("Berkeley") Extract 100 bytes from offset 50 Unstructured data using SuccinctRDD
  • 93. Key-Value Store using SuccinctKVRDD
  • 94. import edu.berkeley.cs.succinct.kv._ Import classes Key-Value Store using SuccinctKVRDD
  • 95. import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))
 val succinctKVRDD = kvRDD.succinctKV Load data & compress using Succinct Key-Value Store using SuccinctKVRDD
  • 96. import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))
 val succinctKVRDD = kvRDD.succinctKV val keys = succinctKVRDD.search("Berkeley") Find all keys for values that contain ā€œBerkeleyā€ Key-Value Store using SuccinctKVRDD
  • 97. import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))
 val succinctKVRDD = kvRDD.succinctKV val value = succinctKVRDD.get(0) val keys = succinctKVRDD.search("Berkeley") Get value for key 0 Key-Value Store using SuccinctKVRDD
  • 101. Evaluation Dataset Cluster Workload Wikipedia dataset
 ~40GB data Amazon EC2, 5 machines, 30GB RAM each Search queries, 1-10,000 occurrences
  • 102. Evaluation Dataset Cluster Workload Systems Wikipedia dataset
 ~40GB data Amazon EC2, 5 machines, 30GB RAM each Search queries, 1-10,000 occurrences Spark, Elasticsearch
  • 103. Evaluation Dataset Cluster Workload Systems Wikipedia dataset
 ~40GB data Amazon EC2, 5 machines, 30GB RAM each Search queries, 1-10,000 occurrences Spark, Elasticsearch Caveats Absolute numbers are dataset dependent
  • 105. Evaluation: Search Takeaway: Succinct on Apache Spark is 2.5x faster than Elasticsearch while being 2.5x more space efficient.
 (Data fits in memory for all systems)
  • 106. Support for Regular Expressions
  • 107. Support for Regular Expressions Applications Data Cleaning
 Information Extraction
 Bioinformatics Document Stores
  • 108. Support for Regular Expressions Applications Operators Data Cleaning
 Information Extraction
 Bioinformatics Document Stores Union, Concat, Wildcard, Repeat
  • 109. Support for Regular Expressions Applications Operators Data Cleaning
 Information Extraction
 Bioinformatics Document Stores Union, Concat, Wildcard, Repeat Example .*(berkeley|stanford).edu
  • 110. Support for Regular Expressions
  • 111. Support for Regular Expressions val matches = succinctRDD.regexSearch(".*(berkeley|stanford).edu") Find all matches for the RegEx ā€œ.*(berkeley|stanford).eduā€ SuccinctRDD
  • 112. Support for Regular Expressions val matches = succinctRDD.regexSearch(".*(berkeley|stanford).edu") Find all matches for the RegEx ā€œ.*(berkeley|stanford).eduā€ SuccinctRDD val matchKeys = succinctKVRDD.regexSearch(".*(berkeley|stanford).edu") Find all keys for values that contain the RegEx ā€œ.*(berkeley|stanford).eduā€ SuccinctKVRDD
  • 115. Evaluation: RegEx Takeaway: Succinct significantly speeds up RegEx queries even when all the data fits in memory for all systems.
  • 117. Succinct on Apache Spark Already in use at Elsevier Labs
  • 118. Succinct on Apache Spark Already in use at Elsevier Labs ‣ Use case: Annotation Search
  • 119. Succinct on Apache Spark Already in use at Elsevier Labs ‣ Use case: Annotation Search Documents
  • 120. Succinct on Apache Spark Already in use at Elsevier Labs ‣ Use case: Annotation Search Documents 1, sentence, (0, 15) 2, word, (0, 4) 3, word, (5, 10) 4, word, (11, 15) Annotations
  • 121. Succinct on Apache Spark Already in use at Elsevier Labs ‣ Use case: Annotation Search Documents 1, sentence, (0, 15) 2, word, (0, 4) 3, word, (5, 10) 4, word, (11, 15) Annotations ā€œFind sentences that talk about open problems in researchā€
  • 122. Succinct on Apache Spark Already in use at Elsevier Labs ‣ Use case: Annotation Search Documents 1, sentence, (0, 15) 2, word, (0, 4) 3, word, (5, 10) 4, word, (11, 15) Annotations (remains|is|still) (unknown|unclear|uncertain) within <sentence> RegEx Annotation ā€œFind sentences that talk about open problems in researchā€
  • 123. Succinct on Apache Spark Already in use at Elsevier Labs ‣ Use case: Annotation Search Documents 1, sentence, (0, 15) 2, word, (0, 4) 3, word, (5, 10) 4, word, (11, 15) Annotations https://spark-packages.org/package/amplab/succinct (remains|is|still) (unknown|unclear|uncertain) within <sentence> RegEx Annotation ā€œFind sentences that talk about open problems in researchā€
  • 124. Problem: Skewed Query Workloads
  • 125. Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform
  • 126. Problem: Skewed Query Workloads ‣ Succinct: Larger fraction of queries in main memory ‣ Challenge: skewed load across shards? ‣ Challenge: time varying loads? Load distribution across partitions is often non-uniform
  • 127. Problem: Skewed Query Workloads ‣ Succinct: Larger fraction of queries in main memory ‣ Challenge: skewed load across shards? ‣ Challenge: time varying loads? ‣ E.g.: Memcached + MySQL deployment @ Facebook Load distribution across partitions is often non-uniform
  • 128. Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform
  • 129. Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform
  • 130. Selective Replication Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform Traditional approach:
  • 131. Selective Replication Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform #Replicas Traditional approach:
  • 132. Selective Replication Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform #Replicas #Replicas α Load Traditional approach:
  • 133. Selective Replication Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform #Replicas #Replicas α Load Coarse grained Traditional approach:
  • 134. Selective Replication Problem: Skewed Query Workloads Load distribution across partitions is often non-uniform #Replicas #Replicas α Load Coarse grained 1-2Ɨ throughput → 2Ɨ storage Traditional approach:
  • 145. Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array
  • 146. Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Unsampled values computed on the fly
  • 147. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Rate = 2 Unsampled values computed on the fly
  • 148. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Rate = 2 Unsampled values computed on the fly
  • 149. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Rate = 2 Unsampled values computed on the fly
  • 150. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 3 14RATE = 4 Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Rate = 2 Unsampled values computed on the fly
  • 151. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 3 14RATE = 4 15 0 8 5RATE = 2 Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Rate = 2 Unsampled values computed on the fly
  • 152. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 3 14RATE = 4 15 0 8 5RATE = 2 Different combination of layers Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array Rate = 2 Unsampled values computed on the fly
  • 153. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 3 14RATE = 4 15 0 8 5RATE = 2 Different combination of layers Different points on tradeoff curve Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array → Rate = 2 Unsampled values computed on the fly
  • 154. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 3 14RATE = 4 15 0 8 5RATE = 2 Different combination of layers Different points on tradeoff curve Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array → Rate = 2 Layer Additions and Deletions Unsampled values computed on the fly
  • 155. OriginalSampled 
 Array 9 15 3 0 12 8 14 5 9 12RATE = 8 3 14RATE = 4 15 0 8 5RATE = 2 Different combination of layers Different points on tradeoff curve Recap: Succinct stores a sampled suffix array BlowFish: Layered Sampled Array → Rate = 2 Layer Additions and Deletions Move along tradeoff curve→ Unsampled values computed on the fly
  • 157. BlowFish: Technical Details ‣ How should partitions share cache on a server?
  • 158. BlowFish: Technical Details ‣ How should partitions share cache on a server?
  • 159. BlowFish: Technical Details ‣ How should partitions share cache on a server? Low Threshold
  • 160. BlowFish: Technical Details ‣ How should partitions share cache on a server? High ThresholdLow Threshold
  • 161. BlowFish: Technical Details ‣ How should partitions share cache on a server? High ThresholdLow Threshold
  • 162. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? High ThresholdLow Threshold
  • 163. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? ‣ How should requests be scheduled across replicas? High ThresholdLow Threshold
  • 164. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? ‣ How should requests be scheduled across replicas? Unified Solution: Back-pressure style scheduling High ThresholdLow Threshold
  • 165. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? Cache proportional to load, ‣ How should requests be scheduled across replicas? Unified Solution: Back-pressure style scheduling High ThresholdLow Threshold
  • 166. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? Cache proportional to load, ‣ How should requests be scheduled across replicas? Unified Solution: Back-pressure style scheduling without explicit coordination High ThresholdLow Threshold
  • 167. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? ‣ How should requests be scheduled across replicas? Unified Solution: Back-pressure style scheduling 1.5x higher throughput than Selective Replication, High ThresholdLow Threshold
  • 168. BlowFish: Technical Details ‣ How should partitions share cache on a server? ‣ How should partitions share cache across servers? ‣ How should requests be scheduled across replicas? Unified Solution: Back-pressure style scheduling 1.5x higher throughput than Selective Replication, within 11% of maximum possible throughput High ThresholdLow Threshold
  • 170. ‣ Standalone system (prototyped & tested) Succinct + BlowFish
  • 171. ‣ Standalone system (prototyped & tested) ‣ Spark Package: Succinct on Apache Spark Succinct + BlowFish
  • 172. ‣ Standalone system (prototyped & tested) ‣ Spark Package: Succinct on Apache Spark ‣ As libraries ‣ C++, Java, Scala ‣ for ease of integration Succinct + BlowFish
  • 175. Array of Suffixes (AoS) banana$ (Input)
  • 176. Array of Suffixes (AoS) banana$ banana$ anana$ nana$ ana$ na$ a$ $ Suffixes (Input)
  • 177. Array of Suffixes (AoS) banana$ banana$ anana$ nana$ ana$ na$ a$ $ Suffixes $ a$ ana$ anana$ banana$ na$ nana$ Array of Suffixes (AoS) lexicographicalorder (Input)
  • 178. AoS to Input (AoS2Input) Array $ a$ ana$ anana$ banana$ na$ nana$ AoS 6 AoS2Input 5 3 1 0 4 2 b Input 0 1 2 3 4 5 6 a n a n a $
  • 179. AoS to Input (AoS2Input) Array $ a$ ana$ anana$ banana$ na$ nana$ AoS 6 AoS2Input 5 3 1 0 4 2 b Input 0 1 2 3 4 5 6 a n a n a $ locations of suffixes (suffix array)
  • 180. AoS to Input (AoS2Input) Array $ a$ ana$ anana$ banana$ na$ nana$ AoS 6 AoS2Input 5 3 1 0 4 2 b Input 0 1 2 3 4 5 6 a n a n a $ locations of suffixes (suffix array)
  • 181. AoS to Input (AoS2Input) Array $ a$ ana$ anana$ banana$ na$ nana$ AoS 6 AoS2Input 5 3 1 0 4 2 b Input 0 1 2 3 4 5 6 a n a n a $ locations of suffixes (suffix array)
  • 189. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA
  • 190. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 3
  • 191. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 3
  • 192. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 3 6
  • 193. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 3 6
  • 194. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 2 3 6
  • 195. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 2 3 6
  • 196. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 3 6
  • 197. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6
  • 198. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 Store only the first character (entire suffix can be computed ā€œon the flyā€ using Next Pointer Array (NPA)) 3 6
  • 199. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6
  • 200. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 a
  • 201. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 an
  • 202. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 an
  • 203. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 ana
  • 204. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 ana
  • 205. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 ana$
  • 206. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6 ana$
  • 207. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6
  • 208. Next Pointer Array: Reducing AoS Size $ a$ ana$ anana$ banana$ na$ nana$ AoS 0 1 2 3 4 5 6 NPA 4 0 5 1 2 AoS NPA $ a b n 4 0 5 6 3 1 2 0 1 2 3 4 5 6 AoS NPA $0 1 2 3 4 5 6 a a a b n n 4 0 5 6 3 1 2 3 6
  • 209. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4
  • 210. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4
  • 211. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4
  • 212. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4 AoS2Input NPA 4 0 5 6 3 1 2 6 0 2 0 1 2 3 4 5 6 3
  • 213. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4 AoS2Input NPA 4 0 5 6 3 1 2 6 0 2 0 1 2 3 4 5 6 3 Store only a few sampled values (unsampled values computed ā€œon the flyā€ using NPA)
  • 214. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4 AoS2Input NPA 4 0 5 6 3 1 2 6 0 2 0 1 2 3 4 5 6 3 Store only a few sampled values (unsampled values computed ā€œon the flyā€ using NPA)
  • 215. Reducing the size of AoS2Input 6 AoS2Input 5 0 2 4 NPA 0 5 6 3 1 2 0 1 2 3 4 5 6 3 1 4 AoS2Input NPA 4 0 5 6 3 1 2 6 0 2 0 1 2 3 4 5 6 3 Store only a few sampled values (unsampled values computed ā€œon the flyā€ using NPA)
  • 216. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) $ a b n 4 0 5 6 3 1 2
  • 217. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA $ a b n 4 0 5 6 3 1 2
  • 218. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA $ a b n 4 0 5 6 3 1 2
  • 219. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA - better compressibility $ a b n 4 0 5 6 3 1 2
  • 220. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA - better compressibility - avoids binary search on AoS (lower latency) $ a b n 4 0 5 6 3 1 2
  • 221. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA - better compressibility - avoids binary search on AoS (lower latency) - enables wider range of queries (E.g., RegEx) $ a b n 4 0 5 6 3 1 2
  • 222. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA - better compressibility - avoids binary search on AoS (lower latency) - enables wider range of queries (E.g., RegEx) $ a b n 4 0 5 6 3 1 2
  • 223. Compressing NPA Increasing sequence of integers (values for suffixes starting with same character) Can be compressed (E.g., using run-length encoding) Succinct uses a 2-dimensional representation of NPA - better compressibility - avoids binary search on AoS (lower latency) - enables wider range of queries (E.g., RegEx) See upcoming NSDI paper! $ a b n 4 0 5 6 3 1 2
  • 224. Evaluation: Storage Footprint 10 node 150GB cluster
  • 225. Evaluation: Storage Footprint 10 with in- h metadata; r-store with a bug also es the sys- ariable. For ss to the in- orm micro- single ma- failure sce- nd Cassan- exes. These nd wildcard rt wildcard ide slightly y, for Suc- valuate the tion. lti-attribute rgeKV from 75 150 225 DataSizethat FitsinMemory(GB) SmallKV LargeKV MongoDB Cassandra HyperDex Succinct RAM Figure 12: Succinct pushes more than 10Ɨ larger amount of data in memory compared to the next best system, while providing similar or stronger functionality. 10 node 150GB cluster
  • 226. Evaluation: Storage Footprint Takeaway: Succinct can push >11x more data in memory 10 with in- h metadata; r-store with a bug also es the sys- ariable. For ss to the in- orm micro- single ma- failure sce- nd Cassan- exes. These nd wildcard rt wildcard ide slightly y, for Suc- valuate the tion. lti-attribute rgeKV from 75 150 225 DataSizethat FitsinMemory(GB) SmallKV LargeKV MongoDB Cassandra HyperDex Succinct RAM Figure 12: Succinct pushes more than 10Ɨ larger amount of data in memory compared to the next best system, while providing similar or stronger functionality. 10 node 150GB cluster
  • 227. Evaluation: Throughput (95% GET + 5% PUT) 10 node 150GB cluster, uniform random access pattern
  • 228. Evaluation: Throughput (95% GET + 5% PUT) 10 node 150GB cluster, uniform random access pattern
  • 229. Evaluation: Throughput (95% GET + 5% PUT) Takeaway: Succinct achieves performance comparable to existing open source systems for queries on primary attributes 10 node 150GB cluster, uniform random access pattern
  • 230. Evaluation: Throughput (95% SEARCH + 5% PUT) 10 node 150GB cluster, search queries with 1-10K occurrences
  • 231. Evaluation: Throughput (95% SEARCH + 5% PUT) 10 node 150GB cluster, search queries with 1-10K occurrences
  • 232. Evaluation: Throughput (95% SEARCH + 5% PUT) Takeaway: Succinct by pushing more data in faster storage provides performance similar to existing systems for 10-11x larger data sizes. 10 node 150GB cluster, search queries with 1-10K occurrences
  • 233. Evaluation: RegEx Latency 40GB Wikipedia dataset, 5 commonly used RegEx queries Single EC2 node, 32 vCPUs, 244GB RAM
  • 234. Evaluation: RegEx Latency 40GB Wikipedia dataset, 5 commonly used RegEx queries Single EC2 node, 32 vCPUs, 244GB RAM
  • 235. Evaluation: RegEx Latency Takeaway: Succinct significantly speeds up RegEx queries even when all the data fits in memory for all systems. 40GB Wikipedia dataset, 5 commonly used RegEx queries Single EC2 node, 32 vCPUs, 244GB RAM
  • 237. val ids1 = succinctJsonRDD.search("AMPLab") Search for JSON documents containing ā€œAMPLabā€ Support for JSON
  • 238. val ids2 = succinctJsonRDD.filter("city", "Berkeley") val ids1 = succinctJsonRDD.search("AMPLab") Filter JSON documents where the ā€œcityā€ attribute has value ā€œBerkeleyā€ Support for JSON
  • 239. val jsonDoc = succinctJsonRDD.get(0) val ids2 = succinctJsonRDD.filter("city", "Berkeley") val ids1 = succinctJsonRDD.search("AMPLab") Get JSON document with id 0 Support for JSON
  • 240. Layer Additions & Deletions
  • 241. 9 12RATE = 8 3 14RATE = 4 15 0 8 5RATE = 2 Layer Additions & Deletions
  • 242. 9 12RATE = 8 3 14RATE = 4 Layer Additions & Deletions Layer Deletions: simple
  • 243. RATE = 2 9 12RATE = 8 3 14RATE = 4 Layer Additions & Deletions Layer Addition:
  • 244. RATE = 2 9 12RATE = 8 3 14RATE = 4 Unsampled values already computed during query execution Layer Additions & Deletions Layer Addition:
  • 245. RATE = 2 9 12RATE = 8 3 14RATE = 4 815 Unsampled values already computed during query execution Layer Additions & Deletions Layer Addition: Layers in LSA populated opportunistically!!
  • 247. Spatial Skew Load distribution across partitions is heavily skewed
  • 248. Object Load 1 Compressed Wasted Cache! Spatial Skew Load distribution across partitions is heavily skewed #Replicas α Load Selective Replication
  • 249. Spatial Skew Load distribution across partitions is heavily skewed #Replicas α Load Selective Replication BlowFish Fractionally change storage just enough to meet load 1 Compressed Uncompressed 10 Object Load
  • 250. Spatial Skew Load distribution across partitions is heavily skewed #Replicas α Load Selective Replication BlowFish Fractionally change storage just enough to meet load 1.5x higher throughput than Selective Replication, 1 Compressed Uncompressed 10 Object Load
  • 251. Spatial Skew Load distribution across partitions is heavily skewed #Replicas α Load Selective Replication BlowFish Fractionally change storage just enough to meet load 1.5x higher throughput than Selective Replication, within 10% of optimal 1 Compressed Uncompressed 10 Object Load
  • 253. Changes in Spatial Skew Study on Facebook Warehouse Cluster [HotStorage’13]
  • 254. Changes in Spatial Skew Transient failures → 90% of failuresStudy on Facebook Warehouse Cluster [HotStorage’13]
  • 255. Changes in Spatial Skew Transient failures → 90% of failures Replica creation delayed by 15 mins Study on Facebook Warehouse Cluster [HotStorage’13]
  • 256. Changes in Spatial Skew Transient failures → 90% of failures Replica creation delayed by 15 mins Study on Facebook Warehouse Cluster [HotStorage’13] Leads to variation in load over time
  • 257. Changes in Spatial Skew Transient failures → 90% of failures Replica creation delayed by 15 mins Replica#1 Replica#2 Replica#3 Data Partitions Request Queues Study on Facebook Warehouse Cluster [HotStorage’13] Leads to variation in load over time
  • 258. Changes in Spatial Skew Transient failures → 90% of failures Replica creation delayed by 15 mins Replica#1 Replica#2 Replica#3 Data Partitions Request Queues Study on Facebook Warehouse Cluster [HotStorage’13] Leads to variation in load over time
  • 259. Changes in Spatial Skew Transient failures → 90% of failures Replica creation delayed by 15 mins Replica#1 Replica#2 Replica#3 Data Partitions Request Queues Study on Facebook Warehouse Cluster [HotStorage’13] Leads to variation in load over time
  • 260. Changes in Spatial Skew Replica#1 Replica#2 Replica#3
  • 261. Changes in Spatial Skew Replica#1 Replica#2 Replica#3
  • 262. Changes in Spatial SkewOperations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 263. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 264. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 265. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 266. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 267. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 268. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 269. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 270. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 271. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Replica#1 Replica#2 Replica#3
  • 272. Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Changes in Spatial Skew Load Throughput Operations/second 0 600 1200 1800 2400 3000 Time (mins) 0 30 60 90 120 RequestQueueSize 0K 10K 20K 30K 40K 50K Time (mins) 0 30 60 90 120 Adapts to 3x higher load in < 5 mins Replica#1 Replica#2 Replica#3