SlideShare a Scribd company logo
14/08/13 Rui Vieira, MSc ITEC 1
Efficient top-k query processing on distributed column family databases
Efficient top-k query
processing on distributed
column family databases
14/08/13 Rui Vieira, MSc ITEC 2
Efficient top-k query processing on distributed column family databases
Ranking (top-k) queriesRanking (top-k) queries
We use top-k queries everydayWe use top-k queries everyday
● Search engines (top 100 pages for certain words)
● Analytics applications (most visited pages per day)
Text search: Time periods:
14/08/13 Rui Vieira, MSc ITEC 3
Efficient top-k query processing on distributed column family databases
Ranking (top-k) queriesRanking (top-k) queries
DefinitionDefinition
Find all k objects with the highest aggregated score over function f
(f is usually a summation function over attributes)
Example:
Find the top 10 students with highest
grades over all modules.
...
Module n
...
Module 2
John, 89%
Emma, 88%
Brian, 70%
Steve, 65%
Anna, 60%
Peter, 59%
Paul, 50%
Mary, 49%
Richard, 31%
...
Module 1
...
John, 39%
Emma, 48%
Brian, 50%
Steve, 75%
Anna, 50%
Peter, 59%
Paul, 80%
Mary, 89%
Richard, 91%
John, 82%
Emma, 78%
Brian, 90%
Steve, 85%
Anna, 83%
Peter, 81%
Paul, 70%
Mary, 59%
Richard, 51%
14/08/13 Rui Vieira, MSc ITEC 4
Efficient top-k query processing on distributed column family databases
Motivation: real-time distributed top-k queriesMotivation: real-time distributed top-k queries
Why real-time top-k queries?
• To be integrated in a larger real-time analytics platform
● “User” real-time = hundred milliseconds ~ one second
• Implement solutions make efficient use of:
• Memory, Bandwidth and Computations
• Can handle massive amounts of data
Use case:
We logging page views in a website. Can we find the top 10 most
visited in the last 7 days? What about 10 months? All under 1 second?
14/08/13 Rui Vieira, MSc ITEC 5
Efficient top-k query processing on distributed column family databases
Top-k queries: simplistic solutionTop-k queries: simplistic solution
“Naive” method
• Fetch all objects and scores from all sources
• Aggregate them in memory
• Sort all aggregations
• Select top-k highest scoring
Solutions to provide ranking queries answers (but not real-time):
<O 1 , 1000> <O
89 , 900>
<O
99 , 1>
...peer 1
Query
Coordinator
peer
2
...
peer n
merge all
data
aggregate
scores
sort all
aggregated
select
k highest
Not feasible:
• For large amounts of data
• Possibly doesn't fit in RAM
• Execution time most likely not real-time
• Not efficient: low-scoring objects processed
• Due to all of the above: not scalable
14/08/13 Rui Vieira, MSc ITEC 6
Efficient top-k query processing on distributed column family databases
Top-k queries: Batch solutionsTop-k queries: Batch solutions
Batch operations (Hadoop / Map-Reduce)
Pros
• Proven solution to (some) top-k scenarios
• Excellent for “report” style use cases
Cons
• Still has to process all the information
• Not real-time
14/08/13 Rui Vieira, MSc ITEC 7
Efficient top-k query processing on distributed column family databases
Our requirements
● Work with “Peers” which are distributed logically (rows)
as well as physically (nodes)
● Nodes in the cluster have (very) limited instructions
● Low latency (fixed number of round-trips)
● Offer considerable savings of bandwidth and execution time
● Possible to adapt to data access patterns and models in Cassandra
14/08/13 Rui Vieira, MSc ITEC 8
Efficient top-k query processing on distributed column family databases
Algorithms
14/08/13 Rui Vieira, MSc ITEC 9
Efficient top-k query processing on distributed column family databases
Algorithms: related Work
Threshold family of algorithms pioneered by Faggins et al.
Objective: determine a threshold below which an object cannot be
a top-k object
Initial Threshold Algorithms (TA) however:
• Not designed with distributed data sources in mind
• Performance highly dependent on data shape (skewness, correlation ...)
• Unbounded round-trips to data source → unbounded latency
• TA keeps performing random accesses until it reaches a
stopping point
14/08/13 Rui Vieira, MSc ITEC 10
Efficient top-k query processing on distributed column family databases
Algorithms: Related Work
Three algorithms were selected:
• Three-Phase Uniform Threshold (TPUT)
• Distributed fixed round-trip exact algorithm
• Hybrid Threshold
• Distributed fixed round-trip exact algorithm
• KLEE
• Distributed fixed round-trip approximate algorithm
• However these algorithms were developed for P2P networks
• As far as we know, they have never been implemented with
distributed column-family databases previously
14/08/13 Rui Vieira, MSc ITEC 11
Efficient top-k query processing on distributed column family databases
Algorithms: TPUT
Request top-k
From each peer
peer1
peer2
peer3
peer4
peerm
calculate a
Partial sum
select kth
score
As min-k
Request all objects with score⩾
mink
m
re-calculate a
Partial sum
select kth
score
as threshold
Request all objects
with score > threshold
Partial sum
(missing scores = 0)
Partial sum
(missing scores = min-k/m)
worst-score best-score
Best-score > worst-score = candidate
Request candidates
peer1
peer2
peer3
peer4
peerm
peer1
peer2
peer3
peer4
peerm
peer1
peer2
peer3
peer4
peerm
Final partial
sum
K highest
are top-k
14/08/13 Rui Vieira, MSc ITEC 12
Efficient top-k query processing on distributed column family databases
Algorithms: Hybrid Threshold
Phase 1
Same as in TPUT.
i.e., the objective is to determine the first threshold: T =
mink
m
score⩾Ti =max(Slowest ,T )
Send to each peer candidates
So far and T
peer1
peer2
peer3
peer4
peerm
Each peer determines lowest scoring
candidate and returns candidates with
Phase 2
Phase 3
re-calculate a
Partial sum
select kth
score
as τ2
If T i<
τ2
m
peer
Fetch score >
τ2
m
re-calculate a
Partial sum
select kth
score
as τ3
Candidates = partial sum > τ3
14/08/13 Rui Vieira, MSc ITEC 13
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE
• TPUT variant
• Trade-off between accuracy and bandwidth
• Relies on summary data (statistical meta-data)
to better estimate min-k without going “deep” on index lists
Fundamental data structures for meta-data:
• Histograms
• Bloom filters
14/08/13 Rui Vieira, MSc ITEC 14
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Histograms)
● Equi-width cells
● Configurable number of cells
● Each cell n stores:
● Highest score in n (ub)
● Lowest score in n (lb)
● Average score for n (avg)
● Number of objects in n (freq)
Example:
Cell #10 (covers scores from 900-1000):
● ub = 989
● lb = 901
● avg = 937.4
● freq = 200
14/08/13 Rui Vieira, MSc ITEC 15
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Bloom filters)
00 1 2 3 4 5 6 7 ... m
0 1 0 0 1 0 0 0 0 1
h 1 (O) h 2 (O)h n (O)
h 1 (P) h 2 (P) h n (P) ∴ P ∉ S
● Bit set with objects hashed into positions
● Allows for very fast membership queries
● Space-efficient data structure
● However, not isomorphic → cannot determine objects from Bloom filter alone
14/08/13 Rui Vieira, MSc ITEC 16
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE
Consists of 4 or (optionally) 3 steps
1 - Exploration Step
Approximate a min-k threshold based on statistical meta-data
2 - Optimisation Step
Decide whether execute step 3 or directly 4
3 - Candidate Filtering
Filter high-scoring candidates
4 - Candidate Retrieval
Fetch candidates from peers
14/08/13 Rui Vieira, MSc ITEC 17
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Phase 1)
Fetch top-k objects
Fetch c “top” histograms + Bloom filters
Fetch c “low” freq and avg
peer1
peer2
peer3
peer4
peerm
For each object
seen so far
Is object in
Bloom filter?
Use weighted avg
Of low cells
Use corresponding
avg value
no
yes
noWas in top-k?
Partial sum
select kth
score
As min-k
score>
mink
m
candidates
14/08/13 Rui Vieira, MSc ITEC 18
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Phase 3)
● Request a bit set with all objects scoring higher than
● Perform a statistical pruning leaving only the most “common”
objects
(Note: this step was not implement due to the computational
limitation of Cassandra nodes)
14/08/13 Rui Vieira, MSc ITEC 19
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Phase 4)
● Request all the candidates from the peers
● Perform a partial sum with the true scores of objects
● Select the k highest as our top-k
14/08/13 Rui Vieira, MSc ITEC 20
Efficient top-k query processing on distributed column family databases
CassandraCassandra
14/08/13 Rui Vieira, MSc ITEC 21
Efficient top-k query processing on distributed column family databases
Cassandra (architecture overview)Cassandra (architecture overview)
● Fully decentralised column-family store
● High (almost linear) scalability
● No single point of failure (no “master” or “slave” nodes)
● Automatic replication
● Clients can read and write to any node in cluster
● Cassandra takes over duties of partitioning and replicating automatically
14/08/13 Rui Vieira, MSc ITEC 22
Efficient top-k query processing on distributed column family databases
Cassandra (architecture overview)Cassandra (architecture overview)
● Automatic partitioning of data (commonly used is Random partitioning)
●
Rows are distributed in nodes by hash of partition key (1st
PK)
"2013-08-14"
id = O 1
score = 7919
column
table foo
nodeA
nodeB
nodeC
nodeD
hashing
(MD5) on key
... id = O n
score = 9109
id = O 1
score = 1219
... id = O n
score = 109
id = O 1
score = 59
... id = O n
score = 91
id = O 1
score = 7919
... id = On
score = 9109
id = O 1
score = 1219
... id = On
score = 109
id = O 1
score = 59
... id = On
score = 91
"2013-08-15"
"2013-08-16"
"2013-08-14"
"2013-08-15"
"2013-08-16"
14/08/13 Rui Vieira, MSc ITEC 23
Efficient top-k query processing on distributed column family databases
Cassandra (data model)
● Columns to be ordered upon insertion (ordered by PKs)
● Columns in the same row are physically co-located
● Range searches are fast: score < 10000
(simply a linear seek on disk)
"2013-08-16"
id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is id (ascending)
"2013-08-16" id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is score (ascending)
table_forward
table_reverse
14/08/13 Rui Vieira, MSc ITEC 24
Efficient top-k query processing on distributed column family databases
Cassandra (CQL)
Data manipulation language for Cassandra is CQL
● Similar in syntax to SQL
INSERT INTO table (foo, bar) VALUES (42, 'Meaning')
SELECT foo, bar FROM table WHERE foo = 42
Limitations
● No joins, unions or sub-selects
● No aggregation functions (min, max, etc...)
● Inequality search are bound to primary key declaration order (next slide)
14/08/13 Rui Vieira, MSc ITEC 25
Efficient top-k query processing on distributed column family databases
Cassandra (CQL)
Consider the following table
CREATE TABLE visits(
date timestamp,
user_id bigint,
hits bigint,
PRIMARY KEY (date, user_id))
Although the following queries would be valid SQL queries
They are not valid CQL:
SELECT * FROM visits WHERE hits > 1000
SELECT * FROM visits WHERE user_id > 900 AND hits = 0
Inequality queries are restricted to PKs and return
contiguous columns, such as
SELECT * FROM visits WHERE date = 1368438171000 AND user_id > 1000
14/08/13 Rui Vieira, MSc ITEC 26
Efficient top-k query processing on distributed column family databases
Implementation
14/08/13 Rui Vieira, MSc ITEC 27
Efficient top-k query processing on distributed column family databases
Implementation (overview)
Query
Coordinator
peer1
peer2
peern
Peer
interface
driver
nodeA
nodeB
nodeC
nodeD
asynchronous call
asynchronous call
asynchronous call
callbackcallbackcallbackasynchronous callcallbackasynchronous callcallbackasynchronous callcallback
KLEE HT TPUT
JVM
14/08/13 Rui Vieira, MSc ITEC 28
Efficient top-k query processing on distributed column family databases
Implementation: challenges
Implement forward and reverse tables to allow lookup by score and id
● Space is cheap
● Space is even cheaper as Cassandra uses in-built data compression
● Space is even cheaper as denormalised data usually compresses better
than normalised data.
● Advantage of scores columns being pre-ordered at the row level
"2013-08-16"
id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is id (ascending)
"2013-08-16" id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is score (ascending)
table_forward
table_reverse
14/08/13 Rui Vieira, MSc ITEC 29
Efficient top-k query processing on distributed column family databases
Implementation: challenges
Map algorithmic steps to CQL logic
Decompose tasks
● Single step in algorithm:
(node can execute arbitrary code)
● Multiple step in this implementation:
(we can only communicate with node via CQL)
peeri
Query
Coordinator
select O >
max(T, S lowest )
List of
candidates determines local
lowest scoring, S lowest
T
peeri
Query
Coordinator
T i=
max(T, S lowest )
List of
candidates
determines local
lowest scoring, S lowest
candidates
peeri
fetch > T i
objects
14/08/13 Rui Vieira, MSc ITEC 30
Efficient top-k query processing on distributed column family databases
Implementation: TPUT (phase 1)
• Query Coordinator (QC) asks for top-k list from each peer 1..m invoking Peer async methods
• QC stores a set of all distinct objects received in a concurrent safe collection
• QC calculates a partial sum for each object
using a thread-safe Accumulator data structure.
Lets assume the partial sums are:
[O89
, 1590] , [O73
, 1590], [O1
, 1000],
[O21
, 990], [O12
, 880], [O51
, 780], [O801
, 680]
Calculate the first threshold:
S psum(O)=S peer1
'
(O)+…+S peerm
'
(O)
T =
τ1
m
Si
'
(O)={Si (O) if O hasbeenreturned by node i
0 if otherwise } 1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
190, O 1
690, O
89
790, O
73
590, O
51
990, O
21
390, O
801
10, O
780
490, O 12
290, O
99
...
peer 2
580, O 1
7, O
89
380, O
73
780, O
51
480, O
21
680, O
801
280, O
780
880, O 12
180, O
99
...
peer n
Query
Coordinator
fetch top- k
...
inverse table
14/08/13 Rui Vieira, MSc ITEC 31
Efficient top-k query processing on distributed column family databases
Implementation: TPUT (phase 2)
QC issues a requests for all objects with a score > T
from the inverse table (peer.getAbove(T))
With the received objects, recalculates the
partial sum.
(for each Pair → accumulator.add(pair))
Designates the kth
partial sum as
t2 = accumulator.getKthValue(k)
1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
190, O 1
690, O
89
790, O
73
590, O
51
990, O
21
390, O
801
10, O
780
490, O 12
290, O
99
...
peer 2
580, O 1
7, O
89
380, O
73
780, O
51
480, O
21
680, O
801
280, O
780
880, O 12
180, O
99
...
peer n
Query
Coordinator
fetch score > T
...
inverse table
14/08/13 Rui Vieira, MSc ITEC 32
Efficient top-k query processing on distributed column family databases
Implementation: TPUT (phase 3)
● Fetch the final candidates from the
forward table.
● Call async Peer methods
● Aggregate scores and nominate k highest
scoring as the top-k
forward table
O 1 , 1000
O
89 , 900
O
73 , 800
O
51 , 700
O
21 , 600
O
801 , 500
O
780 , 300
O 12 , 200
O
99 , 1
...
peer 1
O 1 , 190
O
89 , 690
O
73 , 790
O
51 , 590
O
21 , 990
O
801 , 390
O
780 , 10
O 12 , 490
O
99 , 290
...
peer 2
O 1 , 580
O
89 , 7
O
73 , 380
O
51 , 780
O
21 , 480
O
801 , 680
O
780 , 280
O 12 , 880
O
99 , 180
...
peer n
Query
Coordinator
fetch final candidates
...
14/08/13 Rui Vieira, MSc ITEC 33
Efficient top-k query processing on distributed column family databases
Implementation: challenges
Sequential vs. Random lookups
All algorithms at some point require random
access
Random access much slower than sequential
forward table
1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
...
peer1
inverse table
sequential
O1, 1000
O89, 900
O73, 800
O51, 700
O21, 600
O801, 500
O780, 300
O12, 200
O99, 1
"random"
Lookup # objects Time (ms) 95% CI (ms)
Sequential 240 1.70 0.27
Random 240 115.16 1.32
Sample size n = 100
14/08/13 Rui Vieira, MSc ITEC 34
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
Sequential vs. Random lookups
As a consequence of expensive random lookups a modified KLEE3 variant was
implemented
KLEE3-M:
In the final phase, instead of filtering candidates with
Do a range scan per peer for objects with
Trade-off:
score<
mink
m
score⩾
mink
m
data transfer
execution
time
14/08/13 Rui Vieira, MSc ITEC 35
Efficient top-k query processing on distributed column family databases
CREATE TABLE table_metadata(
peer text,
cell int,
lb double,
ub double,
freq bigint,
avg double,
binmax double,
binmin double,
filter blob,
PRIMARY KEY (date,cell)
) WITH CLUSTERING ORDER BY (cell DESC)
Implementation: KLEE challenges
Mapping data structures to Cassandra's data model
Serialised filter = 0x0000000600000002020100f0084263884418154205141c11
14/08/13 Rui Vieira, MSc ITEC 36
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
Mapping data structures to Cassandra's data model
peeri
determine
maximum score
and create n
equi-width bins
fetch entire
row serialise Bloom filter
and save row
Histogram
Creator
cell=0
cell=1
cell=2
cell=3
cell=4
cell=5
cell= n
freq =0
freq =2
freq =0
freq =10
freq =140
freq =986
freq =10234
avg =0
avg =4590.2
avg =0
avg =678.1
avg =230.1
avg =56.7
avg =1.02
partition object
per bin and add
to Bloom Filter
filter0
filter1
filter2
filter3
filter4
filter5
filtern
Flexible:
● Configurable number of bins
● Configurable maximum false positive ratio for filters
14/08/13 Rui Vieira, MSc ITEC 37
Efficient top-k query processing on distributed column family databases
Implementation: KLEE
...row 1
Query
Coordinator
metadata table
...
row n
Peer
Peer
getFullHistAsync cell:0
freq,avg,filter,
...
cell:1
freq,avg,filter
...
cell:2
freq,avg,filter
...
cell:3
freq,avg,filter
...
cell:n
freq,avg,filter
...
...row 1
Query
Coordinator
metadata table
...
row n
Peer
Peer
getPartialHistAsync cell:0
freq,avg,filter,
...
cell:1
freq,avg,filter
...
cell:2
freq,avg,filter
...
cell:3
freq,avg ,filter
...
cell:n
freq,avg ,filter
...
...row 1
Query
Coordinator
inverse table
...
row n
Peer
Peer
getTopKAsync
1000, O1 900, O12 800, O7 700, O18 1, O 145
ResultResultResultResult
ResultResultResultHistoBloom
estimate min-k
> min-k
...row 1
Query
Coordinator
forward table
...
row n
Peer
Peer
getObjectsAsync
O1, 1000 O12, 900O7, 800 O18, 700 O145, 1
ResultResultResultResult
aggregate
14/08/13 Rui Vieira, MSc ITEC 38
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
final HistogramCreator hc =
new CassandraHistogramCreator(tableDefinition);
// Optionally a max false positive ratio can be defined
hc.createHistogramTableSchema();
hc.createHistogramTable(“1998-05-01”, … , “1998-07-26“);
Simple API for Histogram/Bloom tables creation
14/08/13 Rui Vieira, MSc ITEC 39
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
 Fast generation
● Feasible for “on-the-fly” jobs
● Roughly linear with
execution time of 56 ms per
peer with 100,000 elements
14/08/13 Rui Vieira, MSc ITEC 40
Efficient top-k query processing on distributed column family databases
Implementation: asynchronous communication
● Driver used allowed for asynchronous communication
● Extensive use of ListenableFuture
● Allows for highly concurrent access with smaller thread pool
● Allows asynchronous transformations (eg ResultSet to POJO)
public ListenableFuture<ResultList> getAboveAsync(final long value) {
final ResultSetFuture above = session.executeAsync(statement.bind(value));
final Function<ResultSet, ResultList> transformResults = new Function<ResultSet, ResultList>() {
@Override
public ListenableFuture<ResultList> apply(ResultSet rs) {
final ResultList resultList = new ResultList();
final List<Row> rows = rs.all();
for (final Row row : rows) {
resultList.add(
Pair.create(row.getBytes(object.getName()), row.getLong(score.getName()))
);
}
return resultList;
}
};
return Futures.transform(above, transformResults, executor);
}
14/08/13 Rui Vieira, MSc ITEC 41
Efficient top-k query processing on distributed column family databases
Implementation: API
{
"wc98_ids": {
"name": "wc98_ids",
"inverse": "wc98_ids_inverse",
"metadata": "wc98_ids_metadata",
"score": {
"name": "visits",
"type": "bigint"
},
"id": {
"name": "id",
"type": "text"
},
"peer": {
"name": "date",
"type": "text"
}
}
}
JSON declaration of tables and columns
final QueryCoordinator coordinator =
QueryCoordinator.create(KLEE.class,
tableDefinition);
coordinator.setKeys(“1998-05-01”,
… , “1998-07-26”);
final List<Pair> topK = coordinator.getTopK(10);
14/08/13 Rui Vieira, MSc ITEC 42
Efficient top-k query processing on distributed column family databases
Datasets
Test data
14/08/13 Rui Vieira, MSc ITEC 43
Efficient top-k query processing on distributed column family databases
Datasets: Synthetic (Zipf)
Used in literature as a good approximation of “real-world” data
14/08/13 Rui Vieira, MSc ITEC 44
Efficient top-k query processing on distributed column family databases
Datasets: 1998 World Cup Data
● Data in Common Log Format (CLF) from the 1998 World Cup web servers
● IP addresses replaced by unique anonymous id
● Widely used in the literature as “real-world” test data
● Around 1.4 billion entries (approximately 2 million unique visitors)
●
Range from 1st
of May to 26th
of July 1998
● Highly skewed data
14/08/13 Rui Vieira, MSc ITEC 45
Efficient top-k query processing on distributed column family databases
Results
14/08/13 Rui Vieira, MSc ITEC 46
Efficient top-k query processing on distributed column family databases
Results: varying k
14/08/13 Rui Vieira, MSc ITEC 47
Efficient top-k query processing on distributed column family databases
Results: varying number of peers
14/08/13 Rui Vieira, MSc ITEC 48
Efficient top-k query processing on distributed column family databases
Results: Datasets (1998 World Cup Data)
Algorithm Data (KB)
Execution
time (ms)
95% CI (ms)
Precision
(%)
KLEE3 80 319.95 ±8.58 100
KLEE3-M 1271 84.75 ±6.5 100
Hybrid Threshold 14,306 1921.9 ±65.28 100
TPUT 44 141.5 ±7.36 100
Naive
(baseline)
43,572 8514.6 ±61.38 100
Data for 18 peers = daily from 1st
June 1998 to 18th
June 1998
Sample size n = 20
Give me the top 20 visitors from 1st
June to 18th
June
14/08/13 Rui Vieira, MSc ITEC 49
Efficient top-k query processing on distributed column family databases
Implementation: Pre-aggregation
Mix and match keys for aggregation
results
"2013-08" 192.0.43.10192.0.43.11
"2013-08-02" 192.0.43.10
98
192.0.43.11
234
96327404
"2013-09" 192.0.43.10
5398
192.0.43.11
23234
"2013-08-01" 192.0.43.10
98
192.0.43.11
234coordinator
.setKeys(“1998-05”,
“1998-06”,
“1998-07-01”,
“1998-07-02”);
final List<Pair> topK =
coordinator.getTopK(10);
Mix and match keys for aggregation
results
top-k results the same, but computed over 4 peers instead of 63 peers.
14/08/13 Rui Vieira, MSc ITEC 50
Efficient top-k query processing on distributed column family databases
Results: Pre-aggregation
Algorithm
Data transfer (KB) Execution time (ms)
full aggregated savings full aggregated savings
KLEE 20756 633 97% 2412.2 44.3 98%
HT 14404 5894 59% 4842.6 818.6 83%
TPUT 2215 61 97% 1657.1 162.2 90%
14/08/13 Rui Vieira, MSc ITEC 51
Efficient top-k query processing on distributed column family databases
Conclusions
14/08/13 Rui Vieira, MSc ITEC 52
Efficient top-k query processing on distributed column family databases
Conclusions
• TPUT and HT are well suited for real-time top-k queries with
minimal structural changes in the infrastructure.
• Savings of 98% (TPUT) and 77% (HT) in execution time with no
loss of precision
• Savings of 99.9% (TPUT) and 67% (HT) in data transfer also with no
loss of precision
• KLEE3 requires additional changes to infrastructure, but:
• Efficient to create
• Can discard final patch phase for approximate results with configurable
trade-off between precision and data transfer / execution time
• Savings of 99% in execution time and 97% in data transfer
14/08/13 Rui Vieira, MSc ITEC 53
Efficient top-k query processing on distributed column family databases
Conclusions
• Scalability can be addressed with good planning of data models
together with pre-aggregation
• KLEE3 is more resilient to low object correlation (the case in real
• world data)
• TPUT and KLEE3 are resilient to high k variations which could
have further practical implementations
14/08/13 Rui Vieira, MSc ITEC 54
Efficient top-k query processing on distributed column family databases
Future work
Implementing KLEE4
●
Intravert1
is an application server built on top of a Cassandra node
● Based on the vert.x application framework
● Communication is done either in a RESTful way or directly with Java client
● Allows passing code (in several JVM languages such as Groovy, Clojure, etc)
which is executed at the “server side”
● Acting as middleware, it is possible to implement processing
(such as the candidate hash set) remotely and return it to our client
● TPUT and HT already implemented using Intravert
● KLEE4 in progress
1- https://github.com/zznate/intravert-ug
14/08/13 Rui Vieira, MSc ITEC 55
Efficient top-k query processing on distributed column family databases
Acknowledgements
Jonathan Halliday (Red Hat)
For technical expertise, supervision and support
14/08/13 Rui Vieira, MSc ITEC 56
Efficient top-k query processing on distributed column family databases
Questions ?

More Related Content

PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PDF
Apache Flink internals
Ufuc Celebi – Stream & Batch Processing in one System
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Why you should care about data layout in the file system with Cheng Lian and ...
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Apache Flink internals

What's hot (20)

PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
FlinkML: Large Scale Machine Learning with Apache Flink
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Apache Flink Deep Dive
PPTX
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
PPTX
Distributed GLM with H2O - Atlanta Meetup
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PPT
Stream data mining & CluStream framework
PPTX
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
PDF
18 Data Streams
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Machine learning at Scale with Apache Spark
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
FlinkML: Large Scale Machine Learning with Apache Flink
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit EU talk by Sameer Agarwal
Apache Flink Deep Dive
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Distributed GLM with H2O - Atlanta Meetup
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Stream data mining & CluStream framework
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Optimizing Terascale Machine Learning Pipelines with Keystone ML
18 Data Streams
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Scalable Distributed Real-Time Clustering for Big Data Streams
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Machine learning at Scale with Apache Spark
Ad

Similar to Efficient top-k queries processing in column-family distributed databases (20)

PDF
On Relevant Query Answering over Streaming and Distributed Data
PDF
Efficient top k retrieval on massive data
PPTX
No SQL introduction
PDF
A survey of top k query processing techniques in relational database systems
PDF
Indexing in Cassandra
PPT
Schema Design
PDF
Survey on scalable continual top k keyword search in relational databases
PDF
Survey on scalable continual top k keyword search in
PPT
Lecture 25
PDF
Cassandra
PDF
Internet of things and their requirements.
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PPT
Efficient All Top-k Computation
PPTX
Fast raq a fast approach to range aggregate queries in big data environments
PDF
Cassandra for impatients
PDF
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
PPTX
Automatic extraction of top k pages from the web final
PPTX
Presentation
PDF
Real-time Cassandra
PDF
qCube: Efficient integration of range query operators over a high dimension d...
On Relevant Query Answering over Streaming and Distributed Data
Efficient top k retrieval on massive data
No SQL introduction
A survey of top k query processing techniques in relational database systems
Indexing in Cassandra
Schema Design
Survey on scalable continual top k keyword search in relational databases
Survey on scalable continual top k keyword search in
Lecture 25
Cassandra
Internet of things and their requirements.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
Efficient All Top-k Computation
Fast raq a fast approach to range aggregate queries in big data environments
Cassandra for impatients
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Automatic extraction of top k pages from the web final
Presentation
Real-time Cassandra
qCube: Efficient integration of range query operators over a high dimension d...
Ad

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf

Efficient top-k queries processing in column-family distributed databases

  • 1. 14/08/13 Rui Vieira, MSc ITEC 1 Efficient top-k query processing on distributed column family databases Efficient top-k query processing on distributed column family databases
  • 2. 14/08/13 Rui Vieira, MSc ITEC 2 Efficient top-k query processing on distributed column family databases Ranking (top-k) queriesRanking (top-k) queries We use top-k queries everydayWe use top-k queries everyday ● Search engines (top 100 pages for certain words) ● Analytics applications (most visited pages per day) Text search: Time periods:
  • 3. 14/08/13 Rui Vieira, MSc ITEC 3 Efficient top-k query processing on distributed column family databases Ranking (top-k) queriesRanking (top-k) queries DefinitionDefinition Find all k objects with the highest aggregated score over function f (f is usually a summation function over attributes) Example: Find the top 10 students with highest grades over all modules. ... Module n ... Module 2 John, 89% Emma, 88% Brian, 70% Steve, 65% Anna, 60% Peter, 59% Paul, 50% Mary, 49% Richard, 31% ... Module 1 ... John, 39% Emma, 48% Brian, 50% Steve, 75% Anna, 50% Peter, 59% Paul, 80% Mary, 89% Richard, 91% John, 82% Emma, 78% Brian, 90% Steve, 85% Anna, 83% Peter, 81% Paul, 70% Mary, 59% Richard, 51%
  • 4. 14/08/13 Rui Vieira, MSc ITEC 4 Efficient top-k query processing on distributed column family databases Motivation: real-time distributed top-k queriesMotivation: real-time distributed top-k queries Why real-time top-k queries? • To be integrated in a larger real-time analytics platform ● “User” real-time = hundred milliseconds ~ one second • Implement solutions make efficient use of: • Memory, Bandwidth and Computations • Can handle massive amounts of data Use case: We logging page views in a website. Can we find the top 10 most visited in the last 7 days? What about 10 months? All under 1 second?
  • 5. 14/08/13 Rui Vieira, MSc ITEC 5 Efficient top-k query processing on distributed column family databases Top-k queries: simplistic solutionTop-k queries: simplistic solution “Naive” method • Fetch all objects and scores from all sources • Aggregate them in memory • Sort all aggregations • Select top-k highest scoring Solutions to provide ranking queries answers (but not real-time): <O 1 , 1000> <O 89 , 900> <O 99 , 1> ...peer 1 Query Coordinator peer 2 ... peer n merge all data aggregate scores sort all aggregated select k highest Not feasible: • For large amounts of data • Possibly doesn't fit in RAM • Execution time most likely not real-time • Not efficient: low-scoring objects processed • Due to all of the above: not scalable
  • 6. 14/08/13 Rui Vieira, MSc ITEC 6 Efficient top-k query processing on distributed column family databases Top-k queries: Batch solutionsTop-k queries: Batch solutions Batch operations (Hadoop / Map-Reduce) Pros • Proven solution to (some) top-k scenarios • Excellent for “report” style use cases Cons • Still has to process all the information • Not real-time
  • 7. 14/08/13 Rui Vieira, MSc ITEC 7 Efficient top-k query processing on distributed column family databases Our requirements ● Work with “Peers” which are distributed logically (rows) as well as physically (nodes) ● Nodes in the cluster have (very) limited instructions ● Low latency (fixed number of round-trips) ● Offer considerable savings of bandwidth and execution time ● Possible to adapt to data access patterns and models in Cassandra
  • 8. 14/08/13 Rui Vieira, MSc ITEC 8 Efficient top-k query processing on distributed column family databases Algorithms
  • 9. 14/08/13 Rui Vieira, MSc ITEC 9 Efficient top-k query processing on distributed column family databases Algorithms: related Work Threshold family of algorithms pioneered by Faggins et al. Objective: determine a threshold below which an object cannot be a top-k object Initial Threshold Algorithms (TA) however: • Not designed with distributed data sources in mind • Performance highly dependent on data shape (skewness, correlation ...) • Unbounded round-trips to data source → unbounded latency • TA keeps performing random accesses until it reaches a stopping point
  • 10. 14/08/13 Rui Vieira, MSc ITEC 10 Efficient top-k query processing on distributed column family databases Algorithms: Related Work Three algorithms were selected: • Three-Phase Uniform Threshold (TPUT) • Distributed fixed round-trip exact algorithm • Hybrid Threshold • Distributed fixed round-trip exact algorithm • KLEE • Distributed fixed round-trip approximate algorithm • However these algorithms were developed for P2P networks • As far as we know, they have never been implemented with distributed column-family databases previously
  • 11. 14/08/13 Rui Vieira, MSc ITEC 11 Efficient top-k query processing on distributed column family databases Algorithms: TPUT Request top-k From each peer peer1 peer2 peer3 peer4 peerm calculate a Partial sum select kth score As min-k Request all objects with score⩾ mink m re-calculate a Partial sum select kth score as threshold Request all objects with score > threshold Partial sum (missing scores = 0) Partial sum (missing scores = min-k/m) worst-score best-score Best-score > worst-score = candidate Request candidates peer1 peer2 peer3 peer4 peerm peer1 peer2 peer3 peer4 peerm peer1 peer2 peer3 peer4 peerm Final partial sum K highest are top-k
  • 12. 14/08/13 Rui Vieira, MSc ITEC 12 Efficient top-k query processing on distributed column family databases Algorithms: Hybrid Threshold Phase 1 Same as in TPUT. i.e., the objective is to determine the first threshold: T = mink m score⩾Ti =max(Slowest ,T ) Send to each peer candidates So far and T peer1 peer2 peer3 peer4 peerm Each peer determines lowest scoring candidate and returns candidates with Phase 2 Phase 3 re-calculate a Partial sum select kth score as τ2 If T i< τ2 m peer Fetch score > τ2 m re-calculate a Partial sum select kth score as τ3 Candidates = partial sum > τ3
  • 13. 14/08/13 Rui Vieira, MSc ITEC 13 Efficient top-k query processing on distributed column family databases Algorithms: KLEE • TPUT variant • Trade-off between accuracy and bandwidth • Relies on summary data (statistical meta-data) to better estimate min-k without going “deep” on index lists Fundamental data structures for meta-data: • Histograms • Bloom filters
  • 14. 14/08/13 Rui Vieira, MSc ITEC 14 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Histograms) ● Equi-width cells ● Configurable number of cells ● Each cell n stores: ● Highest score in n (ub) ● Lowest score in n (lb) ● Average score for n (avg) ● Number of objects in n (freq) Example: Cell #10 (covers scores from 900-1000): ● ub = 989 ● lb = 901 ● avg = 937.4 ● freq = 200
  • 15. 14/08/13 Rui Vieira, MSc ITEC 15 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Bloom filters) 00 1 2 3 4 5 6 7 ... m 0 1 0 0 1 0 0 0 0 1 h 1 (O) h 2 (O)h n (O) h 1 (P) h 2 (P) h n (P) ∴ P ∉ S ● Bit set with objects hashed into positions ● Allows for very fast membership queries ● Space-efficient data structure ● However, not isomorphic → cannot determine objects from Bloom filter alone
  • 16. 14/08/13 Rui Vieira, MSc ITEC 16 Efficient top-k query processing on distributed column family databases Algorithms: KLEE Consists of 4 or (optionally) 3 steps 1 - Exploration Step Approximate a min-k threshold based on statistical meta-data 2 - Optimisation Step Decide whether execute step 3 or directly 4 3 - Candidate Filtering Filter high-scoring candidates 4 - Candidate Retrieval Fetch candidates from peers
  • 17. 14/08/13 Rui Vieira, MSc ITEC 17 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Phase 1) Fetch top-k objects Fetch c “top” histograms + Bloom filters Fetch c “low” freq and avg peer1 peer2 peer3 peer4 peerm For each object seen so far Is object in Bloom filter? Use weighted avg Of low cells Use corresponding avg value no yes noWas in top-k? Partial sum select kth score As min-k score> mink m candidates
  • 18. 14/08/13 Rui Vieira, MSc ITEC 18 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Phase 3) ● Request a bit set with all objects scoring higher than ● Perform a statistical pruning leaving only the most “common” objects (Note: this step was not implement due to the computational limitation of Cassandra nodes)
  • 19. 14/08/13 Rui Vieira, MSc ITEC 19 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Phase 4) ● Request all the candidates from the peers ● Perform a partial sum with the true scores of objects ● Select the k highest as our top-k
  • 20. 14/08/13 Rui Vieira, MSc ITEC 20 Efficient top-k query processing on distributed column family databases CassandraCassandra
  • 21. 14/08/13 Rui Vieira, MSc ITEC 21 Efficient top-k query processing on distributed column family databases Cassandra (architecture overview)Cassandra (architecture overview) ● Fully decentralised column-family store ● High (almost linear) scalability ● No single point of failure (no “master” or “slave” nodes) ● Automatic replication ● Clients can read and write to any node in cluster ● Cassandra takes over duties of partitioning and replicating automatically
  • 22. 14/08/13 Rui Vieira, MSc ITEC 22 Efficient top-k query processing on distributed column family databases Cassandra (architecture overview)Cassandra (architecture overview) ● Automatic partitioning of data (commonly used is Random partitioning) ● Rows are distributed in nodes by hash of partition key (1st PK) "2013-08-14" id = O 1 score = 7919 column table foo nodeA nodeB nodeC nodeD hashing (MD5) on key ... id = O n score = 9109 id = O 1 score = 1219 ... id = O n score = 109 id = O 1 score = 59 ... id = O n score = 91 id = O 1 score = 7919 ... id = On score = 9109 id = O 1 score = 1219 ... id = On score = 109 id = O 1 score = 59 ... id = On score = 91 "2013-08-15" "2013-08-16" "2013-08-14" "2013-08-15" "2013-08-16"
  • 23. 14/08/13 Rui Vieira, MSc ITEC 23 Efficient top-k query processing on distributed column family databases Cassandra (data model) ● Columns to be ordered upon insertion (ordered by PKs) ● Columns in the same row are physically co-located ● Range searches are fast: score < 10000 (simply a linear seek on disk) "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is id (ascending) "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is score (ascending) table_forward table_reverse
  • 24. 14/08/13 Rui Vieira, MSc ITEC 24 Efficient top-k query processing on distributed column family databases Cassandra (CQL) Data manipulation language for Cassandra is CQL ● Similar in syntax to SQL INSERT INTO table (foo, bar) VALUES (42, 'Meaning') SELECT foo, bar FROM table WHERE foo = 42 Limitations ● No joins, unions or sub-selects ● No aggregation functions (min, max, etc...) ● Inequality search are bound to primary key declaration order (next slide)
  • 25. 14/08/13 Rui Vieira, MSc ITEC 25 Efficient top-k query processing on distributed column family databases Cassandra (CQL) Consider the following table CREATE TABLE visits( date timestamp, user_id bigint, hits bigint, PRIMARY KEY (date, user_id)) Although the following queries would be valid SQL queries They are not valid CQL: SELECT * FROM visits WHERE hits > 1000 SELECT * FROM visits WHERE user_id > 900 AND hits = 0 Inequality queries are restricted to PKs and return contiguous columns, such as SELECT * FROM visits WHERE date = 1368438171000 AND user_id > 1000
  • 26. 14/08/13 Rui Vieira, MSc ITEC 26 Efficient top-k query processing on distributed column family databases Implementation
  • 27. 14/08/13 Rui Vieira, MSc ITEC 27 Efficient top-k query processing on distributed column family databases Implementation (overview) Query Coordinator peer1 peer2 peern Peer interface driver nodeA nodeB nodeC nodeD asynchronous call asynchronous call asynchronous call callbackcallbackcallbackasynchronous callcallbackasynchronous callcallbackasynchronous callcallback KLEE HT TPUT JVM
  • 28. 14/08/13 Rui Vieira, MSc ITEC 28 Efficient top-k query processing on distributed column family databases Implementation: challenges Implement forward and reverse tables to allow lookup by score and id ● Space is cheap ● Space is even cheaper as Cassandra uses in-built data compression ● Space is even cheaper as denormalised data usually compresses better than normalised data. ● Advantage of scores columns being pre-ordered at the row level "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is id (ascending) "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is score (ascending) table_forward table_reverse
  • 29. 14/08/13 Rui Vieira, MSc ITEC 29 Efficient top-k query processing on distributed column family databases Implementation: challenges Map algorithmic steps to CQL logic Decompose tasks ● Single step in algorithm: (node can execute arbitrary code) ● Multiple step in this implementation: (we can only communicate with node via CQL) peeri Query Coordinator select O > max(T, S lowest ) List of candidates determines local lowest scoring, S lowest T peeri Query Coordinator T i= max(T, S lowest ) List of candidates determines local lowest scoring, S lowest candidates peeri fetch > T i objects
  • 30. 14/08/13 Rui Vieira, MSc ITEC 30 Efficient top-k query processing on distributed column family databases Implementation: TPUT (phase 1) • Query Coordinator (QC) asks for top-k list from each peer 1..m invoking Peer async methods • QC stores a set of all distinct objects received in a concurrent safe collection • QC calculates a partial sum for each object using a thread-safe Accumulator data structure. Lets assume the partial sums are: [O89 , 1590] , [O73 , 1590], [O1 , 1000], [O21 , 990], [O12 , 880], [O51 , 780], [O801 , 680] Calculate the first threshold: S psum(O)=S peer1 ' (O)+…+S peerm ' (O) T = τ1 m Si ' (O)={Si (O) if O hasbeenreturned by node i 0 if otherwise } 1000, O1 900 , O 89 800, O 73 700, O 51 600, O 21 500, O 801 300, O 780 200, O 12 1, O 99 ... peer 1 190, O 1 690, O 89 790, O 73 590, O 51 990, O 21 390, O 801 10, O 780 490, O 12 290, O 99 ... peer 2 580, O 1 7, O 89 380, O 73 780, O 51 480, O 21 680, O 801 280, O 780 880, O 12 180, O 99 ... peer n Query Coordinator fetch top- k ... inverse table
  • 31. 14/08/13 Rui Vieira, MSc ITEC 31 Efficient top-k query processing on distributed column family databases Implementation: TPUT (phase 2) QC issues a requests for all objects with a score > T from the inverse table (peer.getAbove(T)) With the received objects, recalculates the partial sum. (for each Pair → accumulator.add(pair)) Designates the kth partial sum as t2 = accumulator.getKthValue(k) 1000, O1 900 , O 89 800, O 73 700, O 51 600, O 21 500, O 801 300, O 780 200, O 12 1, O 99 ... peer 1 190, O 1 690, O 89 790, O 73 590, O 51 990, O 21 390, O 801 10, O 780 490, O 12 290, O 99 ... peer 2 580, O 1 7, O 89 380, O 73 780, O 51 480, O 21 680, O 801 280, O 780 880, O 12 180, O 99 ... peer n Query Coordinator fetch score > T ... inverse table
  • 32. 14/08/13 Rui Vieira, MSc ITEC 32 Efficient top-k query processing on distributed column family databases Implementation: TPUT (phase 3) ● Fetch the final candidates from the forward table. ● Call async Peer methods ● Aggregate scores and nominate k highest scoring as the top-k forward table O 1 , 1000 O 89 , 900 O 73 , 800 O 51 , 700 O 21 , 600 O 801 , 500 O 780 , 300 O 12 , 200 O 99 , 1 ... peer 1 O 1 , 190 O 89 , 690 O 73 , 790 O 51 , 590 O 21 , 990 O 801 , 390 O 780 , 10 O 12 , 490 O 99 , 290 ... peer 2 O 1 , 580 O 89 , 7 O 73 , 380 O 51 , 780 O 21 , 480 O 801 , 680 O 780 , 280 O 12 , 880 O 99 , 180 ... peer n Query Coordinator fetch final candidates ...
  • 33. 14/08/13 Rui Vieira, MSc ITEC 33 Efficient top-k query processing on distributed column family databases Implementation: challenges Sequential vs. Random lookups All algorithms at some point require random access Random access much slower than sequential forward table 1000, O1 900 , O 89 800, O 73 700, O 51 600, O 21 500, O 801 300, O 780 200, O 12 1, O 99 ... peer 1 ... peer1 inverse table sequential O1, 1000 O89, 900 O73, 800 O51, 700 O21, 600 O801, 500 O780, 300 O12, 200 O99, 1 "random" Lookup # objects Time (ms) 95% CI (ms) Sequential 240 1.70 0.27 Random 240 115.16 1.32 Sample size n = 100
  • 34. 14/08/13 Rui Vieira, MSc ITEC 34 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges Sequential vs. Random lookups As a consequence of expensive random lookups a modified KLEE3 variant was implemented KLEE3-M: In the final phase, instead of filtering candidates with Do a range scan per peer for objects with Trade-off: score< mink m score⩾ mink m data transfer execution time
  • 35. 14/08/13 Rui Vieira, MSc ITEC 35 Efficient top-k query processing on distributed column family databases CREATE TABLE table_metadata( peer text, cell int, lb double, ub double, freq bigint, avg double, binmax double, binmin double, filter blob, PRIMARY KEY (date,cell) ) WITH CLUSTERING ORDER BY (cell DESC) Implementation: KLEE challenges Mapping data structures to Cassandra's data model Serialised filter = 0x0000000600000002020100f0084263884418154205141c11
  • 36. 14/08/13 Rui Vieira, MSc ITEC 36 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges Mapping data structures to Cassandra's data model peeri determine maximum score and create n equi-width bins fetch entire row serialise Bloom filter and save row Histogram Creator cell=0 cell=1 cell=2 cell=3 cell=4 cell=5 cell= n freq =0 freq =2 freq =0 freq =10 freq =140 freq =986 freq =10234 avg =0 avg =4590.2 avg =0 avg =678.1 avg =230.1 avg =56.7 avg =1.02 partition object per bin and add to Bloom Filter filter0 filter1 filter2 filter3 filter4 filter5 filtern Flexible: ● Configurable number of bins ● Configurable maximum false positive ratio for filters
  • 37. 14/08/13 Rui Vieira, MSc ITEC 37 Efficient top-k query processing on distributed column family databases Implementation: KLEE ...row 1 Query Coordinator metadata table ... row n Peer Peer getFullHistAsync cell:0 freq,avg,filter, ... cell:1 freq,avg,filter ... cell:2 freq,avg,filter ... cell:3 freq,avg,filter ... cell:n freq,avg,filter ... ...row 1 Query Coordinator metadata table ... row n Peer Peer getPartialHistAsync cell:0 freq,avg,filter, ... cell:1 freq,avg,filter ... cell:2 freq,avg,filter ... cell:3 freq,avg ,filter ... cell:n freq,avg ,filter ... ...row 1 Query Coordinator inverse table ... row n Peer Peer getTopKAsync 1000, O1 900, O12 800, O7 700, O18 1, O 145 ResultResultResultResult ResultResultResultHistoBloom estimate min-k > min-k ...row 1 Query Coordinator forward table ... row n Peer Peer getObjectsAsync O1, 1000 O12, 900O7, 800 O18, 700 O145, 1 ResultResultResultResult aggregate
  • 38. 14/08/13 Rui Vieira, MSc ITEC 38 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges final HistogramCreator hc = new CassandraHistogramCreator(tableDefinition); // Optionally a max false positive ratio can be defined hc.createHistogramTableSchema(); hc.createHistogramTable(“1998-05-01”, … , “1998-07-26“); Simple API for Histogram/Bloom tables creation
  • 39. 14/08/13 Rui Vieira, MSc ITEC 39 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges  Fast generation ● Feasible for “on-the-fly” jobs ● Roughly linear with execution time of 56 ms per peer with 100,000 elements
  • 40. 14/08/13 Rui Vieira, MSc ITEC 40 Efficient top-k query processing on distributed column family databases Implementation: asynchronous communication ● Driver used allowed for asynchronous communication ● Extensive use of ListenableFuture ● Allows for highly concurrent access with smaller thread pool ● Allows asynchronous transformations (eg ResultSet to POJO) public ListenableFuture<ResultList> getAboveAsync(final long value) { final ResultSetFuture above = session.executeAsync(statement.bind(value)); final Function<ResultSet, ResultList> transformResults = new Function<ResultSet, ResultList>() { @Override public ListenableFuture<ResultList> apply(ResultSet rs) { final ResultList resultList = new ResultList(); final List<Row> rows = rs.all(); for (final Row row : rows) { resultList.add( Pair.create(row.getBytes(object.getName()), row.getLong(score.getName())) ); } return resultList; } }; return Futures.transform(above, transformResults, executor); }
  • 41. 14/08/13 Rui Vieira, MSc ITEC 41 Efficient top-k query processing on distributed column family databases Implementation: API { "wc98_ids": { "name": "wc98_ids", "inverse": "wc98_ids_inverse", "metadata": "wc98_ids_metadata", "score": { "name": "visits", "type": "bigint" }, "id": { "name": "id", "type": "text" }, "peer": { "name": "date", "type": "text" } } } JSON declaration of tables and columns final QueryCoordinator coordinator = QueryCoordinator.create(KLEE.class, tableDefinition); coordinator.setKeys(“1998-05-01”, … , “1998-07-26”); final List<Pair> topK = coordinator.getTopK(10);
  • 42. 14/08/13 Rui Vieira, MSc ITEC 42 Efficient top-k query processing on distributed column family databases Datasets Test data
  • 43. 14/08/13 Rui Vieira, MSc ITEC 43 Efficient top-k query processing on distributed column family databases Datasets: Synthetic (Zipf) Used in literature as a good approximation of “real-world” data
  • 44. 14/08/13 Rui Vieira, MSc ITEC 44 Efficient top-k query processing on distributed column family databases Datasets: 1998 World Cup Data ● Data in Common Log Format (CLF) from the 1998 World Cup web servers ● IP addresses replaced by unique anonymous id ● Widely used in the literature as “real-world” test data ● Around 1.4 billion entries (approximately 2 million unique visitors) ● Range from 1st of May to 26th of July 1998 ● Highly skewed data
  • 45. 14/08/13 Rui Vieira, MSc ITEC 45 Efficient top-k query processing on distributed column family databases Results
  • 46. 14/08/13 Rui Vieira, MSc ITEC 46 Efficient top-k query processing on distributed column family databases Results: varying k
  • 47. 14/08/13 Rui Vieira, MSc ITEC 47 Efficient top-k query processing on distributed column family databases Results: varying number of peers
  • 48. 14/08/13 Rui Vieira, MSc ITEC 48 Efficient top-k query processing on distributed column family databases Results: Datasets (1998 World Cup Data) Algorithm Data (KB) Execution time (ms) 95% CI (ms) Precision (%) KLEE3 80 319.95 ±8.58 100 KLEE3-M 1271 84.75 ±6.5 100 Hybrid Threshold 14,306 1921.9 ±65.28 100 TPUT 44 141.5 ±7.36 100 Naive (baseline) 43,572 8514.6 ±61.38 100 Data for 18 peers = daily from 1st June 1998 to 18th June 1998 Sample size n = 20 Give me the top 20 visitors from 1st June to 18th June
  • 49. 14/08/13 Rui Vieira, MSc ITEC 49 Efficient top-k query processing on distributed column family databases Implementation: Pre-aggregation Mix and match keys for aggregation results "2013-08" 192.0.43.10192.0.43.11 "2013-08-02" 192.0.43.10 98 192.0.43.11 234 96327404 "2013-09" 192.0.43.10 5398 192.0.43.11 23234 "2013-08-01" 192.0.43.10 98 192.0.43.11 234coordinator .setKeys(“1998-05”, “1998-06”, “1998-07-01”, “1998-07-02”); final List<Pair> topK = coordinator.getTopK(10); Mix and match keys for aggregation results top-k results the same, but computed over 4 peers instead of 63 peers.
  • 50. 14/08/13 Rui Vieira, MSc ITEC 50 Efficient top-k query processing on distributed column family databases Results: Pre-aggregation Algorithm Data transfer (KB) Execution time (ms) full aggregated savings full aggregated savings KLEE 20756 633 97% 2412.2 44.3 98% HT 14404 5894 59% 4842.6 818.6 83% TPUT 2215 61 97% 1657.1 162.2 90%
  • 51. 14/08/13 Rui Vieira, MSc ITEC 51 Efficient top-k query processing on distributed column family databases Conclusions
  • 52. 14/08/13 Rui Vieira, MSc ITEC 52 Efficient top-k query processing on distributed column family databases Conclusions • TPUT and HT are well suited for real-time top-k queries with minimal structural changes in the infrastructure. • Savings of 98% (TPUT) and 77% (HT) in execution time with no loss of precision • Savings of 99.9% (TPUT) and 67% (HT) in data transfer also with no loss of precision • KLEE3 requires additional changes to infrastructure, but: • Efficient to create • Can discard final patch phase for approximate results with configurable trade-off between precision and data transfer / execution time • Savings of 99% in execution time and 97% in data transfer
  • 53. 14/08/13 Rui Vieira, MSc ITEC 53 Efficient top-k query processing on distributed column family databases Conclusions • Scalability can be addressed with good planning of data models together with pre-aggregation • KLEE3 is more resilient to low object correlation (the case in real • world data) • TPUT and KLEE3 are resilient to high k variations which could have further practical implementations
  • 54. 14/08/13 Rui Vieira, MSc ITEC 54 Efficient top-k query processing on distributed column family databases Future work Implementing KLEE4 ● Intravert1 is an application server built on top of a Cassandra node ● Based on the vert.x application framework ● Communication is done either in a RESTful way or directly with Java client ● Allows passing code (in several JVM languages such as Groovy, Clojure, etc) which is executed at the “server side” ● Acting as middleware, it is possible to implement processing (such as the candidate hash set) remotely and return it to our client ● TPUT and HT already implemented using Intravert ● KLEE4 in progress 1- https://github.com/zznate/intravert-ug
  • 55. 14/08/13 Rui Vieira, MSc ITEC 55 Efficient top-k query processing on distributed column family databases Acknowledgements Jonathan Halliday (Red Hat) For technical expertise, supervision and support
  • 56. 14/08/13 Rui Vieira, MSc ITEC 56 Efficient top-k query processing on distributed column family databases Questions ?