Efficient top-k queries processing in column-family distributed databases

14/08/13 Rui Vieira, MSc ITEC 1
Efficient top-k query processing on distributed column family databases
Efficient top-k query
processing on distributed
column family databases

Ranking (top-k) queriesRanking (top-k) queries
We use top-k queries everydayWe use top-k queries everyday
● Search engines (top 100 pages for certain words)
● Analytics applications (most visited pages per day)
Text search: Time periods:

Ranking (top-k) queriesRanking (top-k) queries
DefinitionDefinition
Find all k objects with the highest aggregated score over function f
(f is usually a summation function over attributes)
Example:
Find the top 10 students with highest
grades over all modules.
...
Module n
...
Module 2
John, 89%
Emma, 88%
Brian, 70%
Steve, 65%
Anna, 60%
Peter, 59%
Paul, 50%
Mary, 49%
Richard, 31%
...
Module 1
...
John, 39%
Emma, 48%
Brian, 50%
Steve, 75%
Anna, 50%
Peter, 59%
Paul, 80%
Mary, 89%
Richard, 91%
John, 82%
Emma, 78%
Brian, 90%
Steve, 85%
Anna, 83%
Peter, 81%
Paul, 70%
Mary, 59%
Richard, 51%

Motivation: real-time distributed top-k queriesMotivation: real-time distributed top-k queries
Why real-time top-k queries?
• To be integrated in a larger real-time analytics platform
● “User” real-time = hundred milliseconds ~ one second
• Implement solutions make efficient use of:
• Memory, Bandwidth and Computations
• Can handle massive amounts of data
Use case:
We logging page views in a website. Can we find the top 10 most
visited in the last 7 days? What about 10 months? All under 1 second?

Top-k queries: simplistic solutionTop-k queries: simplistic solution
“Naive” method
• Fetch all objects and scores from all sources
• Aggregate them in memory
• Sort all aggregations
• Select top-k highest scoring
Solutions to provide ranking queries answers (but not real-time):
<O 1 , 1000> <O
89 , 900>
<O
99 , 1>
...peer 1
Query
Coordinator
peer
2
...
peer n
merge all
data
aggregate
scores
sort all
aggregated
select
k highest
Not feasible:
• For large amounts of data
• Possibly doesn't fit in RAM
• Execution time most likely not real-time
• Not efficient: low-scoring objects processed
• Due to all of the above: not scalable

Top-k queries: Batch solutionsTop-k queries: Batch solutions
Batch operations (Hadoop / Map-Reduce)
Pros
• Proven solution to (some) top-k scenarios
• Excellent for “report” style use cases
Cons
• Still has to process all the information
• Not real-time

Our requirements
● Work with “Peers” which are distributed logically (rows)
as well as physically (nodes)
● Nodes in the cluster have (very) limited instructions
● Low latency (fixed number of round-trips)
● Offer considerable savings of bandwidth and execution time
● Possible to adapt to data access patterns and models in Cassandra

Algorithms

Algorithms: related Work
Threshold family of algorithms pioneered by Faggins et al.
Objective: determine a threshold below which an object cannot be
a top-k object
Initial Threshold Algorithms (TA) however:
• Not designed with distributed data sources in mind
• Performance highly dependent on data shape (skewness, correlation ...)
• Unbounded round-trips to data source → unbounded latency
• TA keeps performing random accesses until it reaches a
stopping point

Algorithms: Related Work
Three algorithms were selected:
• Three-Phase Uniform Threshold (TPUT)
• Distributed fixed round-trip exact algorithm
• Hybrid Threshold
• Distributed fixed round-trip exact algorithm
• KLEE
• Distributed fixed round-trip approximate algorithm
• However these algorithms were developed for P2P networks
• As far as we know, they have never been implemented with
distributed column-family databases previously

Algorithms: TPUT
Request top-k
From each peer
peer1
peer2
peer3
peer4
peerm
calculate a
Partial sum
select kth
score
As min-k
Request all objects with score⩾
mink
m
re-calculate a
Partial sum
select kth
score
as threshold
Request all objects
with score > threshold
Partial sum
(missing scores = 0)
Partial sum
(missing scores = min-k/m)
worst-score best-score
Best-score > worst-score = candidate
Request candidates
peer1
peer2
peer3
peer4
peerm
peer1
peer2
peer3
peer4
peerm
peer1
peer2
peer3
peer4
peerm
Final partial
sum
K highest
are top-k

Algorithms: Hybrid Threshold
Phase 1
Same as in TPUT.
i.e., the objective is to determine the first threshold: T =
mink
m
score⩾Ti =max(Slowest ,T )
Send to each peer candidates
So far and T
peer1
peer2
peer3
peer4
peerm
Each peer determines lowest scoring
candidate and returns candidates with
Phase 2
Phase 3
re-calculate a
Partial sum
select kth
score
as τ2
If T i<
τ2
m
peer
Fetch score >
τ2
m
re-calculate a
Partial sum
select kth
score
as τ3
Candidates = partial sum > τ3

Algorithms: KLEE
• TPUT variant
• Trade-off between accuracy and bandwidth
• Relies on summary data (statistical meta-data)
to better estimate min-k without going “deep” on index lists
Fundamental data structures for meta-data:
• Histograms
• Bloom filters

Algorithms: KLEE (Histograms)
● Equi-width cells
● Configurable number of cells
● Each cell n stores:
● Highest score in n (ub)
● Lowest score in n (lb)
● Average score for n (avg)
● Number of objects in n (freq)
Example:
Cell #10 (covers scores from 900-1000):
● ub = 989
● lb = 901
● avg = 937.4
● freq = 200

Algorithms: KLEE (Bloom filters)
00 1 2 3 4 5 6 7 ... m
0 1 0 0 1 0 0 0 0 1
h 1 (O) h 2 (O)h n (O)
h 1 (P) h 2 (P) h n (P) ∴ P ∉ S
● Bit set with objects hashed into positions
● Allows for very fast membership queries
● Space-efficient data structure
● However, not isomorphic → cannot determine objects from Bloom filter alone

Algorithms: KLEE
Consists of 4 or (optionally) 3 steps
1 - Exploration Step
Approximate a min-k threshold based on statistical meta-data
2 - Optimisation Step
Decide whether execute step 3 or directly 4
3 - Candidate Filtering
Filter high-scoring candidates
4 - Candidate Retrieval
Fetch candidates from peers

Algorithms: KLEE (Phase 1)
Fetch top-k objects
Fetch c “top” histograms + Bloom filters
Fetch c “low” freq and avg
peer1
peer2
peer3
peer4
peerm
For each object
seen so far
Is object in
Bloom filter?
Use weighted avg
Of low cells
Use corresponding
avg value
no
yes
noWas in top-k?
Partial sum
select kth
score
As min-k
score>
mink
m
candidates

● Request a bit set with all objects scoring higher than
● Perform a statistical pruning leaving only the most “common”
objects
(Note: this step was not implement due to the computational
limitation of Cassandra nodes)

● Request all the candidates from the peers
● Perform a partial sum with the true scores of objects
● Select the k highest as our top-k

CassandraCassandra

Cassandra (architecture overview)Cassandra (architecture overview)
● Fully decentralised column-family store
● High (almost linear) scalability
● No single point of failure (no “master” or “slave” nodes)
● Automatic replication
● Clients can read and write to any node in cluster
● Cassandra takes over duties of partitioning and replicating automatically

Cassandra (architecture overview)Cassandra (architecture overview)
● Automatic partitioning of data (commonly used is Random partitioning)
●
Rows are distributed in nodes by hash of partition key (1st
PK)
"2013-08-14"
id = O 1
score = 7919
column
table foo
nodeA
nodeB
nodeC
nodeD
hashing
(MD5) on key
... id = O n
score = 9109
id = O 1
score = 1219
... id = O n
score = 109
id = O 1
score = 59
... id = O n
score = 91
id = O 1
score = 7919
... id = On
score = 9109
id = O 1
score = 1219
... id = On
score = 109
id = O 1
score = 59
... id = On
score = 91
"2013-08-15"
"2013-08-16"
"2013-08-14"
"2013-08-15"
"2013-08-16"

Cassandra (data model)
● Columns to be ordered upon insertion (ordered by PKs)
● Columns in the same row are physically co-located
● Range searches are fast: score < 10000
(simply a linear seek on disk)
"2013-08-16"
id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is id (ascending)
"2013-08-16" id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is score (ascending)
table_forward
table_reverse

Cassandra (CQL)
Data manipulation language for Cassandra is CQL
● Similar in syntax to SQL
INSERT INTO table (foo, bar) VALUES (42, 'Meaning')
SELECT foo, bar FROM table WHERE foo = 42
Limitations
● No joins, unions or sub-selects
● No aggregation functions (min, max, etc...)
● Inequality search are bound to primary key declaration order (next slide)

Cassandra (CQL)
Consider the following table
CREATE TABLE visits(
date timestamp,
user_id bigint,
hits bigint,
PRIMARY KEY (date, user_id))
Although the following queries would be valid SQL queries
They are not valid CQL:
SELECT * FROM visits WHERE hits > 1000
SELECT * FROM visits WHERE user_id > 900 AND hits = 0
Inequality queries are restricted to PKs and return
contiguous columns, such as
SELECT * FROM visits WHERE date = 1368438171000 AND user_id > 1000

Implementation

Implementation (overview)
Query
Coordinator
peer1
peer2
peern
Peer
interface
driver
nodeA
nodeB
nodeC
nodeD
asynchronous call
asynchronous call
asynchronous call
callbackcallbackcallbackasynchronous callcallbackasynchronous callcallbackasynchronous callcallback
KLEE HT TPUT
JVM

Implementation: challenges
Implement forward and reverse tables to allow lookup by score and id
● Space is cheap
● Space is even cheaper as Cassandra uses in-built data compression
● Space is even cheaper as denormalised data usually compresses better
than normalised data.
● Advantage of scores columns being pre-ordered at the row level
"2013-08-16"
id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is id (ascending)
"2013-08-16" id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is score (ascending)
table_forward
table_reverse

Map algorithmic steps to CQL logic
Decompose tasks
● Single step in algorithm:
(node can execute arbitrary code)
● Multiple step in this implementation:
(we can only communicate with node via CQL)
peeri
Query
Coordinator
select O >
max(T, S lowest )
List of
candidates determines local
lowest scoring, S lowest
T
peeri
Query
Coordinator
T i=
max(T, S lowest )
List of
candidates
determines local
lowest scoring, S lowest
candidates
peeri
fetch > T i
objects

Implementation: TPUT (phase 1)
• Query Coordinator (QC) asks for top-k list from each peer 1..m invoking Peer async methods
• QC stores a set of all distinct objects received in a concurrent safe collection
• QC calculates a partial sum for each object
using a thread-safe Accumulator data structure.
Lets assume the partial sums are:
[O89
, 1590] , [O73
, 1590], [O1
, 1000],
[O21
, 990], [O12
, 880], [O51
, 780], [O801
, 680]
Calculate the first threshold:
S psum(O)=S peer1
'
(O)+…+S peerm
'
(O)
T =
τ1
m
Si
'
(O)={Si (O) if O hasbeenreturned by node i
0 if otherwise } 1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
190, O 1
690, O
89
790, O
73
590, O
51
990, O
21
390, O
801
10, O
780
490, O 12
290, O
99
...
peer 2
580, O 1
7, O
89
380, O
73
780, O
51
480, O
21
680, O
801
280, O
780
880, O 12
180, O
99
...
peer n
Query
Coordinator
fetch top- k
...
inverse table

QC issues a requests for all objects with a score > T
from the inverse table (peer.getAbove(T))
With the received objects, recalculates the
partial sum.
(for each Pair → accumulator.add(pair))
Designates the kth
partial sum as
t2 = accumulator.getKthValue(k)
1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
190, O 1
690, O
89
790, O
73
590, O
51
990, O
21
390, O
801
10, O
780
490, O 12
290, O
99
...
peer 2
580, O 1
7, O
89
380, O
73
780, O
51
480, O
21
680, O
801
280, O
780
880, O 12
180, O
99
...
peer n
Query
Coordinator
fetch score > T
...
inverse table

● Fetch the final candidates from the
forward table.
● Call async Peer methods
● Aggregate scores and nominate k highest
scoring as the top-k
forward table
O 1 , 1000
O
89 , 900
O
73 , 800
O
51 , 700
O
21 , 600
O
801 , 500
O
780 , 300
O 12 , 200
O
99 , 1
...
peer 1
O 1 , 190
O
89 , 690
O
73 , 790
O
51 , 590
O
21 , 990
O
801 , 390
O
780 , 10
O 12 , 490
O
99 , 290
...
peer 2
O 1 , 580
O
89 , 7
O
73 , 380
O
51 , 780
O
21 , 480
O
801 , 680
O
780 , 280
O 12 , 880
O
99 , 180
...
peer n
Query
Coordinator
fetch final candidates
...

Sequential vs. Random lookups
All algorithms at some point require random
access
Random access much slower than sequential
forward table
1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
...
peer1
inverse table
sequential
O1, 1000
O89, 900
O73, 800
O51, 700
O21, 600
O801, 500
O780, 300
O12, 200
O99, 1
"random"
Lookup # objects Time (ms) 95% CI (ms)
Sequential 240 1.70 0.27
Random 240 115.16 1.32
Sample size n = 100

Implementation: KLEE challenges
Sequential vs. Random lookups
As a consequence of expensive random lookups a modified KLEE3 variant was
implemented
KLEE3-M:
In the final phase, instead of filtering candidates with
Do a range scan per peer for objects with
Trade-off:
score<
mink
m
score⩾
mink
m
data transfer
execution
time

CREATE TABLE table_metadata(
peer text,
cell int,
lb double,
ub double,
freq bigint,
avg double,
binmax double,
binmin double,
filter blob,
PRIMARY KEY (date,cell)
) WITH CLUSTERING ORDER BY (cell DESC)
Mapping data structures to Cassandra's data model
Serialised filter = 0x0000000600000002020100f0084263884418154205141c11

Mapping data structures to Cassandra's data model
peeri
determine
maximum score
and create n
equi-width bins
fetch entire
row serialise Bloom filter
and save row
Histogram
Creator
cell=0
cell=1
cell=2
cell=3
cell=4
cell=5
cell= n
freq =0
freq =2
freq =0
freq =10
freq =140
freq =986
freq =10234
avg =0
avg =4590.2
avg =0
avg =678.1
avg =230.1
avg =56.7
avg =1.02
partition object
per bin and add
to Bloom Filter
filter0
filter1
filter2
filter3
filter4
filter5
filtern
Flexible:
● Configurable number of bins
● Configurable maximum false positive ratio for filters

Implementation: KLEE
...row 1
Query
Coordinator
metadata table
...
row n
Peer
Peer
getFullHistAsync cell:0
freq,avg,filter,
...
cell:1
freq,avg,filter
...
cell:2
freq,avg,filter
...
cell:3
freq,avg,filter
...
cell:n
freq,avg,filter
...
...row 1
Query
Coordinator
metadata table
...
row n
Peer
Peer
getPartialHistAsync cell:0
freq,avg,filter,
...
cell:1
freq,avg,filter
...
cell:2
freq,avg,filter
...
cell:3
freq,avg ,filter
...
cell:n
freq,avg ,filter
...
...row 1
Query
Coordinator
inverse table
...
row n
Peer
Peer
getTopKAsync
1000, O1 900, O12 800, O7 700, O18 1, O 145
ResultResultResultResult
ResultResultResultHistoBloom
estimate min-k
> min-k
...row 1
Query
Coordinator
forward table
...
row n
Peer
Peer
getObjectsAsync
O1, 1000 O12, 900O7, 800 O18, 700 O145, 1
ResultResultResultResult
aggregate

final HistogramCreator hc =
new CassandraHistogramCreator(tableDefinition);
// Optionally a max false positive ratio can be defined
hc.createHistogramTableSchema();
hc.createHistogramTable(“1998-05-01”, … , “1998-07-26“);
Simple API for Histogram/Bloom tables creation

 Fast generation
● Feasible for “on-the-fly” jobs
● Roughly linear with
execution time of 56 ms per
peer with 100,000 elements

Implementation: asynchronous communication
● Driver used allowed for asynchronous communication
● Extensive use of ListenableFuture
● Allows for highly concurrent access with smaller thread pool
● Allows asynchronous transformations (eg ResultSet to POJO)
public ListenableFuture<ResultList> getAboveAsync(final long value) {
final ResultSetFuture above = session.executeAsync(statement.bind(value));
final Function<ResultSet, ResultList> transformResults = new Function<ResultSet, ResultList>() {
@Override
public ListenableFuture<ResultList> apply(ResultSet rs) {
final ResultList resultList = new ResultList();
final List<Row> rows = rs.all();
for (final Row row : rows) {
resultList.add(
Pair.create(row.getBytes(object.getName()), row.getLong(score.getName()))
);
}
return resultList;
}
};
return Futures.transform(above, transformResults, executor);
}

Implementation: API
{
"wc98_ids": {
"name": "wc98_ids",
"inverse": "wc98_ids_inverse",
"metadata": "wc98_ids_metadata",
"score": {
"name": "visits",
"type": "bigint"
},
"id": {
"name": "id",
"type": "text"
},
"peer": {
"name": "date",
"type": "text"
}
}
}
JSON declaration of tables and columns
final QueryCoordinator coordinator =
QueryCoordinator.create(KLEE.class,
tableDefinition);
coordinator.setKeys(“1998-05-01”,
… , “1998-07-26”);
final List<Pair> topK = coordinator.getTopK(10);

Datasets
Test data

Datasets: Synthetic (Zipf)
Used in literature as a good approximation of “real-world” data

Datasets: 1998 World Cup Data
● Data in Common Log Format (CLF) from the 1998 World Cup web servers
● IP addresses replaced by unique anonymous id
● Widely used in the literature as “real-world” test data
● Around 1.4 billion entries (approximately 2 million unique visitors)
●
Range from 1st
of May to 26th
of July 1998
● Highly skewed data

Results

Results: varying k

Results: varying number of peers

Results: Datasets (1998 World Cup Data)
Algorithm Data (KB)
Execution
time (ms)
95% CI (ms)
Precision
(%)
KLEE3 80 319.95 ±8.58 100
KLEE3-M 1271 84.75 ±6.5 100
Hybrid Threshold 14,306 1921.9 ±65.28 100
TPUT 44 141.5 ±7.36 100
Naive
(baseline)
43,572 8514.6 ±61.38 100
Data for 18 peers = daily from 1st
June 1998 to 18th
June 1998
Sample size n = 20
Give me the top 20 visitors from 1st
June to 18th
June

Implementation: Pre-aggregation
Mix and match keys for aggregation
results
"2013-08" 192.0.43.10192.0.43.11
"2013-08-02" 192.0.43.10
98
192.0.43.11
234
96327404
"2013-09" 192.0.43.10
5398
192.0.43.11
23234
"2013-08-01" 192.0.43.10
98
192.0.43.11
234coordinator
.setKeys(“1998-05”,
“1998-06”,
“1998-07-01”,
“1998-07-02”);
final List<Pair> topK =
coordinator.getTopK(10);
Mix and match keys for aggregation
results
top-k results the same, but computed over 4 peers instead of 63 peers.

Results: Pre-aggregation
Algorithm
Data transfer (KB) Execution time (ms)
full aggregated savings full aggregated savings
KLEE 20756 633 97% 2412.2 44.3 98%
HT 14404 5894 59% 4842.6 818.6 83%
TPUT 2215 61 97% 1657.1 162.2 90%

Conclusions

Conclusions
• TPUT and HT are well suited for real-time top-k queries with
minimal structural changes in the infrastructure.
• Savings of 98% (TPUT) and 77% (HT) in execution time with no
loss of precision
• Savings of 99.9% (TPUT) and 67% (HT) in data transfer also with no
loss of precision
• KLEE3 requires additional changes to infrastructure, but:
• Efficient to create
• Can discard final patch phase for approximate results with configurable
trade-off between precision and data transfer / execution time
• Savings of 99% in execution time and 97% in data transfer

Conclusions
• Scalability can be addressed with good planning of data models
together with pre-aggregation
• KLEE3 is more resilient to low object correlation (the case in real
• world data)
• TPUT and KLEE3 are resilient to high k variations which could
have further practical implementations

Future work
Implementing KLEE4
●
Intravert1
is an application server built on top of a Cassandra node
● Based on the vert.x application framework
● Communication is done either in a RESTful way or directly with Java client
● Allows passing code (in several JVM languages such as Groovy, Clojure, etc)
which is executed at the “server side”
● Acting as middleware, it is possible to implement processing
(such as the candidate hash set) remotely and return it to our client
● TPUT and HT already implemented using Intravert
● KLEE4 in progress
1- https://github.com/zznate/intravert-ug

Acknowledgements
Jonathan Halliday (Red Hat)
For technical expertise, supervision and support

Questions ?

Efficient top-k queries processing in column-family distributed databases

More Related Content

What's hot (20)

Similar to Efficient top-k queries processing in column-family distributed databases (20)

Recently uploaded (20)

Efficient top-k queries processing in column-family distributed databases