On Relevant Query Answering over Streaming and Distributed Data

ON RELEVANT QUERYANSWERING OVER
STREAMING AND DISTRIBUTED DATA
Shima Zahmatkesh
Politecnico di Milano – DEIB
Data Science Group – Stream Reasoning Team
Supervisor: Prof. Emanuele Della Valle

Motivation
▪ An application that shows the places
around drivers where there is an high
probability of finding free parking.
Query: return the best streets (around the car
that calls the service) where there are many free
parking lots and few cars looking for parking in
the last 10 minutes.
!3
▪ Web applications require to combine data streams
with distributed data over the Web to continuously find
the best answer to user’s queries

Solving the example with web stream processing
Web
Relevant
Answers
Join
Windows
Car request
streams
!4
Stream Processing Engine
Request
Response
Best streets to look for parking
Free parking lots

Problem Statement
Web
Relevant
Answers
Join
SPARQL endpoint
!5
RDF Stream Processing (RSP) Engine
Local
Replica
Minimize
computational
resources
• High Latency
• Rate Limits
Being
Reactive
• Stale Data
• Refresh
Budget
• Maintenance
Policy
Windows
RDF Streams

Research Question
Is it possible to optimize query evaluation in order to
continuously obtain the most relevant combinations of
streaming and evolving distributed data, while
guaranteeing the reactiveness of the engine?
!6

Related work
!7
Stream
Processing
Top-k Query
Evaluation
Federated Query
Processing
ACQUA
Using
resource
replication
MinTopK
Optimal continuous
top-k query
evaluation
•Top-k linked data
query by Wagner
•Top-k join queries
by Ilyas
Our work
Previously
unexplored

Scope of the state of the art
Is it possible to optimize query evaluation in order to
continuously obtain the most relevant combinations of
streaming and evolving distributed data, while
guaranteeing the reactiveness of the engine?
!9
Features ACQUA MinTopk
Type of data Streaming and
distributed
streaming
Relevancy ✗ ✓
Reactiveness Refresh budget Incremental
evaluation
Handling evolving data Local replica
Maintenance policies
✗

Scope of the research
▪ Queries that contains FILTER clause and have to filter
the data come in the distributed dataset.
▪ Top-k queries where the scoring function involves data
that appears both in the streaming and the distributed
datasets.
!10

QUERIES WITH A FILTER
CLAUSE
!11

Query
▪ Every minute give me the best influencers, i.e. users who
are mentioned on Social Network in the last 10 minutes
whose number of followers is greater than 100,000.
!12
REGISTER STREAM <:Influencers> AS
CONSTRUCT {?user a :influentialUser}
WHERE {
WINDOW :W(10m,1m) ON :S
{?user :hasMentions ?mentionsNumber}
SERVICE :BKG
{?user :hasFollowers ?followersCount}
FILTER (?followersCount > 100000)
}
Filtering Threshold
ACQUA

State of the art - ACQUA
!13
WINDOW clause
JOIN
Local Replica
Candidate set
Elected set
RND
LRU
WBM
SERVICE clause
Maintainer
3
Proposer
1
Ranker
2

Proposed Solution – ACQUA.F
!14
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal Replica
SERVICE clause
with
FILTER clause
✓ Filter Update
Policy
✓ RND.F
✓ LRU.F
✓ WBM.F

Filter Update Policy (intuition)
!15
time
NumberofFollowers
t
User A
User B
User D
▪ Computes how close is the value associate to the
variable of each data item to the Filtering Threshold.
User C
Filtering Threshold

Experimental Result
!16
WorstBest
Performance
Experiment Dimension
For low selectivity
WBM is better than
Filter Update Policy
For high selectivity
Filter Update Policy is
better than WBM

Combined Policies – ACQUA.F
!17
time
NumberofFollowers
t
Band
User A
User B
User C
User D
▪ Combine Filter Update Policy with ACQUA ones
▪ RND.F, LRU.F, and WBM.F

Experimental Result
!18
WorstBest
Performance
Experiment Dimension
Impossible in
practice

State of the art - Rank Aggregation
▪ Fairly take into account the opinions of different
algorithms.
▪ Combine the ranking lists by computing aggregated score
!19
User Score
Alice 0.8
Bob 0.7
David 0.4
User Score
Bob 0.9
David 0.8
Alice 0.7
α = 0.5
User Scoreagg
Bob 0.8
Alice 0.75
David 0.6
WBM Filter Update WBM.F+

Proposed Solution – ACQUA.F+
!20
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal Replica
SERVICE clause
with
FILTER clause
✓ LRU.F+
✓ WBM.F+
✓ WBM.F*

Experimental Results
!21
Comparable to WBM.F
Possible in practice

W1(current window)
State of the art – MinTopK
!23
Time
Score
E
C
W1
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Window Length = 9
Top-2
results

W2
!23
Time
Score
E
C
E
C
W1 W2
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Slide = 3
Top-2
results

W3
!23
Time
Score
E
C
E
C
W1
E
F
W2 W3
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Top-2
results

W3
!23
Time
Score
Object Ws We
E 1 3
C 1 2
F 3 3
E
C
E
C
W1
E
F
W2 W3
now
Super-MTK
List
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Top-2
results

Top-k Query
▪ Return every 3 minutes the top-2 popular users who are
most mentioned on Social Networks in the last 9 minutes
!24
REGISTER STREAM :TopkUsersToContact AS
SELECT ?user F(?mentionCount,?followerCount) AS ?score
FROM NAMED WINDOW :W ON :S [RANGE 9m STEP 3m]
WHERE {
WINDOW :W {?user :hasMentions ?mentionCount}
SERVICE :BKG {?user :hasFollowers ?followerCount}
}
ORDER BY DESC (?score)
LIMIT 2

Time
ScoreS
0
2
3
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Score
!25
Time
ScoreR

Time
ScoreS
0
2
3
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Score
!25
Time
FinalScore
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Time
ScoreR
Final Score = F ( ScoreS , ScoreR)

Contributions (1 of 2)
▪ Data structure: Super-MTK+N List
▪ to handle changes in distributed dataset : N changes per window
▪ N additional slots
▪ MTK+N list : Keep K+N elements
▪ Complexity: O(K+N)
!26
Object Ws We
E 1 2
G 1 3
C 1 1
F 2 2
A 3 3
E
G
C
E
G
W1
W2
LBP
F
W1 W2 W3
G
A
K area
N area
MTK+N lists Super-MTK+N List

Contributions (2 of 2)
▪ Algorithm:
▪ Top-k+N
▪ Window expiration
▪ New arrival of distinct data items
▪ Handle changes in distributed data
▪ AcquaTop
▪ Handle updating local replica
▪ Complexity: O(K+N)
▪ Framework: AcquaTop Framework
▪ Apply maintenance policies
!27

Top-K+N – New object arrival
!28
Time
Score
W2 (current window)
Object Ws We
E 2 3
C 2 2
F 2 3
A 3 4
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
K = 2
N = 1

Top-K+N – New object arrival
!28
Time
Score
W2 (current window)
Object Ws We
E 2 3
C 2 2
F 2 3
A 3 4
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
K = 2
N = 1
now
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
Top-K+N

Top-K+N - Handling Changes
!29
Time
Score
W2
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
now

Top-K+N - Handling Changes
!29
Time
Score
W2
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
now
F
Object Ws We
G 2 4
F 2 3
E 2 3
C 2 2
A 4 4
Top-K+N

AcquaTop Framework
RDF Stream
Ranker
Maintainer
SPARQL endpoint
Elected set
Candidate set
Local Replica
✓ MTKN-T
✓ MTKN-F
✓ MTKN-A
Super-MTK+N
List
!30
AcquaTop Algorithm
Top-k+N Algorithm
Expiration
New Arrival
Remote Changes

New Maintenance Policies
▪ MTKN-T: Select objects
from top of the MTKN list
for updating
▪ MTKN-F: Select objects for
updating from the border
of K and N areas in MTKN
list (half from top N area,
and half from bottom K
area)
!31
Object Ws We
E 2 3
G 2 4
C 2 2
F 3 3
A 4 4
Object Ws We
E 2 3
G 2 4
C 2 2
F 3 3
A 4 4
2 items for
updating
2 items for
updating

Experimental setting
▪ Datasets:
▪ Streaming data from twitter: mention numbers of user
▪ Real data from REST twitter: follower count of users
▪ Realistic and synthetic distributed data
▪ Query
▪ Query with FILTER clause
▪ Top-k query
▪ Scoring function - > normalized weighted summation between number
of mentions in each window and number of changes in Follower Count
▪ Generate the Oracle for each query
!33

Experimental setting
▪ Baselines:
▪ WST : we don’t update any changes
▪ RND : randomly selects items for update
▪ MTKN-A: update all the elements in MTKN list
▪ Metrics:
▪ CJD : show the correctness of the results for 2 different sets
▪ nDCG@K : Shows how relevant are the results comparing to the
Oracle one
▪ ACC@K : Shows the accuracy of the results
!34

measuring varying
Filter
Updat
e
LRU.
F
WB
M.F
LRU.F+
WB
M.F
+
WBM.F
*
accuracy selectivity >60% ✓ ✓ <60%
accuracy budget ✓ >=2 =1 ✓ >5
sensitivity to α selectivity,α <50% ✓ ✓
sensitivity to α budget,α ✓ ✓
accuracy - ✓
accuracy S <60% ✓
sensitivity to α - ✓ ✓
Evaluation Results – Queries with Filter
!35

measuring Varying MTKN-T MTKN-F
relevancy budget >3
accuracy budget ✓
relevancy CH ✓
accuracy CH =80 <=40
relevancy K ✓
accuracy K <7
relevancy N ✓
accuracy N ✓
relevancy - ✓
accuracy - ✓
Evaluation Results – Top-k Queries
!36

Limitations and Future work
Limitations Future work
Two class of queries Broaden the class of queries: N:M join
relationship, multi-join operators,
preference queries, …
Static refresh budget Flexible budget allocation
Full replica Cache and replacement strategies
Single stream of data and one query
for evaluation
Distributed streams and multiple
queries
Correct and complete data inaccurate or incomplete
!38

Conclusion
▪ Is this work, we address the problem of relevant query
answering over streaming and distributed data.
▪ Proposed maintenance policies for queries with FILTER
clause.
▪ Proposed framework for top-k query answering and
maintenance policies to generate more relevant and
accurate result.
▪ We get more relevant and accurate results comparing
to the sate-of-the-art approaches.
!39

Thank you! 
Any Question?
On Relevant Query Answering over
Streaming and Distributed Data
Shima Zahmatkesh
shima.zahmatkesh@polimi.it
DEIB - Politecnico of Milano
!40

On Relevant Query Answering over Streaming and Distributed Data

More Related Content

Similar to On Relevant Query Answering over Streaming and Distributed Data (20)

Recently uploaded (20)

On Relevant Query Answering over Streaming and Distributed Data