SlideShare a Scribd company logo
ON RELEVANT QUERYANSWERING OVER
STREAMING AND DISTRIBUTED DATA
Shima Zahmatkesh
Politecnico di Milano – DEIB
Data Science Group – Stream Reasoning Team
Supervisor: Prof. Emanuele Della Valle
INTRODUCTION
!2
Motivation
▪ An application that shows the places
around drivers where there is an high
probability of finding free parking.
Query: return the best streets (around the car
that calls the service) where there are many free
parking lots and few cars looking for parking in
the last 10 minutes.
!3
▪ Web applications require to combine data streams
with distributed data over the Web to continuously find
the best answer to user’s queries
Solving the example with web stream processing
Web
Relevant
Answers
Join
Windows
Car request
streams
!4
Stream Processing Engine
Request
Response
Best streets to look for parking
Free parking lots
Problem Statement
Web
Relevant
Answers
Join
SPARQL endpoint
!5
RDF Stream Processing (RSP) Engine
Local
Replica
Minimize
computational
resources
• High Latency
• Rate Limits
Being
Reactive
• Stale Data
• Refresh
Budget
• Maintenance
Policy
Windows
RDF Streams
Research Question
Is it possible to optimize query evaluation in order to
continuously obtain the most relevant combinations of
streaming and evolving distributed data, while
guaranteeing the reactiveness of the engine?
!6
Related work
!7
Stream
Processing
Top-k Query
Evaluation
Federated Query
Processing
ACQUA
Using
resource
replication
MinTopK
Optimal continuous
top-k query
evaluation
•Top-k linked data
query by Wagner
•Top-k join queries
by Ilyas
Our work
Previously
unexplored
APPROACH
!8
Scope of the state of the art
Is it possible to optimize query evaluation in order to
continuously obtain the most relevant combinations of
streaming and evolving distributed data, while
guaranteeing the reactiveness of the engine?
!9
Features ACQUA MinTopk
Type of data Streaming and
distributed
streaming
Relevancy ✗ ✓
Reactiveness Refresh budget Incremental
evaluation
Handling evolving data Local replica
Maintenance policies
✗
Scope of the research
▪ Queries that contains FILTER clause and have to filter
the data come in the distributed dataset.
▪ Top-k queries where the scoring function involves data
that appears both in the streaming and the distributed
datasets.
!10
QUERIES WITH A FILTER
CLAUSE
!11
Query
▪ Every minute give me the best influencers, i.e. users who
are mentioned on Social Network in the last 10 minutes
whose number of followers is greater than 100,000.
!12
REGISTER STREAM <:Influencers> AS
CONSTRUCT {?user a :influentialUser}
WHERE {
WINDOW :W(10m,1m) ON :S
{?user :hasMentions ?mentionsNumber}
SERVICE :BKG
{?user :hasFollowers ?followersCount}
FILTER (?followersCount > 100000)
}
Filtering Threshold
ACQUA
State of the art - ACQUA
!13
WINDOW clause
JOIN
Local Replica
Candidate set
Elected set
RND
LRU
WBM
SERVICE clause
Maintainer
3
Proposer
1
Ranker
2
Proposed Solution – ACQUA.F
!14
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal Replica
SERVICE clause
with
FILTER clause
✓ Filter Update
Policy
✓ RND.F
✓ LRU.F
✓ WBM.F
Filter Update Policy (intuition)
!15
time
NumberofFollowers
t
User A
User B
User D
▪ Computes how close is the value associate to the
variable of each data item to the Filtering Threshold.
User C
Filtering Threshold
Experimental Result
!16
WorstBest
Performance
Experiment Dimension
For low selectivity
WBM is better than
Filter Update Policy
For high selectivity
Filter Update Policy is
better than WBM
Combined Policies – ACQUA.F
!17
time
NumberofFollowers
t
Band
User A
User B
User C
User D
▪ Combine Filter Update Policy with ACQUA ones
▪ RND.F, LRU.F, and WBM.F
Experimental Result
!18
WorstBest
Performance
Experiment Dimension
Impossible in
practice
State of the art - Rank Aggregation
▪ Fairly take into account the opinions of different
algorithms.
▪ Combine the ranking lists by computing aggregated score
!19
User Score
Alice 0.8
Bob 0.7
David 0.4
User Score
Bob 0.9
David 0.8
Alice 0.7
α = 0.5
User Scoreagg
Bob 0.8
Alice 0.75
David 0.6
WBM Filter Update WBM.F+
Proposed Solution – ACQUA.F+
!20
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal Replica
SERVICE clause
with
FILTER clause
✓ LRU.F+
✓ WBM.F+
✓ WBM.F*
Experimental Results
!21
Comparable to WBM.F
Possible in practice
TOP-K QUERIES
!22
W1(current window)
State of the art – MinTopK
!23
Time
Score
E
C
W1
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Window Length = 9
Top-2
results
W2
State of the art – MinTopK
!23
Time
Score
E
C
E
C
W1 W2
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Slide = 3
Top-2
results
W3
State of the art – MinTopK
!23
Time
Score
E
C
E
C
W1
E
F
W2 W3
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Top-2
results
W3
State of the art – MinTopK
!23
Time
Score
Object Ws We
E 1 3
C 1 2
F 3 3
E
C
E
C
W1
E
F
W2 W3
now
Super-MTK
List
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Top-2
results
Top-k Query
▪ Return every 3 minutes the top-2 popular users who are
most mentioned on Social Networks in the last 9 minutes
!24
REGISTER STREAM :TopkUsersToContact AS
SELECT ?user F(?mentionCount,?followerCount) AS ?score
FROM NAMED WINDOW :W ON :S [RANGE 9m STEP 3m]
WHERE {
WINDOW :W {?user :hasMentions ?mentionCount}
SERVICE :BKG {?user :hasFollowers ?followerCount}
}
ORDER BY DESC (?score)
LIMIT 2
Time
ScoreS
0
2
3
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Score
!25
Time
ScoreR
Time
ScoreS
0
2
3
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Score
!25
Time
FinalScore
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Time
ScoreR
Final Score = F ( ScoreS , ScoreR)
Contributions (1 of 2)
▪ Data structure: Super-MTK+N List
▪ to handle changes in distributed dataset : N changes per window
▪ N additional slots
▪ MTK+N list : Keep K+N elements
▪ Complexity: O(K+N)
!26
Object Ws We
E 1 2
G 1 3
C 1 1
F 2 2
A 3 3
E
G
C
E
G
W1
W2
LBP
F
W1 W2 W3
G
A
K area
N area
MTK+N lists Super-MTK+N List
Contributions (2 of 2)
▪ Algorithm:
▪ Top-k+N
▪ Window expiration
▪ New arrival of distinct data items
▪ Handle changes in distributed data
▪ AcquaTop
▪ Handle updating local replica
▪ Complexity: O(K+N)
▪ Framework: AcquaTop Framework
▪ Apply maintenance policies
!27
Top-K+N – New object arrival
!28
Time
Score
W2 (current window)
Object Ws We
E 2 3
C 2 2
F 2 3
A 3 4
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
K = 2
N = 1
Top-K+N – New object arrival
!28
Time
Score
W2 (current window)
Object Ws We
E 2 3
C 2 2
F 2 3
A 3 4
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
K = 2
N = 1
now
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
Top-K+N
Top-K+N - Handling Changes
!29
Time
Score
W2
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
now
Top-K+N - Handling Changes
!29
Time
Score
W2
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
now
F
Object Ws We
G 2 4
F 2 3
E 2 3
C 2 2
A 4 4
Top-K+N
AcquaTop Framework
RDF Stream
Ranker
Maintainer
SPARQL endpoint
Elected set
Candidate set
Local Replica
✓ MTKN-T
✓ MTKN-F
✓ MTKN-A
Super-MTK+N
List
!30
AcquaTop Algorithm
Top-k+N Algorithm
Expiration
New Arrival
Remote Changes
New Maintenance Policies
▪ MTKN-T: Select objects
from top of the MTKN list
for updating
▪ MTKN-F: Select objects for
updating from the border
of K and N areas in MTKN
list (half from top N area,
and half from bottom K
area)
!31
Object Ws We
E 2 3
G 2 4
C 2 2
F 3 3
A 4 4
Object Ws We
E 2 3
G 2 4
C 2 2
F 3 3
A 4 4
2 items for
updating
2 items for
updating
EXPERIMENTAL
EVALUATION
!32
Experimental setting
▪ Datasets:
▪ Streaming data from twitter: mention numbers of user
▪ Real data from REST twitter: follower count of users
▪ Realistic and synthetic distributed data
▪ Query
▪ Query with FILTER clause
▪ Top-k query
▪ Scoring function - > normalized weighted summation between number
of mentions in each window and number of changes in Follower Count
▪ Generate the Oracle for each query
!33
Experimental setting
▪ Baselines:
▪ WST : we don’t update any changes
▪ RND : randomly selects items for update
▪ MTKN-A: update all the elements in MTKN list
▪ Metrics:
▪ CJD : show the correctness of the results for 2 different sets
▪ nDCG@K : Shows how relevant are the results comparing to the
Oracle one
▪ ACC@K : Shows the accuracy of the results
!34
measuring varying
Filter
Updat
e
LRU.
F
WB
M.F
LRU.F+
WB
M.F
+
WBM.F
*
accuracy selectivity >60% ✓ ✓ <60%
accuracy budget ✓ >=2 =1 ✓ >5
sensitivity to α selectivity,α <50% ✓ ✓
sensitivity to α budget,α ✓ ✓
accuracy - ✓
accuracy S <60% ✓
sensitivity to α - ✓ ✓
Evaluation Results – Queries with Filter
!35
measuring Varying MTKN-T MTKN-F
relevancy budget >3
accuracy budget ✓
relevancy CH ✓
accuracy CH =80 <=40
relevancy K ✓
accuracy K <7
relevancy N ✓
accuracy N ✓
relevancy - ✓
accuracy - ✓
Evaluation Results – Top-k Queries
!36
CONCLUSION
!37
Limitations and Future work
Limitations Future work
Two class of queries Broaden the class of queries: N:M join
relationship, multi-join operators,
preference queries, …
Static refresh budget Flexible budget allocation
Full replica Cache and replacement strategies
Single stream of data and one query
for evaluation
Distributed streams and multiple
queries
Correct and complete data inaccurate or incomplete
!38
Conclusion
▪ Is this work, we address the problem of relevant query
answering over streaming and distributed data.
▪ Proposed maintenance policies for queries with FILTER
clause.
▪ Proposed framework for top-k query answering and
maintenance policies to generate more relevant and
accurate result.
▪ We get more relevant and accurate results comparing
to the sate-of-the-art approaches.
!39
Thank you!

Any Question?
On Relevant Query Answering over
Streaming and Distributed Data
Shima Zahmatkesh
shima.zahmatkesh@polimi.it
DEIB - Politecnico of Milano
!40

More Related Content

PDF
Software-defined Networks as Databases
PDF
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
PDF
Principles in Data Stream Processing | Matthias J Sax, Confluent
PDF
Streaming Auto-scaling in Google Cloud Dataflow
PDF
Data Time Travel by Delta Time Machine
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Query Reranking As A Service
PPTX
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Software-defined Networks as Databases
Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Traje...
Principles in Data Stream Processing | Matthias J Sax, Confluent
Streaming Auto-scaling in Google Cloud Dataflow
Data Time Travel by Delta Time Machine
Introduction to Apache Cassandra™ + What’s New in 4.0
Query Reranking As A Service
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)

Similar to On Relevant Query Answering over Streaming and Distributed Data (20)

PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PDF
Introducing Change Data Capture with Debezium
PDF
Urban flood prediction digital ocean august edition
PDF
Large GIS Data Reprojection With FME Workbench - UTM Zone Fanout Solution
PPTX
OPTIMIZING THE TICK STACK
PDF
Backscatter Working Group Software Inter-comparison Project Requesting and Co...
PPTX
DSL Q1 project statusproject statusproject statusproject status
PDF
On Unified Stream Reasoning - The RDF Stream Processing realm
PDF
Data Time Travel by Delta Time Machine
PDF
Successful Architectures for Fast Data
PDF
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
PPT
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
PDF
Containerized Stream Engine to Build Modern Delta Lake
PPTX
Foundations of streaming SQL: stream & table theory
PDF
CQRS and Event Sourcing: A DevOps perspective
PDF
near real time search in e-commerce
PDF
BigDansing presentation slides for KAUST
PDF
[Hydro]geological analysis using open source app: case Cikapundung River
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Introducing Change Data Capture with Debezium
Urban flood prediction digital ocean august edition
Large GIS Data Reprojection With FME Workbench - UTM Zone Fanout Solution
OPTIMIZING THE TICK STACK
Backscatter Working Group Software Inter-comparison Project Requesting and Co...
DSL Q1 project statusproject statusproject statusproject status
On Unified Stream Reasoning - The RDF Stream Processing realm
Data Time Travel by Delta Time Machine
Successful Architectures for Fast Data
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
Containerized Stream Engine to Build Modern Delta Lake
Foundations of streaming SQL: stream & table theory
CQRS and Event Sourcing: A DevOps perspective
near real time search in e-commerce
BigDansing presentation slides for KAUST
[Hydro]geological analysis using open source app: case Cikapundung River
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Artificial Intelligence
DOCX
573137875-Attendance-Management-System-original
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
web development for engineering and engineering
PPTX
Current and future trends in Computer Vision.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
UNIT 4 Total Quality Management .pptx
PDF
PPT on Performance Review to get promotions
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Construction Project Organization Group 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
III.4.1.2_The_Space_Environment.p pdffdf
CH1 Production IntroductoryConcepts.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Artificial Intelligence
573137875-Attendance-Management-System-original
Mechanical Engineering MATERIALS Selection
Fundamentals of safety and accident prevention -final (1).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
web development for engineering and engineering
Current and future trends in Computer Vision.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Foundation to blockchain - A guide to Blockchain Tech
UNIT 4 Total Quality Management .pptx
PPT on Performance Review to get promotions
additive manufacturing of ss316l using mig welding
CYBER-CRIMES AND SECURITY A guide to understanding
Construction Project Organization Group 2.pptx
Ad

On Relevant Query Answering over Streaming and Distributed Data

  • 1. ON RELEVANT QUERYANSWERING OVER STREAMING AND DISTRIBUTED DATA Shima Zahmatkesh Politecnico di Milano – DEIB Data Science Group – Stream Reasoning Team Supervisor: Prof. Emanuele Della Valle
  • 3. Motivation ▪ An application that shows the places around drivers where there is an high probability of finding free parking. Query: return the best streets (around the car that calls the service) where there are many free parking lots and few cars looking for parking in the last 10 minutes. !3 ▪ Web applications require to combine data streams with distributed data over the Web to continuously find the best answer to user’s queries
  • 4. Solving the example with web stream processing Web Relevant Answers Join Windows Car request streams !4 Stream Processing Engine Request Response Best streets to look for parking Free parking lots
  • 5. Problem Statement Web Relevant Answers Join SPARQL endpoint !5 RDF Stream Processing (RSP) Engine Local Replica Minimize computational resources • High Latency • Rate Limits Being Reactive • Stale Data • Refresh Budget • Maintenance Policy Windows RDF Streams
  • 6. Research Question Is it possible to optimize query evaluation in order to continuously obtain the most relevant combinations of streaming and evolving distributed data, while guaranteeing the reactiveness of the engine? !6
  • 7. Related work !7 Stream Processing Top-k Query Evaluation Federated Query Processing ACQUA Using resource replication MinTopK Optimal continuous top-k query evaluation •Top-k linked data query by Wagner •Top-k join queries by Ilyas Our work Previously unexplored
  • 9. Scope of the state of the art Is it possible to optimize query evaluation in order to continuously obtain the most relevant combinations of streaming and evolving distributed data, while guaranteeing the reactiveness of the engine? !9 Features ACQUA MinTopk Type of data Streaming and distributed streaming Relevancy ✗ ✓ Reactiveness Refresh budget Incremental evaluation Handling evolving data Local replica Maintenance policies ✗
  • 10. Scope of the research ▪ Queries that contains FILTER clause and have to filter the data come in the distributed dataset. ▪ Top-k queries where the scoring function involves data that appears both in the streaming and the distributed datasets. !10
  • 11. QUERIES WITH A FILTER CLAUSE !11
  • 12. Query ▪ Every minute give me the best influencers, i.e. users who are mentioned on Social Network in the last 10 minutes whose number of followers is greater than 100,000. !12 REGISTER STREAM <:Influencers> AS CONSTRUCT {?user a :influentialUser} WHERE { WINDOW :W(10m,1m) ON :S {?user :hasMentions ?mentionsNumber} SERVICE :BKG {?user :hasFollowers ?followersCount} FILTER (?followersCount > 100000) } Filtering Threshold ACQUA
  • 13. State of the art - ACQUA !13 WINDOW clause JOIN Local Replica Candidate set Elected set RND LRU WBM SERVICE clause Maintainer 3 Proposer 1 Ranker 2
  • 14. Proposed Solution – ACQUA.F !14 WINDOW clause JOIN Proposer Ranker MaintainerLocal Replica SERVICE clause with FILTER clause ✓ Filter Update Policy ✓ RND.F ✓ LRU.F ✓ WBM.F
  • 15. Filter Update Policy (intuition) !15 time NumberofFollowers t User A User B User D ▪ Computes how close is the value associate to the variable of each data item to the Filtering Threshold. User C Filtering Threshold
  • 16. Experimental Result !16 WorstBest Performance Experiment Dimension For low selectivity WBM is better than Filter Update Policy For high selectivity Filter Update Policy is better than WBM
  • 17. Combined Policies – ACQUA.F !17 time NumberofFollowers t Band User A User B User C User D ▪ Combine Filter Update Policy with ACQUA ones ▪ RND.F, LRU.F, and WBM.F
  • 19. State of the art - Rank Aggregation ▪ Fairly take into account the opinions of different algorithms. ▪ Combine the ranking lists by computing aggregated score !19 User Score Alice 0.8 Bob 0.7 David 0.4 User Score Bob 0.9 David 0.8 Alice 0.7 α = 0.5 User Scoreagg Bob 0.8 Alice 0.75 David 0.6 WBM Filter Update WBM.F+
  • 20. Proposed Solution – ACQUA.F+ !20 WINDOW clause JOIN Proposer Ranker MaintainerLocal Replica SERVICE clause with FILTER clause ✓ LRU.F+ ✓ WBM.F+ ✓ WBM.F*
  • 21. Experimental Results !21 Comparable to WBM.F Possible in practice
  • 23. W1(current window) State of the art – MinTopK !23 Time Score E C W1 now 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B G H A MTK Lists: Window Length = 9 Top-2 results
  • 24. W2 State of the art – MinTopK !23 Time Score E C E C W1 W2 now 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B G H A MTK Lists: Slide = 3 Top-2 results
  • 25. W3 State of the art – MinTopK !23 Time Score E C E C W1 E F W2 W3 now 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B G H A MTK Lists: Top-2 results
  • 26. W3 State of the art – MinTopK !23 Time Score Object Ws We E 1 3 C 1 2 F 3 3 E C E C W1 E F W2 W3 now Super-MTK List 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B G H A MTK Lists: Top-2 results
  • 27. Top-k Query ▪ Return every 3 minutes the top-2 popular users who are most mentioned on Social Networks in the last 9 minutes !24 REGISTER STREAM :TopkUsersToContact AS SELECT ?user F(?mentionCount,?followerCount) AS ?score FROM NAMED WINDOW :W ON :S [RANGE 9m STEP 3m] WHERE { WINDOW :W {?user :hasMentions ?mentionCount} SERVICE :BKG {?user :hasFollowers ?followerCount} } ORDER BY DESC (?score) LIMIT 2
  • 28. Time ScoreS 0 2 3 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A Score !25 Time ScoreR
  • 29. Time ScoreS 0 2 3 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A Score !25 Time FinalScore 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A Time ScoreR Final Score = F ( ScoreS , ScoreR)
  • 30. Contributions (1 of 2) ▪ Data structure: Super-MTK+N List ▪ to handle changes in distributed dataset : N changes per window ▪ N additional slots ▪ MTK+N list : Keep K+N elements ▪ Complexity: O(K+N) !26 Object Ws We E 1 2 G 1 3 C 1 1 F 2 2 A 3 3 E G C E G W1 W2 LBP F W1 W2 W3 G A K area N area MTK+N lists Super-MTK+N List
  • 31. Contributions (2 of 2) ▪ Algorithm: ▪ Top-k+N ▪ Window expiration ▪ New arrival of distinct data items ▪ Handle changes in distributed data ▪ AcquaTop ▪ Handle updating local replica ▪ Complexity: O(K+N) ▪ Framework: AcquaTop Framework ▪ Apply maintenance policies !27
  • 32. Top-K+N – New object arrival !28 Time Score W2 (current window) Object Ws We E 2 3 C 2 2 F 2 3 A 3 4 now 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A K = 2 N = 1
  • 33. Top-K+N – New object arrival !28 Time Score W2 (current window) Object Ws We E 2 3 C 2 2 F 2 3 A 3 4 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A K = 2 N = 1 now Object Ws We G 2 4 E 2 3 C 2 2 F 3 3 A 4 4 Top-K+N
  • 34. Top-K+N - Handling Changes !29 Time Score W2 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A Object Ws We G 2 4 E 2 3 C 2 2 F 3 3 A 4 4 now
  • 35. Top-K+N - Handling Changes !29 Time Score W2 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 C D E F B A G A Object Ws We G 2 4 E 2 3 C 2 2 F 3 3 A 4 4 now F Object Ws We G 2 4 F 2 3 E 2 3 C 2 2 A 4 4 Top-K+N
  • 36. AcquaTop Framework RDF Stream Ranker Maintainer SPARQL endpoint Elected set Candidate set Local Replica ✓ MTKN-T ✓ MTKN-F ✓ MTKN-A Super-MTK+N List !30 AcquaTop Algorithm Top-k+N Algorithm Expiration New Arrival Remote Changes
  • 37. New Maintenance Policies ▪ MTKN-T: Select objects from top of the MTKN list for updating ▪ MTKN-F: Select objects for updating from the border of K and N areas in MTKN list (half from top N area, and half from bottom K area) !31 Object Ws We E 2 3 G 2 4 C 2 2 F 3 3 A 4 4 Object Ws We E 2 3 G 2 4 C 2 2 F 3 3 A 4 4 2 items for updating 2 items for updating
  • 39. Experimental setting ▪ Datasets: ▪ Streaming data from twitter: mention numbers of user ▪ Real data from REST twitter: follower count of users ▪ Realistic and synthetic distributed data ▪ Query ▪ Query with FILTER clause ▪ Top-k query ▪ Scoring function - > normalized weighted summation between number of mentions in each window and number of changes in Follower Count ▪ Generate the Oracle for each query !33
  • 40. Experimental setting ▪ Baselines: ▪ WST : we don’t update any changes ▪ RND : randomly selects items for update ▪ MTKN-A: update all the elements in MTKN list ▪ Metrics: ▪ CJD : show the correctness of the results for 2 different sets ▪ nDCG@K : Shows how relevant are the results comparing to the Oracle one ▪ ACC@K : Shows the accuracy of the results !34
  • 41. measuring varying Filter Updat e LRU. F WB M.F LRU.F+ WB M.F + WBM.F * accuracy selectivity >60% ✓ ✓ <60% accuracy budget ✓ >=2 =1 ✓ >5 sensitivity to α selectivity,α <50% ✓ ✓ sensitivity to α budget,α ✓ ✓ accuracy - ✓ accuracy S <60% ✓ sensitivity to α - ✓ ✓ Evaluation Results – Queries with Filter !35
  • 42. measuring Varying MTKN-T MTKN-F relevancy budget >3 accuracy budget ✓ relevancy CH ✓ accuracy CH =80 <=40 relevancy K ✓ accuracy K <7 relevancy N ✓ accuracy N ✓ relevancy - ✓ accuracy - ✓ Evaluation Results – Top-k Queries !36
  • 44. Limitations and Future work Limitations Future work Two class of queries Broaden the class of queries: N:M join relationship, multi-join operators, preference queries, … Static refresh budget Flexible budget allocation Full replica Cache and replacement strategies Single stream of data and one query for evaluation Distributed streams and multiple queries Correct and complete data inaccurate or incomplete !38
  • 45. Conclusion ▪ Is this work, we address the problem of relevant query answering over streaming and distributed data. ▪ Proposed maintenance policies for queries with FILTER clause. ▪ Proposed framework for top-k query answering and maintenance policies to generate more relevant and accurate result. ▪ We get more relevant and accurate results comparing to the sate-of-the-art approaches. !39
  • 46. Thank you!
 Any Question? On Relevant Query Answering over Streaming and Distributed Data Shima Zahmatkesh shima.zahmatkesh@polimi.it DEIB - Politecnico of Milano !40