[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation

An Efficient incremental indexing
mechanism for extracting Top-k
representative queries over continuous
data streams
Y.S. Horawalavithana, D.N. Ranasinghe
Adaptive and Reflective Middleware (ARM)
ACM/IFIP/USENIX Middleware
Vancouver, BC, Canada
December 08, 2015
1
University of Colombo School of Computing,
Sri Lanka

2
Overview
• Motivation
• Adaptive Diversification
• Incremental Top-k
• Evaluation
• Conclusion
• Future work

4
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work

5
Minimum independent-dominating set
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2
𝛼
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2

𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
  jijiji ppppdppodNeighborho  ,|)(
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
Publication
space
Graph
model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent

6
NAÏVE Greedy argmax
𝑟(𝑝𝑖)2
𝑝 𝑗∈𝑁(𝑝 𝑖) 𝑟(𝑝𝑗) × 𝑑(𝑝𝑖, 𝑝𝑗)

7
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window
if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.

8
Adaptive Diversification
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
Matching publication stream
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
ith window
(i+1)th window
𝑆𝑖
∗
𝑆𝑖+1
∗
Independence
Dominance
Durability
Order
 Straightforward solution:
 Apply naïve greedy method at each instance
 Propose incremental index mechanism!
 Avoid the curse of re-calculating neighborhood

9
Locality Sensitive Hashing (LSH)
 Simple Idea
 if two points are close together, then after a “projection” operation these two
points will remain close together

10
LSH in Adaptive Diversification:
Publications as categorical data

11
Characteristic Matrix

12
Minhashing
 No Publications any more!
 Signature to represent
 Technique
 Randomly permute the rows at
characteristic matrix m times
 Take the number of the 1st row, in
the permuted order,
 which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
 Advantage:
 Reduce the dimensions into a small
minhash signature

13
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.

14
LSH Buckets
 Take r sized
signature vectors
 From m sized
minhash-
signature
 Map them into,
 L Hash-Tables
 Each with
arbitrary b
number of
buckets

15
Batch-wise Top-k computation
 Bucket “Winner” – a publication which has the
highest relevancy score
 Winner is dominant to represent it's bucket
neighborhood
 Top-k "winners“ that have a majority of votes
 k winners are independent
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window

16
LSH in Dynamic Diversification:
Incremental Top-k computation
𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟
Characteristic
Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature
Matrix
Map 𝑖 𝑡ℎ
signature
into L hash-tables
Update “Winner” at
bucket 𝑖 𝑡ℎ
signature
maps into
Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒

17
LSH in Dynamic Diversification:
When new publication F arrives…
 Only buckets 𝐵13
, 𝐵23
, 𝐵32
, 𝐵43
will vote
 Follow continuity requirements
 Durability
 Order
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window


18
Analysis
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
 For publications x & y
𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦
 At a particular hash table
 x & y map into the same bucket:
𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 x & y does not map into the same bucket:
1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 At L Hash-tables
 x & y does not map into the same bucket:
(1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections

Publication Stream  Zipfian subscriptions
 Normalized preferences
19
Evaluation:
Dataset
Amazon on-line market place data available at 17th – 19th November 2014
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠
=
𝑖=2
32
48 𝑐 𝑖
+ 42 𝑐 𝑖
+ 54 𝑐 𝑖
+ 66 𝑐 𝑖
+ 57 𝑐 𝑖
+ 67 𝑐 𝑖

20
Terminology
ILSH, BLSH and NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH

21
Accuracy:
ILSH vs. NAÏVE
Probability of producing optimal diverse set of results by ILSH under Jaccard similarity threshold (s)

22
Performance & Efficiency:
ILSH vs. BLSH vs. NAÏVE
log (Top-k matching time) on number of publications with D=500

23
Conclusions
 Locality Sensitive Hashing (LSH) indexing method
 Produce diverse set of results at average 70% accuracy over naïve method
 Reduce the matching time very significantly over NAÏVE method
 Further, refine by it’s incremental version
 For handling streaming publications
 Avoid the curse of re-computing neighborhoods
 Top k to restrict the delivery of Top publications
 Given a window size & delivery method
 Model can produce best diverse set of personalized results
 To represent the set of all matching publications at given instance

24
Future work
 Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
 Personalized newspaper for every Facebook user
 Adaptive resource scheduling in large scale distributed system
 Exploit overlap among diversified results of users who have similar interest
 Develop LSH based index over multi-threaded distributed environment

[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation

More Related Content

Similar to [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation (20)

More from Sameera Horawalavithana (16)

Recently uploaded (20)

[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation

Editor's Notes