SlideShare a Scribd company logo
An Efficient incremental indexing
mechanism for extracting Top-k
representative queries over continuous
data streams
Y.S. Horawalavithana, D.N. Ranasinghe
Adaptive and Reflective Middleware (ARM)
ACM/IFIP/USENIX Middleware
Vancouver, BC, Canada
December 08, 2015
1
University of Colombo School of Computing,
Sri Lanka
2
Overview
• Motivation
• Adaptive Diversification
• Incremental Top-k
• Evaluation
• Conclusion
• Future work
3
4
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
5
Minimum independent-dominating set
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2
𝛼
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2

𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
  jijiji ppppdppodNeighborho  ,|)(
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
Publication
space
Graph
model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
6
NAÏVE Greedy argmax
𝑟(𝑝𝑖)2
𝑝 𝑗∈𝑁(𝑝 𝑖) 𝑟(𝑝𝑗) × 𝑑(𝑝𝑖, 𝑝𝑗)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
7
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window
if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
8
Adaptive Diversification
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
Matching publication stream
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
ith window
(i+1)th window
𝑆𝑖
∗
𝑆𝑖+1
∗
Independence
Dominance
Durability
Order
 Straightforward solution:
 Apply naïve greedy method at each instance
 Propose incremental index mechanism!
 Avoid the curse of re-calculating neighborhood
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
9
Locality Sensitive Hashing (LSH)
 Simple Idea
 if two points are close together, then after a “projection” operation these two
points will remain close together
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
10
LSH in Adaptive Diversification:
Publications as categorical data
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
11
LSH in Adaptive Diversification:
Characteristic Matrix
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
12
LSH in Adaptive Diversification:
Minhashing
 No Publications any more!
 Signature to represent
 Technique
 Randomly permute the rows at
characteristic matrix m times
 Take the number of the 1st row, in
the permuted order,
 which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
 Advantage:
 Reduce the dimensions into a small
minhash signature
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
13
LSH in Adaptive Diversification:
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
14
LSH in Adaptive Diversification:
LSH Buckets
 Take r sized
signature vectors
 From m sized
minhash-
signature
 Map them into,
 L Hash-Tables
 Each with
arbitrary b
number of
buckets
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
15
LSH in Adaptive Diversification:
Batch-wise Top-k computation
 Bucket “Winner” – a publication which has the
highest relevancy score
 Winner is dominant to represent it's bucket
neighborhood
 Top-k "winners“ that have a majority of votes
 k winners are independent
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
16
LSH in Dynamic Diversification:
Incremental Top-k computation
𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟
Characteristic
Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature
Matrix
Map 𝑖 𝑡ℎ
signature
into L hash-tables
Update “Winner” at
bucket 𝑖 𝑡ℎ
signature
maps into
Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
17
LSH in Dynamic Diversification:
When new publication F arrives…
 Only buckets 𝐵13
, 𝐵23
, 𝐵32
, 𝐵43
will vote
 Follow continuity requirements
 Durability
 Order
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window

1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
18
LSH in Adaptive Diversification:
Analysis
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
 For publications x & y
𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦
 At a particular hash table
 x & y map into the same bucket:
𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 x & y does not map into the same bucket:
1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 At L Hash-tables
 x & y does not map into the same bucket:
(1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
Publication Stream  Zipfian subscriptions
 Normalized preferences
19
Evaluation:
Dataset
Amazon on-line market place data available at 17th – 19th November 2014
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠
=
𝑖=2
32
48 𝑐 𝑖
+ 42 𝑐 𝑖
+ 54 𝑐 𝑖
+ 66 𝑐 𝑖
+ 57 𝑐 𝑖
+ 67 𝑐 𝑖
20
Terminology
ILSH, BLSH and NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
21
Accuracy:
ILSH vs. NAÏVE
Probability of producing optimal diverse set of results by ILSH under Jaccard similarity threshold (s)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
22
Performance & Efficiency:
ILSH vs. BLSH vs. NAÏVE
log (Top-k matching time) on number of publications with D=500
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
23
Conclusions
 Locality Sensitive Hashing (LSH) indexing method
 Produce diverse set of results at average 70% accuracy over naïve method
 Reduce the matching time very significantly over NAÏVE method
 Further, refine by it’s incremental version
 For handling streaming publications
 Avoid the curse of re-computing neighborhoods
 Top k to restrict the delivery of Top publications
 Given a window size & delivery method
 Model can produce best diverse set of personalized results
 To represent the set of all matching publications at given instance
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
24
Future work
 Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
 Personalized newspaper for every Facebook user
 Adaptive resource scheduling in large scale distributed system
 Exploit overlap among diversified results of users who have similar interest
 Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
25
Q&A
THANK YOU!

More Related Content

PDF
Principal Component Analysis and Clustering
PDF
5 parallel implementation 06299286
PDF
Probabilistic data structures
PDF
Linear sorting
PDF
Quantile Quantile Plot qq plot
PPTX
Q-Q Plot | Statistics
PDF
Hyperspectral Image Reduction
Principal Component Analysis and Clustering
5 parallel implementation 06299286
Probabilistic data structures
Linear sorting
Quantile Quantile Plot qq plot
Q-Q Plot | Statistics
Hyperspectral Image Reduction

Similar to [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation (20)

PDF
OpenLSH - a framework for locality sensitive hashing
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
PPTX
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
PDF
Probabilistic data structures. Part 4. Similarity
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
Probabilistic algorithms for fun and pseudorandom profit
PPTX
Data streaming algorithms
PDF
Local sensitive hashing & minhash on facebook friend
PPTX
Mining Data Streams
PDF
Scalable Recommendation Algorithms with LSH
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
PDF
Open LSH - september 2014 update
PPTX
Probabilistic data structure
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PPTX
streamingalgo88585858585858585pppppp.pptx
PDF
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
PPTX
big data analytics ,stream analytics....
PPT
similarity1 (6).ppt
PDF
Scalable real-time processing techniques
PPTX
Prediction approach in predicting next user choice
OpenLSH - a framework for locality sensitive hashing
Mining of massive datasets using locality sensitive hashing (LSH)
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
Probabilistic data structures. Part 4. Similarity
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Probabilistic algorithms for fun and pseudorandom profit
Data streaming algorithms
Local sensitive hashing & minhash on facebook friend
Mining Data Streams
Scalable Recommendation Algorithms with LSH
Building graphs to discover information by David Martínez at Big Data Spain 2015
Open LSH - september 2014 update
Probabilistic data structure
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
streamingalgo88585858585858585pppppp.pptx
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
big data analytics ,stream analytics....
similarity1 (6).ppt
Scalable real-time processing techniques
Prediction approach in predicting next user choice
Ad

More from Sameera Horawalavithana (16)

PDF
Data-driven Studies on Social Networks: Privacy and Simulation
PDF
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
PDF
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
PPTX
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
PDF
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
PDF
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
PPTX
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
PDF
Duplicate Detection on Hoaxy Dataset
PDF
Dancing with Stream Processing
PDF
Be Elastic: Leapset Innovation session 06-08-2015
PPTX
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
PPTX
Locality sensitive hashing
PPTX
Zipf distribution
PPTX
Query personalization
PPTX
Dancing with publish/subscribe
PPTX
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Data-driven Studies on Social Networks: Privacy and Simulation
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
Duplicate Detection on Hoaxy Dataset
Dancing with Stream Processing
Be Elastic: Leapset Innovation session 06-08-2015
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
Locality sensitive hashing
Zipf distribution
Query personalization
Dancing with publish/subscribe
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Ad

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Foundation of Data Science unit number two notes
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Quality review (1)_presentation of this 21
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Database Infoormation System (DBIS).pptx
annual-report-2024-2025 original latest.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Foundation of Data Science unit number two notes
climate analysis of Dhaka ,Banglades.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation

  • 1. An Efficient incremental indexing mechanism for extracting Top-k representative queries over continuous data streams Y.S. Horawalavithana, D.N. Ranasinghe Adaptive and Reflective Middleware (ARM) ACM/IFIP/USENIX Middleware Vancouver, BC, Canada December 08, 2015 1 University of Colombo School of Computing, Sri Lanka
  • 2. 2 Overview • Motivation • Adaptive Diversification • Incremental Top-k • Evaluation • Conclusion • Future work
  • 3. 3
  • 4. 4 Diversity: Top-k representative set Representative Top-kDrawback (without diversity) What we want (with diversity) Method to retrieve Top-k publications from matching publications 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 5. 5 Minimum independent-dominating set 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2 𝛼 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2  𝑣1 𝑣4 𝑣3 𝑣2 𝑣5 𝑣1 𝑣4 𝑣3 𝑣2 𝑣5   jijiji ppppdppodNeighborho  ,|)( 𝑣1 𝑣4 𝑣3𝑣2 𝑣5 Publication space Graph model Independent, dominating Independent, dominating Independent, dominating Dominating, not independent 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 6. 6 NAÏVE Greedy argmax 𝑟(𝑝𝑖)2 𝑝 𝑗∈𝑁(𝑝 𝑖) 𝑟(𝑝𝑗) × 𝑑(𝑝𝑖, 𝑝𝑗) 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 7. 7 Handling streaming publications 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2𝛼 𝑝6 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2𝑣6 Continuity Requirements 1. Durability an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ window are failed to compete with it. 2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not- older than j. 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 8. 8 Adaptive Diversification 𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. .... Matching publication stream 𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. .... ith window (i+1)th window 𝑆𝑖 ∗ 𝑆𝑖+1 ∗ Independence Dominance Durability Order  Straightforward solution:  Apply naïve greedy method at each instance  Propose incremental index mechanism!  Avoid the curse of re-calculating neighborhood 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 9. 9 Locality Sensitive Hashing (LSH)  Simple Idea  if two points are close together, then after a “projection” operation these two points will remain close together 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 10. 10 LSH in Adaptive Diversification: Publications as categorical data 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 11. 11 LSH in Adaptive Diversification: Characteristic Matrix 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 12. 12 LSH in Adaptive Diversification: Minhashing  No Publications any more!  Signature to represent  Technique  Randomly permute the rows at characteristic matrix m times  Take the number of the 1st row, in the permuted order,  which the column has a 1 for the correspondent column of publications. First permutation of rows at characteristic matrix  Advantage:  Reduce the dimensions into a small minhash signature 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 13. 13 LSH in Adaptive Diversification: Signature Matrix Fast-minhashing Select m number of random hash functions To model the effect of m number of random permutation Mathematically proved only when, The number of rows is a prime. 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 14. 14 LSH in Adaptive Diversification: LSH Buckets  Take r sized signature vectors  From m sized minhash- signature  Map them into,  L Hash-Tables  Each with arbitrary b number of buckets 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 15. 15 LSH in Adaptive Diversification: Batch-wise Top-k computation  Bucket “Winner” – a publication which has the highest relevancy score  Winner is dominant to represent it's bucket neighborhood  Top-k "winners“ that have a majority of votes  k winners are independent 𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . . ith window 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 16. 16 LSH in Dynamic Diversification: Incremental Top-k computation 𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟 Characteristic Matrix 𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ 𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒 Signature Matrix Map 𝑖 𝑡ℎ signature into L hash-tables Update “Winner” at bucket 𝑖 𝑡ℎ signature maps into Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 17. 17 LSH in Dynamic Diversification: When new publication F arrives…  Only buckets 𝐵13 , 𝐵23 , 𝐵32 , 𝐵43 will vote  Follow continuity requirements  Durability  Order 𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . . ith window (i+1)th window  1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 18. 18 LSH in Adaptive Diversification: Analysis For two vectors x,y 𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ; 𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 = 𝑥 ∩ 𝑦 𝑥 ∪ 𝑦  For publications x & y 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦  At a particular hash table  x & y map into the same bucket: 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏  x & y does not map into the same bucket: 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏  At L Hash-tables  x & y does not map into the same bucket: (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏 ) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿 True near neighbors will be unlikely to be unlucky in all the projections 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 19. Publication Stream  Zipfian subscriptions  Normalized preferences 19 Evaluation: Dataset Amazon on-line market place data available at 17th – 19th November 2014 𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 = 1 𝑘 𝑠 𝑛=1 𝑁 ( 1 𝑛 𝑠) N - number of elements in distribution, k - rank of element s - value of exponent 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠 = 𝑖=2 32 48 𝑐 𝑖 + 42 𝑐 𝑖 + 54 𝑐 𝑖 + 66 𝑐 𝑖 + 57 𝑐 𝑖 + 67 𝑐 𝑖
  • 20. 20 Terminology ILSH, BLSH and NAÏVE 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . . BLSH or NAIVE BLSH or NAIVE BLSH or NAIVE BLSH or NAIVE ILSH 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 21. 21 Accuracy: ILSH vs. NAÏVE Probability of producing optimal diverse set of results by ILSH under Jaccard similarity threshold (s) 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 22. 22 Performance & Efficiency: ILSH vs. BLSH vs. NAÏVE log (Top-k matching time) on number of publications with D=500 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 23. 23 Conclusions  Locality Sensitive Hashing (LSH) indexing method  Produce diverse set of results at average 70% accuracy over naïve method  Reduce the matching time very significantly over NAÏVE method  Further, refine by it’s incremental version  For handling streaming publications  Avoid the curse of re-computing neighborhoods  Top k to restrict the delivery of Top publications  Given a window size & delivery method  Model can produce best diverse set of personalized results  To represent the set of all matching publications at given instance 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
  • 24. 24 Future work  Explore other suitable use-cases to apply proposed model & develop prototype applications, E.g.  Personalized newspaper for every Facebook user  Adaptive resource scheduling in large scale distributed system  Exploit overlap among diversified results of users who have similar interest  Develop LSH based index over multi-threaded distributed environment 1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work

Editor's Notes

  • #4: each user gets exposed to more than 1,500 stories each day, but an average user would only get to see about 
  • #16: Since similar publications have the tendency to map into same bucket at probability 1 − d, dominance condition can be well served. Because the "winner" publication as the most relevant publication at each bucket, can cover it's neighborhood. Also two buckets represent two separate neighborhoods. That results all "winner" publications to be dis-similar from each other by at least d distance. So it also satises the independence condition
  • #23: Talk on ILSH update cost, because of maintaining a large characteristic matrix