SlideShare a Scribd company logo
2
Most read
10
Most read
12
Most read
Locality Sensitive Hashing
Randomized Algorithm
Problem Statement
• Given a query point q,
• Find closest items to the query
point with the probability of 1 − 𝛿
• Iterative methods?
• Large volume of data
• Curse of dimensionality
Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi
Diagram
Hash
Approximate
LSH
Approximate LSH
• Simple Idea
• if two points are close together, then after a “projection” operation these two
points will remain close together
LSH Requirement
• For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small
P
d
2d
c.d
q
q
≥ P(1)
≥ P(2)
≥ P(c) P(1) ≥P(2) ≥P(3)
q
Probability vs. Distance on candidate pairs
Hash Function(Random)
• Locality-preserving
• Independent
• Deterministic
• Family of Hash Function per various distance measures
• Euclidean
• Jaccard
• Cosine Similarity
• Hamming
LSH Family for Euclidean distance (2d)
• When d. cos 𝜃 ≤ 𝑎,
• Chance of colliding
• But not certain
• But can guarantee,
• If 𝑑 ≤ 𝑎/2,
• 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃1 ≥ 1/2
• If 𝑑 ≥ 2𝑎,
• 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃2 ≤ 1/3
• As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive
• (𝑎, 2𝑎,
1
2
,
1
3
)
How to define the projection?
• Scalar projection (Dot product)
ℎ
𝑣
=
𝑣
.
𝑥
;
𝑣
= 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒
𝑥
= 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1)
ℎ
𝑣
= 𝑣
.
𝑥
+ 𝑏
𝑤
;
𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛
𝑏 − random variable uniformly distributed between 0 and w
How to define the projection?
• K-dot product, that
(
𝑃1
𝑃2
) 𝑘> (
𝑃1
𝑃2
)
points at different separations will fall into the same quantization bin
• Perform k independent dot products
• Achieve success,
• if the query and the nearest neighbor are in the same bin in all k dot products
• Success probability = 𝑃1
𝑘
; decreases as we include more dot products
Multiple-projections
• L independent projections
• True near neighbor will be unlikely to be unlucky in all the projections
• By increasing L,
• we can find the true nearest neighbor with arbitrarily high probability
Accuracy
• Two close points p and q,
• Separated by 𝑢 = 𝑝 − 𝑞
• Probability of collision 𝑃 𝐻 𝑢 ,
𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞))
=
0
𝑤
1
𝑢
. 𝑓𝑠
𝑡
𝑢
. 1 −
𝑡
𝑤
𝑑𝑡
𝑓𝑠- probability density function of H
• As distance u increases, 𝑃 𝐻 𝑢 decreases
Time complexity
• For a query point q,
• To Find the near neighbor: (𝑇𝑔+𝑇𝑐)
• Calculate & hash the projections (𝑇𝑔)
• O(DkL); D−dimension, kL projections
• Search the bucket for collisions (𝑇𝑐)
• O(DL𝑁𝑐); D-dimension, L projections, and
• where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘
. | 𝑞 − 𝑞′
|; 𝑁𝑐 - expected number of collisions for single projection
• Analyze
• 𝑇𝑔 increases as k & L increase
• 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
How many projections(L)?
• For query point p & neighbor q,
• For single projection,
• Success probability of collisions: ≥ 𝑃1
𝑘
• For L projections,
• Failure probability of collisions: ≤ (1 − 𝑃1
𝑘
) 𝐿
∴ (1 − 𝑃1
𝑘
) 𝐿= 𝛿
𝐿 =
log 𝛿
log(1 − 𝑃1
𝑘
)
LSH in MAXDIVREL Diversity
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
#1 #2 #3 … #k dot
product
1 1 1 0 .. 1
2 1 0 1 … 1
w 0 1 1 … 0
#1 #2 #3 … #k dot
product
1 1 0 1 .. 0
2 0 0 1 … 0
w 0 1 0 … 0
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
REFERENCES
[1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of
Massive Datasets,’” pp. 72–130.
[2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008.
[3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk,
S. Madden, and P. Dubey, “Streaming similarity search over one billion
tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol.
6, no. 14, pp. 1930–1941, Sep. 2013.

More Related Content

PPTX
K-Nearest Neighbor(KNN)
PDF
Ai lab manual
PPT
First order logic
PPTX
Community detection in social networks
PPTX
Resolution method in AI.pptx
PPTX
PPTX
An introduction to reinforcement learning
PPTX
Reinforcement Learning
K-Nearest Neighbor(KNN)
Ai lab manual
First order logic
Community detection in social networks
Resolution method in AI.pptx
An introduction to reinforcement learning
Reinforcement Learning

What's hot (20)

PPT
Np completeness
PPTX
Linear models and multiclass classification
PPTX
Knowledge Representation & Reasoning AI UNIT 3
PPTX
Np hard
PPTX
Divide and Conquer - Part 1
PPTX
Fuzzy logic and application in AI
PDF
Representation Learning of Text for NLP
PDF
Decision trees in Machine Learning
PPT
Asymptotic notations
PDF
Locality sensitive hashing
PPT
AI Lecture 3 (solving problems by searching)
PPT
Pagerank Algorithm Explained
PPTX
Asymptotic Notation
PDF
Hierarchical Clustering
PDF
Algorithms Lecture 2: Analysis of Algorithms I
PPTX
Introduction to natural language processing (NLP)
DOCX
Artificial Intelligence Lab File
PPTX
Reasoning in AI
PPTX
Gradient descent method
PPTX
First order logic
Np completeness
Linear models and multiclass classification
Knowledge Representation & Reasoning AI UNIT 3
Np hard
Divide and Conquer - Part 1
Fuzzy logic and application in AI
Representation Learning of Text for NLP
Decision trees in Machine Learning
Asymptotic notations
Locality sensitive hashing
AI Lecture 3 (solving problems by searching)
Pagerank Algorithm Explained
Asymptotic Notation
Hierarchical Clustering
Algorithms Lecture 2: Analysis of Algorithms I
Introduction to natural language processing (NLP)
Artificial Intelligence Lab File
Reasoning in AI
Gradient descent method
First order logic
Ad

Similar to Locality sensitive hashing (20)

PDF
Sketching and locality sensitive hashing for alignment
PDF
clustering unsupervised learning and machine learning.pdf
PPTX
Fast Single-pass K-means Clusterting at Oxford
PDF
Paper study: Learning to solve circuit sat
PPTX
Tutorial on Object Detection (Faster R-CNN)
PPTX
Oxford 05-oct-2012
PPTX
Data Mining Lecture_10(b).pptx
PPTX
ACM 2013-02-25
PPTX
Paris data-geeks-2013-03-28
PPTX
Lecture 8 about data mining and how to use it.pptx
PDF
Machine Learning Foundations for Professional Managers
PPT
SVD.ppt
PDF
SPATIAL POINT PATTERNS
PPTX
cnn.pptx
PPTX
Paris Data Geeks
PDF
A compact zero knowledge proof to restrict message space in homomorphic encry...
PPTX
A short introduction to Quantum Computing and Quantum Cryptography
PDF
Bounded arithmetic in free logic
PDF
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
PPTX
Data Mining Lecture_9.pptx
Sketching and locality sensitive hashing for alignment
clustering unsupervised learning and machine learning.pdf
Fast Single-pass K-means Clusterting at Oxford
Paper study: Learning to solve circuit sat
Tutorial on Object Detection (Faster R-CNN)
Oxford 05-oct-2012
Data Mining Lecture_10(b).pptx
ACM 2013-02-25
Paris data-geeks-2013-03-28
Lecture 8 about data mining and how to use it.pptx
Machine Learning Foundations for Professional Managers
SVD.ppt
SPATIAL POINT PATTERNS
cnn.pptx
Paris Data Geeks
A compact zero knowledge proof to restrict message space in homomorphic encry...
A short introduction to Quantum Computing and Quantum Cryptography
Bounded arithmetic in free logic
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
Data Mining Lecture_9.pptx
Ad

More from Sameera Horawalavithana (17)

PDF
Data-driven Studies on Social Networks: Privacy and Simulation
PDF
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
PDF
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
PPTX
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
PDF
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
PDF
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
PPTX
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
PDF
Duplicate Detection on Hoaxy Dataset
PDF
Dancing with Stream Processing
PPTX
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
PDF
Be Elastic: Leapset Innovation session 06-08-2015
PPTX
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
PPTX
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
PPTX
Zipf distribution
PPTX
Query personalization
PPTX
Dancing with publish/subscribe
PPTX
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Data-driven Studies on Social Networks: Privacy and Simulation
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
Duplicate Detection on Hoaxy Dataset
Dancing with Stream Processing
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
Be Elastic: Leapset Innovation session 06-08-2015
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
Zipf distribution
Query personalization
Dancing with publish/subscribe
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Monthly Chronicles - July 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Monthly Chronicles - July 2025

Locality sensitive hashing

  • 2. Problem Statement • Given a query point q, • Find closest items to the query point with the probability of 1 − 𝛿 • Iterative methods? • Large volume of data • Curse of dimensionality
  • 3. Taxonomy – Near Neighbor Query (NN) NN Trees K-d Tree Range Tree B Tree Cover Tree Grid Voronoi Diagram Hash Approximate LSH
  • 4. Approximate LSH • Simple Idea • if two points are close together, then after a “projection” operation these two points will remain close together
  • 5. LSH Requirement • For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2 • Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need • (𝑃1−𝑃2) to be large • (𝑑1−𝑑2) to be small
  • 6. P d 2d c.d q q ≥ P(1) ≥ P(2) ≥ P(c) P(1) ≥P(2) ≥P(3) q
  • 7. Probability vs. Distance on candidate pairs
  • 8. Hash Function(Random) • Locality-preserving • Independent • Deterministic • Family of Hash Function per various distance measures • Euclidean • Jaccard • Cosine Similarity • Hamming
  • 9. LSH Family for Euclidean distance (2d) • When d. cos 𝜃 ≤ 𝑎, • Chance of colliding • But not certain • But can guarantee, • If 𝑑 ≤ 𝑎/2, • 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎 • ∴ 𝑃1 ≥ 1/2 • If 𝑑 ≥ 2𝑎, • 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎 • ∴ 𝑃2 ≤ 1/3 • As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive • (𝑎, 2𝑎, 1 2 , 1 3 )
  • 10. How to define the projection? • Scalar projection (Dot product) ℎ 𝑣 = 𝑣 . 𝑥 ; 𝑣 = 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒 𝑥 = 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1) ℎ 𝑣 = 𝑣 . 𝑥 + 𝑏 𝑤 ; 𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛 𝑏 − random variable uniformly distributed between 0 and w
  • 11. How to define the projection? • K-dot product, that ( 𝑃1 𝑃2 ) 𝑘> ( 𝑃1 𝑃2 ) points at different separations will fall into the same quantization bin • Perform k independent dot products • Achieve success, • if the query and the nearest neighbor are in the same bin in all k dot products • Success probability = 𝑃1 𝑘 ; decreases as we include more dot products
  • 12. Multiple-projections • L independent projections • True near neighbor will be unlikely to be unlucky in all the projections • By increasing L, • we can find the true nearest neighbor with arbitrarily high probability
  • 13. Accuracy • Two close points p and q, • Separated by 𝑢 = 𝑝 − 𝑞 • Probability of collision 𝑃 𝐻 𝑢 , 𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞)) = 0 𝑤 1 𝑢 . 𝑓𝑠 𝑡 𝑢 . 1 − 𝑡 𝑤 𝑑𝑡 𝑓𝑠- probability density function of H • As distance u increases, 𝑃 𝐻 𝑢 decreases
  • 14. Time complexity • For a query point q, • To Find the near neighbor: (𝑇𝑔+𝑇𝑐) • Calculate & hash the projections (𝑇𝑔) • O(DkL); D−dimension, kL projections • Search the bucket for collisions (𝑇𝑐) • O(DL𝑁𝑐); D-dimension, L projections, and • where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘 . | 𝑞 − 𝑞′ |; 𝑁𝑐 - expected number of collisions for single projection • Analyze • 𝑇𝑔 increases as k & L increase • 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
  • 15. How many projections(L)? • For query point p & neighbor q, • For single projection, • Success probability of collisions: ≥ 𝑃1 𝑘 • For L projections, • Failure probability of collisions: ≤ (1 − 𝑃1 𝑘 ) 𝐿 ∴ (1 − 𝑃1 𝑘 ) 𝐿= 𝛿 𝐿 = log 𝛿 log(1 − 𝑃1 𝑘 )
  • 16. LSH in MAXDIVREL Diversity #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0 #1 #2 #3 … #k dot product 1 1 1 0 .. 1 2 1 0 1 … 1 w 0 1 1 … 0 #1 #2 #3 … #k dot product 1 1 0 1 .. 0 2 0 0 1 … 0 w 0 1 0 … 0 #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0
  • 17. REFERENCES [1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of Massive Datasets,’” pp. 72–130. [2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008. [3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, “Streaming similarity search over one billion tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol. 6, no. 14, pp. 1930–1941, Sep. 2013.

Editor's Notes

  • #2: A randomized algorithm does not guarantee an exact answer but instead provides a high proba- bility guarantee that it will return the cor- rect answer or one close to it
  • #4: O(log N) ; N – number of object; when d is one dimensional this is binary search, but when d becomes high K-d tree algorithm - The problem with multidimensional algorithms such as k-d trees is that they break down when the dimensionality of the search space is greater than a few dimensions O(N) Grid: Close points should be in same grid cell. But some can always lay across the boundary (no matter how close). Some may be further than 1 grid cell, but still close. And in high dimensions, the number of neighboring grid cells grows exponentially. One option is to randomly shift (and rotate) and try again Hash – O(1) search, while O(N) memory
  • #8: Notice that we say nothing about what happens when the distance between the items is strictly between d1 and d2, but we can make d1 and d2 as close as we wish. The penalty is that typically p1 and p2 are then close as well. As we shall see, it is possible to drive p1 and p2 apart while keeping d1 and d2 fixed - according to a Chernoff-Hoeffding bound
  • #9: the probability that p and q collide under a random choice of hash function depends only on the distance between p and q
  • #10: In fact, if the angle θ between the randomly chosen line and the line connecting the points is large, then there is an even greater chance that the two points will fall in the same bucket. For instance, if θ is 90 degrees, then the two points are certain to fall in the same bucket. However, suppose d is larger than a. In order for there to be any chance of the two points falling in the same bucket, we need d cos θ ≤ a
  • #11: Finding a good hash implementation, and analyzing the hash performance
  • #12: Increasing the quantization bucket width w will increase the number of points that fall into each bucket. To obtain our final nearest neighbor result we will have to perform a linear search through all the points that fall into the same bucket as the query, so varying w effects a trade-off between a larger table with a smaller final linear search, or a more compact table with more points to consider in the final search