SlideShare a Scribd company logo
Efficient Estimation for High Similarities
using Odd Sketches
Michael Mitzenmacher Rasmus Pagh Ninh Pham
Harvard University IT University of Copenhagen IT University of Copenhagen
Reported by
Souop Fotso Jocelyn Axel
Softskills Seminar, January 2018
Abstract
This paper present the implementation and the evaluation of Odd Sketch,
a compact binary sketch for estimating the Jaccard similarity of two sets.
This method provide a highly space-efficient and time-efficient estimator for
sets of high similarity, which is relevant in applications such as web duplicate
detection, collaborative filtering, and association rule learning. The method
extends to weighted Jaccard similarity. Experimental results show that the
Odd Sketche is more efficient than b-bit minwise hashing schemes on associ-
ation rule learning and web duplicate detection tasks.
1. Introduction
The estimation of the Jaccard similarity is a fondamental problem in
many computer applications in which we deal with collections of sets con-
taining thousands (sometimes even billions) of items.
Given two sets S1 and S1 ( S1, S2 ⊆ Ω={0, 1, ..., D − 1} ) their similarity
can be quantified using the Jaccard similarity coeffcient:
J(S1, S2) =
|S1 ∩ S2|
|S1 ∪ S2|
The main challenge in many computer applications is to have an quick esti-
mate of J. Existing solutions while highly efficient in general, are not optimal
1
when J is close to 1. The paper present a novel solution, the Odd Sketch,
that yields improved precision in the high similarity regime.
2. Previous works
2.1. Minwise Hashing
Minwise hashing is a powerful algorithmic technique to estimate set sim-
ilarities, originally proposed by Broder et al. [1].
Given a random permutation π : Ω → Ω, the Jaccard similarity of S1 and S2 is
J(S1, S2) = Pr[min(π(S1)) = min(π(S2))]
where min(π (S1)) denotes the minhash of S1. Therefore we get an esti-
mator for J by considering a sequence of permutations π1,...,πk and storing
the annotated minhashes.
S1 = (i, min(πi(S1))) | i = 1, . . . , k ,
S1 = (i, min(πi(S2))) | i = 1, . . . , k .
We estimate J by the fraction:
ˆJ =
|S1 ∩ S2|
k
This estimator is unbiased, and by independence of the permutations it
can be shown that
V ar(ˆJ) =
J(J − 1)
k
2.2. b-bit Minwise Hashing
Li and Konig [2] proposed a time and space efficient version of the original
minwise hashing scheme. Instead of storing b = 32 or b = 64 bits for each
minhashes, this approach suggested using the lowest b bits. It is based on
the intuition that the same hash values give the same lowest b bits whereas
the different hash values give different lowest b bits with probability 1-1/2b
.
2
Proceeding similarly as done for the minhash but saving only the lowest b
bit for each set, we can have an estimate of J and its variance:
However for similarity close to 1, b-bit minhash will produce almost identical
sketches, which reveal very little about *how* close to 1 the similarity is.
Therefore this approach is non optimal in a high similarity regime.
3. Proposed solution
The authors proposed the Odd Sketch, a compact binary sketch similar
to a Bloom filter with one hash function, constructed on the original min-
hashes with the ”odd” feature that the usual disjunction is replaced by an
exclusive-or operation.
Given a set S, the odd sketch of set S that we denote by odd(S) is a binary
array of size n (n>2) that records in the ith position the parity of the number
of elements of set S that are hashed (by a fully random hash function) in
position i.
Here is a pseudo code of the Odd sketch construction:
Algorithm 1 Odd sketch (S,n)
Require: The set S and the size of sketch in bits n
1: Initialize the array A of size n to zero
2: Pick a random hash function h: Ω →[n]
3: for each set element x S do
4: A[h(x)]=A[h(x)] 1 //flip the bit in the ith=h(x) position
5: end for
6: return A
Because odd(S) records the parity of the number of elements that hash
to a location, it follows that :
3
The authors proved that if we construct the the Odd sketches Odd(S1) and
Odd(S2) from the Minhashes S1 and S2 derived from the original sets S1
and S2 we can estimate the Jaccard similarity coeffcient J( S1, S2) as follow:
Where k is the numbrer of permutation used during the minhash step.
Both Odd Sketches and b-bit minwise hashing can be viewed as variations of
the original minwise hashing scheme that reduce the number of bits used. The
quality of their estimators is dependent on the quality of the original minwise
estimators. In practice, both Odd Sketches and b-bit minwise hashing need
to use more permutations but less storage space than the original minwise
hashing scheme.
4. Evaluation Highlights
In oder to evaluate the performances, the authors implemented b-bit min-
wise hashing and odd sketch in matlab and compared the performances of
both approaches on Association rule learning and web duplication detection
tasks. It emerges that:
• Comparing the accuracy (-log(MSE)) of both approaches on a sparse
data set we note that Odd Sketch provides a smaller error than the
b-bit minwise approach even when both the approaches use the same
number of permutation. The difference is more dramatic when J is very
high
• Association rule learning: The authors measured the precision-
recall ratio of both approaches on detecting the pairwise items that
have Jaccard similarity larger than a threshold J0 =0.9 . The results
obtained demonstrate the superiority of Odd Sketch compared to 1/2-
bit minwise hashing with respect to precision. The Odd Sketch achieved
up to 20% higher precision while providing similar recall.
4
• Web duplicate detection:
In this experiment, the authors compared the performance of the two
approaches on web duplicate detection tasks on the bag of words dataset
. They picked three high dimensional datasets and computed all pair-
wise Jaccard similarities among documents, and retrieved every pair
with J ≥ J0. For the sake of comparison, they used the same number
of permutations and considered the thresholds J0 = 0.85 and J0 = 0.90.
The precision-recall ratio were used again as the standard measure. It
comes out that Odd Sketch is still better in precision but slightly worse
in recall.
5. CONCLUSION
The paper presented the Odd Sketch, a compact binary sketch for esti-
mating similarity of two sets. Odd Sketch is time and space efficient and gives
good results even in the high similarity regime. Experiments on synthetic
and real world datasets demonstrate the efficiency of Odd Sketches in com-
parison with b-bit minwise hashing schemes on association rule learning and
web duplicate detection tasks. From the authors, there is great expectation
that the odd sketch will bee used for other applications.
6. RFERENCES
[1] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise
independent permutations. J. Comput. Syst. Sci., 60(3):630659, 2000.
[2] P. Li and A. C. K¨onig. b-bit minwise hashing. In WWW, pages 671680,
2010
5

More Related Content

PPTX
Dijkstra's algorithm presentation
PPTX
Dijkstra’S Algorithm
PDF
Text encryption
PPTX
Discrete Mathematics Presentation
PPT
S6 l04 analytical and numerical methods of structural analysis
PDF
Numerical Methods in Mechanical Engineering - Final Project
PDF
Understanding the Differences between the erfc(x) and the Q(z) functions: A S...
DOC
Project 2
Dijkstra's algorithm presentation
Dijkstra’S Algorithm
Text encryption
Discrete Mathematics Presentation
S6 l04 analytical and numerical methods of structural analysis
Numerical Methods in Mechanical Engineering - Final Project
Understanding the Differences between the erfc(x) and the Q(z) functions: A S...
Project 2

What's hot (20)

PDF
Optimization Techniques
DOCX
8.clustering algorithm.k means.em algorithm
PDF
Linear regression [Theory and Application (In physics point of view) using py...
PPTX
Dijkstra s algorithm
PPTX
Dijkstra's Algorithm
PDF
Optics ordering points to identify the clustering structure
PPTX
Clustering techniques
PPTX
Dijkstra & flooding ppt(Routing algorithm)
PDF
Dijkstra's Algorithm
PDF
Density Based Clustering
PDF
Ashish garg research paper 660_CamReady
PPTX
Vector quantization
DOCX
PPTX
K-means Clustering
PDF
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
PPTX
Networks dijkstra's algorithm- pgsr
PDF
Graph Based Clustering
PDF
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
PDF
Dijkstra algorithm
Optimization Techniques
8.clustering algorithm.k means.em algorithm
Linear regression [Theory and Application (In physics point of view) using py...
Dijkstra s algorithm
Dijkstra's Algorithm
Optics ordering points to identify the clustering structure
Clustering techniques
Dijkstra & flooding ppt(Routing algorithm)
Dijkstra's Algorithm
Density Based Clustering
Ashish garg research paper 660_CamReady
Vector quantization
K-means Clustering
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
Networks dijkstra's algorithm- pgsr
Graph Based Clustering
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Dijkstra algorithm
Ad

Similar to Report on Efficient Estimation for High Similarities using Odd Sketches (20)

PPTX
Efficient Estimation for High Similarities using Odd Sketches
PDF
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
PDF
Monoids and sketches and crdts, oh my!
PDF
Tutorial 4 (duplicate detection)
PPTX
3 - Finding similar items
PDF
Speeding Up Minwise Hashing for Weighted Sets
PDF
Probabilistic data structures. Part 4. Similarity
PDF
Eryk_Kulikowski_a3
PPT
similarity1 (6).ppt
PPT
Aggregation computation over distributed data streams(the final version)
PDF
A simplified and novel technique to retrieve color images from hand-drawn sk...
PDF
Shape-Based Plagiarism Detection for Flowchart Figures in Texts
PDF
Shape based plagiarism detection for flowchart figures in texts
PPTX
Ada-Sketch and friends
PPTX
Probabilistic data structures
PPTX
Data streaming algorithms
PDF
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PPTX
Ke yi small summaries for big data
PDF
Sketching and locality sensitive hashing for alignment
Efficient Estimation for High Similarities using Odd Sketches
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Finding similar items in high dimensional spaces locality sensitive hashing
Monoids and sketches and crdts, oh my!
Tutorial 4 (duplicate detection)
3 - Finding similar items
Speeding Up Minwise Hashing for Weighted Sets
Probabilistic data structures. Part 4. Similarity
Eryk_Kulikowski_a3
similarity1 (6).ppt
Aggregation computation over distributed data streams(the final version)
A simplified and novel technique to retrieve color images from hand-drawn sk...
Shape-Based Plagiarism Detection for Flowchart Figures in Texts
Shape based plagiarism detection for flowchart figures in texts
Ada-Sketch and friends
Probabilistic data structures
Data streaming algorithms
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
Ke yi small summaries for big data
Sketching and locality sensitive hashing for alignment
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.

Report on Efficient Estimation for High Similarities using Odd Sketches

  • 1. Efficient Estimation for High Similarities using Odd Sketches Michael Mitzenmacher Rasmus Pagh Ninh Pham Harvard University IT University of Copenhagen IT University of Copenhagen Reported by Souop Fotso Jocelyn Axel Softskills Seminar, January 2018 Abstract This paper present the implementation and the evaluation of Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. This method provide a highly space-efficient and time-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity. Experimental results show that the Odd Sketche is more efficient than b-bit minwise hashing schemes on associ- ation rule learning and web duplicate detection tasks. 1. Introduction The estimation of the Jaccard similarity is a fondamental problem in many computer applications in which we deal with collections of sets con- taining thousands (sometimes even billions) of items. Given two sets S1 and S1 ( S1, S2 ⊆ Ω={0, 1, ..., D − 1} ) their similarity can be quantified using the Jaccard similarity coeffcient: J(S1, S2) = |S1 ∩ S2| |S1 ∪ S2| The main challenge in many computer applications is to have an quick esti- mate of J. Existing solutions while highly efficient in general, are not optimal 1
  • 2. when J is close to 1. The paper present a novel solution, the Odd Sketch, that yields improved precision in the high similarity regime. 2. Previous works 2.1. Minwise Hashing Minwise hashing is a powerful algorithmic technique to estimate set sim- ilarities, originally proposed by Broder et al. [1]. Given a random permutation π : Ω → Ω, the Jaccard similarity of S1 and S2 is J(S1, S2) = Pr[min(π(S1)) = min(π(S2))] where min(π (S1)) denotes the minhash of S1. Therefore we get an esti- mator for J by considering a sequence of permutations π1,...,πk and storing the annotated minhashes. S1 = (i, min(πi(S1))) | i = 1, . . . , k , S1 = (i, min(πi(S2))) | i = 1, . . . , k . We estimate J by the fraction: ˆJ = |S1 ∩ S2| k This estimator is unbiased, and by independence of the permutations it can be shown that V ar(ˆJ) = J(J − 1) k 2.2. b-bit Minwise Hashing Li and Konig [2] proposed a time and space efficient version of the original minwise hashing scheme. Instead of storing b = 32 or b = 64 bits for each minhashes, this approach suggested using the lowest b bits. It is based on the intuition that the same hash values give the same lowest b bits whereas the different hash values give different lowest b bits with probability 1-1/2b . 2
  • 3. Proceeding similarly as done for the minhash but saving only the lowest b bit for each set, we can have an estimate of J and its variance: However for similarity close to 1, b-bit minhash will produce almost identical sketches, which reveal very little about *how* close to 1 the similarity is. Therefore this approach is non optimal in a high similarity regime. 3. Proposed solution The authors proposed the Odd Sketch, a compact binary sketch similar to a Bloom filter with one hash function, constructed on the original min- hashes with the ”odd” feature that the usual disjunction is replaced by an exclusive-or operation. Given a set S, the odd sketch of set S that we denote by odd(S) is a binary array of size n (n>2) that records in the ith position the parity of the number of elements of set S that are hashed (by a fully random hash function) in position i. Here is a pseudo code of the Odd sketch construction: Algorithm 1 Odd sketch (S,n) Require: The set S and the size of sketch in bits n 1: Initialize the array A of size n to zero 2: Pick a random hash function h: Ω →[n] 3: for each set element x S do 4: A[h(x)]=A[h(x)] 1 //flip the bit in the ith=h(x) position 5: end for 6: return A Because odd(S) records the parity of the number of elements that hash to a location, it follows that : 3
  • 4. The authors proved that if we construct the the Odd sketches Odd(S1) and Odd(S2) from the Minhashes S1 and S2 derived from the original sets S1 and S2 we can estimate the Jaccard similarity coeffcient J( S1, S2) as follow: Where k is the numbrer of permutation used during the minhash step. Both Odd Sketches and b-bit minwise hashing can be viewed as variations of the original minwise hashing scheme that reduce the number of bits used. The quality of their estimators is dependent on the quality of the original minwise estimators. In practice, both Odd Sketches and b-bit minwise hashing need to use more permutations but less storage space than the original minwise hashing scheme. 4. Evaluation Highlights In oder to evaluate the performances, the authors implemented b-bit min- wise hashing and odd sketch in matlab and compared the performances of both approaches on Association rule learning and web duplication detection tasks. It emerges that: • Comparing the accuracy (-log(MSE)) of both approaches on a sparse data set we note that Odd Sketch provides a smaller error than the b-bit minwise approach even when both the approaches use the same number of permutation. The difference is more dramatic when J is very high • Association rule learning: The authors measured the precision- recall ratio of both approaches on detecting the pairwise items that have Jaccard similarity larger than a threshold J0 =0.9 . The results obtained demonstrate the superiority of Odd Sketch compared to 1/2- bit minwise hashing with respect to precision. The Odd Sketch achieved up to 20% higher precision while providing similar recall. 4
  • 5. • Web duplicate detection: In this experiment, the authors compared the performance of the two approaches on web duplicate detection tasks on the bag of words dataset . They picked three high dimensional datasets and computed all pair- wise Jaccard similarities among documents, and retrieved every pair with J ≥ J0. For the sake of comparison, they used the same number of permutations and considered the thresholds J0 = 0.85 and J0 = 0.90. The precision-recall ratio were used again as the standard measure. It comes out that Odd Sketch is still better in precision but slightly worse in recall. 5. CONCLUSION The paper presented the Odd Sketch, a compact binary sketch for esti- mating similarity of two sets. Odd Sketch is time and space efficient and gives good results even in the high similarity regime. Experiments on synthetic and real world datasets demonstrate the efficiency of Odd Sketches in com- parison with b-bit minwise hashing schemes on association rule learning and web duplicate detection tasks. From the authors, there is great expectation that the odd sketch will bee used for other applications. 6. RFERENCES [1] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630659, 2000. [2] P. Li and A. C. K¨onig. b-bit minwise hashing. In WWW, pages 671680, 2010 5