Report on Efficient Estimation for High Similarities using Odd Sketches

Efficient Estimation for High Similarities
using Odd Sketches
Michael Mitzenmacher Rasmus Pagh Ninh Pham
Harvard University IT University of Copenhagen IT University of Copenhagen
Reported by
Souop Fotso Jocelyn Axel
Softskills Seminar, January 2018
Abstract
This paper present the implementation and the evaluation of Odd Sketch,
a compact binary sketch for estimating the Jaccard similarity of two sets.
This method provide a highly space-efficient and time-efficient estimator for
sets of high similarity, which is relevant in applications such as web duplicate
detection, collaborative filtering, and association rule learning. The method
extends to weighted Jaccard similarity. Experimental results show that the
Odd Sketche is more efficient than b-bit minwise hashing schemes on associ-
ation rule learning and web duplicate detection tasks.
1. Introduction
The estimation of the Jaccard similarity is a fondamental problem in
many computer applications in which we deal with collections of sets con-
taining thousands (sometimes even billions) of items.
Given two sets S1 and S1 ( S1, S2 ⊆ Ω={0, 1, ..., D − 1} ) their similarity
can be quantified using the Jaccard similarity coeffcient:
J(S1, S2) =
|S1 ∩ S2|
|S1 ∪ S2|
The main challenge in many computer applications is to have an quick esti-
mate of J. Existing solutions while highly efficient in general, are not optimal
1

when J is close to 1. The paper present a novel solution, the Odd Sketch,
that yields improved precision in the high similarity regime.
2. Previous works
2.1. Minwise Hashing
Minwise hashing is a powerful algorithmic technique to estimate set sim-
ilarities, originally proposed by Broder et al. [1].
Given a random permutation π : Ω → Ω, the Jaccard similarity of S1 and S2 is
J(S1, S2) = Pr[min(π(S1)) = min(π(S2))]
where min(π (S1)) denotes the minhash of S1. Therefore we get an esti-
mator for J by considering a sequence of permutations π1,...,πk and storing
the annotated minhashes.
S1 = (i, min(πi(S1))) | i = 1, . . . , k ,
S1 = (i, min(πi(S2))) | i = 1, . . . , k .
We estimate J by the fraction:
ˆJ =
|S1 ∩ S2|
k
This estimator is unbiased, and by independence of the permutations it
can be shown that
V ar(ˆJ) =
J(J − 1)
k
2.2. b-bit Minwise Hashing
Li and Konig [2] proposed a time and space efficient version of the original
minwise hashing scheme. Instead of storing b = 32 or b = 64 bits for each
minhashes, this approach suggested using the lowest b bits. It is based on
the intuition that the same hash values give the same lowest b bits whereas
the different hash values give different lowest b bits with probability 1-1/2b
.
2

Proceeding similarly as done for the minhash but saving only the lowest b
bit for each set, we can have an estimate of J and its variance:
However for similarity close to 1, b-bit minhash will produce almost identical
sketches, which reveal very little about *how* close to 1 the similarity is.
Therefore this approach is non optimal in a high similarity regime.
3. Proposed solution
The authors proposed the Odd Sketch, a compact binary sketch similar
to a Bloom ﬁlter with one hash function, constructed on the original min-
hashes with the ”odd” feature that the usual disjunction is replaced by an
exclusive-or operation.
Given a set S, the odd sketch of set S that we denote by odd(S) is a binary
array of size n (n>2) that records in the ith position the parity of the number
of elements of set S that are hashed (by a fully random hash function) in
position i.
Here is a pseudo code of the Odd sketch construction:
Algorithm 1 Odd sketch (S,n)
Require: The set S and the size of sketch in bits n
1: Initialize the array A of size n to zero
2: Pick a random hash function h: Ω →[n]
3: for each set element x S do
4: A[h(x)]=A[h(x)] 1 //ﬂip the bit in the ith=h(x) position
5: end for
6: return A
Because odd(S) records the parity of the number of elements that hash
to a location, it follows that :
3

The authors proved that if we construct the the Odd sketches Odd(S1) and
Odd(S2) from the Minhashes S1 and S2 derived from the original sets S1
and S2 we can estimate the Jaccard similarity coeﬀcient J( S1, S2) as follow:
Where k is the numbrer of permutation used during the minhash step.
Both Odd Sketches and b-bit minwise hashing can be viewed as variations of
the original minwise hashing scheme that reduce the number of bits used. The
quality of their estimators is dependent on the quality of the original minwise
estimators. In practice, both Odd Sketches and b-bit minwise hashing need
to use more permutations but less storage space than the original minwise
hashing scheme.
4. Evaluation Highlights
In oder to evaluate the performances, the authors implemented b-bit min-
wise hashing and odd sketch in matlab and compared the performances of
both approaches on Association rule learning and web duplication detection
tasks. It emerges that:
• Comparing the accuracy (-log(MSE)) of both approaches on a sparse
data set we note that Odd Sketch provides a smaller error than the
b-bit minwise approach even when both the approaches use the same
number of permutation. The diﬀerence is more dramatic when J is very
high
• Association rule learning: The authors measured the precision-
recall ratio of both approaches on detecting the pairwise items that
have Jaccard similarity larger than a threshold J0 =0.9 . The results
obtained demonstrate the superiority of Odd Sketch compared to 1/2-
bit minwise hashing with respect to precision. The Odd Sketch achieved
up to 20% higher precision while providing similar recall.
4

• Web duplicate detection:
In this experiment, the authors compared the performance of the two
approaches on web duplicate detection tasks on the bag of words dataset
. They picked three high dimensional datasets and computed all pair-
wise Jaccard similarities among documents, and retrieved every pair
with J ≥ J0. For the sake of comparison, they used the same number
of permutations and considered the thresholds J0 = 0.85 and J0 = 0.90.
The precision-recall ratio were used again as the standard measure. It
comes out that Odd Sketch is still better in precision but slightly worse
in recall.
5. CONCLUSION
The paper presented the Odd Sketch, a compact binary sketch for esti-
mating similarity of two sets. Odd Sketch is time and space efficient and gives
good results even in the high similarity regime. Experiments on synthetic
and real world datasets demonstrate the efficiency of Odd Sketches in com-
parison with b-bit minwise hashing schemes on association rule learning and
web duplicate detection tasks. From the authors, there is great expectation
that the odd sketch will bee used for other applications.
6. RFERENCES
[1] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise
independent permutations. J. Comput. Syst. Sci., 60(3):630659, 2000.
[2] P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671680,
2010
5

Report on Efficient Estimation for High Similarities using Odd Sketches

More Related Content

What's hot (20)

Similar to Report on Efficient Estimation for High Similarities using Odd Sketches (20)

Recently uploaded (20)

Report on Efficient Estimation for High Similarities using Odd Sketches