SlideShare a Scribd company logo
Sketching and locality sensitive hashing for alignment
Guillaume Marçais
2/15/23
Why do we need sketching and Locality Sensitive Hashing for
alignment?
1
Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
2
Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
?
2
Fast growth of sequence databases
1x1010
1x1011
1x1012
1x1013
1x1014
1x1015
1x1016
1x1017
2006
2008
2010
2012
2014
2016
2018
2020
2022
Number
of
bases
Year
SRA open accessible bases
• Exponential growth in public and private databases (SRA: 1.5×/year)
• =⇒ hidden exponential slow down in large scale analysis
3
Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ > 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ  0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ  0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
Seed and extend paradigm
Main paradigm:
• Find seeds (small exact matches)
• Cluster “coherent” seeds
• Extend between seeds using DP
• Used since the 90s’ (Blast, MUMmer)
• Still computationally intensive for large scale
• Many ways to find seeds:
• k-mers
• Suffix trees/arrays, FM Index
• LSH / sketching
Reference
Query
Seed
Extend
5
Sketching / Locality Sensitive Hashing
Avoid computing edit distance directly, use proxy measures easier to compute
• LSH: hashing method to avoid fruitless comparisons
• Sketching: sparse representation allowing quick comparison
6
Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1  d2, p1  p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Low distance ⇐⇒ High collisions
• High distance ⇐⇒ Low collisions
• In between d1, d2: No guarantee
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1  d2, p1  p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Probability over choice of h ∈ H,
not over the elements x, y
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1  d2, p1  p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• d1  d2: “gapped” LSH
• d1 = d2, “ungapped” LSH
• Gap not desirable but not always
avoidable.
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10
LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10
Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
11
Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
Jaccard between sequences x, y:
Jaccard distance of their k-mer sets
Jd(x, y) = Jd(K(x), K(y))
• Low Ed(x, y) =⇒ Low Jd(x, y)
• High Ed(x, y) ̸
=⇒ High Jd(x, y)
• Can have false positive, few false
negative
11
MinHash: an LSH for the Jaccard distance
• Permutation of k-mers: π : 4k → 4k one-to-one
H = {hπ(S) = argmin
m∈K(S)
π(m) | π permutation of k-mers}
• Fix π, every k-mer of A ∪ B equally likely to be the
minimum for π
Pr
h∈H
[h(A) = h(B)] =
|A ∩ B|
|A ∪ B|
• Unbiased estimator, ungapped LSH
12
minHash sketch: dimensionality reduction
• Choose L hash functions from H: hi, 1 ≤ i ≤ L
• Sketch of S: vector Sk(S) = (hi(S))1≤i≤L
• Big compression: Mash3 L = 1000, k = 21, 7000 × compression
• Very fast pairwise comparison (Hamming distance between sketches)
Sk(A) =











CGAG
TTAC
CATC
CCAT
CATG
ACAA











, Sk(B) =











GTTT
TTAC
GTAG
ATTT
ACCC
ACAA











→ Jd(K(A), K(B)) ≈ 1 −
2
6
3
Mash: fast genome and metagenome distance estimation using MinHash
13
OMH: LSH for the edit distance
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
14
Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
15
Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
15
Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
Jaccard distance Jd(x, y) = 0 Edit distance Ed(x, y) ≥ 1 − 2k
n
Identical k-mer content and high edit distance
15
Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
16
Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
Jw
(A, B) =
P
x∈U min(χw
A(x), χw
B(x))
P
x∈U max(χw
A(x), χw
B(x))
16
Weighted Jaccard handles repetitions
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC →
n
(AAAAA,1),(AAAAA,2),...,(AAAAA,11)
(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1)
o
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→
n
(AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),
(CCCCC,1),(CCCCC,2),...,(CCCCC,11)
o
Weighted Jaccard Jw
d (x, y) = 1 − k+2
n Edit distance Ed(x, y) ≥ 1 − 2k
n
Weighted Jaccard = Jaccard for multi-sets
17
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC
y = AAAACACAACCCCACCAAA
18
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
18
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
Jd(x, y) = Jw
d (x, y) = 0 Ed(x, y) = 0.63
18
Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19
Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19
OMH: Order Min Hash
• minHash is an LSH for Jaccard
• OMH is a refinement of minHash
• OMH is sensitive to
• repeated k-mers
• relative order of k-mers
20
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
GC
CG
GG
AA
AC
AT
CA
CT
GT
TA
TC
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
GA AG GG
GC AG TG
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
Jaccard:
Sk(S) =




GC
TG
GT




OMH:
Sk(S) =




GC CA
AG GG
AG TG




21
OMH is a LSH for edit distance
Theorem: OMH is a LSH for edit distance
There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance.
• p1: related to probability of hash collisions of weighted Jaccard
• p2: related to length of increasing sequence given weighted Jaccard
22
Practical considerations with Jaccard sketches
Jaccard:
• Can use canonical k-mers
• Difficult to find independent hashes:
use bottom sketches (L ≪ n)
OMH:
• ℓ times as large (cost to encode order)
• ℓ = 1: LSH / unbiased estimator of
weighted Jaccard
• Can’t use canonical k-mers: double
sketch
23
OMH has a large gap
• |S| = 100, k = 5
• Current proof has a large gap
• What is smallest gap
possible?
• OMH/minHash similar to
embedding in Hamming
space: gap probably
unavoidable
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
` = 2
` = 5
` = 10
` = 2
` = 5
` = 10
probability
p
similarity s
p1
p2
24

More Related Content

PPT
20140327 - Hashing Object Embedding
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
PDF
Learn to Make a Machine Learn Presentation by Dr. Angana Chakraborty
PDF
Scribed lec8
PPTX
PDF
Locality Sensitive Hashing By Spark
PDF
Probabilistic data structures. Part 4. Similarity
PDF
large_scale_search.pdf
20140327 - Hashing Object Embedding
Building graphs to discover information by David Martínez at Big Data Spain 2015
Learn to Make a Machine Learn Presentation by Dr. Angana Chakraborty
Scribed lec8
Locality Sensitive Hashing By Spark
Probabilistic data structures. Part 4. Similarity
large_scale_search.pdf

Similar to Sketching and locality sensitive hashing for alignment (20)

PPT
20140702 xu jiaming hashinglearning - lite
PDF
OpenLSH - a framework for locality sensitive hashing
PDF
Locality sensitive hashing
PDF
Similarity Search in High Dimensions via Hashing
PDF
Hash function landscape
PDF
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
PDF
Graph Regularised Hashing
PDF
Local sensitive hashing & minhash on facebook friend
PPTX
3 - Finding similar items
PDF
04-lsh_theory.pdfCS246: Mining Massive Datasets Jure Leskovec, Stanford Univ...
PDF
Locality-sensitive hashing for search in metric space
PDF
RecSplit Minimal Perfect Hashing
PDF
Graph Regularised Hashing (ECIR'15 Talk)
PPTX
Probabilistic data structure
PPTX
Data Mining Lecture_6.pptx
PDF
Approximate methods for scalable data mining (long version)
PPTX
Locality sensitive hashing
PDF
5 efficient-matching.ppt
PDF
03 antoine
20140702 xu jiaming hashinglearning - lite
OpenLSH - a framework for locality sensitive hashing
Locality sensitive hashing
Similarity Search in High Dimensions via Hashing
Hash function landscape
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Finding similar items in high dimensional spaces locality sensitive hashing
Graph Regularised Hashing
Local sensitive hashing & minhash on facebook friend
3 - Finding similar items
04-lsh_theory.pdfCS246: Mining Massive Datasets Jure Leskovec, Stanford Univ...
Locality-sensitive hashing for search in metric space
RecSplit Minimal Perfect Hashing
Graph Regularised Hashing (ECIR'15 Talk)
Probabilistic data structure
Data Mining Lecture_6.pptx
Approximate methods for scalable data mining (long version)
Locality sensitive hashing
5 efficient-matching.ppt
03 antoine
Ad

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Ad

Sketching and locality sensitive hashing for alignment

  • 1. Sketching and locality sensitive hashing for alignment Guillaume Marçais 2/15/23
  • 2. Why do we need sketching and Locality Sensitive Hashing for alignment? 1
  • 3. Large scale alignment problems Cluster N samples based on sequence similarity • → N2/2 alignment problems • Speed-up pairwise alignment task? • Skip hopeless alignments? Sequence search in large database • Avoid aligning to all sequences in database? • Approximate nearer neighbor search • High dimension, non-geometric space 2
  • 4. Large scale alignment problems Cluster N samples based on sequence similarity • → N2/2 alignment problems • Speed-up pairwise alignment task? • Skip hopeless alignments? Sequence search in large database • Avoid aligning to all sequences in database? • Approximate nearer neighbor search • High dimension, non-geometric space ? 2
  • 5. Fast growth of sequence databases 1x1010 1x1011 1x1012 1x1013 1x1014 1x1015 1x1016 1x1017 2006 2008 2010 2012 2014 2016 2018 2020 2022 Number of bases Year SRA open accessible bases • Exponential growth in public and private databases (SRA: 1.5×/year) • =⇒ hidden exponential slow down in large scale analysis 3
  • 6. Sequence alignment is hard No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015) Computing the edit distance Ed in time O(n2−δ), δ > 0 violates the Strong Exponential Time Hypothesis (SETH). • Usual dynamic programming: O(n2) • 1Masek and Paterson: O n2 log(n) • n2−δ ≪ n2 log(n) ≪ n2 • Can’t fundamentally improve 1 A faster algorithm computing string edit distances (1980) 4
  • 7. Sequence alignment is hard No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015) Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential Time Hypothesis (SETH). • Usual dynamic programming: O(n2) • 1Masek and Paterson: O n2 log(n) • n2−δ ≪ n2 log(n) ≪ n2 • Can’t fundamentally improve 1 A faster algorithm computing string edit distances (1980) 4
  • 8. Sequence alignment is hard No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015) Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential Time Hypothesis (SETH). • Usual dynamic programming: O(n2) • 1Masek and Paterson: O n2 log(n) • n2−δ ≪ n2 log(n) ≪ n2 • Can’t fundamentally improve 1 A faster algorithm computing string edit distances (1980) 4
  • 9. Seed and extend paradigm Main paradigm: • Find seeds (small exact matches) • Cluster “coherent” seeds • Extend between seeds using DP • Used since the 90s’ (Blast, MUMmer) • Still computationally intensive for large scale • Many ways to find seeds: • k-mers • Suffix trees/arrays, FM Index • LSH / sketching Reference Query Seed Extend 5
  • 10. Sketching / Locality Sensitive Hashing Avoid computing edit distance directly, use proxy measures easier to compute • LSH: hashing method to avoid fruitless comparisons • Sketching: sparse representation allowing quick comparison 6
  • 11. Locality Sensitive Hashing: Make collisions matters U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1]. H = {h : U → [0, |T| − 1]} 0 1 2 3 4 Universal Hashing • Collisions as rare as possible • ∀x, y ∈ U, x ̸= y, Pr h∈H [h(x) = h(y)] = 1 |T| Locality Sensitive Hashing • Collision between similar elements • ∀x, y ∈ U Ed(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 Ed(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 7
  • 12. Locality Sensitive Hashing: Make collisions matters U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1]. H = {h : U → [0, |T| − 1]} 0 1 2 3 4 Universal Hashing • Collisions as rare as possible • ∀x, y ∈ U, x ̸= y, Pr h∈H [h(x) = h(y)] = 1 |T| Locality Sensitive Hashing • Collision between similar elements • ∀x, y ∈ U Ed(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 Ed(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 7
  • 13. Locality Sensitive Hashing: Make collisions matters U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1]. H = {h : U → [0, |T| − 1]} 0 1 2 3 4 Universal Hashing • Collisions as rare as possible • ∀x, y ∈ U, x ̸= y, Pr h∈H [h(x) = h(y)] = 1 |T| Locality Sensitive Hashing • Collision between similar elements • ∀x, y ∈ U Ed(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 Ed(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 7
  • 14. Locality Sensitive Hashing Definition The family H is “(d1, d2, p1, p2)-sensitive” for distance D if there exists d1 d2, p1 p2 such that for all x, y ∈ U D(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 D(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 • Low distance ⇐⇒ High collisions • High distance ⇐⇒ Low collisions • In between d1, d2: No guarantee Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 8
  • 15. Locality Sensitive Hashing Definition The family H is “(d1, d2, p1, p2)-sensitive” for distance D if there exists d1 d2, p1 p2 such that for all x, y ∈ U D(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 D(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 • Probability over choice of h ∈ H, not over the elements x, y Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 8
  • 16. Locality Sensitive Hashing Definition The family H is “(d1, d2, p1, p2)-sensitive” for distance D if there exists d1 d2, p1 p2 such that for all x, y ∈ U D(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 D(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 • d1 d2: “gapped” LSH • d1 = d2, “ungapped” LSH • Gap not desirable but not always avoidable. Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 8
  • 17. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 18. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 19. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? Hash Tables 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 20. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? Hash Tables 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 21. LSH for the edit distance How to design an LSH for edit distance? • minHash: LSH for k-mer Jaccard distance • OMH: Ordered Min Hash 10
  • 22. LSH for the edit distance How to design an LSH for edit distance? • minHash: LSH for k-mer Jaccard distance • OMH: Ordered Min Hash 10
  • 23. Jaccard distance Jaccard distance between sets A, B: Jd(A, B) = 1 − |A ∩ B| |A ∪ B| 11
  • 24. Jaccard distance Jaccard distance between sets A, B: Jd(A, B) = 1 − |A ∩ B| |A ∪ B| Jaccard between sequences x, y: Jaccard distance of their k-mer sets Jd(x, y) = Jd(K(x), K(y)) • Low Ed(x, y) =⇒ Low Jd(x, y) • High Ed(x, y) ̸ =⇒ High Jd(x, y) • Can have false positive, few false negative 11
  • 25. MinHash: an LSH for the Jaccard distance • Permutation of k-mers: π : 4k → 4k one-to-one H = {hπ(S) = argmin m∈K(S) π(m) | π permutation of k-mers} • Fix π, every k-mer of A ∪ B equally likely to be the minimum for π Pr h∈H [h(A) = h(B)] = |A ∩ B| |A ∪ B| • Unbiased estimator, ungapped LSH 12
  • 26. minHash sketch: dimensionality reduction • Choose L hash functions from H: hi, 1 ≤ i ≤ L • Sketch of S: vector Sk(S) = (hi(S))1≤i≤L • Big compression: Mash3 L = 1000, k = 21, 7000 × compression • Very fast pairwise comparison (Hamming distance between sketches) Sk(A) =            CGAG TTAC CATC CCAT CATG ACAA            , Sk(B) =            GTTT TTAC GTAG ATTT ACCC ACAA            → Jd(K(A), K(B)) ≈ 1 − 2 6 3 Mash: fast genome and metagenome distance estimation using MinHash 13
  • 27. OMH: LSH for the edit distance • minHash: LSH for k-mer Jaccard distance • OMH: Ordered Min Hash 14
  • 28. Jaccard ignores k-mer repetition x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k 15
  • 29. Jaccard ignores k-mer repetition x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} 15
  • 30. Jaccard ignores k-mer repetition x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} Jaccard distance Jd(x, y) = 0 Edit distance Ed(x, y) ≥ 1 − 2k n Identical k-mer content and high edit distance 15
  • 31. Weighted Jaccard: Jaccard on multi-set • χA : U → {0, 1}, χA(x) = 1 ⇐⇒ x ∈ A • χw A : U → N, χw A(x) = # of instances of x in A J(A, B) = |A ∩ B| |A ∪ B| = P x∈U min(χA(x), χB(x)) P x∈U max(χA(x), χB(x)) 16
  • 32. Weighted Jaccard: Jaccard on multi-set • χA : U → {0, 1}, χA(x) = 1 ⇐⇒ x ∈ A • χw A : U → N, χw A(x) = # of instances of x in A J(A, B) = |A ∩ B| |A ∪ B| = P x∈U min(χA(x), χB(x)) P x∈U max(χA(x), χB(x)) Jw (A, B) = P x∈U min(χw A(x), χw B(x)) P x∈U max(χw A(x), χw B(x)) 16
  • 33. Weighted Jaccard handles repetitions x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC → n (AAAAA,1),(AAAAA,2),...,(AAAAA,11) (AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1) o y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k → n (AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1), (CCCCC,1),(CCCCC,2),...,(CCCCC,11) o Weighted Jaccard Jw d (x, y) = 1 − k+2 n Edit distance Ed(x, y) ≥ 1 − 2k n Weighted Jaccard = Jaccard for multi-sets 17
  • 34. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC y = AAAACACAACCCCACCAAA 18
  • 35. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o y = AAAACACAACCCCACCAAA → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o x, y: de Bruijn sequences, contain all 16 possible 4-mers once (σ!)σk−1 de Bruijn sequences of length σk + σ − 1 18
  • 36. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o y = AAAACACAACCCCACCAAA → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o x, y: de Bruijn sequences, contain all 16 possible 4-mers once (σ!)σk−1 de Bruijn sequences of length σk + σ − 1 Jd(x, y) = Jw d (x, y) = 0 Ed(x, y) = 0.63 18
  • 37. Jaccard is different from edit distance Unlike edit distance, k-mer Jaccard is insensitive to: 1. k-mer repetitions 2. relative positions of k-mers • k-mer Jaccard is not an LSH for the edit distance • Still provides big computation saving: asymmetric error model 19
  • 38. Jaccard is different from edit distance Unlike edit distance, k-mer Jaccard is insensitive to: 1. k-mer repetitions 2. relative positions of k-mers • k-mer Jaccard is not an LSH for the edit distance • Still provides big computation saving: asymmetric error model 19
  • 39. OMH: Order Min Hash • minHash is an LSH for Jaccard • OMH is a refinement of minHash • OMH is sensitive to • repeated k-mers • relative order of k-mers 20
  • 40. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2 21
  • 41. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA GC CG GG AA AC AT CA CT GT TA TC 21
  • 42. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 21
  • 43. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 21
  • 44. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 21
  • 45. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 2 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 3 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 4 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 5 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 6 1 21
  • 46. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 2 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 3 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 4 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 5 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 6 1 21
  • 47. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 2 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 3 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 4 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 5 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 6 1 GA AG GG GC AG TG 21
  • 48. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 Jaccard: Sk(S) =     GC TG GT     OMH: Sk(S) =     GC CA AG GG AG TG     21
  • 49. OMH is a LSH for edit distance Theorem: OMH is a LSH for edit distance There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance. • p1: related to probability of hash collisions of weighted Jaccard • p2: related to length of increasing sequence given weighted Jaccard 22
  • 50. Practical considerations with Jaccard sketches Jaccard: • Can use canonical k-mers • Difficult to find independent hashes: use bottom sketches (L ≪ n) OMH: • ℓ times as large (cost to encode order) • ℓ = 1: LSH / unbiased estimator of weighted Jaccard • Can’t use canonical k-mers: double sketch 23
  • 51. OMH has a large gap • |S| = 100, k = 5 • Current proof has a large gap • What is smallest gap possible? • OMH/minHash similar to embedding in Hamming space: gap probably unavoidable 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ` = 2 ` = 5 ` = 10 ` = 2 ` = 5 ` = 10 probability p similarity s p1 p2 24