SlideShare a Scribd company logo
Advanced Algorithms – COMS31900
Pattern matching part three
Hamming distance
Benjamin Sach
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b a
a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
m
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b a
a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
m
4
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
a b a
m
6
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
a b a
m
10
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
a b a
m
6
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
a b a
m
6
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
j-th character of P
(i + j)-th char. of T
T[2] = c
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
a b a
m
6
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
• A naive algorithm takes O(nm) time
j-th character of P
(i + j)-th char. of T
T[2] = c
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in T
P matches at location i iff
a b a
a b a
m
6
for all 0 j < m we have that P[j] = T[i + j]
(our strings are zero-indexed)
• A naive algorithm takes O(nm) time
• Many O(n) time algorithms are known (for example the KMP algorithm)
j-th character of P
(i + j)-th char. of T
T[2] = c
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(4) = 1
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(5) = 4
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(6) = 1
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 7
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
A naive algorithm for this problem takes O(nm) time
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
A naive algorithm for this problem takes O(nm) time
. . . but we can do better
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
It’s a small alphabet after all
T
P
d
d
d d dd d d
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
d
Imagine that the alphabet contains only a small number of different symbols,
aa c
a b
bb c
which we will consider individually. . .
It’s a small alphabet after all
T
P
d
d
d d dd d d
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
d
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
aa c
a b
bb c
which we will consider individually. . .
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
d
d
d d dd d d
d
aa c
a b
bb c
which we will consider individually. . .
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
aa c
a b
bb c
which we will consider individually. . .
1
1
1 1 11 1 1
1
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
aa c
a b
bb c
which we will consider individually. . .
1
1
1 1 11 1 1
1
It’s a small alphabet after all
T
P
aa c
a b
bb c
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
which we will consider individually. . .
1
1
1 1 11 1 1
1
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
(Td ⊗ Pd)[4] =
(1 × 1)+ (0 × 0)+
(1 × 0)+ (1 × 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
1 iff P [j]=T [i+j]=d
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
(Td ⊗ Pd)[4] =
(1 × 1)+ (0 × 0)+
(1 × 0)+ (1 × 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
This is the exactly number of matching ds at the i-th alignment.
1 iff P [j]=T [i+j]=d
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
(Td ⊗ Pd)[4] =
(1 × 1)+ (0 × 0)+
(1 × 0)+ (1 × 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
This is the exactly number of matching ds at the i-th alignment.
1 iff P [j]=T [i+j]=d
which we will consider individually. . .
How can we work out (Td ⊗ Pd) quickly?
00 0
0 0
00 01
1
1 1 11 1 1
1
(Td ⊗ Pd)[4] =
(1 × 1)+ (0 × 0)+
(1 × 0)+ (1 × 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1
j=0
Pd[j] × Td[i + j]
This is the exactly number of matching ds at the i-th alignment.
1 iff P [j]=T [i+j]=d
which we will consider individually. . .
How can we work out (Td ⊗ Pd) quickly?
00 0
0 0
00 01
1
1 1 11 1 1
1
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
C
C[i] = ci
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
C
C[i] = ci
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
A
A[i] = ai
B
B[i] = bi
(or be seen as arrays of length n)
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Hint 1 Let A = Pd and B = Td
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Hint 1 Let A = Pd and B = Td
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[i](or be seen as arrays of length n)
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
Pd[j]Td[i−j]
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Hint 2 Let A = Pd (padded with zeros) and B = Td
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[i](or be seen as arrays of length n)
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where ci =
i
j=0
Pd[j]Td[i−j]
m
0 0 0 00
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Hint 3 Let A = Pd (padded with zeros) and B = Td (reversed). . . now C contains (Td ⊗ Pd)
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[n − i](or be seen as arrays of length n)
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where cn−i =
n−i
j=0
Pd[j]Td[i + j]
m
0 0 0 00
Last year on COMS21103. . .
Let A and B be (n − 1) degree polynomials which can be expressed as. . .
A(x) =
n−1
i=0
aixi and B(x) =
n−1
i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time.
C
C[i] = ci
m−1
j=0
Pd[j]Td[i + j]
these look similar!
Hint 3 Let A = Pd (padded with zeros) and B = Td (reversed). . . now C contains (Td ⊗ Pd)
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[n − i](or be seen as arrays of length n)
The polynomial C = A × B can be expressed as. . .
C(x) =
2n−1
i=0
cixi where cn−i =
n−i
j=0
Pd[j]Td[i + j]
m
0 0 0 00
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
00 0
0 0
00 01
1
1 1 11 1 1
1
(Pσ is defined analogously)
alignment 4
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
(Pσ is defined analogously)
alignment 4
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)
alignment 4
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)
alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)
alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)
alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
cross-correlations are used a lot
in the pattern matching literature
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)
alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
cross-correlations are used a lot
in the pattern matching literature
(but they mostly call them convolutions)
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1
j=0
Pσ[j] × Tσ[i + j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)
alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
cross-correlations are used a lot
in the pattern matching literature
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
matches involving σ
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
all matches
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
mismatches = m− matches
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| log n) time)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| log n) time)
(O(n|Σ|) time)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| log n) time)
(O(n|Σ|) time)
This takes O(n|Σ| log n) total time (and uses O(n) space)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| log n) time)
(O(n|Σ|) time)
This takes O(n|Σ| log n) total time (and uses O(n) space)
However, |Σ| could be as big as m...
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol in O(n log n) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m −
σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| log n) time)
(O(n|Σ|) time)
This takes O(n|Σ| log n) total time (and uses O(n) space)
However, |Σ| could be as big as m...
in which case, this is worse than the naive method!
(in the example Σ = {a, b, c, d} so |Σ| = 4)
Coping with a large alphabet
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
Key idea: Our algorithm will have two main stages:
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
(at each alignment of P and T)
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
Stage 2 will count all the matches involving infrequent symbols
(at each alignment of P and T)
(at each alignment of P and T)
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
Stage 2 will count all the matches involving infrequent symbols
The total number of matches is the sum of the matches from Stage 1 and Stage 2
(at each alignment of P and T)
(at each alignment of P and T)
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
√
m = 3
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
Stage 2 will count all the matches involving infrequent symbols
The total number of matches is the sum of the matches from Stage 1 and Stage 2
(at each alignment of P and T)
(at each alignment of P and T)
We will now see an algorithm which runs in O(n
√
m log n) time
regardless of the alphabet size
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (
√
m + 1) freq. symbols
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (
√
m + 1) freq. symbols
each occurs at least
√
m times. . .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (
√
m + 1) freq. symbols
each occurs at least
√
m times. . . (
√
m + 1)
√
m > m
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (
√
m + 1) freq. symbols
each occurs at least
√
m times. . . (
√
m + 1)
√
m > m Contradiction!
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (
√
m + 1) freq. symbols
so there are at most
√
m frequent symbols
each occurs at least
√
m times. . . (
√
m + 1)
√
m > m Contradiction!
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (
√
m + 1) freq. symbols
So Stage 1 takes O(n
√
m log n) time.
so there are at most
√
m frequent symbols
each occurs at least
√
m times. . . (
√
m + 1)
√
m > m Contradiction!
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
c and d are infrequent
in O(n log n) time (per symbol σ) using cross-correlations
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
m = 9
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
m = 9
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
If T[k] is infrequent. . .
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
If T[k] is infrequent. . .
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
If T[k] is infrequent. . .
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
If T[k] is infrequent. . .
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
(k − j) < 0d bc a d
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
(k − j) < 0d bc ab da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
j = 4
k = 4
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
k − j = 0
0
P a d bc ab b da
j = 4
k = 4
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k − j = 0
P a d bc ab b da
j = 4
k = 4
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
k = 5
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k = 5
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k = 5
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k − j = 1
k = 5
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
k − j = 1
1
k = 5
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
j = 4k − j = 2
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 6
j = 4k − j = 2
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
j = 8
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
(k − j) < 0
k = 7
j = 8
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
j = 6
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
j = 6k − j = 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
k = 7
j = 6k − j = 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 0 0 0 0
(except when (k − j) < 0)
13 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 0 0 0 0
(except when (k − j) < 0)
13 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
What is A[i]?
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
What is A[i]?
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
What is A[i]?
Fact A[i] is the number of matches at
alignment i involving an infrequent symbol
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
O(n) time
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
store a list for each infrequent symbol
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
O(n) time
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
store a list for each infrequent symbol
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
(each list has length less than
√
m)
O(n) time
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
store a list for each infrequent symbol
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
(each list has length less than
√
m)
O(n) time
P a d bc ab b da
O(n
√
m)
time
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than
√
m times in P.
a is frequent , b is frequent
c and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T. . .
For each character T[k], (where 0 k < n)
For all j such that T[k] = P[j],
If T[k] is infrequent. . .
Increase A[k − j] by one
Construct an array A of length (n − m + 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
O(n
√
m) total time
P a d bc ab b da
Pattern matching with mismatches: putting it all together
Algorithm summary
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Overall, we obtain a time complexity of O(n
√
m log n).
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Overall, we obtain a time complexity of O(n
√
m log n).
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Overall, we obtain a time complexity of O(n
√
m log n).
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Notice that Stage 1 takes
longer than Stage 2...
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
(this stage is unaffected - the time complexity doesn’t depend on f))
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
How long does each stage take now?
As each frequent symbol occurs at least f times, there are at most m
f frequent symbols
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
How long does each stage take now?
As each frequent symbol occurs at least f times, there are at most m
f frequent symbols
and we do one cross-correlation for each frequent symbol. . .
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
How long does each stage take now?
As each frequent symbol occurs at least f times, there are at most m
f frequent symbols
and we do one cross-correlation for each frequent symbol. . .
Stage 1: Count all matches involving frequent symbols - O(m
f · n log n) time
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
How long does each stage take now?
Stage 2: Count all matches involving infrequent symbols.
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
How long does each stage take now?
Stage 2: Count all matches involving infrequent symbols.
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
We make a single pass through T. . .
and for each T[i] we update at most (f − 1) locations in A
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least
√
m times in P.
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
How long does each stage take now?
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Stage 2: Count all matches involving infrequent symbols. - O(nf) time
We make a single pass through T. . .
and for each T[i] we update at most (f − 1) locations in A
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n m
f log n) time
Stage 2: Count all matches involving infrequent symbols. - O(nf) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
What should we set f to?
Stage 1: Count all matches involving frequent symbols - O(n m
f log n) time
Stage 2: Count all matches involving infrequent symbols. - O(nf) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
What should we set f to?
Stage 1: Count all matches involving frequent symbols - O(n m
f log n) time
Stage 2: Count all matches involving infrequent symbols. - O(nf) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Let f =
√
m log n. . .
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
What should we set f to?
Stage 1: Count all matches involving frequent symbols - O(n m
f log n) time
Stage 2: Count all matches involving infrequent symbols. - O(nf) time
at any alignment i the number of mismatches is just m minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P.
and infrequent otherwise
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Let f =
√
m log n. . .
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
What should we set f to?
at any alignment i the number of mismatches is just m minus the total number of matches
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Let f =
√
m log n. . .
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 1: Count all matches involving frequent symbols - O(n m√
m log n
log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m log n times in P.
and infrequent otherwise
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
What should we set f to?
at any alignment i the number of mismatches is just m minus the total number of matches
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Let f =
√
m log n. . .
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m log n times in P.
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
at any alignment i the number of mismatches is just m minus the total number of matches
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m log n times in P.
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
at any alignment i the number of mismatches is just m minus the total number of matches
(by alphabetically sorting the characters from P)
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbol
can be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b da
Matches with an infrequent symbol
can be found by direct counting
Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
Stage 2: Count all matches involving infrequent symbols. - O(n
√
m log n) time
Definition: An alphabet symbol is frequent if it occurs at least
√
m log n times in P.
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols - O(n
√
m log n) time
This improves the overall time complexity from O(n
√
m log n) to O(n
√
m log n).
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
T
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
the final substring
might be shorter
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
P
m
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
We run the previous algorithm O n
m times so this process takes O(n
√
m log m) time in total
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
We run the previous algorithm O n
m times so this process takes O(n
√
m log m) time in total
P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Split T into O n
m contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
We run the previous algorithm O n
m times so this process takes O(n
√
m log m) time in total
P
m
what about this alignment?
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
We run the previous algorithm O n
m times so this process takes O(n
√
m log m) time in total
P
m
what about this alignment?
Split T into O n
m overlapping 2m length substrings, T1, T2, T3 . . .
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
We run the previous algorithm O n
m times so this process takes O(n
√
m log m) time in total
P
m
what about this alignment?
Split T into O n
m overlapping 2m length substrings, T1, T2, T3 . . .
T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21
m 2m
T2 T4 T6 T8 T10 T12 T14 T16 T18 T20
Improving the time complexity 2 - split the text
We have just seen an algorithm which takes O(n
√
m log n) time.
Imagine that n is a lot bigger than m. . .
T
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk| m log |Tk|) time.
= O(m
√
m log m) time.
We run the previous algorithm O n
m times so this process takes O(n
√
m log m) time in total
P
m
Split T into O n
m overlapping 2m length substrings, T1, T2, T3 . . .
T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21
m 2m
T2 T4 T6 T8 T10 T12 T14 T16 T18 T20
Conclusion
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
(the Hamming distance is the number of mismatches)
c a a
A naive algorithm for this problem takes O(nm) time
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
We have seen two alternative algorithms:
One algorithm takes O(n|Σ| log n) time (where |Σ| is the alphabet size)
The other algorithm takes O(n
√
m log n) time (regardless of the alphabet size)
and can be improved to O(n
√
m log m)
(by changing the freq./infreq. cut off and splitting the text)
and can be improved to O(n|Σ| log m) (by splitting the text)

More Related Content

PPS
Gui e o natal verde no planeta azul
ZIP
historia sim-_o e a noite de natal
PPT
Marcelina e a noite de halloween
PDF
Sejn 087 dzek slejd - cuvar
PPTX
Historia. sara tomã© boneco de neve
PPTX
A palavra feia de Alberto
DOC
Lisa kleypas s-a-intamplat_intr-o_toamna_0.5_07__
PDF
Vajat erp 058 vilijam mark - obracun kod ok korala ...by goci
Gui e o natal verde no planeta azul
historia sim-_o e a noite de natal
Marcelina e a noite de halloween
Sejn 087 dzek slejd - cuvar
Historia. sara tomã© boneco de neve
A palavra feia de Alberto
Lisa kleypas s-a-intamplat_intr-o_toamna_0.5_07__
Vajat erp 058 vilijam mark - obracun kod ok korala ...by goci

What's hot (16)

PPTX
O Conto da bela e a fera
PDF
Um caminhão nas estrelas
PPT
Asa de papel
DOCX
A bruxa mimi no inverno
PDF
Pony west nova serija 121 122 - bo bek - rudnik zla &amp; hamilton luger - po...
PDF
ZMR RADIATORS
PDF
Anne Barton_O seara de pasiune_pdf
PDF
O desejo secreto de Annika - Beverly Lewis
PDF
Sejn 064 dzek slejd - ziveces jos samo tri dana
PDF
03 cara elliott tentatii periculoase
PDF
Alan Ford - 026 - Superhik (SA Klasik 26)
PDF
ZMR HEATING LOAD CALCULATION
PDF
Impariamo il linguaggio delle cose
PDF
χαμηλές πτήσεις - σπουργίτι είσαι και φαίνεσαι
PDF
21 sablast crnih mocvara
PPT
A bruxa Mimi
O Conto da bela e a fera
Um caminhão nas estrelas
Asa de papel
A bruxa mimi no inverno
Pony west nova serija 121 122 - bo bek - rudnik zla &amp; hamilton luger - po...
ZMR RADIATORS
Anne Barton_O seara de pasiune_pdf
O desejo secreto de Annika - Beverly Lewis
Sejn 064 dzek slejd - ziveces jos samo tri dana
03 cara elliott tentatii periculoase
Alan Ford - 026 - Superhik (SA Klasik 26)
ZMR HEATING LOAD CALCULATION
Impariamo il linguaggio delle cose
χαμηλές πτήσεις - σπουργίτι είσαι και φαίνεσαι
21 sablast crnih mocvara
A bruxa Mimi
Ad

Viewers also liked (12)

PDF
Approximation Algorithms Part Two: More Constant factor approximations
PDF
Hamming Distance and Data Compression of 1-D CA
PDF
Approximation Algorithms Part Four: APTAS
PPT
Ch10
PPTX
Error Detection N Correction
PDF
Error detection and correction
PPTX
Parity check(Error Detecting Codes)
PPTX
Error Detection and Correction - Data link Layer
PPT
Errror Detection and Correction
PPT
Error detection and correction
PPT
Error control, parity check, check sum, vrc
PPTX
Computer Networks - Error Detection & Error Correction
Approximation Algorithms Part Two: More Constant factor approximations
Hamming Distance and Data Compression of 1-D CA
Approximation Algorithms Part Four: APTAS
Ch10
Error Detection N Correction
Error detection and correction
Parity check(Error Detecting Codes)
Error Detection and Correction - Data link Layer
Errror Detection and Correction
Error detection and correction
Error control, parity check, check sum, vrc
Computer Networks - Error Detection & Error Correction
Ad

Similar to Pattern Matching Part Three: Hamming Distance (20)

PDF
Pattern Matching Part Two: k-mismatches
PPTX
Asymptotic notation
PPTX
String Matching (Naive,Rabin-Karp,KMP)
PPT
lec17.ppt
PPT
String-Matching Algorithms Advance algorithm
PPT
PPTX
Compiler Design_Intermediate code generation new ppt.pptx
PPTX
Proficient Computer Network Assignment Help
PPTX
Asymptotic Notation
PDF
Olimpiade matematika di kanada 2017
PPT
Chapter 6 intermediate code generation
PPTX
Intermediate code generation1
PDF
accenture Advanced coding questiosn for online assessment preparation
PDF
Modified Rabin Karp
PDF
Daa chapter9
PPTX
DAA Week 2 slide for design algorithm and analysis.pptx
DOC
pradeepbishtLecture13 div conq
PPT
chap09alg.ppt for string matching algorithm
Pattern Matching Part Two: k-mismatches
Asymptotic notation
String Matching (Naive,Rabin-Karp,KMP)
lec17.ppt
String-Matching Algorithms Advance algorithm
Compiler Design_Intermediate code generation new ppt.pptx
Proficient Computer Network Assignment Help
Asymptotic Notation
Olimpiade matematika di kanada 2017
Chapter 6 intermediate code generation
Intermediate code generation1
accenture Advanced coding questiosn for online assessment preparation
Modified Rabin Karp
Daa chapter9
DAA Week 2 slide for design algorithm and analysis.pptx
pradeepbishtLecture13 div conq
chap09alg.ppt for string matching algorithm

More from Benjamin Sach (20)

PDF
Approximation Algorithms Part Three: (F)PTAS
PDF
Approximation Algorithms Part One: Constant factor approximations
PDF
van Emde Boas trees
PDF
Orthogonal Range Searching
PDF
Lowest Common Ancestor
PDF
Range Minimum Queries
PDF
Pattern Matching Part Two: Suffix Arrays
PDF
Pattern Matching Part One: Suffix Trees
PDF
Hashing Part Two: Cuckoo Hashing
PDF
Hashing Part Two: Static Perfect Hashing
PDF
Hashing Part One
PDF
Probability Recap
PDF
Bloom Filters
PDF
Dynamic Programming
PDF
Minimum Spanning Trees (via Disjoint Sets)
PDF
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
PDF
Depth First Search and Breadth First Search
PDF
Shortest Paths Part 2: Negative Weights and All-pairs
PDF
Line Segment Intersections
PDF
Self-balancing Trees and Skip Lists
Approximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part One: Constant factor approximations
van Emde Boas trees
Orthogonal Range Searching
Lowest Common Ancestor
Range Minimum Queries
Pattern Matching Part Two: Suffix Arrays
Pattern Matching Part One: Suffix Trees
Hashing Part Two: Cuckoo Hashing
Hashing Part Two: Static Perfect Hashing
Hashing Part One
Probability Recap
Bloom Filters
Dynamic Programming
Minimum Spanning Trees (via Disjoint Sets)
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Depth First Search and Breadth First Search
Shortest Paths Part 2: Negative Weights and All-pairs
Line Segment Intersections
Self-balancing Trees and Skip Lists

Recently uploaded (20)

PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Lesson notes of climatology university.
PPTX
Presentation on HIE in infants and its manifestations
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Computing-Curriculum for Schools in Ghana
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Cell Structure & Organelles in detailed.
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Pharma ospi slides which help in ospi learning
Module 4: Burden of Disease Tutorial Slides S2 2025
202450812 BayCHI UCSC-SV 20250812 v17.pptx
RMMM.pdf make it easy to upload and study
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
Lesson notes of climatology university.
Presentation on HIE in infants and its manifestations
GDM (1) (1).pptx small presentation for students
Computing-Curriculum for Schools in Ghana
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Chinmaya Tiranga quiz Grand Finale.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Cell Types and Its function , kingdom of life
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Cell Structure & Organelles in detailed.
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pharma ospi slides which help in ospi learning

Pattern Matching Part Three: Hamming Distance

  • 1. Advanced Algorithms – COMS31900 Pattern matching part three Hamming distance Benjamin Sach
  • 2. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a m for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 3. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a m 4 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 4. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a a b a m 6 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 5. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a a b a m 10 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 6. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a a b a m 6 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 7. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a a b a m 6 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) j-th character of P (i + j)-th char. of T T[2] = c 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 8. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a a b a m 6 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) • A naive algorithm takes O(nm) time j-th character of P (i + j)-th char. of T T[2] = c 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 9. Exact pattern matching T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a a b a m 6 for all 0 j < m we have that P[j] = T[i + j] (our strings are zero-indexed) • A naive algorithm takes O(nm) time • Many O(n) time algorithms are known (for example the KMP algorithm) j-th character of P (i + j)-th char. of T T[2] = c 0 1 2 3 4 5 6 7 8 9 10 11 12 n
  • 10. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b d a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a m i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 11. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b d a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a m i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Ham(4) = 1 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 12. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(5) = 4 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 13. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(6) = 1 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 14. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(7) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 15. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(7) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 7
  • 16. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 8
  • 17. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] A naive algorithm for this problem takes O(nm) time 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 8
  • 18. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] A naive algorithm for this problem takes O(nm) time . . . but we can do better 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 8
  • 19. It’s a small alphabet after all T P d d d d dd d d m 0 1 2 3 4 5 6 7 8 9 10 11 12 n d Imagine that the alphabet contains only a small number of different symbols, aa c a b bb c which we will consider individually. . .
  • 20. It’s a small alphabet after all T P d d d d dd d d m 0 1 2 3 4 5 6 7 8 9 10 11 12 n d Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 aa c a b bb c which we will consider individually. . .
  • 21. It’s a small alphabet after all T P m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 d d d d dd d d d aa c a b bb c which we will consider individually. . .
  • 22. It’s a small alphabet after all T P m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 aa c a b bb c which we will consider individually. . . 1 1 1 1 11 1 1 1
  • 23. It’s a small alphabet after all T P m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 aa c a b bb c which we will consider individually. . . 1 1 1 1 11 1 1 1
  • 24. It’s a small alphabet after all T P aa c a b bb c m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 which we will consider individually. . . 1 1 1 1 11 1 1 1
  • 25. It’s a small alphabet after all T P m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1
  • 26. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1
  • 27. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1
  • 28. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1
  • 29. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1 (Td ⊗ Pd)[4] = (1 × 1)+ (0 × 0)+ (1 × 0)+ (1 × 1) = 2
  • 30. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] 1 iff P [j]=T [i+j]=d which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1 (Td ⊗ Pd)[4] = (1 × 1)+ (0 × 0)+ (1 × 0)+ (1 × 1) = 2
  • 31. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] This is the exactly number of matching ds at the i-th alignment. 1 iff P [j]=T [i+j]=d which we will consider individually. . . 00 0 0 0 00 01 1 1 1 11 1 1 1 (Td ⊗ Pd)[4] = (1 × 1)+ (0 × 0)+ (1 × 0)+ (1 × 1) = 2
  • 32. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] This is the exactly number of matching ds at the i-th alignment. 1 iff P [j]=T [i+j]=d which we will consider individually. . . How can we work out (Td ⊗ Pd) quickly? 00 0 0 0 00 01 1 1 1 11 1 1 1 (Td ⊗ Pd)[4] = (1 × 1)+ (0 × 0)+ (1 × 0)+ (1 × 1) = 2
  • 33. It’s a small alphabet after all m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Imagine that the alphabet contains only a small number of different symbols, Replace all d symbols with 1 and everything else with 0 Td Pd We denote these new strings Td and Pd and consider. . . (Td ⊗ Pd)[i] = m−1 j=0 Pd[j] × Td[i + j] This is the exactly number of matching ds at the i-th alignment. 1 iff P [j]=T [i+j]=d which we will consider individually. . . How can we work out (Td ⊗ Pd) quickly? 00 0 0 0 00 01 1 1 1 11 1 1 1
  • 34. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi
  • 35. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi A A[i] = ai B B[i] = bi (or be seen as arrays of length n)
  • 36. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) A A[i] = ai B B[i] = bi (or be seen as arrays of length n)
  • 37. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) A A[i] = ai B B[i] = bi (or be seen as arrays of length n) C C[i] = ci
  • 38. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. A A[i] = ai B B[i] = bi (or be seen as arrays of length n) C C[i] = ci
  • 39. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. A A[i] = ai B B[i] = bi (or be seen as arrays of length n) C C[i] = ci m−1 j=0 Pd[j]Td[i + j]
  • 40. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. A A[i] = ai B B[i] = bi (or be seen as arrays of length n) C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar!
  • 41. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. A A[i] = ai B B[i] = bi (or be seen as arrays of length n) C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar!
  • 42. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 aj×b(i−j) By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. A A[i] = ai B B[i] = bi (or be seen as arrays of length n) C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar! Hint 1 Let A = Pd and B = Td
  • 43. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar! Hint 1 Let A = Pd and B = Td A A[i] = ai = Pd[i] B B[i] = bi = Td[i](or be seen as arrays of length n) The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 Pd[j]Td[i−j]
  • 44. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar! Hint 2 Let A = Pd (padded with zeros) and B = Td A A[i] = ai = Pd[i] B B[i] = bi = Td[i](or be seen as arrays of length n) The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where ci = i j=0 Pd[j]Td[i−j] m 0 0 0 00
  • 45. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar! Hint 3 Let A = Pd (padded with zeros) and B = Td (reversed). . . now C contains (Td ⊗ Pd) A A[i] = ai = Pd[i] B B[i] = bi = Td[n − i](or be seen as arrays of length n) The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where cn−i = n−i j=0 Pd[j]Td[i + j] m 0 0 0 00
  • 46. Last year on COMS21103. . . Let A and B be (n − 1) degree polynomials which can be expressed as. . . A(x) = n−1 i=0 aixi and B(x) = n−1 i=0 bixi By the magic of the FFT we can compute C (i.e. every ci) in O(n log n) time. C C[i] = ci m−1 j=0 Pd[j]Td[i + j] these look similar! Hint 3 Let A = Pd (padded with zeros) and B = Td (reversed). . . now C contains (Td ⊗ Pd) A A[i] = ai = Pd[i] B B[i] = bi = Td[n − i](or be seen as arrays of length n) The polynomial C = A × B can be expressed as. . . C(x) = 2n−1 i=0 cixi where cn−i = n−i j=0 Pd[j]Td[i + j] m 0 0 0 00
  • 47. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. 00 0 0 0 00 01 1 1 1 11 1 1 1 (Pσ is defined analogously) alignment 4
  • 48. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 (Pσ is defined analogously) alignment 4
  • 49. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i (Pσ is defined analogously) alignment 4
  • 50. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i (Pσ is defined analogously) alignment 4 (Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
  • 51. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i (Pσ is defined analogously) alignment 4 (Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ it is also very often (but technically incorrectly) called the convolution
  • 52. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i (Pσ is defined analogously) alignment 4 (Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ it is also very often (but technically incorrectly) called the convolution cross-correlations are used a lot in the pattern matching literature
  • 53. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i (Pσ is defined analogously) alignment 4 (Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ it is also very often (but technically incorrectly) called the convolution cross-correlations are used a lot in the pattern matching literature (but they mostly call them convolutions)
  • 54. Computing cross-correlations via the FFT m 0 1 2 3 4 5 6 7 8 9 10 11 12 n Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s Tσ Pσ (Tσ ⊗ Pσ)[i] = m−1 j=0 Pσ[j] × Tσ[i + j] is exactly number of matching ds at the i-th alignment. We can compute (Tσ ⊗ Pσ) in O(n log n) time via the FFT 00 0 0 0 00 01 1 1 1 11 1 1 1 i.e after O(n log n) time we have (Td ⊗ Pd)[i] for every i (Pσ is defined analogously) alignment 4 (Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ it is also very often (but technically incorrectly) called the convolution cross-correlations are used a lot in the pattern matching literature
  • 55. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 56. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . matches involving σ (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 57. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . all matches (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 58. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . mismatches = m− matches (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 59. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 60. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (O(n|Σ|) time) (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 61. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (O(n|Σ|) time) (O(n|Σ| log n) time) (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 62. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (O(n|Σ|) time) (O(n|Σ| log n) time) (O(n|Σ|) time) (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 63. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (O(n|Σ|) time) (O(n|Σ| log n) time) (O(n|Σ|) time) This takes O(n|Σ| log n) total time (and uses O(n) space) (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 64. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (O(n|Σ|) time) (O(n|Σ| log n) time) (O(n|Σ|) time) This takes O(n|Σ| log n) total time (and uses O(n) space) However, |Σ| could be as big as m... (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 65. It’s a small alphabet after all Let Σ denote the set of alphabet symbols and |Σ| be its size We have seen how to find all matches with a single symbol in O(n log n) time Algorithm Summary Construct Tσ and Pσ for each symbol σ in Σ Compute (Tσ ⊗ Pσ) for each symbol σ in Σ For every i, compute, Ham(i) = m − σ∈Σ (Tσ ⊗ Pσ)[i] . (O(n|Σ|) time) (O(n|Σ| log n) time) (O(n|Σ|) time) This takes O(n|Σ| log n) total time (and uses O(n) space) However, |Σ| could be as big as m... in which case, this is worse than the naive method! (in the example Σ = {a, b, c, d} so |Σ| = 4)
  • 66. Coping with a large alphabet We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size
  • 67. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size
  • 68. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size
  • 69. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 70. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 71. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 72. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 73. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent Key idea: Our algorithm will have two main stages: We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 74. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent Key idea: Our algorithm will have two main stages: Stage 1 will count all the matches involving frequent symbols (at each alignment of P and T) We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 75. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent Key idea: Our algorithm will have two main stages: Stage 1 will count all the matches involving frequent symbols Stage 2 will count all the matches involving infrequent symbols (at each alignment of P and T) (at each alignment of P and T) We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 76. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent Key idea: Our algorithm will have two main stages: Stage 1 will count all the matches involving frequent symbols Stage 2 will count all the matches involving infrequent symbols The total number of matches is the sum of the matches from Stage 1 and Stage 2 (at each alignment of P and T) (at each alignment of P and T) We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size √ m = 3
  • 77. Coping with a large alphabet Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent Key idea: Our algorithm will have two main stages: Stage 1 will count all the matches involving frequent symbols Stage 2 will count all the matches involving infrequent symbols The total number of matches is the sum of the matches from Stage 1 and Stage 2 (at each alignment of P and T) (at each alignment of P and T) We will now see an algorithm which runs in O(n √ m log n) time regardless of the alphabet size
  • 78. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent
  • 79. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent
  • 80. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent
  • 81. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 82. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 83. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? Assume that there at least ( √ m + 1) freq. symbols P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 84. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? Assume that there at least ( √ m + 1) freq. symbols each occurs at least √ m times. . . P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 85. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? Assume that there at least ( √ m + 1) freq. symbols each occurs at least √ m times. . . ( √ m + 1) √ m > m P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 86. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? Assume that there at least ( √ m + 1) freq. symbols each occurs at least √ m times. . . ( √ m + 1) √ m > m Contradiction! P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 87. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? Assume that there at least ( √ m + 1) freq. symbols so there are at most √ m frequent symbols each occurs at least √ m times. . . ( √ m + 1) √ m > m Contradiction! P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 88. The frequent/infrequent symbols trick Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. Stage 1: For each alignment i, count the number of matches involving frequent symbols: Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ) How many frequent symbols can there be? Assume that there at least ( √ m + 1) freq. symbols So Stage 1 takes O(n √ m log n) time. so there are at most √ m frequent symbols each occurs at least √ m times. . . ( √ m + 1) √ m > m Contradiction! P a d bc ab b da 0 1 2 3 4 5 6 7 8 m = 9 a is frequent , b is frequent c and d are infrequent in O(n log n) time (per symbol σ) using cross-correlations
  • 89. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. m = 9 a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca P a d bc ab b da
  • 90. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. m = 9 a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. P a d bc ab b da
  • 91. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. P a d bc ab b da
  • 92. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 93. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 94. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 95. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 96. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) If T[k] is infrequent. . . Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 97. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) If T[k] is infrequent. . . Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 98. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) If T[k] is infrequent. . . Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 99. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) If T[k] is infrequent. . . Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 100. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 00 P a d bc ab b da
  • 101. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 102. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 (k − j) < 0d bc a d
  • 103. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 (k − j) < 0d bc ab da
  • 104. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 105. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 106. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 107. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 108. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 109. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da
  • 110. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) 0 P a d bc ab b da j = 4 k = 4
  • 111. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 00 0 0 0 0 0 (except when (k − j) < 0) k − j = 0 0 P a d bc ab b da j = 4 k = 4
  • 112. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) k − j = 0 P a d bc ab b da j = 4 k = 4
  • 113. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) P a d bc ab b da
  • 114. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) P a d bc ab b da
  • 115. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) P a d bc ab b da
  • 116. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) P a d bc ab b da k = 5
  • 117. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) k = 5 P a d bc ab b da
  • 118. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) k = 5 j = 4 P a d bc ab b da
  • 119. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 0 (except when (k − j) < 0) k − j = 1 k = 5 j = 4 P a d bc ab b da
  • 120. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) k − j = 1 1 k = 5 j = 4 P a d bc ab b da
  • 121. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 P a d bc ab b da
  • 122. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 P a d bc ab b da
  • 123. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 P a d bc ab b da
  • 124. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 k = 6 P a d bc ab b da
  • 125. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 k = 6 P a d bc ab b da
  • 126. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 k = 6 j = 4 P a d bc ab b da
  • 127. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 0 (except when (k − j) < 0) 1 k = 6 j = 4k − j = 2 P a d bc ab b da
  • 128. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 6 j = 4k − j = 2 P a d bc ab b da
  • 129. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 P a d bc ab b da
  • 130. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 P a d bc ab b da
  • 131. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 P a d bc ab b da
  • 132. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 P a d bc ab b da
  • 133. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 P a d bc ab b da
  • 134. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 j = 8 P a d bc ab b da
  • 135. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 (k − j) < 0 k = 7 j = 8 P a d bc ab b da
  • 136. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 P a d bc ab b da
  • 137. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 P a d bc ab b da
  • 138. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 j = 6 P a d bc ab b da
  • 139. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 1 1 k = 7 j = 6k − j = 1 P a d bc ab b da
  • 140. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 k = 7 j = 6k − j = 1 P a d bc ab b da
  • 141. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 P a d bc ab b da
  • 142. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 P a d bc ab b da
  • 143. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 P a d bc ab b da
  • 144. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 P a d bc ab b da
  • 145. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 P a d bc ab b da
  • 146. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 12 P a d bc ab b da
  • 147. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 13 P a d bc ab b da
  • 148. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 13 P a d bc ab b da
  • 149. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 13 P a d bc ab b da
  • 150. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 00 0 0 0 (except when (k − j) < 0) 13 P a d bc ab b da
  • 151. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 0 0 0 0 (except when (k − j) < 0) 13 1 P a d bc ab b da
  • 152. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 0 0 0 0 (except when (k − j) < 0) 13 1 P a d bc ab b da
  • 153. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) P a d bc ab b da
  • 154. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) What is A[i]? P a d bc ab b da
  • 155. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) What is A[i]? P a d bc ab b da
  • 156. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) What is A[i]? Fact A[i] is the number of matches at alignment i involving an infrequent symbol P a d bc ab b da
  • 157. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) P a d bc ab b da
  • 158. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) How quick is Stage 2? P a d bc ab b da
  • 159. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) How quick is Stage 2? O(n) time P a d bc ab b da
  • 160. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one store a list for each infrequent symbol Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) How quick is Stage 2? O(n) time P a d bc ab b da
  • 161. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one store a list for each infrequent symbol Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) How quick is Stage 2? (each list has length less than √ m) O(n) time P a d bc ab b da
  • 162. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one store a list for each infrequent symbol Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) How quick is Stage 2? (each list has length less than √ m) O(n) time P a d bc ab b da O(n √ m) time
  • 163. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) P a d bc ab b da
  • 164. The infrequent/frequent symbols trick aaa Definition: A symbol is infrequent if it occurs fewer than √ m times in P. a is frequent , b is frequent c and d are infrequent Every symbol is either frequent or infrequent T d b c c c d d c d c d ca Stage 2: Count all matches involving infrequent symbols. Make a single pass through T. . . For each character T[k], (where 0 k < n) For all j such that T[k] = P[j], If T[k] is infrequent. . . Increase A[k − j] by one Construct an array A of length (n − m + 1) - which initially contains all zeros A 1 1 2 1 13 2 0 (except when (k − j) < 0) O(n √ m) total time P a d bc ab b da
  • 165. Pattern matching with mismatches: putting it all together Algorithm summary
  • 166. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time
  • 167. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise
  • 168. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P)
  • 169. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P)
  • 170. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation
  • 171. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m) time Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation
  • 172. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m) time Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 173. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 174. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Overall, we obtain a time complexity of O(n √ m log n). Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 175. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Overall, we obtain a time complexity of O(n √ m log n). Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 176. Pattern matching with mismatches: putting it all together Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Overall, we obtain a time complexity of O(n √ m log n). Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Notice that Stage 1 takes longer than Stage 2...
  • 177. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise
  • 178. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition?
  • 179. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise
  • 180. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise How long does each stage take now?
  • 181. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time (this stage is unaffected - the time complexity doesn’t depend on f)) How long does each stage take now?
  • 182. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise How long does each stage take now?
  • 183. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise Stage 1: Count all matches involving frequent symbols Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation How long does each stage take now?
  • 184. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise Stage 1: Count all matches involving frequent symbols Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation How long does each stage take now? As each frequent symbol occurs at least f times, there are at most m f frequent symbols
  • 185. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise Stage 1: Count all matches involving frequent symbols Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation How long does each stage take now? As each frequent symbol occurs at least f times, there are at most m f frequent symbols and we do one cross-correlation for each frequent symbol. . .
  • 186. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation How long does each stage take now? As each frequent symbol occurs at least f times, there are at most m f frequent symbols and we do one cross-correlation for each frequent symbol. . . Stage 1: Count all matches involving frequent symbols - O(m f · n log n) time
  • 187. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise How long does each stage take now?
  • 188. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise How long does each stage take now? Stage 2: Count all matches involving infrequent symbols. aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 189. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise How long does each stage take now? Stage 2: Count all matches involving infrequent symbols. aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting We make a single pass through T. . . and for each T[i] we update at most (f − 1) locations in A
  • 190. Improving the Time Complexity 1 - balance the stages Current Definition: An alphabet symbol is frequent if it occurs at least √ m times in P. and infrequent otherwise What happens if we generalise this definition? New Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise How long does each stage take now? aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Stage 2: Count all matches involving infrequent symbols. - O(nf) time We make a single pass through T. . . and for each T[i] we update at most (f − 1) locations in A
  • 191. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n m f log n) time Stage 2: Count all matches involving infrequent symbols. - O(nf) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 192. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time What should we set f to? Stage 1: Count all matches involving frequent symbols - O(n m f log n) time Stage 2: Count all matches involving infrequent symbols. - O(nf) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting
  • 193. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time What should we set f to? Stage 1: Count all matches involving frequent symbols - O(n m f log n) time Stage 2: Count all matches involving infrequent symbols. - O(nf) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Let f = √ m log n. . .
  • 194. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time What should we set f to? Stage 1: Count all matches involving frequent symbols - O(n m f log n) time Stage 2: Count all matches involving infrequent symbols. - O(nf) time at any alignment i the number of mismatches is just m minus the total number of matches Definition: An alphabet symbol is frequent if it occurs at least f times in P. and infrequent otherwise (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Let f = √ m log n. . .
  • 195. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary What should we set f to? at any alignment i the number of mismatches is just m minus the total number of matches (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Let f = √ m log n. . . Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 1: Count all matches involving frequent symbols - O(n m√ m log n log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m log n times in P. and infrequent otherwise
  • 196. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary What should we set f to? at any alignment i the number of mismatches is just m minus the total number of matches (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Let f = √ m log n. . . Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m log n times in P. and infrequent otherwise Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time
  • 197. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary at any alignment i the number of mismatches is just m minus the total number of matches (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m log n times in P. and infrequent otherwise Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time
  • 198. Pattern matching with mismatches: putting it all together (Generalised) Algorithm summary at any alignment i the number of mismatches is just m minus the total number of matches (by alphabetically sorting the characters from P) Tσ Pσ 00 0 0 0 00 01 1 1 1 11 1 1 1 Matches with a single symbol can be found using a cross-correlation aaaT d b c c c d d c d c d ca A 1 1 2 1 13 2 0 P a d bc ab b da Matches with an infrequent symbol can be found by direct counting Stage 0: Classify each symbol as frequent or infrequent - O(m log n) time Stage 2: Count all matches involving infrequent symbols. - O(n √ m log n) time Definition: An alphabet symbol is frequent if it occurs at least √ m log n times in P. and infrequent otherwise Stage 1: Count all matches involving frequent symbols - O(n √ m log n) time This improves the overall time complexity from O(n √ m log n) to O(n √ m log n).
  • 199. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. T
  • 200. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T
  • 201. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m
  • 202. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . .
  • 203. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . 2m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 204. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . the final substring might be shorter 2m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 205. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . 2m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 206. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 207. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T P m Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 208. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 209. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 210. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 211. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 212. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m How long does running the previous algorithm take? T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 213. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 214. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 215. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m P m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. We run the previous algorithm O n m times so this process takes O(n √ m log m) time in total T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 216. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. We run the previous algorithm O n m times so this process takes O(n √ m log m) time in total P m T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 217. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Split T into O n m contiguous 2m length substrings, T1, T2, T3 . . . Run the previous algorithm once for with P and each Tk 2m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. We run the previous algorithm O n m times so this process takes O(n √ m log m) time in total P m what about this alignment? T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 218. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Run the previous algorithm once for with P and each Tk 2m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. We run the previous algorithm O n m times so this process takes O(n √ m log m) time in total P m what about this alignment? Split T into O n m overlapping 2m length substrings, T1, T2, T3 . . . T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
  • 219. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Run the previous algorithm once for with P and each Tk 2m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. We run the previous algorithm O n m times so this process takes O(n √ m log m) time in total P m what about this alignment? Split T into O n m overlapping 2m length substrings, T1, T2, T3 . . . T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 m 2m T2 T4 T6 T8 T10 T12 T14 T16 T18 T20
  • 220. Improving the time complexity 2 - split the text We have just seen an algorithm which takes O(n √ m log n) time. Imagine that n is a lot bigger than m. . . T Run the previous algorithm once for with P and each Tk 2m How long does running the previous algorithm take? O(|Tk| m log |Tk|) time. = O(m √ m log m) time. We run the previous algorithm O n m times so this process takes O(n √ m log m) time in total P m Split T into O n m overlapping 2m length substrings, T1, T2, T3 . . . T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 m 2m T2 T4 T6 T8 T10 T12 T14 T16 T18 T20
  • 221. Conclusion T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output (the Hamming distance is the number of mismatches) c a a A naive algorithm for this problem takes O(nm) time 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] We have seen two alternative algorithms: One algorithm takes O(n|Σ| log n) time (where |Σ| is the alphabet size) The other algorithm takes O(n √ m log n) time (regardless of the alphabet size) and can be improved to O(n √ m log m) (by changing the freq./infreq. cut off and splitting the text) and can be improved to O(n|Σ| log m) (by splitting the text)