2. Strings
Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …, z)
A string is any member of Σ*, i.e. any sequence of 0
or more members of Σ
‘this is a string’ Σ*
‘this is also a string’ Σ*
‘1234’ Σ*
3. String operations
Given strings s1 of length n and s2 of length m
Equality: is s1 = s2? (case sensitive or insensitive)
Running time
O(n) where n is length of shortest string
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’
4. String operations
Concatenate (append): create string s1s2
Running time
Θ(n+m)
‘this is a’ . ‘ string’ → ‘this is a string’
5. String operations
Substitute: Exchange all occurrences of a particular
character with another character
Running time
Θ(n)
Substitute(‘this is a string’, ‘i’, ‘x’) → ‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’
9. Edit distance EXAM MUST
(aka Levenshtein distance)
Edit distance between two strings is the minimum number of
insertions, deletions and substitutions required to transform
string s1 into string s2
Insertion:
ABACED ABACCED DABACCED
Insert ‘C’ Insert ‘D’
10. Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the minimum
number of insertions, deletions and substitutions
required to transform string s1 into string s2
Deletion:
ABACED
11. Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the minimum
number of insertions, deletions and substitutions
required to transform string s1 into string s2
Deletion:
ABACED BACED
Delete ‘A’
12. Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the minimum
number of insertions, deletions and substitutions
required to transform string s1 into string s2
Deletion:
ABACED BACED BACE
Delete ‘A’ Delete ‘D’
13. Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the minimum
number of insertions, deletions and substitutions
required to transform string s1 into string s2
Substitution:
ABACED ABADED ABADES
Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’
16. Edit distance examples
Edit(Banana, Car) = 5
Operations:
Delete ‘B’ anana
Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
17. Edit distance examples
Edit(Simple, Apple) = 3 no of operation need
Operations:
Delete ‘S’ imple
Sub ‘A’ for ‘i’ A mple
Sub ‘m’ for ‘p’ A p ple
18. Is edit distance symmetric
(reversibale)?
that is, is Edit(s1, s2) = Edit(s2, s1)?
Why?
sub ‘i’ for ‘j’ sub ‘j’ for ‘i’
→
delete ‘i’ insert ‘i’
→
insert ‘i’ delete ‘i’
→
30. Equal
X = A B C B D A ?
Y = B D C A B ?
Edit
)
,
(
)
,
( 1
...
1
1
...
1
m
n Y
X
Edit
Y
X
Edit
31. Combining results
)
,
(
)
,
( 1
...
1
1
...
1
m
n Y
X
Edit
Y
X
Edit
)
,
(
1
)
,
( 1
...
1
1
...
1
m
n Y
X
Edit
Y
X
Edit
)
,
(
1
)
,
( ...
1
1
...
1 m
n Y
X
Edit
Y
X
Edit
)
,
(
1
)
,
( 1
...
1
...
1
m
n Y
X
Edit
Y
X
Edit
Insert:
Delete:
Substitute:
Equal:
32. Rabin-Karp algorithm
P = ABA
S = BABABBABABA
- Use a function T to that computes a numerical
representation of P
,
- Calculate T for all m symbol sequences of S
and compare
33. P = ABA
S = BABABBABABA
Hash P
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S and
compare
34. P = ABA
S = BABABBABABA
Hash m symbol
sequences and
compare
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S and
compare
T(BAB)
=
35. P = ABA
S = BABABBABABA
Hash m symbol
sequences and
compare
T(P)
match
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S and
compare
T(ABA)
=
36. P = ABA
S = BABABBABABA
Hash m symbol
sequences and
compare
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S and
compare
T(BAB)
=
37. P = ABA
S = BABABBABABA
Hash m symbol
sequences and
compare
T(P)
…
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S and
compare
T(BAB)
=
38. Rabin-Karp algorithm
Given T(si…i+m-1) we must
be able to efficiently
calculate T(si+1…i+m)
P = ABA
S = BABABBABABA
For this to be
useful/efficient, what
needs to be true
about T?
T(P)
…
T(BAB)
=