Pattern matching programs

1
Searching for Patterns | Set 1 (Naive Pattern Searching)
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
1) Input:
txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output:
Pattern found at index 10
2) Input:
txt[] = "AABAACAADAABAAABAA"
pat[] = "AABA"
Output:
Pattern searching is an important problem in computer science. When we do search for a string in
notepad/word file or browser or database, pattern searching algorithms are used to show the search
results.
Naive Pattern Searching: Slide the pattern over text one by one and check for a match. If a match is
found, then slides by 1 again to check for subsequent matches.

2
Output:
What is the best case?
The best case occurs when the first character of the pattern is not present in text at all.
txt[] = "AABCCAADDEE"
pat[] = "FAA"
Run on IDE
The number of comparisons in best case is O(n).
What is the worst case?
The worst case of Naive Pattern Searching occurs in following scenarios.
1) When all characters of the text and pattern are same.
txt[] = "AAAAAAAAAAAAAAAAAA"
pat[] = "AAAAA".
Run on IDE
2) Worst case also occurs when only the last character is different.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
Run on IDE

3
Number of comparisons in worst case is O(m*(n-m+1)). Although strings which have repeated characters
are not likely to appear in English text, they may well occur in other applications (for example, in binary
texts). The KMP matching algorithm improves the worst case to O(n). We will be covering KMP in the
next post. Also, we will be writing more posts to cover all pattern searching algorithms and data
structures.

4
Searching for Patterns | Set 2 (KMP Algorithm)
Examples:
1) Input:
pat[] = "TEST"
Output:
2) Input:
pat[] = "AABA"
Output:
results.
We have discussed Naive pattern searching algorithm in the previous program. The worst case
complexity of Naive algorithm is O(m(n-m+1)). Time complexity of KMP algorithm is O(n) in worst case.
KMP (Knuth Morris Pratt) Pattern Searching
The Naive pattern searching algorithm doesn’t work well in cases where we see many matching
characters followed by a mismatching character. Following are some examples.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing
more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic
idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know
some of the characters in the text (since they matched the pattern characters prior to the mismatch). We
take advantage of this information to avoid matching the characters that we know will anyway match.
KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of
size m (same as size of pattern). Here name lps indicates longest proper prefix which is also suffix..

5
For each sub-pattern pat[0…i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper
prefix which is also a suffix of the sub-pattern pat[0..i].
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
Examples:
For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
Searching Algorithm:
Unlike the Naive algo where we slide the pattern by one, we use a value from lps[] to decide the next
sliding position. Let us see how we do that. When we compare pat[j] with txt[i] and see a mismatch, we
know that characters pat[0..j-1] match with txt[i-j+1…i-1], and we also know that lps[j-1] characters of
pat[0…j-1] are both proper prefix and suffix which means we do not need to match these lps[j-1]
characters with txt[i-j…i-1] because we know that these characters will anyway match. See KMPSearch()
in the below code for details.
Preprocessing Algorithm:
In the preprocessing part, we calculate values in lps[]. To do that, we keep track of the length of the
longest prefix suffix value (we use len variable for this purpose) for the previous index. We initialize lps[0]
and len as 0. If pat[len] and pat[i] match, we increment len by 1 and assign the incremented value to lps[i].
If pat[i] and pat[len] do not match and len is not 0, we update len to lps[len-1]. See computeLPSArray () in
the below code for details.

8
Searching for Patterns | Set 3 (Rabin-Karp Algorithm)
Examples:
1) Input:
pat[] = "TEST"
Output:
2) Input:
pat[] = "AABA"
Output:
The Naive String Matching algorithm slides the pattern one by one. After each slide, it one by one
checks characters at the current shift and if all characters match then prints the match.
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive
algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current
substring of text, and if the hash values match then only it starts matching individual characters. So Rabin
Karp algorithm needs to calculate hash values for following strings.
1) Pattern itself.
2) All the substrings of text of length m.
Since we need to efficiently calculate hash values for all the substrings of size m of text, we must have a
hash function which has following property. Hash at the next shift must be efficiently computable from the
current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently
computable from hash(txt[s .. s+m-1]) and txt[s+m] i.e.,hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s
.. s+m-1]) and rehash must be O(1) operation.
The hash function suggested by Rabin and Karp calculates an integer value. The integer value for a
string is numeric value of a string. For example, if all possible characters are from 1 to 10, the numeric
value of “122” will be 122. The number of possible characters is higher than 10 (256 in general) and
pattern length can be large. So the numeric values cannot be practically stored as an integer. Therefore,
the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored
in an integer variable (can fit in memory words). To do rehashing, we need to take off the most significant
digit and add the new least significant digit for in hash value. Rehashing is done using the following
formula.
hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q

9
hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)

11
Searching for Patterns | Set 4 (Finite Automata)
Examples:
1) Input:
pat[] = "TEST"
Output:
2) Input:
pat[] = "AABA"
Output:
results.
We have discussed the following algorithms in the previous posts:
Naive Algorithm
KMP Algorithm
Rabin Karp Algorithm
In this post, we will discuss Finite Automata (FA) based pattern searching algorithm. In FA based
algorithm, we preprocess the pattern and build a 2D array that represents a Finite Automata. Construction
of the FA is the main tricky part of this algorithm. Once the FA is built, the searching is simple. In search,
we simply need to start from the first state of the automata and first character of the text. At every step,
we consider next character of text, look for the next state in the built FA and move to new state. If we
reach final state, then pattern is found in text. Time complexity of the search prcess is O(n).

12
Before we discuss FA construction, let us take a look at the following FA for pattern ACACAGA.
The above diagrams represent graphical and tabular representations of pattern ACACAGA.
Number of states in FA will be M+1 where M is length of the pattern. The main thing to construct FA is to
get the next state from the current state for every possible character. Given a character x and a state k,
we can get the next state by considering the string “pat[0..k-1]x” which is basically concatenation of
pattern characters pat[0], pat[1] … pat[k-1] and the character x. The idea is to get length of the longest
prefix of the given pattern such that the prefix is also suffix of “pat[0..k-1]x”. The value of length gives us
the next state. For example, let us see how to get the next state from current state 5 and character ‘C’ in
the above diagram. We need to consider the string, “pat[0..5]C” which is “ACACAC”. The lenght of the
longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4 (“ACAC”). So the next state
(from state 5) is 4 for character ‘C’.
In the following code, computeTF() constructs the FA. The time complexity of the computeTF() is
O(m^3*NO_OF_CHARS) where m is length of the pattern and NO_OF_CHARS is size of alphabet (total
number of possible characters in pattern and text). The implementation tries all possible prefixes starting
from the longest possible that can be a suffix of “pat[0..k-1]x”. There are better implementations to
construct FA in O(m*NO_OF_CHARS) (Hint: we can use something like lps array construction in
KMP algorithm).

Pattern matching programs

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Pattern matching programs (20)

More from akruthi k (10)

Recently uploaded (20)

Pattern matching programs