SlideShare a Scribd company logo
1
Searching for Patterns | Set 1 (Naive Pattern Searching)
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
1) Input:
txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output:
Pattern found at index 10
2) Input:
txt[] = "AABAACAADAABAAABAA"
pat[] = "AABA"
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
Pattern searching is an important problem in computer science. When we do search for a string in
notepad/word file or browser or database, pattern searching algorithms are used to show the search
results.
Naive Pattern Searching: Slide the pattern over text one by one and check for a match. If a match is
found, then slides by 1 again to check for subsequent matches.
2
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
What is the best case?
The best case occurs when the first character of the pattern is not present in text at all.
txt[] = "AABCCAADDEE"
pat[] = "FAA"
Run on IDE
The number of comparisons in best case is O(n).
What is the worst case?
The worst case of Naive Pattern Searching occurs in following scenarios.
1) When all characters of the text and pattern are same.
txt[] = "AAAAAAAAAAAAAAAAAA"
pat[] = "AAAAA".
Run on IDE
2) Worst case also occurs when only the last character is different.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
Run on IDE
3
Number of comparisons in worst case is O(m*(n-m+1)). Although strings which have repeated characters
are not likely to appear in English text, they may well occur in other applications (for example, in binary
texts). The KMP matching algorithm improves the worst case to O(n). We will be covering KMP in the
next post. Also, we will be writing more posts to cover all pattern searching algorithms and data
structures.
4
Searching for Patterns | Set 2 (KMP Algorithm)
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
1) Input:
txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output:
Pattern found at index 10
2) Input:
txt[] = "AABAACAADAABAAABAA"
pat[] = "AABA"
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
Pattern searching is an important problem in computer science. When we do search for a string in
notepad/word file or browser or database, pattern searching algorithms are used to show the search
results.
We have discussed Naive pattern searching algorithm in the previous program. The worst case
complexity of Naive algorithm is O(m(n-m+1)). Time complexity of KMP algorithm is O(n) in worst case.
KMP (Knuth Morris Pratt) Pattern Searching
The Naive pattern searching algorithm doesn’t work well in cases where we see many matching
characters followed by a mismatching character. Following are some examples.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing
more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic
idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know
some of the characters in the text (since they matched the pattern characters prior to the mismatch). We
take advantage of this information to avoid matching the characters that we know will anyway match.
KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of
size m (same as size of pattern). Here name lps indicates longest proper prefix which is also suffix..
5
For each sub-pattern pat[0…i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper
prefix which is also a suffix of the sub-pattern pat[0..i].
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
Examples:
For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
Searching Algorithm:
Unlike the Naive algo where we slide the pattern by one, we use a value from lps[] to decide the next
sliding position. Let us see how we do that. When we compare pat[j] with txt[i] and see a mismatch, we
know that characters pat[0..j-1] match with txt[i-j+1…i-1], and we also know that lps[j-1] characters of
pat[0…j-1] are both proper prefix and suffix which means we do not need to match these lps[j-1]
characters with txt[i-j…i-1] because we know that these characters will anyway match. See KMPSearch()
in the below code for details.
Preprocessing Algorithm:
In the preprocessing part, we calculate values in lps[]. To do that, we keep track of the length of the
longest prefix suffix value (we use len variable for this purpose) for the previous index. We initialize lps[0]
and len as 0. If pat[len] and pat[i] match, we increment len by 1 and assign the incremented value to lps[i].
If pat[i] and pat[len] do not match and len is not 0, we update len to lps[len-1]. See computeLPSArray () in
the below code for details.
6
7
8
Searching for Patterns | Set 3 (Rabin-Karp Algorithm)
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
1) Input:
txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output:
Pattern found at index 10
2) Input:
txt[] = "AABAACAADAABAAABAA"
pat[] = "AABA"
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
The Naive String Matching algorithm slides the pattern one by one. After each slide, it one by one
checks characters at the current shift and if all characters match then prints the match.
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive
algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current
substring of text, and if the hash values match then only it starts matching individual characters. So Rabin
Karp algorithm needs to calculate hash values for following strings.
1) Pattern itself.
2) All the substrings of text of length m.
Since we need to efficiently calculate hash values for all the substrings of size m of text, we must have a
hash function which has following property. Hash at the next shift must be efficiently computable from the
current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently
computable from hash(txt[s .. s+m-1]) and txt[s+m] i.e.,hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s
.. s+m-1]) and rehash must be O(1) operation.
The hash function suggested by Rabin and Karp calculates an integer value. The integer value for a
string is numeric value of a string. For example, if all possible characters are from 1 to 10, the numeric
value of “122” will be 122. The number of possible characters is higher than 10 (256 in general) and
pattern length can be large. So the numeric values cannot be practically stored as an integer. Therefore,
the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored
in an integer variable (can fit in memory words). To do rehashing, we need to take off the most significant
digit and add the new least significant digit for in hash value. Rehashing is done using the following
formula.
hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q
9
hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)
10
11
Searching for Patterns | Set 4 (Finite Automata)
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
1) Input:
txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output:
Pattern found at index 10
2) Input:
txt[] = "AABAACAADAABAAABAA"
pat[] = "AABA"
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
Pattern searching is an important problem in computer science. When we do search for a string in
notepad/word file or browser or database, pattern searching algorithms are used to show the search
results.
We have discussed the following algorithms in the previous posts:
Naive Algorithm
KMP Algorithm
Rabin Karp Algorithm
In this post, we will discuss Finite Automata (FA) based pattern searching algorithm. In FA based
algorithm, we preprocess the pattern and build a 2D array that represents a Finite Automata. Construction
of the FA is the main tricky part of this algorithm. Once the FA is built, the searching is simple. In search,
we simply need to start from the first state of the automata and first character of the text. At every step,
we consider next character of text, look for the next state in the built FA and move to new state. If we
reach final state, then pattern is found in text. Time complexity of the search prcess is O(n).
12
Before we discuss FA construction, let us take a look at the following FA for pattern ACACAGA.
The above diagrams represent graphical and tabular representations of pattern ACACAGA.
Number of states in FA will be M+1 where M is length of the pattern. The main thing to construct FA is to
get the next state from the current state for every possible character. Given a character x and a state k,
we can get the next state by considering the string “pat[0..k-1]x” which is basically concatenation of
pattern characters pat[0], pat[1] … pat[k-1] and the character x. The idea is to get length of the longest
prefix of the given pattern such that the prefix is also suffix of “pat[0..k-1]x”. The value of length gives us
the next state. For example, let us see how to get the next state from current state 5 and character ‘C’ in
the above diagram. We need to consider the string, “pat[0..5]C” which is “ACACAC”. The lenght of the
longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4 (“ACAC”). So the next state
(from state 5) is 4 for character ‘C’.
In the following code, computeTF() constructs the FA. The time complexity of the computeTF() is
O(m^3*NO_OF_CHARS) where m is length of the pattern and NO_OF_CHARS is size of alphabet (total
number of possible characters in pattern and text). The implementation tries all possible prefixes starting
from the longest possible that can be a suffix of “pat[0..k-1]x”. There are better implementations to
construct FA in O(m*NO_OF_CHARS) (Hint: we can use something like lps array construction in
KMP algorithm).
13
14

More Related Content

PPT
Inverted index
PDF
Rabin karp string matcher
PPTX
String matching algorithms-pattern matching.
PPTX
Naive string matching
PPTX
heap Sort Algorithm
PPT
Theory of computing
PPTX
Rabin Carp String Matching algorithm
DOCX
Online restaurant management system
Inverted index
Rabin karp string matcher
String matching algorithms-pattern matching.
Naive string matching
heap Sort Algorithm
Theory of computing
Rabin Carp String Matching algorithm
Online restaurant management system

What's hot (20)

PPTX
String Matching (Naive,Rabin-Karp,KMP)
PPTX
Rabin karp string matching algorithm
PPTX
Term weighting
PPTX
2. forward chaining and backward chaining
PPTX
String Matching Finite Automata & KMP Algorithm.
PPT
GAC DS Priority Queue Presentation 2022.ppt
PPTX
Knuth morris pratt string matching algo
PPTX
String matching algorithms
PPTX
Rabin Karp ppt
PPTX
Unit 3 stack
PDF
Algorithms Lecture 6: Searching Algorithms
PPTX
Unit 2 part-2
PPT
Algorithm: Quick-Sort
PDF
Array data structure
PDF
Convex hull
PDF
String matching, naive,
PPTX
Chapter 09 design and analysis of algorithms
PPTX
Spam Detection Using Natural Language processing
PPTX
IRS-Cataloging and Indexing-2.1.pptx
PPT
Searching algorithms
String Matching (Naive,Rabin-Karp,KMP)
Rabin karp string matching algorithm
Term weighting
2. forward chaining and backward chaining
String Matching Finite Automata & KMP Algorithm.
GAC DS Priority Queue Presentation 2022.ppt
Knuth morris pratt string matching algo
String matching algorithms
Rabin Karp ppt
Unit 3 stack
Algorithms Lecture 6: Searching Algorithms
Unit 2 part-2
Algorithm: Quick-Sort
Array data structure
Convex hull
String matching, naive,
Chapter 09 design and analysis of algorithms
Spam Detection Using Natural Language processing
IRS-Cataloging and Indexing-2.1.pptx
Searching algorithms
Ad

Viewers also liked (18)

DOC
Jose E Rivera Resume loc16
DOCX
100330
PPTX
Attraction Marketing
DOC
Florida casino parties by dan mar productions 17
PDF
La stampa 3D a confronto con la proprietà intellettuale: tutela giuridica, li...
DOCX
100410
DOCX
100400
DOCX
100324
ODP
D:\College System Files\Media\Magazines\Coursework\Powerpoints\Q6
PPTX
Evaulation 3
PPT
Mitos bzd
PDF
Hixson office2013
PPTX
Teories ètiques 1r de Batxillerat
PPTX
FRS urinary System
PDF
Pitching deck
PDF
Microbiology Practical 2!!!! i will miss this class! (Ilana Kovach)
Jose E Rivera Resume loc16
100330
Attraction Marketing
Florida casino parties by dan mar productions 17
La stampa 3D a confronto con la proprietà intellettuale: tutela giuridica, li...
100410
100400
100324
D:\College System Files\Media\Magazines\Coursework\Powerpoints\Q6
Evaulation 3
Mitos bzd
Hixson office2013
Teories ètiques 1r de Batxillerat
FRS urinary System
Pitching deck
Microbiology Practical 2!!!! i will miss this class! (Ilana Kovach)
Ad

Similar to Pattern matching programs (20)

PDF
StringMatching-Rabikarp algorithmddd.pdf
PPTX
String Matching algorithm String Matching algorithm String Matching algorithm
PPTX
String matching algorithms(knuth morris-pratt)
PPT
W9Presentation.ppt
PPT
String matching algorithm
PPTX
String matching Algorithm by Foysal
PPTX
String-Matching algorithms KNuth-Morri-Pratt.pptx
PPT
String searching
PPT
KMP Pattern Matching algorithm
PDF
module6_stringmatchingalgorithm_2022.pdf
PPTX
Gp 27[string matching].pptx
PDF
An Index Based K-Partitions Multiple Pattern Matching Algorithm
PPT
PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PPT
PPT
lec17.ppt
PDF
Daa chapter9
PPT
String matching algorithms
PDF
An Application of Pattern matching for Motif Identification
PPTX
String Matching Algorithms: Naive, KMP, Rabin-Karp
PPT
Chap09alg
StringMatching-Rabikarp algorithmddd.pdf
String Matching algorithm String Matching algorithm String Matching algorithm
String matching algorithms(knuth morris-pratt)
W9Presentation.ppt
String matching algorithm
String matching Algorithm by Foysal
String-Matching algorithms KNuth-Morri-Pratt.pptx
String searching
KMP Pattern Matching algorithm
module6_stringmatchingalgorithm_2022.pdf
Gp 27[string matching].pptx
An Index Based K-Partitions Multiple Pattern Matching Algorithm
PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
lec17.ppt
Daa chapter9
String matching algorithms
An Application of Pattern matching for Motif Identification
String Matching Algorithms: Naive, KMP, Rabin-Karp
Chap09alg

More from akruthi k (10)

PPTX
Unit i-introduction
PDF
PDF
Boyer moore
PPTX
Physical layer overview
PPTX
Fhss
PPTX
Dsss phy
PPTX
802.11 mgt-opern
PPTX
802.11i
PPTX
802.1x
PPTX
Wired equivalent privacy (wep)
Unit i-introduction
Boyer moore
Physical layer overview
Fhss
Dsss phy
802.11 mgt-opern
802.11i
802.1x
Wired equivalent privacy (wep)

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
Mechanical Engineering MATERIALS Selection
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Operating System & Kernel Study Guide-1 - converted.pdf
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CH1 Production IntroductoryConcepts.pptx
Sustainable Sites - Green Building Construction
Foundation to blockchain - A guide to Blockchain Tech
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mechanical Engineering MATERIALS Selection
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Internet of Things (IOT) - A guide to understanding

Pattern matching programs

  • 1. 1 Searching for Patterns | Set 1 (Naive Pattern Searching) Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m. Examples: 1) Input: txt[] = "THIS IS A TEST TEXT" pat[] = "TEST" Output: Pattern found at index 10 2) Input: txt[] = "AABAACAADAABAAABAA" pat[] = "AABA" Output: Pattern found at index 0 Pattern found at index 9 Pattern found at index 13 Pattern searching is an important problem in computer science. When we do search for a string in notepad/word file or browser or database, pattern searching algorithms are used to show the search results. Naive Pattern Searching: Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches.
  • 2. 2 Output: Pattern found at index 0 Pattern found at index 9 Pattern found at index 13 What is the best case? The best case occurs when the first character of the pattern is not present in text at all. txt[] = "AABCCAADDEE" pat[] = "FAA" Run on IDE The number of comparisons in best case is O(n). What is the worst case? The worst case of Naive Pattern Searching occurs in following scenarios. 1) When all characters of the text and pattern are same. txt[] = "AAAAAAAAAAAAAAAAAA" pat[] = "AAAAA". Run on IDE 2) Worst case also occurs when only the last character is different. txt[] = "AAAAAAAAAAAAAAAAAB" pat[] = "AAAAB" Run on IDE
  • 3. 3 Number of comparisons in worst case is O(m*(n-m+1)). Although strings which have repeated characters are not likely to appear in English text, they may well occur in other applications (for example, in binary texts). The KMP matching algorithm improves the worst case to O(n). We will be covering KMP in the next post. Also, we will be writing more posts to cover all pattern searching algorithms and data structures.
  • 4. 4 Searching for Patterns | Set 2 (KMP Algorithm) Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m. Examples: 1) Input: txt[] = "THIS IS A TEST TEXT" pat[] = "TEST" Output: Pattern found at index 10 2) Input: txt[] = "AABAACAADAABAAABAA" pat[] = "AABA" Output: Pattern found at index 0 Pattern found at index 9 Pattern found at index 13 Pattern searching is an important problem in computer science. When we do search for a string in notepad/word file or browser or database, pattern searching algorithms are used to show the search results. We have discussed Naive pattern searching algorithm in the previous program. The worst case complexity of Naive algorithm is O(m(n-m+1)). Time complexity of KMP algorithm is O(n) in worst case. KMP (Knuth Morris Pratt) Pattern Searching The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by a mismatching character. Following are some examples. txt[] = "AAAAAAAAAAAAAAAAAB" pat[] = "AAAAB" txt[] = "ABABABCABABABCABABABC" pat[] = "ABABAC" (not a worst case, but a bad case for Naive) The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text (since they matched the pattern characters prior to the mismatch). We take advantage of this information to avoid matching the characters that we know will anyway match. KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of size m (same as size of pattern). Here name lps indicates longest proper prefix which is also suffix..
  • 5. 5 For each sub-pattern pat[0…i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i]. lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i]. Examples: For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5] For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0] For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4] For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3] For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] Searching Algorithm: Unlike the Naive algo where we slide the pattern by one, we use a value from lps[] to decide the next sliding position. Let us see how we do that. When we compare pat[j] with txt[i] and see a mismatch, we know that characters pat[0..j-1] match with txt[i-j+1…i-1], and we also know that lps[j-1] characters of pat[0…j-1] are both proper prefix and suffix which means we do not need to match these lps[j-1] characters with txt[i-j…i-1] because we know that these characters will anyway match. See KMPSearch() in the below code for details. Preprocessing Algorithm: In the preprocessing part, we calculate values in lps[]. To do that, we keep track of the length of the longest prefix suffix value (we use len variable for this purpose) for the previous index. We initialize lps[0] and len as 0. If pat[len] and pat[i] match, we increment len by 1 and assign the incremented value to lps[i]. If pat[i] and pat[len] do not match and len is not 0, we update len to lps[len-1]. See computeLPSArray () in the below code for details.
  • 6. 6
  • 7. 7
  • 8. 8 Searching for Patterns | Set 3 (Rabin-Karp Algorithm) Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m. Examples: 1) Input: txt[] = "THIS IS A TEST TEXT" pat[] = "TEST" Output: Pattern found at index 10 2) Input: txt[] = "AABAACAADAABAAABAA" pat[] = "AABA" Output: Pattern found at index 0 Pattern found at index 9 Pattern found at index 13 The Naive String Matching algorithm slides the pattern one by one. After each slide, it one by one checks characters at the current shift and if all characters match then prints the match. Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters. So Rabin Karp algorithm needs to calculate hash values for following strings. 1) Pattern itself. 2) All the substrings of text of length m. Since we need to efficiently calculate hash values for all the substrings of size m of text, we must have a hash function which has following property. Hash at the next shift must be efficiently computable from the current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently computable from hash(txt[s .. s+m-1]) and txt[s+m] i.e.,hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1]) and rehash must be O(1) operation. The hash function suggested by Rabin and Karp calculates an integer value. The integer value for a string is numeric value of a string. For example, if all possible characters are from 1 to 10, the numeric value of “122” will be 122. The number of possible characters is higher than 10 (256 in general) and pattern length can be large. So the numeric values cannot be practically stored as an integer. Therefore, the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored in an integer variable (can fit in memory words). To do rehashing, we need to take off the most significant digit and add the new least significant digit for in hash value. Rehashing is done using the following formula. hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q
  • 9. 9 hash( txt[s .. s+m-1] ) : Hash value at shift s. hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1) d: Number of characters in the alphabet q: A prime number h: d^(m-1)
  • 10. 10
  • 11. 11 Searching for Patterns | Set 4 (Finite Automata) Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m. Examples: 1) Input: txt[] = "THIS IS A TEST TEXT" pat[] = "TEST" Output: Pattern found at index 10 2) Input: txt[] = "AABAACAADAABAAABAA" pat[] = "AABA" Output: Pattern found at index 0 Pattern found at index 9 Pattern found at index 13 Pattern searching is an important problem in computer science. When we do search for a string in notepad/word file or browser or database, pattern searching algorithms are used to show the search results. We have discussed the following algorithms in the previous posts: Naive Algorithm KMP Algorithm Rabin Karp Algorithm In this post, we will discuss Finite Automata (FA) based pattern searching algorithm. In FA based algorithm, we preprocess the pattern and build a 2D array that represents a Finite Automata. Construction of the FA is the main tricky part of this algorithm. Once the FA is built, the searching is simple. In search, we simply need to start from the first state of the automata and first character of the text. At every step, we consider next character of text, look for the next state in the built FA and move to new state. If we reach final state, then pattern is found in text. Time complexity of the search prcess is O(n).
  • 12. 12 Before we discuss FA construction, let us take a look at the following FA for pattern ACACAGA. The above diagrams represent graphical and tabular representations of pattern ACACAGA. Number of states in FA will be M+1 where M is length of the pattern. The main thing to construct FA is to get the next state from the current state for every possible character. Given a character x and a state k, we can get the next state by considering the string “pat[0..k-1]x” which is basically concatenation of pattern characters pat[0], pat[1] … pat[k-1] and the character x. The idea is to get length of the longest prefix of the given pattern such that the prefix is also suffix of “pat[0..k-1]x”. The value of length gives us the next state. For example, let us see how to get the next state from current state 5 and character ‘C’ in the above diagram. We need to consider the string, “pat[0..5]C” which is “ACACAC”. The lenght of the longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4 (“ACAC”). So the next state (from state 5) is 4 for character ‘C’. In the following code, computeTF() constructs the FA. The time complexity of the computeTF() is O(m^3*NO_OF_CHARS) where m is length of the pattern and NO_OF_CHARS is size of alphabet (total number of possible characters in pattern and text). The implementation tries all possible prefixes starting from the longest possible that can be a suffix of “pat[0..k-1]x”. There are better implementations to construct FA in O(m*NO_OF_CHARS) (Hint: we can use something like lps array construction in KMP algorithm).
  • 13. 13
  • 14. 14