SlideShare a Scribd company logo
240-301 Comp. Eng. Lab III (Software), Pattern Matching 1
Pattern Matching
1
a b a c a a b
2
3
4
a b a c a b
a b a c a b
Dr. Andrew Davison
WiG Lab (teachers room), CoE
ad@fivedots.coe.psu.ac.th
240-301, Computer Engineering Lab III (Software)
T:
P:
Semester 1, 2006-2007
240-301 Comp. Eng. Lab III (Software), Pattern Matching 2
Overview
1. What is Pattern Matching?
2. The Brute Force Algorithm
3. The Knuth-Morris-Pratt Algorithm
4. The Boyer-Moore Algorithm
5. More Information
240-301 Comp. Eng. Lab III (Software), Pattern Matching 3
1. What is Pattern Matching?
 Definition:
– given a text string T and a pattern string P, find
the pattern inside the text
 T: “the rain in spain stays mainly on the plain”
 P: “n th”
 Applications:
– text editors, Web search engines (e.g. Google),
image analysis
240-301 Comp. Eng. Lab III (Software), Pattern Matching 4
String Concepts
 Assume S is a string of size m.
S = x1x2 … xm
 A prefix of S is a substring S[1 .. k-1]
 A suffix of S is a substring S[k-1 .. m]
– k is any index between 1 and m
– S[0] is null character
240-301 Comp. Eng. Lab III (Software), Pattern Matching 5
Examples
 All possible prefixes of S:
– “”, “a", "an", "and", "andr”, "andre“,
 All possible suffixes of S:
– “”, “w", “ew", “rew", “drew", “ndrew”
a n d r e w
S
0 5
240-301 Comp. Eng. Lab III (Software), Pattern Matching 6
2. The Brute Force Algorithm
 Check each position in the text T to see if
the pattern P starts in that position
a n d r e w
T:
r e w
P:
a n d r e w
T:
r e w
P:
. . . .
P moves 1 char at a time through T
240-301 Comp. Eng. Lab III (Software), Pattern Matching 7
Brute Force in Java
public static int brute(String text,String pattern)
{ int n = text.length(); // n is length of text
int m = pattern.length(); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) &&
(text.charAt(i+j) == pattern.charAt(j)) )
j++;
if (j == m)
return i; // match at i
}
return -1; // no match
} // end of brute()
Return index where
pattern starts, or -1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 8
Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BruteSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);
int posn = brute(args[0], args[1]);
if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 9
Analysis
 Brute force pattern matching runs in time
O(mn) in the worst case.
 But most searches of ordinary text take
O(m+n), which is very quick.
continued
240-301 Comp. Eng. Lab III (Software), Pattern Matching 10
 The brute force algorithm is fast when the
alphabet of the text is large
– e.g. A..Z, a..z, 1..9, etc.
 It is slower when the alphabet is small
– e.g. 0, 1 (as in binary files, image files, etc.)
continued
240-301 Comp. Eng. Lab III (Software), Pattern Matching 11
 Example of a worst case:
– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
– P: "aaah"
 Example of a more average case:
– T: "a string searching example is standard"
– P: "store"
240-301 Comp. Eng. Lab III (Software), Pattern Matching 12
3. The KMP Algorithm
 The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).
 But it shifts the pattern more intelligently
than the brute force algorithm.
continued
240-301 Comp. Eng. Lab III (Software), Pattern Matching 13
 If a mismatch occurs between the text and
pattern P at P[j], what is the most we can
shift the pattern to avoid wasteful
comparisons?
 Answer: the largest prefix of P[1 .. j-1] that
is a suffix of P[1 .. j-1]
240-301 Comp. Eng. Lab III (Software), Pattern Matching 14
Example
T:
P:
jnew = 3
j = 6
i
240-301 Comp. Eng. Lab III (Software), Pattern Matching 15
Why
 Find largest prefix (start) of:
"a b a a b" ( P[1..j-1] )
which is suffix (end) of:
“a b a a b" ( p[1 .. j-1] )
 Answer: "a b"
 Set j = 3 // the new j value
j == 5
240-301 Comp. Eng. Lab III (Software), Pattern Matching 16
KMP Border Function
 KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
 j = mismatch position in P[]
 k = position before the mismatch (k = j-1).
 The border function b(k) is defined as the
size of the largest prefix of P[1..k] that is
also a suffix of P[1..k].
240-301 Comp. Eng. Lab III (Software), Pattern Matching 17
 P: "abaaba"
j: 123456
 In code, b() is represented by an array, like
the table.
Border Function Example
b(k) is the size of
the largest border.
1
4
2
5
3
2
1
j
1
0
0
F(j)
k
b(k)
(k == j-1)
240-301 Comp. Eng. Lab III (Software), Pattern Matching 18
Why is b(5) == 2?
 b(5) means
– find the size of the largest prefix of P[1..5] that
is also a suffix of P[1..5]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
= 2
P: "abaaba"
240-301 Comp. Eng. Lab III (Software), Pattern Matching 19
 Knuth-Morris-Pratt’s algorithm modifies
the brute-force algorithm.
– if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = b(k) + 1; // obtain the new j
Using the Failure Function
240-301 Comp. Eng. Lab III (Software), Pattern Matching 20
KMP in Java
public static int kmpMatch(String text,
String pattern)
{
int n = text.length();
int m = pattern.length();
int fail[] = computeFail(pattern);
int i=0;
int j=0;
:
Return index where
pattern starts, or -1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 21
while (i < n) {
if (pattern.charAt(j) == text.charAt(i)) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
return -1; // no match
} // end of kmpMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 22
public static int[] computeFail(
String pattern)
{
int fail[] = new int[pattern.length()];
fail[0] = 0;
int m = pattern.length();
int j = 0;
int i = 1;
:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 23
while (i < m) {
if (pattern.charAt(j) ==
pattern.charAt(i)) { //j+1 chars match
fail[i] = j + 1;
i++;
j++;
}
else if (j > 0) // j follows matching prefix
j = fail[j-1];
else { // no match
fail[i] = 0;
i++;
}
}
return fail;
} // end of computeFail()
Similar code
to kmpMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 24
Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java KmpSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);
int posn = kmpMatch(args[0], args[1]);
if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 25
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
19
18
17
15
a b a c a b
16
14
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
0
4
1
5
3
2
1
k
1
0
0
b(k)
T:
P:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 26
Why is b(4) == 1?
 b(4) means
– find the size of the largest prefix of P[1..5] that
is also a suffix of P[1..5]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
= 1
P: "abacab"
240-301 Comp. Eng. Lab III (Software), Pattern Matching 27
KMP Advantages
 KMP runs in optimal time: O(m+n)
– very fast
 The algorithm never needs to move
backwards in the input text, T
– this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
240-301 Comp. Eng. Lab III (Software), Pattern Matching 28
KMP Disadvantages
 KMP doesn’t work so well as the size of the
alphabet increases
– more chance of a mismatch (more possible
mismatches)
– mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later
240-301 Comp. Eng. Lab III (Software), Pattern Matching 29
KMP Extensions
 The basic algorithm doesn't take into
account the letter in the text that caused the
mismatch.
a a a
b b
a a a
b b a
x
a a a
b b a
T:
P:
Basic KMP
does not do this.
240-301 Comp. Eng. Lab III (Software), Pattern Matching 30
3. The Boyer-Moore Algorithm
 The Boyer-Moore pattern matching
algorithm is based on two techniques.
 1. The looking-glass technique
– find P in T by moving backwards through P,
starting at its end
240-301 Comp. Eng. Lab III (Software), Pattern Matching 31
 2. The character-jump technique
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]
 There are 3 possible
cases, tried in order.
x a
T
i
b a
P
j
240-301 Comp. Eng. Lab III (Software), Pattern Matching 32
Case 1
 If P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].
x a
T
i
b a
P
j
x c
x a
T
inew
b a
P
jnew
x c
? ?
and
move i and
j right, so
j at end
240-301 Comp. Eng. Lab III (Software), Pattern Matching 33
Case 2
 If P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].
a x
T
i
a x
P
j
c w
a x
T
inew
a x
P
jnew
c w
?
and
move i and
j right, so
j at end
x
x is after
j position
x
240-301 Comp. Eng. Lab III (Software), Pattern Matching 34
Case 3
 If cases 1 and 2 do not apply, then shift P to
align P[1] with T[i+1].
x a
T
i
b a
P
j
d c
x a
T
inew
b a
P
jnew
d c
? ?
and
move i and
j right, so
j at end
No x in P
?
1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 35
Boyer-Moore Example (1)
1
a p a t t e r n m a t c h i n g a l g o r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
2
3
4
5
6
7
8
9
10
11
T:
P:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 36
Last Occurrence Function
 Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
– L() maps all the letters in A to integers
 L(x) is defined as: // x is a letter in A
– the largest index i such that P[i] == x, or
– -1 if no such index exists
240-301 Comp. Eng. Lab III (Software), Pattern Matching 37
L() Example
 A = {a, b, c, d}
 P: "abacab"
-1
4
6
5
L(x)
d
c
b
a
x
a b a c a b
1 2 3 4 5 6
P
L() stores indexes into P[]
240-301 Comp. Eng. Lab III (Software), Pattern Matching 38
Note
 In Boyer-Moore code, L() is calculated
when the pattern P is read in.
 Usually L() is stored as an array
– something like the table in the previous slide
240-301 Comp. Eng. Lab III (Software), Pattern Matching 39
Boyer-Moore Example (2)
1
a b a c a a b a d c a b a c a b a a b b
2
3
4
5
6
7
8
9
10
12
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
11
13
-1
4
6
5
L(x)
d
c
b
a
x
T:
P:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 40
Boyer-Moore in Java
public static int bmMatch(String text,
String pattern)
{
int last[] = buildLast(pattern);
int n = text.length();
int m = pattern.length();
int i = m-1;
if (i > n-1)
return -1; // no match if pattern is
// longer than text
:
Return index where
pattern starts, or -1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 41
int j = m-1;
do {
if (pattern.charAt(j) == text.charAt(i))
if (j == 0)
return i; // match
else { // looking-glass technique
i--;
j--;
}
else { // character jump technique
int lo = last[text.charAt(i)]; //last occ
i = i + m - Math.min(j, 1+lo);
j = m - 1;
}
} while (i <= n-1);
return -1; // no match
} // end of bmMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 42
public static int[] buildLast(String pattern)
/* Return array storing index of last
occurrence of each ASCII char in pattern. */
{
int last[] = new int[128]; // ASCII char set
for(int i=0; i < 128; i++)
last[i] = -1; // initialize array
for (int i = 0; i < pattern.length(); i++)
last[pattern.charAt(i)] = i;
return last;
} // end of buildLast()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 43
Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BmSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);
int posn = bmMatch(args[0], args[1]);
if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 44
Analysis
 Boyer-Moore worst case running time is
O(nm + A)
 But, Boyer-Moore is fast when the alphabet
(A) is large, slow when the alphabet is small.
– e.g. good for English text, poor for binary
 Boyer-Moore is significantly faster than
brute force for searching English text.
240-301 Comp. Eng. Lab III (Software), Pattern Matching 45
Worst Case Example
 T: "aaaaa…a"
 P: "baaaaa"
11
1
a a a a a a a a a
2
3
4
5
6
b a a a a a
b a a a a a
b a a a a a
b a a a a a
7
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
T:
P:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 46
5. More Information
 Algorithms in C++
Robert Sedgewick
Addison-Wesley, 1992
– chapter 19, String Searching
 Online Animated Algorithms:
– http://www.ics.uci.edu/~goodrich/dsa/
11strings/demos/pattern/
– http://www-sr.informatik.uni-tuebingen.de/
~buehler/BM/BM1.html
– http://www-igm.univ-mlv.fr/~lecroq/string/
This book is
in the CoE library.

More Related Content

PPT
Chpt9 patternmatching
PDF
module6_stringmatchingalgorithm_2022.pdf
PPT
Pattern matching
PPTX
Knuth morris pratt string matching algo
PPTX
IMPLEMENTATION OF DIFFERENT PATTERN RECOGNITION ALGORITHM
PPT
String matching algorithm
PDF
Pattern matching programs
PPTX
Gp 27[string matching].pptx
Chpt9 patternmatching
module6_stringmatchingalgorithm_2022.pdf
Pattern matching
Knuth morris pratt string matching algo
IMPLEMENTATION OF DIFFERENT PATTERN RECOGNITION ALGORITHM
String matching algorithm
Pattern matching programs
Gp 27[string matching].pptx

Similar to PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn (20)

PPT
lec17.ppt
PPTX
String matching Algorithm by Foysal
PPTX
STRING MATCHING
PPTX
Kmp & bm copy
PPTX
Advance algorithms in master of technology
PDF
StringMatching-Rabikarp algorithmddd.pdf
PPTX
String matching algorithms(knuth morris-pratt)
PPT
chap09alg.ppt for string matching algorithm
PPT
PPTX
String-Matching algorithms KNuth-Morri-Pratt.pptx
PPT
KMP Pattern Matching algorithm
PPT
W9Presentation.ppt
PPT
Chap09alg
PPT
Chap09alg
PPTX
Boyer more algorithm
PDF
String matching algorithms
PPT
String matching algorithms
PDF
An Application of Pattern matching for Motif Identification
PPT
Boyre Moore Algorithm | Computer Science
PPT
String searching
lec17.ppt
String matching Algorithm by Foysal
STRING MATCHING
Kmp & bm copy
Advance algorithms in master of technology
StringMatching-Rabikarp algorithmddd.pdf
String matching algorithms(knuth morris-pratt)
chap09alg.ppt for string matching algorithm
String-Matching algorithms KNuth-Morri-Pratt.pptx
KMP Pattern Matching algorithm
W9Presentation.ppt
Chap09alg
Chap09alg
Boyer more algorithm
String matching algorithms
String matching algorithms
An Application of Pattern matching for Motif Identification
Boyre Moore Algorithm | Computer Science
String searching
Ad

More from RAtna29 (20)

PPT
RedBlackTrees_2.pptNNNNNNNNNNNNNNNNNNNNNN
PPT
6Sorting.pptBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
PPTX
statisticsforsupportslides.pptxnnnnnnnnnnnnnnnnnn
PPT
Gerstman_PP09.pptvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
PPT
chapter8.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PDF
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
PPTX
red black tree.pptxMMMMMMMMMMMMMMMMMMMMMMMMMM
PPTX
Unit 5 m way tree.pptxMMMMMMMMMMMMMMMMMMM
PPTX
TF_IDF_PMI_Jurafsky.pptxnnnnnnnnnnnnnnnn
PPTX
13-DependencyParsing.pptxnnnnnnnnnnnnnnnnnnn
PPT
pos-tagging.pptbbbbbbbbbbbbbbbbbbbbnnnnnnnnnn
PPT
lecture_15.pptffffffffffffffffffffffffff
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPT
Chapter 4.pptmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
PPT
cse220lec4.pptnnnnnnnnnnnnnnnnnnnnnnnnnnn
PPT
slp05.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnmmmmmmmmm
PPTX
lecture14-distributed-reprennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnsentations.pptx
PPTX
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
PPT
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
PPT
lecture3-indexconstruction.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
RedBlackTrees_2.pptNNNNNNNNNNNNNNNNNNNNNN
6Sorting.pptBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
statisticsforsupportslides.pptxnnnnnnnnnnnnnnnnnn
Gerstman_PP09.pptvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
chapter8.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
red black tree.pptxMMMMMMMMMMMMMMMMMMMMMMMMMM
Unit 5 m way tree.pptxMMMMMMMMMMMMMMMMMMM
TF_IDF_PMI_Jurafsky.pptxnnnnnnnnnnnnnnnn
13-DependencyParsing.pptxnnnnnnnnnnnnnnnnnnn
pos-tagging.pptbbbbbbbbbbbbbbbbbbbbnnnnnnnnnn
lecture_15.pptffffffffffffffffffffffffff
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Chapter 4.pptmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cse220lec4.pptnnnnnnnnnnnnnnnnnnnnnnnnnnn
slp05.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnmmmmmmmmm
lecture14-distributed-reprennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnsentations.pptx
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture3-indexconstruction.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Geodesy 1.pptx...............................................
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
737-MAX_SRG.pdf student reference guides
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Sustainable Sites - Green Building Construction
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
573137875-Attendance-Management-System-original
Internet of Things (IOT) - A guide to understanding
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Fundamentals of safety and accident prevention -final (1).pptx
Geodesy 1.pptx...............................................
Automation-in-Manufacturing-Chapter-Introduction.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
737-MAX_SRG.pdf student reference guides
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Categorization of Factors Affecting Classification Algorithms Selection
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Safety Seminar civil to be ensured for safe working.
Sustainable Sites - Green Building Construction
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS

PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

  • 1. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 1 Pattern Matching 1 a b a c a a b 2 3 4 a b a c a b a b a c a b Dr. Andrew Davison WiG Lab (teachers room), CoE ad@fivedots.coe.psu.ac.th 240-301, Computer Engineering Lab III (Software) T: P: Semester 1, 2006-2007
  • 2. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 2 Overview 1. What is Pattern Matching? 2. The Brute Force Algorithm 3. The Knuth-Morris-Pratt Algorithm 4. The Boyer-Moore Algorithm 5. More Information
  • 3. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 3 1. What is Pattern Matching?  Definition: – given a text string T and a pattern string P, find the pattern inside the text  T: “the rain in spain stays mainly on the plain”  P: “n th”  Applications: – text editors, Web search engines (e.g. Google), image analysis
  • 4. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 4 String Concepts  Assume S is a string of size m. S = x1x2 … xm  A prefix of S is a substring S[1 .. k-1]  A suffix of S is a substring S[k-1 .. m] – k is any index between 1 and m – S[0] is null character
  • 5. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 5 Examples  All possible prefixes of S: – “”, “a", "an", "and", "andr”, "andre“,  All possible suffixes of S: – “”, “w", “ew", “rew", “drew", “ndrew” a n d r e w S 0 5
  • 6. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 6 2. The Brute Force Algorithm  Check each position in the text T to see if the pattern P starts in that position a n d r e w T: r e w P: a n d r e w T: r e w P: . . . . P moves 1 char at a time through T
  • 7. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 7 Brute Force in Java public static int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute() Return index where pattern starts, or -1
  • 8. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 8 Usage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BruteSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = brute(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
  • 9. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 9 Analysis  Brute force pattern matching runs in time O(mn) in the worst case.  But most searches of ordinary text take O(m+n), which is very quick. continued
  • 10. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 10  The brute force algorithm is fast when the alphabet of the text is large – e.g. A..Z, a..z, 1..9, etc.  It is slower when the alphabet is small – e.g. 0, 1 (as in binary files, image files, etc.) continued
  • 11. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 11  Example of a worst case: – T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" – P: "aaah"  Example of a more average case: – T: "a string searching example is standard" – P: "store"
  • 12. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 12 3. The KMP Algorithm  The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-to- right order (like the brute force algorithm).  But it shifts the pattern more intelligently than the brute force algorithm. continued
  • 13. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 13  If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?  Answer: the largest prefix of P[1 .. j-1] that is a suffix of P[1 .. j-1]
  • 14. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 14 Example T: P: jnew = 3 j = 6 i
  • 15. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 15 Why  Find largest prefix (start) of: "a b a a b" ( P[1..j-1] ) which is suffix (end) of: “a b a a b" ( p[1 .. j-1] )  Answer: "a b"  Set j = 3 // the new j value j == 5
  • 16. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 16 KMP Border Function  KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.  j = mismatch position in P[]  k = position before the mismatch (k = j-1).  The border function b(k) is defined as the size of the largest prefix of P[1..k] that is also a suffix of P[1..k].
  • 17. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 17  P: "abaaba" j: 123456  In code, b() is represented by an array, like the table. Border Function Example b(k) is the size of the largest border. 1 4 2 5 3 2 1 j 1 0 0 F(j) k b(k) (k == j-1)
  • 18. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 18 Why is b(5) == 2?  b(5) means – find the size of the largest prefix of P[1..5] that is also a suffix of P[1..5] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2 P: "abaaba"
  • 19. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 19  Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm. – if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = b(k) + 1; // obtain the new j Using the Failure Function
  • 20. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 20 KMP in Java public static int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0; : Return index where pattern starts, or -1
  • 21. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 21 while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()
  • 22. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 22 public static int[] computeFail( String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :
  • 23. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 23 while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code to kmpMatch()
  • 24. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 24 Usage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java KmpSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = kmpMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
  • 25. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 25 Example 1 a b a c a a b a c a b a c a b a a b b 7 8 19 18 17 15 a b a c a b 16 14 13 2 3 4 5 6 9 a b a c a b a b a c a b a b a c a b a b a c a b 10 11 12 c 0 4 1 5 3 2 1 k 1 0 0 b(k) T: P:
  • 26. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 26 Why is b(4) == 1?  b(4) means – find the size of the largest prefix of P[1..5] that is also a suffix of P[1..5] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1 P: "abacab"
  • 27. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 27 KMP Advantages  KMP runs in optimal time: O(m+n) – very fast  The algorithm never needs to move backwards in the input text, T – this makes the algorithm good for processing very large files that are read in from external devices or through a network stream
  • 28. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 28 KMP Disadvantages  KMP doesn’t work so well as the size of the alphabet increases – more chance of a mismatch (more possible mismatches) – mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later
  • 29. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 29 KMP Extensions  The basic algorithm doesn't take into account the letter in the text that caused the mismatch. a a a b b a a a b b a x a a a b b a T: P: Basic KMP does not do this.
  • 30. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 30 3. The Boyer-Moore Algorithm  The Boyer-Moore pattern matching algorithm is based on two techniques.  1. The looking-glass technique – find P in T by moving backwards through P, starting at its end
  • 31. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 31  2. The character-jump technique – when a mismatch occurs at T[i] == x – the character in pattern P[j] is not the same as T[i]  There are 3 possible cases, tried in order. x a T i b a P j
  • 32. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 32 Case 1  If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i]. x a T i b a P j x c x a T inew b a P jnew x c ? ? and move i and j right, so j at end
  • 33. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 33 Case 2  If P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1]. a x T i a x P j c w a x T inew a x P jnew c w ? and move i and j right, so j at end x x is after j position x
  • 34. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 34 Case 3  If cases 1 and 2 do not apply, then shift P to align P[1] with T[i+1]. x a T i b a P j d c x a T inew b a P jnew d c ? ? and move i and j right, so j at end No x in P ? 1
  • 35. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 35 Boyer-Moore Example (1) 1 a p a t t e r n m a t c h i n g a l g o r i t h m r i t h m r i t h m r i t h m r i t h m r i t h m r i t h m r i t h m 2 3 4 5 6 7 8 9 10 11 T: P:
  • 36. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 36 Last Occurrence Function  Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L() – L() maps all the letters in A to integers  L(x) is defined as: // x is a letter in A – the largest index i such that P[i] == x, or – -1 if no such index exists
  • 37. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 37 L() Example  A = {a, b, c, d}  P: "abacab" -1 4 6 5 L(x) d c b a x a b a c a b 1 2 3 4 5 6 P L() stores indexes into P[]
  • 38. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 38 Note  In Boyer-Moore code, L() is calculated when the pattern P is read in.  Usually L() is stored as an array – something like the table in the previous slide
  • 39. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 39 Boyer-Moore Example (2) 1 a b a c a a b a d c a b a c a b a a b b 2 3 4 5 6 7 8 9 10 12 a b a c a b a b a c a b a b a c a b a b a c a b a b a c a b a b a c a b 11 13 -1 4 6 5 L(x) d c b a x T: P:
  • 40. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 40 Boyer-Moore in Java public static int bmMatch(String text, String pattern) { int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1; if (i > n-1) return -1; // no match if pattern is // longer than text : Return index where pattern starts, or -1
  • 41. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 41 int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i <= n-1); return -1; // no match } // end of bmMatch()
  • 42. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 42 public static int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set for(int i=0; i < 128; i++) last[i] = -1; // initialize array for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i; return last; } // end of buildLast()
  • 43. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 43 Usage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BmSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = bmMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }
  • 44. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 44 Analysis  Boyer-Moore worst case running time is O(nm + A)  But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small. – e.g. good for English text, poor for binary  Boyer-Moore is significantly faster than brute force for searching English text.
  • 45. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 45 Worst Case Example  T: "aaaaa…a"  P: "baaaaa" 11 1 a a a a a a a a a 2 3 4 5 6 b a a a a a b a a a a a b a a a a a b a a a a a 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 T: P:
  • 46. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 46 5. More Information  Algorithms in C++ Robert Sedgewick Addison-Wesley, 1992 – chapter 19, String Searching  Online Animated Algorithms: – http://www.ics.uci.edu/~goodrich/dsa/ 11strings/demos/pattern/ – http://www-sr.informatik.uni-tuebingen.de/ ~buehler/BM/BM1.html – http://www-igm.univ-mlv.fr/~lecroq/string/ This book is in the CoE library.

Editor's Notes

  • #2: 4/15/2024 4:02 PM