SlideShare a Scribd company logo
Regular Expressions and
Automata
Introduction
• Regular Expression (RE) – is a language for
specifying text search strings.
– First developed by Kleene (1956)
– Requires a:
• Pattern – specification formula using a special
language that specifies simple classes of strings.
• Corpus – a body of text to search through.
Introduction
• The regular expression is used for
specifying:
– text strings in situations like Web-search
example, and in other
– information retrieval applications, but also
plays an important role in
– word-processing,
– computation of frequencies from corpora, and
other such tasks.
Introduction
• Regular Expressions can be implemented via finite-state
automaton.
• Finite-State Automaton (FSA) is one of the most significant
tools of computational linguistics.
• Its variations:
– Finite-state transducers
– Hidden Markov Models, and
– N-gram grammars
Important components of the Speech Recognition and
Synthesis, spell-checking, and information-extraction
applications.
Regular Expressions
• Formally, a regular expression is an algebraic notation for characterizing a
set of strings.
– Thus they can be used to specify search strings as well as to define a language
in a formal way.
• Regular Expression requires
– A pattern that we want to search for, and
– A corpus of text to search through.
• Thus when we give a search pattern, we will assume that the search engine
returns the line of the document is returned. This is what the UNIX grep
command does.
• We will underline the exact part of the pattern that matches the regular
expression.
• A search can be designed to return all matches to a regular expression or
only the first match.
Basic Regular Expression Patterns
• The simplest kind of regular expression is a sequence of
simple characters:
– /woodchuck/
– /Buttercup/
– /!/
RE Example Patterns Matched
/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/Claire says,/ “Dagmar, my gift please,” Claire says,”
/song/ “all our pretty songs”
/!/ “You’ve left the burglar behind again!” said Nori
Basic Regular Expression Patterns
• Regular Expressions are case sensitive
– /s/, is not the same as
– /S/
• /woodchucks/ will not match “Woodchucks”
• Disjunction: “[“ and “]”.
RE Match Example Pattern
/[wW]oodchuck/ Woodchuck or
woodchuck
“Woodchuck”
/[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in
soldati”
/[1234567890]/ Any digit “plenty of 7 to 5”
Basic Regular Expression Patterns
• It is inconvenient to search for example:
/[ABCDEFGHIJKLMONPQRSTUVWXYZ]/
• Specifying range in Regular Expressions: “-”
RE Match Example Patterns Matched
/[A-Z]/ An uppercase letter
“we should call it ‘Drenched
Blossoms’”
/[a-z]/ A lower case letter
“my beans were impatient to
be hoed!”
/[0-9]/ A single digit
“Chapter 1: Down the Rabbit
Hole”
Basic Regular Expression Patterns
• Negative Specification – what pattern can not be: “^”
– If the first symbol after the open square brace “[” is “^” the resulting pattern is negated.
– Example /[^a]/ matches any single character (including special characters) except a.
RE Match (single characters) Example Patterns Matched
/[^A-Z]/ Not an uppercase letter “Oyfn pripetchik”
/[^Ss]/ Neither ‘S’ nor ‘s’
“I have no exquisite reason
for ’t”
/[^.]/ Not a period “our resident Djinn”
/[e^]/ Either ‘e’ or ‘^’ “look up ^ now”
/a^b/ Pattern ‘a^b’ “look up a^b now”
Basic Regular Expression Patterns
• How do we specify both woodchuck and woodchucks?
– Optional character specification: /?/
– /?/ means “the preceding character or nothing”.
RE Match Example Patterns Matched
/woodchucks?/
woodchuck or
woodchucks
“woodchuck”
colou?r color or colour “colour”
Basic Regular Expression Patterns
• Question-mark “?” can be though of as “zero or one instance of the
previous character”.
• It is a way to specify how many of something that we want.
• Sometimes we need to specify regular expressions that allow
repetitions of things.
• For example, consider the language of (certain) sheep, which
consists of strings that look like the following:
– baa!
– baaa?
– baaaa?
– baaaaa?
– baaaaaa?
– …
Basic Regular Expression Patterns
• Any number of repetitions is specified by
“*” which means “any string of 0 or more”.
• Examples:
– /aa*/ - a followed by zero or more a’s
– /[ab]*/ - zero or more a’s or b’s. This will
match aaaa or abababa or bbbb
Basic Regular Expression Patterns
• We know enough to specify part of our regular expression for prices:
multiple digits.
– Regular expression for individual digit:
• /[0-9]/
– Regular expression for an integer:
• /[0-9][0-9]*/
– Why is not just /[0-9]*/?
• Because it is annoying to specify “at least once” RE since it involves
repetition of the same pattern there is a special character that is used for “at
least once”: “+”
– Regular expression for an integer becomes then:
• /[0-9]+/
– Regular expression for sheep language:
• /baa*!/, or
• /ba+!/
Basic Regular Expression Patterns
• One very important special character is the period: /./, a wildcard expression that matches any
single character (except carriage return).
• Example: Find any line in which a particular word (for example Veton) appears twice:
– /Veton.*Veton/
RE Match Example Pattern
/beg.n/ Any character between beg and n begin beg’n, begun
Anchors
• Anchors are special characters that anchor regular expressions to
particular places in a string.
– The most common anchors are:
• “^” – matches the start of a line
• “$” – matches the end of the line
– Examples:
• /^The/ - matches the word “The” only at the start of the line.
• Three uses of “^”:
1. /^xyz/ - Matches the start of the line
2. [^xyz] – Negation
3. /^/ - Just to mean a caret
 /⌴$/ - “⌴” Stands for space “character”; matches a space at the end
of line.
 /^The dog.$/ - matches a line that contains only the phrase “The
dog”.
Anchors
• /b/ - matches a word boundary
• /B/ - matches a non-boundary
• /btheb/ - matches the word “the” but not the word “other”.
• Word is defined as a any sequence of digits, underscores or
letters.
• /b99/ - will match the string 99 in
“There are 99 bottles of beer on the wall” but NOT
“There are 299 bottles of beer on the wall”
and it will match the string
“$99” since 99 follows a “$” which is not a digit, underscore, or
a letter.
Disjunction, Grouping and Precedence.
• Suppose we need to search for texts about pets;
specifically we may be interested in cats and dogs. If
we want to search for either “cat” or the string “dog”
we can not use any of the constructs we have
introduced so far (why not “[]”?).
• New operator that defines disjunction, also called the
pipe symbol is “|”.
• /cat|dog/ - matches either cat or the string dog.
Grouping
• In many instances it is necessary to be able to group
the sequence of characters to be treated as one set.
• Example: Search for guppy and guppies.
– /gupp(y|ies)/
• Useful in conjunction to “*” operator.
– /*/ - applies to single character and not to a whole sequence.
• Example: Match “Column 1 Column 2 Column 3 …”
– /Column⌴[0-9]+⌴*/ - will match “Column # …“
– /(Column⌴[0-9]+⌴*)*/ - will match “Column 1 Column 2
Column 3 …”
Operator Precedence Hierarchy
Operator Class Precedence from Highest to
Lowest
Parenthesis ()
Counters * + ? {}
Sequences and anchors ^ $
Disjunction |
Simple Example
• Problem Statement: Want to write RE to find cases of the English article
“the”.
1. /the/ - It will miss “The”
2. /[tT]he/ - It will match “amalthea”, “Bethesda”, “theology”, etc.
3. /b[tT]heb/ - Is the correct RE
• Problem Statement: If we want to find “the” where it might also have
undelines or numbers nearby (“The-” , “the_” or “the25”) one needs to
specify that we want instances in which there are no alphabetic letters
on either side of “the”:
1. /[^a-zA-Z][tT]he[^a-zA-Z]/ - it will not find “the” if it begins the line.
2. /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
Simple Example (cont.)
• Refining RE by reduction of:
– false positives (false acceptance):
• Strings that are incorrectly matched.
– false negatives (false rejection):
• Strings that are incorrectly missed.
A More Complex Example
• Problem Statement: Build an application to
help a user by a computer on the Web.
– The user might want “any PC with more than 6
GHz and 256 GB of disk space for less than
$1000
– To solve the problem must be able to match the
expressions like 1000 MHz, 6 GHz and 256 GB
as well as $999.99 etc.
Solution – Dollar Amounts
• Complete regular expression for prices of full dollar
amounts:
– /$[0-9]+/
• Adding fractions of dollars:
– /$[0-9]+.[0-9][0-9]/ or
– /$[0-9]+.[0-9] {2}/
• Problem since this RE only will match “$199.99” and
not “$199”. To solve this issue must make cents
optional and make sure the $ amount is a word:
– /b$[0-9]+(.[0-9][0-9])?b/
Solution: Processor Speech
• Processor speed in megahertz = MHz or gigahertz = GHz)
– /b[0-9]+⌴*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)b/
– ⌴* is used to denote “zero or more spaces”.
• Simple pattern to specify operating systems:
– /b(Windows ⌴*(8|7|Vista|XP)?)b/
– /b(Mac|Macintosh|Apple|OS ⌴X)b/
Solution: Disk Space
• Dealing with disk space:
– Gb = gigabytes
• Memory size:
– Mb or MB = megabytes or
– Gb or GB = gigabytes
• Must allow optional fractions:
– /b[0-9]+⌴*(M[Bb]|[Mm]egabytes?)b/
– /b[0-9]+(.[0-9]+)?⌴*(G[Bb]|[Gg]igabytes?)b/
Solution: Operating Systems and Vendors
• /b((Windows)+⌴*(XP|Vista|7|8)?)b/
• /b((Mac|Macintosh|Apple|OS ⌴X)b/
Advanced Operators
RE Expansion Match Example Patterns
d [0-9] Any digit “Party of 5”
D [^0-9] Any non-digit “Blue moon”
w
[a-zA-Z0-
9⌴]
Any
alphanumeric
or space
Daiyu
W [^w]
A non-
alphanumeric
!!!!
s [⌴rtnf]
Whitespace
(space, tab)
“ ”
S [^s]
Non-
whitespace
“in Concord”
Aliases for common sets of characters
Repetition Metacharacters
RE Description Example
*
Matches any number of occurrences
of the previous character – zero or
more
/ac*e/ - matches “ae”, “ace”, “acce”, “accce”
as in “The aerial acceleration alerted the ace
pilot”
?
Matches at most one occurrence of
the previous characters – zero or
one.
/ac?e/ - matches “ae” and “ace” as in “The
aerial acceleration alerted the ace pilot”
+
Matches one or more occurrences of
the previous characters
/ac+e/ - matches “ace”, “acce”, “accce” as in
“The aerial acceleration alerted the ace pilot”
{n}
Matches exactly n occurrences of the
previous characters.
/ac{2}e/ - matches “acce” as in “The aerial
acceleration alerted the ace pilot”
{n,}
Matches n or more occurrences of
the previous characters
/ac{2,}e/ - matches “acce”, “accce” etc., as in
“The aerial acceleration alerted the ace pilot”
{n,m}
Matches from n to m occurrences of
the previous characters.
/ac{2,4}e/ - matches “acce”, “accce” and
“acccce” , as in “The aerial acceleration alerted
the ace pilot”
.
Matches one occurrence of any
characters of the alphabet except
the new line character
/a.e/ matches aae, aAe, abe, aBe, a1e, etc.,
as in ““The aerial acceleration alerted the ace
pilot”
.* Matches any string of characters and until it encounters a new line character
Literal Matching of Special Characters & “”
Characters
RE Match Example Patterns
* An asterisk “*” “K*A*P*L*A*N”
. A period “.” “Dr. Këpuska, I presume”
? A question mark “?”
“Would you like to light my
candle?”
n A newline
t A tab
r A carriage return character
Some characters that need to be back-slashed “”
Finate State Automata
• The regular expression is more than just a
convenient metalangue for text searching.
1. A regular expression is one way of describing a finite-
state-automaton (FSA).
 FSA – are the theoretical foundation of significant
number of computational work.
 Any regular expression can be implemented as FSA
(except regular expressions that use the memory feature).
2. Regular expression is one way of characterizing a
particular kind of formal language called a regular
language.
 Both FSA and RE can be used to describe regular
languages.
FSA, RE and Regular Languages
Finite
automata
Regular
languages
Regular
expressions
Finite-State Automaton for Regular
Expressions
• Using FSA to Recognize Sheeptalk with RE:
/baa+!/
q0 q4
q1 q2 q3
b a a
a
!
Start State Final State
Transitions
FSA Use
• The FSA can be used for recognizing (we also say accepting)
strings in the following way. First, think of the input as being
written on a long tape broken up into cells, with one symbol
written in each cell of the tape, as figure below:
a b a ! b
q0
Recognition Process
• The machine starts in the start state (q0), and iterates the following
process:
1. Check the next letter of the input.
a. If it matches the symbol on an arc leaving the current state, then
i. cross that arc
ii. move to the next state, also
iii. advance one symbol in the input
b. If we are in the accepting state (q4) when we run out of input, the machine
has successfully recognized an instance of sheeptalk.
2. If the machine never gets to the final state,
a. either because it runs out of input, or
b. it gets some input that doesn’t match an arc (as in Fig in previous slide), or
c. if it just happens to get stuck in some non-final state, we say the machine
rejects or fails to accept an input.
State Transition Table
Input
State b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4: Ø Ø Ø
We’ve marked state 4 with a colon to indicate that it’s a final state (you
can have as many final states as you want), and the Ø indicates an
illegal or missing transition. We can read the first row as “if we’re in
state 0 and we see the input b we must go to state 1. If we’re in state 0
and we see the input a or !, we fail”.
Formal Definition of Automaton
Q={q0,q1,…,qN} A finite set of N states
 a finite input alphabet of symbols
q0 the start state
F the set of final states, F ⊆ Q
(q, i)
the transition function or transition
matrix between states. Given a state q ∈
Q and an input symbol i ∈ , δ(q, i)
returns a new state q′ ∈ Q. δ is thus a
relation from Q×S to Q;
FSA Example
• Q = {q0,q1,q2,q3,q4},
•  = {a,b, !},
• F = {q4}, and
• (q, i)
q0 q4
q1 q2 q3
b a a
a
!
Start State Final State
Transitions
Deterministic Algorithm for Recognizing a
String
function D-RECOGNIZE(tape,machine) returns accept or reject
index←Beginning of tape
current-state←Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
elsif transition-table[current-state,tape[index]] is empty then
return reject
else
current-state←transition-table[current-state,tape[index]]
index←index + 1
end
Tracing Execution for Some Sheep Talk
q0 q4
q1 q2 q3
b a a
a
!
Start State Final State
Transitions
b a a a !
q0 q1 q2 q3 q3 q4
Input
State b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4: Ø Ø Ø
Tracing Execution for Some Sheep
Talk (cont.)
Before examining the beginning of the tape, the machine is in
state q0. Finding a b on input tape, it changes to state q1 as
indicated by the contents of transition-table[q0,b] in Fig.
It then finds an a and switches to state q2, another a puts it in
state q3, a third a leaves it in state q3, where it reads the “!”,
and switches to state q4. Since there is no more input, the
End of input condition at the beginning of the loop is
satisfied for the first time and the machine halts in q4.
State q4 is an accepting state, and so the machine has accepted
the string baaa! as a sentence in the sheep language.
Fail State
• The algorithm will fail whenever there is no legal
transition for a given combination of state and input. The
input abc will fail to be recognized since there is no legal
transition out of state q0 on the input a, (i.e., this entry of
the transition table has a Ø).
• Even if the automaton had allowed an initial a it would
have certainly failed on c, since c isn’t even in the
sheeptalk alphabet! We can think of these “empty”
elements in the table as if they all pointed at one “empty”
state, which we might call the fail state or sink state.
• In a sense then, we could FAIL STATE view any machine
with empty transitions as if we had augmented it with a fail
state, and drawn in all the extra arcs, so we always had
somewhere to go from any state on any possible input. Just
for completeness, next Fig. shows the FSA from previous
Figure with the fail state qF filled in.
Adding a Fail State to FSA
q0 q4
q1
b a a !
Start State Final State
a
a
qF
q2 q3
! b
! b
b
b
!
!
a
Formal Languages
• Key Concept #1. Formal Language:
– A model which can both generate and recognize all and only the
strings of a formal language acts as a definition of the formal
language.
– A formal language is a set of strings, each string composed of
symbols from a finite symbol-set called an alphabet (the same
alphabet used above for defining an automaton!).
• The alphabet for a “sheep” language is the set
 = {a,b, !}.
• Given a model m (such as FSA) we can use L(m) to mean “the
formal language characterized by m”.
• L(m)={baa!,baaa!, baaaa!, baaaaa!,….}
Formal Languages
• Automaton can express an infinite set in a closed form.
• Formal Languages are not the same as natural languages.
– Formal languages are used to model part of a natural language:
• Phonology,
• Morphology, or
• Syntax
• Generative Grammar:
– It is used sometimes in linguistics to mean a grammar of a formal
language;
– The use of an automaton to define a language by generating all possible
strings.
Example 2
• Alphabet consisting of words. Must build an FSA that
models the sub-part of English language that deals with
amounts of money:
– Ten cents,
– Three dollars,
– One dollar thirty-five cents, …
• Such a formal language would model the subset of
English that consists of phrases like ten cents, three
dollars, one dollar thirty-five cents, etc.
1. Solve the problem of building FSA for numbers 1-99
with which will model cents.
2. Model dollar amounts by adding cents to it.
FSA for the words for English numbers 1-
99
q0 q2
q1
Start State
Twenty
Thirty
Forty
Fifty
Sixty
Seventy
Eighty
Ninety
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Final State
Ten
One Eleven
Two Twelve Twenty
Three Thirteen Thirty
Four Fourteen Forty
Five Fifteen Fifty
Six Sixteen Sixty
Seven Seventeen Seventy
Eight Eighteen Eighty
Nine Nineteen Ninety
FSA for the simple Dollars and Cents
q0 q1
Ten
One Eleven
Two Twelve Twenty
Three Thirteen Thirty
Four Fourteen Forty
Five Fifteen Fifty
Six Sixteen Sixty
Seven Seventeen Seventy
Eight Eighteen Eighty
Nine Nineteen Ninety
Twenty
Thirty
Forty
Fifty
Sixty
Seventy
Eighty
Ninety
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
q3
q2
cents
q5
Ten
One Eleven
Two Twelve Twenty
Three Thirteen Thirty
Four Fourteen Forty
Five Fifteen Fifty
Six Sixteen Sixty
Seven Seventeen Seventy
Eight Eighteen Eighty
Nine Nineteen Ninety
Twenty
Thirty
Forty
Fifty
Sixty
Seventy
Eighty
Ninety
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
q7
q6
cents
dollars
q4
Non-Deterministic FSAs
q0 q4
q1 q2 q3
b a a !
Start State Final State
a
q0 q4
q1 q2 q3
b a a
a
!
Start State Final State
Deterministic FSA
Non-Deterministic FSA
Deterministic vs Non-deterministic FSA
• Deterministic FSA is one whose behavior during
recognition is fully determined by the state it is in and
the symbol it is looking at.
• The FSA in the previous slide when FSA is at the state
q2 and the input symbol is a we do not know whether
to remain in state 2 (self-loop transition) or state 3 (the
other transition) .
• Clearly the decision dependents on the next input
symbols.
Another NFSA for “sheep” language:  -
transition
q0 q4
q1
b a
a
!
Start State Final State
q2 q3

•  - transition defines the arc that causes transition without an
input symbol. Thus when in state q3 transition to state q2 is
allowed without looking at the input symbol or advancing input
pointer.
• This example is another kind of non-deterministic behavior – we
“might not know” whether to follow the  - transition or the ! arc.
Using NFSA to Accept Strings
• There is a problem of (wrong) choice in non-deterministic FSA. There are
three standard solutions to the problem of non-determinism:
– Backup: Whenever we come to a choice point, we could put a marker to mark
where we were in the input, and what state the automaton was in. Then if it
turns out that we took the wrong choice, we could back up and try another
path.
– Look-ahead: We could look ahead in the input to help us decide which path to
take.
– Parallelism: Whenever we come to a choice point, we could look at every
alternative path in parallel.
• We will focus here on the backup approach and defer discussion of the
look-ahead and parallelism approaches to later chapters.
Back-up Approach for NFSA Recognizer
• The backup approach suggests that we should make
choices that might lead to dead-ends, knowing that
we can always return to unexplored alternative
choices.
• There are two key points to this approach:
1. Must know ALL alternatives for each choice point.
2. Store sufficient information about each alternative so that
we can return to it when necessary.
Back-up Approach for NFSA Recognizer
• When a backup algorithm reaches a point in its processing where
no progress can be made:
– Runs out of input, or
– Has no legal transitions,
It returns to a previous choice point and selects one of the unexplored
alternatives and continues from there.
• To apply this notion to current definition of FSA we need only to
store two things for each choice point:
1. The State (or node)
2. Corresponding position on the tape.
Search State
• Combination of the node and the position specifies
the search state of the recognition algorithm.
• To avoid confusion, the state of automaton is called
a node or machine-state.
• Two changes are necessary in transition table:
1. To represent nodes that have  - transitions we need to
add  - column,
2. Accommodate multiple transitions to different nodes from
the same input symbol. Each cell entry consists of the list
of destination nodes rather then a single node.
The Transition table of a NFSA
Input
State b a ! 
0 1 Ø Ø Ø
1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
3 Ø Ø 4 2
4: Ø Ø Ø Ø
q0 q4
q1
b a
a
!
Start State Final State
q2 q3

An Algorithm for NFSA Recognition
function ND-RECOGNIZE(tape,machine) returns accept or reject
agenda←{(Initial state of machine, beginning of tape)}
current-search-state←NEXT(agenda)
loop
if ACCEPT-STATE?(current-search-state) returns true
then
return accept
else
agenda← agenda ∪ GENERATE-NEW-STATES(current-search-state)
if agenda is empty
then
return reject
else
current-search-state←NEXT(agenda)
end
An Algorithm for NFSA Recognition
(cont.)
function GENERATE-NEW-STATES(current-state) returns a set of search-states
current-node←the node the current search-state is in
index←the point on the tape the current search-state is looking at
return a list of search states from transition table as follows:
(transition-table[current-node, ], index)
∪
(transition-table[current-node, tape[index]], index + 1)
function ACCEPT-STATE?(search-state) returns true or false
current-node←the node search-state is in
index←the point on the tape search-state is looking at
if index is at the end of the tape and current-node is an accept state of machine
then
return true
else
return false
Possible execution of ND-RECOGNIZE
q0 q4
q1
b a
a
!
Start State Final State
q2 q3

Input
State b a ! 
0 1 Ø Ø Ø
1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
3 Ø Ø 4 2
4: Ø Ø Ø Ø
b a a a !
q0
1
b a a a !
q0
2
q1
b a a a !
3
q1 q2
b a a a !
4
q2 q3
b a a a !
5
q3
X
b a a a ! 6
q2
b a a a ! 7
q3
b a a a ! 8
q3
2 possibilities
Depth-First-Search
• Depth-First-Search or Last-In-First-Out
(LIFO):
– Uses stack data structure to implement the
function NEXT.
• NEXT returns the state at the front of the
agenda.
• Pitfall: Under certain circumstances they
can enter an infinite loop.
Depth-First Search of ND-RECOGNIZE
q0 q4
q1
b a
a
!
Start State Final State
q2 q3

Input
State b a ! 
0 1 Ø Ø Ø
1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
3 Ø Ø 4 2
4: Ø Ø Ø Ø
b a a a !
q0
1
b a a a !
q0
2
q1
b a a a !
3
q1 q2
b a a a !
4
q2 q3
b a a a !
5
q3
X
b a a a ! 6
q2
b a a a ! 7
q3
b a a a ! 8
q4
A depth-first trace of FSA on some sheeptalk
1 possibility is
evaluated first
Breadth-First Search
• Breadth-First Search or First In First Out (FIFO)
strategy.
– All possible choices explored at once.
– Uses a queue data structure to implement NEXT function
• Pitfalls:
– As with depth-first if the state-space is infinite, the search
may never terminate.
– More importantly due to growth in the size of the agenda if
the state-space is even moderately large, the search may
require an impractically large amount of memory.
Breadth-First Search of ND-RECOGNIZE
q0 q4
q1
b a
a
!
Start State Final State
q2 q3

Input
State b a ! 
0 1 Ø Ø Ø
1 Ø 2 Ø Ø
2 Ø 2,3 Ø Ø
3 Ø Ø 4 2
4: Ø Ø Ø Ø
b a a a !
q0
1
b a a a !
q0
2
q1
b a a a !
3
q1 q2
b a a a !
4
q2 q3
b a a a !
5
q3
X
b a a a !
4
q2
b a a a ! 5
q3
b a a a ! 6
b a a a !
q2
5
q4
A breadth-first trace of FSA on some sheeptalk
2 possibilities
are evaluated
Advanced Search Algorithms
• For larger problems, more complex search
techniques such as dynamic programming
or A* must be used.
Equivalence of D-FSA and N-FSA Automata
• For many languages N-FSA’s are easier to construct
then D-FSA’s.
• However, any language that can be described with N-
FSA can also be described by some D-FSA.
– D-FSA typically has about as many states as N-FSA,
– D-FSA typically has more transitions
– Worst case scenario D-FSA can have no more than 2n states
describing a language specified by N-FSA with n-states.
Regular Languages and FSA
• The class of languages that is definable by
Regular Expressions is exactly the same as
the class of languages that are
characterizable by finite-state automata:
– Those languages are called Regular
Languages.
Formal Definition of Regular Languages
(important)
•  - alphabet = set of symbols in a language.
•  - empty string
• Ø – empty set.
• The regular languages (or regular sets) over  is then formally
defined as follows:
1. Ø is a regular language
2. ∀a ∈  ∪ , {a} is a regular language
3. If L1 and L2 are regular languages, then so are:
a) L1˙L2 = {xy|x ∈ L1, y ∈ L2},
the concatenation of L1 and L2
b) L1 ∪ L2, the union or disjunction of L1 and L2
c) L1
*, the * closure of L1 , as well as
L2
*, the * closure of L2
• All and only the sets of languages which meet the above properties are regular
languages.
Regular Languages and FSAs
• All regular languages can be implemented by the three operations
which define regular languages:
– Concatenation
– Disjunction|Union (also called “|”),
– * closure.
• Example:
– (*,+,{n,m}) are just a special case of repetition plus * closure.
– All the anchors can be thought of as individual special symbols.
– The square braces [] are a kind of disjunction:
• [ab] means “a or b”, or
• The disjunction of a and b.
Regular Languages and FSAs
• Regular languages are also closed under the following
operations:
– Intersection: if L1 and L2 are regular languages, then so is
L1 ∩ L2, the language consisting of the set of strings that are
in both L1 and L2.
– Difference: if L1 and L2 are regular languages, then so is L1
– L2, the language consisting of the set of strings that are in
L1 but not L2.
– Complementation: if L1 and L2 are regular languages, then
so is *-L1, *-L2 the set of all possible strings that are not
in L1, L2.
– Reversal: if L1 is regular language, then so is L1
R, the
language consisting of the set of reversals of the strings that
are in L1.
Regular Expressions and FSA
• The regular expressions are equivalent to finite-state automaton (Proof:
Hopcroft and Ullman 1979).
• Proof is inductive. Each primitive operations of a regular expression
(concatenation, union, closure) is shown as part of inductive step of the
proof:
q0 qf
(a) r=
q0 qf
(a) r=Ø
q0 qf
(a) r=a
a
Automata for the base case (no operators) for the induction showing that
any regular expression can be turned into an equivalent automaton
Concatenation
• FSAs next to each other by connecting all the final states of FSA1 to the
initial state of FSA2 by an -transition
q0
qf q0 qf

FSA1 FSA2
Closure
• Repetition: All final states of the FSA back to the initial states by -
transition
• Zero occurrences case: Direct link from the initial state to final state
Union
• Add a single new initial state q0, and add new -transitions from it to the
former initial states of the two machines to be joined
q0
qf
q0 qf

FSA1
FSA2
qf
q0



The union (|) of two FSAs

More Related Content

PPT
Chapter-three automata and complexity theory.ppt
PPTX
REGULAR EXPRESSION FOR NATURAL LANGUAGES
PPTX
NLP_KASHK:Regular Expressions
PPT
Regular Expressions grep and egrep
PPTX
Chapter 3: Introduction to Regular Expression
PPTX
Regular Expressions
PPTX
Regular expressions in Python
PPT
Introduction to regular expressions
Chapter-three automata and complexity theory.ppt
REGULAR EXPRESSION FOR NATURAL LANGUAGES
NLP_KASHK:Regular Expressions
Regular Expressions grep and egrep
Chapter 3: Introduction to Regular Expression
Regular Expressions
Regular expressions in Python
Introduction to regular expressions

Similar to Construction of a predictive parsing table.pdf (20)

PPT
regular-expressions lecture 28-string regular expression
DOCX
Python - Regular Expressions
PDF
Python (regular expression)
PPT
Adv. python regular expression by Rj
PPT
Chapter Two(1)
PPTX
Regular expressions
PPTX
Regular Expression
PDF
Regular expressions
PDF
Basta mastering regex power
PDF
An Introduction to Regular expressions
PDF
Regular expressions
PDF
Regularexpressions
PPT
Introduction to Regular Expressions RootsTech 2013
PPTX
[Compilers23] Lexical Analysis – Scanning Part I.pptx
PPTX
Regular Expression Crash Course
PDF
Regular expressions
ODP
Looking for Patterns
PDF
Python Programming - XI. String Manipulation and Regular Expressions
PDF
Regex startup
PDF
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
regular-expressions lecture 28-string regular expression
Python - Regular Expressions
Python (regular expression)
Adv. python regular expression by Rj
Chapter Two(1)
Regular expressions
Regular Expression
Regular expressions
Basta mastering regex power
An Introduction to Regular expressions
Regular expressions
Regularexpressions
Introduction to Regular Expressions RootsTech 2013
[Compilers23] Lexical Analysis – Scanning Part I.pptx
Regular Expression Crash Course
Regular expressions
Looking for Patterns
Python Programming - XI. String Manipulation and Regular Expressions
Regex startup
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
Ad

More from anwarkade1 (20)

PPT
CH 5 PUSH DOWN AUTOMATA PUSH DOWN AUTOMATA
PPTX
chapter 4 context-freegrammarintroduction context-freegramma
PPTX
Unit-1compiler design and its lecture note .pptx
PPT
Chapter-2-lexical-analyser and its property lecture note.ppt
PPTX
chapter 1 automata and complexity theory lecture note.pptx
PPT
context-freelanguages and properties lecture note.ppt
PPTX
Chapter-twoChapter-three automata and complexity theory .pptx
PPTX
Deterministic Finite Automata (DFA).pptx
PPTX
INSTAL AND OPTIMIZE OPERATING SYSTEM WINDOWS
PPTX
tvet management system for analysis in school or college
PPTX
Computer Hardware Technology PowerPoint Templates.pptx
PPTX
BUSINESS PLAN PRERARATION GUIDE FINAL PPT.pptx
PPTX
introduction of enter pruner lecture note
PPTX
group 3 bussiness plan FOR MANAUFACTURIN
PPTX
window server 2008 mail configuration
PPT
Information Technology.ppt
PPT
informatics1.ppt
PPT
chapter one Introduction to HCI.ppt
PPT
Chapter 9 TCP IP Reference Model.ppt
PPT
Chapter 5 Network Configuration Basics.ppt
CH 5 PUSH DOWN AUTOMATA PUSH DOWN AUTOMATA
chapter 4 context-freegrammarintroduction context-freegramma
Unit-1compiler design and its lecture note .pptx
Chapter-2-lexical-analyser and its property lecture note.ppt
chapter 1 automata and complexity theory lecture note.pptx
context-freelanguages and properties lecture note.ppt
Chapter-twoChapter-three automata and complexity theory .pptx
Deterministic Finite Automata (DFA).pptx
INSTAL AND OPTIMIZE OPERATING SYSTEM WINDOWS
tvet management system for analysis in school or college
Computer Hardware Technology PowerPoint Templates.pptx
BUSINESS PLAN PRERARATION GUIDE FINAL PPT.pptx
introduction of enter pruner lecture note
group 3 bussiness plan FOR MANAUFACTURIN
window server 2008 mail configuration
Information Technology.ppt
informatics1.ppt
chapter one Introduction to HCI.ppt
Chapter 9 TCP IP Reference Model.ppt
Chapter 5 Network Configuration Basics.ppt
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.

Construction of a predictive parsing table.pdf

  • 2. Introduction • Regular Expression (RE) – is a language for specifying text search strings. – First developed by Kleene (1956) – Requires a: • Pattern – specification formula using a special language that specifies simple classes of strings. • Corpus – a body of text to search through.
  • 3. Introduction • The regular expression is used for specifying: – text strings in situations like Web-search example, and in other – information retrieval applications, but also plays an important role in – word-processing, – computation of frequencies from corpora, and other such tasks.
  • 4. Introduction • Regular Expressions can be implemented via finite-state automaton. • Finite-State Automaton (FSA) is one of the most significant tools of computational linguistics. • Its variations: – Finite-state transducers – Hidden Markov Models, and – N-gram grammars Important components of the Speech Recognition and Synthesis, spell-checking, and information-extraction applications.
  • 5. Regular Expressions • Formally, a regular expression is an algebraic notation for characterizing a set of strings. – Thus they can be used to specify search strings as well as to define a language in a formal way. • Regular Expression requires – A pattern that we want to search for, and – A corpus of text to search through. • Thus when we give a search pattern, we will assume that the search engine returns the line of the document is returned. This is what the UNIX grep command does. • We will underline the exact part of the pattern that matches the regular expression. • A search can be designed to return all matches to a regular expression or only the first match.
  • 6. Basic Regular Expression Patterns • The simplest kind of regular expression is a sequence of simple characters: – /woodchuck/ – /Buttercup/ – /!/ RE Example Patterns Matched /woodchucks/ “interesting links to woodchucks and lemurs” /a/ “Mary Ann stopped by Mona’s” /Claire says,/ “Dagmar, my gift please,” Claire says,” /song/ “all our pretty songs” /!/ “You’ve left the burglar behind again!” said Nori
  • 7. Basic Regular Expression Patterns • Regular Expressions are case sensitive – /s/, is not the same as – /S/ • /woodchucks/ will not match “Woodchucks” • Disjunction: “[“ and “]”. RE Match Example Pattern /[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck” /[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati” /[1234567890]/ Any digit “plenty of 7 to 5”
  • 8. Basic Regular Expression Patterns • It is inconvenient to search for example: /[ABCDEFGHIJKLMONPQRSTUVWXYZ]/ • Specifying range in Regular Expressions: “-” RE Match Example Patterns Matched /[A-Z]/ An uppercase letter “we should call it ‘Drenched Blossoms’” /[a-z]/ A lower case letter “my beans were impatient to be hoed!” /[0-9]/ A single digit “Chapter 1: Down the Rabbit Hole”
  • 9. Basic Regular Expression Patterns • Negative Specification – what pattern can not be: “^” – If the first symbol after the open square brace “[” is “^” the resulting pattern is negated. – Example /[^a]/ matches any single character (including special characters) except a. RE Match (single characters) Example Patterns Matched /[^A-Z]/ Not an uppercase letter “Oyfn pripetchik” /[^Ss]/ Neither ‘S’ nor ‘s’ “I have no exquisite reason for ’t” /[^.]/ Not a period “our resident Djinn” /[e^]/ Either ‘e’ or ‘^’ “look up ^ now” /a^b/ Pattern ‘a^b’ “look up a^b now”
  • 10. Basic Regular Expression Patterns • How do we specify both woodchuck and woodchucks? – Optional character specification: /?/ – /?/ means “the preceding character or nothing”. RE Match Example Patterns Matched /woodchucks?/ woodchuck or woodchucks “woodchuck” colou?r color or colour “colour”
  • 11. Basic Regular Expression Patterns • Question-mark “?” can be though of as “zero or one instance of the previous character”. • It is a way to specify how many of something that we want. • Sometimes we need to specify regular expressions that allow repetitions of things. • For example, consider the language of (certain) sheep, which consists of strings that look like the following: – baa! – baaa? – baaaa? – baaaaa? – baaaaaa? – …
  • 12. Basic Regular Expression Patterns • Any number of repetitions is specified by “*” which means “any string of 0 or more”. • Examples: – /aa*/ - a followed by zero or more a’s – /[ab]*/ - zero or more a’s or b’s. This will match aaaa or abababa or bbbb
  • 13. Basic Regular Expression Patterns • We know enough to specify part of our regular expression for prices: multiple digits. – Regular expression for individual digit: • /[0-9]/ – Regular expression for an integer: • /[0-9][0-9]*/ – Why is not just /[0-9]*/? • Because it is annoying to specify “at least once” RE since it involves repetition of the same pattern there is a special character that is used for “at least once”: “+” – Regular expression for an integer becomes then: • /[0-9]+/ – Regular expression for sheep language: • /baa*!/, or • /ba+!/
  • 14. Basic Regular Expression Patterns • One very important special character is the period: /./, a wildcard expression that matches any single character (except carriage return). • Example: Find any line in which a particular word (for example Veton) appears twice: – /Veton.*Veton/ RE Match Example Pattern /beg.n/ Any character between beg and n begin beg’n, begun
  • 15. Anchors • Anchors are special characters that anchor regular expressions to particular places in a string. – The most common anchors are: • “^” – matches the start of a line • “$” – matches the end of the line – Examples: • /^The/ - matches the word “The” only at the start of the line. • Three uses of “^”: 1. /^xyz/ - Matches the start of the line 2. [^xyz] – Negation 3. /^/ - Just to mean a caret  /⌴$/ - “⌴” Stands for space “character”; matches a space at the end of line.  /^The dog.$/ - matches a line that contains only the phrase “The dog”.
  • 16. Anchors • /b/ - matches a word boundary • /B/ - matches a non-boundary • /btheb/ - matches the word “the” but not the word “other”. • Word is defined as a any sequence of digits, underscores or letters. • /b99/ - will match the string 99 in “There are 99 bottles of beer on the wall” but NOT “There are 299 bottles of beer on the wall” and it will match the string “$99” since 99 follows a “$” which is not a digit, underscore, or a letter.
  • 17. Disjunction, Grouping and Precedence. • Suppose we need to search for texts about pets; specifically we may be interested in cats and dogs. If we want to search for either “cat” or the string “dog” we can not use any of the constructs we have introduced so far (why not “[]”?). • New operator that defines disjunction, also called the pipe symbol is “|”. • /cat|dog/ - matches either cat or the string dog.
  • 18. Grouping • In many instances it is necessary to be able to group the sequence of characters to be treated as one set. • Example: Search for guppy and guppies. – /gupp(y|ies)/ • Useful in conjunction to “*” operator. – /*/ - applies to single character and not to a whole sequence. • Example: Match “Column 1 Column 2 Column 3 …” – /Column⌴[0-9]+⌴*/ - will match “Column # …“ – /(Column⌴[0-9]+⌴*)*/ - will match “Column 1 Column 2 Column 3 …”
  • 19. Operator Precedence Hierarchy Operator Class Precedence from Highest to Lowest Parenthesis () Counters * + ? {} Sequences and anchors ^ $ Disjunction |
  • 20. Simple Example • Problem Statement: Want to write RE to find cases of the English article “the”. 1. /the/ - It will miss “The” 2. /[tT]he/ - It will match “amalthea”, “Bethesda”, “theology”, etc. 3. /b[tT]heb/ - Is the correct RE • Problem Statement: If we want to find “the” where it might also have undelines or numbers nearby (“The-” , “the_” or “the25”) one needs to specify that we want instances in which there are no alphabetic letters on either side of “the”: 1. /[^a-zA-Z][tT]he[^a-zA-Z]/ - it will not find “the” if it begins the line. 2. /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
  • 21. Simple Example (cont.) • Refining RE by reduction of: – false positives (false acceptance): • Strings that are incorrectly matched. – false negatives (false rejection): • Strings that are incorrectly missed.
  • 22. A More Complex Example • Problem Statement: Build an application to help a user by a computer on the Web. – The user might want “any PC with more than 6 GHz and 256 GB of disk space for less than $1000 – To solve the problem must be able to match the expressions like 1000 MHz, 6 GHz and 256 GB as well as $999.99 etc.
  • 23. Solution – Dollar Amounts • Complete regular expression for prices of full dollar amounts: – /$[0-9]+/ • Adding fractions of dollars: – /$[0-9]+.[0-9][0-9]/ or – /$[0-9]+.[0-9] {2}/ • Problem since this RE only will match “$199.99” and not “$199”. To solve this issue must make cents optional and make sure the $ amount is a word: – /b$[0-9]+(.[0-9][0-9])?b/
  • 24. Solution: Processor Speech • Processor speed in megahertz = MHz or gigahertz = GHz) – /b[0-9]+⌴*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)b/ – ⌴* is used to denote “zero or more spaces”. • Simple pattern to specify operating systems: – /b(Windows ⌴*(8|7|Vista|XP)?)b/ – /b(Mac|Macintosh|Apple|OS ⌴X)b/
  • 25. Solution: Disk Space • Dealing with disk space: – Gb = gigabytes • Memory size: – Mb or MB = megabytes or – Gb or GB = gigabytes • Must allow optional fractions: – /b[0-9]+⌴*(M[Bb]|[Mm]egabytes?)b/ – /b[0-9]+(.[0-9]+)?⌴*(G[Bb]|[Gg]igabytes?)b/
  • 26. Solution: Operating Systems and Vendors • /b((Windows)+⌴*(XP|Vista|7|8)?)b/ • /b((Mac|Macintosh|Apple|OS ⌴X)b/
  • 27. Advanced Operators RE Expansion Match Example Patterns d [0-9] Any digit “Party of 5” D [^0-9] Any non-digit “Blue moon” w [a-zA-Z0- 9⌴] Any alphanumeric or space Daiyu W [^w] A non- alphanumeric !!!! s [⌴rtnf] Whitespace (space, tab) “ ” S [^s] Non- whitespace “in Concord” Aliases for common sets of characters
  • 28. Repetition Metacharacters RE Description Example * Matches any number of occurrences of the previous character – zero or more /ac*e/ - matches “ae”, “ace”, “acce”, “accce” as in “The aerial acceleration alerted the ace pilot” ? Matches at most one occurrence of the previous characters – zero or one. /ac?e/ - matches “ae” and “ace” as in “The aerial acceleration alerted the ace pilot” + Matches one or more occurrences of the previous characters /ac+e/ - matches “ace”, “acce”, “accce” as in “The aerial acceleration alerted the ace pilot” {n} Matches exactly n occurrences of the previous characters. /ac{2}e/ - matches “acce” as in “The aerial acceleration alerted the ace pilot” {n,} Matches n or more occurrences of the previous characters /ac{2,}e/ - matches “acce”, “accce” etc., as in “The aerial acceleration alerted the ace pilot” {n,m} Matches from n to m occurrences of the previous characters. /ac{2,4}e/ - matches “acce”, “accce” and “acccce” , as in “The aerial acceleration alerted the ace pilot” . Matches one occurrence of any characters of the alphabet except the new line character /a.e/ matches aae, aAe, abe, aBe, a1e, etc., as in ““The aerial acceleration alerted the ace pilot” .* Matches any string of characters and until it encounters a new line character
  • 29. Literal Matching of Special Characters & “” Characters RE Match Example Patterns * An asterisk “*” “K*A*P*L*A*N” . A period “.” “Dr. Këpuska, I presume” ? A question mark “?” “Would you like to light my candle?” n A newline t A tab r A carriage return character Some characters that need to be back-slashed “”
  • 30. Finate State Automata • The regular expression is more than just a convenient metalangue for text searching. 1. A regular expression is one way of describing a finite- state-automaton (FSA).  FSA – are the theoretical foundation of significant number of computational work.  Any regular expression can be implemented as FSA (except regular expressions that use the memory feature). 2. Regular expression is one way of characterizing a particular kind of formal language called a regular language.  Both FSA and RE can be used to describe regular languages.
  • 31. FSA, RE and Regular Languages Finite automata Regular languages Regular expressions
  • 32. Finite-State Automaton for Regular Expressions • Using FSA to Recognize Sheeptalk with RE: /baa+!/ q0 q4 q1 q2 q3 b a a a ! Start State Final State Transitions
  • 33. FSA Use • The FSA can be used for recognizing (we also say accepting) strings in the following way. First, think of the input as being written on a long tape broken up into cells, with one symbol written in each cell of the tape, as figure below: a b a ! b q0
  • 34. Recognition Process • The machine starts in the start state (q0), and iterates the following process: 1. Check the next letter of the input. a. If it matches the symbol on an arc leaving the current state, then i. cross that arc ii. move to the next state, also iii. advance one symbol in the input b. If we are in the accepting state (q4) when we run out of input, the machine has successfully recognized an instance of sheeptalk. 2. If the machine never gets to the final state, a. either because it runs out of input, or b. it gets some input that doesn’t match an arc (as in Fig in previous slide), or c. if it just happens to get stuck in some non-final state, we say the machine rejects or fails to accept an input.
  • 35. State Transition Table Input State b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø We’ve marked state 4 with a colon to indicate that it’s a final state (you can have as many final states as you want), and the Ø indicates an illegal or missing transition. We can read the first row as “if we’re in state 0 and we see the input b we must go to state 1. If we’re in state 0 and we see the input a or !, we fail”.
  • 36. Formal Definition of Automaton Q={q0,q1,…,qN} A finite set of N states  a finite input alphabet of symbols q0 the start state F the set of final states, F ⊆ Q (q, i) the transition function or transition matrix between states. Given a state q ∈ Q and an input symbol i ∈ , δ(q, i) returns a new state q′ ∈ Q. δ is thus a relation from Q×S to Q;
  • 37. FSA Example • Q = {q0,q1,q2,q3,q4}, •  = {a,b, !}, • F = {q4}, and • (q, i) q0 q4 q1 q2 q3 b a a a ! Start State Final State Transitions
  • 38. Deterministic Algorithm for Recognizing a String function D-RECOGNIZE(tape,machine) returns accept or reject index←Beginning of tape current-state←Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table[current-state,tape[index]] is empty then return reject else current-state←transition-table[current-state,tape[index]] index←index + 1 end
  • 39. Tracing Execution for Some Sheep Talk q0 q4 q1 q2 q3 b a a a ! Start State Final State Transitions b a a a ! q0 q1 q2 q3 q3 q4 Input State b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø
  • 40. Tracing Execution for Some Sheep Talk (cont.) Before examining the beginning of the tape, the machine is in state q0. Finding a b on input tape, it changes to state q1 as indicated by the contents of transition-table[q0,b] in Fig. It then finds an a and switches to state q2, another a puts it in state q3, a third a leaves it in state q3, where it reads the “!”, and switches to state q4. Since there is no more input, the End of input condition at the beginning of the loop is satisfied for the first time and the machine halts in q4. State q4 is an accepting state, and so the machine has accepted the string baaa! as a sentence in the sheep language.
  • 41. Fail State • The algorithm will fail whenever there is no legal transition for a given combination of state and input. The input abc will fail to be recognized since there is no legal transition out of state q0 on the input a, (i.e., this entry of the transition table has a Ø). • Even if the automaton had allowed an initial a it would have certainly failed on c, since c isn’t even in the sheeptalk alphabet! We can think of these “empty” elements in the table as if they all pointed at one “empty” state, which we might call the fail state or sink state. • In a sense then, we could FAIL STATE view any machine with empty transitions as if we had augmented it with a fail state, and drawn in all the extra arcs, so we always had somewhere to go from any state on any possible input. Just for completeness, next Fig. shows the FSA from previous Figure with the fail state qF filled in.
  • 42. Adding a Fail State to FSA q0 q4 q1 b a a ! Start State Final State a a qF q2 q3 ! b ! b b b ! ! a
  • 43. Formal Languages • Key Concept #1. Formal Language: – A model which can both generate and recognize all and only the strings of a formal language acts as a definition of the formal language. – A formal language is a set of strings, each string composed of symbols from a finite symbol-set called an alphabet (the same alphabet used above for defining an automaton!). • The alphabet for a “sheep” language is the set  = {a,b, !}. • Given a model m (such as FSA) we can use L(m) to mean “the formal language characterized by m”. • L(m)={baa!,baaa!, baaaa!, baaaaa!,….}
  • 44. Formal Languages • Automaton can express an infinite set in a closed form. • Formal Languages are not the same as natural languages. – Formal languages are used to model part of a natural language: • Phonology, • Morphology, or • Syntax • Generative Grammar: – It is used sometimes in linguistics to mean a grammar of a formal language; – The use of an automaton to define a language by generating all possible strings.
  • 45. Example 2 • Alphabet consisting of words. Must build an FSA that models the sub-part of English language that deals with amounts of money: – Ten cents, – Three dollars, – One dollar thirty-five cents, … • Such a formal language would model the subset of English that consists of phrases like ten cents, three dollars, one dollar thirty-five cents, etc. 1. Solve the problem of building FSA for numbers 1-99 with which will model cents. 2. Model dollar amounts by adding cents to it.
  • 46. FSA for the words for English numbers 1- 99 q0 q2 q1 Start State Twenty Thirty Forty Fifty Sixty Seventy Eighty Ninety One Two Three Four Five Six Seven Eight Nine Final State Ten One Eleven Two Twelve Twenty Three Thirteen Thirty Four Fourteen Forty Five Fifteen Fifty Six Sixteen Sixty Seven Seventeen Seventy Eight Eighteen Eighty Nine Nineteen Ninety
  • 47. FSA for the simple Dollars and Cents q0 q1 Ten One Eleven Two Twelve Twenty Three Thirteen Thirty Four Fourteen Forty Five Fifteen Fifty Six Sixteen Sixty Seven Seventeen Seventy Eight Eighteen Eighty Nine Nineteen Ninety Twenty Thirty Forty Fifty Sixty Seventy Eighty Ninety One Two Three Four Five Six Seven Eight Nine q3 q2 cents q5 Ten One Eleven Two Twelve Twenty Three Thirteen Thirty Four Fourteen Forty Five Fifteen Fifty Six Sixteen Sixty Seven Seventeen Seventy Eight Eighteen Eighty Nine Nineteen Ninety Twenty Thirty Forty Fifty Sixty Seventy Eighty Ninety One Two Three Four Five Six Seven Eight Nine q7 q6 cents dollars q4
  • 48. Non-Deterministic FSAs q0 q4 q1 q2 q3 b a a ! Start State Final State a q0 q4 q1 q2 q3 b a a a ! Start State Final State Deterministic FSA Non-Deterministic FSA
  • 49. Deterministic vs Non-deterministic FSA • Deterministic FSA is one whose behavior during recognition is fully determined by the state it is in and the symbol it is looking at. • The FSA in the previous slide when FSA is at the state q2 and the input symbol is a we do not know whether to remain in state 2 (self-loop transition) or state 3 (the other transition) . • Clearly the decision dependents on the next input symbols.
  • 50. Another NFSA for “sheep” language:  - transition q0 q4 q1 b a a ! Start State Final State q2 q3  •  - transition defines the arc that causes transition without an input symbol. Thus when in state q3 transition to state q2 is allowed without looking at the input symbol or advancing input pointer. • This example is another kind of non-deterministic behavior – we “might not know” whether to follow the  - transition or the ! arc.
  • 51. Using NFSA to Accept Strings • There is a problem of (wrong) choice in non-deterministic FSA. There are three standard solutions to the problem of non-determinism: – Backup: Whenever we come to a choice point, we could put a marker to mark where we were in the input, and what state the automaton was in. Then if it turns out that we took the wrong choice, we could back up and try another path. – Look-ahead: We could look ahead in the input to help us decide which path to take. – Parallelism: Whenever we come to a choice point, we could look at every alternative path in parallel. • We will focus here on the backup approach and defer discussion of the look-ahead and parallelism approaches to later chapters.
  • 52. Back-up Approach for NFSA Recognizer • The backup approach suggests that we should make choices that might lead to dead-ends, knowing that we can always return to unexplored alternative choices. • There are two key points to this approach: 1. Must know ALL alternatives for each choice point. 2. Store sufficient information about each alternative so that we can return to it when necessary.
  • 53. Back-up Approach for NFSA Recognizer • When a backup algorithm reaches a point in its processing where no progress can be made: – Runs out of input, or – Has no legal transitions, It returns to a previous choice point and selects one of the unexplored alternatives and continues from there. • To apply this notion to current definition of FSA we need only to store two things for each choice point: 1. The State (or node) 2. Corresponding position on the tape.
  • 54. Search State • Combination of the node and the position specifies the search state of the recognition algorithm. • To avoid confusion, the state of automaton is called a node or machine-state. • Two changes are necessary in transition table: 1. To represent nodes that have  - transitions we need to add  - column, 2. Accommodate multiple transitions to different nodes from the same input symbol. Each cell entry consists of the list of destination nodes rather then a single node.
  • 55. The Transition table of a NFSA Input State b a !  0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø q0 q4 q1 b a a ! Start State Final State q2 q3 
  • 56. An Algorithm for NFSA Recognition function ND-RECOGNIZE(tape,machine) returns accept or reject agenda←{(Initial state of machine, beginning of tape)} current-search-state←NEXT(agenda) loop if ACCEPT-STATE?(current-search-state) returns true then return accept else agenda← agenda ∪ GENERATE-NEW-STATES(current-search-state) if agenda is empty then return reject else current-search-state←NEXT(agenda) end
  • 57. An Algorithm for NFSA Recognition (cont.) function GENERATE-NEW-STATES(current-state) returns a set of search-states current-node←the node the current search-state is in index←the point on the tape the current search-state is looking at return a list of search states from transition table as follows: (transition-table[current-node, ], index) ∪ (transition-table[current-node, tape[index]], index + 1) function ACCEPT-STATE?(search-state) returns true or false current-node←the node search-state is in index←the point on the tape search-state is looking at if index is at the end of the tape and current-node is an accept state of machine then return true else return false
  • 58. Possible execution of ND-RECOGNIZE q0 q4 q1 b a a ! Start State Final State q2 q3  Input State b a !  0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø b a a a ! q0 1 b a a a ! q0 2 q1 b a a a ! 3 q1 q2 b a a a ! 4 q2 q3 b a a a ! 5 q3 X b a a a ! 6 q2 b a a a ! 7 q3 b a a a ! 8 q3 2 possibilities
  • 59. Depth-First-Search • Depth-First-Search or Last-In-First-Out (LIFO): – Uses stack data structure to implement the function NEXT. • NEXT returns the state at the front of the agenda. • Pitfall: Under certain circumstances they can enter an infinite loop.
  • 60. Depth-First Search of ND-RECOGNIZE q0 q4 q1 b a a ! Start State Final State q2 q3  Input State b a !  0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø b a a a ! q0 1 b a a a ! q0 2 q1 b a a a ! 3 q1 q2 b a a a ! 4 q2 q3 b a a a ! 5 q3 X b a a a ! 6 q2 b a a a ! 7 q3 b a a a ! 8 q4 A depth-first trace of FSA on some sheeptalk 1 possibility is evaluated first
  • 61. Breadth-First Search • Breadth-First Search or First In First Out (FIFO) strategy. – All possible choices explored at once. – Uses a queue data structure to implement NEXT function • Pitfalls: – As with depth-first if the state-space is infinite, the search may never terminate. – More importantly due to growth in the size of the agenda if the state-space is even moderately large, the search may require an impractically large amount of memory.
  • 62. Breadth-First Search of ND-RECOGNIZE q0 q4 q1 b a a ! Start State Final State q2 q3  Input State b a !  0 1 Ø Ø Ø 1 Ø 2 Ø Ø 2 Ø 2,3 Ø Ø 3 Ø Ø 4 2 4: Ø Ø Ø Ø b a a a ! q0 1 b a a a ! q0 2 q1 b a a a ! 3 q1 q2 b a a a ! 4 q2 q3 b a a a ! 5 q3 X b a a a ! 4 q2 b a a a ! 5 q3 b a a a ! 6 b a a a ! q2 5 q4 A breadth-first trace of FSA on some sheeptalk 2 possibilities are evaluated
  • 63. Advanced Search Algorithms • For larger problems, more complex search techniques such as dynamic programming or A* must be used.
  • 64. Equivalence of D-FSA and N-FSA Automata • For many languages N-FSA’s are easier to construct then D-FSA’s. • However, any language that can be described with N- FSA can also be described by some D-FSA. – D-FSA typically has about as many states as N-FSA, – D-FSA typically has more transitions – Worst case scenario D-FSA can have no more than 2n states describing a language specified by N-FSA with n-states.
  • 65. Regular Languages and FSA • The class of languages that is definable by Regular Expressions is exactly the same as the class of languages that are characterizable by finite-state automata: – Those languages are called Regular Languages.
  • 66. Formal Definition of Regular Languages (important) •  - alphabet = set of symbols in a language. •  - empty string • Ø – empty set. • The regular languages (or regular sets) over  is then formally defined as follows: 1. Ø is a regular language 2. ∀a ∈  ∪ , {a} is a regular language 3. If L1 and L2 are regular languages, then so are: a) L1˙L2 = {xy|x ∈ L1, y ∈ L2}, the concatenation of L1 and L2 b) L1 ∪ L2, the union or disjunction of L1 and L2 c) L1 *, the * closure of L1 , as well as L2 *, the * closure of L2 • All and only the sets of languages which meet the above properties are regular languages.
  • 67. Regular Languages and FSAs • All regular languages can be implemented by the three operations which define regular languages: – Concatenation – Disjunction|Union (also called “|”), – * closure. • Example: – (*,+,{n,m}) are just a special case of repetition plus * closure. – All the anchors can be thought of as individual special symbols. – The square braces [] are a kind of disjunction: • [ab] means “a or b”, or • The disjunction of a and b.
  • 68. Regular Languages and FSAs • Regular languages are also closed under the following operations: – Intersection: if L1 and L2 are regular languages, then so is L1 ∩ L2, the language consisting of the set of strings that are in both L1 and L2. – Difference: if L1 and L2 are regular languages, then so is L1 – L2, the language consisting of the set of strings that are in L1 but not L2. – Complementation: if L1 and L2 are regular languages, then so is *-L1, *-L2 the set of all possible strings that are not in L1, L2. – Reversal: if L1 is regular language, then so is L1 R, the language consisting of the set of reversals of the strings that are in L1.
  • 69. Regular Expressions and FSA • The regular expressions are equivalent to finite-state automaton (Proof: Hopcroft and Ullman 1979). • Proof is inductive. Each primitive operations of a regular expression (concatenation, union, closure) is shown as part of inductive step of the proof: q0 qf (a) r= q0 qf (a) r=Ø q0 qf (a) r=a a Automata for the base case (no operators) for the induction showing that any regular expression can be turned into an equivalent automaton
  • 70. Concatenation • FSAs next to each other by connecting all the final states of FSA1 to the initial state of FSA2 by an -transition q0 qf q0 qf  FSA1 FSA2
  • 71. Closure • Repetition: All final states of the FSA back to the initial states by - transition • Zero occurrences case: Direct link from the initial state to final state
  • 72. Union • Add a single new initial state q0, and add new -transitions from it to the former initial states of the two machines to be joined q0 qf q0 qf  FSA1 FSA2 qf q0    The union (|) of two FSAs