SlideShare a Scribd company logo
Lexical Analyzer
 Lexical Analyzer reads the source program
character by character to produce tokens.
 Normally a lexical analyzer doesn’t return a list of
tokens at one shot, it returns a token when the
parser asks a token from it.
Compiler Construction 1
Lexical
Analyzer
Parser
source
program
token
get next token
Chapter 2
Symbol Table
 Since the LA is part of compiler that reads the
source text, it may perform certain other tasks
such as,
1. Stripping out comments and whitespaces(blank, newline,
tab)
2. Correlating error messages generated by the compiler
with the source program e.g. LA may keep track of the
number of newline characters seen, so it can associate a
line number with each error message.
3. In some compilers, the LA makes a copy of the source
program with the error messages inserted at the
appropriate positions.
4. If the source program uses a macro-preprocessor, the
expantion of macros may also be performed by the LA
5. Sometimes, LA are divided into cascade of two processes
A) Scanning B) Lexical anlysis
Compiler Construction 2
Compiler Construction 3
Issues in Lexical Analysis
Reasons for separating the analysis phase of compiling into
lexical analysis and parsing
 Simpler Design
 Compiler Efficiency Is Improved
 Compiler Portability Is Enhanced
Tokens, Patterns and Lexemes
 Token represents a set of strings described by a pattern.
– Identifier represents a set of strings which start with a
letter continues with letters and digits
– The actual string (newval) is called as lexeme.
– Tokens: identifier, number, addop, delimeter, …
 Since a token can represent more than one lexeme, additional
information should be held for that specific lexeme. This
additional information is called as the attribute of the token.
 For simplicity, a token may have a single attribute which
holds the required information for that token.
Compiler Construction 4
Compiler Construction 5
– For identifiers, this attribute is a pointer to the symbol table, and
the symbol table holds the actual attributes for that token.
 Some attributes:
– <id,attr> where attr is pointer to the symbol table
– <assgop,_> no attribute is needed (if there is only one
assignment operator)
– <num,val> where val is the actual value of the number.
 Token type and its attribute uniquely identifies a lexeme.
 Regular expressions are widely used to specify patterns.
Compiler Construction 6
Example
E = M * C ** 2
Token names and associated attributes are
<id, pointer to symbol table entry for E>
<assign_op>
<id, pointer to symbol table entry for M>
<mult_op>
<id, pointer to symbol table entry for C>
<exp_op>
<number,integer value 2>
Attributes of Tokens
y := 31 + 28*x
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
Token Attribute
Compiler Construction 7
Lexical Analyzer
Parser
Compiler Construction 8
Tokens, Patterns, and Lexemes
A token is a classification of lexical units
– For example: id and num
Lexemes are the specific character strings that
make up a token
– For example: abc and 123
Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and
digits” and “non-empty sequence of digits”
Compiler Construction 9
Lexical Errors
fi ( a == f(x) )…
It may be misspelling of the keyword if
It may be undeclared function identifier
Lexical analyzer must return the token id to the parser
and some other phase handle the error
Panic Mode Recovery – Delete successive characters from
the remaining input, until the LA can find well formed
token
Compiler Construction 10
Other possible recovery actions
1. Delete one character from the remaining input
2. Insert a missing character into the remaining input
3. Replace a character by another character
4. Transpose two adjacent characters
Compiler Construction 11
Input Buffering
Three approaches to the implementation of a lexical
analyzer
1. Use of lexical analyzer generator e.g. Lex compiler
2. Write the Lexical Analyzer in a conventional system
programming languages
3. Write the Lexical Analyzer in assembly language using
I/O facilities to read the input
Compiler Construction 12
Buffer Pairs
E = M * C * * 2 eof
 Each buffer is of the same size N, and N is usually size
of a disk block
 Using one system read command N characters are read
 eof marks the end of source file
 Two pointers lexemeBegin and forward are maintained
Compiler Construction 13
Sentinels
 Each time we advance forward pointer we must check
for two tests
1. One for end of buffer
2. Other to determine what character is read
 These two tests can be combined by using sentinel
 Sentinel is a special character that can be part of
source program e.g. eof
Compiler Construction 14
Can we run out of buffer space?
 In modern languages lexemes are short and one or two
character lookahead is sufficient
 The buffer size in thousand is ample and the double buffer
scheme works without problem
 To avoid problem with long character strings, we can treat
them as concatenation of components
 In JAVA a+ operator is used to represents long strings on
different lines
 PL/I do not treat keywords as reserved
Problem with DECLARE( Arg1, Arg2……
Specification of Tokens
 Regular expressions are an important notation for
specifying lexeme patterns
Compiler Construction 15
Terminology of Languages
 Alphabet : a finite set of symbols (ASCII characters)
 String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
–  is the empty string
– |s| is the length of string s.
 Language: sets of strings over some fixed alphabet
–  the empty set is a language.
– {} the set containing empty string is a language
– The set of well-formed C programs is a language
– The set of all possible identifiers is a language.
Compiler Construction 16
Operations on Languages
 Concatenation:
– L1L2 = { s1s2 | s1  L1 and s2  L2 }
 Union
– L1 L2 = { s | s  L1 or s  L2 }
 Exponentiation:
– L0
= {} L1
= L L2
= LL
 Kleene Closure
– L*
=
 Positive Closure
– +



0
i
i
L



0
i
i
L



1
i
i
L
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1  L2 = {a,b,c,d,1,2}
• L1
3
= all strings with length three (using a,b,c,d}
• L1
*
= all strings using letters a,b,c,d and empty
string
• L1
+
= doesn’t include the empty string
Compiler Construction 18
Regular Expressions
 We use regular expressions to describe tokens of a
programming language.
 A regular expression is built up of simpler regular
expressions (using defining rules)
 Each regular expression denotes a language.
 A language denoted by a regular expression is
called as a regular set.
Compiler Construction 19
Regular Expressions (Rules)
Regular expressions over alphabet 
Reg. Expr Language it denotes
 {}
a  {a}
(r1) | (r2) L(r1)  L(r2)
(r1) (r2) L(r1) L(r2)
(r)*
(L(r))*
(r) L(r)
Compiler Construction 20
Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*
|c means (a(b)*
)|(c)
• Ex:
–  = {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0*
=> { ,0,00,000,0000,....}
– (0|1)*
=> all strings with 0 and 1, including the empty
string
Compiler Construction 21
Regular Definitions
♠ To write regular expression for some languages can be difficult,
because their regular expressions can be quite complex. In those
cases, we may use regular definitions.
♠ We can give names to regular expressions, and we can use these
names as symbols to define other regular expressions.
♠ A regular definition is a sequence of the definitions of the form:
d1  r1 where di is a distinct name and
d2  r2 ri is a regular expression over the
alphabet   {d1,d2,...,di-1}
dn  rn
basic symbols previously defined
names
Compiler Construction 22
Regular Definitions (cont.)
Ex: Identifiers in Pascal
letter  A | B | ... | Z | a | b | ... | z
digit  0 | 1 | ... | 9
id  letter (letter | digit ) *
– If we try to write the regular expression
representing identifiers without using regular
definitions, that regular expression will be
complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Compiler Construction 23
Regular Definitions (cont.)
• Ex: Unsigned numbers in Pascal
• digit  0 | 1 | ... | 9
• digits  digit +
• opt-exponent  ( E ( + | - | ε ) digits ) | ε
• opt-fraction  ( . digits ) | ε
unsigned-num  digits opt-fraction opt-exponent
Compiler Construction 24
Compiler Construction 25
Problems
1. Describe the languages denoted by the following regular
expressions
a) a( a | b )*a
Solution :
L(r ) = { aa, aaa, aba, aaaa, aaba, abaa, abba, ..}
From above we can say that the above language says
that it is “ Language consisting of strings with a’s and
b’s always starting and ending with a.
Compiler Construction 26
b) ((ε | a) b*)*
Solution:
L(r ) = {ε, a, b, aa, ab, bb, abb, abab, …..}
From above, we can say that the above language
says that it is “Language consisting of strings
with no consecutive a’s when there is combination
of a’s and b’s in the string
Problems
Compiler Construction 27
Write regular definitions for the following languages
a) All strings of letters that contain the five vowels in
order
Non_vowel → [b-d B-D f-h F-H j-n J-N p-t P-T v-z V-Z]
String → (non-vowel)*([a| A])+
(non-vowel)*([e|E])+
(non-vowel)*([i| I])+
(non-vowel)*([o| O])+
(non-vowel)*([u| U])+
b) All strings of lowercase letters in which the letters are in
ascending lexicographic order
Compiler Construction 28
Recognition of Tokens
Consider the following grammar
Stmt if
→ expr then stmt
| if expr then stmt else stmt
| ε
expr → term relop term
| term
term id
→
| number
Compiler Construction 29
Patterns for Tokens
digit [0-9]
→
digits digit
→ +
number digits(. Digits)? (E[+-]? Digits)?
→
letter [A-Za-z]
→
id letter (letter | digit)*
→
if if
→
then then
→
else else
→
relop < | > | <= | >= | = | <>
→
Transition Diagram
 As an intermediate step in the construction of LA patterns are
first converted into stylized flowcharts called transition diagrams
 RE patterns are converted to TD by hand.There are mechanical way
to construct these diagrams from Res
 TD have set of nodes, called states. Each state represents a condition
that could occur during the process of scanning
 Edges are directed from one state of TD to another. Each edge is
labeled by a symbol or set of symbols.
 Certain states are called accepting or final. These states indicate that
a lexeme has been found.
Compiler Construction 30
Compiler Construction 31
 If it is necessary to retract forward pointer one position then *
is placed near that accepting state.
0 1 2
3
4
7
8
5
6
< =
>
return (relop LE)
return (relop NE)
return (relop LT)
return (relop GE)
return (relop GT)
other
return (relop EQ)
*
=
>
=
other
*
start
Recognition of reserved words and identifiers
Compiler Construction 32
0 10 11
start letter
Letter or digit
other *
return(getToken(), installID())
There are two ways that we can handle reserved words that look like ids
1. Install the reserved words in the symbol table initially
A field of the symbol table entry indicates that which token they
represent
2. Create separate transition diagram for each keyword.
Such transition diagram consists of states representing the situation
after each successive letter of the keyword is seen, followed by a test
for a “ nonletter-or-digit
start t h e n
nonlet/digit
*
Sketch of implementation of relop transition
diagram
TOKEN getRelop()
{
Token retToken = new(RELOP);
while(1) { /* repeat charater processing until a return or failure occurs*/
switch(state){
case 0: c = nextChar();
if (c == ‘<‘ ) state =1;
else if ( c == ‘=‘ ) state = 5;
else if ( c == ‘>’ ) state = 6;
else fail (); /* lexeme is not a relop */
break;
case 1: …
case 2: …
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
}
}
Compiler Construction 33
Finite Automata
 A recognizer for a language is a program that
takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
 We call the recognizer of the tokens as a finite
automaton.
 A finite automaton can be: deterministic(DFA) or
non-deterministic (NFA)
 This means that we may use a deterministic or
non-deterministic automaton as a lexical analyzer
Compiler Construction 34
 Both deterministic and non-deterministic finite
automaton recognize regular sets.
 Which one?
– deterministic – faster recognizer, but it may take more
space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
 First, we define regular expressions for tokens;
Then we convert them into a DFA to get a lexical
analyzer for our tokens.
– Algorithm1: Regular Expression  NFA  DFA (two
steps: first to NFA, then to DFA)
– Algorithm2: Regular Expression  DFA (directly convert
a regular expression into a DFA)
Compiler Construction 35
Non-Deterministic Finite Automaton (NFA)
• A non-deterministic finite automaton (NFA)
is a mathematical model that consists of:
– S - a set of states
–  - a set of input symbols (alphabet)
– move – a transition function move to map state-
symbol pairs to sets of states.
– s0 - a start (initial) state
– F – a set of accepting states (final states)
Compiler Construction 36
 - transitions are allowed in NFAs. In other
words, we can move from one state to another one
without consuming any symbol
 A NFA accepts a string x, if and only if there is a
path from the starting state to one of accepting
states such that edge labels along this path spell
out x
Compiler Construction 37
NFA (Example)
Compiler Construction 38
1
0 2
a b
start
a
b
0 is the start state s0
{2} is the set of final states F
S = {a,b}
S = {0,1,2}
Transition Function: a b
0 {0,1} {0}
1 _ {2}
2 _ _
Transition graph of the NFA
The language recognized by this NFA is (a|b) *
a b
Deterministic Finite Automaton (DFA)
Compiler Construction 39
 It is a special form of a NFA
 no state has - transition
 for each symbol a and state s, there is at most one labeled edge
a leaving s.
i.e. transition function is from pair of state-symbol to state
(not set of states)
1
0 2
b
a
a
b
The language recognized by
this DFA is also (a|b) *
a b
b a
Implementing a DFA
• Le us assume that the end of a string is marked with a
special symbol (say eos). The algorithm for recognition
will be as follows: (an efficient implementation)
s  s0 { start from the initial state }
c  nextchar { get the next character from the
input string }
while (c != eos) do { do until the end of the string }
begin
s  move(s,c) { transition function }
c  nextchar
end
if (s in F) then { if s is an accepting state }
return “yes”
else
return “no”
Compiler Construction 40
Implementing a NFA
S  -closure({s0}) { set all of states can be accessible
from s0 by -transitions }
c  nextchar
while (c != eos) {
begin
s  -closure(move(S,c)) { set of all states can be
accessible from a state in S
c  nextchar by a transition on c }
end
if (SF != ) then { if S contains an accepting state }
return “yes”
else
return “no”
• This algorithm is not efficient.
Compiler Construction 41
Converting A Regular Expression into A
NFA (Thomson’s Construction)
 This is one way to convert a regular expression into a
NFA.
 There can be other ways (much efficient) for the
conversion.
 Thomson’s Construction is simple and systematic
method. It guarantees that the resulting NFA will have
exactly one final state, and one start state.
 Construction starts from simplest parts (alphabet
symbols). To create a NFA for a complex regular
expression, NFAs of its sub-expressions are combined
to create its NFA,
Compiler Construction 42
Thomson’s Construction (cont.)
Compiler Construction 43
 To recognize an empty string 
 To recognize a symbol a in the alphabet 
 If N(r1) and N(r2) are NFAs for regular expressions r1
and r2
 For regular expression r1 | r2
a
f
i
f
i

N(r2)
N(r1)
f
i NFA for r1 | r2
 


Thomson’s Construction (cont.)
Compiler Construction 44
 For regular expression r1 r2
i f
N(r2)
N(r1)
NFA for r1 r2
Final state of N(r2) become final state of N(r1r
 For regular expression r*
N(r)
i f
NFA for r*
 


Thomson’s Construction (Example - (a|b) *
a )
Compiler Construction 45
a:
a
b
b:
(a | b)
a
b




b




a

 
(a|b) *


b




a
 

a
(a|b) *
a
Converting a NFA into a DFA (subset
construction)
 The general idea behind the subset construction is that each
state of DFA corresponds to a set of NFA states
 After reading input a1a2….an, the DFA is in that state which
correponds to the set of states that the NFA can reach, from its
start state
 It is possible that the number of DFA states is exponential in
the number of NFA states
 In practice the NFA and DFA have approximately the same
number of states
Compiler Construction 46
Converting a NFA into a DFA
(subset construction)
 The subset construction algorithm performs
three operations on NFA states
1. -closure(s) – Set of NFA states reachable from
NFA state s on -transitions alone
2. -closure(T) – Set of NFA states reachable
from some NFA state s in set T on -transitions
alone
3. move(T,a) – Set of NFA states to which there is
a transition on input symbol a from some state s
in T
Compiler Construction 47
Converting a NFA into a DFA
(subset construction)
T is set of states of NFA
put T= -closure({s0}) as an unmarked state into the set of DFA (DS)
while (there is one unmarked T in DS) do
begin
mark T
for each input symbol a do
begin
U  -closure(move(T,a))
if (U is not in DS) then
add U into DS as an unmarked state
Dtran[T,a]  U
end
end
– a state S in DS is an accepting state of DFA if a state in S is an accepting
state of NFA
– the start state of DFA is -closure({s0})
Compiler Construction 48
set of states to which there is a transition on
a from a state s in S1
-closure({s0}) is the set of all states can be accessible
from s0 by -transition.
Converting a NFA into a DFA (Example)
• S0 = -closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state
•  mark S0
• -closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1
into DS
• -closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 S2
into DS
• Dtran[S0,a]  S1 transfunc[S0,b]  S2
•  mark S1
• -closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
• -closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
b




a
 

a
Converting a NFA into a DFA (Example)
contd...
Dtran[S1,a]  S1 Dtran[S1,b]  S2
 mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
Dtran[S2,a]  S1 Dtran[S2,b]  S2
Compiler Construction 50
Converting a NFA into a DFA
(Example – cont.)
Compiler Construction 51
S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7}
S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}
b
a
a
b
b
a
S1
S2
S0
Transition Table for DFA D
NFA State DFA State a b
{0,1,2,4,7} S0 S1 S2
(1,2,3,4,6,7,8} S1 S1 S2
{1,2,4,5,6,7} S2 S1 S2
A Language for Specifying Lexical
Analyzers
 Several tools have been built for constructing LA
from Res
 Lex, has been widely used to specify Las for a
variety of languages
 The tool is referred as Lex compiler and its input
specifications is called Lex language
 First, a specification of a LA is prepared by
creating a program lex.1 in the lex language
Compiler Construction 52
Compiler Construction 53
 Then lex.1 is run through the Lex compiler to produce
C program lex.yy.c
 lex.yy.c consists of a tabular representation of a
transition diagram constructed from the RE of lex.1
 The actions associated with RE in lex.1 are pieces of C
code and are carried over diretly to lex.yy.c
 Finally, lex.yy.c is run through the C compiler to produce
an object program a.out.
Compiler Construction 54
Lex source
program lex.1
Lex
compiler
C
compiler
a.out
Lex.yy.c
a.out
Lex.yy.c
Input
stream
Sequence of
tokens
Creating Lexical Analyzer with Lex
Lex Specifications
Lex program consists of three parts
declarations { includes declarations of
variables,
%% constants and regular
definitions}
translation rules
%%
auxiliary procedures { holds procedures needed by
the actions}
The translation rules of a Lex program are of the
form
p1 {action1 }
Compiler Construction 55
Design of Lexical Analyzer Generator
 Design of a S/W tool that automatically constructs
a lexical analyzer from a program in the Lex
language.
 Specification of a LA of the form
p1 {action 1}
p2 {action 2}
..
pn {action n}
pi is Regular Expression and actioni is a
program fragment
Compiler Construction 56
 Our problem is to construct a recognizer that
looks for lexemes in the input buffer
 If more that one pattern matches, the recognizer
is to choose the longest lexeme matched
 If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is
chosen
Compiler Construction 57
Design of Lexical Analyzer Generator
(cont.)
Lex Specification Transition table
input buffer
Compiler Construction 58
Model of Lex Compiler
Lex Compiler
lexeme
FA
Simulator
Transition
Table
Schematic Lexical Analyzer
Pattern Matching Based on NFA’s
 Construct the transition table of NFA N for the
composite pattern p1| p2| p3 …… | pn
 Create an NFA N(pi) for each pattern pi and link
it to S0 the start state
Compiler Construction 59
N(p1)
N(p2)
N(pn)
S0
Example
We have the following Lex program consisting of
three regular expressions
a { } /* actions are omitted here */
abb { }
a*b+
{ }
Compiler Construction 60
start a
1 2
start a b b
3 4 5
1
6
start b
7 8
a b
NFA for a, abb, and a*b
Example Cont.
Compiler Construction 61
a
a b b
1 2
3 4 5 6
b
7 8
a b
0
ε
ε
ε
Combined NFA recognizing three different patterns
0
1
3
7
a a b a
2
4
7
7 8
Sequence of sets of states entered in processing input aaba
A*b+
Pattern Matching Based on DFA
• Another approach to the construction of lexical
analyzer from Lex specification is to use a DFA
• When we convert an NFA to DFA using subset
construction algorithm, there may be several
accepting states in given subset of NFA states
• In such situation, the accepting state corrosponding
to the pattern listed first has priority
• Following is the transition table of converted DFA
Compiler Construction 62
Pattern Matching Based on DFA
STATE INPUT SYMBOL
a b
PATTEN
ANNOUNCED
0137 247 8 none
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb
Compiler Construction 63
Test for strings aaba and aba
Example
The input string is aba
 The DFA starts off in state 0137
 On input a it goes to state 247
 Then on input b it goes to state 58
 On next input a it has no next state
 We thus have reached termination
 The last of these includes the accepting NFA state 8
 In state 58 the DFA announces that the pattern a*b+
has been recognized and selects ab the prefix of
input
as lexeme.
Compiler Construction 64
Minimizing Number of States of a DFA
 partition the set of states into two groups:
 G1 : set of accepting states
 G2 : set of non-accepting states
 For each new group G
 partition G into subgroups such that states s1 and s2 are in
the same group if
 for all input symbols a, states s1 and s2 have
transitions to states in the same group.
• Start state of the minimized DFA is the group containing
the start state of the original DFA.
• Accepting states of the minimized DFA are the groups
containing the accepting states of the original DFA.
Compiler Construction 65
Algorithm : Minimizing the number of states of a
DFA
Input : A DFA D with set of states S, input alphabet Σ, start state
s0, and set of accepting states F’
Output : A DFA D’ accepting the same language as D and having as
few states as possible
Method :
1. Start with an initial partition π with two groups, F and S – F, the
accepting and nonaccepting states of D
2. Apply following procedure to construct new partition π new
initially, π new = π;
for (each group G of π) {
partition G into subgroups such that
Compiler Construction 66
Minimizing DFA - Example
Compiler Construction 67
b a
a
a
b
b
3
2
1
G1 = {2}
G2 = {1,3}
G2 cannot be partitioned because
move(1,a)=2 move(1,b)=3
move(3,a)=2 move(2,b)=3
So, the minimized DFA (with minimum states)
{1,3}
a
a
b
b
{2}
Minimizing DFA – Another Example
Compiler Construction 68
b
b
b
a
a
a
a
b 4
3
2
1
Groups: {1,2,3} {4}
a b
1->2 1->3
2->2 2->3
3->4 3->3
{1,2} {3}
no more partitioning
So, the minimized DFA
{1,2}
{4}
{3}
b
a
a
a
b
b

More Related Content

PDF
Compilers Design
PPT
atc 3rd module compiler and automata.ppt
PPT
Chapter-2-lexical-analyser and its property lecture note.ppt
PPT
Module 2
PPT
SS & CD Module 3
DOCX
Compiler design important questions
PPT
Lecture 1 - Lexical Analysis.ppt
PDF
Compiler Design lab manual for Computer Engineering .pdf
Compilers Design
atc 3rd module compiler and automata.ppt
Chapter-2-lexical-analyser and its property lecture note.ppt
Module 2
SS & CD Module 3
Compiler design important questions
Lecture 1 - Lexical Analysis.ppt
Compiler Design lab manual for Computer Engineering .pdf

Similar to Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering (20)

DOCX
Compiler Design
PPT
LexicalAnalysis in Compiler design .pt
PPTX
CD U1-5.pptx
PPTX
Ch 2.pptx
PPT
Chapter Two(1)
PDF
Lexicalanalyzer
PDF
Lexicalanalyzer
PPT
Compiler Design in Engineering for Designing
PPT
Unit1.ppt
PDF
An Introduction to the Compiler Designss
PPT
Lexical Analysis
DOC
Pcd question bank
PDF
Structure-Compiler-phases information about basics of compiler. Pdfpdf
DOCX
Cs6660 compiler design may june 2016 Answer Key
PDF
Compiler_Design_Introduction_Unit_2_IIT.pdf
PPT
Compiler Designs
PDF
COMPILER DESIGN.pdf
PDF
Lexical analysis Compiler design pdf to read
PDF
Lexical analysis compiler design to read and study
PDF
Assignment4
Compiler Design
LexicalAnalysis in Compiler design .pt
CD U1-5.pptx
Ch 2.pptx
Chapter Two(1)
Lexicalanalyzer
Lexicalanalyzer
Compiler Design in Engineering for Designing
Unit1.ppt
An Introduction to the Compiler Designss
Lexical Analysis
Pcd question bank
Structure-Compiler-phases information about basics of compiler. Pdfpdf
Cs6660 compiler design may june 2016 Answer Key
Compiler_Design_Introduction_Unit_2_IIT.pdf
Compiler Designs
COMPILER DESIGN.pdf
Lexical analysis Compiler design pdf to read
Lexical analysis compiler design to read and study
Assignment4
Ad

Recently uploaded (20)

PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
web development for engineering and engineering
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Welding lecture in detail for understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Project quality management in manufacturing
PPTX
bas. eng. economics group 4 presentation 1.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT 4 Total Quality Management .pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Geodesy 1.pptx...............................................
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CH1 Production IntroductoryConcepts.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
OOP with Java - Java Introduction (Basics)
web development for engineering and engineering
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Foundation to blockchain - A guide to Blockchain Tech
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Welding lecture in detail for understanding
Internet of Things (IOT) - A guide to understanding
Project quality management in manufacturing
bas. eng. economics group 4 presentation 1.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Ad

Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering

  • 1. Lexical Analyzer  Lexical Analyzer reads the source program character by character to produce tokens.  Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it. Compiler Construction 1 Lexical Analyzer Parser source program token get next token Chapter 2 Symbol Table
  • 2.  Since the LA is part of compiler that reads the source text, it may perform certain other tasks such as, 1. Stripping out comments and whitespaces(blank, newline, tab) 2. Correlating error messages generated by the compiler with the source program e.g. LA may keep track of the number of newline characters seen, so it can associate a line number with each error message. 3. In some compilers, the LA makes a copy of the source program with the error messages inserted at the appropriate positions. 4. If the source program uses a macro-preprocessor, the expantion of macros may also be performed by the LA 5. Sometimes, LA are divided into cascade of two processes A) Scanning B) Lexical anlysis Compiler Construction 2
  • 3. Compiler Construction 3 Issues in Lexical Analysis Reasons for separating the analysis phase of compiling into lexical analysis and parsing  Simpler Design  Compiler Efficiency Is Improved  Compiler Portability Is Enhanced
  • 4. Tokens, Patterns and Lexemes  Token represents a set of strings described by a pattern. – Identifier represents a set of strings which start with a letter continues with letters and digits – The actual string (newval) is called as lexeme. – Tokens: identifier, number, addop, delimeter, …  Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token.  For simplicity, a token may have a single attribute which holds the required information for that token. Compiler Construction 4
  • 5. Compiler Construction 5 – For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds the actual attributes for that token.  Some attributes: – <id,attr> where attr is pointer to the symbol table – <assgop,_> no attribute is needed (if there is only one assignment operator) – <num,val> where val is the actual value of the number.  Token type and its attribute uniquely identifies a lexeme.  Regular expressions are widely used to specify patterns.
  • 6. Compiler Construction 6 Example E = M * C ** 2 Token names and associated attributes are <id, pointer to symbol table entry for E> <assign_op> <id, pointer to symbol table entry for M> <mult_op> <id, pointer to symbol table entry for C> <exp_op> <number,integer value 2>
  • 7. Attributes of Tokens y := 31 + 28*x <id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”> Token Attribute Compiler Construction 7 Lexical Analyzer Parser
  • 8. Compiler Construction 8 Tokens, Patterns, and Lexemes A token is a classification of lexical units – For example: id and num Lexemes are the specific character strings that make up a token – For example: abc and 123 Patterns are rules describing the set of lexemes belonging to a token – For example: “letter followed by letters and digits” and “non-empty sequence of digits”
  • 9. Compiler Construction 9 Lexical Errors fi ( a == f(x) )… It may be misspelling of the keyword if It may be undeclared function identifier Lexical analyzer must return the token id to the parser and some other phase handle the error Panic Mode Recovery – Delete successive characters from the remaining input, until the LA can find well formed token
  • 10. Compiler Construction 10 Other possible recovery actions 1. Delete one character from the remaining input 2. Insert a missing character into the remaining input 3. Replace a character by another character 4. Transpose two adjacent characters
  • 11. Compiler Construction 11 Input Buffering Three approaches to the implementation of a lexical analyzer 1. Use of lexical analyzer generator e.g. Lex compiler 2. Write the Lexical Analyzer in a conventional system programming languages 3. Write the Lexical Analyzer in assembly language using I/O facilities to read the input
  • 12. Compiler Construction 12 Buffer Pairs E = M * C * * 2 eof  Each buffer is of the same size N, and N is usually size of a disk block  Using one system read command N characters are read  eof marks the end of source file  Two pointers lexemeBegin and forward are maintained
  • 13. Compiler Construction 13 Sentinels  Each time we advance forward pointer we must check for two tests 1. One for end of buffer 2. Other to determine what character is read  These two tests can be combined by using sentinel  Sentinel is a special character that can be part of source program e.g. eof
  • 14. Compiler Construction 14 Can we run out of buffer space?  In modern languages lexemes are short and one or two character lookahead is sufficient  The buffer size in thousand is ample and the double buffer scheme works without problem  To avoid problem with long character strings, we can treat them as concatenation of components  In JAVA a+ operator is used to represents long strings on different lines  PL/I do not treat keywords as reserved Problem with DECLARE( Arg1, Arg2……
  • 15. Specification of Tokens  Regular expressions are an important notation for specifying lexeme patterns Compiler Construction 15
  • 16. Terminology of Languages  Alphabet : a finite set of symbols (ASCII characters)  String : – Finite sequence of symbols on an alphabet – Sentence and word are also used in terms of string –  is the empty string – |s| is the length of string s.  Language: sets of strings over some fixed alphabet –  the empty set is a language. – {} the set containing empty string is a language – The set of well-formed C programs is a language – The set of all possible identifiers is a language. Compiler Construction 16
  • 17. Operations on Languages  Concatenation: – L1L2 = { s1s2 | s1  L1 and s2  L2 }  Union – L1 L2 = { s | s  L1 or s  L2 }  Exponentiation: – L0 = {} L1 = L L2 = LL  Kleene Closure – L* =  Positive Closure – +    0 i i L    0 i i L    1 i i L
  • 18. Example • L1 = {a,b,c,d} L2 = {1,2} • L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2} • L1  L2 = {a,b,c,d,1,2} • L1 3 = all strings with length three (using a,b,c,d} • L1 * = all strings using letters a,b,c,d and empty string • L1 + = doesn’t include the empty string Compiler Construction 18
  • 19. Regular Expressions  We use regular expressions to describe tokens of a programming language.  A regular expression is built up of simpler regular expressions (using defining rules)  Each regular expression denotes a language.  A language denoted by a regular expression is called as a regular set. Compiler Construction 19
  • 20. Regular Expressions (Rules) Regular expressions over alphabet  Reg. Expr Language it denotes  {} a  {a} (r1) | (r2) L(r1)  L(r2) (r1) (r2) L(r1) L(r2) (r)* (L(r))* (r) L(r) Compiler Construction 20
  • 21. Regular Expressions (cont.) • We may remove parentheses by using precedence rules. – * highest – concatenation next – | lowest • ab* |c means (a(b)* )|(c) • Ex: –  = {0,1} – 0|1 => {0,1} – (0|1)(0|1) => {00,01,10,11} – 0* => { ,0,00,000,0000,....} – (0|1)* => all strings with 0 and 1, including the empty string Compiler Construction 21
  • 22. Regular Definitions ♠ To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use regular definitions. ♠ We can give names to regular expressions, and we can use these names as symbols to define other regular expressions. ♠ A regular definition is a sequence of the definitions of the form: d1  r1 where di is a distinct name and d2  r2 ri is a regular expression over the alphabet   {d1,d2,...,di-1} dn  rn basic symbols previously defined names Compiler Construction 22
  • 23. Regular Definitions (cont.) Ex: Identifiers in Pascal letter  A | B | ... | Z | a | b | ... | z digit  0 | 1 | ... | 9 id  letter (letter | digit ) * – If we try to write the regular expression representing identifiers without using regular definitions, that regular expression will be complex. (A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) * Compiler Construction 23
  • 24. Regular Definitions (cont.) • Ex: Unsigned numbers in Pascal • digit  0 | 1 | ... | 9 • digits  digit + • opt-exponent  ( E ( + | - | ε ) digits ) | ε • opt-fraction  ( . digits ) | ε unsigned-num  digits opt-fraction opt-exponent Compiler Construction 24
  • 25. Compiler Construction 25 Problems 1. Describe the languages denoted by the following regular expressions a) a( a | b )*a Solution : L(r ) = { aa, aaa, aba, aaaa, aaba, abaa, abba, ..} From above we can say that the above language says that it is “ Language consisting of strings with a’s and b’s always starting and ending with a.
  • 26. Compiler Construction 26 b) ((ε | a) b*)* Solution: L(r ) = {ε, a, b, aa, ab, bb, abb, abab, …..} From above, we can say that the above language says that it is “Language consisting of strings with no consecutive a’s when there is combination of a’s and b’s in the string
  • 27. Problems Compiler Construction 27 Write regular definitions for the following languages a) All strings of letters that contain the five vowels in order Non_vowel → [b-d B-D f-h F-H j-n J-N p-t P-T v-z V-Z] String → (non-vowel)*([a| A])+ (non-vowel)*([e|E])+ (non-vowel)*([i| I])+ (non-vowel)*([o| O])+ (non-vowel)*([u| U])+ b) All strings of lowercase letters in which the letters are in ascending lexicographic order
  • 28. Compiler Construction 28 Recognition of Tokens Consider the following grammar Stmt if → expr then stmt | if expr then stmt else stmt | ε expr → term relop term | term term id → | number
  • 29. Compiler Construction 29 Patterns for Tokens digit [0-9] → digits digit → + number digits(. Digits)? (E[+-]? Digits)? → letter [A-Za-z] → id letter (letter | digit)* → if if → then then → else else → relop < | > | <= | >= | = | <> →
  • 30. Transition Diagram  As an intermediate step in the construction of LA patterns are first converted into stylized flowcharts called transition diagrams  RE patterns are converted to TD by hand.There are mechanical way to construct these diagrams from Res  TD have set of nodes, called states. Each state represents a condition that could occur during the process of scanning  Edges are directed from one state of TD to another. Each edge is labeled by a symbol or set of symbols.  Certain states are called accepting or final. These states indicate that a lexeme has been found. Compiler Construction 30
  • 31. Compiler Construction 31  If it is necessary to retract forward pointer one position then * is placed near that accepting state. 0 1 2 3 4 7 8 5 6 < = > return (relop LE) return (relop NE) return (relop LT) return (relop GE) return (relop GT) other return (relop EQ) * = > = other * start
  • 32. Recognition of reserved words and identifiers Compiler Construction 32 0 10 11 start letter Letter or digit other * return(getToken(), installID()) There are two ways that we can handle reserved words that look like ids 1. Install the reserved words in the symbol table initially A field of the symbol table entry indicates that which token they represent 2. Create separate transition diagram for each keyword. Such transition diagram consists of states representing the situation after each successive letter of the keyword is seen, followed by a test for a “ nonletter-or-digit start t h e n nonlet/digit *
  • 33. Sketch of implementation of relop transition diagram TOKEN getRelop() { Token retToken = new(RELOP); while(1) { /* repeat charater processing until a return or failure occurs*/ switch(state){ case 0: c = nextChar(); if (c == ‘<‘ ) state =1; else if ( c == ‘=‘ ) state = 5; else if ( c == ‘>’ ) state = 6; else fail (); /* lexeme is not a relop */ break; case 1: … case 2: … case 8: retract(); retToken.attribute = GT; return(retToken); } } } Compiler Construction 33
  • 34. Finite Automata  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.  We call the recognizer of the tokens as a finite automaton.  A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)  This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer Compiler Construction 34
  • 35.  Both deterministic and non-deterministic finite automaton recognize regular sets.  Which one? – deterministic – faster recognizer, but it may take more space – non-deterministic – slower, but it may take less space – Deterministic automatons are widely used lexical analyzers.  First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens. – Algorithm1: Regular Expression  NFA  DFA (two steps: first to NFA, then to DFA) – Algorithm2: Regular Expression  DFA (directly convert a regular expression into a DFA) Compiler Construction 35
  • 36. Non-Deterministic Finite Automaton (NFA) • A non-deterministic finite automaton (NFA) is a mathematical model that consists of: – S - a set of states –  - a set of input symbols (alphabet) – move – a transition function move to map state- symbol pairs to sets of states. – s0 - a start (initial) state – F – a set of accepting states (final states) Compiler Construction 36
  • 37.  - transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol  A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x Compiler Construction 37
  • 38. NFA (Example) Compiler Construction 38 1 0 2 a b start a b 0 is the start state s0 {2} is the set of final states F S = {a,b} S = {0,1,2} Transition Function: a b 0 {0,1} {0} 1 _ {2} 2 _ _ Transition graph of the NFA The language recognized by this NFA is (a|b) * a b
  • 39. Deterministic Finite Automaton (DFA) Compiler Construction 39  It is a special form of a NFA  no state has - transition  for each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition function is from pair of state-symbol to state (not set of states) 1 0 2 b a a b The language recognized by this DFA is also (a|b) * a b b a
  • 40. Implementing a DFA • Le us assume that the end of a string is marked with a special symbol (say eos). The algorithm for recognition will be as follows: (an efficient implementation) s  s0 { start from the initial state } c  nextchar { get the next character from the input string } while (c != eos) do { do until the end of the string } begin s  move(s,c) { transition function } c  nextchar end if (s in F) then { if s is an accepting state } return “yes” else return “no” Compiler Construction 40
  • 41. Implementing a NFA S  -closure({s0}) { set all of states can be accessible from s0 by -transitions } c  nextchar while (c != eos) { begin s  -closure(move(S,c)) { set of all states can be accessible from a state in S c  nextchar by a transition on c } end if (SF != ) then { if S contains an accepting state } return “yes” else return “no” • This algorithm is not efficient. Compiler Construction 41
  • 42. Converting A Regular Expression into A NFA (Thomson’s Construction)  This is one way to convert a regular expression into a NFA.  There can be other ways (much efficient) for the conversion.  Thomson’s Construction is simple and systematic method. It guarantees that the resulting NFA will have exactly one final state, and one start state.  Construction starts from simplest parts (alphabet symbols). To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to create its NFA, Compiler Construction 42
  • 43. Thomson’s Construction (cont.) Compiler Construction 43  To recognize an empty string   To recognize a symbol a in the alphabet   If N(r1) and N(r2) are NFAs for regular expressions r1 and r2  For regular expression r1 | r2 a f i f i  N(r2) N(r1) f i NFA for r1 | r2    
  • 44. Thomson’s Construction (cont.) Compiler Construction 44  For regular expression r1 r2 i f N(r2) N(r1) NFA for r1 r2 Final state of N(r2) become final state of N(r1r  For regular expression r* N(r) i f NFA for r*    
  • 45. Thomson’s Construction (Example - (a|b) * a ) Compiler Construction 45 a: a b b: (a | b) a b     b     a    (a|b) *   b     a    a (a|b) * a
  • 46. Converting a NFA into a DFA (subset construction)  The general idea behind the subset construction is that each state of DFA corresponds to a set of NFA states  After reading input a1a2….an, the DFA is in that state which correponds to the set of states that the NFA can reach, from its start state  It is possible that the number of DFA states is exponential in the number of NFA states  In practice the NFA and DFA have approximately the same number of states Compiler Construction 46
  • 47. Converting a NFA into a DFA (subset construction)  The subset construction algorithm performs three operations on NFA states 1. -closure(s) – Set of NFA states reachable from NFA state s on -transitions alone 2. -closure(T) – Set of NFA states reachable from some NFA state s in set T on -transitions alone 3. move(T,a) – Set of NFA states to which there is a transition on input symbol a from some state s in T Compiler Construction 47
  • 48. Converting a NFA into a DFA (subset construction) T is set of states of NFA put T= -closure({s0}) as an unmarked state into the set of DFA (DS) while (there is one unmarked T in DS) do begin mark T for each input symbol a do begin U  -closure(move(T,a)) if (U is not in DS) then add U into DS as an unmarked state Dtran[T,a]  U end end – a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA – the start state of DFA is -closure({s0}) Compiler Construction 48 set of states to which there is a transition on a from a state s in S1 -closure({s0}) is the set of all states can be accessible from s0 by -transition.
  • 49. Converting a NFA into a DFA (Example) • S0 = -closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state •  mark S0 • -closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1 into DS • -closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS • Dtran[S0,a]  S1 transfunc[S0,b]  S2 •  mark S1 • -closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 • -closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 b     a    a
  • 50. Converting a NFA into a DFA (Example) contd... Dtran[S1,a]  S1 Dtran[S1,b]  S2  mark S2 -closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 -closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 Dtran[S2,a]  S1 Dtran[S2,b]  S2 Compiler Construction 50
  • 51. Converting a NFA into a DFA (Example – cont.) Compiler Construction 51 S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7} S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8} b a a b b a S1 S2 S0 Transition Table for DFA D NFA State DFA State a b {0,1,2,4,7} S0 S1 S2 (1,2,3,4,6,7,8} S1 S1 S2 {1,2,4,5,6,7} S2 S1 S2
  • 52. A Language for Specifying Lexical Analyzers  Several tools have been built for constructing LA from Res  Lex, has been widely used to specify Las for a variety of languages  The tool is referred as Lex compiler and its input specifications is called Lex language  First, a specification of a LA is prepared by creating a program lex.1 in the lex language Compiler Construction 52
  • 53. Compiler Construction 53  Then lex.1 is run through the Lex compiler to produce C program lex.yy.c  lex.yy.c consists of a tabular representation of a transition diagram constructed from the RE of lex.1  The actions associated with RE in lex.1 are pieces of C code and are carried over diretly to lex.yy.c  Finally, lex.yy.c is run through the C compiler to produce an object program a.out.
  • 54. Compiler Construction 54 Lex source program lex.1 Lex compiler C compiler a.out Lex.yy.c a.out Lex.yy.c Input stream Sequence of tokens Creating Lexical Analyzer with Lex
  • 55. Lex Specifications Lex program consists of three parts declarations { includes declarations of variables, %% constants and regular definitions} translation rules %% auxiliary procedures { holds procedures needed by the actions} The translation rules of a Lex program are of the form p1 {action1 } Compiler Construction 55
  • 56. Design of Lexical Analyzer Generator  Design of a S/W tool that automatically constructs a lexical analyzer from a program in the Lex language.  Specification of a LA of the form p1 {action 1} p2 {action 2} .. pn {action n} pi is Regular Expression and actioni is a program fragment Compiler Construction 56
  • 57.  Our problem is to construct a recognizer that looks for lexemes in the input buffer  If more that one pattern matches, the recognizer is to choose the longest lexeme matched  If there are two or more patterns that match the longest lexeme, the first listed matching pattern is chosen Compiler Construction 57 Design of Lexical Analyzer Generator (cont.)
  • 58. Lex Specification Transition table input buffer Compiler Construction 58 Model of Lex Compiler Lex Compiler lexeme FA Simulator Transition Table Schematic Lexical Analyzer
  • 59. Pattern Matching Based on NFA’s  Construct the transition table of NFA N for the composite pattern p1| p2| p3 …… | pn  Create an NFA N(pi) for each pattern pi and link it to S0 the start state Compiler Construction 59 N(p1) N(p2) N(pn) S0
  • 60. Example We have the following Lex program consisting of three regular expressions a { } /* actions are omitted here */ abb { } a*b+ { } Compiler Construction 60 start a 1 2 start a b b 3 4 5 1 6 start b 7 8 a b NFA for a, abb, and a*b
  • 61. Example Cont. Compiler Construction 61 a a b b 1 2 3 4 5 6 b 7 8 a b 0 ε ε ε Combined NFA recognizing three different patterns 0 1 3 7 a a b a 2 4 7 7 8 Sequence of sets of states entered in processing input aaba A*b+
  • 62. Pattern Matching Based on DFA • Another approach to the construction of lexical analyzer from Lex specification is to use a DFA • When we convert an NFA to DFA using subset construction algorithm, there may be several accepting states in given subset of NFA states • In such situation, the accepting state corrosponding to the pattern listed first has priority • Following is the transition table of converted DFA Compiler Construction 62
  • 63. Pattern Matching Based on DFA STATE INPUT SYMBOL a b PATTEN ANNOUNCED 0137 247 8 none 247 7 58 a 8 - 8 a*b+ 7 7 8 None 58 - 68 a*b+ 68 - 8 abb Compiler Construction 63 Test for strings aaba and aba
  • 64. Example The input string is aba  The DFA starts off in state 0137  On input a it goes to state 247  Then on input b it goes to state 58  On next input a it has no next state  We thus have reached termination  The last of these includes the accepting NFA state 8  In state 58 the DFA announces that the pattern a*b+ has been recognized and selects ab the prefix of input as lexeme. Compiler Construction 64
  • 65. Minimizing Number of States of a DFA  partition the set of states into two groups:  G1 : set of accepting states  G2 : set of non-accepting states  For each new group G  partition G into subgroups such that states s1 and s2 are in the same group if  for all input symbols a, states s1 and s2 have transitions to states in the same group. • Start state of the minimized DFA is the group containing the start state of the original DFA. • Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA. Compiler Construction 65
  • 66. Algorithm : Minimizing the number of states of a DFA Input : A DFA D with set of states S, input alphabet Σ, start state s0, and set of accepting states F’ Output : A DFA D’ accepting the same language as D and having as few states as possible Method : 1. Start with an initial partition π with two groups, F and S – F, the accepting and nonaccepting states of D 2. Apply following procedure to construct new partition π new initially, π new = π; for (each group G of π) { partition G into subgroups such that Compiler Construction 66
  • 67. Minimizing DFA - Example Compiler Construction 67 b a a a b b 3 2 1 G1 = {2} G2 = {1,3} G2 cannot be partitioned because move(1,a)=2 move(1,b)=3 move(3,a)=2 move(2,b)=3 So, the minimized DFA (with minimum states) {1,3} a a b b {2}
  • 68. Minimizing DFA – Another Example Compiler Construction 68 b b b a a a a b 4 3 2 1 Groups: {1,2,3} {4} a b 1->2 1->3 2->2 2->3 3->4 3->3 {1,2} {3} no more partitioning So, the minimized DFA {1,2} {4} {3} b a a a b b