Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering

Lexical Analyzer
 Lexical Analyzer reads the source program
character by character to produce tokens.
 Normally a lexical analyzer doesn’t return a list of
tokens at one shot, it returns a token when the
parser asks a token from it.
Compiler Construction 1
Lexical
Analyzer
Parser
source
program
token
get next token
Chapter 2
Symbol Table

 Since the LA is part of compiler that reads the
source text, it may perform certain other tasks
such as,
1. Stripping out comments and whitespaces(blank, newline,
tab)
2. Correlating error messages generated by the compiler
with the source program e.g. LA may keep track of the
number of newline characters seen, so it can associate a
line number with each error message.
3. In some compilers, the LA makes a copy of the source
program with the error messages inserted at the
appropriate positions.
4. If the source program uses a macro-preprocessor, the
expantion of macros may also be performed by the LA
5. Sometimes, LA are divided into cascade of two processes
A) Scanning B) Lexical anlysis

Issues in Lexical Analysis
Reasons for separating the analysis phase of compiling into
lexical analysis and parsing
 Simpler Design
 Compiler Efficiency Is Improved
 Compiler Portability Is Enhanced

Tokens, Patterns and Lexemes
 Token represents a set of strings described by a pattern.
– Identifier represents a set of strings which start with a
letter continues with letters and digits
– The actual string (newval) is called as lexeme.
– Tokens: identifier, number, addop, delimeter, …
 Since a token can represent more than one lexeme, additional
information should be held for that specific lexeme. This
additional information is called as the attribute of the token.
 For simplicity, a token may have a single attribute which
holds the required information for that token.

– For identifiers, this attribute is a pointer to the symbol table, and
the symbol table holds the actual attributes for that token.
 Some attributes:
– <id,attr> where attr is pointer to the symbol table
– <assgop,_> no attribute is needed (if there is only one
assignment operator)
– <num,val> where val is the actual value of the number.
 Token type and its attribute uniquely identifies a lexeme.
 Regular expressions are widely used to specify patterns.

Example
E = M * C ** 2
Token names and associated attributes are
<id, pointer to symbol table entry for E>
<assign_op>
<id, pointer to symbol table entry for M>
<mult_op>
<id, pointer to symbol table entry for C>
<exp_op>
<number,integer value 2>

Attributes of Tokens
y := 31 + 28*x
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
Token Attribute
Lexical Analyzer
Parser

Tokens, Patterns, and Lexemes
A token is a classification of lexical units
– For example: id and num
Lexemes are the specific character strings that
make up a token
– For example: abc and 123
Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and
digits” and “non-empty sequence of digits”

Lexical Errors
fi ( a == f(x) )…
It may be misspelling of the keyword if
It may be undeclared function identifier
Lexical analyzer must return the token id to the parser
and some other phase handle the error
Panic Mode Recovery – Delete successive characters from
the remaining input, until the LA can find well formed
token

Other possible recovery actions
1. Delete one character from the remaining input
2. Insert a missing character into the remaining input
3. Replace a character by another character
4. Transpose two adjacent characters

Input Buffering
Three approaches to the implementation of a lexical
analyzer
1. Use of lexical analyzer generator e.g. Lex compiler
2. Write the Lexical Analyzer in a conventional system
programming languages
3. Write the Lexical Analyzer in assembly language using
I/O facilities to read the input

Buffer Pairs
E = M * C * * 2 eof
 Each buffer is of the same size N, and N is usually size
of a disk block
 Using one system read command N characters are read
 eof marks the end of source file
 Two pointers lexemeBegin and forward are maintained

Sentinels
 Each time we advance forward pointer we must check
for two tests
1. One for end of buffer
2. Other to determine what character is read
 These two tests can be combined by using sentinel
 Sentinel is a special character that can be part of
source program e.g. eof

Can we run out of buffer space?
 In modern languages lexemes are short and one or two
character lookahead is sufficient
 The buffer size in thousand is ample and the double buffer
scheme works without problem
 To avoid problem with long character strings, we can treat
them as concatenation of components
 In JAVA a+ operator is used to represents long strings on
different lines
 PL/I do not treat keywords as reserved
Problem with DECLARE( Arg1, Arg2……

Specification of Tokens
 Regular expressions are an important notation for
specifying lexeme patterns

Terminology of Languages
 Alphabet : a finite set of symbols (ASCII characters)
 String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
–  is the empty string
– |s| is the length of string s.
 Language: sets of strings over some fixed alphabet
–  the empty set is a language.
– {} the set containing empty string is a language
– The set of well-formed C programs is a language
– The set of all possible identifiers is a language.

Operations on Languages
 Concatenation:
– L1L2 = { s1s2 | s1  L1 and s2  L2 }
 Union
– L1 L2 = { s | s  L1 or s  L2 }
 Exponentiation:
– L0
= {} L1
= L L2
= LL
 Kleene Closure
– L*
=
 Positive Closure
– +



0
i
i
L



0
i
i
L



1
i
i
L

Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1  L2 = {a,b,c,d,1,2}
• L1
3
= all strings with length three (using a,b,c,d}
• L1
*
= all strings using letters a,b,c,d and empty
string
• L1
+
= doesn’t include the empty string

Regular Expressions
 We use regular expressions to describe tokens of a
programming language.
 A regular expression is built up of simpler regular
expressions (using defining rules)
 Each regular expression denotes a language.
 A language denoted by a regular expression is
called as a regular set.

Regular Expressions (Rules)
Regular expressions over alphabet 
Reg. Expr Language it denotes
 {}
a  {a}
(r1) | (r2) L(r1)  L(r2)
(r1) (r2) L(r1) L(r2)
(r)*
(L(r))*
(r) L(r)

Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*
|c means (a(b)*
)|(c)
• Ex:
–  = {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0*
=> { ,0,00,000,0000,....}
– (0|1)*
=> all strings with 0 and 1, including the empty
string

Regular Definitions
♠ To write regular expression for some languages can be difficult,
because their regular expressions can be quite complex. In those
cases, we may use regular definitions.
♠ We can give names to regular expressions, and we can use these
names as symbols to define other regular expressions.
♠ A regular definition is a sequence of the definitions of the form:
d1  r1 where di is a distinct name and
d2  r2 ri is a regular expression over the
alphabet   {d1,d2,...,di-1}
dn  rn
basic symbols previously defined
names

Regular Definitions (cont.)
Ex: Identifiers in Pascal
letter  A | B | ... | Z | a | b | ... | z
digit  0 | 1 | ... | 9
id  letter (letter | digit ) *
– If we try to write the regular expression
representing identifiers without using regular
definitions, that regular expression will be
complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *

Problems
1. Describe the languages denoted by the following regular
expressions
a) a( a | b )*a
Solution :
L(r ) = { aa, aaa, aba, aaaa, aaba, abaa, abba, ..}
From above we can say that the above language says
that it is “ Language consisting of strings with a’s and
b’s always starting and ending with a.

b) ((ε | a) b*)*
Solution:
L(r ) = {ε, a, b, aa, ab, bb, abb, abab, …..}
From above, we can say that the above language
says that it is “Language consisting of strings
with no consecutive a’s when there is combination
of a’s and b’s in the string

Problems
Write regular definitions for the following languages
a) All strings of letters that contain the five vowels in
order
Non_vowel → [b-d B-D f-h F-H j-n J-N p-t P-T v-z V-Z]
String → (non-vowel)*([a| A])+
(non-vowel)*([e|E])+
(non-vowel)*([i| I])+
(non-vowel)*([o| O])+
(non-vowel)*([u| U])+
b) All strings of lowercase letters in which the letters are in
ascending lexicographic order

Recognition of Tokens
Consider the following grammar
Stmt if
→ expr then stmt
| if expr then stmt else stmt
| ε
expr → term relop term
| term
term id
→
| number

Patterns for Tokens
digit [0-9]
→
digits digit
→ +
number digits(. Digits)? (E[+-]? Digits)?
→
letter [A-Za-z]
→
id letter (letter | digit)*
→
if if
→
then then
→
else else
→
relop < | > | <= | >= | = | <>
→

Transition Diagram
 As an intermediate step in the construction of LA patterns are
first converted into stylized flowcharts called transition diagrams
 RE patterns are converted to TD by hand.There are mechanical way
to construct these diagrams from Res
 TD have set of nodes, called states. Each state represents a condition
that could occur during the process of scanning
 Edges are directed from one state of TD to another. Each edge is
labeled by a symbol or set of symbols.
 Certain states are called accepting or final. These states indicate that
a lexeme has been found.

 If it is necessary to retract forward pointer one position then *
is placed near that accepting state.
0 1 2
3
4
7
8
5
6
< =
>
return (relop LE)
return (relop NE)
return (relop LT)
return (relop GE)
return (relop GT)
other
return (relop EQ)
*
=
>
=
other
*
start

Recognition of reserved words and identifiers
0 10 11
start letter
Letter or digit
other *
return(getToken(), installID())
There are two ways that we can handle reserved words that look like ids
1. Install the reserved words in the symbol table initially
A field of the symbol table entry indicates that which token they
represent
2. Create separate transition diagram for each keyword.
Such transition diagram consists of states representing the situation
after each successive letter of the keyword is seen, followed by a test
for a “ nonletter-or-digit
start t h e n
nonlet/digit
*

Sketch of implementation of relop transition
diagram
TOKEN getRelop()
{
Token retToken = new(RELOP);
while(1) { /* repeat charater processing until a return or failure occurs*/
switch(state){
case 0: c = nextChar();
if (c == ‘<‘ ) state =1;
else if ( c == ‘=‘ ) state = 5;
else if ( c == ‘>’ ) state = 6;
else fail (); /* lexeme is not a relop */
break;
case 1: …
case 2: …
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
}
}

Finite Automata
 A recognizer for a language is a program that
takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
 We call the recognizer of the tokens as a finite
automaton.
 A finite automaton can be: deterministic(DFA) or
non-deterministic (NFA)
 This means that we may use a deterministic or
non-deterministic automaton as a lexical analyzer

 Both deterministic and non-deterministic finite
automaton recognize regular sets.
 Which one?
– deterministic – faster recognizer, but it may take more
space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
 First, we define regular expressions for tokens;
Then we convert them into a DFA to get a lexical
analyzer for our tokens.
– Algorithm1: Regular Expression  NFA  DFA (two
steps: first to NFA, then to DFA)
– Algorithm2: Regular Expression  DFA (directly convert
a regular expression into a DFA)

Non-Deterministic Finite Automaton (NFA)
• A non-deterministic finite automaton (NFA)
is a mathematical model that consists of:
– S - a set of states
–  - a set of input symbols (alphabet)
– move – a transition function move to map state-
symbol pairs to sets of states.
– s0 - a start (initial) state
– F – a set of accepting states (final states)

 - transitions are allowed in NFAs. In other
words, we can move from one state to another one
without consuming any symbol
 A NFA accepts a string x, if and only if there is a
path from the starting state to one of accepting
states such that edge labels along this path spell
out x

NFA (Example)
1
0 2
a b
start
a
b
0 is the start state s0
{2} is the set of final states F
S = {a,b}
S = {0,1,2}
Transition Function: a b
0 {0,1} {0}
1 _ {2}
2 _ _
Transition graph of the NFA
The language recognized by this NFA is (a|b) *
a b

Deterministic Finite Automaton (DFA)
 It is a special form of a NFA
 no state has - transition
 for each symbol a and state s, there is at most one labeled edge
a leaving s.
i.e. transition function is from pair of state-symbol to state
(not set of states)
1
0 2
b
a
a
b
The language recognized by
this DFA is also (a|b) *
a b
b a

Implementing a DFA
• Le us assume that the end of a string is marked with a
special symbol (say eos). The algorithm for recognition
will be as follows: (an efficient implementation)
s  s0 { start from the initial state }
c  nextchar { get the next character from the
input string }
while (c != eos) do { do until the end of the string }
begin
s  move(s,c) { transition function }
c  nextchar
end
if (s in F) then { if s is an accepting state }
return “yes”
else
return “no”

Implementing a NFA
S  -closure({s0}) { set all of states can be accessible
from s0 by -transitions }
c  nextchar
while (c != eos) {
begin
s  -closure(move(S,c)) { set of all states can be
accessible from a state in S
c  nextchar by a transition on c }
end
if (SF != ) then { if S contains an accepting state }
return “yes”
else
return “no”
• This algorithm is not efficient.

Converting A Regular Expression into A
NFA (Thomson’s Construction)
 This is one way to convert a regular expression into a
NFA.
 There can be other ways (much efficient) for the
conversion.
 Thomson’s Construction is simple and systematic
method. It guarantees that the resulting NFA will have
exactly one final state, and one start state.
 Construction starts from simplest parts (alphabet
symbols). To create a NFA for a complex regular
expression, NFAs of its sub-expressions are combined
to create its NFA,

Thomson’s Construction (cont.)
 To recognize an empty string 
 To recognize a symbol a in the alphabet 
 If N(r1) and N(r2) are NFAs for regular expressions r1
and r2
 For regular expression r1 | r2
a
f
i
f
i

N(r2)
N(r1)
f
i NFA for r1 | r2
 



Thomson’s Construction (cont.)
 For regular expression r1 r2
i f
N(r2)
N(r1)
NFA for r1 r2
Final state of N(r2) become final state of N(r1r
 For regular expression r*
N(r)
i f
NFA for r*
 



Thomson’s Construction (Example - (a|b) *
a )
a:
a
b
b:
(a | b)
a
b




b




a

 
(a|b) *


b




a
 

a
(a|b) *
a

Converting a NFA into a DFA (subset
construction)
 The general idea behind the subset construction is that each
state of DFA corresponds to a set of NFA states
 After reading input a1a2….an, the DFA is in that state which
correponds to the set of states that the NFA can reach, from its
start state
 It is possible that the number of DFA states is exponential in
the number of NFA states
 In practice the NFA and DFA have approximately the same
number of states

Converting a NFA into a DFA
(subset construction)
 The subset construction algorithm performs
three operations on NFA states
1. -closure(s) – Set of NFA states reachable from
NFA state s on -transitions alone
2. -closure(T) – Set of NFA states reachable
from some NFA state s in set T on -transitions
alone
3. move(T,a) – Set of NFA states to which there is
a transition on input symbol a from some state s
in T

(subset construction)
T is set of states of NFA
put T= -closure({s0}) as an unmarked state into the set of DFA (DS)
while (there is one unmarked T in DS) do
begin
mark T
for each input symbol a do
begin
U  -closure(move(T,a))
if (U is not in DS) then
add U into DS as an unmarked state
Dtran[T,a]  U
end
end
– a state S in DS is an accepting state of DFA if a state in S is an accepting
state of NFA
– the start state of DFA is -closure({s0})
set of states to which there is a transition on
a from a state s in S1
-closure({s0}) is the set of all states can be accessible
from s0 by -transition.

Converting a NFA into a DFA (Example)
• S0 = -closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state
•  mark S0
• -closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1
into DS
• -closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 S2
into DS
• Dtran[S0,a]  S1 transfunc[S0,b]  S2
•  mark S1
• -closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
• -closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
b




a
 

a

Converting a NFA into a DFA (Example)
contd...
Dtran[S1,a]  S1 Dtran[S1,b]  S2
 mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
Dtran[S2,a]  S1 Dtran[S2,b]  S2

(Example – cont.)
S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7}
S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}
b
a
a
b
b
a
S1
S2
S0
Transition Table for DFA D
NFA State DFA State a b
{0,1,2,4,7} S0 S1 S2
(1,2,3,4,6,7,8} S1 S1 S2
{1,2,4,5,6,7} S2 S1 S2

A Language for Specifying Lexical
Analyzers
 Several tools have been built for constructing LA
from Res
 Lex, has been widely used to specify Las for a
variety of languages
 The tool is referred as Lex compiler and its input
specifications is called Lex language
 First, a specification of a LA is prepared by
creating a program lex.1 in the lex language

 Then lex.1 is run through the Lex compiler to produce
C program lex.yy.c
 lex.yy.c consists of a tabular representation of a
transition diagram constructed from the RE of lex.1
 The actions associated with RE in lex.1 are pieces of C
code and are carried over diretly to lex.yy.c
 Finally, lex.yy.c is run through the C compiler to produce
an object program a.out.

Lex source
program lex.1
Lex
compiler
C
compiler
a.out
Lex.yy.c
a.out
Lex.yy.c
Input
stream
Sequence of
tokens
Creating Lexical Analyzer with Lex

Lex Specifications
Lex program consists of three parts
declarations { includes declarations of
variables,
%% constants and regular
definitions}
translation rules
%%
auxiliary procedures { holds procedures needed by
the actions}
The translation rules of a Lex program are of the
form
p1 {action1 }

Design of Lexical Analyzer Generator
 Design of a S/W tool that automatically constructs
a lexical analyzer from a program in the Lex
language.
 Specification of a LA of the form
p1 {action 1}
p2 {action 2}
..
pn {action n}
pi is Regular Expression and actioni is a
program fragment

 Our problem is to construct a recognizer that
looks for lexemes in the input buffer
 If more that one pattern matches, the recognizer
is to choose the longest lexeme matched
 If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is
chosen
Design of Lexical Analyzer Generator
(cont.)

Lex Specification Transition table
input buffer
Model of Lex Compiler
Lex Compiler
lexeme
FA
Simulator
Transition
Table
Schematic Lexical Analyzer

Pattern Matching Based on NFA’s
 Construct the transition table of NFA N for the
composite pattern p1| p2| p3 …… | pn
 Create an NFA N(pi) for each pattern pi and link
it to S0 the start state
N(p1)
N(p2)
N(pn)
S0

Example
We have the following Lex program consisting of
three regular expressions
a { } /* actions are omitted here */
abb { }
a*b+
{ }
start a
1 2
start a b b
3 4 5
1
6
start b
7 8
a b
NFA for a, abb, and a*b

Example Cont.
a
a b b
1 2
3 4 5 6
b
7 8
a b
0
ε
ε
ε
Combined NFA recognizing three different patterns
0
1
3
7
a a b a
2
4
7
7 8
Sequence of sets of states entered in processing input aaba
A*b+

Pattern Matching Based on DFA
• Another approach to the construction of lexical
analyzer from Lex specification is to use a DFA
• When we convert an NFA to DFA using subset
construction algorithm, there may be several
accepting states in given subset of NFA states
• In such situation, the accepting state corrosponding
to the pattern listed first has priority
• Following is the transition table of converted DFA

Pattern Matching Based on DFA
STATE INPUT SYMBOL
a b
PATTEN
ANNOUNCED
0137 247 8 none
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb
Test for strings aaba and aba

Example
The input string is aba
 The DFA starts off in state 0137
 On input a it goes to state 247
 Then on input b it goes to state 58
 On next input a it has no next state
 We thus have reached termination
 The last of these includes the accepting NFA state 8
 In state 58 the DFA announces that the pattern a*b+
has been recognized and selects ab the prefix of
input
as lexeme.

Minimizing Number of States of a DFA
 partition the set of states into two groups:
 G1 : set of accepting states
 G2 : set of non-accepting states
 For each new group G
 partition G into subgroups such that states s1 and s2 are in
the same group if
 for all input symbols a, states s1 and s2 have
transitions to states in the same group.
• Start state of the minimized DFA is the group containing
the start state of the original DFA.
• Accepting states of the minimized DFA are the groups
containing the accepting states of the original DFA.

Algorithm : Minimizing the number of states of a
DFA
Input : A DFA D with set of states S, input alphabet Σ, start state
s0, and set of accepting states F’
Output : A DFA D’ accepting the same language as D and having as
few states as possible
Method :
1. Start with an initial partition π with two groups, F and S – F, the
accepting and nonaccepting states of D
2. Apply following procedure to construct new partition π new
initially, π new = π;
for (each group G of π) {
partition G into subgroups such that

Minimizing DFA - Example
b a
a
a
b
b
3
2
1
G1 = {2}
G2 = {1,3}
G2 cannot be partitioned because
move(1,a)=2 move(1,b)=3
move(3,a)=2 move(2,b)=3
So, the minimized DFA (with minimum states)
{1,3}
a
a
b
b
{2}

Minimizing DFA – Another Example
b
b
b
a
a
a
a
b 4
3
2
1
Groups: {1,2,3} {4}
a b
1->2 1->3
2->2 2->3
3->4 3->3
{1,2} {3}
no more partitioning
So, the minimized DFA
{1,2}
{4}
{3}
b
a
a
a
b
b

Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering

More Related Content

Similar to Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering (20)

Recently uploaded (20)

Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering