LexicalAnalysis in Compiler design .pt

CHAPTER 2
Lexical Analysis
(Scanning)

1. THE ROLE OF LEXICAL ANALYZER
lexical analyzer
(scanner)
syntax analyzer
(parser)
symbol table
manager
source
program
tokens

 Main task: to read input characters and group them into
“tokens.”
 Secondary tasks:
 Skip comments and whitespace;
 Correlate error messages with source program (e.g., line number of error).

Different approaches for Implementing Lexical
Analyzers:
 Using a scanner generator, e.g., lex or flex. This automatically
generates a lexical analyzer from a high-level description of the tokens.
(easiest to implement; least efficient)
 Programming it in a language such as C, using the I/O facilities of the
language.
(intermediate in ease, efficiency)
 Writing it in assembly language and explicitly managing the input.
(hardest to implement, but most efficient)

 token: a name for a set of input strings with related
structure.
Example: “identifier,” “integer constant”
 pattern: a rule describing the set of strings
associated with a token.
Example: “a letter followed by zero or more letters, digits, or
underscores.”
 lexeme: the actual input string that matches a
pattern.
Example: count

Examples
Input: count = 123
Tokens:
identifier : Rule: “letter followed by …”
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by …”
Lexeme: 123

LexicalAnalysis in Compiler design .pt

 If more than one lexeme can match the pattern for a
token, the scanner must indicate the actual lexeme
that matched.
 This information is given using an attribute
associated with the token.
Example: The program statement
count = 123
yields the following token-attribute pairs:
identifier, pointer to the string “count”
assg_op, 
integer_const, the integer value 123

2. Input Buffering Scheme
Three approaches for Implementing Lexical
Analyzers:
•Using a scanner generator, e.g., lex or flex. This
automatically generates a lexical analyzer from a high-
level description of the tokens.
• (easiest to implement; least efficient)
•Programming it in a language such as C, using the I/O
facilities of the language.
• (intermediate in ease, efficiency)
•Writing it in assembly language and explicitly managing
the input.
(hardest to implement, but most efficient)
These three choices are listed in the increasing difficulty
for the implementer or compiler writer.

 Lexical Analyzer performance or Speed is
crucial, since
 This is the only part of the compiler that examines the entire input
program one character at a time.
 Disk input can be slow.
 The scanner accounts for considerable 25-30% of total compile
time.
 LA has to lookahead to determine when a match has been
found to announce a token.
 Scanners or LAs use and Inpput buffering technique called
double-buffering to minimize the overheads associated with
identification of tokens in a speed maner.

Input Buffering scheme with Sentinels
 Objective: Optimize the common case by reducing
the number of tests to one per advance of fwd.
 Idea: Extend each buffer half to hold a sentinel at
the end.
 This is a special character that cannot occur in a
program (e.g., EOF).
 It signals the need for some special action (fill
other buffer-half, or terminate processing).

3. Specification of Tokens: regular expressions
Terminology:
alphabet : a finite set of symbols
string : a finite sequence of alphabet symbols
language : a (finite or infinite) set of strings.

Regular Expressions
A pattern notation for describing certain kinds
of sets over strings:
Given an alphabet :
  is a regular exp. (denotes the language {})
 for each a  , a is a regular exp. (denotes the language
{a})
 if r and s are regular exps. denoting L(r) and L(s)
respectively, then so are:
 (r) | (s) ( denotes the language L(r)  L(s) )
 (r)(s) ( denotes the language L(r)L(s) )
 (r)* ( denotes the language L(r)* )

4. FINITE AUTOMATA – NFA and DFA

Finite Automata
A finite automaton is a 5-tuple
(Q, , T, q0, F), where:
  is a finite alphabet;
 Q is a finite set of states;
 T: Q    Q is the
transition function;
 q0  Q is the initial state;
and
 F  Q is a set of final
states.

NFA with € symbol for RE : (a/b)*abb

5.From Regular Expressions to NFA
The following algorithm is used to construct NFA for the
given RE.

Construct NFA for the given RE : (a/b)*abb
First decompose the given complex RE
into a simple REs .

The final NFA after applying algorithm

Construct NFA with € symbol for RE (a/b)*abb and convert
it into DFA
NFA accepting the strings by the given RE is
First consider the staring state of NFA 0, and then
compute the €-closure(0) which is a starting state for
DFA and is set of states taken from NFA

DFA after conversion accepts the set of strings
represented by the RE (a/b)*abb

7. Recognition of tokens
Considering the language generated by the
following grammar for the recognition of the tokens by
Lexical Analyzer. The grammar is

State9: c=nextchar();
if LETTWR(c) then STATE=10 else FAIL();
State10: c=nextchar();
if LETTWR(c) OR DIGIT(c) then STATE=10
else if OTHER(c) then STATE=11
else FAIL();
State11: Return (getToken(), install_ID())

8.Language for specifying Lexical analyzer (Lex,flex);
LEX tool

Considering the language generated by the
following grammar for the recognition of the tokens by
Lexical Analyzer. The grammar is

The following is a LEX program that recognizes the
tokens of various categories like white space,
identifier, number, relational operators, and
keywrods:if, then, else

9. Design of scanner or Lexical Analyzer generator

In next step convert this compound NFA to DFA for
recognizing the tokens by LA.

First construct NFA for each pattern Pi in LEX program.
Then construct compound NFA which recognizes all string
represented by all the patterns.
Pattern P1
Pattern P2
Pattern P3

Converting the above NFA to DFA –
The starting state A of DFA is composed of {0,1,3,7} NFA States as E-closure(0)
{0,1,3,7 }
A=
B=
C=
D=
E=
F=

LexicalAnalysis in Compiler design .pt

More Related Content

Similar to LexicalAnalysis in Compiler design .pt (20)

Recently uploaded (20)

LexicalAnalysis in Compiler design .pt