SlideShare a Scribd company logo
CSc 453
Lexical Analysis
(Scanning)
Saumya Debray
The University of Arizona
Tucson
CSc 453: Lexical Analysis 2
Overview
 Main task: to read input characters and group them into
“tokens.”
 Secondary tasks:
 Skip comments and whitespace;
 Correlate error messages with source program (e.g., line number of error).
lexical analyzer
(scanner)
syntax analyzer
(parser)
symbol table
manager
source
program
tokens
Overview (cont’d)
CSc 453: Lexical Analysis 3
/ * p g m . c * / n i n
t m a i n ( i n t a
r g c , c h a r * * a r
g v ) { n t i n t x ,
Y ; n t f l o a t w ;
lexical
analyzer
keywd_int
identifier: “main”
left_paren
keywod_int
identifier: “argc”
comma
keywd_char
star
star
identifier: “argv”
right_paren
left_brace
keywd_int
…
Input file
Token
sequence
. . .
CSc 453: Lexical Analysis 4
Implementing Lexical Analyzers
Different approaches:
 Using a scanner generator, e.g., lex or flex. This automatically
generates a lexical analyzer from a high-level description of the tokens.
(easiest to implement; least efficient)
 Programming it in a language such as C, using the I/O facilities of the
language.
(intermediate in ease, efficiency)
 Writing it in assembly language and explicitly managing the input.
(hardest to implement, but most efficient)
CSc 453: Lexical Analysis 5
Lexical Analysis: Terminology
 token: a name for a set of input strings with related
structure.
Example: “identifier,” “integer constant”
 pattern: a rule describing the set of strings
associated with a token.
Example: “a letter followed by zero or more letters, digits, or
underscores.”
 lexeme: the actual input string that matches a
pattern.
Example: count
CSc 453: Lexical Analysis 6
Examples
Input: count = 123
Tokens:
identifier : Rule: “letter followed by …”
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by …”
Lexeme: 123
CSc 453: Lexical Analysis 7
Attributes for Tokens
 If more than one lexeme can match the pattern for a
token, the scanner must indicate the actual lexeme
that matched.
 This information is given using an attribute
associated with the token.
Example: The program statement
count = 123
yields the following token-attribute pairs:
identifier, pointer to the string “count”
assg_op, 
integer_const, the integer value 123
CSc 453: Lexical Analysis 8
Specifying Tokens: regular expressions
 Terminology:
alphabet : a finite set of symbols
string : a finite sequence of alphabet symbols
language : a (finite or infinite) set of strings.
 Regular Operations on languages:
Union: R  S = { x | x  R or x  S}
Concatenation: RS = { xy | x  R and y  S}
Kleene closure: R* = R concatenated with itself 0 or more times
= {}  R  RR  RRR 
= strings obtained by concatenating a finite
number of strings from the set R.
CSc 453: Lexical Analysis 9
Regular Expressions
A pattern notation for describing certain kinds
of sets over strings:
Given an alphabet :
  is a regular exp. (denotes the language {})
 for each a  , a is a regular exp. (denotes the language {a})
 if r and s are regular exps. denoting L(r) and L(s)
respectively, then so are:
 (r) | (s) ( denotes the language L(r)  L(s) )
 (r)(s) ( denotes the language L(r)L(s) )
 (r)* ( denotes the language L(r)* )
CSc 453: Lexical Analysis 10
Common Extensions to r.e. Notation
 One or more repetitions of r : r+
 A range of characters : [a-zA-Z], [0-9]
 An optional expression: r?
 Any single character: .
 Giving names to regular expressions, e.g.:
 letter = [a-zA-Z_]
 digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
 ident = letter ( letter | digit )*
 Integer_const = digit+
CSc 453: Lexical Analysis 11
Recognizing Tokens: Finite Automata
A finite automaton is a 5-tuple
(Q, , T, q0, F), where:
  is a finite alphabet;
 Q is a finite set of states;
 T: Q    Q is the
transition function;
 q0  Q is the initial state;
and
 F  Q is a set of final
states.
CSc 453: Lexical Analysis 12
Finite Automata: An Example
A (deterministic) finite automaton (DFA) to match C-
style comments:
CSc 453: Lexical Analysis 13
Formalizing Automata Behavior
To formalize automata behavior, we extend the
transition function to deal with strings:
Ŧ : Q  *  Q
Ŧ(q, ) = q
Ŧ(q, aw) = Ŧ(r, w) where r = T(q, a)
The language accepted by an automaton M is
L(M) = { w | Ŧ(q0, w)  F }.
A language L is regular if it is accepted by
some finite automaton.
CSc 453: Lexical Analysis 14
Finite Automata and Lexical Analysis
 The tokens of a language are specified using
regular expressions.
 A scanner is a big DFA, essentially the
“aggregate” of the automata for the individual
tokens.
 Issues:
 What does the scanner automaton look like?
 How much should we match? (When do we stop?)
 What do we do when a match is found?
 Buffer management (for efficiency reasons).
CSc 453: Lexical Analysis 15
Structure of a Scanner Automaton
CSc 453: Lexical Analysis 16
How much should we match?
In general, find the longest match possible.
E.g., on input 123.45, match this as
num_const(123.45)
rather than
num_const(123), “.”, num_const(45).
CSc 453: Lexical Analysis 17
Input Buffering
 Scanner performance is crucial:
 This is the only part of the compiler that examines the entire
input program one character at a time.
 Disk input can be slow.
 The scanner accounts for ~25-30% of total compile time.
 We need lookahead to determine when a
match has been found.
 Scanners use double-buffering to minimize
the overheads associated with this.
CSc 453: Lexical Analysis 18
Buffer Pairs
 Use two N-byte buffers (N = size of a disk block;
typically, N = 1024 or 4096).
 Read N bytes into one half of the buffer each time.
If input has less than N bytes, put a special EOF
marker in the buffer.
 When one buffer has been processed, read N bytes
into the other buffer (“circular buffers”).
CSc 453: Lexical Analysis 19
Buffer pairs (cont’d)
Code:
if (fwd at end of first half)
reload second half;
set fwd to point to beginning of second half;
else if (fwd at end of second half)
reload first half;
set fwd to point to beginning of first half;
else
fwd++;
it takes two tests for each advance of the fwd pointer.
CSc 453: Lexical Analysis 20
Buffer pairs: Sentinels
 Objective: Optimize the common case by reducing
the number of tests to one per advance of fwd.
 Idea: Extend each buffer half to hold a sentinel at
the end.
 This is a special character that cannot occur in a
program (e.g., EOF).
 It signals the need for some special action (fill
other buffer-half, or terminate processing).
CSc 453: Lexical Analysis 21
Buffer pairs with sentinels (cont’d)
Code:
fwd++;
if ( *fwd == EOF ) { /* special processing needed */
if (fwd at end of first half)
. . .
else if (fwd at end of second half)
. . .
else /* end of input */
terminate processing.
}
common case now needs just a single test per character.
CSc 453: Lexical Analysis 22
Handling Reserved Words
1. Hard-wire them directly into the scanner
automaton:
 harder to modify;
 increases the size and complexity of the automaton;
 performance benefits unclear (fewer tests, but cache effects
due to larger code size).
2. Fold them into “identifier” case, then look up
a keyword table:
 simpler, smaller code;
 table lookup cost can be mitigated using perfect hashing.
CSc 453: Lexical Analysis 23
Implementing Finite Automata 1
Encoded as program code:
 each state corresponds to a (labeled code fragment)
 state transitions represented as control transfers.
E.g.:
while ( TRUE ) {
…
state_k: ch = NextChar(); /* buffer mgt happens here */
switch (ch) {
case … : goto ...; /* state transition */
…
}
state_m: /* final state */
copy lexeme to where parser can get at it;
return token_type;
…
}
CSc 453: Lexical Analysis 24
Direct-Coded Automaton: Example
int scanner()
{ char ch;
while (TRUE) {
ch = NextChar( );
state_1: switch (ch) { /* initial state */
case ‘a’ : goto state_2;
case ‘b’ : goto state_3;
default : Error();
}
state_2: …
state_3: switch (ch) {
case ‘a’ : goto state_2;
default : return SUCCESS;
}
} /* while */
}
CSc 453: Lexical Analysis 25
Implementing Finite Automata 2
Table-driven automata (e.g., lex, flex):
 Use a table to encode transitions:
next_state = T(curr_state, next_char);
 Use one bit in state no. to indicate whether it’s a final (or
error) state. If so, consult a separate table for what action to
take.
T next input character
Current
state
CSc 453: Lexical Analysis 26
Table-Driven Automaton: Example
#define isFinal(s) ((s) < 0)
int scanner()
{ char ch;
int currState = 1;
while (TRUE) {
ch = NextChar( );
if (ch == EOF) return 0; /* fail */
currState = T [currState, ch];
if (IsFinal(currState)) {
return 1; /* success */
}
} /* while */
}
T
input
a b
state
1 2 3
2 2 3
3 2 -1
CSc 453: Lexical Analysis 27
What do we do on finding a match?
 A match is found when:
 The current automaton state is a final state; and
 No transition is enabled on the next input character.
 Actions on finding a match:
 if appropriate, copy lexeme (or other token attribute) to where
the parser can access it;
 save any necessary scanner state so that scanning can
subsequently resume at the right place;
 return a value indicating the token found.

More Related Content

PPT
Lecture 1 - Lexical Analysis.ppt
PPT
LexicalAnalysis in Compiler design .pt
PPTX
Ch 2.pptx
PPT
SS & CD Module 3
PPT
Module 2
PPT
1.Role lexical Analyzer
PPTX
5490ce2bf23093de242ccc160dbfd3b639d.pptx
PDF
role of lexical parser compiler design1-181124035217.pdf
Lecture 1 - Lexical Analysis.ppt
LexicalAnalysis in Compiler design .pt
Ch 2.pptx
SS & CD Module 3
Module 2
1.Role lexical Analyzer
5490ce2bf23093de242ccc160dbfd3b639d.pptx
role of lexical parser compiler design1-181124035217.pdf

Similar to 1CompilerDesigningss_LexicalAnalysis.ppt (20)

PPT
02-Lexical-Analysis.ppt
PPTX
Chahioiuou9oioooooooooooooofffghfpterTwo.pptx
PPTX
A Role of Lexical Analyzer
PPTX
04LexicalAnalysissnsnjmsjsjmsbdjjdnd.pptx
PPT
52232.-Compiler-Design-Lexical-Analysis.ppt
PPT
Lexical analysis, syntax analysis, semantic analysis. Ppt
PPT
Chapter-2-lexical-analyser and its property lecture note.ppt
PDF
Lexical analysis - Compiler Design
PPT
02. Chapter 3 - Lexical Analysis NLP.ppt
PPT
Compiler Design ug semLexical Analysis.ppt
PPTX
Lexical analysis - Compiler Design
PDF
Compiler_Design_Introduction_Unit_2_IIT.pdf
PPTX
LexicalAnalysis chapter2 i n compiler design.pptx
PPTX
Ch03-LexicalAnalysis chapter2 in compiler design.pptx
PPTX
Compiler Lexical Analyzer to analyze lexemes.pptx
PDF
Ch03-LexicalAnalysis in compiler design subject.pdf
PDF
3a. Context Free Grammar.pdf
PPTX
3. Lexical analysis
PPT
PPT
Ch3.ppt
02-Lexical-Analysis.ppt
Chahioiuou9oioooooooooooooofffghfpterTwo.pptx
A Role of Lexical Analyzer
04LexicalAnalysissnsnjmsjsjmsbdjjdnd.pptx
52232.-Compiler-Design-Lexical-Analysis.ppt
Lexical analysis, syntax analysis, semantic analysis. Ppt
Chapter-2-lexical-analyser and its property lecture note.ppt
Lexical analysis - Compiler Design
02. Chapter 3 - Lexical Analysis NLP.ppt
Compiler Design ug semLexical Analysis.ppt
Lexical analysis - Compiler Design
Compiler_Design_Introduction_Unit_2_IIT.pdf
LexicalAnalysis chapter2 i n compiler design.pptx
Ch03-LexicalAnalysis chapter2 in compiler design.pptx
Compiler Lexical Analyzer to analyze lexemes.pptx
Ch03-LexicalAnalysis in compiler design subject.pdf
3a. Context Free Grammar.pdf
3. Lexical analysis
Ch3.ppt
Ad

Recently uploaded (20)

PPT
Project quality management in manufacturing
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PPT on Performance Review to get promotions
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Digital Logic Computer Design lecture notes
PDF
composite construction of structures.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Mechanical Engineering MATERIALS Selection
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Project quality management in manufacturing
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT on Performance Review to get promotions
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Digital Logic Computer Design lecture notes
composite construction of structures.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
bas. eng. economics group 4 presentation 1.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Mechanical Engineering MATERIALS Selection
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CH1 Production IntroductoryConcepts.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Lecture Notes Electrical Wiring System Components
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Ad

1CompilerDesigningss_LexicalAnalysis.ppt

  • 1. CSc 453 Lexical Analysis (Scanning) Saumya Debray The University of Arizona Tucson
  • 2. CSc 453: Lexical Analysis 2 Overview  Main task: to read input characters and group them into “tokens.”  Secondary tasks:  Skip comments and whitespace;  Correlate error messages with source program (e.g., line number of error). lexical analyzer (scanner) syntax analyzer (parser) symbol table manager source program tokens
  • 3. Overview (cont’d) CSc 453: Lexical Analysis 3 / * p g m . c * / n i n t m a i n ( i n t a r g c , c h a r * * a r g v ) { n t i n t x , Y ; n t f l o a t w ; lexical analyzer keywd_int identifier: “main” left_paren keywod_int identifier: “argc” comma keywd_char star star identifier: “argv” right_paren left_brace keywd_int … Input file Token sequence . . .
  • 4. CSc 453: Lexical Analysis 4 Implementing Lexical Analyzers Different approaches:  Using a scanner generator, e.g., lex or flex. This automatically generates a lexical analyzer from a high-level description of the tokens. (easiest to implement; least efficient)  Programming it in a language such as C, using the I/O facilities of the language. (intermediate in ease, efficiency)  Writing it in assembly language and explicitly managing the input. (hardest to implement, but most efficient)
  • 5. CSc 453: Lexical Analysis 5 Lexical Analysis: Terminology  token: a name for a set of input strings with related structure. Example: “identifier,” “integer constant”  pattern: a rule describing the set of strings associated with a token. Example: “a letter followed by zero or more letters, digits, or underscores.”  lexeme: the actual input string that matches a pattern. Example: count
  • 6. CSc 453: Lexical Analysis 6 Examples Input: count = 123 Tokens: identifier : Rule: “letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “digit followed by …” Lexeme: 123
  • 7. CSc 453: Lexical Analysis 7 Attributes for Tokens  If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched.  This information is given using an attribute associated with the token. Example: The program statement count = 123 yields the following token-attribute pairs: identifier, pointer to the string “count” assg_op,  integer_const, the integer value 123
  • 8. CSc 453: Lexical Analysis 8 Specifying Tokens: regular expressions  Terminology: alphabet : a finite set of symbols string : a finite sequence of alphabet symbols language : a (finite or infinite) set of strings.  Regular Operations on languages: Union: R  S = { x | x  R or x  S} Concatenation: RS = { xy | x  R and y  S} Kleene closure: R* = R concatenated with itself 0 or more times = {}  R  RR  RRR  = strings obtained by concatenating a finite number of strings from the set R.
  • 9. CSc 453: Lexical Analysis 9 Regular Expressions A pattern notation for describing certain kinds of sets over strings: Given an alphabet :   is a regular exp. (denotes the language {})  for each a  , a is a regular exp. (denotes the language {a})  if r and s are regular exps. denoting L(r) and L(s) respectively, then so are:  (r) | (s) ( denotes the language L(r)  L(s) )  (r)(s) ( denotes the language L(r)L(s) )  (r)* ( denotes the language L(r)* )
  • 10. CSc 453: Lexical Analysis 10 Common Extensions to r.e. Notation  One or more repetitions of r : r+  A range of characters : [a-zA-Z], [0-9]  An optional expression: r?  Any single character: .  Giving names to regular expressions, e.g.:  letter = [a-zA-Z_]  digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9  ident = letter ( letter | digit )*  Integer_const = digit+
  • 11. CSc 453: Lexical Analysis 11 Recognizing Tokens: Finite Automata A finite automaton is a 5-tuple (Q, , T, q0, F), where:   is a finite alphabet;  Q is a finite set of states;  T: Q    Q is the transition function;  q0  Q is the initial state; and  F  Q is a set of final states.
  • 12. CSc 453: Lexical Analysis 12 Finite Automata: An Example A (deterministic) finite automaton (DFA) to match C- style comments:
  • 13. CSc 453: Lexical Analysis 13 Formalizing Automata Behavior To formalize automata behavior, we extend the transition function to deal with strings: Ŧ : Q  *  Q Ŧ(q, ) = q Ŧ(q, aw) = Ŧ(r, w) where r = T(q, a) The language accepted by an automaton M is L(M) = { w | Ŧ(q0, w)  F }. A language L is regular if it is accepted by some finite automaton.
  • 14. CSc 453: Lexical Analysis 14 Finite Automata and Lexical Analysis  The tokens of a language are specified using regular expressions.  A scanner is a big DFA, essentially the “aggregate” of the automata for the individual tokens.  Issues:  What does the scanner automaton look like?  How much should we match? (When do we stop?)  What do we do when a match is found?  Buffer management (for efficiency reasons).
  • 15. CSc 453: Lexical Analysis 15 Structure of a Scanner Automaton
  • 16. CSc 453: Lexical Analysis 16 How much should we match? In general, find the longest match possible. E.g., on input 123.45, match this as num_const(123.45) rather than num_const(123), “.”, num_const(45).
  • 17. CSc 453: Lexical Analysis 17 Input Buffering  Scanner performance is crucial:  This is the only part of the compiler that examines the entire input program one character at a time.  Disk input can be slow.  The scanner accounts for ~25-30% of total compile time.  We need lookahead to determine when a match has been found.  Scanners use double-buffering to minimize the overheads associated with this.
  • 18. CSc 453: Lexical Analysis 18 Buffer Pairs  Use two N-byte buffers (N = size of a disk block; typically, N = 1024 or 4096).  Read N bytes into one half of the buffer each time. If input has less than N bytes, put a special EOF marker in the buffer.  When one buffer has been processed, read N bytes into the other buffer (“circular buffers”).
  • 19. CSc 453: Lexical Analysis 19 Buffer pairs (cont’d) Code: if (fwd at end of first half) reload second half; set fwd to point to beginning of second half; else if (fwd at end of second half) reload first half; set fwd to point to beginning of first half; else fwd++; it takes two tests for each advance of the fwd pointer.
  • 20. CSc 453: Lexical Analysis 20 Buffer pairs: Sentinels  Objective: Optimize the common case by reducing the number of tests to one per advance of fwd.  Idea: Extend each buffer half to hold a sentinel at the end.  This is a special character that cannot occur in a program (e.g., EOF).  It signals the need for some special action (fill other buffer-half, or terminate processing).
  • 21. CSc 453: Lexical Analysis 21 Buffer pairs with sentinels (cont’d) Code: fwd++; if ( *fwd == EOF ) { /* special processing needed */ if (fwd at end of first half) . . . else if (fwd at end of second half) . . . else /* end of input */ terminate processing. } common case now needs just a single test per character.
  • 22. CSc 453: Lexical Analysis 22 Handling Reserved Words 1. Hard-wire them directly into the scanner automaton:  harder to modify;  increases the size and complexity of the automaton;  performance benefits unclear (fewer tests, but cache effects due to larger code size). 2. Fold them into “identifier” case, then look up a keyword table:  simpler, smaller code;  table lookup cost can be mitigated using perfect hashing.
  • 23. CSc 453: Lexical Analysis 23 Implementing Finite Automata 1 Encoded as program code:  each state corresponds to a (labeled code fragment)  state transitions represented as control transfers. E.g.: while ( TRUE ) { … state_k: ch = NextChar(); /* buffer mgt happens here */ switch (ch) { case … : goto ...; /* state transition */ … } state_m: /* final state */ copy lexeme to where parser can get at it; return token_type; … }
  • 24. CSc 453: Lexical Analysis 24 Direct-Coded Automaton: Example int scanner() { char ch; while (TRUE) { ch = NextChar( ); state_1: switch (ch) { /* initial state */ case ‘a’ : goto state_2; case ‘b’ : goto state_3; default : Error(); } state_2: … state_3: switch (ch) { case ‘a’ : goto state_2; default : return SUCCESS; } } /* while */ }
  • 25. CSc 453: Lexical Analysis 25 Implementing Finite Automata 2 Table-driven automata (e.g., lex, flex):  Use a table to encode transitions: next_state = T(curr_state, next_char);  Use one bit in state no. to indicate whether it’s a final (or error) state. If so, consult a separate table for what action to take. T next input character Current state
  • 26. CSc 453: Lexical Analysis 26 Table-Driven Automaton: Example #define isFinal(s) ((s) < 0) int scanner() { char ch; int currState = 1; while (TRUE) { ch = NextChar( ); if (ch == EOF) return 0; /* fail */ currState = T [currState, ch]; if (IsFinal(currState)) { return 1; /* success */ } } /* while */ } T input a b state 1 2 3 2 2 3 3 2 -1
  • 27. CSc 453: Lexical Analysis 27 What do we do on finding a match?  A match is found when:  The current automaton state is a final state; and  No transition is enabled on the next input character.  Actions on finding a match:  if appropriate, copy lexeme (or other token attribute) to where the parser can access it;  save any necessary scanner state so that scanning can subsequently resume at the right place;  return a value indicating the token found.