SlideShare a Scribd company logo
Compiler Design
Chapter Two
This Chapter Covers:
Role of lexical analyser
Token Specification and Recognition
NFA to DFA
Lexical Analyzer
Lexical Analyzer reads the source program character by character to produce tokens.
Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a
token when the parser asks a token from it.
1
3
2
Token
Token represents a set of strings described by a pattern.
Identifier represents a set of strings which start with a letter continues
with letters and digits
Lexeme: is a sequence of characters in the source program that matched
by the pattern for a token.
Tokens: identifier, number, addop, delimeter, …
Since a token can represent more than one lexeme, additional information
should be held for that specific lexeme. This additional information is called
as the attribute of the token.
For simplicity, a token may have a single attribute which holds the required
information for that token.
For identifiers, this attribute a pointer to the symbol table, and the
symbol table holds the actual attributes for that token.
Token (Cont.)
Some attributes:
<id,attr> where attr is pointer to the symbol table
<assgop,_> no attribute is needed (if there is only one assignment operator)
<num,val> where val is the actual value of the number.
Token type and its attribute uniquely identifies a lexeme.
Regular expressions are widely used to specify patterns.
Scanner
A scanner groups (classed together) input characters into tokens.
For example, if the input is:
x = x*(b+1); then the scanner generates the following sequence of tokens
id(x), =, id(x), *, (, id(b), +, num(1), ), ;
where id(x) indicates the identifier with name x (a programme variable in this case) and
num(1) indicates the integer 1.
Each time the parser needs a token, it sends a request to the scanner.
Then, the scanner reads as many characters from the input stream as it is necessary to
construct a single token.
Scanner (Cont.)
The scanner may report an error during scanning.
Otherwise, when a single token is formed, the scanner is suspended (stop
from being active temporarily) and returns the token to the parser.
The parser will repeatedly call the scanner to read all the tokens from the
input stream or until an error is detected (such as a syntax error).
Some tokens require some extra information.
For example, an identifier is a token (so it is represented by some number)
but it is also associated with a string that holds the identifier name.
Scanner (Cont.)
For example, the token id(x) is associated with the string, "x".
Similarly, the token num(1) is associated with the number, 1.
Tokens are specified by patterns, called regular expressions.
For example, the regular expression [a-z][a-zA-Z0-9]* recognises all identifiers with at
least one alphanumeric letter whose first letter is lower-case alphabetic.
A typical scanner:
recognises the keywords of the language (these are the reserved words that have a
special meaning in the language, such as the word class in Java); (such as the #include
"file" directive in C).
Scanner (Cont.)
recognises special characters, such as parentheses ( and ), or groups of special
characters, such as := (equal by definition) and ==;
recognises identifiers, integers, reals, decimals, strings, etc;
ignores whitespaces and comments;
Efficient Scanners can be built using regular expressions and finite automata.
There are automated tools called scanner generators, such as flex (Fast Lexical
Analyzer Generator) for C and JLex for Java, which construct a fast scanner
automatically according to specifications (regular expressions).
Role of Lexical Analyser
Lexical analyzer performs below given tasks:-
 Remove white spaces and comments from the source program.
Correlates (make correct) error messages with the source
program.
Read input characters from the source program.
Helps to identify token into the symbol table.
Example: Symbol table for a code:
//Define a global function
int add(int a, int b) {
int sum =0;
sum =a+b;
return sum; }
Lexical Analysis
In lexical analysis, we read the source programme character by character and
converge (meet) them to tokens.
A token is the smallest unit recognisable by the compiler.
Generally, we have four classes of tokens that are usually recognised and they are:
1. Keywords
2. Identifies
3. Constants
4. Delimiters
Construction of Lexical Analyser
There are 2 general ways to construct lexical analyser:
Hand implementation
Automatic generation of lexical analyser
Hand Implementation
There are two ways to use hand implementation:
Input Buffer approach
Transitional diagrams approach
Input Buffering
The lexical analyser scans the characters of the source programme one at a time to
discover tokens.
Cont.
Often, many characters beyond (in addition to) the next
token may have to be examined before the next token itself
can be determined.
For this and other reasons, it is desirable for the lexical analyser
to read its input from an input buffer.
Operations on Languages
Concatenation: The operation of joining two or more strings
together.
L1L2 = { s1s2 | s1  L1 and s2  L2 }
Union: The operation of combining the result set of two or
more strings.
L1 L2 = { s| s  L1 or s L2 }
Exponentiation: Repeated strings of the base.
L0
= {} L1
= L L2
= LL
Kleene Closure : Infinite set of all possible strings, including
the empty ε string.
L*
=
Positive Closure : The infinite set of all possible strings,
excluding ε.
L+
=



0
i
i
L



1
i
i
L
Example
L1 = {a,b,c,d} L2 = {1,2}
L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
L1  L2 = {a,b,c,d,1,2}
L1
3
= all strings with length three (using a,b,c,d}
L1
* (zero or more)
= all strings using letters a, b, c, d and empty (ε )
string.
a*
=i.e., it can generate {ε, a, aa, aaa, …}
L1
+ (one or more)
= doesn’t include the empty (ε ) string.
a+
= i.e., it can generate {a, aa, aaa, …}
Regular Expressions
Regular Expressions (REs)
We use regular expressions to describe tokens of a
programming language.
Regular expressions are a very convenient (suitable) form of
representing (possibly infinite) sets of strings, called regular
sets.
For example, the RE (a| b)*aa represents the infinite set
{``aa",``aaa",``baa",``abaa", ... }, which is the set of all strings
with characters a and b that end in aa.
RE Order of Precedence
We can freely put parentheses around REs to denote the
order of evaluation.
We can drop redundant parenthesis by assuming:
the Kleene star operator * has the highest precedence and is
left associative
concatenation has the next highest precedence and is left
associative
the union operator | has the lowest precedence and is left
associative
Tokens/Patterns/Lexemes/Attributes
A token is sequence of characters which represents a unit of
information in the source program.
Lexeme: a sequence of characters in the source program that
is matched by the pattern for a token
A Pattern is a rule describing the set of lexemes that can
represent a particular token in source programs.
In other word: A set of strings in the input for which the same
token is produced as output.
Pattern is a regular definition or regular definition.
Cont.
An attribute of a token is usually a pointer to the symbol table
entry that gives additional information about the token, such
as its type, value, line number, etc.
Attributes are used to distinguish different lexemes in a token
Example:
Lexeme Token Pattern
int keyword int
if keyword if
<,<=,=,>,>= relational operator < or <= or = or > or
newval identifier newval
Regular Expression
To write regular expression for some languages can be difficult,
because their regular expressions can be quite complex. In
those cases, we may use regular definitions.
A regular definition is a sequence of the definitions of the form
and they are described using regular definitions, as follow.
digit → [0-9] ---- any of the numerals from 0-9.
letter → [A-Za-z] ---- a set of upper and lower case letters.
id → letter ( letter  digit )* --- a set of letters, underscore or
digits (0-9).
relop → < | > | <= | >= | = | <>
Lexical Analyser generator Lex
There are tools that can generate lexical analyzers.
Lex is a special-purpose programming language for creating
programmes to process streams of input characters.
An input file, which we call lex.l, is written in the Lex
language and describes the lexical analyzer to be generated.
The Lex compiler transforms lex.l to a C program, in a file
that is always named lex.yy.c. The latter file is compiled by
the C compiler into a file called a.out, as always.
The C-compiler output is a working lexical analyzer that can
take a stream of input characters and produce a stream of
tokens.
Cont.
Creating a lexical analyzer with Lex
Lex Specifications
Lex source is separated into three sections by %% delimiters
Declarations :- This section includes declaration of variables, constants and
regular definitions.
%%
translation rules :- defines the rules that parse the input stream (regular
expressions ).
%%
auxiliary functions (optional)
Steps in lex implementation
1. Read input language specification
2. Construct NFA with epsilon-moves (Can also do DFA
directly)
3. Convert NFA to DFA
4. Optimise the DFA
5. Generate parsing tables & code
Finite Automata
A finite automaton can be: deterministic(DFA) or non-
deterministic (NFA)
Both deterministic and non-deterministic finite automaton
recognize regular sets.
Which one?
deterministic – faster recognizer, but it may take more
space
non-deterministic – slower, but it may take less space
deterministic automatons are widely used lexical
analyzers.
First, we define regular expressions for tokens; Then we
convert them into a DFA to get a lexical analyzer for our
tokens.
Finite Automata (Cont.)
Algorithm1: Regular Expression  NFA  DFA (two
steps: first to NFA, then to DFA)
Algorithm2: Regular Expression  DFA (directly
convert a regular expression into a DFA)
Converting a RE to an NFA
Every regular expression (RE) can be converted into an
equivalent NFA.
Every NFA can be converted into an equivalent DFA.
The task of a scanner generator, such as JLex, is to generate
the transition tables or to synthesise the scanner programme
given a scanner specification (in the form of a set of REs).
This is accomplished in two steps: first it converts REs into
an NFA and then it converts the NFA into a DFA
(Algorithm1).
Converting a RE to an NFA (Cont.)
An NFA is similar to a DFA but it also permits (allow)
multiple transitions over the same character and transitions
over ɛ.
In the case of multiple transitions from a state over the same
character, when we are at this state and we read this
character, we have more than one choice; the NFA succeeds
if at least one of these choices.
The -transition does not
ɛ consume any input characters, so
you may jump to another state for free.
Clearly DFAs are a subset of NFAs.
Non-Deterministic Finite Automaton
A non-deterministic finite automaton (NFA) is a
mathematical model that consists (made up) of:
 S - a set of states
  (sigma) - a set of input symbols (alphabet)
 move – a transition function
 s0 - a start (initial) state
 F – a set of accepting states (final states)
- transitions are allowed in NFAs. In other words, we can
move from one state to another one without consuming any
symbol.
Non-Deterministic Finite Automaton
A NFA accepts a string x, if and only if there is a path from
the starting state to one of accepting states such that edge
labels along this path spell out x.
NFA (Example)
0 is the start state s0
{2} is the set of final states F
 = {a,b}
S = {0,1,2}
Transition Function: a b
0 {0,1}
{0}
1 _ {2}
2 _
_
The language recognized by this NFA is (a|b)*
ab
Transition Tables
We can also represent an NFA by a transition table, whose
rows correspond to states, and whose columns correspond to
the input symbols and ɛ.
The entry for a given state and input is the value of the
transition function applied to those arguments.
If the transition function has no information about that state-
input pair, we put 0 in the table for the pair.
Transition table for the NFA of RE (a|b)*abb
Deterministic Finite Automaton (DFA)
A Deterministic Finite Automaton (DFA) is a special form of
a NFA.
no state has - transition
for each state s and input symbol a there is exactly one
transition out of s labelled a.
A DFA represents a finite state machine that recognises a
RE.
The language recognized by this DFA is also (a|b)*
ab
Converting a NFA into a DFA (Example)
-closure({0}) = {0,1,2,4,7}
 mark S0
-closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S0,a]  S1 transfunc[S0,b]  S2
 mark S1
-closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a]  S1 transfunc[S1,b]  S2
 mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a]  S1 transfunc[S2,b]  S2
Converting a NFA into a DFA (Cont.)
S0 is the start state of DFA since 0 is a member of
S0={0,1,2,4,7}
S1 is an accepting state of DFA since 8 is a member of S1
= {1,2,3,4,6,7,8}
Converting RE Directly to DFAs
We may convert a regular expression into a DFA (without
creating a NFA first).
First we augment the given regular expression by
concatenating it with a special symbol #.
r  (r)# augmented regular expression
(make something) greater by adding to it.)
Then, we create a syntax tree for this augmented regular
expression.
In this syntax tree, all alphabet symbols (plus # and the
empty string) in the augmented regular expression will be on
the leaves, and all inner nodes will be the operators in that
augmented regular expression.
Regular Expression  DFA (cont.)
Then each alphabet symbol (plus #) will be numbered
(position numbers).
(a|b) *
a  (a|b)*
a # augmented regular expression

*

|
b
a
#
a
1
4
3
2
Syntax tree of (a|b) *
a #
• each symbol is numbered (positions)
• each symbol is at a leave
• inner nodes are operators
Minimizing Number of States of a DFA
partition the set of states into two groups:
G1 : set of accepting states
G2 : set of non-accepting states
For each new group G
partition G into subgroups such that states s1 and s2 are in
the same group if, for all input symbols a, states s1 and s2
have transitions to states in the same group.
Start state of the minimized DFA is the group containing
the start state of the original DFA.
Accepting states of the minimized DFA are the groups
containing the accepting states of the original DFA.
Minimizing DFA - Example
G1 = {2}
G2 = {1,3}
G2 cannot be partitioned because
move(1,a)=2
move(1,b)=3
move(3,a)=2
move(2,b)=3
So, the minimized DFA (with minimum states)
Minimizing DFA – Another Example
a b
1->2 1->3
2->2 2->3
3->4 3->3
So, the minimized DFA
Chapter-2-lexical-analyser and its property lecture note.ppt
Quiz 5%
Write LEX program to implement a simple calculator?
L1 = {0,1} L2 = {0,1}
L1L2 = {} ?
RE 0* = ?
 RE (0|1)*
= ?
RE (0|1)*11 = ?

More Related Content

DOCX
Compiler Design
PPT
Lexical Analysis
PPTX
A Role of Lexical Analyzer
PPT
Lecturer-05 lex anylser (1).pptrjyghsgst
PPTX
Chahioiuou9oioooooooooooooofffghfpterTwo.pptx
PDF
COMPILER DESIGN.pdf
DOC
Pcd question bank
PPTX
Structure of the compiler
Compiler Design
Lexical Analysis
A Role of Lexical Analyzer
Lecturer-05 lex anylser (1).pptrjyghsgst
Chahioiuou9oioooooooooooooofffghfpterTwo.pptx
COMPILER DESIGN.pdf
Pcd question bank
Structure of the compiler

Similar to Chapter-2-lexical-analyser and its property lecture note.ppt (20)

PPTX
Ch 2.pptx
PPT
atc 3rd module compiler and automata.ppt
PDF
Lexical Analysis - Compiler design
PPTX
Chapter 2.pptx compiler design lecture note
PPTX
ashjhas sahdj ajshbas sajakj askk sadk as
PPTX
Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering
PDF
role of lexical parser compiler design1-181124035217.pdf
PPT
1.Role lexical Analyzer
PPT
Module4 lex and yacc.ppt
PDF
Compilers Design
PDF
Assignment4
PPT
Module 2
PPT
SS & CD Module 3
PPT
LexicalAnalysis in Compiler design .pt
PPTX
Language for specifying lexical Analyzer
PPTX
LexicalAnalysis chapter2 i n compiler design.pptx
PPTX
Ch03-LexicalAnalysis chapter2 in compiler design.pptx
PPTX
role of lexical anaysis
PDF
Token, Pattern and Lexeme
PPT
Compiler Designs
Ch 2.pptx
atc 3rd module compiler and automata.ppt
Lexical Analysis - Compiler design
Chapter 2.pptx compiler design lecture note
ashjhas sahdj ajshbas sajakj askk sadk as
Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering
role of lexical parser compiler design1-181124035217.pdf
1.Role lexical Analyzer
Module4 lex and yacc.ppt
Compilers Design
Assignment4
Module 2
SS & CD Module 3
LexicalAnalysis in Compiler design .pt
Language for specifying lexical Analyzer
LexicalAnalysis chapter2 i n compiler design.pptx
Ch03-LexicalAnalysis chapter2 in compiler design.pptx
role of lexical anaysis
Token, Pattern and Lexeme
Compiler Designs
Ad

More from anwarkade1 (20)

PPT
CH 5 PUSH DOWN AUTOMATA PUSH DOWN AUTOMATA
PPTX
chapter 4 context-freegrammarintroduction context-freegramma
PPTX
Unit-1compiler design and its lecture note .pptx
PPTX
chapter 1 automata and complexity theory lecture note.pptx
PPT
context-freelanguages and properties lecture note.ppt
PPTX
Chapter-twoChapter-three automata and complexity theory .pptx
PPT
Chapter-three automata and complexity theory.ppt
PDF
Construction of a predictive parsing table.pdf
PPTX
Deterministic Finite Automata (DFA).pptx
PPTX
INSTAL AND OPTIMIZE OPERATING SYSTEM WINDOWS
PPTX
tvet management system for analysis in school or college
PPTX
Computer Hardware Technology PowerPoint Templates.pptx
PPTX
BUSINESS PLAN PRERARATION GUIDE FINAL PPT.pptx
PPTX
introduction of enter pruner lecture note
PPTX
group 3 bussiness plan FOR MANAUFACTURIN
PPTX
window server 2008 mail configuration
PPT
Information Technology.ppt
PPT
informatics1.ppt
PPT
chapter one Introduction to HCI.ppt
PPT
Chapter 9 TCP IP Reference Model.ppt
CH 5 PUSH DOWN AUTOMATA PUSH DOWN AUTOMATA
chapter 4 context-freegrammarintroduction context-freegramma
Unit-1compiler design and its lecture note .pptx
chapter 1 automata and complexity theory lecture note.pptx
context-freelanguages and properties lecture note.ppt
Chapter-twoChapter-three automata and complexity theory .pptx
Chapter-three automata and complexity theory.ppt
Construction of a predictive parsing table.pdf
Deterministic Finite Automata (DFA).pptx
INSTAL AND OPTIMIZE OPERATING SYSTEM WINDOWS
tvet management system for analysis in school or college
Computer Hardware Technology PowerPoint Templates.pptx
BUSINESS PLAN PRERARATION GUIDE FINAL PPT.pptx
introduction of enter pruner lecture note
group 3 bussiness plan FOR MANAUFACTURIN
window server 2008 mail configuration
Information Technology.ppt
informatics1.ppt
chapter one Introduction to HCI.ppt
Chapter 9 TCP IP Reference Model.ppt
Ad

Recently uploaded (20)

PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Institutional Correction lecture only . . .
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Business Ethics Teaching Materials for college
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
RMMM.pdf make it easy to upload and study
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Cell Structure & Organelles in detailed.
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
O7-L3 Supply Chain Operations - ICLT Program
Institutional Correction lecture only . . .
Final Presentation General Medicine 03-08-2024.pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Business Ethics Teaching Materials for college
Module 4: Burden of Disease Tutorial Slides S2 2025
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
RMMM.pdf make it easy to upload and study
PPH.pptx obstetrics and gynecology in nursing
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Microbial diseases, their pathogenesis and prophylaxis
Cell Structure & Organelles in detailed.
Insiders guide to clinical Medicine.pdf
Pharma ospi slides which help in ospi learning
Supply Chain Operations Speaking Notes -ICLT Program
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Week 4 Term 3 Study Techniques revisited.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx

Chapter-2-lexical-analyser and its property lecture note.ppt

  • 2. Chapter Two This Chapter Covers: Role of lexical analyser Token Specification and Recognition NFA to DFA
  • 3. Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it. 1 3 2
  • 4. Token Token represents a set of strings described by a pattern. Identifier represents a set of strings which start with a letter continues with letters and digits Lexeme: is a sequence of characters in the source program that matched by the pattern for a token. Tokens: identifier, number, addop, delimeter, … Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token. For simplicity, a token may have a single attribute which holds the required information for that token. For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for that token.
  • 5. Token (Cont.) Some attributes: <id,attr> where attr is pointer to the symbol table <assgop,_> no attribute is needed (if there is only one assignment operator) <num,val> where val is the actual value of the number. Token type and its attribute uniquely identifies a lexeme. Regular expressions are widely used to specify patterns.
  • 6. Scanner A scanner groups (classed together) input characters into tokens. For example, if the input is: x = x*(b+1); then the scanner generates the following sequence of tokens id(x), =, id(x), *, (, id(b), +, num(1), ), ; where id(x) indicates the identifier with name x (a programme variable in this case) and num(1) indicates the integer 1. Each time the parser needs a token, it sends a request to the scanner. Then, the scanner reads as many characters from the input stream as it is necessary to construct a single token.
  • 7. Scanner (Cont.) The scanner may report an error during scanning. Otherwise, when a single token is formed, the scanner is suspended (stop from being active temporarily) and returns the token to the parser. The parser will repeatedly call the scanner to read all the tokens from the input stream or until an error is detected (such as a syntax error). Some tokens require some extra information. For example, an identifier is a token (so it is represented by some number) but it is also associated with a string that holds the identifier name.
  • 8. Scanner (Cont.) For example, the token id(x) is associated with the string, "x". Similarly, the token num(1) is associated with the number, 1. Tokens are specified by patterns, called regular expressions. For example, the regular expression [a-z][a-zA-Z0-9]* recognises all identifiers with at least one alphanumeric letter whose first letter is lower-case alphabetic. A typical scanner: recognises the keywords of the language (these are the reserved words that have a special meaning in the language, such as the word class in Java); (such as the #include "file" directive in C).
  • 9. Scanner (Cont.) recognises special characters, such as parentheses ( and ), or groups of special characters, such as := (equal by definition) and ==; recognises identifiers, integers, reals, decimals, strings, etc; ignores whitespaces and comments; Efficient Scanners can be built using regular expressions and finite automata. There are automated tools called scanner generators, such as flex (Fast Lexical Analyzer Generator) for C and JLex for Java, which construct a fast scanner automatically according to specifications (regular expressions).
  • 10. Role of Lexical Analyser Lexical analyzer performs below given tasks:-  Remove white spaces and comments from the source program. Correlates (make correct) error messages with the source program. Read input characters from the source program. Helps to identify token into the symbol table. Example: Symbol table for a code: //Define a global function int add(int a, int b) { int sum =0; sum =a+b; return sum; }
  • 11. Lexical Analysis In lexical analysis, we read the source programme character by character and converge (meet) them to tokens. A token is the smallest unit recognisable by the compiler. Generally, we have four classes of tokens that are usually recognised and they are: 1. Keywords 2. Identifies 3. Constants 4. Delimiters
  • 12. Construction of Lexical Analyser There are 2 general ways to construct lexical analyser: Hand implementation Automatic generation of lexical analyser Hand Implementation There are two ways to use hand implementation: Input Buffer approach Transitional diagrams approach Input Buffering The lexical analyser scans the characters of the source programme one at a time to discover tokens.
  • 13. Cont. Often, many characters beyond (in addition to) the next token may have to be examined before the next token itself can be determined. For this and other reasons, it is desirable for the lexical analyser to read its input from an input buffer.
  • 14. Operations on Languages Concatenation: The operation of joining two or more strings together. L1L2 = { s1s2 | s1  L1 and s2  L2 } Union: The operation of combining the result set of two or more strings. L1 L2 = { s| s  L1 or s L2 } Exponentiation: Repeated strings of the base. L0 = {} L1 = L L2 = LL Kleene Closure : Infinite set of all possible strings, including the empty ε string. L* = Positive Closure : The infinite set of all possible strings, excluding ε. L+ =    0 i i L    1 i i L
  • 15. Example L1 = {a,b,c,d} L2 = {1,2} L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2} L1  L2 = {a,b,c,d,1,2} L1 3 = all strings with length three (using a,b,c,d} L1 * (zero or more) = all strings using letters a, b, c, d and empty (ε ) string. a* =i.e., it can generate {ε, a, aa, aaa, …} L1 + (one or more) = doesn’t include the empty (ε ) string. a+ = i.e., it can generate {a, aa, aaa, …}
  • 16. Regular Expressions Regular Expressions (REs) We use regular expressions to describe tokens of a programming language. Regular expressions are a very convenient (suitable) form of representing (possibly infinite) sets of strings, called regular sets. For example, the RE (a| b)*aa represents the infinite set {``aa",``aaa",``baa",``abaa", ... }, which is the set of all strings with characters a and b that end in aa.
  • 17. RE Order of Precedence We can freely put parentheses around REs to denote the order of evaluation. We can drop redundant parenthesis by assuming: the Kleene star operator * has the highest precedence and is left associative concatenation has the next highest precedence and is left associative the union operator | has the lowest precedence and is left associative
  • 18. Tokens/Patterns/Lexemes/Attributes A token is sequence of characters which represents a unit of information in the source program. Lexeme: a sequence of characters in the source program that is matched by the pattern for a token A Pattern is a rule describing the set of lexemes that can represent a particular token in source programs. In other word: A set of strings in the input for which the same token is produced as output. Pattern is a regular definition or regular definition.
  • 19. Cont. An attribute of a token is usually a pointer to the symbol table entry that gives additional information about the token, such as its type, value, line number, etc. Attributes are used to distinguish different lexemes in a token Example: Lexeme Token Pattern int keyword int if keyword if <,<=,=,>,>= relational operator < or <= or = or > or newval identifier newval
  • 20. Regular Expression To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use regular definitions. A regular definition is a sequence of the definitions of the form and they are described using regular definitions, as follow. digit → [0-9] ---- any of the numerals from 0-9. letter → [A-Za-z] ---- a set of upper and lower case letters. id → letter ( letter digit )* --- a set of letters, underscore or digits (0-9). relop → < | > | <= | >= | = | <>
  • 21. Lexical Analyser generator Lex There are tools that can generate lexical analyzers. Lex is a special-purpose programming language for creating programmes to process streams of input characters. An input file, which we call lex.l, is written in the Lex language and describes the lexical analyzer to be generated. The Lex compiler transforms lex.l to a C program, in a file that is always named lex.yy.c. The latter file is compiled by the C compiler into a file called a.out, as always. The C-compiler output is a working lexical analyzer that can take a stream of input characters and produce a stream of tokens.
  • 22. Cont. Creating a lexical analyzer with Lex Lex Specifications Lex source is separated into three sections by %% delimiters Declarations :- This section includes declaration of variables, constants and regular definitions. %% translation rules :- defines the rules that parse the input stream (regular expressions ). %% auxiliary functions (optional)
  • 23. Steps in lex implementation 1. Read input language specification 2. Construct NFA with epsilon-moves (Can also do DFA directly) 3. Convert NFA to DFA 4. Optimise the DFA 5. Generate parsing tables & code
  • 24. Finite Automata A finite automaton can be: deterministic(DFA) or non- deterministic (NFA) Both deterministic and non-deterministic finite automaton recognize regular sets. Which one? deterministic – faster recognizer, but it may take more space non-deterministic – slower, but it may take less space deterministic automatons are widely used lexical analyzers. First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens.
  • 25. Finite Automata (Cont.) Algorithm1: Regular Expression  NFA  DFA (two steps: first to NFA, then to DFA) Algorithm2: Regular Expression  DFA (directly convert a regular expression into a DFA)
  • 26. Converting a RE to an NFA Every regular expression (RE) can be converted into an equivalent NFA. Every NFA can be converted into an equivalent DFA. The task of a scanner generator, such as JLex, is to generate the transition tables or to synthesise the scanner programme given a scanner specification (in the form of a set of REs). This is accomplished in two steps: first it converts REs into an NFA and then it converts the NFA into a DFA (Algorithm1).
  • 27. Converting a RE to an NFA (Cont.) An NFA is similar to a DFA but it also permits (allow) multiple transitions over the same character and transitions over ɛ. In the case of multiple transitions from a state over the same character, when we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices. The -transition does not ɛ consume any input characters, so you may jump to another state for free. Clearly DFAs are a subset of NFAs.
  • 28. Non-Deterministic Finite Automaton A non-deterministic finite automaton (NFA) is a mathematical model that consists (made up) of:  S - a set of states   (sigma) - a set of input symbols (alphabet)  move – a transition function  s0 - a start (initial) state  F – a set of accepting states (final states) - transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol.
  • 29. Non-Deterministic Finite Automaton A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x. NFA (Example) 0 is the start state s0 {2} is the set of final states F  = {a,b} S = {0,1,2} Transition Function: a b 0 {0,1} {0} 1 _ {2} 2 _ _ The language recognized by this NFA is (a|b)* ab
  • 30. Transition Tables We can also represent an NFA by a transition table, whose rows correspond to states, and whose columns correspond to the input symbols and ɛ. The entry for a given state and input is the value of the transition function applied to those arguments. If the transition function has no information about that state- input pair, we put 0 in the table for the pair. Transition table for the NFA of RE (a|b)*abb
  • 31. Deterministic Finite Automaton (DFA) A Deterministic Finite Automaton (DFA) is a special form of a NFA. no state has - transition for each state s and input symbol a there is exactly one transition out of s labelled a. A DFA represents a finite state machine that recognises a RE. The language recognized by this DFA is also (a|b)* ab
  • 32. Converting a NFA into a DFA (Example) -closure({0}) = {0,1,2,4,7}  mark S0 -closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 -closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 transfunc[S0,a]  S1 transfunc[S0,b]  S2  mark S1 -closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 -closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 transfunc[S1,a]  S1 transfunc[S1,b]  S2  mark S2 -closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 -closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 transfunc[S2,a]  S1 transfunc[S2,b]  S2
  • 33. Converting a NFA into a DFA (Cont.) S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7} S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}
  • 34. Converting RE Directly to DFAs We may convert a regular expression into a DFA (without creating a NFA first). First we augment the given regular expression by concatenating it with a special symbol #. r  (r)# augmented regular expression (make something) greater by adding to it.) Then, we create a syntax tree for this augmented regular expression. In this syntax tree, all alphabet symbols (plus # and the empty string) in the augmented regular expression will be on the leaves, and all inner nodes will be the operators in that augmented regular expression.
  • 35. Regular Expression  DFA (cont.) Then each alphabet symbol (plus #) will be numbered (position numbers). (a|b) * a  (a|b)* a # augmented regular expression  *  | b a # a 1 4 3 2 Syntax tree of (a|b) * a # • each symbol is numbered (positions) • each symbol is at a leave • inner nodes are operators
  • 36. Minimizing Number of States of a DFA partition the set of states into two groups: G1 : set of accepting states G2 : set of non-accepting states For each new group G partition G into subgroups such that states s1 and s2 are in the same group if, for all input symbols a, states s1 and s2 have transitions to states in the same group. Start state of the minimized DFA is the group containing the start state of the original DFA. Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA.
  • 37. Minimizing DFA - Example G1 = {2} G2 = {1,3} G2 cannot be partitioned because move(1,a)=2 move(1,b)=3 move(3,a)=2 move(2,b)=3 So, the minimized DFA (with minimum states)
  • 38. Minimizing DFA – Another Example a b 1->2 1->3 2->2 2->3 3->4 3->3 So, the minimized DFA
  • 40. Quiz 5% Write LEX program to implement a simple calculator? L1 = {0,1} L2 = {0,1} L1L2 = {} ? RE 0* = ?  RE (0|1)* = ? RE (0|1)*11 = ?