SlideShare a Scribd company logo
2
Most read
5
Most read
6
Most read
Lexical Analysis
Role of the Lexical Analyzer
Remove comments and white spaces (aka  scanning)
Macros expansion
Read input characters from the source program
Group them into lexemes
Produce as output a sequence of tokens
Interact with the symbol table
Correlate error messages generated by  the compiler with the source program
send Tokens to Parser
Scanner-Parser Interaction
Scanners are usually implemented to produce tokens only when requested by a
parser.
Here is how it works
Get next token" is a command which is sent from the parser to the lexical analyzer.1.
On receiving this command, the lexical analyzer scans the input until it finds the
next token.
2.
It returns the token to Parser.3.
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Issues in Lexical Analysis
Why Separating  Lexical and Syntactic analysis?
Simplicity of design:
A parser containing the rules for comments and white space is more complex to
make than a parser that can assume that comments and whitespaces have been
removed.
Improved compiler efficiency :
Reading source code and classifying it in token is time consuming task when we
separate from parser it allows us to use specialized technique for lexer,which can
speed up scanning
Higher portability : Input device specific peculiarities (specialities) are restricted to
lexer.
Basic Terminology
Token : a pair consisting of
Token name: abstract symbol representing  lexical unit [affects parsing
decision]
Optional attribute value [influences  translations after parsing]
Pattern: a description of the form that  different lexemes take
Lexeme: sequence of characters in source program matching a pattern
Example
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t
Attributes for Tokens
When more than one lexeme can match a pattern, a lexical analyzer must provide the
compiler additional information about that lexeme matched.
In formation about identifiers, its lexeme, type and location at which it was first found is
kept in symbol table.
The appropriate attribute value for an identifier is a pointer to the symbol table entry for
that identifier.
Recall :
Tokens influence parsing decision;
The attributes influence the translation of tokens.
Example:Token and attributes for a fortan statement are given as follows
       E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical
error.
Its hard for lexical analyzer without the aid of other components, that there is a source-
code error.
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Example:If the statement if is encountered for the first time in a C program it can not tell
whether fi is misspelling of if statement or a undeclared literal.
Probably the parser in this case will be able to handle this.
Error Handling is very localized ( limited ), with Respect to Input Source
For Example:
whil ( x = 0 ) do generates no lexical errors in PASCAL
Handling Lexical Errors
Panic mode Recovery
Delete successive characters from the remaining input until the analyzer can find a well-
formed token.
May confuse the parser – creating syntax error
Possible error recovery actions:
Deleting extra irrelevant character
Inserting Missing Input Characters
Replacing an incorrect character by a correct character
Transposing (exchanging) two adjacent Characters
Input Buffering
The amount of time taken is high to process characters of a large source program.
Specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character
Lexical analyzer may need to look at least a character ahead to make a token 
decision.
For Example
we cannot be sure we’ve seen the end of an identifier until we see a character that is not
a letter or digit, and therefore is not part of the lexeme for id.
In C, single-character operators like -, =, or < could also be the beginning of a two-
character operator like
    ->, ==, or <=.
A  two-bu�er scheme that handles large lookaheads safely.
We then consider an improvement involving “sentinels” that saves time checking for the
ends of bu�ers
Buffer Pairs
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Use a bu�er(memory) divided into two N-character halves N = Number of characters on
one disk block
One system command read N characters instead of using one system call per character
Fewer than N character => eof
Two pointers to input are maintained
Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
Pointer forward scans ahead until a pattern match is found.
The string of characters between the pointers is the current lexeme.
Initially both pointers point to first character of the next lexeme to be found.
Forward pointer scans ahead until a match for a pattern is found
Advancing forward requires that we first test whether we have reached the end of one of
the bu�ers.
If so, we must reload the other bu�er from the input. And move forward to the beginning
of the newly loaded bu�er.
Once the next lexeme is determined, the forward pointer is set to the character at its
right end.
After the lexeme is processed both pointers are set to the character immediately past the
lexeme
Comments and white space can be treated as patterns that yield no token.
Sentinels
During Bu�ering for each character read, we make two tests:
one to check the end of the bu�er.
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
And second to determine what character is read
Two tests can be simplified using additional sentinels .
We can combine the bu�er-end test with the test for the  current character if we extend
each bu�er to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
Figure shows the same arrangement as previous, but with the sentinels added.
Note that eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of a bu�er means that the input is at an end.
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Specification of Tokens
An alphabet is a finite set of symbols.
Typical example of symbols are letters, digits and punctuation etc.
The set {0, 1} is the binary alphabet.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
The length is string s is denoted as |s|
Empty string is denoted by ε
A language is any countable set of strings over some fixed alphabet.
Operations
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Example
Let: L = { a, b, c, ..., z } D = { 0, 1, 2, ..., 9 }
D + = “The set of strings with one or more digits”
L U D = “The set of all letters and digits (alphanumeric characters)”
LD = “The set of strings consisting of a letter followed by a digit”
L * = “The set of all strings of letters, including Ɛ , the empty string”
( L U D )* = “Sequences of zero or more letters and digits”
L ( ( L U D )* ) = “Set of strings that start with a letter, followed by zero or
more letters and digits.”L U D
Regular Expression
 A regular expression is a specific pattern that provides clear and flexible means to
"match" (specify and recognize) strings of text
Regular expressions over alphabet Σ
Ɛ is a regular expression that denotes {Ɛ}.
If a is a symbol (i.e., if a∈Σ), then a is a regular expression that denotes {a}.
Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
(r) | (s) is a regular expression denoting L(r) U L(s).
(r)(s) is a regular expression denoting L(r)L(s).
(r) * is a regular expression denoting (L(r)) * .
(r) is a regular expression denoting L(r).
(r) is a regular expression denoting L(r).
Regular Definitions
Sometime we may wish to give names to regular expressions and to define new regular
expressions using these names as if they were symbols.
If Σ is an alphabet of basic symbols then a regular
definition is a sequence of the following form:
d 1 →r 1
d 2 →r 2
........
d n →r n
where
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
 Each d i is a new symbol such that d i ∉ Σ and d i ≠d j where j < I
Each r i is a regular expression over Σ ∪ {d 1 ,d 2 ,...,d i-1 )
Shorthand Notation
r* = r+| Ɛ
r+ = rr* = r*r
Zero or one instance: r? is equivalent to r l Ɛ
Character classes. [a-z] is shorthand for  a|b|. . . |z
Token Recognization
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
Implementation: Transition Diagrams
Intermediate step in constructing  lexical analyzer
Convert patterns into flowcharts called  transition diagrams.
As characters are read, the relevant TDs are used to attempt to match lexeme to a
pattern
Each TD has:
States : Represented by Circles
Actions : Represented by Arrows between states
Start State : Beginning of a pattern (Arrowhead)
Final State(s) : End of pattern (Concentric Circles)
Edges: arrows connecting the states
Each TD is Deterministic (assume) - No need to choose between 2 di�erent actions !
Example : Transition diagram for all RELOPs (Relational Operators)
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
the start in state 4 and 8 means we are calling forward pointer one step back
(remember in input buffering we look forward)
Example: Transition diagram for id
installID() :
It has access to the bu�er(memory) where lexeme is located and is mainly used to get
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
attribute value.
If lexeme is keyword 0 is returned
if lexeme is variable returns pointer to symbol table entry
if lexeme is not found in symbol table it is installed as new variable and its pointer
to symbol table is returned.
 getToken() :
if lexeme is keyword it returns corresponding token.
otherwise , token id is returned.
Implementation
Language for specifying Lexical Analyser
The language used is flex.
All the specifications are stored in lex.l
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
declaration section includes declaration for variables , constants and regular definition.
Translation rules are nothing but regular definitions.
auxiliary functions are actions that are done when a token is matched.
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman

More Related Content

PPT
Lexical Analysis
PPTX
Lexical analyzer
PPTX
Compiler design syntax analysis
PPTX
Lexical analyzer generator lex
PPTX
Input-Buffering
PPTX
A Role of Lexical Analyzer
PDF
Lexical analysis - Compiler Design
Lexical Analysis
Lexical analyzer
Compiler design syntax analysis
Lexical analyzer generator lex
Input-Buffering
A Role of Lexical Analyzer
Lexical analysis - Compiler Design

What's hot (20)

PDF
Code optimization in compiler design
PPTX
Role-of-lexical-analysis
PPTX
Top down parsing
PPTX
Types of Parser
PPTX
Lexical Analysis - Compiler Design
PPT
Intermediate code generation (Compiler Design)
PPTX
Syntax Analysis in Compiler Design
PPT
Lecture 1 - Lexical Analysis.ppt
PPTX
Relationship Among Token, Lexeme & Pattern
PDF
Algorithms Lecture 1: Introduction to Algorithms
PPT
Chapter 5 -Syntax Directed Translation - Copy.ppt
PDF
Module 05 Preprocessor and Macros in C
PPTX
The role of the parser and Error recovery strategies ppt in compiler design
PPTX
Recognition-of-tokens
PPTX
Parsing in Compiler Design
PPTX
Loop optimization
PPTX
Peephole optimization techniques in compiler design
PDF
PPTX
Compiler Chapter 1
PDF
Bottom up parser
Code optimization in compiler design
Role-of-lexical-analysis
Top down parsing
Types of Parser
Lexical Analysis - Compiler Design
Intermediate code generation (Compiler Design)
Syntax Analysis in Compiler Design
Lecture 1 - Lexical Analysis.ppt
Relationship Among Token, Lexeme & Pattern
Algorithms Lecture 1: Introduction to Algorithms
Chapter 5 -Syntax Directed Translation - Copy.ppt
Module 05 Preprocessor and Macros in C
The role of the parser and Error recovery strategies ppt in compiler design
Recognition-of-tokens
Parsing in Compiler Design
Loop optimization
Peephole optimization techniques in compiler design
Compiler Chapter 1
Bottom up parser
Ad

Similar to Lexical Analysis - Compiler design (20)

PPTX
Lecture 02 lexical analysis
PPTX
LexicalAnalysis chapter2 i n compiler design.pptx
PPTX
Ch03-LexicalAnalysis chapter2 in compiler design.pptx
PPTX
5490ce2bf23093de242ccc160dbfd3b639d.pptx
PPT
52232.-Compiler-Design-Lexical-Analysis.ppt
PPT
Compiler Design ug semLexical Analysis.ppt
PDF
Lexical analysis Compiler design pdf to read
PDF
Lexical analysis compiler design to read and study
PPT
atc 3rd module compiler and automata.ppt
PPT
1.Role lexical Analyzer
PDF
Ch03-LexicalAnalysis in compiler design subject.pdf
PDF
role of lexical parser compiler design1-181124035217.pdf
PPTX
04LexicalAnalysissnsnjmsjsjmsbdjjdnd.pptx
PPTX
Unit-2.pptx for complier design for lexical analyzer
PPT
SS & CD Module 3
PPT
Module 2
PDF
Compiler lec 8
PPT
Chapter-2-lexical-analyser and its property lecture note.ppt
PPTX
Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering
PPTX
Compiler lecture 05
Lecture 02 lexical analysis
LexicalAnalysis chapter2 i n compiler design.pptx
Ch03-LexicalAnalysis chapter2 in compiler design.pptx
5490ce2bf23093de242ccc160dbfd3b639d.pptx
52232.-Compiler-Design-Lexical-Analysis.ppt
Compiler Design ug semLexical Analysis.ppt
Lexical analysis Compiler design pdf to read
Lexical analysis compiler design to read and study
atc 3rd module compiler and automata.ppt
1.Role lexical Analyzer
Ch03-LexicalAnalysis in compiler design subject.pdf
role of lexical parser compiler design1-181124035217.pdf
04LexicalAnalysissnsnjmsjsjmsbdjjdnd.pptx
Unit-2.pptx for complier design for lexical analyzer
SS & CD Module 3
Module 2
Compiler lec 8
Chapter-2-lexical-analyser and its property lecture note.ppt
Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering
Compiler lecture 05
Ad

More from Aman Sharma (6)

PDF
Information architecture unit i
PDF
Compiler design Introduction
PDF
Role of supoort Institutions & Management of Small Business UNIT IV
PDF
Small Enterprises and Enterprise Launching Formalitites UNIT III
PDF
Opportunity Identification and Product selection UNIT II
PDF
Entrepreneurship & Entreprenurs Unit I
Information architecture unit i
Compiler design Introduction
Role of supoort Institutions & Management of Small Business UNIT IV
Small Enterprises and Enterprise Launching Formalitites UNIT III
Opportunity Identification and Product selection UNIT II
Entrepreneurship & Entreprenurs Unit I

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PPT on Performance Review to get promotions
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Sustainable Sites - Green Building Construction
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
573137875-Attendance-Management-System-original
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
additive manufacturing of ss316l using mig welding
PDF
composite construction of structures.pdf
PPT
Project quality management in manufacturing
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
Internet of Things (IOT) - A guide to understanding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CYBER-CRIMES AND SECURITY A guide to understanding
PPT on Performance Review to get promotions
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Sustainable Sites - Green Building Construction
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT 4 Total Quality Management .pptx
Model Code of Practice - Construction Work - 21102022 .pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
573137875-Attendance-Management-System-original
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
additive manufacturing of ss316l using mig welding
composite construction of structures.pdf
Project quality management in manufacturing

Lexical Analysis - Compiler design

  • 1. Lexical Analysis Role of the Lexical Analyzer Remove comments and white spaces (aka  scanning) Macros expansion Read input characters from the source program Group them into lexemes Produce as output a sequence of tokens Interact with the symbol table Correlate error messages generated by  the compiler with the source program send Tokens to Parser Scanner-Parser Interaction Scanners are usually implemented to produce tokens only when requested by a parser. Here is how it works Get next token" is a command which is sent from the parser to the lexical analyzer.1. On receiving this command, the lexical analyzer scans the input until it finds the next token. 2. It returns the token to Parser.3. https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 2. Issues in Lexical Analysis Why Separating  Lexical and Syntactic analysis? Simplicity of design: A parser containing the rules for comments and white space is more complex to make than a parser that can assume that comments and whitespaces have been removed. Improved compiler efficiency : Reading source code and classifying it in token is time consuming task when we separate from parser it allows us to use specialized technique for lexer,which can speed up scanning Higher portability : Input device specific peculiarities (specialities) are restricted to lexer. Basic Terminology Token : a pair consisting of Token name: abstract symbol representing  lexical unit [affects parsing decision] Optional attribute value [influences  translations after parsing] Pattern: a description of the form that  different lexemes take Lexeme: sequence of characters in source program matching a pattern Example https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 3. #include <stdio.h> int maximum(int x, int y) { // This will compare 2 numbers if (x > y) return x; else { return y; } } Examples of Tokens created Lexeme Token int Keyword maximum Identifier ( Operator int Keyword x Identifier , Operator int Keyword Y Identifier ) Operator { Operator If Keyword Examples of Nontokens Type Examples Comment // This will compare 2 numbers Pre-processor directive #include <stdio.h> https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 4. Pre-processor directive #define NUMS 8,9 Macro NUMS Whitespace /n /b /t Attributes for Tokens When more than one lexeme can match a pattern, a lexical analyzer must provide the compiler additional information about that lexeme matched. In formation about identifiers, its lexeme, type and location at which it was first found is kept in symbol table. The appropriate attribute value for an identifier is a pointer to the symbol table entry for that identifier. Recall : Tokens influence parsing decision; The attributes influence the translation of tokens. Example:Token and attributes for a fortan statement are given as follows        E = M * C ** 2 <id, pointer to symbol-table entry for E> <assign_op, > <id, pointer to symbol-table entry for M> <mult_op, > <id, pointer to symbol-table entry for C> <exp_op, > <num, integer value 2> Lexical Errors A character sequence which is not possible to scan into any valid token is a lexical error. Its hard for lexical analyzer without the aid of other components, that there is a source- code error. https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 5. Example:If the statement if is encountered for the first time in a C program it can not tell whether fi is misspelling of if statement or a undeclared literal. Probably the parser in this case will be able to handle this. Error Handling is very localized ( limited ), with Respect to Input Source For Example: whil ( x = 0 ) do generates no lexical errors in PASCAL Handling Lexical Errors Panic mode Recovery Delete successive characters from the remaining input until the analyzer can find a well- formed token. May confuse the parser – creating syntax error Possible error recovery actions: Deleting extra irrelevant character Inserting Missing Input Characters Replacing an incorrect character by a correct character Transposing (exchanging) two adjacent Characters Input Buffering The amount of time taken is high to process characters of a large source program. Specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character Lexical analyzer may need to look at least a character ahead to make a token  decision. For Example we cannot be sure we’ve seen the end of an identifier until we see a character that is not a letter or digit, and therefore is not part of the lexeme for id. In C, single-character operators like -, =, or < could also be the beginning of a two- character operator like     ->, ==, or <=. A  two-bu�er scheme that handles large lookaheads safely. We then consider an improvement involving “sentinels” that saves time checking for the ends of bu�ers Buffer Pairs https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 6. Use a bu�er(memory) divided into two N-character halves N = Number of characters on one disk block One system command read N characters instead of using one system call per character Fewer than N character => eof Two pointers to input are maintained Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine. Pointer forward scans ahead until a pattern match is found. The string of characters between the pointers is the current lexeme. Initially both pointers point to first character of the next lexeme to be found. Forward pointer scans ahead until a match for a pattern is found Advancing forward requires that we first test whether we have reached the end of one of the bu�ers. If so, we must reload the other bu�er from the input. And move forward to the beginning of the newly loaded bu�er. Once the next lexeme is determined, the forward pointer is set to the character at its right end. After the lexeme is processed both pointers are set to the character immediately past the lexeme Comments and white space can be treated as patterns that yield no token. Sentinels During Bu�ering for each character read, we make two tests: one to check the end of the bu�er. https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 7. And second to determine what character is read Two tests can be simplified using additional sentinels . We can combine the bu�er-end test with the test for the  current character if we extend each bu�er to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof. Figure shows the same arrangement as previous, but with the sentinels added. Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a bu�er means that the input is at an end. https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 8. Specification of Tokens An alphabet is a finite set of symbols. Typical example of symbols are letters, digits and punctuation etc. The set {0, 1} is the binary alphabet. A string over an alphabet is a finite sequence of symbols drawn from that alphabet. The length is string s is denoted as |s| Empty string is denoted by ε A language is any countable set of strings over some fixed alphabet. Operations https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 9. Example Let: L = { a, b, c, ..., z } D = { 0, 1, 2, ..., 9 } D + = “The set of strings with one or more digits” L U D = “The set of all letters and digits (alphanumeric characters)” LD = “The set of strings consisting of a letter followed by a digit” L * = “The set of all strings of letters, including Ɛ , the empty string” ( L U D )* = “Sequences of zero or more letters and digits” L ( ( L U D )* ) = “Set of strings that start with a letter, followed by zero or more letters and digits.”L U D Regular Expression  A regular expression is a specific pattern that provides clear and flexible means to "match" (specify and recognize) strings of text Regular expressions over alphabet Σ Ɛ is a regular expression that denotes {Ɛ}. If a is a symbol (i.e., if a∈Σ), then a is a regular expression that denotes {a}. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then (r) | (s) is a regular expression denoting L(r) U L(s). (r)(s) is a regular expression denoting L(r)L(s). (r) * is a regular expression denoting (L(r)) * . (r) is a regular expression denoting L(r). (r) is a regular expression denoting L(r). Regular Definitions Sometime we may wish to give names to regular expressions and to define new regular expressions using these names as if they were symbols. If Σ is an alphabet of basic symbols then a regular definition is a sequence of the following form: d 1 →r 1 d 2 →r 2 ........ d n →r n where https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 10.  Each d i is a new symbol such that d i ∉ Σ and d i ≠d j where j < I Each r i is a regular expression over Σ ∪ {d 1 ,d 2 ,...,d i-1 ) Shorthand Notation r* = r+| Ɛ r+ = rr* = r*r Zero or one instance: r? is equivalent to r l Ɛ Character classes. [a-z] is shorthand for  a|b|. . . |z Token Recognization https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 11. Implementation: Transition Diagrams Intermediate step in constructing  lexical analyzer Convert patterns into flowcharts called  transition diagrams. As characters are read, the relevant TDs are used to attempt to match lexeme to a pattern Each TD has: States : Represented by Circles Actions : Represented by Arrows between states Start State : Beginning of a pattern (Arrowhead) Final State(s) : End of pattern (Concentric Circles) Edges: arrows connecting the states Each TD is Deterministic (assume) - No need to choose between 2 di�erent actions ! Example : Transition diagram for all RELOPs (Relational Operators) https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 12. the start in state 4 and 8 means we are calling forward pointer one step back (remember in input buffering we look forward) Example: Transition diagram for id installID() : It has access to the bu�er(memory) where lexeme is located and is mainly used to get https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 13. attribute value. If lexeme is keyword 0 is returned if lexeme is variable returns pointer to symbol table entry if lexeme is not found in symbol table it is installed as new variable and its pointer to symbol table is returned.  getToken() : if lexeme is keyword it returns corresponding token. otherwise , token id is returned. Implementation Language for specifying Lexical Analyser The language used is flex. All the specifications are stored in lex.l https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman
  • 14. declaration section includes declaration for variables , constants and regular definition. Translation rules are nothing but regular definitions. auxiliary functions are actions that are done when a token is matched. https://www.slideshare.net/Amansharma1037 https://www.linkedin.com/in/includeaman