Lexical Analysis - Compiler design

Lexical Analysis
Role of the Lexical Analyzer
Remove comments and white spaces (aka scanning)
Macros expansion
Read input characters from the source program
Group them into lexemes
Produce as output a sequence of tokens
Interact with the symbol table
Correlate error messages generated by the compiler with the source program
send Tokens to Parser
Scanner-Parser Interaction
Scanners are usually implemented to produce tokens only when requested by a
parser.
Here is how it works
Get next token" is a command which is sent from the parser to the lexical analyzer.1.
On receiving this command, the lexical analyzer scans the input until it ﬁnds the
next token.
2.
It returns the token to Parser.3.
https://www.slideshare.net/Amansharma1037
https://www.linkedin.com/in/includeaman

Issues in Lexical Analysis
Why Separating Lexical and Syntactic analysis?
Simplicity of design:
A parser containing the rules for comments and white space is more complex to
make than a parser that can assume that comments and whitespaces have been
removed.
Improved compiler efficiency :
Reading source code and classifying it in token is time consuming task when we
separate from parser it allows us to use specialized technique for lexer,which can
speed up scanning
Higher portability : Input device specific peculiarities (specialities) are restricted to
lexer.
Basic Terminology
Token : a pair consisting of
Token name: abstract symbol representing lexical unit [affects parsing
decision]
Optional attribute value [influences translations after parsing]
Pattern: a description of the form that different lexemes take
Lexeme: sequence of characters in source program matching a pattern
Example

#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t
Attributes for Tokens
When more than one lexeme can match a pattern, a lexical analyzer must provide the
compiler additional information about that lexeme matched.
In formation about identifiers, its lexeme, type and location at which it was first found is
kept in symbol table.
The appropriate attribute value for an identifier is a pointer to the symbol table entry for
that identifier.
Recall :
Tokens influence parsing decision;
The attributes influence the translation of tokens.
Example:Token and attributes for a fortan statement are given as follows
E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical
error.
Its hard for lexical analyzer without the aid of other components, that there is a source-
code error.

Example:If the statement if is encountered for the first time in a C program it can not tell
whether fi is misspelling of if statement or a undeclared literal.
Probably the parser in this case will be able to handle this.
Error Handling is very localized ( limited ), with Respect to Input Source
For Example:
whil ( x = 0 ) do generates no lexical errors in PASCAL
Handling Lexical Errors
Panic mode Recovery
Delete successive characters from the remaining input until the analyzer can find a well-
formed token.
May confuse the parser – creating syntax error
Possible error recovery actions:
Deleting extra irrelevant character
Inserting Missing Input Characters
Replacing an incorrect character by a correct character
Transposing (exchanging) two adjacent Characters
Input Buffering
The amount of time taken is high to process characters of a large source program.
Specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character
Lexical analyzer may need to look at least a character ahead to make a token
decision.
For Example
we cannot be sure we’ve seen the end of an identifier until we see a character that is not
a letter or digit, and therefore is not part of the lexeme for id.
In C, single-character operators like -, =, or < could also be the beginning of a two-
character operator like
->, ==, or <=.
A two-bu�er scheme that handles large lookaheads safely.
We then consider an improvement involving “sentinels” that saves time checking for the
ends of bu�ers
Buffer Pairs

Use a bu�er(memory) divided into two N-character halves N = Number of characters on
one disk block
One system command read N characters instead of using one system call per character
Fewer than N character => eof
Two pointers to input are maintained
Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
Pointer forward scans ahead until a pattern match is found.
The string of characters between the pointers is the current lexeme.
Initially both pointers point to ﬁrst character of the next lexeme to be found.
Forward pointer scans ahead until a match for a pattern is found
Advancing forward requires that we ﬁrst test whether we have reached the end of one of
the bu�ers.
If so, we must reload the other bu�er from the input. And move forward to the beginning
of the newly loaded bu�er.
Once the next lexeme is determined, the forward pointer is set to the character at its
right end.
After the lexeme is processed both pointers are set to the character immediately past the
lexeme
Comments and white space can be treated as patterns that yield no token.
Sentinels
During Bu�ering for each character read, we make two tests:
one to check the end of the bu�er.

And second to determine what character is read
Two tests can be simpliﬁed using additional sentinels .
We can combine the bu�er-end test with the test for the current character if we extend
each bu�er to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
Figure shows the same arrangement as previous, but with the sentinels added.
Note that eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of a bu�er means that the input is at an end.

Specification of Tokens
An alphabet is a finite set of symbols.
Typical example of symbols are letters, digits and punctuation etc.
The set {0, 1} is the binary alphabet.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
The length is string s is denoted as |s|
Empty string is denoted by ε
A language is any countable set of strings over some fixed alphabet.
Operations

Example
Let: L = { a, b, c, ..., z } D = { 0, 1, 2, ..., 9 }
D + = “The set of strings with one or more digits”
L U D = “The set of all letters and digits (alphanumeric characters)”
LD = “The set of strings consisting of a letter followed by a digit”
L * = “The set of all strings of letters, including Ɛ , the empty string”
( L U D )* = “Sequences of zero or more letters and digits”
L ( ( L U D )* ) = “Set of strings that start with a letter, followed by zero or
more letters and digits.”L U D
Regular Expression
A regular expression is a specific pattern that provides clear and flexible means to
"match" (specify and recognize) strings of text
Regular expressions over alphabet Σ
Ɛ is a regular expression that denotes {Ɛ}.
If a is a symbol (i.e., if a∈Σ), then a is a regular expression that denotes {a}.
Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
(r) | (s) is a regular expression denoting L(r) U L(s).
(r)(s) is a regular expression denoting L(r)L(s).
(r) * is a regular expression denoting (L(r)) * .
(r) is a regular expression denoting L(r).
(r) is a regular expression denoting L(r).
Regular Definitions
Sometime we may wish to give names to regular expressions and to define new regular
expressions using these names as if they were symbols.
If Σ is an alphabet of basic symbols then a regular
definition is a sequence of the following form:
d 1 →r 1
d 2 →r 2
........
d n →r n
where

Each d i is a new symbol such that d i ∉ Σ and d i ≠d j where j < I
Each r i is a regular expression over Σ ∪ {d 1 ,d 2 ,...,d i-1 )
Shorthand Notation
r* = r+| Ɛ
r+ = rr* = r*r
Zero or one instance: r? is equivalent to r l Ɛ
Character classes. [a-z] is shorthand for a|b|. . . |z
Token Recognization

Implementation: Transition Diagrams
Intermediate step in constructing lexical analyzer
Convert patterns into ﬂowcharts called transition diagrams.
As characters are read, the relevant TDs are used to attempt to match lexeme to a
pattern
Each TD has:
States : Represented by Circles
Actions : Represented by Arrows between states
Start State : Beginning of a pattern (Arrowhead)
Final State(s) : End of pattern (Concentric Circles)
Edges: arrows connecting the states
Each TD is Deterministic (assume) - No need to choose between 2 di�erent actions !
Example : Transition diagram for all RELOPs (Relational Operators)

the start in state 4 and 8 means we are calling forward pointer one step back
(remember in input buﬀering we look forward)
Example: Transition diagram for id
installID() :
It has access to the bu�er(memory) where lexeme is located and is mainly used to get

attribute value.
If lexeme is keyword 0 is returned
if lexeme is variable returns pointer to symbol table entry
if lexeme is not found in symbol table it is installed as new variable and its pointer
to symbol table is returned.
getToken() :
if lexeme is keyword it returns corresponding token.
otherwise , token id is returned.
Implementation
Language for specifying Lexical Analyser
The language used is ﬂex.
All the speciﬁcations are stored in lex.l

declaration section includes declaration for variables , constants and regular deﬁnition.
Translation rules are nothing but regular deﬁnitions.
auxiliary functions are actions that are done when a token is matched.

Lexical Analysis - Compiler design

More Related Content

What's hot (20)

Similar to Lexical Analysis - Compiler design (20)

More from Aman Sharma (6)

Recently uploaded (20)

Lexical Analysis - Compiler design