Syntax analysis

Syntax Analyzer
 Syntax Analyzer creates the syntactic structure of the given
source program.
 This syntactic structure - parse tree.
 Syntax Analyzer is also known as parser.
 The syntax analyzer (parser) checks whether a given source
program satisfies the rules implied by a context-free grammar
or not.
 If it satisfies, the parser creates the parse tree of that program.
 Otherwise the parser gives the error messages.

INTRODUCTION
 Every programming language has precise rules that prescribe the syntactic
structure of well-formed programs.
 Program is made up of functions, a function out of declarations and statements, a
statement out of expressions
 The syntax of programming language constructs can be specified by context-
free grammars
 A context-free grammar
 gives a precise syntactic specification of a programming language.
 the design of the grammar is an initial phase of the design of a
compiler.
 a grammar can be directly converted into a parser by some tools.

Parser
• Parser works on a stream of tokens.
• The smallest item is a token.
Lexical
Analyzer
Parser
source
program
token
get next token
parse tree

Parsing
 Parsing is the process of determining whether a string of tokens can be
generated by a grammar.
 Parsing methods
 The top-down
 Bottom-up methods.
 Top-down parsing, construction starts at the root and proceeds to the
leaves.
 Bottom-up parsing, construction starts at the leaves and proceeds towards
the root.
 Top-down parsers are easy to build by hand.
 Bottom-up parsing,
 Can handle a larger class of grammars.
 They are not as easy to build, but tools for generating them directly from a grammar are available.
 Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).

Top- Down Parsing
 Done by starting with the root, labeled with the starting nonterminal stmt,
and repeatedly performing the following two steps.
 At node N, labeled with nonterminal A, select one of the productions for A and
construct children at N for the symbols in the production body.
 Find the next node at which a subtree is to be constructed, typically the leftmost
unexpanded nonterminal of the tree.
 The current terminal being scanned in the input is frequently referred to as
the lookahead symbol.

Top-Down Parsing
 Top-Down Parsing is an attempt to find a left-most
derivation for an input string
 Example:
S  cAd Find a derivation for
A  ab | a for w  cad
S S Backtrack S
/ |  / |  / |
c A d c A d c A d
/ |
a b a

Predictive Parsing
 Recursive-descent parsing is a top-down method of syntax analysis in
which a set of recursive procedures is used to process the input.
 Simple form of recursive descent – Predictive Parsing

Syntax Error Handling
 Goals in error handling
 Report the presence of errors clearly and accurately.
 Recover from each error quickly enough to detect subsequent errors.
 Add minimal overhead to the processing of correct programs.

Error-Recovery Strategies
 The simplest approach is for the parser to quit with an informative error
message when it detects the first error.
 Panic-mode recovery
 Phrase-level recovery
 Error-productions
 Global-correction.

Panic-Mode Recovery
 The parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found.
 The synchronizing tokens are usually delimiters, such as ; or }.
 Skips a considerable amount of input without checking for additional errors
 It has the advantage of simplicity, and is guaranteed not to go into an infinite
loop.

Phrase-Level Recovery
 Perform local correction on the remaining input;
 It may replace a prefix of the remaining input by some string that allows the
parser to continue.
 A typical local correction is to replace a comma by a semicolon.
 Delete an extraneous semicolon.
 Insert a missing semicolon.
 Disadvantage in coping with situations in which the actual error has occurred
before the point of detection.

Error Productions
 Expand the grammar for the language at hand with productions that generate the
erroneous constructs.
 The parser can then generate appropriate error diagnostics about the erroneous
construct that has been recognized in the input.

Global Correction
 Compiler to make as few changes as possible in processing an incorrect input
string.
 Given an incorrect input string x and grammar G, algorithms will find a parse
tree for a related string y, such that the number of insertions, deletions, and
changes of tokens required to transform x into y is as small as possible.
 Not implemented.

Syntax Definition
 A grammar describes the hierarchical structure of programming language constructs.
 Eg: if ( expression ) statement else statement
 An if-else statement is the concatenation of the keyword if, an opening parenthesis, an
expression, a closing parenthesis, a statement, the keyword else, and another statement.
 Stmt -> if ( expr ) stmt else stmt
 Rule is called a production.
 In a production, lexical elements if and the parentheses are called terminals.
 Variables like expr and stmt are called nonterminals.

A Context Free Grammar
 A context-free grammar has four components:
 A set of terminal symbols, sometimes referred to as "tokens.“
 A set of nonterminals, sometimes called "syntactic variables."
 A set of productions, where each production consists of a nonterminal,called the head or
left side of the production, an arrow, and a sequence of terminals and/or nonterminals ,
called the body or right side of the production
 A designation of one of the nonterminals as the start symbol.

A Context Free Grammar
The terminal symbols are

Notational Conventions
These symbols are terminals:
 Lowercase letters early in the alphabet, such as a, b, c.
 Operator symbols such as +, *, and so on.
 Punctuation symbols such as parentheses, comma, and so on.
 The digits 0, 1, . . . , 9.
 Boldface strings such as id or if, each of which represents a single terminal
symbol.

These symbols are nonterminals:
 Uppercase letters early in the alphabet, such as A, B, C.
 The letter s, which, when it appears, is usually the start symbol.
 Lowercase, italic names such as expr or stmt.
 Uppercase letters may be used t o represent nonterminals for the constructs.
For example, nonterminals for expressions, terms, and factors are often
represented by E, T, and F, respectively.

 Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar
symbols; that is, either nonterminals or terminals.
 Lowercase letters late in the alphabet , chiefly u, v, ... ,z, represent (possibly
empty) strings of terminals.
 Lowercase greek letters,α, β, γ for example, represent (possibly empty) strings
of grammar symbols.
 A set of productions a -> α 1 , a -> α2, ... , a -> α k with a common head
 A (call them a-productions) , may be written A -> α 1 I α 2 I . , . I α k · call α1 ,
α2 , ... ,αk the alternatives for A.
 Unless stated otherwise, the head of the first production is the start symbol

Derivations
 E  E+E : E+E derives from E
 E  E+E  id+E  id+id
 A sequence of replacements of non-terminal symbols is called a derivation
of id+id from E.
 A   if there is a production rule A in our grammar and  and
 are arbitrary strings of terminal and non-terminal symbols
1  2  ...  n (n derives from 1 or 1 derives n )
 : derives in one step
 : derives in zero or more steps
 : derives in one or more steps
*
+

CFG - Terminology
 L(G) is the language of G (the language generated by G) which is a set of
sentences.
 A sentence of L(G) is a string of terminal symbols of G.
 If S is the start symbol of G then
 is a sentence of L(G) iff S   where  is a string of terminals of G
 If G is a context-free grammar, L(G) is a context-free language.
 Two grammars are equivalent if they produce the same language.
 S   - If  contains non-terminals, it is called as a sentential form of G.
- If  does not contain non-terminals, it is called as a sentence of G.
*
*

Derivation Example
 E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
OR
 E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)
 At each derivation step, we can choose any of the non-terminal in the
sentential form of G for the replacement.
 If we always choose the left-most non-terminal in each derivation
step, this derivation is called as left-most derivation.
 If we always choose the right-most non-terminal in each derivation
step, this derivation is called as right-most derivation.

Left-Most and Right-Most Derivations
Left-Most Derivation
E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
Right-Most Derivation
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)
 We will see that the top-down parsers try to find the left-most derivation of the
given source program.
 We will see that the bottom-up parsers try to find the right-most derivation of
the given source program in the reverse order.
lmlmlmlmlm
rmrmrmrmrm

Parse Trees and Derivations
 A parse tree is a graphical representation of a derivation that filters out the
order in which productions are applied to replace nonterminals.
 The interior node is labeled with the nonterminal A in the head of the
production;
 The children of the node are labeled, from left to right, by the symbols in the
body of the production
 The leaves of a parse tree are labeled by nonterminals or terminals
 Read from left to right, constitute a sentential form, called the yield or frontier
of the tree.
 There is a many-to-one relationship between derivations and parse trees.

Ambiguity
 a grammar that produces more than one parse tree for some sentence is said
to be ambiguous

Writing a Grammar
 Grammars are capable of describing most, of the syntax of programming
languages .
 Grammar should be unambiguous.
 Left-recursion elimination and left factoring - are useful for rewriting
grammars .
 From the resulting grammar we can create top down parsers without
backtracking.
 Such parsers are called predictive parsers or recursive-descent parser

Eliminating Ambiguity
 ambiguous grammar can be rewritten to eliminate the ambiguity.
 stmt -> if expr then stmt
|if expr then stmt else stmt
|other
 is ambiguous since the string
 if E1 then if E2 then S1 else S2 has the two parse trees

Two parse trees for an ambiguous sentence

Eliminating Ambiguity
 The general rule is, "Match each else with the closest unmatched then."

Left Recursion
 A grammar is left recursive if it has a non-terminal A such that there is a
derivation.
A  A for some string 
 Top-down parsing techniques cannot handle left-recursive grammars.
 The left-recursion may appear in a single step of the derivation (immediate left-
recursion), or may appear in more than one step of the derivation.
*

Immediate Left-Recursion
A  A  |  where  does not start with A
 eliminate immediate left recursion
A   A’
A’   A’ | 
A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A
 eliminate immediate left recursion
A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar
In general,

Left-Recursion -- Problem
• A grammar cannot be immediately left-recursive, but it still can
be left-recursive.
S  Aa | b
A  Sc | d
S  Aa  Sca
A  Sc  Aac causes to a left-recursion

Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai  Aj 
by
Ai  1  | ... | k 
where Aj  1 | ... | k
}
- eliminate immediate left-recursions among Ai productions
}

Eliminate Left-Recursion
S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: S, A
- A  Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A  bdA’ | fA’
A’  cA’ | adA’ | 
So, the resulting equivalent grammar which is not left-recursive is:
S  Aa | b
A  bdA’ | fA’
A’  cA’ | adA’ | 

Left-Recursive Grammars III
 Here is an example of a (directly) left-recursive grammar:
E  E + T | T
T  T * F | F
F  ( E ) | id
 This grammar can be re-written as the following non left-
recursive grammar:
E  T E’ E’  + TE’ | є
T  F T’ T’  * F T’ | є
F  (E) | id

Left Factoring
 Left factoring is a grammar transformation that is useful for
producing a grammar suitable for predictive, or top-down,
parsing.
 Stmt -> if expr then stmt else stmt
|if expr then stmt
 A ->α 1 | α 2
 So it should be left factored as

Left-Factoring -- Algorithm
 For each non-terminal A with two or more alternatives (production rules)
with a common non-empty prefix
A  1 | ... | n | 1 | ... | m
convert it into
A  A’ | 1 | ... | m
A’  1 | ... | n

Left-Factoring – Example1
A  abB | aB | cdg | cdeB | cdfB

A  aA’ | cdg | cdeB | cdfB
A’  bB | B

A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB

Left-Factoring – Example2
A  ad | a | ab | abc | b

A  aA’ | b
A’  d |  | b | bc

A  aA’ | b
A’  d |  | bA’’
A’’   | c

Top-Down Parsing
 The parse tree is created top to bottom.
 Top-down parser
 Recursive-descent parsing
 Backtracking is needed
 It is a general parsing technique, but not widely used.
 Not efficient
 Predictive parsing
 No backtracking
 Efficient
 Needs a special form of grammars - (LL(1) grammars).
 Recursive predictive parsing is a special form of recursive descent parsing without
backtracking.
 Non-recursive (table driven) predictive parser is also known as LL(1) parser.

Recursive Predictive Parsing
Each non-terminal corresponds to a procedure.
Ex: A  aBb
proc A {
- match the current token with a, and move to the next
token;
- call ‘B’;
- match the current token with b, and move to the next
token;
}

Recursive Predictive Parsing (cont.)
A  aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}

Top-down parse for id + id * id

FIRST and FOLLOW
 FIRST and FOLLOW allow us to choose which production toapply, based on the
next input symbol.
 FIRST(α), where α is any string of grammar symbols, to be the set of terminals that
begin strings derived from α.
 If α => ε, then ε is also in FIRST(α) .
 A => cY, so c is in FIRST(A)
 FOLLOW(A) is the set of the terminals which occur immediately after (follow) the
non-terminal A in the strings derived from the starting symbol.
 a terminal a is in FOLLOW(A) if S  Aa
 $ is in FOLLOW(A) if S  A
*
*

FIRST
1. If X is a terminal, then FIRST(X) = {X}.
2. If X is a nonterminal and X -> YI Y2 ... Yk is a production for some k>=1, then
place a in FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of
FIRST(YI), ... ,FIRST(Yi-I); that is , YI Y2 ... Yi-1 => ε. If ε is in FIRST (Yj) for
all j = 1, 2,... ,k, then add ε to FIRST (X). For example, everything in FIRST(Y1)
is surely in FIRST(X) . If Yi does not derive ε then we add nothing more to
FIRST (X) , but if Y1 => ε, then we add FIRST(Y2), and so on.
3. 3. If X => ε is a production, then add ε to FIRST (X).
*

LL ( 1 ) Grammars
 L: scanning the input from left to right
 L: producing a leftmost derivation
 1 : one input symbol of lookahead at each step
 A grammar G is LL(1) if and only if whenever A -> α | β are two distinct
productions of G, the following conditions hold:

Construction of a predictive parsing
table.

Syntax analysis

More Related Content

What's hot (20)

Similar to Syntax analysis (20)

More from Akshaya Arunan (9)

Recently uploaded (20)

Syntax analysis