4. Language processing systems
• We have learnt that any computer system is made of
hardware and software The hardware understands a
language, which humans cannot understand
• So we write programs in high level language, which is
easier for us to understand and remember These
programs are then fed into a series of tools and OS
components to get the desired code that can be used
by the machine This is known as Language Processing
System
5. Language processing systems
• User writes a program in language (high level language)
• The compiler, compiles the program and translates it to assembly
program (low level language)
• An assembler then translates the assembly program into machine
code object
• A linker tool is used to link all the parts of the program together for
execution (executable machine code)
• A loader loads all of them into memory and then the program is
executed
6. What is compiler?
A program that translates source code into an
equivalent target code
Source code is written in a programming language,
e.g. C++ or Java
Target code is often computer-understandable object
code
A bridge between (application) software and
hardware
7. Compiler: a black box view
compiler
Source program Target program
Error message
9. Compilation process
Broadly divided into two phases
Analysis (also called front-end)
• (Lexical, Syntactic, Semantic) Analysis
• Programming language dependent
• Computer architecture independent
Synthesis (also called back-end)
• (Intermediate) Code generation and optimisation
• Programming language independent
• Computer architecture dependent
10. Compilation process (cont.)
Each phase transforms the source program from
one representation into another representation
The phases communicate with the symbol table
and error table
11. • Lexical Analysis (Or Scanning Or Tokenizing):
In lexical analysis, stream of characters making up the source
program (called Lexemes) is read from left to right and grouped
into categories (called tokens) that are sequence of characters
having a collective meaning.
For example: Consider the following assignment statement:
Position = Initial + Rate * 60
(All the variables are of type real)
After passing through the lexical analysis phase, the above
assignment statement gets the following form.
id1 = id2 + id3 * 60
Note: The Lexical Analysis truncates white spaces as well as
comments from the source program.
Brief Introduction to Phases of Compiler
An overview of compiler phases
12. • Syntax Analysis (Parsing Or Hierarchal Analysis):
In syntax analysis phase, characters or tokens are grouped
hierarchically into nested collections with collective meanings. It
includes grouping the tokens of source program into grammatical
phrases which are then used by the compiler to synthesize the
output.
After passing through the syntax analysis phase, the above
assignment statement takes the following form.
=
id1 +
id2 *
id3 60
Brief Introduction to Phases of Compiler (Contd...)
An overview of compiler phases
13. • Semantic Analysis:
The semantic analysis phase performs certain checks to ensure that the
components of a program fit together meaningfully. It checks the
source program for semantic errors and gathers type information from
symbol table for subsequent code generation phase.
It uses hierarchical structure determined by syntax analysis phase to
identify the operations and operands of the expressions and
statements.
An important component of semantic analysis is type checking. Here
the compiler checks that each operator has operands that are permitted
by the source language specification.
An overview of compiler phases
14. • For example, when we want to add function name with array name
and store it into variable, the compiler will generate an error.
Moreover, many language specifications require a compiler to
generate an error when a real number is used as index of an array.
However, some languages may allow type conversion which is done
in semantic analysis phase.
• The above assignment statement, after passing through the semantic
analysis phase takes the following form.
An overview of compiler phases
15. • Intermediate Code Generation:
After syntax and semantic analysis phase some compiler generates an
explicit intermediate representation of source program. We can think
of this intermediate representation as a program for an abstract
machine.
It takes the form of three address code which is like the assembly
language for a machine in which every memory location (i.e. variable)
can act like a register. It consists of a sequence of instructions, each of
which has at most three operands.
The above statement will become as follows:
Temp1 = into real (60)
Temp2 = id3 * Temp1
Temp3 = id2 + Temp2
id1 = Temp3
An overview of compiler phases
16. Brief Introduction to Phases of Compiler (Contd...)
• Code Optimization:
The code optimization phase attempts to improve the
intermediate code by reducing the lines of code so that faster
running machine code will result.
The above intermediate code will be optimized as:
Temp1 = id3 * 60.0
id1 = id2 + Temp1
An overview of compiler phases
17. Brief Introduction to Phases of Compiler (Contd...)
• Code Generation:
It generates target code consisting normally of assembly code (or
sometimes relocatable machine code when assembler is
embedded).
So, the above optimized code will be written in assembly
language as:
MOVF R1, id3
MULF R1, #60.0
MOVF R2, id2
ADDF R1, R2
MOVF id1, R1
An overview of compiler phases
18. Symbol Table Management
A symbol table is a data structure containing information about
various attributes of each identifier.
For example, in case of a variable, their attributes may provide
information like,
• type of variable
• memory allocated to this variable
• address of variable
• scope of variable
and in case of procedure,
• procedure name
• the number and type of arguments
• return type (if any)
• Each phase of compiler retrieves and stores data from and into
symbol table and updates it if required.
Symbol table management
19. • Each phase of compiler can encounter errors. However after detecting an
error a phase must somehow deal with that error so that compilation can
proceed further and allow further errors in the source program to be
detected.
• All this task of error detection is done by the error detection routine. The
syntax and semantic analysis phases usually handle a large fraction of
errors detectable by the compiler.
• Lexical Analyzer can detect errors only when input characters cannot
form any token of language. Errors where token stream violates the
structure rules of the language are determined by the Syntax Analysis
phase.
• During Semantic Analysis, the compiler tries to detect the language
constructs that have correct syntactic structure but no meaning to the
operation involve e.g. if we try to add two identifiers one of which is the
name of an array and other is a function name. The semantic analyzer will
generate an error.
Error handling
20. Compiler vs. Interpreter
Compiler transforms a source code file into an
object code file
Interpreter translates source code line-by-line,
executes it, and then discards the translated
version
Compiled languages are time efficient than
interpreted languages (Think why?)
Compiled languages are space efficient than
interpreted languages (Think why?)
Interpreted languages are easy to debug (Think
why?)
21. Applications of Compiler Technology
HTML and Word processing documents
General consistency checking could benefit
from type checking
Textual user interfaces use parsing to recognise
users’ utterances
22. Tokens, Patterns, and Lexemes
A token is a name given to a logical unit in the
language, often it is a pair:
token name (e.g., identifier, number)
token value (e.g., "myCounter")
A lexeme is a sequence of program characters
that form a token
e.g., "myCounter"
A pattern is a description of the form that the
lexemes of a token may take
e.g., character strings including A-Z, a-z, 0-9, and _
23. Lexical Analyser
Groups sequence of characters into lexemes
smallest meaningful entity in a language (keywords,
identifiers, constants)
Makes use of the theory of regular languages and finite
state machines.
24. Lexical Analysis
input token value/lexeme
ID r
ASN =
ID x
MUL *
r = x * (a+10) LP (
ID a
PLUS +
INT 10
RP )
Tokens are typically represented by numbers, for efficiency reasons
25. Classes of Tokens
Keywords (also called reserved words)
Operators
Identifiers
Constants: numbers and literal strings
Punctuation symbol
26. Examples of Tokens
Token Description Sample lexemes
IF ‘i’, ‘f’ if
ELSE ‘e’, ‘l’, ‘s’, ‘e’ else
OPR Plus, minus, equal +, -, *
ID Letter followed by letters
and digits
pi, score, D2
NUM Any numeric constant 345, 45.6
LITERAL Anything in double or
single quotes
“core dumped”
27. Lexical Analyser: Issues & Remed
How to describe tokens?
Regular expressions could be used
Often called specification
How to break text down into tokens?
Finite automata could be used
Often called implementation