Generacion de codigo ensamblado

1
Summary: Direct Code Generation
1 Direct Code Generation
Code generation involves the generation of the target representation (object code) from
the annotated parse tree (or Abstract Syntactic Tree, AST) produced by syntactic and
semantic analysis.
The output of code generation is typically assembler code, although compilers can also be
used to translate a high level language to another high level language (source to source
compiler) or from a low level language to a high level one (decompiler). We will assume
here that assembler is produced.
Code generation can be direct or indirect:
• Direct code generation: The object code is produced directly from the syntactic
tree.
• Indirect Code Generation: The code generator produces an intermediate
representation, which is at a level of abstractness between the parse tree and the
target form. The intermediate code is mapped into the target form by a separate
process. This approach has the advantage that all processing up until the
intermediate representation is machine independent, and only the final step is
machine dependent.
Code generation can also be distinguished by how it is integrated into the syntactic
analysis:
• A single pass compiler performs semantic actions as syntactic rules are applied,
and these semantic actions generate the code (either assembler directly, or the
intermediate code).
• A multiple pass compiler separates syntactic analysis and code generation. The
parse tree is produced in its entirety, and this is the input to the code generation
phase.
The rest of this summary assumes direct code generation, and a single pass compiler,
generating code via semantic actions.
2 NASM Assembler
2.1 Instructions
Each instruction of a NASM file has the following form:
label: instruction operands ; comment
Typical example: fadd st1
All fields are all optional with some restrictions. Relatively free use of white space: labels
may have white space before them, or instructions may have no space before them. The
colon after a label is optional. But use it for clarity.

2
2.2 Data Space
The data space of a program can be divided between:
• Addressable Memory of the program: which can be accessed by referring, e.g., to
location 345
• Registers: defined locations which can hold one datum each.
The program itself occupies the addressable memory of the program.
Some assembler ‘pseudo instructions’ do not end up as machine code instructions, but
rather reserve space, e.g.
buffer: resb 64
…reserves 64 bytes of memory at this point of the program. The label ‘buffer’ can be used
to access this memory.
Registers: NASM has 16 registers predefined: 8 16-bit and 8 32-bit. Our examples will
work with the following 32-bit registers:
• EAX (accumulator), EBX, ECX, EDX,
ESP (stack pointer), EBP, ESI, EDI
2.3 Operands
Operands of instructions can be constants, registers, or references to locations in memory
of the program.
• Constants: E.g.,
resw 64 (reserves 64 words of memory)
mov eax,100 (store 100 in register eax)
• Registers:
mov eax,ebx (moves contents of register ebx into register eax)
• Indirect addressing: placing a register in [ ] brackets indicates that the location to use
is contained within the register
mov [eax],ebx
(moves contents of register ebx into memory location contained in
register eax)
• Expressions: any operand can be replaced by an expression, e.g.,
mov [eax+1],ebx
(moves contents of register ebx into memory location denoted by adding 1 to the content
of register eax)
Labels as Operands: Labels can be used in place of registers. When translated to
machine code, the label will be replaced by the memory location of the instruction
associated with the label. E.g.,
wordvar: resw 2
mov [wordvar], eax
(reserve 2 words of memory and then move the content of register eax into the first word
of this space)

3
3 Direct Code Generation
3.1 How code is generated
There are several ways to generate code from the syntax tree. In this course, we will
assume it is done via the semantic actions connected to the syntax rules. For instance,
E :- E1 + E2 { GEN(“add”, E1.loc, E2.loc) }
The semantic action calls the function GEN with the arguments provided. GEN generates
assembler code for its arguments, which is saved to the object code file for this source
code.
GEN would be c code defined elsewhere and provided to the Bison/Yacc compiler. When
this rule is applied, the current values of the attributes E1.loc and E2.loc would be
substituted. E1.loc and E2.loc were calculated as E1 and E2 were parsed. the ‘loc’
attribute records which variable or temporary location holds the value of the CFG symbol.
GEN is responsible for resolving how these memory locations are referred to in the
assembler code.
3.2 Dealing with Registers
Operations can be performed more rapidly when the operands are in registers than when
they are in addressable memory. Also, some operations require one operand to be a
particular register (e.g., mul, div). For these reasons, to generate assembler instructions,
we sometimes need to load variables into registers. For instance, to generate code for:
A = B + C
...we might produce the following code:
mov EAX, [B] ; move the contents of variable B into register EAX
add EAX, [C] ; Add the contents of variable C to register EAX.
mov A, EAX ; move the contents of register EAX to location A
Note we need to generate three types of assembler instructions dealing with registers:
• Instructions to move variables from memory location into a register
• Instructions to perform operations on the registers
• Instructions to store register values into a memory locations.
An important job of the code generator is to keep track of where variable values are at a
given point of time. In the example above, if the prior line of source code had left the value
of B in the EAX register, then it would not be necessary to generate a line of code to move
B into EAX.
For this reason, the compiler maintains a variable “AC”, which records which variable’s
value is currently held in the EAX register. Before generating an instruction to place a
variable’s value into the register, the compiler checks what is the current value held in the
register, and only generates the line if needed.
3.3 The CAC function
The CAC function is c code provided by the user for use in a YACC/Bison compiler,
allowing this function to be referenced in semantic actions.
It is called to ensure particular values are placed in the EAX register. The EAX register, is
sometimes called the “accumulator”, and CAC thus stands for “control of Accumulator”.

The CAC function is called with two variables as arguments, and generates the assembler
code needed to ensure one of them is in the EAX register. It returns 0 if the first of these
variables ends up in the register, and 1 if the second ends up in the register.
The code for CAC is as follows:
4
int CAC (opd *x, opd *y) {
if (AC==y) return 1;
if (AC!=x) {
if (AC!=NULL) GEN ("MOV", AC, "EAX");
GEN ("MOV”, “EAX”, x);
AC=x;
}
return 0;
}
X and Y represent the two variables which CAC needs to deal with. AC is a variable
maintained by the compiler, keeping track of which variable should currently be loaded into
the EAX register (NULL if the current value is not a variable value).
NOTE: AC and CAC belong to the compiler, they are not part of the assembler code.
In the first line of the function, the program checks if y is already in EAX. If so, nothing
needs to be done, so the program returns 1, indicating that the y value is in the register.
In next lines of code, the program makes sure that the current value of the register is x. If it
is not the current value, code is generated to move the value currently in the register back
to its place, and then code is generated to move x into the register.
The line “AC=x” tells the compiler that at that point, the execution of generated code would
leave the variable x in EAX.
The CAC function can be used in other code as follows. Assume we wish to generate
code to add two values.
if CAC(x, y)
GEN(“add”, “EAX”, x)
else:
GEN(“add”, “EAX”, y)
AC=z
The user calls CAC, which returns 0 if x is in the EAX register, and 1 otherwise. CAC itself
would issue 0, 1 or 2 lines of code:
• If x or y were already in EAX, no code would be generated by CAC.
• if AC was Null, then CAC would generate 1 line to move x into EAX:
mov eax, [x]
• if AC was not Null, then CAC would generate 1 line to move EAX back to its
location, and another to move x into EAX:
mov [%AC], eax
mov eax, [x]
(where %AC would be replaced at compile time with the contents of AC)
CAC will return 0 or 1. If 1 is returned, the code above would generate code to add x to the
EAX, otherwise, it generated a line of code to add y to eax.

4 Generating Mathematical Expressions
This section deals with the generation of code for a mathematical expression, such as “A +
B” or “A * B”, etc. This would correspond to a grammar rule such as “E :- E1 + E2”. Each of
the Es on the right hand side can correspond to a simple constant (int or float), an identify
(a variable), another mathematical expression, or a function call.
Lets assume a bottom-up parser with code generation performed at the same time as
syntactic analysis. In this case, We generate code for “E :- E1 + E2” at the time of
reduction of the rule.
The recognition of E1 and E2 would also have generated some code, which would thus
appear in the assembly program before the code for the current rule. This code would
calculate the values of the right-hand-side Es.
The code we generate for the rule “E :- E1 + E2” depends on where the values calculated
for E1 and E2 are left. If E1 is a variable B and E2 is a constant, 120, we might simply
generate lines of code such as:
5
MOV EAX, [B]
ADD EAX, 120
If prior code left the value of B already in EAX, then we would not need to generate the
first line.
If E1 was itself a mathematical expression, we need to generate code keeping in mind
where the previously generated code left the value of E1 (possibly in EAX itself).
The other problem here is that in many languages, mathematical expressions can
combine data of different types (e.g., int, long, float). Often, the number of bytes of the
operands will determine the register which will be used to perform the operation.
One solution is to use conditional code to generate different assembly code depending on
where the values of the expression are currently stored. The problem splits into two parts:
1. Getting the values into the correct locations (at least one in a register of the
appropriate type for the operation, e.g., a float or int register).
2. Generating the assembler code to perform the operation (the operator needs to be
float or int).
Below we give a possible implementation for the realisation of “E+E” (sum). The example
assumes we are dealing with only three data types:
• Unsigned chars: 1 byte
• Int: 2 bytes
• Double (float): 8 bytes
These numbers can be from three sources:
• In the variable space
• A constant
• Already in a register
Where one of the numbers is an unsigned char, it is loaded into an int register, and this
register is used instead of the original location. In the process described below, it is then
treated as an int.
We need three distinct assembler operands:
• If both of the numbers are int, we use ADD x, y to add the numbers, leaving an int
in the location of x.
• If both of the numbers are double, we use the ‘FADD x’ operation. This operation
assumes a stack (pila) used for storing results. The first number is assumed to be

at the top (cima) of the stack, and the operation adds the operand to this location,
leaving the result in place of the original value (on top of the stack).
• if one number is a double, and the other an integer , the double is placed on the
top of the stack, and then an ‘FIADD x’ operation is used, which adds its integer
operand to the value on top of the stack, leaving the result in place of the original
value.
The following table could be used by the compiler as part of the generation of the
operation E+E. It allows two numbers, of whatever type, and wherever located, to be
added together.
6
•
Type of y Operand
Type of x
operand
unsigned
char
int Register
int
Constant
int
double Register
double
unsigned
char
Load x
Re-enter
Swap
Re-enter
Swap
Re-enter
Load x
Re-enter
Load x
Re-enter
Load x
Re-enter
int Load y
Re-enter
Load y
Re-enter
ADD y,x Load y
Re-enter
Load x
Re-enter
FIADD x
Register
int
Load y
Re-enter
ADD x,y ADD x,y ADD x,y MOV
T,x
Re-enter
MOV
T,x
Re-enter
Constant
int
Swap
Re-enter
Swap
Re-enter
Swap
Re-enter
- Load x
Re-enter
FADD x
double Load y
Re-enter
Load y
Re-enter
Swap
Re-enter
Swap
Re-enter
Load y
Re-enter
FADD x
Register
double
Swap
Re-enter
Swap
Re-enter
Swap
Re-enter
Swap
Re-enter
Swap
Re-enter
FADD y
The table assumes there is a function “Load” within the compiler which places the named
value into a register of the appropriate type. This function is driven from the following table.
It assumes that the value to load is either unsigned char, int, int constant or double. The
table generates distinct code depending on whether you want to load the value into an int
or double register.
Load into a register of Type of operand to load
type: unsigned
char
int int constant double
int XOR RH,RH
MOV RL,x
MOV RX,x MOV RX,x FLD x
FISTP x
MOV RX,x
double XOR RH,RH
MOV RL,x
MOV T,RX
FLD T
FILD x MOV T,x
FLD T
FLD x
Integer operations load their values into a 2 byte register RX. Each byte of RX can be
accessed individually: RH is the high byte, and RL is the low byte. The operation “XOR
RH,RH” basically sets all bits of RH to 0 (since the ‘exclusive or’ of two identical numbers is 0). If
the number to load is an unsigned char, the high byte is cleared, and the char is loaded in the low
byte. If the number to load is an integer, it is loaded into both bytes directly.

Float operations make use of the stack (an area of memory assigned for such operations). The
FLD operation loads the float operand onto the top of the stack. The FILD operation loads an
integer operand onto the top of the stack with 8 bytes of space.
Lets try an example. We start with code “S+7”, where S is a variable of type float, and 7 is
an integer constant. On the entry to the function, we have “x” (=S) as a double and “y” (=7)
as a constant int.
The code for this cell is “Swap; Re-enter”. This means that we swap the values of x and y,
and then restart the procedure.
Now we have x (=7) and y (=S), which means we look at the cell for x=const int and y =
double. The code for this cell is “load x; re-enter”. The call to “load x” with x as a const int,
which we want to put into a double register. We thus issue the assembler code:
7
MOV T, 7
FLD T
We then perform the “re-enter” command, and re-start the routine with x in a double
register, and y still a double variable. We thus get the commands: “Swap; re-enter”. We
thus re-enter with x as a double variable, and y as a double register. We thus issue the
assembler:
FADD S
…and are finished. 3 assembler commands issued.
5 Generating Conditional Instructions
5.1 The Status Flags and conditional jumping
A special register exists called the “FLAGS” register. It consists of a sequence of bits,
which are set (1) or unset (0). These flags are set as the result of mathematical
operations, e.g., ADD, SUB, MUL or DIV , or their float alternatives.
• ZF (Zero Flag): set if the operation results in a zero value, unset otherwise.
• SF (Sign Flag): set if operation results in a negative value, unset otherwise.
These flags can be referenced in conditional jump operations, e.g.,
jz L100 ; jump to L100 if last op resulted in zero
5.2 Integer Comparison: CMP
The NASM instruction CMP basically subtracts its second argument from the first. The
result is not stored anywhere, but the ZF and SF flags are set as a result of the operation.
The CMP instruction is thus usually followed by a conditional jump, e.g.,
CMP [A], [B]
JZ L1 ; jump if cmp result was zero

8
5.3 Simple If statements
If-then statements can be mapped into assembler as follows. Assume code like:
if A == B then <stmt1>
Firstly, we generate code for the comparison, e.g.,
CMP [A], [B]
Then we generate code to jump over the code for <stmt1> if test fails
Then we put the code for <stmt1>, e.g.
ADD X, Y
On the line following this, we put the label from above
L1: …
if A == B then <block> CMP [A], [B]
JZ L1
.... CODE FOR <block>
L1:
We can use semantic actions to generate the assembler for the source structure. #A1 and
#A2 correspond to lambda rules with associated semantic actions, used as a means to
generate code in the correct location (e.g., in parsing “<stmt>:-if <exp> then #A1 <block>
#A2”, we reduce elements in the following order: <exp>, #A1, <block>, #A2 and then
<stmt>, and thus the semantic actions to produce code are performed in that order).
Attribute Grammar:
<stmt> :- if <exp> then #A1 <block> #A2
#A1 :- l { Generate code to jump if exp non-zero }
#A2 :- l { Generate line with label }
<exp> :- ...

5.4 If -else statements
If-else statements are a little more complex. A typical if-else statement might generate
code like:
9
CMP [A], [B]
<block1 code>
JMP L2
L1: <block2 code>
L2: …
Attribute Grammar:
<stmt> :- if <exp> then #A1 <block1> #A2 else <block2> #A3
#A1 :- l { Generate code to jump to start_else if exp non-zero }
#A2 :- l { Generate jmp to end_ifelse;
Then generate label for start_else }
#A3 :- l { Generate line with label for end_ifelse }
6 Generating Loops
6.1 While Loops
While loops map onto assembler much as for an if-statement. E.g., for
while <exp> do <instructions> end
<loop> :- while #A1 <exp> #A2 do <instructions> end #A3
#A1 :- l { Generate line with a unique label for loop start }
#A2 :- l { Generate line with jump to end if expr fails }
#A3 :- l { Generate jump back to start, and label for loop end }

10
Below is code from a real while loop:
topwhile: ;a label to mark the top of this WHILE loop
mov eax, 3 ;planning to invoke function 3—read from a file
mov ebx, [infileID] ;the file ID must be placed into register ebx
mov ecx, mybyte ;the address of memory to receive file content
;must be placed into register ecx
mov edx, 1 ;the number of bytes to read is placed in edx,
int 80h ; invokes a kernel function according to
;the number in register eax
cmp eax, 0 ;check whether a byte was read
je dunwhile ;skip the body if no bytes were read
xor byte [mybyte], 00001111b ;[] dereferences, thereby refers to the
;contents at mybytes
mov eax, 4 ; planning to invoke function 4—write to a file
mov ebx, [outfileID] ;the file ID must be placed into register ebx
mov ecx, mybyte ;the address of memory to write from must be
;placed into register ecx
mov edx, 1 ;the no. of bytes to write must be placed in edx
int 80h ; invokes a kernel function according to no. in eax
jmp topwhile ;go back to the top of the loop
dunwhile: ;jump to here if no byte is read

11
6.2 Repeat Loops
<loop> :- repeat #A1 <instructions> until <exp> #A2
#A1: Generate unique label for loop start
#A2: Generate jump to end if last result zero
Generate unconditional jump back to beginning
Generate label for loop end
<instr> Code for <instr> generated by other productions
<exp>: Code for <exp> generated by other productions
7 Generating Code from Functions
This section covers the generation of code for functions. This includes the generation of
function calls and the generation of the code of the function body itself. Three important
issues here are:
1. How are the parameters passed to the function.
2. How are local variables represented within the function.
3. How are values returned from the function.
There are many possible ways to implement functions. Basically, it is up to the person
writing the code generator to decide how to do it. We describe here one of the more
standard ways of generating functions and function calls.
7.1 The Stack Space
Our implementation of functions depends heavily on the use of a stack in the program
memory. Many assemblers assign part of the addressable memory of the program to a
stack to hold information about the current variable context. Basically, when we enter a
function, space is allocated on top of the stack for the local variables, and when we exit
from the function, this allocated space is popped off the stack. The stack thus represents
the embedded block structure we discussed under symbol tables.
The stack typically starts at the top of addressable memory, and expands downwards. So,
assume we have 1000 bytes of addressable memory, the “bottom of the stack” will start at
address 1000. If we push a 2-byte integer onto the stack, it will occupy memory range 999-
1000. Pushing an 8 byte float value onto this stack, it would occupy bytes: 991-998.
A register called SP (for Stack pointer) indicates the top of the stack. In some systems, SP
will point at the next free location in the stack. In others, it points to the lowest byte of the
top element of the stack. We will assume this last approach, so in the above case, after
pushing on the two numbers, SP would contain 991.

7.2 The Function Call
Before calling the function, parameters are pushed onto the stack. These can then be
accessed by the call routine, from the top of the stack. So that the parameters are
available in the required order, they are pushed onto the stack in reverse order.
12
The Call in Source Code:
rutina(a, b)
The Call In Assembler
...
PUSH b
PUSH a
CALL rutina
...
The NASM instruction “CALL” firstly pushes the address of the following instruction onto
the stack. This will be used as the return address when the function call returns.
7.3 Entering the Routine
On entering the routine, the routine firstly establishes the boundaries of the local space of
the stack. A register BP (Base Pointer) is used to indicate the lowest point of the stack
which is part of the current context. Consequently, the first thing a routine does on entry is
to store the old value of BP onto the stack (for later recovery and restoration), and then
reset the BP to point at the current top of the stack (which is the point from which the local
context will grow).
The first lines of any routine will thus be something like the following:
rutina:
PUSH BP
MOV BP,SP
‘rutina’ is the name of the function, represented as a label in assembler. The old value of
BP is pushed onto the stack, and then BP is reset to the value of SP (top of the stack).

7.4 Allocating Space for Local Variables
The next step is to allocate stack space for the local variables. The compiler works out
how many bytes of memory are required for the local variables, and decrements the stack
pointer (the stack grows down, remember) by this amount. In the following example, each
int takes 2 bytes and each the double 8 bytes, a total of 14 bytes.
Source code:
int rutina (int a, char *b)
{
13
int i, j, k;
double r;
. . .
}
Assembler Code:
rutina:
PUSH BP
MOV BP,SP
SUB SP,14
...
7.5 Referring to parameters and local variables
In the body of the function, rather than referring to variables by name, one references
them in terms of offsets from the base pointer.
Parameters: parameters were pushed onto the stack BEFORE the function was called,
and thus are part of the previous context, they are thus above BP. In the above example,
parameters a and b can be accessed using [BP+6] and [BP+8] (note the 6 bytes used to
store the old BP and the return address).
Local Variables: The local variables are available under BP in memory. i, j and k are thus
available as, respectively: [BP-2], [BP-4], [BP-6]. r starts at [BP-14].
Then, the program address to return to is pushed on the stack.
b
a
Free
Memory
BP
return address
old_bp
i
j
k
r
SP
[BP+8]
[BP+6]
[BP+2]
[BP]
[BP-2]
[BP-4]
[BP-6]
[BP-14]
On entering the routine, space is allocated for local variables of that routine.
On leaving the routine, the part of the stack used by the routine can be ‘popped’.
Recursive routines thus have separate memory space.
7.6 Placing The function’s code
After we generate the line to allocate space for the local variables, we then generate the
code for the body of the function. Firstly, the line MOV SP,BP resets the stack pointer to

its value before calling this routine (we thus pop all the local stack space off the stack). At
this point, the top element on the stack is the old BP. We can thus issue a command POP
BP which pops this element off the stack into BP, thus resetting the BP to its prior value
(SP is also moved up two bytes).
At this point, the element on top of the stack is the address where execution should
resume in the calling context. The RET operator pops an element of the stack, and
resumes processing from that point.
Back in the calling function, after the function call, we then need to wipe the function
parameters off the stack. We do this simply by ADD SP,4.
The calling code:
PUSH b
PUSH a
CALL rutina
ADD SP,4
…
14
The routine code:
rutina:
PUSH BP
MOV BP,SP
SUB SP,14
...
MOV SP,BP
POP BP
RET
...
7.7 Returning Values
A function may or may not return a value. There are various ways to return a value, and it
is up to the compiler writer to decide how it is done. One way is to leave the returned value
in the EAX register (if it fits), or in a float register for larger numbers. Alternatively, the
number could have been placed on the stack, to be popped layer by the calling routine.
. . .
rutina(a, b)
. . .
int rutina (int a, char *b)
{
int i, j, k;
double r;
. . .
return k;
}
Source Program:
PUSH b
PUSH a
CALL rutina
ADD SP,4
rutina:
PUSH BP
MOV BP,SP
SUB SP,14
...
MOV EAX, [BP-6]
MOV SP,BP
POP BP
RET
Object Program:
Move K
into EAX

Generacion de codigo ensamblado

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Generacion de codigo ensamblado (20)

Recently uploaded (20)

Generacion de codigo ensamblado