Tomasulo Algorithm

ASSIGNMENT # 1
Subject
“COMPUTER ARCHITECTURE”
Teacher
“Ma’am Aden Iqbal”
By
“Farwa Abdul Hannan”
(12-CS-13)
Monday, 28 March, 2016
NFC – INSITUTDE OF ENGINEERING AND
FERTILIZER RESEARCH, FSD

1
Tomasulo Algorithm
1) Consider the code sequence shown below.
LD F6, 12(R2)
LD F2, 16(R3)
ADDD F0, F2, F4
DIVD F10, F0, F6
SUBD F8, F6, F2
ADDI R2, R2, 8
ADDI R3, R3, 16
ADDD F6, F8, F2
a) Identify all WAR, WAW, and RAW dependencies in the instruction stream.
WAR WAW RAW
SUBD F8, F6, F2
ADDD F6, F8, F2
LD F6, 12(R2)
ADDD F6, F8, F2
LD F2, 16(R3)
ADDDF0, F2, F4
NIL NIL ADDD F0, F2, F4
DIVD F10, F0, F6
NIL NIL LD F6, 12(R2)
SUBD F8, F6, F2
b) Draw a pipeline diagram of how instructions would issue in a machine using
Tamasulo algorithm as discussed in class:. Assume that the FP Add unit has 4
EX phases, the FP Multiply unit has 7 EX phases, and divide has 24 EX phases.
FP Adds, Subtracts, and Multiplies are fully-pipelined, while divide operations
are NOT pipelined.

2
Cycle 1, 2, 3
Instruction Status Load/Buffers
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 Load1 Yes 12+R2
LD F2 16+ R3 2 Load2 Yes 16+R3
ADDD F0 F2 F4 3 Load3 No
DIVD F10 F0 F6
SUBD F8 F6 F2
ADDI R2 R2 8
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
Time Name Busy Op. Vj Vk Qj Qk
ADD1 Yes ADDD R(F4) Load2
ADD2 No
ADD3 No
MULT1 No
MULT2 No
Register Result Status
Clock F0 F2 F4 F6 F8 F10 F12 F14
3 FU ADD1 Load2 Load1
Cycle 4

3
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 Load2 Yes 16+R3
DIVD F10 F0 F6 4
SUBD F8 F6 F2
ADDI R2 R2 8
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
ADD1 Yes ADDD R(F4) Load2
ADD2 No
ADD3 No
MULT1 Yes DIVD M(A1) ADD1
MULT2 No
4 FU ADD1 Load2 M(A1) MULT1
Cycle 5
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k

4
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
4 ADD1 Yes ADDD M(A2) R(F4)
4 ADD2 Yes SUBD M(A1) M(A2)
ADD3 No
MULT2 No
5 FU ADD1 M(A2) M(A1) ADD2 MULT1
Cycle 6
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No

5
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8 6 6
ADDI R3 R3 16
ADDD F6 F8 F2
Reservation Station
ADD3 No
MULT2 No
Cycle 7
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8 6 6 7

6
ADDI R3 R3 16 7 7
ADDD F6 F8 F2
Reservation Station
ADD3 No
MULT2 No
Cycle 8
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station

7
ADD3 No ADDD M(A2) ADD2
MULT2 No
8 FU ADD1 M(A2) ADD3 ADD2 MULT1
Cycle 9
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 9 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station

8
MULT2 No
Cycle 10
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
ADDD F0 F2 F4 3 9 10 Load3 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
ADD1 No
ADD2 Yes SUBD M(A1) M(A2)
4 ADD3 Yes ADDD M-M M(A2)
24 MULT1 Yes DIVD M+R4 M(A1)
MULT2 No

9
10 FU M+R4 M(A2) ADD3 ADD2 MULT1
Cycle 11
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 9 11
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
ADD1 No
ADD2 No
MULT2 No
11 FU M+R4 M(A2) ADD3 M-M MULT1

10
Cycle 14
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 8 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14
Reservation Station
ADD1 No
ADD2 No
MULT2 No
14 FU M+R4 M(A2) ADD3 M-M MULT1
Cycle 15

11
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4
SUBD F8 F6 F2 5 8
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14 15
Reservation Station
ADD1 No
ADD2 No
ADD3 No
MULT2 No
15 FU M+R4 M(A2) M-M+M M-M MULT1
Cycle 35
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No

12
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4 35
SUBD F8 F6 F2 5 8 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14 15
Reservation Station
ADD1 No
ADD2 No
ADD3 No
MULT2 No
35 FU M+R4 M(A2) M-M+M M-M MULT1
Cycle 36
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4 35 36

13
SUBD F8 F6 F2 5 8 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8 14 15
Reservation Station
ADD1 No
ADD2 No
ADD3 No
MULT1 No
MULT2 No
36 FU M+R4 M(A2) M-M+M M-M (M+R4)/M
c) Tomasulo’s algorithm has a disadvantage. Only one result can complete per
clock, per CDB. Using the same latencies as above, find a code sequence of no
more than 12 instructions where Tomasulo’s algorithm must stall due to CDB
contention. Indicate where this occurs in your sequence.
It occurs in the following cycle
Cycle 9
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle Busy Address
j k
LD F6 12+ R2 1 3 4 Load1 No
LD F2 16+ R3 2 4 5 Load2 No
DIVD F10 F0 F6 4

14
SUBD F8 F6 F2 5 9
ADDI R2 R2 8 6 6 7
ADDI R3 R3 16 7 7 8
ADDD F6 F8 F2 8
Reservation Station
MULT2 No
2) Evaluate the performance of several implementation options for
the following workload:
LOOP:
L.D F3, R4(R6) # F3 = MEM[r4+r6]
MUL.D F4, F3, F2 # F4 = F3*F2
S.D F4, R3(R6) # MEM[R3+R6] = F4
A.D F4, F3, F3 # F4 = F3+F3
Only one instruction can complete per result
per CDB

15
A.D F10, F10, F4 # F10 = F10 + F4
DSUBUI R6,R6, #4 # R6 = R6 - 4
BNEQ R6, loop # if R6 != 0, jump to LOOP
Assume the processor implements Tomasulo’s algorithm (with reservation stations and no reorder
buffer), as well as the following:
 A single instruction is issued per cycle.
 All function units are not pipelined.
 No forwarding between or within function units; results are communicated via the single
CDB.
 The memory execution unit uses three stages for load and 2 cycles for store. Load and store
have separate reservation stations, but either a load or store can execute at any one time
since they share the memory port.
 Issue and write result stages require one cycle each. Address generation is performed
separate from the ALU in the load and store buffers.
 Branches execute in the integer unit, and instructions issued after a branch wait until the
branch has been resolved and broadcast on the CDB.
Functional Unit Queues and Latencies:
Functional Unit # of Functional Units Latency (cycles in EX) # of Reservation Stations
Memory – Load 1 3 2
Memory – Store 1 2 2
Integer 1 1 5
FP – Add 1 4 3
FP – Multiply 1 2 2
a) Perform a simulation of the first two iterations for a single issue architecture.
Create the table below
Iteration 1
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
j k
L.D F3 R4 R6 1 2 6
MUL.D F4 F3 F2 2 6 17
S.D F4 R3 R6 3 17 21

16
A.D F4 F3 F3 4 21 26
A.D F10 F10 F4 5 26 31
DSUBUI R6 R6 #4 6 31 36
BNEQ R6 loop 7 37
Iteration 2
Instruction Issue
Cycle
Execute
Cycle
Write
Cycle
j k
L.D F3 R4 R6 8 38 42
MUL.D F4 F3 F2 9 42 53
S.D F4 R3 R6 10 53 56
A.D F4 F3 F3 11 56 61
A.D F10 F10 F4 12 61 66
DSUBUI R6 R6 #4 13 66 71
BNEQ R6 loop 14 71
b) What is the performance bottleneck?
The delay in transmission of data through the circuits of a computer's microprocessor or
over a TCP/IP network. The delay typically occurs when a system's bandwidth cannot
support the amount of information being relayed at the speed it is being processed
c) What is the “steady state” of this loop – that is how many cycles will an average
loop iteration take if loop startup and shutdown effects are ignored?
The steady state of the loop occurs when the R6 will be equal to zero which means at R6
equal to zero the loop will no longer keep on iterating and will be in a steady state.
d) Where will the first issue stall occur?
The first stall will occur when the second instruction of MULTD F4, F3, F2 will execute
because its execution will be dependent on the F3 of LD. So RAW delay will occur.

Tomasulo Algorithm

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Tomasulo Algorithm (20)

More from Farwa Ansari (12)

Recently uploaded (20)

Tomasulo Algorithm