Single core design space exploration

Single Core Design Space Exploration Project Report
Vishesh Chanana
UIN:665319085
University of Illinois, Chicago
773-703-6742
ABSTRACT
SimpleScalar is a simulation tool created by developer
Todd Austin as a part of his PhD. The simulator is open
source i.e. it is freely available and can be modified by
anyone. This simulator is written in C language and is
used to compare the performances of machines and
find out which one gives a better performance.
SimpleScalar is being used widely for research purposes.
In this literature review, we are given a task to find out
the most suitable SimpleScalar Architecture running a
given benchmark application while using the
Wattch(sim-outorder) simulator. We basically find the
Energy Delay Product for different configurations and
then analyze the results. Based on our analysis we
combine all the configurations that give the lowest EDP
and finalize the best configuration for our architecture.
1. INTRODUCTION
With the rising competition in the computer market, the
cost of a computer has become a major factor.
Companies are continuously looking at various
techniques and technologies to lower the cost of their
machines. From using cheaper but stronger materials to
for the structure of a computer to putting more and
more transistors on a single chip, the companies were
able to lower their costs to a certain extent, but this
lowering of costs had its own effects on power
consumption. The power consumption started increasing
at a fast rate.
Power consumption depends on a lot of factors like
clock tree, control and data paths, memory and registers.
Variations in the above mentioned factors can bring
down the power consumption of a computer. But it gives
rise to another problem. The decrease in power
consumption decreases the speed of the computer. So,
to get a good machine, the companies started looking at
the Energy Delay Product(EDP) of the computer to get an
optimum result.
The EDP can be defined as the product of the average
power consumed and the square of the propagation
delay. To get an optimum computer the computer should
not use excessive power and the Cycle per instructions
should not be too less as to increase the execution time.
In this literature review, we divide the various
configuration of a computer architecture into 4 parts
namely, branch prediction, l1 data cache configuration,
Functional units and Data path & others. We then
simulate the architecture using different parameters of
each group separately. To select the most useful result
for a given parameter, we look into its EDP, and finalize
the one with the lowest EDP. Then we go onto to get the
EDP for each group. After finding the configuration for
the lowest EDP of every group we then find the EDP for
the architecture by combining the data from all the
groups.
For this experiment we installed Virtual machine that has
the Ubuntu(32-bit) OS on my machine. Then we installed
the SimpleScalar application over Ubuntu. The
configuration for the Ubuntu was as follows: 2048MB
RAM, 23MB Video memory and a 20GB hard disk. The
Remote Desktop server and the video capture were
disabled as they were not required during this
experiment.
2. RESULTS
2.1 Branch Prediction
We start off with finding out the best possible branch
prediction method that we could implement in our
architecture. Branch predictor is basically a circuit that
guesses which way a branch will go before the outcome
is known for sure. The branch predictor helps in
increasing the speed of a program. There are about five
types of branch predictors namely not-taken branch
predictor, taken branch predictor, perfect branch
predictor, bimodal predictor and 2-level predictor. In this
project we are concerned with only the bimodal

2
predictor and the 2-level predictor. Further, we also vary
the Return Address stack size and the Branch table Buffer
size to see the effects on the EDP of the architecture.
The bimodal branch predictor, also known as the Direct
History table is one of the cheapest and the simplest
branch predictor used. It uses the branch address to
access the prediction table that predicts the outcome of
the branch. Being the simplest branch predictor, only
thing that we can vary in this branch predictor is the
table size. We can only vary the size of the table in the
order of 2
(order)
bits. The following Figure shows us the
variation of the EDP with respect to the table size of a
bimod predictor.
We see from the graph that the EDP(blue) is lowest when
the table size is 128 bits. But having a fewer number of
bits in the branch predictor gives us a less number of
prediction bits which hampers the speed of the
benchmark. This happens as having fewer branch
prediction bits does not let the predictor predict
correctly for some of the branches. This is evident from
the figure, the IPC for 128 bits table is the lowest. Among
the rest of the prediction table size, the one with 512 bits
has the lowest EDP. Increasing the size further we see
that the there is no improvement in the EDP nor is there
any significant improvement in the IPC of the
architecture. Looking at the miss rate for each table we
see that it reaches its minimum value 0.0206 hits/ref at
size 512. Increasing the size of the table does not make
any difference in the miss rate.
Now, we look towards the 2-level predictor. To configure
a 2 level predictor we have 4 parts: l1 cache, l2 cache,
history table size and number of xor gates. The l1 cache
size can be varied from 1 to 32kb. We see from the
Figure EDP values for the different l1 cache sizes. The
lowest EDP is when the cache size is 32kb. The IPC for
this benchmark remains constant but the average energy
consumption decreases as the size of l1 cache increases.
This is because the larger the size of the first level of
cache more is the probability of the finding the branch
prediction bit inside the cache. Thus, decreasing the
number of operations required to fetch the prediction bit
from l2 cache or from the memory. This decrease in the
number of operations decreases energy consumption of
this architecture thus decreasing the EDP.
The l2 cache is the 2
nd
level of cache of the 2 level
predictor. As the size of l1 cache cannot be increased
beyond 32kb, there is a level 2 cache that has a larger
size. It provides data to the l1 level cache If not present
without any delay. The l2 cache uses the same logic as
the l1 cache so it is very fast and also it is larger than the
l1 cache. We vary the size of l2 cache from 256 KB to
8192KB. The Graph provides us with some relevant
information.
The EDP for a 256 KB cache and a 512 KB cache is quite
low as compared to the higher sizes. As the size of l2
cache increases the EDP keeps on increasing. The
increase in EDP is due to the fact that as the size of the
cache increases it requires a larger overhead to search
7.99067.86937.86857.89017.91487.9163 7.9355
1.50851.52261.5228 1.523 1.523 1.523 1.5231
0
1
2
3
4
5
6
7
8
9
128 256 512 1024 2048 4096 8192
7.897321
372
7.902581
08
7.906504
304
7.892535
901
7.893010
136
7.884560
115
7.87
7.88
7.89
7.9
7.91
1 2 4 8 16 32
EDP
7.8
7.85
7.9
7.95
8
256 512 1024 2048 4096 8192
EDP

3
for a particular branch target address. This increase in
overhead results in consumption of some extra power.
So, for our project we will select the l2 cache with a size
512KB.
A branch history table is used to predict the future
behavior by storing the previous action and target of
branches. So the prediction of a particular branch
depends on how the branch behaved when executed the
last time. The history table is of great importance in the
2-level predictor. It provides the behavior of the branch
while the l1 and l2 cache provide the address of the
instruction that have to be executed. In our experiment
we vary the history table size from 2kb to 256 KB. Branch
history table is basically a shift register that stores the
outcome of branches. A new value is stored by shifting
the register.
Having a larger history table size seems legit but as the
size increases. The 2-level predictor is far more accurate
than a single level predictor. Using this 2 level predictor
comes with its own cost i.e. the Warm up Phase effect.
The warm up phase effect states that the time required
to put in usable values gets longer. Using a larger history
table would further increase the cost of the whole setup.
To further refine our results we look into the EDP values
of different history table sizes.
Going by the given figure we see that the lowest EDP is
when the size is 2Kb. But a very small size would increase
the overhead of getting the desired data. Thus, from the
above figure we select the size 8Kb as the optimum size
for the history table.
We now, look into the ways we could merge the above
information together to get an optimized result. From
the branch history table and the data from the cache i.e.
Branch address. There are many ways to merge both of
these but we use either xoring or concatenation in this
project. Concatenation simply means taking bits from
both registers and simply concatenating them into one.
In xor, the two registers are simply bitwise XORed to get
the result.
When xor is 0, it implies concatenation is used if its’ 1
then xor is used. The table for the XOR and
Concatenation is given. Both of the processes for this
benchmark application provide us with almost equal EDP
values.
Type IPC CPI Avg
Power
EDP
Concatenate 1.523 0.6566 18.3204 7.898356
XOR 1.523 0.6566 18.3145 7.895812
So, we select the XOR part for our project as it has a
slight advantage over Concatenation.
The return address stack points to the code segment
accessed by the next instruction in the calling method. As
we can see from the graph for a small return address
stack size, a lot of power is required. A smaller stack size
needs to do a lot of pop and push operations for getting
the required data thus increasing the power expenditure.
As the size of the return address increases, the EDP
decreases. The lowest EDP encountered is for a return
address stack size of 8KB. After that it increases slightly
as the overhead required is also increased.
7.875
7.88
7.885
7.89
7.895
7.9
7.905
7.91
2 4 8 16 32 64 128 256

4
Next, we go to another attribute of branch prediction,
Branch Target Buffer(BTB). BTB is a special type of cache
memory that stores addresses of the most recently used
branches. It further improves branch prediction as we get
the target address and the direction of the branch in the
Instruction Fetch stage thus reducing the penalty. In a
branch table we can vary two attributes, the set size and
the associativity. The EDP graph data for the number of
sets available can be seen as below.
From the graph we see that the EDP is almost equivalent
for a set size of 64 and 128. We could choose either of
the two.
Increasing the associativity of a cache increases the hit
rate for a program. But with a higher associativity a high
amount of power is also consumed. We can see from the
graph that as the associativity increases, the EDP
increases slightly at first, but at a rapid pace afterwards.
We could take any amount of associativity in our
architecture ranging from 1 to 4. The table shows us the
various characteristics of taking a set size of 128 and
associativity 2 or 4.
Associativity IPC CPI Average
power
EDP
2 1.5224 0.6569 17.8613 7.707465
4 1.5228 0.6567 17.8641 7.70398
We can conclude from the table that in a set associative
cache the parallel lookup is not efficient when we see
from the energy consumption point of view but from the
point of view of cache latency it is very efficient. This
decrease in CPI with a higher associativity negates the
effect of a higher average power thus giving us lower
EDP.
We now take all the values we selected for our
architecture and do an intra-group simulation. The
parameters selected are as follows:
Bimodal predictor – 512
2 level predictor – 32 512 8 1
Return Address Stack size – 8
BTB configuration – 128 4
With this configuration we get an EDP equal to 7.6607
and an IPC equals to 1.5226. It is clear from the give data
that using the optimized values gives us a lower EDP
without any effects on the speed of the program.
2.2 Memory System
7.7
7.8
7.9
8
8.1
8.2
8.3
8.4
2 4 8 16 32 64 128 256
7
7.5
8
8.5
9
9.5
64
128
256
512
1024
2048
4096
8192
Series1
0
2
4
6
8
10
12
Series1

5
Now we come towards the components that effect the
memory requirements of the system. The different
components we would be comparing are l1data cache, l2
data cache, l1 instruction cache and l2 instruction cache.
First we vary the l1 data cache. There are several factors
that can be varied in an l1 data cache like the number of
blocks a dl1 data cache has and the number of bytes per
block of a dl1 data cache has.
Lets’ look at the EDP values for different block sizes of
the l1 data cache.
We can see from the figure that the EDP value is lowest
when the block size is 8. This increase in energy
consumption is because as the data is transferred from
memory and cache in fixed sizes, a large movement of
data would require a higher energy. But we cannot
conclude our analysis on the basis of only the EDP. WE
have to take a look towards the IPC and the miss rate of
each block size. A smaller block size implies a higher miss
rate. And a higher miss rate implies it needs to replace
the data in the cache a lot more times than when using a
smaller block size. This is seen from the data we
collected, the miss rate of the cache with a block size of 8
is 0.0305 while that of block size 64 is 0.0087. As we
increase the size of the cache block size the miss rate
decreases further but EDP also grows proportionately.
We also look at the number of bytes per block attribute
of the level 1 data cache. Number of bytes per block is
nothing but the associativity of the cache. As explained in
the previous section increasing the number of bytes per
block would decrease the miss rate but would also
increase power consumption. So, for our project we take
the value for our architecture the number of bytes per
block as 16.
Now we change the configuration of level 2 data cache to
find the optimum value for our architecture. The level
two cache is used to provide fast access to the data that
is not present in the first level cache. The logic is same as
that in the first level. Like for the first level cache, we
vary the block size and the bytes per block of the second
level block. The blocks size in level 2 cache indicates the
number of blocks that are available with the cache to
store the data. The EDP graph is shown below.
It is clear from the graph that the lowest energy delay
product for the cache is when the block size is just 512
Kb. Other than the EDP, even the miss rate for the block
size 512 is lower than all of the lower sizes. The lower
miss rate means that the higher block size makes better
use of spatial locality. Thus , also increases the IPC as the
size increases.
The number of bytes per block is the associativity of the
cache. It basically tells us that where the data in the
cache could go and get stored. Going by the data we see
that as we increase the associativity of the cache, the
EDP increases steadily. What I could conclude from the
data is that as the number of bytes per block increases,
the data that could be stored in a block also increases.
This increases the overall overhead required to access a
particular block thus, increasing the power consumption.
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8 16 32 64 128 256
7.75
7.8
7.85
7.9
7.95
8
8.05
8.1
8.15

6
Now we vary the configuration for the 1
st
instruction
cache. The instruction cache is similar to the data cache
only difference being instead of data it stores the
instructions. According to the data that we received a
block size of 32 is enough for a level 1 instruction cache
to achieve higher IPC as well as consume least amount of
power. From this data we can deduce that because of
less space available the overhead required to scan the
whole cache is quite low. This increases the speed of
checking for an instruction in turn decreasing the amount
of power requirements.
Now we look into the number of bytes per block
requirements for a level 1 instruction cache. As we are
working with about 840 million instructions in the
program the associativity needs to be high to get a good
IPC. Having a small number of bytes per block would
hamper the storage on instructions. We can see this from
the graph provided.
Having a small number of bytes per block would decrease
increase the amount of energy needed to get and
remove the instructions from the level 1 instruction
cache.
Coming to the level 2 instruction cache, we take the level
2 data cache as the level 2 instruction cache.
Now we do an intra-group simulation of all the cache
configurations that we have finalized. The cache
configuration is:
L1 Data cache – dl1:64:16:2:f
L2 Data cache – ul2:512:64:2:r
L1 Inst cache – il1:32:64:2:r
L2 inst cache – dl2
When these cache configurations are run together the
results we get are astonishing. The Average power
consumed is down to 11, but it has come at the expense
of the IPC. The IPC is now less than 1. This increases the
EDP of the whole architecture. It seems that the level 1
instruction cache block size is not enough for the
architecture. We can see from the above data that with a
very small increase in EDP we can use a 128 Kb block
sized level 1 instruction cache. With the new
configuration, the new IPC becomes 1.6969 and the EDP
comes out to be 6.26 is very low.
2.3 Functional Units
Now we come to the part where we see the effects of
varying the attributes of the Functional units. The
functional units include the different types of ALUs that
are available with the computer. These are the integer
ALU, Floating point ALU, integer multiplier/divider and
the Floating point multiplier/divider. As the name
suggests the ALUs are used for mathematical calculations
that are processed in the computer benchmark.
First, we varied the number of integer ALUs in our
Architecture from 1 ALU to 8 ALU. We can see the
following data from our simulations.
7.4
7.6
7.8
8
8.2
8.4
8.6
32 64 128 256 512
0
50
100
150
200
250
300
8 16 32 64

7
We can see from the above data that the architecture
with the lowest EDP is the one with 4 integer ALUs. As,
we can see we could lower our cost by taking only 3
ALUs. It would increase the EDP slightly but it would help
decrease the cost of the whole setup. The higher number
of integer ALUs would dissipate a larger amount of heat
which would require a better cooling system. So, we
select only 3 ALUs for our system.
Now, we take into consideration the number of
multiplication ALUs. Similar to the integer ALUs we vary
the number of multiplication ALUs from 1 to 8. As we
know there are not many multiplication operations when
compared to integer operations we hope that a larger
number of multiplication ALUs won’t affect the IPC but it
will definitely increase the cost of the architecture. We
can look at the numbers in the following graph.
As expected the there was no effect on IPC. We also
notice that the EDP is lowest when the number of
multiplication ALUs is 1. But there is a sudden decrease
in EDP when the number of multiplication ALUs is more
than 5. It reaches its minimum value at 7 before
increasing again. It is due to the fact that the ILP is
completely achieved when the number of multiplication
ALUs is more than 5.
Other than the integer operations there are some
Floating point operations that are processed during the
execution of the benchmark. The amount of floating
point operations is less as compared to integer
operations. The Figure shows the EDP when using
different number of FP integer ALUs in our architecture.
We can see from the given data that the EDP of the
architecture keeps on increasing as the number of FP
integer ALUs increase. According to the EDP data we
could select only 1 FP integer ALU but we should also
look into the speed of the program. When taking 2 FP
integer ALUs the EDP is slightly higher while the speed of
the program also improves slightly. Depending on the
type of architecture we want, concentrating on speed or
on energy consumption, we could choose either 1 or 2
integer FP ALUs.
Corresponding to the integer FP ALUs there are
multiplication FP ALUs. The FP multiplication ALUs are
used for processing the multiplication or division
operation on Floating point numbers. We can see from
the Table that there is not much difference in the EDP if
the number of multiplication FP ALUs is varied from 1 to
8.
0
2
4
6
8
10
12
14
16
7.875
7.88
7.885
7.89
7.895
7.9
7.905
1.5231.5231.5231.5231.5231.5231.5231.523
EDP
1.51571.52261.52261.52261.52261.52261.52261.5226
0
2
4
6
8
10
1 2 3 4 5 6 7 8

8
FP MUL IPC CPI Avg
power
EDP
1 1.523 0.6566 18.3266 7.901029
2 1.5347 0.6516 18.4007 7.812616
3 1.5347 0.6516 18.4359 7.827562
4 1.539 0.6498 18.4885 7.806585
5 1.539 0.6498 18.4275 7.780828
6 1.539 0.6498 18.4349 7.783953
7 1.539 0.6498 18.46 7.794551
8 1.539 0.6498 18.4347 7.783868
As it is difficult to conclude anything from the EDP
obtained, we could take into consideration the IPC of
every configuration. We see there is an increase in IPC
with increase in in the number of multiplication FP ALUs.
Since the number of ALUs is increasing the parallelism is
increasing thus increasing the pipelining. We know that
after a certain level of pipelining, increasing parallelism
does not have any effect on pipelining. So, further
increasing the number of ALUs does not have any effect
on the IPC of the architecture. From the above given data
we can conclude that the optimum number of FP
multiplication ALUs is 4.
From the above information we are left with the
following information for the intra group simulation:
Integer ALU – 3
Integer multiplication ALU – 7
Floating point Integer ALU - 1
Floating point multiplication ALU - 4
Using all these values together we see that the
performance of our architecture is enhanced. The EDP
decreases to 7.313 but the speed becomes a little slow as
well. This is due to the usage of extra multiplication ALUs
when compared to the default values. As we know that
the multiplication ALUs take many cycles if the there is a
miss. The increase in number of ALUs increases the time
required when there is a miss thus, decreasing the IPC.
2.4 Data Path and Others
Now we look into some miscellaneous parameters such
as the instruction fetch queue size, the instruction width
and register update unite to name a few.
The instruction fetch queue size is the first factor that we
consider for our project. The graph for the EDP values
and the IPC is shown below.
We see from the above data that as the size of the
instruction queue reaches 32, the EDP decreases. The
only conclusion we can take from this is that as the
instruction fetch queue size increases there is more
space to fetch the instructions and store them. This helps
in speeding up the processor. This is also evident from
the IPC values shown in the graph.
Next, we reach the instruction decode width. We take
the default value for this i.e. 4 as it provides us the most
optimum value for our project.
The next attribute is the in-order issue of instructions.
We take it as false because out order execution of
instructions gives us a better speed. It has a better speed
because it executes instructions without waiting for the
previous one to finish executing only if there is no
dependency. The in-order execution just stalls the
execution until it gets the required data.
The table shows the data for this attribute.
Issue-
inorder IPC EDP CPI
Average
Power
FALSE 1.523 12.04021 0.6566 18.3372
TRUE 0.0805 16.16568 1.2423 13.0127
The RUU or the Register Update Unit has the sole
purpose of keeping the instructions ready for the hungry
Functional Units. As we increase the number of
Functional units the RUU size also needs to be increased.
13.43630
891 12.01827
508
11.84569
672
11.55704
2
11.46649
524
1.2712 1.523 1.5892 1.6808 1.7536
0
2
4
6
8
10
12
14
16
2 4 8 16 32

9
As we can see from the graph shown the optimum value
for the RUU size is 16.
The EDP continuously decreases up till it reaches its
default value. After that it increases. This is because the
number of instructions that can be fetched is always
more than the number of instruction that can be issued.
To keep extra functional units busy more expenditure of
power is required thus increasing the EDP.
For the rest of the attributes relating to the Data Path &
Others category we take the default values in
consideration as they provide us with the best possible
EDP values.
We do an intra-group analysis for these miscellaneous
instructions. We only change the value of instruction
fetch queue size to 32, rest others are set to default
values. We see from the simulation results that the EDP
has dropped considerably to 6.6 with an increase in IPC
to 1.7356.
3. INTER GROUP SIMULATION
We have now done all the intra group simulations for our
architecture, now, we will see the behavior of our
architecture when we combine all these configurations
into a single architecture.
Config IPC CPI Avg
power
EDP
Default 1.5230 0.6569 18.2972 7.89
Optimized 1.6989 0.5886 16.4427 5.6965
From the above data we can conclude that after applying
all the optimizations we get a very low EDP and a very
high IPC for our new architecture. By using a combination
of the simple bimod branch predictor with a 2 level
branch predictor decreases the amount of miss
predictions. The decrease in miss prediction decreases
the power requirements of the system. Changing the
replacement policies of cache and tweaking some block
sizes has had a great effect on the IPC of the system. A
better replacement policy implies the presence of good
data in the cache which helps speed up the processing.
The functional units have had their own advantages in
increasing the speed as more number of functional units
working in parallel save some time thus increasing the
IPC, though they use a little extra power but the lower
CPI of the process negates the effect and thus gives us a
better overall performance. The components of the data
path and others category did not contribute much
towards the optimization as most of the values chosen
were default values. Here are some screen shots from
the final optimized simulation.
4. CONCLUSION
In this project we tried to find out the best possible
configuration to get an optimized result. Though we
could find a better configuration for our machine than
the default configuration but we cannot say that this is
the most optimum machine. This is an optimization for
just the eeg benchmark. Some benchmarks will have
different amount of instructions and a different
percentage of branch instructions, some will focus only
on energy consumption and some will focus only on a
better speed. Thus, we conclude that there is no
optimum configuration.
References
[1] www.simplescalar.com
[2]http://web.eecs.umich.edu/~taustin/papers/UWTR97-
simple.pdf
[3]http://www.wseas.us/e-
library/transactions/computers/2011/52-316.pdf “A
0
5
10
15
20
25
2 4 8 16 32

10
Study on Factors Influencing Power Consumption in
Multithreaded and Multicore CPUs” Vijayalakshmi
Saravanan, Senthil Kumar Chandran, Sasikumar
Punnekkat and D P Kothari, Malardalen University,
Sweden and VIT University, Vellore, India, March 2011
[4] “Reducing Power Consumption for High-Associativity
Data Caches in Embedded Processors” Dan Nicolaescu,
Alex Veidenbaum, Alex Nicolau, University of California
at Irvine

Single core design space exploration

More Related Content

What's hot (10)

Similar to Single core design space exploration (20)

Recently uploaded (20)

Single core design space exploration