SlideShare a Scribd company logo
Single Core Design Space Exploration Project Report
Vishesh Chanana
UIN:665319085
University of Illinois, Chicago
773-703-6742
ABSTRACT
SimpleScalar is a simulation tool created by developer
Todd Austin as a part of his PhD. The simulator is open
source i.e. it is freely available and can be modified by
anyone. This simulator is written in C language and is
used to compare the performances of machines and
find out which one gives a better performance.
SimpleScalar is being used widely for research purposes.
In this literature review, we are given a task to find out
the most suitable SimpleScalar Architecture running a
given benchmark application while using the
Wattch(sim-outorder) simulator. We basically find the
Energy Delay Product for different configurations and
then analyze the results. Based on our analysis we
combine all the configurations that give the lowest EDP
and finalize the best configuration for our architecture.
1. INTRODUCTION
With the rising competition in the computer market, the
cost of a computer has become a major factor.
Companies are continuously looking at various
techniques and technologies to lower the cost of their
machines. From using cheaper but stronger materials to
for the structure of a computer to putting more and
more transistors on a single chip, the companies were
able to lower their costs to a certain extent, but this
lowering of costs had its own effects on power
consumption. The power consumption started increasing
at a fast rate.
Power consumption depends on a lot of factors like
clock tree, control and data paths, memory and registers.
Variations in the above mentioned factors can bring
down the power consumption of a computer. But it gives
rise to another problem. The decrease in power
consumption decreases the speed of the computer. So,
to get a good machine, the companies started looking at
the Energy Delay Product(EDP) of the computer to get an
optimum result.
The EDP can be defined as the product of the average
power consumed and the square of the propagation
delay. To get an optimum computer the computer should
not use excessive power and the Cycle per instructions
should not be too less as to increase the execution time.
In this literature review, we divide the various
configuration of a computer architecture into 4 parts
namely, branch prediction, l1 data cache configuration,
Functional units and Data path & others. We then
simulate the architecture using different parameters of
each group separately. To select the most useful result
for a given parameter, we look into its EDP, and finalize
the one with the lowest EDP. Then we go onto to get the
EDP for each group. After finding the configuration for
the lowest EDP of every group we then find the EDP for
the architecture by combining the data from all the
groups.
For this experiment we installed Virtual machine that has
the Ubuntu(32-bit) OS on my machine. Then we installed
the SimpleScalar application over Ubuntu. The
configuration for the Ubuntu was as follows: 2048MB
RAM, 23MB Video memory and a 20GB hard disk. The
Remote Desktop server and the video capture were
disabled as they were not required during this
experiment.
2. RESULTS
2.1 Branch Prediction
We start off with finding out the best possible branch
prediction method that we could implement in our
architecture. Branch predictor is basically a circuit that
guesses which way a branch will go before the outcome
is known for sure. The branch predictor helps in
increasing the speed of a program. There are about five
types of branch predictors namely not-taken branch
predictor, taken branch predictor, perfect branch
predictor, bimodal predictor and 2-level predictor. In this
project we are concerned with only the bimodal
2
predictor and the 2-level predictor. Further, we also vary
the Return Address stack size and the Branch table Buffer
size to see the effects on the EDP of the architecture.
The bimodal branch predictor, also known as the Direct
History table is one of the cheapest and the simplest
branch predictor used. It uses the branch address to
access the prediction table that predicts the outcome of
the branch. Being the simplest branch predictor, only
thing that we can vary in this branch predictor is the
table size. We can only vary the size of the table in the
order of 2
(order)
bits. The following Figure shows us the
variation of the EDP with respect to the table size of a
bimod predictor.
We see from the graph that the EDP(blue) is lowest when
the table size is 128 bits. But having a fewer number of
bits in the branch predictor gives us a less number of
prediction bits which hampers the speed of the
benchmark. This happens as having fewer branch
prediction bits does not let the predictor predict
correctly for some of the branches. This is evident from
the figure, the IPC for 128 bits table is the lowest. Among
the rest of the prediction table size, the one with 512 bits
has the lowest EDP. Increasing the size further we see
that the there is no improvement in the EDP nor is there
any significant improvement in the IPC of the
architecture. Looking at the miss rate for each table we
see that it reaches its minimum value 0.0206 hits/ref at
size 512. Increasing the size of the table does not make
any difference in the miss rate.
Now, we look towards the 2-level predictor. To configure
a 2 level predictor we have 4 parts: l1 cache, l2 cache,
history table size and number of xor gates. The l1 cache
size can be varied from 1 to 32kb. We see from the
Figure EDP values for the different l1 cache sizes. The
lowest EDP is when the cache size is 32kb. The IPC for
this benchmark remains constant but the average energy
consumption decreases as the size of l1 cache increases.
This is because the larger the size of the first level of
cache more is the probability of the finding the branch
prediction bit inside the cache. Thus, decreasing the
number of operations required to fetch the prediction bit
from l2 cache or from the memory. This decrease in the
number of operations decreases energy consumption of
this architecture thus decreasing the EDP.
The l2 cache is the 2
nd
level of cache of the 2 level
predictor. As the size of l1 cache cannot be increased
beyond 32kb, there is a level 2 cache that has a larger
size. It provides data to the l1 level cache If not present
without any delay. The l2 cache uses the same logic as
the l1 cache so it is very fast and also it is larger than the
l1 cache. We vary the size of l2 cache from 256 KB to
8192KB. The Graph provides us with some relevant
information.
The EDP for a 256 KB cache and a 512 KB cache is quite
low as compared to the higher sizes. As the size of l2
cache increases the EDP keeps on increasing. The
increase in EDP is due to the fact that as the size of the
cache increases it requires a larger overhead to search
7.99067.86937.86857.89017.91487.9163 7.9355
1.50851.52261.5228 1.523 1.523 1.523 1.5231
0
1
2
3
4
5
6
7
8
9
128 256 512 1024 2048 4096 8192
7.897321
372
7.902581
08
7.906504
304
7.892535
901
7.893010
136
7.884560
115
7.87
7.88
7.89
7.9
7.91
1 2 4 8 16 32
EDP
7.8
7.85
7.9
7.95
8
256 512 1024 2048 4096 8192
EDP
3
for a particular branch target address. This increase in
overhead results in consumption of some extra power.
So, for our project we will select the l2 cache with a size
512KB.
A branch history table is used to predict the future
behavior by storing the previous action and target of
branches. So the prediction of a particular branch
depends on how the branch behaved when executed the
last time. The history table is of great importance in the
2-level predictor. It provides the behavior of the branch
while the l1 and l2 cache provide the address of the
instruction that have to be executed. In our experiment
we vary the history table size from 2kb to 256 KB. Branch
history table is basically a shift register that stores the
outcome of branches. A new value is stored by shifting
the register.
Having a larger history table size seems legit but as the
size increases. The 2-level predictor is far more accurate
than a single level predictor. Using this 2 level predictor
comes with its own cost i.e. the Warm up Phase effect.
The warm up phase effect states that the time required
to put in usable values gets longer. Using a larger history
table would further increase the cost of the whole setup.
To further refine our results we look into the EDP values
of different history table sizes.
Going by the given figure we see that the lowest EDP is
when the size is 2Kb. But a very small size would increase
the overhead of getting the desired data. Thus, from the
above figure we select the size 8Kb as the optimum size
for the history table.
We now, look into the ways we could merge the above
information together to get an optimized result. From
the branch history table and the data from the cache i.e.
Branch address. There are many ways to merge both of
these but we use either xoring or concatenation in this
project. Concatenation simply means taking bits from
both registers and simply concatenating them into one.
In xor, the two registers are simply bitwise XORed to get
the result.
When xor is 0, it implies concatenation is used if its’ 1
then xor is used. The table for the XOR and
Concatenation is given. Both of the processes for this
benchmark application provide us with almost equal EDP
values.
Type IPC CPI Avg
Power
EDP
Concatenate 1.523 0.6566 18.3204 7.898356
XOR 1.523 0.6566 18.3145 7.895812
So, we select the XOR part for our project as it has a
slight advantage over Concatenation.
The return address stack points to the code segment
accessed by the next instruction in the calling method. As
we can see from the graph for a small return address
stack size, a lot of power is required. A smaller stack size
needs to do a lot of pop and push operations for getting
the required data thus increasing the power expenditure.
As the size of the return address increases, the EDP
decreases. The lowest EDP encountered is for a return
address stack size of 8KB. After that it increases slightly
as the overhead required is also increased.
7.875
7.88
7.885
7.89
7.895
7.9
7.905
7.91
2 4 8 16 32 64 128 256
4
Next, we go to another attribute of branch prediction,
Branch Target Buffer(BTB). BTB is a special type of cache
memory that stores addresses of the most recently used
branches. It further improves branch prediction as we get
the target address and the direction of the branch in the
Instruction Fetch stage thus reducing the penalty. In a
branch table we can vary two attributes, the set size and
the associativity. The EDP graph data for the number of
sets available can be seen as below.
From the graph we see that the EDP is almost equivalent
for a set size of 64 and 128. We could choose either of
the two.
Increasing the associativity of a cache increases the hit
rate for a program. But with a higher associativity a high
amount of power is also consumed. We can see from the
graph that as the associativity increases, the EDP
increases slightly at first, but at a rapid pace afterwards.
We could take any amount of associativity in our
architecture ranging from 1 to 4. The table shows us the
various characteristics of taking a set size of 128 and
associativity 2 or 4.
Associativity IPC CPI Average
power
EDP
2 1.5224 0.6569 17.8613 7.707465
4 1.5228 0.6567 17.8641 7.70398
We can conclude from the table that in a set associative
cache the parallel lookup is not efficient when we see
from the energy consumption point of view but from the
point of view of cache latency it is very efficient. This
decrease in CPI with a higher associativity negates the
effect of a higher average power thus giving us lower
EDP.
We now take all the values we selected for our
architecture and do an intra-group simulation. The
parameters selected are as follows:
Bimodal predictor – 512
2 level predictor – 32 512 8 1
Return Address Stack size – 8
BTB configuration – 128 4
With this configuration we get an EDP equal to 7.6607
and an IPC equals to 1.5226. It is clear from the give data
that using the optimized values gives us a lower EDP
without any effects on the speed of the program.
2.2 Memory System
7.7
7.8
7.9
8
8.1
8.2
8.3
8.4
2 4 8 16 32 64 128 256
7
7.5
8
8.5
9
9.5
64
128
256
512
1024
2048
4096
8192
Series1
0
2
4
6
8
10
12
Series1
5
Now we come towards the components that effect the
memory requirements of the system. The different
components we would be comparing are l1data cache, l2
data cache, l1 instruction cache and l2 instruction cache.
First we vary the l1 data cache. There are several factors
that can be varied in an l1 data cache like the number of
blocks a dl1 data cache has and the number of bytes per
block of a dl1 data cache has.
Lets’ look at the EDP values for different block sizes of
the l1 data cache.
We can see from the figure that the EDP value is lowest
when the block size is 8. This increase in energy
consumption is because as the data is transferred from
memory and cache in fixed sizes, a large movement of
data would require a higher energy. But we cannot
conclude our analysis on the basis of only the EDP. WE
have to take a look towards the IPC and the miss rate of
each block size. A smaller block size implies a higher miss
rate. And a higher miss rate implies it needs to replace
the data in the cache a lot more times than when using a
smaller block size. This is seen from the data we
collected, the miss rate of the cache with a block size of 8
is 0.0305 while that of block size 64 is 0.0087. As we
increase the size of the cache block size the miss rate
decreases further but EDP also grows proportionately.
We also look at the number of bytes per block attribute
of the level 1 data cache. Number of bytes per block is
nothing but the associativity of the cache. As explained in
the previous section increasing the number of bytes per
block would decrease the miss rate but would also
increase power consumption. So, for our project we take
the value for our architecture the number of bytes per
block as 16.
Now we change the configuration of level 2 data cache to
find the optimum value for our architecture. The level
two cache is used to provide fast access to the data that
is not present in the first level cache. The logic is same as
that in the first level. Like for the first level cache, we
vary the block size and the bytes per block of the second
level block. The blocks size in level 2 cache indicates the
number of blocks that are available with the cache to
store the data. The EDP graph is shown below.
It is clear from the graph that the lowest energy delay
product for the cache is when the block size is just 512
Kb. Other than the EDP, even the miss rate for the block
size 512 is lower than all of the lower sizes. The lower
miss rate means that the higher block size makes better
use of spatial locality. Thus , also increases the IPC as the
size increases.
The number of bytes per block is the associativity of the
cache. It basically tells us that where the data in the
cache could go and get stored. Going by the data we see
that as we increase the associativity of the cache, the
EDP increases steadily. What I could conclude from the
data is that as the number of bytes per block increases,
the data that could be stored in a block also increases.
This increases the overall overhead required to access a
particular block thus, increasing the power consumption.
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8 16 32 64 128 256
7.75
7.8
7.85
7.9
7.95
8
8.05
8.1
8.15
6
Now we vary the configuration for the 1
st
instruction
cache. The instruction cache is similar to the data cache
only difference being instead of data it stores the
instructions. According to the data that we received a
block size of 32 is enough for a level 1 instruction cache
to achieve higher IPC as well as consume least amount of
power. From this data we can deduce that because of
less space available the overhead required to scan the
whole cache is quite low. This increases the speed of
checking for an instruction in turn decreasing the amount
of power requirements.
Now we look into the number of bytes per block
requirements for a level 1 instruction cache. As we are
working with about 840 million instructions in the
program the associativity needs to be high to get a good
IPC. Having a small number of bytes per block would
hamper the storage on instructions. We can see this from
the graph provided.
Having a small number of bytes per block would decrease
increase the amount of energy needed to get and
remove the instructions from the level 1 instruction
cache.
Coming to the level 2 instruction cache, we take the level
2 data cache as the level 2 instruction cache.
Now we do an intra-group simulation of all the cache
configurations that we have finalized. The cache
configuration is:
L1 Data cache – dl1:64:16:2:f
L2 Data cache – ul2:512:64:2:r
L1 Inst cache – il1:32:64:2:r
L2 inst cache – dl2
When these cache configurations are run together the
results we get are astonishing. The Average power
consumed is down to 11, but it has come at the expense
of the IPC. The IPC is now less than 1. This increases the
EDP of the whole architecture. It seems that the level 1
instruction cache block size is not enough for the
architecture. We can see from the above data that with a
very small increase in EDP we can use a 128 Kb block
sized level 1 instruction cache. With the new
configuration, the new IPC becomes 1.6969 and the EDP
comes out to be 6.26 is very low.
2.3 Functional Units
Now we come to the part where we see the effects of
varying the attributes of the Functional units. The
functional units include the different types of ALUs that
are available with the computer. These are the integer
ALU, Floating point ALU, integer multiplier/divider and
the Floating point multiplier/divider. As the name
suggests the ALUs are used for mathematical calculations
that are processed in the computer benchmark.
First, we varied the number of integer ALUs in our
Architecture from 1 ALU to 8 ALU. We can see the
following data from our simulations.
7.4
7.6
7.8
8
8.2
8.4
8.6
32 64 128 256 512
0
50
100
150
200
250
300
8 16 32 64
7
We can see from the above data that the architecture
with the lowest EDP is the one with 4 integer ALUs. As,
we can see we could lower our cost by taking only 3
ALUs. It would increase the EDP slightly but it would help
decrease the cost of the whole setup. The higher number
of integer ALUs would dissipate a larger amount of heat
which would require a better cooling system. So, we
select only 3 ALUs for our system.
Now, we take into consideration the number of
multiplication ALUs. Similar to the integer ALUs we vary
the number of multiplication ALUs from 1 to 8. As we
know there are not many multiplication operations when
compared to integer operations we hope that a larger
number of multiplication ALUs won’t affect the IPC but it
will definitely increase the cost of the architecture. We
can look at the numbers in the following graph.
As expected the there was no effect on IPC. We also
notice that the EDP is lowest when the number of
multiplication ALUs is 1. But there is a sudden decrease
in EDP when the number of multiplication ALUs is more
than 5. It reaches its minimum value at 7 before
increasing again. It is due to the fact that the ILP is
completely achieved when the number of multiplication
ALUs is more than 5.
Other than the integer operations there are some
Floating point operations that are processed during the
execution of the benchmark. The amount of floating
point operations is less as compared to integer
operations. The Figure shows the EDP when using
different number of FP integer ALUs in our architecture.
We can see from the given data that the EDP of the
architecture keeps on increasing as the number of FP
integer ALUs increase. According to the EDP data we
could select only 1 FP integer ALU but we should also
look into the speed of the program. When taking 2 FP
integer ALUs the EDP is slightly higher while the speed of
the program also improves slightly. Depending on the
type of architecture we want, concentrating on speed or
on energy consumption, we could choose either 1 or 2
integer FP ALUs.
Corresponding to the integer FP ALUs there are
multiplication FP ALUs. The FP multiplication ALUs are
used for processing the multiplication or division
operation on Floating point numbers. We can see from
the Table that there is not much difference in the EDP if
the number of multiplication FP ALUs is varied from 1 to
8.
0
2
4
6
8
10
12
14
16
7.875
7.88
7.885
7.89
7.895
7.9
7.905
1.5231.5231.5231.5231.5231.5231.5231.523
EDP
1.51571.52261.52261.52261.52261.52261.52261.5226
0
2
4
6
8
10
1 2 3 4 5 6 7 8
8
FP MUL IPC CPI Avg
power
EDP
1 1.523 0.6566 18.3266 7.901029
2 1.5347 0.6516 18.4007 7.812616
3 1.5347 0.6516 18.4359 7.827562
4 1.539 0.6498 18.4885 7.806585
5 1.539 0.6498 18.4275 7.780828
6 1.539 0.6498 18.4349 7.783953
7 1.539 0.6498 18.46 7.794551
8 1.539 0.6498 18.4347 7.783868
As it is difficult to conclude anything from the EDP
obtained, we could take into consideration the IPC of
every configuration. We see there is an increase in IPC
with increase in in the number of multiplication FP ALUs.
Since the number of ALUs is increasing the parallelism is
increasing thus increasing the pipelining. We know that
after a certain level of pipelining, increasing parallelism
does not have any effect on pipelining. So, further
increasing the number of ALUs does not have any effect
on the IPC of the architecture. From the above given data
we can conclude that the optimum number of FP
multiplication ALUs is 4.
From the above information we are left with the
following information for the intra group simulation:
Integer ALU – 3
Integer multiplication ALU – 7
Floating point Integer ALU - 1
Floating point multiplication ALU - 4
Using all these values together we see that the
performance of our architecture is enhanced. The EDP
decreases to 7.313 but the speed becomes a little slow as
well. This is due to the usage of extra multiplication ALUs
when compared to the default values. As we know that
the multiplication ALUs take many cycles if the there is a
miss. The increase in number of ALUs increases the time
required when there is a miss thus, decreasing the IPC.
2.4 Data Path and Others
Now we look into some miscellaneous parameters such
as the instruction fetch queue size, the instruction width
and register update unite to name a few.
The instruction fetch queue size is the first factor that we
consider for our project. The graph for the EDP values
and the IPC is shown below.
We see from the above data that as the size of the
instruction queue reaches 32, the EDP decreases. The
only conclusion we can take from this is that as the
instruction fetch queue size increases there is more
space to fetch the instructions and store them. This helps
in speeding up the processor. This is also evident from
the IPC values shown in the graph.
Next, we reach the instruction decode width. We take
the default value for this i.e. 4 as it provides us the most
optimum value for our project.
The next attribute is the in-order issue of instructions.
We take it as false because out order execution of
instructions gives us a better speed. It has a better speed
because it executes instructions without waiting for the
previous one to finish executing only if there is no
dependency. The in-order execution just stalls the
execution until it gets the required data.
The table shows the data for this attribute.
Issue-
inorder IPC EDP CPI
Average
Power
FALSE 1.523 12.04021 0.6566 18.3372
TRUE 0.0805 16.16568 1.2423 13.0127
The RUU or the Register Update Unit has the sole
purpose of keeping the instructions ready for the hungry
Functional Units. As we increase the number of
Functional units the RUU size also needs to be increased.
13.43630
891 12.01827
508
11.84569
672
11.55704
2
11.46649
524
1.2712 1.523 1.5892 1.6808 1.7536
0
2
4
6
8
10
12
14
16
2 4 8 16 32
9
As we can see from the graph shown the optimum value
for the RUU size is 16.
The EDP continuously decreases up till it reaches its
default value. After that it increases. This is because the
number of instructions that can be fetched is always
more than the number of instruction that can be issued.
To keep extra functional units busy more expenditure of
power is required thus increasing the EDP.
For the rest of the attributes relating to the Data Path &
Others category we take the default values in
consideration as they provide us with the best possible
EDP values.
We do an intra-group analysis for these miscellaneous
instructions. We only change the value of instruction
fetch queue size to 32, rest others are set to default
values. We see from the simulation results that the EDP
has dropped considerably to 6.6 with an increase in IPC
to 1.7356.
3. INTER GROUP SIMULATION
We have now done all the intra group simulations for our
architecture, now, we will see the behavior of our
architecture when we combine all these configurations
into a single architecture.
Config IPC CPI Avg
power
EDP
Default 1.5230 0.6569 18.2972 7.89
Optimized 1.6989 0.5886 16.4427 5.6965
From the above data we can conclude that after applying
all the optimizations we get a very low EDP and a very
high IPC for our new architecture. By using a combination
of the simple bimod branch predictor with a 2 level
branch predictor decreases the amount of miss
predictions. The decrease in miss prediction decreases
the power requirements of the system. Changing the
replacement policies of cache and tweaking some block
sizes has had a great effect on the IPC of the system. A
better replacement policy implies the presence of good
data in the cache which helps speed up the processing.
The functional units have had their own advantages in
increasing the speed as more number of functional units
working in parallel save some time thus increasing the
IPC, though they use a little extra power but the lower
CPI of the process negates the effect and thus gives us a
better overall performance. The components of the data
path and others category did not contribute much
towards the optimization as most of the values chosen
were default values. Here are some screen shots from
the final optimized simulation.
4. CONCLUSION
In this project we tried to find out the best possible
configuration to get an optimized result. Though we
could find a better configuration for our machine than
the default configuration but we cannot say that this is
the most optimum machine. This is an optimization for
just the eeg benchmark. Some benchmarks will have
different amount of instructions and a different
percentage of branch instructions, some will focus only
on energy consumption and some will focus only on a
better speed. Thus, we conclude that there is no
optimum configuration.
References
[1] www.simplescalar.com
[2]http://web.eecs.umich.edu/~taustin/papers/UWTR97-
simple.pdf
[3]http://www.wseas.us/e-
library/transactions/computers/2011/52-316.pdf “A
0
5
10
15
20
25
2 4 8 16 32
10
Study on Factors Influencing Power Consumption in
Multithreaded and Multicore CPUs” Vijayalakshmi
Saravanan, Senthil Kumar Chandran, Sasikumar
Punnekkat and D P Kothari, Malardalen University,
Sweden and VIT University, Vellore, India, March 2011
[4] “Reducing Power Consumption for High-Associativity
Data Caches in Embedded Processors” Dan Nicolaescu,
Alex Veidenbaum, Alex Nicolau, University of California
at Irvine

More Related Content

DOCX
226 team project-report-manjula kollipara
PDF
Interplay of Communication and Computation Energy Consumption for Low Power S...
PDF
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
PDF
Perceiving and recovering degraded data on secure cloud
PDF
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
PDF
H04502048051
PDF
Computer Network Performance Evaluation Based on Different Data Packet Size U...
PDF
226 team project-report-manjula kollipara
Interplay of Communication and Computation Energy Consumption for Low Power S...
An Efficient Low Complexity Low Latency Architecture for Matching of Data Enc...
Perceiving and recovering degraded data on secure cloud
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
H04502048051
Computer Network Performance Evaluation Based on Different Data Packet Size U...

What's hot (10)

PPTX
Lec07 threading hw
PDF
International Refereed Journal of Engineering and Science (IRJES)
PPTX
Lec13 multidevice
PPT
Dpa attacks by piyush mittal (211 cs2281)
PDF
MapReduce: Distributed Computing for Machine Learning
PDF
A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing
PPTX
Lec09 nbody-optimization
PDF
Wired and Wireless Computer Network Performance Evaluation Using OMNeT++ Simu...
PDF
20150207 howes-gpgpu8-dark secrets
PDF
Models and approaches for Differential Power Analysis
Lec07 threading hw
International Refereed Journal of Engineering and Science (IRJES)
Lec13 multidevice
Dpa attacks by piyush mittal (211 cs2281)
MapReduce: Distributed Computing for Machine Learning
A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing
Lec09 nbody-optimization
Wired and Wireless Computer Network Performance Evaluation Using OMNeT++ Simu...
20150207 howes-gpgpu8-dark secrets
Models and approaches for Differential Power Analysis
Ad

Similar to Single core design space exploration (20)

PDF
Conference Paper: Universal Node: Towards a high-performance NFV environment
DOC
Analysis of Multicore Performance Degradation of Scientific Applications
PDF
Improve datacenter energy efficiency with Intel Node Manager
PPT
Per domain power analysis
PDF
IMPLEMENTATION OF SOC CORE FOR IOT ENGINE
PDF
Performance and Energy evaluation
PDF
Paper id 24201428
PPT
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
Low Power System on chip based design methodology
PDF
Low complexity turbo decoder with modified acs
PDF
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
PDF
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
PDF
Paper id 25201467
PDF
REAL TIME FACE DETECTION ON GPU USING OPENCL
PDF
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
PDF
A Simplied Bit-Line Technique for Memory Optimization
PDF
Introduction to Microcontrollers
PDF
Alibaba cloud benchmarking report ecs rds limton xavier
PDF
The state of Spark in the cloud
Conference Paper: Universal Node: Towards a high-performance NFV environment
Analysis of Multicore Performance Degradation of Scientific Applications
Improve datacenter energy efficiency with Intel Node Manager
Per domain power analysis
IMPLEMENTATION OF SOC CORE FOR IOT ENGINE
Performance and Energy evaluation
Paper id 24201428
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Accelerating Real Time Applications on Heterogeneous Platforms
Low Power System on chip based design methodology
Low complexity turbo decoder with modified acs
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
Paper id 25201467
REAL TIME FACE DETECTION ON GPU USING OPENCL
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
A Simplied Bit-Line Technique for Memory Optimization
Introduction to Microcontrollers
Alibaba cloud benchmarking report ecs rds limton xavier
The state of Spark in the cloud
Ad

Recently uploaded (20)

DOCX
A PROPOSAL ON IoT climate sensor 2.docx
PPTX
Lecture-3-Computer-programming for BS InfoTech
PDF
Dozuki_Solution-hardware minimalization.
PPTX
Presentation 1.pptxnshshdhhdhdhdhdhhdhdhdhd
PDF
Dynamic Checkweighers and Automatic Weighing Machine Solutions
PPTX
Prograce_Present.....ggation_Simple.pptx
PDF
How NGOs Save Costs with Affordable IT Rentals
PPTX
Wireless and Mobile Backhaul Market.pptx
DOCX
fsdffdghjjgfxfdghjvhjvgfdfcbchghgghgcbjghf
PPTX
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
PPTX
Fundamentals of Computer.pptx Computer BSC
PDF
ICT grade for 8. MATATAG curriculum .P2.pdf
PPTX
material for studying about lift elevators escalation
PDF
Tcl Scripting for EDA.pdf
PPTX
quadraticequations-111211090004-phpapp02.pptx
PPTX
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
PPTX
Nanokeyer nano keyekr kano ketkker nano keyer
PPTX
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
PPTX
Computers and mobile device: Evaluating options for home and work
PPT
Hypersensitivity Namisha1111111111-WPS.ppt
A PROPOSAL ON IoT climate sensor 2.docx
Lecture-3-Computer-programming for BS InfoTech
Dozuki_Solution-hardware minimalization.
Presentation 1.pptxnshshdhhdhdhdhdhhdhdhdhd
Dynamic Checkweighers and Automatic Weighing Machine Solutions
Prograce_Present.....ggation_Simple.pptx
How NGOs Save Costs with Affordable IT Rentals
Wireless and Mobile Backhaul Market.pptx
fsdffdghjjgfxfdghjvhjvgfdfcbchghgghgcbjghf
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
Fundamentals of Computer.pptx Computer BSC
ICT grade for 8. MATATAG curriculum .P2.pdf
material for studying about lift elevators escalation
Tcl Scripting for EDA.pdf
quadraticequations-111211090004-phpapp02.pptx
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
Nanokeyer nano keyekr kano ketkker nano keyer
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
Computers and mobile device: Evaluating options for home and work
Hypersensitivity Namisha1111111111-WPS.ppt

Single core design space exploration

  • 1. Single Core Design Space Exploration Project Report Vishesh Chanana UIN:665319085 University of Illinois, Chicago 773-703-6742 ABSTRACT SimpleScalar is a simulation tool created by developer Todd Austin as a part of his PhD. The simulator is open source i.e. it is freely available and can be modified by anyone. This simulator is written in C language and is used to compare the performances of machines and find out which one gives a better performance. SimpleScalar is being used widely for research purposes. In this literature review, we are given a task to find out the most suitable SimpleScalar Architecture running a given benchmark application while using the Wattch(sim-outorder) simulator. We basically find the Energy Delay Product for different configurations and then analyze the results. Based on our analysis we combine all the configurations that give the lowest EDP and finalize the best configuration for our architecture. 1. INTRODUCTION With the rising competition in the computer market, the cost of a computer has become a major factor. Companies are continuously looking at various techniques and technologies to lower the cost of their machines. From using cheaper but stronger materials to for the structure of a computer to putting more and more transistors on a single chip, the companies were able to lower their costs to a certain extent, but this lowering of costs had its own effects on power consumption. The power consumption started increasing at a fast rate. Power consumption depends on a lot of factors like clock tree, control and data paths, memory and registers. Variations in the above mentioned factors can bring down the power consumption of a computer. But it gives rise to another problem. The decrease in power consumption decreases the speed of the computer. So, to get a good machine, the companies started looking at the Energy Delay Product(EDP) of the computer to get an optimum result. The EDP can be defined as the product of the average power consumed and the square of the propagation delay. To get an optimum computer the computer should not use excessive power and the Cycle per instructions should not be too less as to increase the execution time. In this literature review, we divide the various configuration of a computer architecture into 4 parts namely, branch prediction, l1 data cache configuration, Functional units and Data path & others. We then simulate the architecture using different parameters of each group separately. To select the most useful result for a given parameter, we look into its EDP, and finalize the one with the lowest EDP. Then we go onto to get the EDP for each group. After finding the configuration for the lowest EDP of every group we then find the EDP for the architecture by combining the data from all the groups. For this experiment we installed Virtual machine that has the Ubuntu(32-bit) OS on my machine. Then we installed the SimpleScalar application over Ubuntu. The configuration for the Ubuntu was as follows: 2048MB RAM, 23MB Video memory and a 20GB hard disk. The Remote Desktop server and the video capture were disabled as they were not required during this experiment. 2. RESULTS 2.1 Branch Prediction We start off with finding out the best possible branch prediction method that we could implement in our architecture. Branch predictor is basically a circuit that guesses which way a branch will go before the outcome is known for sure. The branch predictor helps in increasing the speed of a program. There are about five types of branch predictors namely not-taken branch predictor, taken branch predictor, perfect branch predictor, bimodal predictor and 2-level predictor. In this project we are concerned with only the bimodal
  • 2. 2 predictor and the 2-level predictor. Further, we also vary the Return Address stack size and the Branch table Buffer size to see the effects on the EDP of the architecture. The bimodal branch predictor, also known as the Direct History table is one of the cheapest and the simplest branch predictor used. It uses the branch address to access the prediction table that predicts the outcome of the branch. Being the simplest branch predictor, only thing that we can vary in this branch predictor is the table size. We can only vary the size of the table in the order of 2 (order) bits. The following Figure shows us the variation of the EDP with respect to the table size of a bimod predictor. We see from the graph that the EDP(blue) is lowest when the table size is 128 bits. But having a fewer number of bits in the branch predictor gives us a less number of prediction bits which hampers the speed of the benchmark. This happens as having fewer branch prediction bits does not let the predictor predict correctly for some of the branches. This is evident from the figure, the IPC for 128 bits table is the lowest. Among the rest of the prediction table size, the one with 512 bits has the lowest EDP. Increasing the size further we see that the there is no improvement in the EDP nor is there any significant improvement in the IPC of the architecture. Looking at the miss rate for each table we see that it reaches its minimum value 0.0206 hits/ref at size 512. Increasing the size of the table does not make any difference in the miss rate. Now, we look towards the 2-level predictor. To configure a 2 level predictor we have 4 parts: l1 cache, l2 cache, history table size and number of xor gates. The l1 cache size can be varied from 1 to 32kb. We see from the Figure EDP values for the different l1 cache sizes. The lowest EDP is when the cache size is 32kb. The IPC for this benchmark remains constant but the average energy consumption decreases as the size of l1 cache increases. This is because the larger the size of the first level of cache more is the probability of the finding the branch prediction bit inside the cache. Thus, decreasing the number of operations required to fetch the prediction bit from l2 cache or from the memory. This decrease in the number of operations decreases energy consumption of this architecture thus decreasing the EDP. The l2 cache is the 2 nd level of cache of the 2 level predictor. As the size of l1 cache cannot be increased beyond 32kb, there is a level 2 cache that has a larger size. It provides data to the l1 level cache If not present without any delay. The l2 cache uses the same logic as the l1 cache so it is very fast and also it is larger than the l1 cache. We vary the size of l2 cache from 256 KB to 8192KB. The Graph provides us with some relevant information. The EDP for a 256 KB cache and a 512 KB cache is quite low as compared to the higher sizes. As the size of l2 cache increases the EDP keeps on increasing. The increase in EDP is due to the fact that as the size of the cache increases it requires a larger overhead to search 7.99067.86937.86857.89017.91487.9163 7.9355 1.50851.52261.5228 1.523 1.523 1.523 1.5231 0 1 2 3 4 5 6 7 8 9 128 256 512 1024 2048 4096 8192 7.897321 372 7.902581 08 7.906504 304 7.892535 901 7.893010 136 7.884560 115 7.87 7.88 7.89 7.9 7.91 1 2 4 8 16 32 EDP 7.8 7.85 7.9 7.95 8 256 512 1024 2048 4096 8192 EDP
  • 3. 3 for a particular branch target address. This increase in overhead results in consumption of some extra power. So, for our project we will select the l2 cache with a size 512KB. A branch history table is used to predict the future behavior by storing the previous action and target of branches. So the prediction of a particular branch depends on how the branch behaved when executed the last time. The history table is of great importance in the 2-level predictor. It provides the behavior of the branch while the l1 and l2 cache provide the address of the instruction that have to be executed. In our experiment we vary the history table size from 2kb to 256 KB. Branch history table is basically a shift register that stores the outcome of branches. A new value is stored by shifting the register. Having a larger history table size seems legit but as the size increases. The 2-level predictor is far more accurate than a single level predictor. Using this 2 level predictor comes with its own cost i.e. the Warm up Phase effect. The warm up phase effect states that the time required to put in usable values gets longer. Using a larger history table would further increase the cost of the whole setup. To further refine our results we look into the EDP values of different history table sizes. Going by the given figure we see that the lowest EDP is when the size is 2Kb. But a very small size would increase the overhead of getting the desired data. Thus, from the above figure we select the size 8Kb as the optimum size for the history table. We now, look into the ways we could merge the above information together to get an optimized result. From the branch history table and the data from the cache i.e. Branch address. There are many ways to merge both of these but we use either xoring or concatenation in this project. Concatenation simply means taking bits from both registers and simply concatenating them into one. In xor, the two registers are simply bitwise XORed to get the result. When xor is 0, it implies concatenation is used if its’ 1 then xor is used. The table for the XOR and Concatenation is given. Both of the processes for this benchmark application provide us with almost equal EDP values. Type IPC CPI Avg Power EDP Concatenate 1.523 0.6566 18.3204 7.898356 XOR 1.523 0.6566 18.3145 7.895812 So, we select the XOR part for our project as it has a slight advantage over Concatenation. The return address stack points to the code segment accessed by the next instruction in the calling method. As we can see from the graph for a small return address stack size, a lot of power is required. A smaller stack size needs to do a lot of pop and push operations for getting the required data thus increasing the power expenditure. As the size of the return address increases, the EDP decreases. The lowest EDP encountered is for a return address stack size of 8KB. After that it increases slightly as the overhead required is also increased. 7.875 7.88 7.885 7.89 7.895 7.9 7.905 7.91 2 4 8 16 32 64 128 256
  • 4. 4 Next, we go to another attribute of branch prediction, Branch Target Buffer(BTB). BTB is a special type of cache memory that stores addresses of the most recently used branches. It further improves branch prediction as we get the target address and the direction of the branch in the Instruction Fetch stage thus reducing the penalty. In a branch table we can vary two attributes, the set size and the associativity. The EDP graph data for the number of sets available can be seen as below. From the graph we see that the EDP is almost equivalent for a set size of 64 and 128. We could choose either of the two. Increasing the associativity of a cache increases the hit rate for a program. But with a higher associativity a high amount of power is also consumed. We can see from the graph that as the associativity increases, the EDP increases slightly at first, but at a rapid pace afterwards. We could take any amount of associativity in our architecture ranging from 1 to 4. The table shows us the various characteristics of taking a set size of 128 and associativity 2 or 4. Associativity IPC CPI Average power EDP 2 1.5224 0.6569 17.8613 7.707465 4 1.5228 0.6567 17.8641 7.70398 We can conclude from the table that in a set associative cache the parallel lookup is not efficient when we see from the energy consumption point of view but from the point of view of cache latency it is very efficient. This decrease in CPI with a higher associativity negates the effect of a higher average power thus giving us lower EDP. We now take all the values we selected for our architecture and do an intra-group simulation. The parameters selected are as follows: Bimodal predictor – 512 2 level predictor – 32 512 8 1 Return Address Stack size – 8 BTB configuration – 128 4 With this configuration we get an EDP equal to 7.6607 and an IPC equals to 1.5226. It is clear from the give data that using the optimized values gives us a lower EDP without any effects on the speed of the program. 2.2 Memory System 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 2 4 8 16 32 64 128 256 7 7.5 8 8.5 9 9.5 64 128 256 512 1024 2048 4096 8192 Series1 0 2 4 6 8 10 12 Series1
  • 5. 5 Now we come towards the components that effect the memory requirements of the system. The different components we would be comparing are l1data cache, l2 data cache, l1 instruction cache and l2 instruction cache. First we vary the l1 data cache. There are several factors that can be varied in an l1 data cache like the number of blocks a dl1 data cache has and the number of bytes per block of a dl1 data cache has. Lets’ look at the EDP values for different block sizes of the l1 data cache. We can see from the figure that the EDP value is lowest when the block size is 8. This increase in energy consumption is because as the data is transferred from memory and cache in fixed sizes, a large movement of data would require a higher energy. But we cannot conclude our analysis on the basis of only the EDP. WE have to take a look towards the IPC and the miss rate of each block size. A smaller block size implies a higher miss rate. And a higher miss rate implies it needs to replace the data in the cache a lot more times than when using a smaller block size. This is seen from the data we collected, the miss rate of the cache with a block size of 8 is 0.0305 while that of block size 64 is 0.0087. As we increase the size of the cache block size the miss rate decreases further but EDP also grows proportionately. We also look at the number of bytes per block attribute of the level 1 data cache. Number of bytes per block is nothing but the associativity of the cache. As explained in the previous section increasing the number of bytes per block would decrease the miss rate but would also increase power consumption. So, for our project we take the value for our architecture the number of bytes per block as 16. Now we change the configuration of level 2 data cache to find the optimum value for our architecture. The level two cache is used to provide fast access to the data that is not present in the first level cache. The logic is same as that in the first level. Like for the first level cache, we vary the block size and the bytes per block of the second level block. The blocks size in level 2 cache indicates the number of blocks that are available with the cache to store the data. The EDP graph is shown below. It is clear from the graph that the lowest energy delay product for the cache is when the block size is just 512 Kb. Other than the EDP, even the miss rate for the block size 512 is lower than all of the lower sizes. The lower miss rate means that the higher block size makes better use of spatial locality. Thus , also increases the IPC as the size increases. The number of bytes per block is the associativity of the cache. It basically tells us that where the data in the cache could go and get stored. Going by the data we see that as we increase the associativity of the cache, the EDP increases steadily. What I could conclude from the data is that as the number of bytes per block increases, the data that could be stored in a block also increases. This increases the overall overhead required to access a particular block thus, increasing the power consumption. 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8 16 32 64 128 256 7.75 7.8 7.85 7.9 7.95 8 8.05 8.1 8.15
  • 6. 6 Now we vary the configuration for the 1 st instruction cache. The instruction cache is similar to the data cache only difference being instead of data it stores the instructions. According to the data that we received a block size of 32 is enough for a level 1 instruction cache to achieve higher IPC as well as consume least amount of power. From this data we can deduce that because of less space available the overhead required to scan the whole cache is quite low. This increases the speed of checking for an instruction in turn decreasing the amount of power requirements. Now we look into the number of bytes per block requirements for a level 1 instruction cache. As we are working with about 840 million instructions in the program the associativity needs to be high to get a good IPC. Having a small number of bytes per block would hamper the storage on instructions. We can see this from the graph provided. Having a small number of bytes per block would decrease increase the amount of energy needed to get and remove the instructions from the level 1 instruction cache. Coming to the level 2 instruction cache, we take the level 2 data cache as the level 2 instruction cache. Now we do an intra-group simulation of all the cache configurations that we have finalized. The cache configuration is: L1 Data cache – dl1:64:16:2:f L2 Data cache – ul2:512:64:2:r L1 Inst cache – il1:32:64:2:r L2 inst cache – dl2 When these cache configurations are run together the results we get are astonishing. The Average power consumed is down to 11, but it has come at the expense of the IPC. The IPC is now less than 1. This increases the EDP of the whole architecture. It seems that the level 1 instruction cache block size is not enough for the architecture. We can see from the above data that with a very small increase in EDP we can use a 128 Kb block sized level 1 instruction cache. With the new configuration, the new IPC becomes 1.6969 and the EDP comes out to be 6.26 is very low. 2.3 Functional Units Now we come to the part where we see the effects of varying the attributes of the Functional units. The functional units include the different types of ALUs that are available with the computer. These are the integer ALU, Floating point ALU, integer multiplier/divider and the Floating point multiplier/divider. As the name suggests the ALUs are used for mathematical calculations that are processed in the computer benchmark. First, we varied the number of integer ALUs in our Architecture from 1 ALU to 8 ALU. We can see the following data from our simulations. 7.4 7.6 7.8 8 8.2 8.4 8.6 32 64 128 256 512 0 50 100 150 200 250 300 8 16 32 64
  • 7. 7 We can see from the above data that the architecture with the lowest EDP is the one with 4 integer ALUs. As, we can see we could lower our cost by taking only 3 ALUs. It would increase the EDP slightly but it would help decrease the cost of the whole setup. The higher number of integer ALUs would dissipate a larger amount of heat which would require a better cooling system. So, we select only 3 ALUs for our system. Now, we take into consideration the number of multiplication ALUs. Similar to the integer ALUs we vary the number of multiplication ALUs from 1 to 8. As we know there are not many multiplication operations when compared to integer operations we hope that a larger number of multiplication ALUs won’t affect the IPC but it will definitely increase the cost of the architecture. We can look at the numbers in the following graph. As expected the there was no effect on IPC. We also notice that the EDP is lowest when the number of multiplication ALUs is 1. But there is a sudden decrease in EDP when the number of multiplication ALUs is more than 5. It reaches its minimum value at 7 before increasing again. It is due to the fact that the ILP is completely achieved when the number of multiplication ALUs is more than 5. Other than the integer operations there are some Floating point operations that are processed during the execution of the benchmark. The amount of floating point operations is less as compared to integer operations. The Figure shows the EDP when using different number of FP integer ALUs in our architecture. We can see from the given data that the EDP of the architecture keeps on increasing as the number of FP integer ALUs increase. According to the EDP data we could select only 1 FP integer ALU but we should also look into the speed of the program. When taking 2 FP integer ALUs the EDP is slightly higher while the speed of the program also improves slightly. Depending on the type of architecture we want, concentrating on speed or on energy consumption, we could choose either 1 or 2 integer FP ALUs. Corresponding to the integer FP ALUs there are multiplication FP ALUs. The FP multiplication ALUs are used for processing the multiplication or division operation on Floating point numbers. We can see from the Table that there is not much difference in the EDP if the number of multiplication FP ALUs is varied from 1 to 8. 0 2 4 6 8 10 12 14 16 7.875 7.88 7.885 7.89 7.895 7.9 7.905 1.5231.5231.5231.5231.5231.5231.5231.523 EDP 1.51571.52261.52261.52261.52261.52261.52261.5226 0 2 4 6 8 10 1 2 3 4 5 6 7 8
  • 8. 8 FP MUL IPC CPI Avg power EDP 1 1.523 0.6566 18.3266 7.901029 2 1.5347 0.6516 18.4007 7.812616 3 1.5347 0.6516 18.4359 7.827562 4 1.539 0.6498 18.4885 7.806585 5 1.539 0.6498 18.4275 7.780828 6 1.539 0.6498 18.4349 7.783953 7 1.539 0.6498 18.46 7.794551 8 1.539 0.6498 18.4347 7.783868 As it is difficult to conclude anything from the EDP obtained, we could take into consideration the IPC of every configuration. We see there is an increase in IPC with increase in in the number of multiplication FP ALUs. Since the number of ALUs is increasing the parallelism is increasing thus increasing the pipelining. We know that after a certain level of pipelining, increasing parallelism does not have any effect on pipelining. So, further increasing the number of ALUs does not have any effect on the IPC of the architecture. From the above given data we can conclude that the optimum number of FP multiplication ALUs is 4. From the above information we are left with the following information for the intra group simulation: Integer ALU – 3 Integer multiplication ALU – 7 Floating point Integer ALU - 1 Floating point multiplication ALU - 4 Using all these values together we see that the performance of our architecture is enhanced. The EDP decreases to 7.313 but the speed becomes a little slow as well. This is due to the usage of extra multiplication ALUs when compared to the default values. As we know that the multiplication ALUs take many cycles if the there is a miss. The increase in number of ALUs increases the time required when there is a miss thus, decreasing the IPC. 2.4 Data Path and Others Now we look into some miscellaneous parameters such as the instruction fetch queue size, the instruction width and register update unite to name a few. The instruction fetch queue size is the first factor that we consider for our project. The graph for the EDP values and the IPC is shown below. We see from the above data that as the size of the instruction queue reaches 32, the EDP decreases. The only conclusion we can take from this is that as the instruction fetch queue size increases there is more space to fetch the instructions and store them. This helps in speeding up the processor. This is also evident from the IPC values shown in the graph. Next, we reach the instruction decode width. We take the default value for this i.e. 4 as it provides us the most optimum value for our project. The next attribute is the in-order issue of instructions. We take it as false because out order execution of instructions gives us a better speed. It has a better speed because it executes instructions without waiting for the previous one to finish executing only if there is no dependency. The in-order execution just stalls the execution until it gets the required data. The table shows the data for this attribute. Issue- inorder IPC EDP CPI Average Power FALSE 1.523 12.04021 0.6566 18.3372 TRUE 0.0805 16.16568 1.2423 13.0127 The RUU or the Register Update Unit has the sole purpose of keeping the instructions ready for the hungry Functional Units. As we increase the number of Functional units the RUU size also needs to be increased. 13.43630 891 12.01827 508 11.84569 672 11.55704 2 11.46649 524 1.2712 1.523 1.5892 1.6808 1.7536 0 2 4 6 8 10 12 14 16 2 4 8 16 32
  • 9. 9 As we can see from the graph shown the optimum value for the RUU size is 16. The EDP continuously decreases up till it reaches its default value. After that it increases. This is because the number of instructions that can be fetched is always more than the number of instruction that can be issued. To keep extra functional units busy more expenditure of power is required thus increasing the EDP. For the rest of the attributes relating to the Data Path & Others category we take the default values in consideration as they provide us with the best possible EDP values. We do an intra-group analysis for these miscellaneous instructions. We only change the value of instruction fetch queue size to 32, rest others are set to default values. We see from the simulation results that the EDP has dropped considerably to 6.6 with an increase in IPC to 1.7356. 3. INTER GROUP SIMULATION We have now done all the intra group simulations for our architecture, now, we will see the behavior of our architecture when we combine all these configurations into a single architecture. Config IPC CPI Avg power EDP Default 1.5230 0.6569 18.2972 7.89 Optimized 1.6989 0.5886 16.4427 5.6965 From the above data we can conclude that after applying all the optimizations we get a very low EDP and a very high IPC for our new architecture. By using a combination of the simple bimod branch predictor with a 2 level branch predictor decreases the amount of miss predictions. The decrease in miss prediction decreases the power requirements of the system. Changing the replacement policies of cache and tweaking some block sizes has had a great effect on the IPC of the system. A better replacement policy implies the presence of good data in the cache which helps speed up the processing. The functional units have had their own advantages in increasing the speed as more number of functional units working in parallel save some time thus increasing the IPC, though they use a little extra power but the lower CPI of the process negates the effect and thus gives us a better overall performance. The components of the data path and others category did not contribute much towards the optimization as most of the values chosen were default values. Here are some screen shots from the final optimized simulation. 4. CONCLUSION In this project we tried to find out the best possible configuration to get an optimized result. Though we could find a better configuration for our machine than the default configuration but we cannot say that this is the most optimum machine. This is an optimization for just the eeg benchmark. Some benchmarks will have different amount of instructions and a different percentage of branch instructions, some will focus only on energy consumption and some will focus only on a better speed. Thus, we conclude that there is no optimum configuration. References [1] www.simplescalar.com [2]http://web.eecs.umich.edu/~taustin/papers/UWTR97- simple.pdf [3]http://www.wseas.us/e- library/transactions/computers/2011/52-316.pdf “A 0 5 10 15 20 25 2 4 8 16 32
  • 10. 10 Study on Factors Influencing Power Consumption in Multithreaded and Multicore CPUs” Vijayalakshmi Saravanan, Senthil Kumar Chandran, Sasikumar Punnekkat and D P Kothari, Malardalen University, Sweden and VIT University, Vellore, India, March 2011 [4] “Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors” Dan Nicolaescu, Alex Veidenbaum, Alex Nicolau, University of California at Irvine