SlideShare a Scribd company logo
Performance Considerations For Cache
Memory Design in a Multi-core Processor ?
Divya Ravindran, dxr150630
Ilango Jeyasubramanian, ixj150230
Kavitha Thiagarajan, kxt132230
Susmitha Gogineni, sxg155930
University of Texas at Dallas, Richardson, TX 75030 USA
Abstract: In the recent times multi-core processors have gained importance over the traditional
uniprocessors as there is a saturated growth in the performance improvements of the uniproces-
sors. Multi-core processors make use of multiple cores and in order to improve their performance,
there is a high necessity to reduce the memory access time, improve power e ciency and also
maintain the coherence of data among the cores. To address to the e ciency of multiple cores, a
filter cache is designed with an e cient Segmented Least Recently Used replacement policy. This
technique e↵ectively reduces the energy consumed by 11%. Finally, to address the coherence of
the caches, a modified MOESI based snooping protocol for the ring topology was used. This
improved the performance of the processor by increasing the hit rate by 7%.
Keywords: multi-core; filter cache; energy e cient; hit ratio; coherence; LRU; ring-order
1. INTRODUCTION
As the number of transistors on the chip is doubling every
18 months following the Moore‘s law, it is observed that
the processor speed is also improving at the same rate, but
the memory latency has not progressed at the same rate
as the processor. Due to this di↵erence in the growth, the
time to access the memory becomes larger as the processor
speed improves further. In order to overcome this memory
wall, caches were built.
Cache is a tiny and fast memory and has a smaller access
time than the main memory. The beneficial properties of
cache has made it desirable in providing e ciency to the
processor [1].
This project concentrates on how the Caches can be
modified to make the multi-core processor work in an
e cient way such that the overall speed up is improved,
the energy consumed by the processor is reduced and there
is an improvement in its performance. The analysis of the
newly implemented cache designs is done using some of
the SPEC2006 Benchmarks, In this experiment, the size
and associativity of the caches are fixed in order to provide
simplicity in analysis, the instruction set architecture(ISA)
is built for X86-64 processors.
The first modification performed was introducing a filter
cache, which is a tiny cache assumed to run at the speed of
the core. It consists of the most frequently used instructions
and the access time of the data in the filter cache is very
short, but the hit rate of the filter cache is low [2]. This is
improved by implementing a prediction technique which
? This project paper is edited in the format of International Feder-
ation of Automatic Control Conference Papers in LATEX 2"as part of
EEDG 6304 Computer Architecture coursework.
chooses the memory level to be accessed to reduce misses [3]
[4].
In order to see further improvement in hit rate, a Segmented
Least recently used(LRU) Block Replacement Policy along
with the filter cache is implemented and analyzed. The SLRU
consists of two segments and it uses the principle of
probability to perform the cache block replacement.
The coherence of the multi-core processors were analyzed
later with the help of various topologies. The idea was to
introduce a modified MOESI based snooping protocol f or
the ring topology which helps in improving the coherence of
data in a multi-core processor. This modification makes use
of the round-robin order of ring to provide a fast and stable
performance.
2. FILTER CACHE
2.1 Idea of Filter Cache
Cache is a very important component of modern processor
which can e↵ectively alleviate the speed gap between the
CPU and o↵-chip memory system. Multi-core processors
have become the main development trend of processors, due
to their high performance but power dissipation is a major
issue with the large memory accesses of multiple cores.
Therefore, an energy e cient cache design is required for
energy e ciency.
Filter cache is used to improve performance and power
e ciency. Filter cache is a small capacity cache which is used
to store frequently accessed data and instructions by the
cores. Filter cache acts as the first instruction source which
will consume less energy for most used instructions and
data. The filter is assumed to have almost the same speed as
the core and consume less energy than the normal cache. Fig
1. shows the basic idea of the filter cache.
Fig. 1. Filter Cache: The basic idea
Fig. 2. Filter Cache: How prediction works
The improvement of performance and energy saving is
achieved by accessing the filter instead of the normal cache.
The CPU will access the filter first and only when the filter
is not hit, the visit of normal cache is performed.
2.2 Prediction Filter Cache
For any instruction or data, the processor first accesses the
filter cache. If the filter is hit, we can finish the fetch
instruction operation at a very low cost without the extra loss
of performance and energy would happen. The past studies
have shown that the hit ratio of filter is extremely important
for filter cache [5]. Therefore to ensure hit ratio, prediction
algorithm is incorporated to improve hit ratio of public filter.
In the prediction cache, The CPU accesses the filter or nor-
mal cache depending on the prediction signal. Prediction
algorithm is designed to eliminate unnecessary accesses to
the filter cache [6]. I f the prediction for filter is failed, the
CPU will re-fetch the instruction through normal cache
and it will also cause the extra loss of performance and
energy.
2.3Architecture of Energy E cient Multi-core cache System:
Public Filter
Each core has separately level 1 instruction cache and data
cache. Apublic filter cache unit is shared by all cores in the
system. All cores also share the level 2 LLC. However, a public-
filter is introduced to be the first shared cache for all cores.
Fi g. 3 shows how the architecture has been modified to
accommodate the filter cache [7].
For each instruction-fetch, every core will access the public-
filter first. If the public-filter is hit, instruction is returned to
core directly [8].Otherwise, the next level memory L2 cache
will be accessed until the right instruction is returned and the
public filter will be updated by the new cache block which
contains the new missed instruction.
Fig. 3. Filter Cache: Architectural Change made to the
Baseline Cache
Algorithm 1 Algorithm for the proposed cache design
CPU sends the data;
while Resolving the public filter for the data do
Visit the public filter;
if data was hit then
Return the instruction;
else
Visit the LLC;
if hit then
Return instruction and update the filter;
else
Visit the main memory and update the filter;
end
end
end
A Dynamic Replacement method Segmented LRU (SLRU)
is used to maintain good hit ratio and dynamic memory
management methods are used to distribute hit ratio
equally among all cores.
3. SEGMENTED LRU POLICY
3.1 Existing Segmented LRU
An SLRU cache is divided into two segments, a proba-
tionary segment and a protected segment. Lines in each
segment are ordered from the most to the least recently
accessed. Fig. 4 explains how the block is segmented.
Data from the memory for misses is added to the cache
at the most recently accessed end of the probationary
segment. Cache Hits are removed from wherever they
currently reside and added to the most recently accessed
end of the protected segment. Lines in the protected
segment have thus been accessed at least twice, giving this
line another chance to be accessed before being replaced.
The lines to be discarded for replacement are obtained
from the LRU end of the probationary segment [9].
3.2 Dynamic Segmented LRU
Based on our observation with the existing SLRU algo-
rithm, we found that often, they always use a constant
number of protected and probationary ways. The proposed
scheme handles the dynamic sizing of the two segments
based on access probability in each cache line of the set
[10].
The access probability is summed up each time from
the first line and the selection of new cache line for the
insertion of new cache miss data from the memory is
done at the cache line where the summed up probability
is around “0.5”. This helps in dynamically adjusting the
segmentation size by access probability.
3.3 Code Snippet for LRU Changes
Void LRU::insertBlock(PacketPtr pkt, BlkType *blk)
{
BaseSetAssoc::insertBlock(pkt, blk);
int set = extractSet(pkt->getAddr());
int Tot = 0;
//Calculating the total number of accesses
for (int i = 0; i <= assoc - 1 ; i++)
{
BlkType *b1 = sets[set].blks[i];
int Tot = Tot + b1->refCount;
}
int add = 0;
int start = 0;
if( Tot != 0)
{
for (int i = 0; i <= assoc -1; i++)
{
BlkType *b2 = sets[set].blks[i];
// Calculating the access probability of each line
int prob = b2->refCount / Tot;
add = add + prob;
//Selecting theline with probability of 0.5
if (add >= 0.5)
{
start = add;
break;
}
}
}
//Setting the head of probationary block for new data
sets[set].moveToHead1(blk,start);
}
3.4 Code Snippet for Cacheset
template <class Blktype>
void
CacheSet<Blktype>::moveToHead1(Blktype *blk, int start)
{
// nothing to do if block is already head
if (blks[0] == blk)
return;
% write ’next’ block into blks[i]
. moving up from MRU toward LRU
. until we overwrite the block we moved to head.
. setting the head of the probationary statement %
int i = start;
Blktype *next = blk;
do {
assert(i < assoc);
// swap blks[i] and next
Blktype *tmp = blks[i];
blks[i] = next;
next = tmp;
++i;
} while (next != blk);
}
Fig. 4. LRU Segmentation: The probationary vs protected
segments
3.5 Dynamic SLRU with Random Promotion and Aging
Traditional implementations of SLRU has shown benefit
by making selected random promotions as well. The ran-
dom promotion in the SLRU algorithm allows to randomly
pick a cache line from the probationary segment and pro-
mote it to the promoted segment. This random promotion
is also added with the Dynamic segmented LRU policy to
see further performance improvements.
In contrast to random promotion, we also made “Cache
line aging mechanism” to bring down aged cache line with
lowest access probability from the protected to probation-
ary segment to see further performance improvements.
3.6 Dynamic SLRU With Adaptive Bypassing
Cache bypassing helps in avoiding invalidating cache line
with high access probability for just one or two misses.
The new data is accessed directly from the memory with
no update for cache line where it got missed. This helps
in improving the hit rate by maintaining highly accessed
cache line for a little more time in the cache set.
Initially our bypass algorithm arbitrarily picks an access
probability for implementing adaptive bypassing [11] [12].
The probability of making the bypass is also dynamically
adapted by how e↵ective the decisions to bypass have been
in the past by measuring the hit rate.
Each e↵ective bypass doubles the probability that a future
a bypass will occur, for example, if the current probability
is 0.25 the probability will double to 0.5. Similarly each
ine↵ective bypass halves the probability of a future bypass,
for example, cutting the current probability of 0.5 to 0.25.
To turn o↵ adaptive bypassing, bypassing probability is
set to 0 that will prevent any bypassing and allocate all
missed lines.
3.7 Miss Status Holding Register (MSHR)
The adaptive bypassing is implemented with Miss Status
Holding Register which helps to store the cache miss
information without invalidation the corresponding cache
line. This in turn improves the hit rate by supplying cache
hits even under a miss.
When the data becomes available in the memory, the miss
pending is resolved with new data inside the cache line.
However, the adaptive bypassing cannot be done if the
MSHR becomes full. Stalls will be required until we resolve
and create enough space to store the new miss pending and
continue the adaptive bypassing mechanism.
This adaptive bypassing with MSHR is also implemented
with Dynamic SLRU to see further performance improve-
ments.
4. COHERENCE POLICY
4.1 Existing Segmented LRU
In multi-core processors, due to data transaction between
several processors and their respective caches, there hap-
pens to be a coherence problem. This occurs when two pro-
cessors access the same physical address space [13]. Thus
the shared memory models should be deigned with their
respective cache hierarchies with a performance sensitive
stand-point. In this paper, the cache coherence problem
is addressed for the ring interconnect model. It was
chosen since they are proven to address the coherence
problem quite well. The rings have an exploitable
ordering of coherence, simple and distributive arbitration
as opposed to the bus topologies, short and fewer ports
with faster p2p links [14] [15].
The order of the ring is not the order of the bus since bus
has a centralized arbiter [16]. To initiate a request, a core
must first access a centralized arbiter and then send its
request to a queue. The queue creates the total order of
requests, and resends the request on a separate set of snoop
links [17] [18]. Caches snoop the requests and send the
results on another set of links where the snoop results are
collected at another queue. Finally the snoop queue resends
the final snoop results to all cores [19]. This type of logical
bus will result in significant performance loss to recreate the
ordering of an atomic bus [20]. In crossbar interconnects
this is a drawback. Thus we go for the ring interconnect due
to the e ciency of the wires. In a way, the topology is
analogous to a tra c roundabout. This is the idea in which
the snooping was implemented. Rings o↵er a distributive
access by the method of “Token Ring” [20 - 23].
There were several proposals to implement the coherence
in the ring topology. The Greedy-Order topology uses
unbounded reentries of cache requests to the ring to handle
contention. This improves the latency but hits the band-
width. The Ordering-Point topology uses a performance-
costly ordering point which hits the latency.
The Ring-Order consistency used this paper is fast and
stable in performance. It also exploits the round-robin order
of the ring. Ring-Order uses a token-counting approach,
that passes tokens to ring nodes in order to ensure
coherence safety [22]. A program was designed to simulate
an LRU cache with a write-back and write-allocate policy.
Modified the MOESI Snooping protocol for the ring
topology thus making the initial requests to succeed all the
time and as a result there would be no reentries or ordering
point.
5. SIMULATION
In order to evaluate the e↵ectiveness of the energy e cient
cache design for multi-core processor, the simulation of the
Fig. 5. CPI for various Cache designs vs Benchmarks used
improved cache protocols were done on Gem5 simulator
[24]. The baseline was taken as an X86 processor with 4
cores. Some of the SPEC 2006 benchmark programs were
used for the simulation
The following table 1 explains the configuration of our
baseline system.
Table 1. Baseline System Settings
System Configuration
PRIVATE L1 CACHES Split-I&D, 4kB, 4-way set Assoc
SHARED FILTER CACHE Unified-I&D, 8kB, 8-way set Assoc
SHARED L2 CACHE Unified-I&D, 64kB, 16-way set Assoc
MAIN MEMORY 1GB of DRAM
RING INTERCONNECT 80-byte unidirectional
5.1 Results
The energy e cient cache design was integrated into the
Gem5 and do some comparative experiments with the
baseline 4-cores cache system and the filter cache with fixed
distribution of public-filter using crossbar intercon-nect
[25]. The public-filter associativity is 16 which was fixed for
simplicity and this indicates each core has 4 filter lines in
fixed filter cache. The dynamic management method
(SLRU) will be activated for every 1000 instruc-tions.
In the experiments, the performance and energy consump-
tion of each benchmark were observed. The performance is
evaluated by the CPI (Cycle Per Instruction). The smaller
the CPI is, the higher the performance of the system is.
The results obtained from this experiment was fed into
CACTI for observing the power and energy consumption.
Figure 5 shows the improvement of CPI for every bench-
mark and the modified system. On an average, there is
about 7.68% improvement in the CPI of the fully enhanced
system when compared to the baseline system.
Figure 6 shows the reduced energy consumption of each
cache system proposed for the benchmarks. The energy
consumption is improved by about 11%.
Fig. 6. Energy consumption for the cache implementations
The coherence policy was simulated using Gem5 as well
as the SMP Cache simulator. The write transactions were
recorded for the normalized tra c (having the total L2
cache misses/transactions on a scale of 0 to 1). Figure 7
shows the improvements.
Fig. 7. Write Transactions vs L2 Misses
The hit rates were also recorded and it is seen that for a
128 KB cache for all 4 cores, the performance was quite
good for the given workload with hit rates ranging from
89% to 97%. The hit rate for Rind-Order was found to be
more than that of Ordering-Point and Greedy-Order. The
table is given below.
The snoops per cycle for Rings show improvement over the
Bus for the said benchmarks as on the table below.
6. FUTURE WORK
The Cache system proposed can be integrated with the
coherence policy discussed in the paper. By that, instead
if crossbar interconnect, the filter-cache and SLRU design
would be implemented on a system in which cores would
be connected in a ring topology.
Cache power and performance can be improved using De-
terministic Naps and early miss detection. Dynamic power
can be reduced by 2% by use of a hash based mechanism
to minimize the Cache lines. There is a 92% improvement
in performance due to skipping of few cache pipe stages as
guaranteed misses. Static power savings of about 17% is
achieved by using cache access to deterministically lower
the power state of cache lines that are guaranteed not
to be accessed in the immediate future [26]. If this is
implemented in the proposed cache design there would be
better results in-terms of performance and power.
7. CONCLUSION
In this paper, an energy e cient cache design for multi-
core processors was proposed. The baseline cache is im-
plemented with a filter cache structure on the multi-core
I-cache in a form of public-filter which is the shared first
instruction source for all cores. Meanwhile, a dynamic LRU
policy of the public-filter is also applied. Together they
improved the power and performance of the cache.
The experiment results show that the presented method
can save about 11% energy and also shows a significant
improvement in the performance. The Coherence policy for
a Ring topology was also discussed and the results showed
improvement when compared with bus topology.
ACKNOWLEDGEMENTS
We profoundly thank Professor Dr. Bhanu Kapoor for
providing us guidance, support and encouragement. We
also thank Jiacong He, whose PhD qualifier presentation
inspired us to work on this research.
REFERENCES
[1] Hennessy, J. L., Patterson, D. A. (2012). “Computer
Architecture: A Quantitative Approach. Elsevier.
[2] Tang Weiyu and Gupta R,Nicolau, “A Design of
a Predictive Filter Cache for Energy Savings in High
Performance Processor Architectures” Proceedings of the
International Conference on Computer Design, 2001: 68-
73.
[3] Brooks, D., Tiwari, V., & Martonosi, M. (2000).
“Wattch: a framework for architectural-level power anal-
ysis and optimizations” (Vol. 28, No. 2, pp. 83-94). ACM.
[4] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feed-
back directed prefetching: “Improving the performance and
bandwidth-e ciency of hardware prefetchers”. In Proc. of
the 13th International Symposium on High Performance
Computer Architecture, 2007.
[5] Cao. X, Z. Xiaolin. “An Energy E cient Cache Design
for Multi-core Processors”, In IEEE International Confer-
ence on Green Computing and Communications, 2013.
[6] Advanced Micro Devices, Inc., AMD64 Architecture
Programmer‘s Manual Volume 3: “General-Purpose and
System Instructions”, May 2013, revision 3.20.
[7] Johnson Kin, Munish Gupta and William H. Mangione-
Smith, “The Filter Cache: An Energy E cient Memory
Structure,” Microarchitecture .Proceedings, Thirtieth An-
nual IEEE/ACM International Symposium on, 1997:184
-193.
[8] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filtering
memory references to increase energy e ciency,” IEEE
Trans. Comput, vol. 49, no. 1, pp. 1?15, Jan. 2000.
[9] H. Gao and C. Wilkerson,“A dueling segmented LRU
replacement algorithm with adaptive bypassing,” 1st JILP:
Cache Replacement Championship, France, 2010.
[10] K. Morales and B. K. Lee, “Fixed Segmented LRU
cache replacement scheme with selective caching,” 2012
IEEE 31st International Performance Computing and
Communications Conference (IPCCC), Austin, TX, 2012.
[11] H. Gao and C. Wilkerson. “A dueling segmented
LRU replacement algorithm with adaptive bypassing.” In
Proceedings of the 1st JILP Workshop on Computer
Architecture Competitions, 2010
[12] Jayesh Gaur et al. “Bypass and Insertion Algorithms
for Exclusive Last-level Caches.” In ISCA 2011.
[13] Hongil Yoon and Gurindar S. Sohi, “Reducing Coher-
ence Overheads with Multi-line Invalidation (MLI) Mes-
sages”, Computer Sciences Department at University of
Wisconsin-Madison
[14] Daniel J. Sorin, Mark D. Hill, and David A. Wood. “A
Primer on Memory Consistency and Cache Coherence”,
Synthesis Lectures in Computer Architecture, 2011 Mor-
gan & Claypool Publishers.
[15] I. Singh, A. Shriraman, W. W. L. Fung, M. O?Connor,
and T. M. Aamodt, “Cache coherence for GPU architec-
tures,” in HPCA, 2013, pp. 578?590.
[16] R. Kumar, V. Zyuban, and D. Tullsen. “Interconnec-
tions in multi-core architectures: Understanding Mecha-
nisms, Overheads and Scaling”. In Proceedings of the 32nd
Annual International Symposium on Computer Architec-
ture, June 2005.
[17] M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill,
and D. A. Wood, “Using destination-set prediction to im-
prove the latency/bandwidth trade- o↵ in shared-memory
multiprocessors,” in Proceedings of the 30th ISCA, June
2003.
[18] M. M. K. Martin, M. D. Hill, and D. A. Wood, “Token
coherence: Decoupling performance and correctness,” in
ISCA-30, 2003.
[19] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A.
Wood, “Bandwidth adaptive snooping,” in HPCA-8, 2002.
[20] M. R. Marty, “Cache coherence techniques for multi-
core processors,” in PhD Dissertation, University of Wis-
consin - Madison, 2008.
[21] M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu,
M. M. K. Martin, and D. A. Wood, “Improving multiple-
cmp systems using token coherenece,” in HPCA, February
2005.
[22] M. R. Marty and M. D. Hill, “Coherence ordering for
ring-based chip multiprocessors,” in MICRO-39, December
2006.
[23] –, “Virtual hierarchies to support server consolida-
tion,” in ISCA-34, 2007.
[24] N. Binkert, et al., “The gem5 simulator”. 2011
SIGARCH Comput. Ar- chit. News.
[25] “gem5-gpu.cs.wisc.edu”
[26] Oluleye Olorode and Mehrdad Nourani, “Improving
Cache Power and Performance Using Deterministic Naps
and Early Miss Detection”, IEEE Trans. Multi-Scale Com-
puting Systems, Vol 1, No 3, Pages 150–158, 2015.

More Related Content

PDF
ASIC DESIGN OF MINI-STEREO DIGITAL AUDIO PROCESSOR UNDER SMIC 180NM TECHNOLOGY
PPTX
Floor plan & Power Plan
PDF
Bharat gargi final project report
PPT
Design challenges in physical design
PDF
Understanding cts log_messages
PPT
Back end[1] debdeep
PPTX
Library Characterization Flow
PPTX
Placement
ASIC DESIGN OF MINI-STEREO DIGITAL AUDIO PROCESSOR UNDER SMIC 180NM TECHNOLOGY
Floor plan & Power Plan
Bharat gargi final project report
Design challenges in physical design
Understanding cts log_messages
Back end[1] debdeep
Library Characterization Flow
Placement

What's hot (20)

PDF
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
PDF
Secure remote protocol for fpga reconfiguration
PDF
MTE104-L2: Overview of Microcontrollers
PPTX
Placement and algorithm.
PPTX
Implementation strategies for digital ics
DOCX
Evaluation of Branch Predictors
PDF
Design and implementation of 15 4 compressor using 1-bit semi domino full add...
PDF
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
PDF
Distributed Traffic management framework
PDF
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
PDF
Ijecet 06 08_003
PDF
Implementation of switching controller for the internet router
PDF
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
PDF
Publication
DOCX
Vlsi physical design-notes
PDF
Iaetsd design and simulation of high speed cmos full adder (2)
PDF
Dp32725728
PDF
A novel mrp so c processor for dispatch time curtailment
PDF
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
PDF
M.Tech: Advanced Computer Architecture Assignment II
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
Secure remote protocol for fpga reconfiguration
MTE104-L2: Overview of Microcontrollers
Placement and algorithm.
Implementation strategies for digital ics
Evaluation of Branch Predictors
Design and implementation of 15 4 compressor using 1-bit semi domino full add...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
Distributed Traffic management framework
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Ijecet 06 08_003
Implementation of switching controller for the internet router
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Publication
Vlsi physical design-notes
Iaetsd design and simulation of high speed cmos full adder (2)
Dp32725728
A novel mrp so c processor for dispatch time curtailment
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
M.Tech: Advanced Computer Architecture Assignment II
Ad

Similar to DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5 (20)

PDF
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
PPTX
Cache Memory.pptx
PPT
ch5.pptjhbuhugikhgyfguijhft67yijbtdyuyhjh
PPT
12-6810-12.ppt
PPTX
Cache simulator
PPT
Cache replacement policies,cache miss,writingtechniques
PPTX
Elements of cache design
PDF
Architecture and implementation issues of multi core processors and caching –...
PPTX
Computer System Architecture Lecture Note 8.1 primary Memory
PDF
Different Approaches in Energy Efficient Cache Memory
PPT
Snooping 2
PDF
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
PPTX
Cache design
PDF
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
PDF
Power minimization of systems using Performance Enhancement Guaranteed Caches
PPTX
Cache recap
PPTX
Cache recap
PPTX
Cache recap
PPTX
Cache recap
PPTX
Cache recap
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
Cache Memory.pptx
ch5.pptjhbuhugikhgyfguijhft67yijbtdyuyhjh
12-6810-12.ppt
Cache simulator
Cache replacement policies,cache miss,writingtechniques
Elements of cache design
Architecture and implementation issues of multi core processors and caching –...
Computer System Architecture Lecture Note 8.1 primary Memory
Different Approaches in Energy Efficient Cache Memory
Snooping 2
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
Cache design
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Power minimization of systems using Performance Enhancement Guaranteed Caches
Cache recap
Cache recap
Cache recap
Cache recap
Cache recap
Ad

More from Ilango Jeyasubramanian (6)

PDF
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
DOCX
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
DOCX
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
DOCX
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS USING THE 3DB BANDWIDTH
DOCX
PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
DOCX
STANDARD CELL LIBRARY DESIGN
DESIGNED A 350NM TWO STAGE OPERATIONAL AMPLIFIER
PARASITIC-AWARE FULL PHYSICAL CHIP DESIGN OF LNA RFIC AT 2.45GHZ USING IBM 13...
RELIABLE NoC ROUTER ARCHITECTURE DESIGN USING IBM 130NM TECHNOLOGY
ACCURATE Q-PREDICTION FOR RFIC SPIRAL INDUCTORS USING THE 3DB BANDWIDTH
PARASITICS REDUCTION FOR RFIC CMOS LAYOUT AND IIP3 VS Q-BASED DESIGN ANALYSI...
STANDARD CELL LIBRARY DESIGN

Recently uploaded (20)

PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Sustainable Sites - Green Building Construction
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT 4 Total Quality Management .pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Construction Project Organization Group 2.pptx
PPT
Mechanical Engineering MATERIALS Selection
OOP with Java - Java Introduction (Basics)
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
Sustainable Sites - Green Building Construction
bas. eng. economics group 4 presentation 1.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Lecture Notes Electrical Wiring System Components
Operating System & Kernel Study Guide-1 - converted.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT 4 Total Quality Management .pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Construction Project Organization Group 2.pptx
Mechanical Engineering MATERIALS Selection

DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5

  • 1. Performance Considerations For Cache Memory Design in a Multi-core Processor ? Divya Ravindran, dxr150630 Ilango Jeyasubramanian, ixj150230 Kavitha Thiagarajan, kxt132230 Susmitha Gogineni, sxg155930 University of Texas at Dallas, Richardson, TX 75030 USA Abstract: In the recent times multi-core processors have gained importance over the traditional uniprocessors as there is a saturated growth in the performance improvements of the uniproces- sors. Multi-core processors make use of multiple cores and in order to improve their performance, there is a high necessity to reduce the memory access time, improve power e ciency and also maintain the coherence of data among the cores. To address to the e ciency of multiple cores, a filter cache is designed with an e cient Segmented Least Recently Used replacement policy. This technique e↵ectively reduces the energy consumed by 11%. Finally, to address the coherence of the caches, a modified MOESI based snooping protocol for the ring topology was used. This improved the performance of the processor by increasing the hit rate by 7%. Keywords: multi-core; filter cache; energy e cient; hit ratio; coherence; LRU; ring-order 1. INTRODUCTION As the number of transistors on the chip is doubling every 18 months following the Moore‘s law, it is observed that the processor speed is also improving at the same rate, but the memory latency has not progressed at the same rate as the processor. Due to this di↵erence in the growth, the time to access the memory becomes larger as the processor speed improves further. In order to overcome this memory wall, caches were built. Cache is a tiny and fast memory and has a smaller access time than the main memory. The beneficial properties of cache has made it desirable in providing e ciency to the processor [1]. This project concentrates on how the Caches can be modified to make the multi-core processor work in an e cient way such that the overall speed up is improved, the energy consumed by the processor is reduced and there is an improvement in its performance. The analysis of the newly implemented cache designs is done using some of the SPEC2006 Benchmarks, In this experiment, the size and associativity of the caches are fixed in order to provide simplicity in analysis, the instruction set architecture(ISA) is built for X86-64 processors. The first modification performed was introducing a filter cache, which is a tiny cache assumed to run at the speed of the core. It consists of the most frequently used instructions and the access time of the data in the filter cache is very short, but the hit rate of the filter cache is low [2]. This is improved by implementing a prediction technique which ? This project paper is edited in the format of International Feder- ation of Automatic Control Conference Papers in LATEX 2"as part of EEDG 6304 Computer Architecture coursework. chooses the memory level to be accessed to reduce misses [3] [4]. In order to see further improvement in hit rate, a Segmented Least recently used(LRU) Block Replacement Policy along with the filter cache is implemented and analyzed. The SLRU consists of two segments and it uses the principle of probability to perform the cache block replacement. The coherence of the multi-core processors were analyzed later with the help of various topologies. The idea was to introduce a modified MOESI based snooping protocol f or the ring topology which helps in improving the coherence of data in a multi-core processor. This modification makes use of the round-robin order of ring to provide a fast and stable performance. 2. FILTER CACHE 2.1 Idea of Filter Cache Cache is a very important component of modern processor which can e↵ectively alleviate the speed gap between the CPU and o↵-chip memory system. Multi-core processors have become the main development trend of processors, due to their high performance but power dissipation is a major issue with the large memory accesses of multiple cores. Therefore, an energy e cient cache design is required for energy e ciency. Filter cache is used to improve performance and power e ciency. Filter cache is a small capacity cache which is used to store frequently accessed data and instructions by the cores. Filter cache acts as the first instruction source which will consume less energy for most used instructions and data. The filter is assumed to have almost the same speed as the core and consume less energy than the normal cache. Fig 1. shows the basic idea of the filter cache.
  • 2. Fig. 1. Filter Cache: The basic idea Fig. 2. Filter Cache: How prediction works The improvement of performance and energy saving is achieved by accessing the filter instead of the normal cache. The CPU will access the filter first and only when the filter is not hit, the visit of normal cache is performed. 2.2 Prediction Filter Cache For any instruction or data, the processor first accesses the filter cache. If the filter is hit, we can finish the fetch instruction operation at a very low cost without the extra loss of performance and energy would happen. The past studies have shown that the hit ratio of filter is extremely important for filter cache [5]. Therefore to ensure hit ratio, prediction algorithm is incorporated to improve hit ratio of public filter. In the prediction cache, The CPU accesses the filter or nor- mal cache depending on the prediction signal. Prediction algorithm is designed to eliminate unnecessary accesses to the filter cache [6]. I f the prediction for filter is failed, the CPU will re-fetch the instruction through normal cache and it will also cause the extra loss of performance and energy. 2.3Architecture of Energy E cient Multi-core cache System: Public Filter Each core has separately level 1 instruction cache and data cache. Apublic filter cache unit is shared by all cores in the system. All cores also share the level 2 LLC. However, a public- filter is introduced to be the first shared cache for all cores. Fi g. 3 shows how the architecture has been modified to accommodate the filter cache [7]. For each instruction-fetch, every core will access the public- filter first. If the public-filter is hit, instruction is returned to core directly [8].Otherwise, the next level memory L2 cache will be accessed until the right instruction is returned and the public filter will be updated by the new cache block which contains the new missed instruction. Fig. 3. Filter Cache: Architectural Change made to the Baseline Cache Algorithm 1 Algorithm for the proposed cache design CPU sends the data; while Resolving the public filter for the data do Visit the public filter; if data was hit then Return the instruction; else Visit the LLC; if hit then Return instruction and update the filter; else Visit the main memory and update the filter; end end end A Dynamic Replacement method Segmented LRU (SLRU) is used to maintain good hit ratio and dynamic memory management methods are used to distribute hit ratio equally among all cores. 3. SEGMENTED LRU POLICY 3.1 Existing Segmented LRU An SLRU cache is divided into two segments, a proba- tionary segment and a protected segment. Lines in each segment are ordered from the most to the least recently accessed. Fig. 4 explains how the block is segmented. Data from the memory for misses is added to the cache at the most recently accessed end of the probationary segment. Cache Hits are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. Lines in the protected segment have thus been accessed at least twice, giving this line another chance to be accessed before being replaced. The lines to be discarded for replacement are obtained from the LRU end of the probationary segment [9]. 3.2 Dynamic Segmented LRU Based on our observation with the existing SLRU algo- rithm, we found that often, they always use a constant number of protected and probationary ways. The proposed scheme handles the dynamic sizing of the two segments based on access probability in each cache line of the set [10].
  • 3. The access probability is summed up each time from the first line and the selection of new cache line for the insertion of new cache miss data from the memory is done at the cache line where the summed up probability is around “0.5”. This helps in dynamically adjusting the segmentation size by access probability. 3.3 Code Snippet for LRU Changes Void LRU::insertBlock(PacketPtr pkt, BlkType *blk) { BaseSetAssoc::insertBlock(pkt, blk); int set = extractSet(pkt->getAddr()); int Tot = 0; //Calculating the total number of accesses for (int i = 0; i <= assoc - 1 ; i++) { BlkType *b1 = sets[set].blks[i]; int Tot = Tot + b1->refCount; } int add = 0; int start = 0; if( Tot != 0) { for (int i = 0; i <= assoc -1; i++) { BlkType *b2 = sets[set].blks[i]; // Calculating the access probability of each line int prob = b2->refCount / Tot; add = add + prob; //Selecting theline with probability of 0.5 if (add >= 0.5) { start = add; break; } } } //Setting the head of probationary block for new data sets[set].moveToHead1(blk,start); } 3.4 Code Snippet for Cacheset template <class Blktype> void CacheSet<Blktype>::moveToHead1(Blktype *blk, int start) { // nothing to do if block is already head if (blks[0] == blk) return; % write ’next’ block into blks[i] . moving up from MRU toward LRU . until we overwrite the block we moved to head. . setting the head of the probationary statement % int i = start; Blktype *next = blk; do { assert(i < assoc); // swap blks[i] and next Blktype *tmp = blks[i]; blks[i] = next; next = tmp; ++i; } while (next != blk); } Fig. 4. LRU Segmentation: The probationary vs protected segments 3.5 Dynamic SLRU with Random Promotion and Aging Traditional implementations of SLRU has shown benefit by making selected random promotions as well. The ran- dom promotion in the SLRU algorithm allows to randomly pick a cache line from the probationary segment and pro- mote it to the promoted segment. This random promotion is also added with the Dynamic segmented LRU policy to see further performance improvements. In contrast to random promotion, we also made “Cache line aging mechanism” to bring down aged cache line with lowest access probability from the protected to probation- ary segment to see further performance improvements. 3.6 Dynamic SLRU With Adaptive Bypassing Cache bypassing helps in avoiding invalidating cache line with high access probability for just one or two misses. The new data is accessed directly from the memory with no update for cache line where it got missed. This helps in improving the hit rate by maintaining highly accessed cache line for a little more time in the cache set. Initially our bypass algorithm arbitrarily picks an access probability for implementing adaptive bypassing [11] [12]. The probability of making the bypass is also dynamically adapted by how e↵ective the decisions to bypass have been in the past by measuring the hit rate. Each e↵ective bypass doubles the probability that a future a bypass will occur, for example, if the current probability is 0.25 the probability will double to 0.5. Similarly each ine↵ective bypass halves the probability of a future bypass, for example, cutting the current probability of 0.5 to 0.25. To turn o↵ adaptive bypassing, bypassing probability is set to 0 that will prevent any bypassing and allocate all missed lines. 3.7 Miss Status Holding Register (MSHR) The adaptive bypassing is implemented with Miss Status Holding Register which helps to store the cache miss information without invalidation the corresponding cache line. This in turn improves the hit rate by supplying cache hits even under a miss. When the data becomes available in the memory, the miss pending is resolved with new data inside the cache line. However, the adaptive bypassing cannot be done if the MSHR becomes full. Stalls will be required until we resolve
  • 4. and create enough space to store the new miss pending and continue the adaptive bypassing mechanism. This adaptive bypassing with MSHR is also implemented with Dynamic SLRU to see further performance improve- ments. 4. COHERENCE POLICY 4.1 Existing Segmented LRU In multi-core processors, due to data transaction between several processors and their respective caches, there hap- pens to be a coherence problem. This occurs when two pro- cessors access the same physical address space [13]. Thus the shared memory models should be deigned with their respective cache hierarchies with a performance sensitive stand-point. In this paper, the cache coherence problem is addressed for the ring interconnect model. It was chosen since they are proven to address the coherence problem quite well. The rings have an exploitable ordering of coherence, simple and distributive arbitration as opposed to the bus topologies, short and fewer ports with faster p2p links [14] [15]. The order of the ring is not the order of the bus since bus has a centralized arbiter [16]. To initiate a request, a core must first access a centralized arbiter and then send its request to a queue. The queue creates the total order of requests, and resends the request on a separate set of snoop links [17] [18]. Caches snoop the requests and send the results on another set of links where the snoop results are collected at another queue. Finally the snoop queue resends the final snoop results to all cores [19]. This type of logical bus will result in significant performance loss to recreate the ordering of an atomic bus [20]. In crossbar interconnects this is a drawback. Thus we go for the ring interconnect due to the e ciency of the wires. In a way, the topology is analogous to a tra c roundabout. This is the idea in which the snooping was implemented. Rings o↵er a distributive access by the method of “Token Ring” [20 - 23]. There were several proposals to implement the coherence in the ring topology. The Greedy-Order topology uses unbounded reentries of cache requests to the ring to handle contention. This improves the latency but hits the band- width. The Ordering-Point topology uses a performance- costly ordering point which hits the latency. The Ring-Order consistency used this paper is fast and stable in performance. It also exploits the round-robin order of the ring. Ring-Order uses a token-counting approach, that passes tokens to ring nodes in order to ensure coherence safety [22]. A program was designed to simulate an LRU cache with a write-back and write-allocate policy. Modified the MOESI Snooping protocol for the ring topology thus making the initial requests to succeed all the time and as a result there would be no reentries or ordering point. 5. SIMULATION In order to evaluate the e↵ectiveness of the energy e cient cache design for multi-core processor, the simulation of the Fig. 5. CPI for various Cache designs vs Benchmarks used improved cache protocols were done on Gem5 simulator [24]. The baseline was taken as an X86 processor with 4 cores. Some of the SPEC 2006 benchmark programs were used for the simulation The following table 1 explains the configuration of our baseline system. Table 1. Baseline System Settings System Configuration PRIVATE L1 CACHES Split-I&D, 4kB, 4-way set Assoc SHARED FILTER CACHE Unified-I&D, 8kB, 8-way set Assoc SHARED L2 CACHE Unified-I&D, 64kB, 16-way set Assoc MAIN MEMORY 1GB of DRAM RING INTERCONNECT 80-byte unidirectional 5.1 Results The energy e cient cache design was integrated into the Gem5 and do some comparative experiments with the baseline 4-cores cache system and the filter cache with fixed distribution of public-filter using crossbar intercon-nect [25]. The public-filter associativity is 16 which was fixed for simplicity and this indicates each core has 4 filter lines in fixed filter cache. The dynamic management method (SLRU) will be activated for every 1000 instruc-tions. In the experiments, the performance and energy consump- tion of each benchmark were observed. The performance is evaluated by the CPI (Cycle Per Instruction). The smaller the CPI is, the higher the performance of the system is. The results obtained from this experiment was fed into CACTI for observing the power and energy consumption. Figure 5 shows the improvement of CPI for every bench- mark and the modified system. On an average, there is about 7.68% improvement in the CPI of the fully enhanced system when compared to the baseline system. Figure 6 shows the reduced energy consumption of each cache system proposed for the benchmarks. The energy consumption is improved by about 11%.
  • 5. Fig. 6. Energy consumption for the cache implementations The coherence policy was simulated using Gem5 as well as the SMP Cache simulator. The write transactions were recorded for the normalized tra c (having the total L2 cache misses/transactions on a scale of 0 to 1). Figure 7 shows the improvements. Fig. 7. Write Transactions vs L2 Misses The hit rates were also recorded and it is seen that for a 128 KB cache for all 4 cores, the performance was quite good for the given workload with hit rates ranging from 89% to 97%. The hit rate for Rind-Order was found to be more than that of Ordering-Point and Greedy-Order. The table is given below. The snoops per cycle for Rings show improvement over the Bus for the said benchmarks as on the table below. 6. FUTURE WORK The Cache system proposed can be integrated with the coherence policy discussed in the paper. By that, instead if crossbar interconnect, the filter-cache and SLRU design would be implemented on a system in which cores would be connected in a ring topology. Cache power and performance can be improved using De- terministic Naps and early miss detection. Dynamic power can be reduced by 2% by use of a hash based mechanism to minimize the Cache lines. There is a 92% improvement in performance due to skipping of few cache pipe stages as guaranteed misses. Static power savings of about 17% is achieved by using cache access to deterministically lower the power state of cache lines that are guaranteed not to be accessed in the immediate future [26]. If this is implemented in the proposed cache design there would be better results in-terms of performance and power. 7. CONCLUSION In this paper, an energy e cient cache design for multi- core processors was proposed. The baseline cache is im- plemented with a filter cache structure on the multi-core I-cache in a form of public-filter which is the shared first instruction source for all cores. Meanwhile, a dynamic LRU policy of the public-filter is also applied. Together they improved the power and performance of the cache. The experiment results show that the presented method can save about 11% energy and also shows a significant improvement in the performance. The Coherence policy for a Ring topology was also discussed and the results showed improvement when compared with bus topology. ACKNOWLEDGEMENTS We profoundly thank Professor Dr. Bhanu Kapoor for providing us guidance, support and encouragement. We also thank Jiacong He, whose PhD qualifier presentation inspired us to work on this research. REFERENCES [1] Hennessy, J. L., Patterson, D. A. (2012). “Computer Architecture: A Quantitative Approach. Elsevier. [2] Tang Weiyu and Gupta R,Nicolau, “A Design of a Predictive Filter Cache for Energy Savings in High Performance Processor Architectures” Proceedings of the International Conference on Computer Design, 2001: 68- 73. [3] Brooks, D., Tiwari, V., & Martonosi, M. (2000). “Wattch: a framework for architectural-level power anal- ysis and optimizations” (Vol. 28, No. 2, pp. 83-94). ACM. [4] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feed- back directed prefetching: “Improving the performance and bandwidth-e ciency of hardware prefetchers”. In Proc. of the 13th International Symposium on High Performance Computer Architecture, 2007. [5] Cao. X, Z. Xiaolin. “An Energy E cient Cache Design for Multi-core Processors”, In IEEE International Confer- ence on Green Computing and Communications, 2013.
  • 6. [6] Advanced Micro Devices, Inc., AMD64 Architecture Programmer‘s Manual Volume 3: “General-Purpose and System Instructions”, May 2013, revision 3.20. [7] Johnson Kin, Munish Gupta and William H. Mangione- Smith, “The Filter Cache: An Energy E cient Memory Structure,” Microarchitecture .Proceedings, Thirtieth An- nual IEEE/ACM International Symposium on, 1997:184 -193. [8] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filtering memory references to increase energy e ciency,” IEEE Trans. Comput, vol. 49, no. 1, pp. 1?15, Jan. 2000. [9] H. Gao and C. Wilkerson,“A dueling segmented LRU replacement algorithm with adaptive bypassing,” 1st JILP: Cache Replacement Championship, France, 2010. [10] K. Morales and B. K. Lee, “Fixed Segmented LRU cache replacement scheme with selective caching,” 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC), Austin, TX, 2012. [11] H. Gao and C. Wilkerson. “A dueling segmented LRU replacement algorithm with adaptive bypassing.” In Proceedings of the 1st JILP Workshop on Computer Architecture Competitions, 2010 [12] Jayesh Gaur et al. “Bypass and Insertion Algorithms for Exclusive Last-level Caches.” In ISCA 2011. [13] Hongil Yoon and Gurindar S. Sohi, “Reducing Coher- ence Overheads with Multi-line Invalidation (MLI) Mes- sages”, Computer Sciences Department at University of Wisconsin-Madison [14] Daniel J. Sorin, Mark D. Hill, and David A. Wood. “A Primer on Memory Consistency and Cache Coherence”, Synthesis Lectures in Computer Architecture, 2011 Mor- gan & Claypool Publishers. [15] I. Singh, A. Shriraman, W. W. L. Fung, M. O?Connor, and T. M. Aamodt, “Cache coherence for GPU architec- tures,” in HPCA, 2013, pp. 578?590. [16] R. Kumar, V. Zyuban, and D. Tullsen. “Interconnec- tions in multi-core architectures: Understanding Mecha- nisms, Overheads and Scaling”. In Proceedings of the 32nd Annual International Symposium on Computer Architec- ture, June 2005. [17] M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood, “Using destination-set prediction to im- prove the latency/bandwidth trade- o↵ in shared-memory multiprocessors,” in Proceedings of the 30th ISCA, June 2003. [18] M. M. K. Martin, M. D. Hill, and D. A. Wood, “Token coherence: Decoupling performance and correctness,” in ISCA-30, 2003. [19] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood, “Bandwidth adaptive snooping,” in HPCA-8, 2002. [20] M. R. Marty, “Cache coherence techniques for multi- core processors,” in PhD Dissertation, University of Wis- consin - Madison, 2008. [21] M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood, “Improving multiple- cmp systems using token coherenece,” in HPCA, February 2005. [22] M. R. Marty and M. D. Hill, “Coherence ordering for ring-based chip multiprocessors,” in MICRO-39, December 2006. [23] –, “Virtual hierarchies to support server consolida- tion,” in ISCA-34, 2007. [24] N. Binkert, et al., “The gem5 simulator”. 2011 SIGARCH Comput. Ar- chit. News. [25] “gem5-gpu.cs.wisc.edu” [26] Oluleye Olorode and Mehrdad Nourani, “Improving Cache Power and Performance Using Deterministic Naps and Early Miss Detection”, IEEE Trans. Multi-Scale Com- puting Systems, Vol 1, No 3, Pages 150–158, 2015.