DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5

Performance Considerations For Cache
Memory Design in a Multi-core Processor ?
Divya Ravindran, dxr150630
Ilango Jeyasubramanian, ixj150230
Kavitha Thiagarajan, kxt132230
Susmitha Gogineni, sxg155930
University of Texas at Dallas, Richardson, TX 75030 USA
Abstract: In the recent times multi-core processors have gained importance over the traditional
uniprocessors as there is a saturated growth in the performance improvements of the uniproces-
sors. Multi-core processors make use of multiple cores and in order to improve their performance,
there is a high necessity to reduce the memory access time, improve power e ciency and also
maintain the coherence of data among the cores. To address to the e ciency of multiple cores, a
filter cache is designed with an e cient Segmented Least Recently Used replacement policy. This
technique e↵ectively reduces the energy consumed by 11%. Finally, to address the coherence of
the caches, a modified MOESI based snooping protocol for the ring topology was used. This
improved the performance of the processor by increasing the hit rate by 7%.
Keywords: multi-core; filter cache; energy e cient; hit ratio; coherence; LRU; ring-order
1. INTRODUCTION
As the number of transistors on the chip is doubling every
18 months following the Moore‘s law, it is observed that
the processor speed is also improving at the same rate, but
the memory latency has not progressed at the same rate
as the processor. Due to this di↵erence in the growth, the
time to access the memory becomes larger as the processor
speed improves further. In order to overcome this memory
wall, caches were built.
Cache is a tiny and fast memory and has a smaller access
time than the main memory. The beneficial properties of
cache has made it desirable in providing e ciency to the
processor [1].
This project concentrates on how the Caches can be
modified to make the multi-core processor work in an
e cient way such that the overall speed up is improved,
the energy consumed by the processor is reduced and there
is an improvement in its performance. The analysis of the
newly implemented cache designs is done using some of
the SPEC2006 Benchmarks, In this experiment, the size
and associativity of the caches are fixed in order to provide
simplicity in analysis, the instruction set architecture(ISA)
is built for X86-64 processors.
The first modification performed was introducing a filter
cache, which is a tiny cache assumed to run at the speed of
the core. It consists of the most frequently used instructions
and the access time of the data in the filter cache is very
short, but the hit rate of the filter cache is low [2]. This is
improved by implementing a prediction technique which
? This project paper is edited in the format of International Feder-
ation of Automatic Control Conference Papers in LATEX 2"as part of
EEDG 6304 Computer Architecture coursework.
chooses the memory level to be accessed to reduce misses [3]
[4].
In order to see further improvement in hit rate, a Segmented
Least recently used(LRU) Block Replacement Policy along
with the filter cache is implemented and analyzed. The SLRU
consists of two segments and it uses the principle of
probability to perform the cache block replacement.
The coherence of the multi-core processors were analyzed
later with the help of various topologies. The idea was to
introduce a modified MOESI based snooping protocol f or
the ring topology which helps in improving the coherence of
data in a multi-core processor. This modification makes use
of the round-robin order of ring to provide a fast and stable
performance.
2. FILTER CACHE
2.1 Idea of Filter Cache
Cache is a very important component of modern processor
which can e↵ectively alleviate the speed gap between the
CPU and o↵-chip memory system. Multi-core processors
have become the main development trend of processors, due
to their high performance but power dissipation is a major
issue with the large memory accesses of multiple cores.
Therefore, an energy e cient cache design is required for
energy e ciency.
Filter cache is used to improve performance and power
e ciency. Filter cache is a small capacity cache which is used
to store frequently accessed data and instructions by the
cores. Filter cache acts as the first instruction source which
will consume less energy for most used instructions and
data. The filter is assumed to have almost the same speed as
the core and consume less energy than the normal cache. Fig
1. shows the basic idea of the filter cache.

Fig. 1. Filter Cache: The basic idea
Fig. 2. Filter Cache: How prediction works
The improvement of performance and energy saving is
achieved by accessing the filter instead of the normal cache.
The CPU will access the filter first and only when the filter
is not hit, the visit of normal cache is performed.
2.2 Prediction Filter Cache
For any instruction or data, the processor first accesses the
filter cache. If the filter is hit, we can finish the fetch
instruction operation at a very low cost without the extra loss
of performance and energy would happen. The past studies
have shown that the hit ratio of filter is extremely important
for filter cache [5]. Therefore to ensure hit ratio, prediction
algorithm is incorporated to improve hit ratio of public filter.
In the prediction cache, The CPU accesses the filter or nor-
mal cache depending on the prediction signal. Prediction
algorithm is designed to eliminate unnecessary accesses to
the filter cache [6]. I f the prediction for filter is failed, the
CPU will re-fetch the instruction through normal cache
and it will also cause the extra loss of performance and
energy.
2.3Architecture of Energy E cient Multi-core cache System:
Public Filter
Each core has separately level 1 instruction cache and data
cache. Apublic filter cache unit is shared by all cores in the
system. All cores also share the level 2 LLC. However, a public-
filter is introduced to be the first shared cache for all cores.
Fi g. 3 shows how the architecture has been modified to
accommodate the filter cache [7].
For each instruction-fetch, every core will access the public-
filter first. If the public-filter is hit, instruction is returned to
core directly [8].Otherwise, the next level memory L2 cache
will be accessed until the right instruction is returned and the
public filter will be updated by the new cache block which
contains the new missed instruction.
Fig. 3. Filter Cache: Architectural Change made to the
Baseline Cache
Algorithm 1 Algorithm for the proposed cache design
CPU sends the data;
while Resolving the public filter for the data do
Visit the public filter;
if data was hit then
Return the instruction;
else
Visit the LLC;
if hit then
Return instruction and update the filter;
else
Visit the main memory and update the filter;
end
end
end
A Dynamic Replacement method Segmented LRU (SLRU)
is used to maintain good hit ratio and dynamic memory
management methods are used to distribute hit ratio
equally among all cores.
3. SEGMENTED LRU POLICY
3.1 Existing Segmented LRU
An SLRU cache is divided into two segments, a proba-
tionary segment and a protected segment. Lines in each
segment are ordered from the most to the least recently
accessed. Fig. 4 explains how the block is segmented.
Data from the memory for misses is added to the cache
at the most recently accessed end of the probationary
segment. Cache Hits are removed from wherever they
currently reside and added to the most recently accessed
end of the protected segment. Lines in the protected
segment have thus been accessed at least twice, giving this
line another chance to be accessed before being replaced.
The lines to be discarded for replacement are obtained
from the LRU end of the probationary segment [9].
3.2 Dynamic Segmented LRU
Based on our observation with the existing SLRU algo-
rithm, we found that often, they always use a constant
number of protected and probationary ways. The proposed
scheme handles the dynamic sizing of the two segments
based on access probability in each cache line of the set
[10].

The access probability is summed up each time from
the ﬁrst line and the selection of new cache line for the
insertion of new cache miss data from the memory is
done at the cache line where the summed up probability
is around “0.5”. This helps in dynamically adjusting the
segmentation size by access probability.
3.3 Code Snippet for LRU Changes
Void LRU::insertBlock(PacketPtr pkt, BlkType *blk)
{
BaseSetAssoc::insertBlock(pkt, blk);
int set = extractSet(pkt->getAddr());
int Tot = 0;
//Calculating the total number of accesses
for (int i = 0; i <= assoc - 1 ; i++)
{
BlkType *b1 = sets[set].blks[i];
int Tot = Tot + b1->refCount;
}
int add = 0;
int start = 0;
if( Tot != 0)
{
for (int i = 0; i <= assoc -1; i++)
{
BlkType *b2 = sets[set].blks[i];
// Calculating the access probability of each line
int prob = b2->refCount / Tot;
add = add + prob;
//Selecting theline with probability of 0.5
if (add >= 0.5)
{
start = add;
break;
}
}
}
//Setting the head of probationary block for new data
sets[set].moveToHead1(blk,start);
}
3.4 Code Snippet for Cacheset
template <class Blktype>
void
CacheSet<Blktype>::moveToHead1(Blktype *blk, int start)
{
// nothing to do if block is already head
if (blks[0] == blk)
return;
% write ’next’ block into blks[i]
. moving up from MRU toward LRU
. until we overwrite the block we moved to head.
. setting the head of the probationary statement %
int i = start;
Blktype *next = blk;
do {
assert(i < assoc);
// swap blks[i] and next
Blktype *tmp = blks[i];
blks[i] = next;
next = tmp;
++i;
} while (next != blk);
}
Fig. 4. LRU Segmentation: The probationary vs protected
segments
3.5 Dynamic SLRU with Random Promotion and Aging
Traditional implementations of SLRU has shown beneﬁt
by making selected random promotions as well. The ran-
dom promotion in the SLRU algorithm allows to randomly
pick a cache line from the probationary segment and pro-
mote it to the promoted segment. This random promotion
is also added with the Dynamic segmented LRU policy to
see further performance improvements.
In contrast to random promotion, we also made “Cache
line aging mechanism” to bring down aged cache line with
lowest access probability from the protected to probation-
ary segment to see further performance improvements.
3.6 Dynamic SLRU With Adaptive Bypassing
Cache bypassing helps in avoiding invalidating cache line
with high access probability for just one or two misses.
The new data is accessed directly from the memory with
no update for cache line where it got missed. This helps
in improving the hit rate by maintaining highly accessed
cache line for a little more time in the cache set.
Initially our bypass algorithm arbitrarily picks an access
probability for implementing adaptive bypassing [11] [12].
The probability of making the bypass is also dynamically
adapted by how e↵ective the decisions to bypass have been
in the past by measuring the hit rate.
Each e↵ective bypass doubles the probability that a future
a bypass will occur, for example, if the current probability
is 0.25 the probability will double to 0.5. Similarly each
ine↵ective bypass halves the probability of a future bypass,
for example, cutting the current probability of 0.5 to 0.25.
To turn o↵ adaptive bypassing, bypassing probability is
set to 0 that will prevent any bypassing and allocate all
missed lines.
3.7 Miss Status Holding Register (MSHR)
The adaptive bypassing is implemented with Miss Status
Holding Register which helps to store the cache miss
information without invalidation the corresponding cache
line. This in turn improves the hit rate by supplying cache
hits even under a miss.
When the data becomes available in the memory, the miss
pending is resolved with new data inside the cache line.
However, the adaptive bypassing cannot be done if the
MSHR becomes full. Stalls will be required until we resolve

and create enough space to store the new miss pending and
continue the adaptive bypassing mechanism.
This adaptive bypassing with MSHR is also implemented
with Dynamic SLRU to see further performance improve-
ments.
4. COHERENCE POLICY
4.1 Existing Segmented LRU
In multi-core processors, due to data transaction between
several processors and their respective caches, there hap-
pens to be a coherence problem. This occurs when two pro-
cessors access the same physical address space [13]. Thus
the shared memory models should be deigned with their
respective cache hierarchies with a performance sensitive
stand-point. In this paper, the cache coherence problem
is addressed for the ring interconnect model. It was
chosen since they are proven to address the coherence
problem quite well. The rings have an exploitable
ordering of coherence, simple and distributive arbitration
as opposed to the bus topologies, short and fewer ports
with faster p2p links [14] [15].
The order of the ring is not the order of the bus since bus
has a centralized arbiter [16]. To initiate a request, a core
must first access a centralized arbiter and then send its
request to a queue. The queue creates the total order of
requests, and resends the request on a separate set of snoop
links [17] [18]. Caches snoop the requests and send the
results on another set of links where the snoop results are
collected at another queue. Finally the snoop queue resends
the final snoop results to all cores [19]. This type of logical
bus will result in significant performance loss to recreate the
ordering of an atomic bus [20]. In crossbar interconnects
this is a drawback. Thus we go for the ring interconnect due
to the e ciency of the wires. In a way, the topology is
analogous to a tra c roundabout. This is the idea in which
the snooping was implemented. Rings o↵er a distributive
access by the method of “Token Ring” [20 - 23].
There were several proposals to implement the coherence
in the ring topology. The Greedy-Order topology uses
unbounded reentries of cache requests to the ring to handle
contention. This improves the latency but hits the band-
width. The Ordering-Point topology uses a performance-
costly ordering point which hits the latency.
The Ring-Order consistency used this paper is fast and
stable in performance. It also exploits the round-robin order
of the ring. Ring-Order uses a token-counting approach,
that passes tokens to ring nodes in order to ensure
coherence safety [22]. A program was designed to simulate
an LRU cache with a write-back and write-allocate policy.
Modified the MOESI Snooping protocol for the ring
topology thus making the initial requests to succeed all the
time and as a result there would be no reentries or ordering
point.
5. SIMULATION
In order to evaluate the e↵ectiveness of the energy e cient
cache design for multi-core processor, the simulation of the
Fig. 5. CPI for various Cache designs vs Benchmarks used
improved cache protocols were done on Gem5 simulator
[24]. The baseline was taken as an X86 processor with 4
cores. Some of the SPEC 2006 benchmark programs were
used for the simulation
The following table 1 explains the configuration of our
baseline system.
Table 1. Baseline System Settings
System Configuration
PRIVATE L1 CACHES Split-I&D, 4kB, 4-way set Assoc
SHARED FILTER CACHE Unified-I&D, 8kB, 8-way set Assoc
SHARED L2 CACHE Unified-I&D, 64kB, 16-way set Assoc
MAIN MEMORY 1GB of DRAM
RING INTERCONNECT 80-byte unidirectional
5.1 Results
The energy e cient cache design was integrated into the
Gem5 and do some comparative experiments with the
baseline 4-cores cache system and the filter cache with fixed
distribution of public-filter using crossbar intercon-nect
[25]. The public-filter associativity is 16 which was fixed for
simplicity and this indicates each core has 4 filter lines in
fixed filter cache. The dynamic management method
(SLRU) will be activated for every 1000 instruc-tions.
In the experiments, the performance and energy consump-
tion of each benchmark were observed. The performance is
evaluated by the CPI (Cycle Per Instruction). The smaller
the CPI is, the higher the performance of the system is.
The results obtained from this experiment was fed into
CACTI for observing the power and energy consumption.
Figure 5 shows the improvement of CPI for every bench-
mark and the modified system. On an average, there is
about 7.68% improvement in the CPI of the fully enhanced
system when compared to the baseline system.
Figure 6 shows the reduced energy consumption of each
cache system proposed for the benchmarks. The energy
consumption is improved by about 11%.

Fig. 6. Energy consumption for the cache implementations
The coherence policy was simulated using Gem5 as well
as the SMP Cache simulator. The write transactions were
recorded for the normalized tra c (having the total L2
cache misses/transactions on a scale of 0 to 1). Figure 7
shows the improvements.
Fig. 7. Write Transactions vs L2 Misses
The hit rates were also recorded and it is seen that for a
128 KB cache for all 4 cores, the performance was quite
good for the given workload with hit rates ranging from
89% to 97%. The hit rate for Rind-Order was found to be
more than that of Ordering-Point and Greedy-Order. The
table is given below.
The snoops per cycle for Rings show improvement over the
Bus for the said benchmarks as on the table below.
6. FUTURE WORK
The Cache system proposed can be integrated with the
coherence policy discussed in the paper. By that, instead
if crossbar interconnect, the filter-cache and SLRU design
would be implemented on a system in which cores would
be connected in a ring topology.
Cache power and performance can be improved using De-
terministic Naps and early miss detection. Dynamic power
can be reduced by 2% by use of a hash based mechanism
to minimize the Cache lines. There is a 92% improvement
in performance due to skipping of few cache pipe stages as
guaranteed misses. Static power savings of about 17% is
achieved by using cache access to deterministically lower
the power state of cache lines that are guaranteed not
to be accessed in the immediate future [26]. If this is
implemented in the proposed cache design there would be
better results in-terms of performance and power.
7. CONCLUSION
In this paper, an energy e cient cache design for multi-
core processors was proposed. The baseline cache is im-
plemented with a filter cache structure on the multi-core
I-cache in a form of public-filter which is the shared first
instruction source for all cores. Meanwhile, a dynamic LRU
policy of the public-filter is also applied. Together they
improved the power and performance of the cache.
The experiment results show that the presented method
can save about 11% energy and also shows a significant
improvement in the performance. The Coherence policy for
a Ring topology was also discussed and the results showed
improvement when compared with bus topology.
ACKNOWLEDGEMENTS
We profoundly thank Professor Dr. Bhanu Kapoor for
providing us guidance, support and encouragement. We
also thank Jiacong He, whose PhD qualifier presentation
inspired us to work on this research.
REFERENCES
[1] Hennessy, J. L., Patterson, D. A. (2012). “Computer
Architecture: A Quantitative Approach. Elsevier.
[2] Tang Weiyu and Gupta R,Nicolau, “A Design of
a Predictive Filter Cache for Energy Savings in High
Performance Processor Architectures” Proceedings of the
International Conference on Computer Design, 2001: 68-
73.
[3] Brooks, D., Tiwari, V., & Martonosi, M. (2000).
“Wattch: a framework for architectural-level power anal-
ysis and optimizations” (Vol. 28, No. 2, pp. 83-94). ACM.
[4] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feed-
back directed prefetching: “Improving the performance and
bandwidth-e ciency of hardware prefetchers”. In Proc. of
the 13th International Symposium on High Performance
Computer Architecture, 2007.
[5] Cao. X, Z. Xiaolin. “An Energy E cient Cache Design
for Multi-core Processors”, In IEEE International Confer-
ence on Green Computing and Communications, 2013.

[6] Advanced Micro Devices, Inc., AMD64 Architecture
Programmer‘s Manual Volume 3: “General-Purpose and
System Instructions”, May 2013, revision 3.20.
[7] Johnson Kin, Munish Gupta and William H. Mangione-
Smith, “The Filter Cache: An Energy E cient Memory
Structure,” Microarchitecture .Proceedings, Thirtieth An-
nual IEEE/ACM International Symposium on, 1997:184
-193.
[8] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filtering
memory references to increase energy e ciency,” IEEE
Trans. Comput, vol. 49, no. 1, pp. 1?15, Jan. 2000.
[9] H. Gao and C. Wilkerson,“A dueling segmented LRU
replacement algorithm with adaptive bypassing,” 1st JILP:
Cache Replacement Championship, France, 2010.
[10] K. Morales and B. K. Lee, “Fixed Segmented LRU
cache replacement scheme with selective caching,” 2012
IEEE 31st International Performance Computing and
Communications Conference (IPCCC), Austin, TX, 2012.
[11] H. Gao and C. Wilkerson. “A dueling segmented
LRU replacement algorithm with adaptive bypassing.” In
Proceedings of the 1st JILP Workshop on Computer
Architecture Competitions, 2010
[12] Jayesh Gaur et al. “Bypass and Insertion Algorithms
for Exclusive Last-level Caches.” In ISCA 2011.
[13] Hongil Yoon and Gurindar S. Sohi, “Reducing Coher-
ence Overheads with Multi-line Invalidation (MLI) Mes-
sages”, Computer Sciences Department at University of
Wisconsin-Madison
[14] Daniel J. Sorin, Mark D. Hill, and David A. Wood. “A
Primer on Memory Consistency and Cache Coherence”,
Synthesis Lectures in Computer Architecture, 2011 Mor-
gan & Claypool Publishers.
[15] I. Singh, A. Shriraman, W. W. L. Fung, M. O?Connor,
and T. M. Aamodt, “Cache coherence for GPU architec-
tures,” in HPCA, 2013, pp. 578?590.
[16] R. Kumar, V. Zyuban, and D. Tullsen. “Interconnec-
tions in multi-core architectures: Understanding Mecha-
nisms, Overheads and Scaling”. In Proceedings of the 32nd
Annual International Symposium on Computer Architec-
ture, June 2005.
[17] M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill,
and D. A. Wood, “Using destination-set prediction to im-
prove the latency/bandwidth trade- o↵ in shared-memory
multiprocessors,” in Proceedings of the 30th ISCA, June
2003.
[18] M. M. K. Martin, M. D. Hill, and D. A. Wood, “Token
coherence: Decoupling performance and correctness,” in
ISCA-30, 2003.
[19] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A.
Wood, “Bandwidth adaptive snooping,” in HPCA-8, 2002.
[20] M. R. Marty, “Cache coherence techniques for multi-
core processors,” in PhD Dissertation, University of Wis-
consin - Madison, 2008.
[21] M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu,
M. M. K. Martin, and D. A. Wood, “Improving multiple-
cmp systems using token coherenece,” in HPCA, February
2005.
[22] M. R. Marty and M. D. Hill, “Coherence ordering for
ring-based chip multiprocessors,” in MICRO-39, December
2006.
[23] –, “Virtual hierarchies to support server consolida-
tion,” in ISCA-34, 2007.
[24] N. Binkert, et al., “The gem5 simulator”. 2011
SIGARCH Comput. Ar- chit. News.
[25] “gem5-gpu.cs.wisc.edu”
[26] Oluleye Olorode and Mehrdad Nourani, “Improving
Cache Power and Performance Using Deterministic Naps
and Early Miss Detection”, IEEE Trans. Multi-Scale Com-
puting Systems, Vol 1, No 3, Pages 150–158, 2015.

DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5

More Related Content

What's hot (20)

Similar to DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5 (20)

More from Ilango Jeyasubramanian (6)

Recently uploaded (20)

DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED FILTER CACHES USING GEM5