SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 78
ARCHITECTURE AND IMPLEMENTATION ISSUES OF MULTI-CORE
PROCESSORS AND CACHING – A SURVEY
Bhaskar Das1
, Ashim Kumar Mahato2
, Ajoy Kumar Khan3
1
Department of Information Technology, Assam University, Silchar
2
Department of Information Technology, Assam University, Silchar
3
Department of Information Technology, Assam University, Silchar
Abstract
As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache
memories are needed to bridge the gap. This paper includes what brought about the change from single processor architecture to
having multiple processors on a single die and some of the hurdles involved, and the technologies behind it. Having each processor on
a single die allows much greater communication speeds between the processors. For multi-threading and multitasking, security and
virtualization and physical restraints such as heat generation and die size, we need multi-core processor. Processor cache is the
performance bottleneck in most current architectures. Next, we consider some of the issues involved in the implementation of highly
optimized cache memories and survey the techniques that can be used to help achieve the increasingly stringent design targets and
constraints of multi-processors.
Keywords: Cache, Multi-core, Multi-tasking, Multi-Threading, virtualization.
---------------------------------------------------------------------***---------------------------------------------------------------------
1. INTRODUCTION
Traditional processor architectures have included the transistor
count into the hundreds of millions. This transistor, nano-scale
electronic switch, can switch between 1 and 0 states billions of
times in a second. So, power is very much needed. One way to
counteract the power consumed is to reduce the size of the
transistor. However, the transistor can only shrink so much
before the functionality of the electronic switch breaks down
and allows current to pass improperly [1]. These power
consumptions lead to heat production, another side-effect of
high transistor counts. These issues point toward a shift in
architectures: greater parallelism.
Multiple applications run on a single core processor, so the
operating system acts as scheduler-switching contexts between
the applications. This can require a complete dump of all
processor registers and possibly the cache(s), which is costly in
terms of completion time. For example, if there are two
processors working in parallel then no need to switch contexts
between the two applications in a running computer. The main
advantage from multiple cores, the programmer must divide the
application into simultaneous threads or be done by the
operating system for multitasking. A thread is a lightweight sub-
program that shares the same memory space as other threads
under the same program process. This notion of multi-threading
is challenging relatively new and isn't yet taught to be as
fundamental as, say, data structures.
Cache memory is small faster memory which is used to store
data temporarily portion of main memory that data are
frequently used. The main concepts of using cache memory to
improve processor performance have been easy to understood
and readable. Today, caches have become an important part of
every processor. Our main aim is to reduce the performance gap
between processor and main memory that is why we insert a
cache memory between processor and main memory.
The ability of caches to bridge the performance gap is depends
on two primary factors - the time needed to retrieve data from
the cache and the fraction of memory references that can be
satisfied by the cache. These two factors are commonly referred
to as access (hit) time and hit ratio respectively [16], [17]. The
access time is most important for first level caches because a
longer access time means a slower processor clock rate, more
pipeline stages. In order to minimize access time, cache access
should be triggered as soon as the address of the memory
reference is available. The hit ratio is also critical, both because
misses impose delays, and because off-chip bandwidth,
especially when there is a shared bus, is a very limited resource.
Next, in section 2 we will discuss basics of multi-core
architecture and cache technologies. In section 3
We will show how we can critically analyze the performance of
multi-core processor. Next, we will tell about the architecture of
multi-core processors. In section 5 we state about some
implementation issues of modern caches. Section 6 says how we
use the multi-core processor.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 79
2. BASICS
2.1 Computer Architectures
Past architectures have multiple physically separate processors.
Those architectures become gradually backdated to the multiple
on-chip processors due to mainly wire delay and caching
techniques. Wire delay is the time it takes for data to traverse
the physical wires. This can have a drastic effect on frequencies.
There is also the added problem of limited intra-processor
communication pins for multiple separate processors - a
problem not seen in multi-core processors.
2.2 Cache
Cache memory is small faster memory which is used to store
data temporarily portion of main memory that data are
frequently used. The main concepts of using cache memory to
improve processor performance have been easy to understood
and readable. Today, caches have become an important part of
every processor. Our main aim is to reduce the performance gap
between processor and main memory that is why we insert a
cache memory between processor and main memory.
Data cache was designed using two key concepts in mind –
1. Temporal locality [3] –i) when elements are require again in
the near future.
ii) Arrange the code such way that element in cache is reused
often.
2. Special locality [3] – i) when other element in the block will
be needed soon.
ii) Cache line is fetched together.
iii) In the same cache line works on consecutive data elements.
When the processor find data item in the cache if it is available
in the cache then it is called a cache hit and if the data item is
not available in the cache then it is called cache miss. If we want
to reduce the miss rate then we should concentrate on both the
latency and bandwidth of the memory.
The detail of cache operation leads to different cache design
choices: Cache mapping techniques, Cache replacement policy
and Cache write techniques
Cache mapping techniques [5]: Three basic cache mapping
techniques –
In case of direct mapping each block in main memory has
particularly one and only one location in cache it can be copied
to. See figure 1 for an example. This technique is less costly as
no searching is required. However, if thrashing occurs, when
one cache block is continually swapped between two or more
memory blocks, the overhead becomes an issue.
Fully-associative mapping each memory blocks can be stored
anywhere in the cache. In this technique, the entire cache must
be searched for each memory access. This requires more
hardware and is thus very costly.
A combination of the two mapping concept, direct and fully-
associative, forms the most common mapping technique: set-
associative. Here, the cache block is divided into separate sets.
Every set is made up of two or more blocks. A two block set-
associative mapping is referred to as 2-way, because the data
retrieved from main memory can be put in two different
locations, instead of just one. This technique is flexible and
limits the amount of thrashing that could occur.
Fig.1. Direct Main Memory to Cache mapping
Fig.2. Associative Main Memory to Cache mapping
Note:
i) Direct Mapping ≡ 1way set associative.
ii) Fully Associative Mapping ≡ n way set associative.
Block Identification:
-Block address is used to identify the contents of a cache block
uniquely.
- Block size are changes according to different mapping
technique.
C
D
A
B
1
0
2
7
3
4
5
6
Cache
MM
A
a
A
B
C
D
0
1
2
3
4
5
6
7
MM
Cache
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 80
Consider fig 4.
Block Address Block
offsetTag Index
Fig.4. Addressing Cache Block
Offset-Within block least significant bits of an address index
words.
Block address divided into two parts one is the Tag another is
the Index.
-Tag is used to identify each cache block uniquely.
-Index is used to particular slot set (set associative) or slot
(direct).
In case fully associative mapping technique index size-
Index Cache Size
2 =
Block Size × Set Associativity
Writing to cache from the CPU presents another opportunity for
optimization. There are two simple write policies: write-through
and write-back [4][5]. During a typical write, the CPU stores its
computed data to a location in cache, which is stored back into
main memory. Write-through stores the data into the cache and
into the main memory at the same time. Write-back stores the
data in the cache, and only writes to main memory when
evicted.
1. ANALYSIS PROCEDURE
We must need mathematical equations to verify the
performance. This is very important in cache design.
Miss rates are a common metric of cache implementations
where miss rate is the ratio of misses to memory accesses. This
analysis is calculated by involving the times associated with
miss penalties and hit times. From [4], the average memory
access time (AMAT) in seconds or clock cycles can be found by
the following equation.
AMAT = Hit time + (Miss rate * Miss penalty)
Where hit time is the time it takes to get a memory location and
miss penalty is the time involved when the requested memory is
not found in the cache. Miss penalties take larger time than hit.
Via emulation and simulation software is the most common way
to test configurations before a complete physical
implementation. In [2], hardware prototyping and testing is
analyzed using a Xilinx Virtex-II Pro FPGA. Using an FPGA as
a test bed gives great reconfigurability. As stated in [6], random
program generators and simulation methods are used to test the
basic structures when combined. Upon implementation,
benchmarking software, such as SPEC CPU2006 [7], is used to
test the many aspects of a processor.
4. ARCHITECTURE
In 2006 Intel and AMD started multi-core processor to the user
and server markets. The AMD Athlon 64 FX dual-core
processor has two L1 caches, data and instruction, and one L2
cache, unified, for each core [8] (see Figure 5). Intel uses a
shared L2 cache in what is referred to as the Advanced Smart
Cache" [8] (see Figure 6).This implementation technique we
used to reducing the cache misses and increasing the
performance.
Fig.5: AMD Athlon 64 FX Architecture.
Another concept [9] proposes is non-uniform cache architecture
to share cache between cores dynamically. When one core uses
cache space unnecessarily and intrudes on another core's space
then cache pollution occurs in the architecture address. The
proposal is done with a quad-core processor and three levels of
cache. The third level, L3, is partly shared and partly private.
Each core is allotted a certain amount of space in L3 to be
private and cannot be intruded upon.
In [10], the idea of specializing is the cores for virtual machines.
They proposed two main designs: a single virtual machine core
shared by all other general-purpose and specialized cores (for
system virtualization); or each general purpose core can have a
virtual-machine-specific core (for process virtualization).
System security and dependability is addressed in [11] with an
integrated framework for dependable and revivable
architectures", or INDRA. INDRA uses a core set at a higher
privilege that is protected from remote attacks, a resurrection,
and monitors the execution of the other cores, the resurrectees.
Fig.6: Intel Core Duo Architecture.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 81
5. IMPLEMENTATION ISSUES
5.1 Addressing Constraints
In order to minimize effective memory access time, the access
should be triggered as soon as the effective address of the
memory reference becomes available. In most computers there
must be a delay in translation. This delay cannot be removed
completely.
5.1.1 Physical Address Cache
The caches are organized as 2-dimensional arrays and are
accessed in a two phase cycle. In the first phase, a cache set is
selected by using a portion of the address known as the index
bits. In the second phase, the remaining part of the address is
used to make a further selection from within this cache set to
yield either a cache miss determination or the requested data.
There are various techniques that exploit this two phase access
cycle to enable a physically addressed cache to be accessed
without requiring an extra address translation cycle. However,
there is a practical limit to this approach because increasing the
set-Associativity provides only a diminishing return on cache hit
ratio but adds hard-ware complexity and adversely impacts the
access time [12]. A technique that can be used to increase the
number of address bits available before address translation is to
restrict the virtual to physical page mapping so that the low-
order bits of the physical and virtual page numbers are identical
[14]. Another way to make more address bits available before
address translation is to predict the additional address bits. An
example of a good predictor is the content of the base register
that is used to compute the effective address [13].
5.1.2 Virtual Address Cache
Instead of using bits from the virtual address as a predictor for
the physical address, a different approach is to use the virtual
address to directly access the cache [15], [16]. This avoids the
delay for translation. In addition, all the addresses must be
tagged with an address space identifier or else the cache must be
purged on every task switch [16]. The most serious drawback of
the virtual address cache is that multiple virtual addresses may
be mapped to the same physical address, i.e. synonyms may
occur [16]. The usual approach to handling synonyms is to
prevent them from being present in the cache at the same time.
In general, a reverse translation buffer (RTB) is needed in order
for this approach to be feasible. One way to reduce the
complexity in handling synonyms is to make sure that the index
bits used to select the cache set are the same for both the
physical and virtual addresses.
5.2 Access Time and Miss Ratio Targets
The performance of a cache is determined both by the fraction
of memory requests it can satisfy (hit/miss ratio) and the speed
at which it can satisfy them (access time). There have been
numerous studies on cache hit/miss ratios with respect to the
cache and line sizes, and the set Associativity [16], [17].
5.2.1 Decoupled Caches
The data array access and line selection are carried out
independently of the tag array access and comparison so as to
circumvent the delay imbalance between the paths through the
tag and data arrays. This is trivially true in the direct-mapped
case because in such a cache, there is only one cache line in
each cache set. However, it tends to have an inferior hit ratio
due to conflict misses.
5.2.2 Multiple-Access Caches
A direct-mapped cache is accessed sequentially more than once
in order to achieve the access time of a direct-mapped cache for
the fast access and the hit ratio of a set-associative cache as a
whole. In [20], a simple rehashing function based on flipping
the highest-order index bit is used. Upon a hit to a secondary
location, the lines located in the primary and secondary
locations are swapped.
5.2.3 Multi-level Caches
A small and fast upstream cache is used for the fast access while
one or larger and slower downstream caches are used to capture
the fast- access misses with minimal penalties. The organization
and performance of multi-level caches have been studied
extensively [18].
5.3 Area and Bandwidth Constraints
In order to bridge the growing performance gap between
processor and memory, more and more silicon area is being
dedicated to the on-chip caches. For example, the Intel Pentium
Pro consists of a pair of 8KB on-die instruction and data L1
caches and an on-module 512KB L2 cache. Together these
caches occupy 65% of the total die area and account for 88% of
the total number of transistors. There are several approaches to
increasing cache bandwidth. A straightforward way is to have
separate instruction and data caches so that the instruction and
data references can be handled simultaneously. The trace cache
[19] alleviates this problem by storing the logically contiguous
instructions in a physically contiguous block in a separate cache.
6. USES OF MULTI-CORE PROCESSOR
6.1 Servers
Servers have very good use of multi-core processor. A server
can potentially have many simultaneous connections to many
users. To accept these connections, the server will either spawn
a new process or fork off a new thread. It makes the main
process to wait for a connection. The operating system can then
allocate these workloads across the available cores.
6.2 Consumers
The consumer market has adopted these new processors,
banking on the multi-tasking parallelism granted by the multiple
cores. These applications reap direct benefit from a multi-core
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 82
architecture by either multi-threaded programs or via scheduling
by the operating system. Multi-core processors are not limited to
traditional computers. Two such examples are the Cell
processor and NVIDIA Tesla GPU.
6.3 Virtualization
The idea of virtualization tracks back to the days of mainframes.
Now the costs are far lower. However, one thing remains to be
true: under-utilization. A system administrator can configure the
computer to virtualises" its devices, or operating system, to
allow one or more simultaneous virtual machine(s) to use the
computer as if each virtual machine (VM) was its own
computer.
7. CONCLUSIONS
Multi-core processors are already expanding their niche and are
finding many new and creative uses. Due to physical limitations
and increased multitasking requirements, the multi-core
architecture is expected to become the standard over the single-
core predecessors. Further caching schemes, both specialized
and general, will continue to be honed, narrowing the
performance gap between the processor and main memory.
Again During the past decade the performance of processors has
improved by almost 60% each year. Current trends in the
industry suggest that in the future, it may become economically
feasible to integrate a processor on the same die as the DRAM.
Such integration has the potential to reduce system cost and
improve both DRAM latency and available bandwidth. For
general purpose computing, cache memories will continue to
play a crucial role in bridging the processor-DRAM
performance gap.
REFERENCES
[1] D. Geer, “Industry Trends: Chip Makers Turn to Multi-
core Processors," Computer.org, IEEE, pp. 11-13, May
2005.
[2] C. R. Clark, R. Nathuji, H. S. Lee, “Using an FPGA as a
Prototyping Platform for Multi-core Processor
Applications", Georgia Institute of Technology, Atlanta,
GA.
[3] V. P. Heuring and H. F. Jordan, “Computer Systems
Design and Architecture”, Prentice Hall, 2nd Edition,
2003.
[4] J. L. Hennessy, D. A. Patterson, “Computer Architecture:
A Quantitative Approach”, Morgan Kaufmann
Publishers, 4th Edition, 2007.
[5] L. Null, J. Lobur, “Computer Organization and
Architecture”, Jones and Bartlett Publishers, 2003.
[6] D. Lewin, D. Lorenz, S. Ur, “A Methodology for
Processor Implementation Verification", Technion,
Haifa, Israel.
[7] J. L. Henning, SPEC CPU Subcommittee, “SPEC
CPU2006 Benchmark Descriptions", Standard
Performance Evaluation Corporation, 2006.
[8] Jeremy W. Langston and Xubin He, “Multi-core
Processors and Caching - A Survey”, Tennessee
Technological University, 2007.
[9] H. Dybdahl, P. Stenstrom, “An Adaptive Shared/Private
NUCA Cache Partitioning Scheme for Chip
Multiprocessors", HiPEAC Network of Excellence.
[10] D. Upton, K. Hazelwood, Heterogeneous Chip
Multiprocessor Design for Virtual Machines", University
of Virginia.
[11] W. Shi, H. S. Lee, L. Falk, M. Ghosh, “An Integrated
Framework for Dependable and Revivable Architectures
Using Multicore Processors", Georgia Institute of
Technology, Atlanta, GA, 2006.
[12] M. Hill, A. Smith, “Evaluating Associativity in CPU
Caches," IEEE Trans. Computers, Vol. 22(12), Dec.
1989.
[13] K. Hua, et al., “Early Resolution of Address Translation
in Cache Design," Int'l Conf. Comp. Designs, pp. 408-
412, Oct. 1990.
[14] K. Inoue, H. Nonogaki, T. Urakawa, K. Shimizu, “Plural
virtual address space processing system," US Patent No.
4145738, March 20, 1979.
[15] F. Reiley, J. Richcreek, “Parallel Addressing of A
Storage Hierarchy in A Data Processing System Using
Virtual Address," US Patent No. 3693165, Sep. 19, 1972.
[16] A. Smith, “Cache Memories," Computing Surveys, Vol.
14(4), Sep.1982, pp. 473-530.
[17] A Smith, “Cache Memory Design: An Evolving Art,"
IEEE Spectrum, Dec. 1987, pp. 40-44.
[18] F. Sparacio, “Data Processing System with Second Level
Cache," IBM Tech. Disc., 21(6), Nov. 1978, pp. 2468-
2469..
[19] E. Rotenberg, S. Bennett, J. Smith, “Trace Cache: A
Low-Latency Approach to High-Bandwidth Instruction
Fetching," MICRO'29, Dec. 1996, pp. 24-34.
[20] A. Agarwal, J. Hennessy, M. Horowitz, “Cache
Performance of Operating Systems and
Multiprogramming," ACM Trans. Computer Systems,
Vol. 6(4), Nov. 1988, pp. 393-431.

More Related Content

PDF
Compositional Analysis for the Multi-Resource Server
PDF
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
PDF
International Journal of Engineering Research and Development
PDF
An octa core processor with shared memory and message-passing
PDF
shashank_hpca1995_00386533
PDF
Memory consistency models
PDF
Design and Simulation of a 16kb Memory using Memory Banking technique
PDF
Geo distributed parallelization pacts in map reduce
Compositional Analysis for the Multi-Resource Server
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
International Journal of Engineering Research and Development
An octa core processor with shared memory and message-passing
shashank_hpca1995_00386533
Memory consistency models
Design and Simulation of a 16kb Memory using Memory Banking technique
Geo distributed parallelization pacts in map reduce

What's hot (12)

PDF
Effective Sparse Matrix Representation for the GPU Architectures
PDF
Survey paper _ lakshmi yasaswi kamireddy(651771619)
PDF
Ib3115221526
PDF
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
PDF
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
PDF
Characteristics of an on chip cache on nec sx
PDF
Cache coherency controller for MESI protocol based on FPGA
PPTX
previous question solve of operating system.
PDF
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
PDF
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
PPT
system interconnect architectures in ACA
PDF
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Effective Sparse Matrix Representation for the GPU Architectures
Survey paper _ lakshmi yasaswi kamireddy(651771619)
Ib3115221526
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Characteristics of an on chip cache on nec sx
Cache coherency controller for MESI protocol based on FPGA
previous question solve of operating system.
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
system interconnect architectures in ACA
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Ad

Viewers also liked (20)

PDF
Exposure hazard analysis in cement fiber sheet
PDF
Scrum an agile process
PDF
Measuring effort for modifying software package as
PDF
Digital watermarking with a new algorithm
PDF
An analysis of desktop control and information retrieval from the internet us...
PDF
Best lookup algorithm for 100+gbps ipv6 packet
PDF
Learning analytics to quantize and improve the skills development and attainm...
PDF
Product aspect ranking using domain dependent and domain independent review
PDF
Static analysis of master leaf spring
PDF
A comparative study of physical attacks on wireless sensor networks
PDF
Simulation of collision avoidance by navigation
PDF
Evaluate the effective resource management through pert analysis
PDF
Outage analysis of simo system over nakagami n fading channel
PDF
The heating pattern of the microwave dehydrator for treating petroleum crude ...
PDF
Intelligent computing techniques on medical image segmentation and analysis a...
PDF
Two level data security using steganography and 2 d cellular automata
PDF
Secure intrusion detection and countermeasure selection in virtual system usi...
PDF
Exposure to elevated temperatures and cooled under different regimes – a stud...
PDF
A simplified design of multiplier for multi layer feed forward hardware neura...
PDF
Application of dual tone multi frequency technology
Exposure hazard analysis in cement fiber sheet
Scrum an agile process
Measuring effort for modifying software package as
Digital watermarking with a new algorithm
An analysis of desktop control and information retrieval from the internet us...
Best lookup algorithm for 100+gbps ipv6 packet
Learning analytics to quantize and improve the skills development and attainm...
Product aspect ranking using domain dependent and domain independent review
Static analysis of master leaf spring
A comparative study of physical attacks on wireless sensor networks
Simulation of collision avoidance by navigation
Evaluate the effective resource management through pert analysis
Outage analysis of simo system over nakagami n fading channel
The heating pattern of the microwave dehydrator for treating petroleum crude ...
Intelligent computing techniques on medical image segmentation and analysis a...
Two level data security using steganography and 2 d cellular automata
Secure intrusion detection and countermeasure selection in virtual system usi...
Exposure to elevated temperatures and cooled under different regimes – a stud...
A simplified design of multiplier for multi layer feed forward hardware neura...
Application of dual tone multi frequency technology
Ad

Similar to Architecture and implementation issues of multi core processors and caching – a survey (20)

PDF
Cmp cache architectures a survey
PDF
IRJET- Implementation of Mesi Protocol using Verilog
PDF
Different Approaches in Energy Efficient Cache Memory
PDF
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
PDF
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
PDF
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
PDF
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
PDF
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
PDF
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
PDF
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
PDF
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
PPTX
week_2Lec02_CS422.pptx
PPTX
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
PPTX
SISD-Set Associative Mapping-Understanding the Sequential Computing Model.pptx
PPTX
Cache Memory.pptx
PDF
Aw33283286
PDF
Aw33283286
PPTX
Limitations of memory system performance
PDF
Robust Fault Tolerance in Content Addressable Memory Interface
PDF
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
Cmp cache architectures a survey
IRJET- Implementation of Mesi Protocol using Verilog
Different Approaches in Energy Efficient Cache Memory
A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTTLENECK
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
A Survey of Different Approaches for Overcoming the Processor - Memory Bottle...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
week_2Lec02_CS422.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
SISD-Set Associative Mapping-Understanding the Sequential Computing Model.pptx
Cache Memory.pptx
Aw33283286
Aw33283286
Limitations of memory system performance
Robust Fault Tolerance in Content Addressable Memory Interface
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS

More from eSAT Publishing House (20)

PDF
Likely impacts of hudhud on the environment of visakhapatnam
PDF
Impact of flood disaster in a drought prone area – case study of alampur vill...
PDF
Hudhud cyclone – a severe disaster in visakhapatnam
PDF
Groundwater investigation using geophysical methods a case study of pydibhim...
PDF
Flood related disasters concerned to urban flooding in bangalore, india
PDF
Enhancing post disaster recovery by optimal infrastructure capacity building
PDF
Effect of lintel and lintel band on the global performance of reinforced conc...
PDF
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
PDF
Wind damage to buildings, infrastrucuture and landscape elements along the be...
PDF
Shear strength of rc deep beam panels – a review
PDF
Role of voluntary teams of professional engineers in dissater management – ex...
PDF
Risk analysis and environmental hazard management
PDF
Review study on performance of seismically tested repaired shear walls
PDF
Monitoring and assessment of air quality with reference to dust particles (pm...
PDF
Low cost wireless sensor networks and smartphone applications for disaster ma...
PDF
Coastal zones – seismic vulnerability an analysis from east coast of india
PDF
Can fracture mechanics predict damage due disaster of structures
PDF
Assessment of seismic susceptibility of rc buildings
PDF
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
PDF
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Likely impacts of hudhud on the environment of visakhapatnam
Impact of flood disaster in a drought prone area – case study of alampur vill...
Hudhud cyclone – a severe disaster in visakhapatnam
Groundwater investigation using geophysical methods a case study of pydibhim...
Flood related disasters concerned to urban flooding in bangalore, india
Enhancing post disaster recovery by optimal infrastructure capacity building
Effect of lintel and lintel band on the global performance of reinforced conc...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Shear strength of rc deep beam panels – a review
Role of voluntary teams of professional engineers in dissater management – ex...
Risk analysis and environmental hazard management
Review study on performance of seismically tested repaired shear walls
Monitoring and assessment of air quality with reference to dust particles (pm...
Low cost wireless sensor networks and smartphone applications for disaster ma...
Coastal zones – seismic vulnerability an analysis from east coast of india
Can fracture mechanics predict damage due disaster of structures
Assessment of seismic susceptibility of rc buildings
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...

Recently uploaded (20)

PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Mechanical Engineering MATERIALS Selection
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Construction Project Organization Group 2.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
CH1 Production IntroductoryConcepts.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mechanical Engineering MATERIALS Selection
additive manufacturing of ss316l using mig welding
Construction Project Organization Group 2.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Lesson 3_Tessellation.pptx finite Mathematics
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx

Architecture and implementation issues of multi core processors and caching – a survey

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 78 ARCHITECTURE AND IMPLEMENTATION ISSUES OF MULTI-CORE PROCESSORS AND CACHING – A SURVEY Bhaskar Das1 , Ashim Kumar Mahato2 , Ajoy Kumar Khan3 1 Department of Information Technology, Assam University, Silchar 2 Department of Information Technology, Assam University, Silchar 3 Department of Information Technology, Assam University, Silchar Abstract As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache memories are needed to bridge the gap. This paper includes what brought about the change from single processor architecture to having multiple processors on a single die and some of the hurdles involved, and the technologies behind it. Having each processor on a single die allows much greater communication speeds between the processors. For multi-threading and multitasking, security and virtualization and physical restraints such as heat generation and die size, we need multi-core processor. Processor cache is the performance bottleneck in most current architectures. Next, we consider some of the issues involved in the implementation of highly optimized cache memories and survey the techniques that can be used to help achieve the increasingly stringent design targets and constraints of multi-processors. Keywords: Cache, Multi-core, Multi-tasking, Multi-Threading, virtualization. ---------------------------------------------------------------------***--------------------------------------------------------------------- 1. INTRODUCTION Traditional processor architectures have included the transistor count into the hundreds of millions. This transistor, nano-scale electronic switch, can switch between 1 and 0 states billions of times in a second. So, power is very much needed. One way to counteract the power consumed is to reduce the size of the transistor. However, the transistor can only shrink so much before the functionality of the electronic switch breaks down and allows current to pass improperly [1]. These power consumptions lead to heat production, another side-effect of high transistor counts. These issues point toward a shift in architectures: greater parallelism. Multiple applications run on a single core processor, so the operating system acts as scheduler-switching contexts between the applications. This can require a complete dump of all processor registers and possibly the cache(s), which is costly in terms of completion time. For example, if there are two processors working in parallel then no need to switch contexts between the two applications in a running computer. The main advantage from multiple cores, the programmer must divide the application into simultaneous threads or be done by the operating system for multitasking. A thread is a lightweight sub- program that shares the same memory space as other threads under the same program process. This notion of multi-threading is challenging relatively new and isn't yet taught to be as fundamental as, say, data structures. Cache memory is small faster memory which is used to store data temporarily portion of main memory that data are frequently used. The main concepts of using cache memory to improve processor performance have been easy to understood and readable. Today, caches have become an important part of every processor. Our main aim is to reduce the performance gap between processor and main memory that is why we insert a cache memory between processor and main memory. The ability of caches to bridge the performance gap is depends on two primary factors - the time needed to retrieve data from the cache and the fraction of memory references that can be satisfied by the cache. These two factors are commonly referred to as access (hit) time and hit ratio respectively [16], [17]. The access time is most important for first level caches because a longer access time means a slower processor clock rate, more pipeline stages. In order to minimize access time, cache access should be triggered as soon as the address of the memory reference is available. The hit ratio is also critical, both because misses impose delays, and because off-chip bandwidth, especially when there is a shared bus, is a very limited resource. Next, in section 2 we will discuss basics of multi-core architecture and cache technologies. In section 3 We will show how we can critically analyze the performance of multi-core processor. Next, we will tell about the architecture of multi-core processors. In section 5 we state about some implementation issues of modern caches. Section 6 says how we use the multi-core processor.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 79 2. BASICS 2.1 Computer Architectures Past architectures have multiple physically separate processors. Those architectures become gradually backdated to the multiple on-chip processors due to mainly wire delay and caching techniques. Wire delay is the time it takes for data to traverse the physical wires. This can have a drastic effect on frequencies. There is also the added problem of limited intra-processor communication pins for multiple separate processors - a problem not seen in multi-core processors. 2.2 Cache Cache memory is small faster memory which is used to store data temporarily portion of main memory that data are frequently used. The main concepts of using cache memory to improve processor performance have been easy to understood and readable. Today, caches have become an important part of every processor. Our main aim is to reduce the performance gap between processor and main memory that is why we insert a cache memory between processor and main memory. Data cache was designed using two key concepts in mind – 1. Temporal locality [3] –i) when elements are require again in the near future. ii) Arrange the code such way that element in cache is reused often. 2. Special locality [3] – i) when other element in the block will be needed soon. ii) Cache line is fetched together. iii) In the same cache line works on consecutive data elements. When the processor find data item in the cache if it is available in the cache then it is called a cache hit and if the data item is not available in the cache then it is called cache miss. If we want to reduce the miss rate then we should concentrate on both the latency and bandwidth of the memory. The detail of cache operation leads to different cache design choices: Cache mapping techniques, Cache replacement policy and Cache write techniques Cache mapping techniques [5]: Three basic cache mapping techniques – In case of direct mapping each block in main memory has particularly one and only one location in cache it can be copied to. See figure 1 for an example. This technique is less costly as no searching is required. However, if thrashing occurs, when one cache block is continually swapped between two or more memory blocks, the overhead becomes an issue. Fully-associative mapping each memory blocks can be stored anywhere in the cache. In this technique, the entire cache must be searched for each memory access. This requires more hardware and is thus very costly. A combination of the two mapping concept, direct and fully- associative, forms the most common mapping technique: set- associative. Here, the cache block is divided into separate sets. Every set is made up of two or more blocks. A two block set- associative mapping is referred to as 2-way, because the data retrieved from main memory can be put in two different locations, instead of just one. This technique is flexible and limits the amount of thrashing that could occur. Fig.1. Direct Main Memory to Cache mapping Fig.2. Associative Main Memory to Cache mapping Note: i) Direct Mapping ≡ 1way set associative. ii) Fully Associative Mapping ≡ n way set associative. Block Identification: -Block address is used to identify the contents of a cache block uniquely. - Block size are changes according to different mapping technique. C D A B 1 0 2 7 3 4 5 6 Cache MM A a A B C D 0 1 2 3 4 5 6 7 MM Cache
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 80 Consider fig 4. Block Address Block offsetTag Index Fig.4. Addressing Cache Block Offset-Within block least significant bits of an address index words. Block address divided into two parts one is the Tag another is the Index. -Tag is used to identify each cache block uniquely. -Index is used to particular slot set (set associative) or slot (direct). In case fully associative mapping technique index size- Index Cache Size 2 = Block Size × Set Associativity Writing to cache from the CPU presents another opportunity for optimization. There are two simple write policies: write-through and write-back [4][5]. During a typical write, the CPU stores its computed data to a location in cache, which is stored back into main memory. Write-through stores the data into the cache and into the main memory at the same time. Write-back stores the data in the cache, and only writes to main memory when evicted. 1. ANALYSIS PROCEDURE We must need mathematical equations to verify the performance. This is very important in cache design. Miss rates are a common metric of cache implementations where miss rate is the ratio of misses to memory accesses. This analysis is calculated by involving the times associated with miss penalties and hit times. From [4], the average memory access time (AMAT) in seconds or clock cycles can be found by the following equation. AMAT = Hit time + (Miss rate * Miss penalty) Where hit time is the time it takes to get a memory location and miss penalty is the time involved when the requested memory is not found in the cache. Miss penalties take larger time than hit. Via emulation and simulation software is the most common way to test configurations before a complete physical implementation. In [2], hardware prototyping and testing is analyzed using a Xilinx Virtex-II Pro FPGA. Using an FPGA as a test bed gives great reconfigurability. As stated in [6], random program generators and simulation methods are used to test the basic structures when combined. Upon implementation, benchmarking software, such as SPEC CPU2006 [7], is used to test the many aspects of a processor. 4. ARCHITECTURE In 2006 Intel and AMD started multi-core processor to the user and server markets. The AMD Athlon 64 FX dual-core processor has two L1 caches, data and instruction, and one L2 cache, unified, for each core [8] (see Figure 5). Intel uses a shared L2 cache in what is referred to as the Advanced Smart Cache" [8] (see Figure 6).This implementation technique we used to reducing the cache misses and increasing the performance. Fig.5: AMD Athlon 64 FX Architecture. Another concept [9] proposes is non-uniform cache architecture to share cache between cores dynamically. When one core uses cache space unnecessarily and intrudes on another core's space then cache pollution occurs in the architecture address. The proposal is done with a quad-core processor and three levels of cache. The third level, L3, is partly shared and partly private. Each core is allotted a certain amount of space in L3 to be private and cannot be intruded upon. In [10], the idea of specializing is the cores for virtual machines. They proposed two main designs: a single virtual machine core shared by all other general-purpose and specialized cores (for system virtualization); or each general purpose core can have a virtual-machine-specific core (for process virtualization). System security and dependability is addressed in [11] with an integrated framework for dependable and revivable architectures", or INDRA. INDRA uses a core set at a higher privilege that is protected from remote attacks, a resurrection, and monitors the execution of the other cores, the resurrectees. Fig.6: Intel Core Duo Architecture.
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 81 5. IMPLEMENTATION ISSUES 5.1 Addressing Constraints In order to minimize effective memory access time, the access should be triggered as soon as the effective address of the memory reference becomes available. In most computers there must be a delay in translation. This delay cannot be removed completely. 5.1.1 Physical Address Cache The caches are organized as 2-dimensional arrays and are accessed in a two phase cycle. In the first phase, a cache set is selected by using a portion of the address known as the index bits. In the second phase, the remaining part of the address is used to make a further selection from within this cache set to yield either a cache miss determination or the requested data. There are various techniques that exploit this two phase access cycle to enable a physically addressed cache to be accessed without requiring an extra address translation cycle. However, there is a practical limit to this approach because increasing the set-Associativity provides only a diminishing return on cache hit ratio but adds hard-ware complexity and adversely impacts the access time [12]. A technique that can be used to increase the number of address bits available before address translation is to restrict the virtual to physical page mapping so that the low- order bits of the physical and virtual page numbers are identical [14]. Another way to make more address bits available before address translation is to predict the additional address bits. An example of a good predictor is the content of the base register that is used to compute the effective address [13]. 5.1.2 Virtual Address Cache Instead of using bits from the virtual address as a predictor for the physical address, a different approach is to use the virtual address to directly access the cache [15], [16]. This avoids the delay for translation. In addition, all the addresses must be tagged with an address space identifier or else the cache must be purged on every task switch [16]. The most serious drawback of the virtual address cache is that multiple virtual addresses may be mapped to the same physical address, i.e. synonyms may occur [16]. The usual approach to handling synonyms is to prevent them from being present in the cache at the same time. In general, a reverse translation buffer (RTB) is needed in order for this approach to be feasible. One way to reduce the complexity in handling synonyms is to make sure that the index bits used to select the cache set are the same for both the physical and virtual addresses. 5.2 Access Time and Miss Ratio Targets The performance of a cache is determined both by the fraction of memory requests it can satisfy (hit/miss ratio) and the speed at which it can satisfy them (access time). There have been numerous studies on cache hit/miss ratios with respect to the cache and line sizes, and the set Associativity [16], [17]. 5.2.1 Decoupled Caches The data array access and line selection are carried out independently of the tag array access and comparison so as to circumvent the delay imbalance between the paths through the tag and data arrays. This is trivially true in the direct-mapped case because in such a cache, there is only one cache line in each cache set. However, it tends to have an inferior hit ratio due to conflict misses. 5.2.2 Multiple-Access Caches A direct-mapped cache is accessed sequentially more than once in order to achieve the access time of a direct-mapped cache for the fast access and the hit ratio of a set-associative cache as a whole. In [20], a simple rehashing function based on flipping the highest-order index bit is used. Upon a hit to a secondary location, the lines located in the primary and secondary locations are swapped. 5.2.3 Multi-level Caches A small and fast upstream cache is used for the fast access while one or larger and slower downstream caches are used to capture the fast- access misses with minimal penalties. The organization and performance of multi-level caches have been studied extensively [18]. 5.3 Area and Bandwidth Constraints In order to bridge the growing performance gap between processor and memory, more and more silicon area is being dedicated to the on-chip caches. For example, the Intel Pentium Pro consists of a pair of 8KB on-die instruction and data L1 caches and an on-module 512KB L2 cache. Together these caches occupy 65% of the total die area and account for 88% of the total number of transistors. There are several approaches to increasing cache bandwidth. A straightforward way is to have separate instruction and data caches so that the instruction and data references can be handled simultaneously. The trace cache [19] alleviates this problem by storing the logically contiguous instructions in a physically contiguous block in a separate cache. 6. USES OF MULTI-CORE PROCESSOR 6.1 Servers Servers have very good use of multi-core processor. A server can potentially have many simultaneous connections to many users. To accept these connections, the server will either spawn a new process or fork off a new thread. It makes the main process to wait for a connection. The operating system can then allocate these workloads across the available cores. 6.2 Consumers The consumer market has adopted these new processors, banking on the multi-tasking parallelism granted by the multiple cores. These applications reap direct benefit from a multi-core
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 82 architecture by either multi-threaded programs or via scheduling by the operating system. Multi-core processors are not limited to traditional computers. Two such examples are the Cell processor and NVIDIA Tesla GPU. 6.3 Virtualization The idea of virtualization tracks back to the days of mainframes. Now the costs are far lower. However, one thing remains to be true: under-utilization. A system administrator can configure the computer to virtualises" its devices, or operating system, to allow one or more simultaneous virtual machine(s) to use the computer as if each virtual machine (VM) was its own computer. 7. CONCLUSIONS Multi-core processors are already expanding their niche and are finding many new and creative uses. Due to physical limitations and increased multitasking requirements, the multi-core architecture is expected to become the standard over the single- core predecessors. Further caching schemes, both specialized and general, will continue to be honed, narrowing the performance gap between the processor and main memory. Again During the past decade the performance of processors has improved by almost 60% each year. Current trends in the industry suggest that in the future, it may become economically feasible to integrate a processor on the same die as the DRAM. Such integration has the potential to reduce system cost and improve both DRAM latency and available bandwidth. For general purpose computing, cache memories will continue to play a crucial role in bridging the processor-DRAM performance gap. REFERENCES [1] D. Geer, “Industry Trends: Chip Makers Turn to Multi- core Processors," Computer.org, IEEE, pp. 11-13, May 2005. [2] C. R. Clark, R. Nathuji, H. S. Lee, “Using an FPGA as a Prototyping Platform for Multi-core Processor Applications", Georgia Institute of Technology, Atlanta, GA. [3] V. P. Heuring and H. F. Jordan, “Computer Systems Design and Architecture”, Prentice Hall, 2nd Edition, 2003. [4] J. L. Hennessy, D. A. Patterson, “Computer Architecture: A Quantitative Approach”, Morgan Kaufmann Publishers, 4th Edition, 2007. [5] L. Null, J. Lobur, “Computer Organization and Architecture”, Jones and Bartlett Publishers, 2003. [6] D. Lewin, D. Lorenz, S. Ur, “A Methodology for Processor Implementation Verification", Technion, Haifa, Israel. [7] J. L. Henning, SPEC CPU Subcommittee, “SPEC CPU2006 Benchmark Descriptions", Standard Performance Evaluation Corporation, 2006. [8] Jeremy W. Langston and Xubin He, “Multi-core Processors and Caching - A Survey”, Tennessee Technological University, 2007. [9] H. Dybdahl, P. Stenstrom, “An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors", HiPEAC Network of Excellence. [10] D. Upton, K. Hazelwood, Heterogeneous Chip Multiprocessor Design for Virtual Machines", University of Virginia. [11] W. Shi, H. S. Lee, L. Falk, M. Ghosh, “An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors", Georgia Institute of Technology, Atlanta, GA, 2006. [12] M. Hill, A. Smith, “Evaluating Associativity in CPU Caches," IEEE Trans. Computers, Vol. 22(12), Dec. 1989. [13] K. Hua, et al., “Early Resolution of Address Translation in Cache Design," Int'l Conf. Comp. Designs, pp. 408- 412, Oct. 1990. [14] K. Inoue, H. Nonogaki, T. Urakawa, K. Shimizu, “Plural virtual address space processing system," US Patent No. 4145738, March 20, 1979. [15] F. Reiley, J. Richcreek, “Parallel Addressing of A Storage Hierarchy in A Data Processing System Using Virtual Address," US Patent No. 3693165, Sep. 19, 1972. [16] A. Smith, “Cache Memories," Computing Surveys, Vol. 14(4), Sep.1982, pp. 473-530. [17] A Smith, “Cache Memory Design: An Evolving Art," IEEE Spectrum, Dec. 1987, pp. 40-44. [18] F. Sparacio, “Data Processing System with Second Level Cache," IBM Tech. Disc., 21(6), Nov. 1978, pp. 2468- 2469.. [19] E. Rotenberg, S. Bennett, J. Smith, “Trace Cache: A Low-Latency Approach to High-Bandwidth Instruction Fetching," MICRO'29, Dec. 1996, pp. 24-34. [20] A. Agarwal, J. Hennessy, M. Horowitz, “Cache Performance of Operating Systems and Multiprogramming," ACM Trans. Computer Systems, Vol. 6(4), Nov. 1988, pp. 393-431.