Directory based cache coherence

Outline
• Non-Uniform Cache Architecture (NUCA)
• Cache Coherence
• Implementation of directories in multicore
architecture
1

Non-Uniform Cache Architecture [1]
• Uniform Cache Architecture
▫ Multi-level cache hierarchies
 Organized into a few discrete levels
 Each level reduces access to the lower level
 Inclusion overhead
 Internal wire delays
 Restricted number of ports
▫ Large on-chip cache
 Single and discrete hit latency
 Undesirable due to increasing wire delays
2

• Non-uniform cache architecture (NUCA)
▫ Exploit non-uniformity
 Data in large cache closer to processor is accessed
faster than data residing physically farther
Level 2 caches architectures, 16MB with 50nm technology (taken from [1])
3

• Static NUCA
▫ Each bank can be accessed at different speeds
 Proportional to the distance from the controller
 Lower latency when closer to controller
▫ Mapping of data into banks based on block index
▫ Banks are independently addressable
▫ Access to banks may proceed in parallel
Banks have private channels
▫ Large number of wires
▫ Access time and routing delay increase with time
 Best organization at smaller technologies uses larger
banks
4

Static NUCA design (taken from [1])
5

• Switched Static NUCA
▫ 2D Mesh, point-to-point links
▫ Removes most of the large number of wires
▫ Allows a large number of faster, smaller banks
• Dynamic NUCA
▫ Allows data to be mapped to many banks
▫ Allows data to migrate among the banks
▫ Frequently used data can be promoted to faster
banks
6

Switched NUCA design (taken from [1])
7

• Policies
▫ Bank placement policy
 Where is data placed in the NUCA cache memory
▫ Bank access policy
 Determines bank-searching algorithm
▫ Bank migration policy
 Determines if a data element is allowed to change its
placement from one bank to another
 Regulates migration of data
▫ Bank replacement policy
 How NUCA behaves when there is a data eviction from
one of the banks
8

Taken from [2]
9

Cache Coherence
• Cache-coherence problem
• Support for large number of processors
▫ Need for high bandwidth
▫ Bus architecture insufficient
• Point-to-Point networks
▫ No broadcast mechanism
▫ Snooping protocol unusable
• Directory
▫ Solution for point-to-point networks
▫ Stores location of cache copies of blocks of data
▫ Centralized or distributed
10

Implementation of directories in
multicore architectures [3]
• DRAM (off-chip) directory
▫ Stores directory information in DRAM
 Ex: full-map protocol
▫ Does not exploit distance locality
▫ Treats each tile as a potential sharer of data
▫ Directory can be cached in on-chip SRAM
 Do not need to access off-chip memory each time
11

Taken from [3]
12

multicore architecture [4]
• DRAM (off-chip) directory with directory caches
▫ Private cache
▫ Directory is cached in each tile
 Do not need to access off-chip memory each time
 Non-coherent caches
 Home node for any given cache line
 Different range of memory address for each tile
▫ Directory controller in each tile
 Controls coherency between private caches
13

Taken from [4]
14

• Duplicate tag directory
▫ Directory centrally located in SRAM
▫ Connected to individual cores
▫ Exact duplicate tag store
 Directory state for a block is determined by examining
copy of tags of every possible cache that can hold the
block
 Keep copied tags up-to-date
▫ No more need to read states from DRAM memory
▫ Challenging as the number of cores increases
 64 cores, 16-way associative cache = 1024 aggregate
associativity of all tiles
15

Taken from [3]
16

Directory memory, 4-way associative caches (taken from [5])
17

• Static cache bank directory
▫ Distributed directory among the tiles
 Mapping block address to a tile (called the home tile)
 Home tiles selected by simple interleaving
 Location can be sub-optimal (see next slide)
 Tile’s cache extended to contain directory
information
 Integrates directory states with cache tags
 Avoids SRAM or DRAM separate directory
18

multicore architectures [3,6]
Taken from [3]
19
Taken from [6]

• SGI Origin2000 multiprocessor system
▫ Directory memory connected to on-chip memory
 Shared L2 cache
 Directory memory distributed over multiple tiles
 Cache coherence controller
 Home tile sends appropriate messages to cores
20

SGI Origin2000 multiprocessor system (taken from [7])
21

• Tilera Tile64 architecture
▫ 2d mesh network (8X8)
▫ Provides coherent shared-memory environment
▫ Uses neighborhood caching
 Provides on-chip distributed shared cache
▫ Coherency is maintained at the home tile
 Data is not cached at non-home tiles
▫ Communication over a Tile Dynamic Network
22

23
Tilera Tile64 (taken from)

References
• [1] C. Kim, D. Burger, S.W. Keckler, “An Adaptative, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip
Caches”, in Proc. 10th Int. Conf. ASPLOS, San Jose, CA, 2002, pp. 1-12
• [2] J. Lira, C. Molina, A. Gonzalez, “Analysis of Non-Uniform Cache Architecture Policies for Chip-Multiprocessors Using
the Parsec Benchmark Suite”, MMCS’09, Mar. 2009, pp. 1-8
• [3] M.R. Marty, M.D. Hill, “Virtual Hierarchies to Support Server Consolidation”, ISCA’07, June 2007, pp. 1-11
• [4] J.A. Brown, R. Kumar, D. Tullsen, “Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures”,
SPAA’07, June 2007, pp. 1-9
• [5] J. Chang, G.S. Sophi, “Cooperative Caching for Chip Multiprocessors”, Computer Architecture, ISCA '06. 33rd
International Symposium on, 2006, pp.264-276
• [6] S. Cho, L. Jin, "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation“, Microarchitecture, 2006.
MICRO-39. 39th Annual IEEE/ACM International Symposium on, Dec. 2006, pp.455-468
• [7] H. Lee, S. Cho, B.R. Childers, "PERFECTORY: A Fault-Tolerant Directory Memory Architecture“, Computers, IEEE
Transactions on , vol.59, no.5, May 2010, p.638-650
• [8] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.C. Miao, J.F. Brown, A. Agarwal,
"On-Chip Interconnection Architecture of the Tile Processor“, Micro, IEEE , vol.27, no.5, Sept.-Oct. 2007, pp.15-31
• [9] Linux Devices, “4-way chip gains Linux IDE, dev cards, design wins” [online], Linux Devices, Apr. 2008 [cited Oct. 21
2010] , available from World Wide Web: < http://thing1.linuxdevices.com/news/NS4811855366.html >
24

Directory based cache coherence

More Related Content

What's hot (16)

Viewers also liked (16)

Similar to Directory based cache coherence (20)

More from James Wong (20)

Recently uploaded (20)

Directory based cache coherence

Editor's Notes