SlideShare a Scribd company logo
SPARC-T1
Cache & Virtual Memory
Architecture
by : Kaushik Patra
(kpatra@gmail.com)
Agenda
SPARC-T1 overview
SPARC core overview
L1 Caches and TLBs
L1 I-cache
IFQ & MIL
I-cache fill path
I-cache miss path
L1 D-Cache
Data flow through LSU
Memory Management
Unit
Data flow in MMU
TLB structure
TLB entry replacement
algorithm.
L2-Cache Overview
L2-Cache structure
L2-Cache line replacement
algorithm.
SPARC T1 overview
SPARC T1 overview
8 SPARC V9 core
4 threads per core
16 KB L1 instruction cache
(I-Cache) per core.
8KB L1 data cache (D-Cache)
per core.
SPARC T1 overview
3MB L2 cache
shared by all cores
4-way banked
12-way associative
132 GB/sec cross bar
interconnect for on-chip
communication.
SPARC T1 overview
4 DDR-II DRAM controller
144 bit interface per channel
25GB/sec total peak bandwidth.
IEE 754 compliant floating
point unit (FPU).
Shared by all core
SPARC T1 overview
External interface
J-Bus interface for I/O
2.56 GB/sec peak bandwidth
128 bit multiplexed address and
data bus.
Serial System Interface (SSI) for
boot PROM.
SPARC T1 overview
SPARC core overview
SPARC core overview
Instruction Fetch Unit (IFU)
Load Store Unit (LSU)
Memory Management Unit (MMU).
Execution Unit (EXU)
Multiplier Unit
Trap Logic Unit
Floating Point Front end Unit
Stream Processing Unit
SPARC core overview
SPARC core data path
Separate instruction
cache (I-Cache) and data
cache (D-Cache).
SPARC core overview
We’ll limit our discussion
within I-Cache and D-
Cache.
We’ll also include associated
TLB architecture for
supporting memory
virtualization.
L1 Cache and TLBs
IFU contains I-cache and
I-TLB
LSU contains D-cache
and D-TLB
MMU
IFU LSU
I-Cache
I-TLB
D-Cache
D-TLB
L1 Cache and TLBs
IFU controls I-Cache
content.
LSU controls D-Cache
content.
MMU controls both the I-
TLB and D-TLB
MMU
IFU LSU
I-Cache
I-TLB
D-Cache
D-TLB
L1 I-Cache
Physically indexed and tagged
address is translated into physical location using I-TLB before
cache hit/miss is determined.
4-way set associative.
16KB data storage with 32 bytes line size.
Single ported data and tag array.
L1 I-Cache
I-Cache fill size - 16 bytes per access.
Cached data contains
32-bit instruction
1-bit parity
1-bit pre-decode
Valid bit array has 1 read and 1 write port.
Cache invalidation access only V-bit array.
L1 I-Cache
Cache line replacement in pseudo-random.
Read access has higher priority over write access to I-
cache.
Maximum wait time for a write access is 25 SPARC
core clock cycle.
Any write access request waiting more than 25 clock cycle will
cause pipeline stall in-order allow access to complete the
pending write operation.
IFQ & MIL
Instruction Fill Queue (IFQ)
feeds into I-Cache.
Missed Instruction List (MIL)
stores the addresses which
missed the I-Cache or I-TLB
access.
MIL feeds into LSU for
further processing.
I-Cache
I-TLB
IFQ MIL
To
LSU
From
LSU
IFU
Address
Instruction fetch
For every SPARC core clock cycle 2 instruction is
fetched per instruction issue.
This strategy has been takes to reduce I-Cache read access for
opportunistic I-Cache line fill.
Each thread is allowed to have one outstanding I-
cache miss.
i.e. total 4 I-cache miss per core is allowed.
Duplicate I-cache miss do not induce redundant fill
request for L2-Cache.
I-Cache fill path
Fill packet (PCX) comes from L2-
cache via LSU.
Parity and pre-decode is computed
before I-cache is filled up.
CPX packet also includes
invalidations
test access point (TAP) read & write
error notification
IFQINV
BIST > ASI > CPX
BIST ASI CPXPKT
from
LSU
To
V-bit array
To
I-Cache
Bypass To
TIR
I-Cache fill path
Invalidation CPX is handled
through INV block.
access V-bit array
IFQ has a bypass circuit to deliver
current CPX directly to Thread
Instruction Register (TIR) to avoid
extra stall in processing instruction.
IFQINV
BIST > ASI > CPX
BIST ASI CPXPKT
from
LSU
To
V-bit array
To
I-Cache
Bypass To
TIR
I-Cache fill path
Each I-cache fill takes 2 CPX, 16
bytes each.
I-cache line size is 32 bytes.
I-cache line is invalidated after first
packet is written.
I-cache line becomes valid again
after 2nd packet is written.
IFQINV
BIST > ASI > CPX
BIST ASI CPXPKT
from
LSU
To
V-bit array
To
I-Cache
Bypass To
TIR
V I
WRITE
CPX-1
WRITE
CPX-2
V = Valid
I = Invalid
WRITE
CPX-1
I-Cache miss path
Missed Instruction List (MIL) sends
I-Cache miss request to L2-Cache
using LSU.
One miss entry per thread, i.e. total
4 miss entry per SPARC core.
Each entry in MIL contains
physical address (PA).
The replacement way information.
The MIL state information.
The cacheability.
The error information
MIL
Physical
Address
(PA)
RR Arbitrator
COMP
PCXPKT
to
LSU
I-Cache miss path
PA keeps track of I-cache fetch
progress from I-cache miss till I-
cache fill.
Round robin algorithm to dispatch
I-cache fill request from different
threads.
MIL uses linked list, of size 4, to
keep track of duplicate I-cache
miss.
Marks duplicate request as child.
Any child request is serviced as soon as
parent request gets response.
MIL
Physical
Address
(PA)
RR Arbitrator
COMP
PCXPKT
to
LSU
I-Cache miss path
S1 S3
S2
S4
Make
Fill
Request
CPX-1
not done
Send
Speculative
notification
New
I-Cache
Miss
CPX-2
not doneSend
notification
MIL alters between 4
states.
starts with S1 upon new I-
cache miss.
Makes fill request.
Wait till I-cache fill is done.
Upon completing CPX-1
fill, it sends speculative
completion notification to
thread scheduler.
I-Cache miss path
An I-Cache fill request may be cancelled upon trap or
exception.
However, MIL still goes through the filling a cache line, but the bypass to
TIR is blocked.
Why ? because, the pending child request should be serviced even if the
parent request is cancelled.
Child I-cache miss request needs to wait till the parent’s I-
cache miss request is serviced.The child instruction fetch
shall be retired (rollback) to fetch stage to allow it to access
the I-cache.This is referred as ‘miss-fill crossover’ .
L1 D-Cache
4-way set associative
8 KB data storage with 16 byte line size.
Single read-write port for data and tag array.
Dual ported Valid bit (V-bit) array.
cache invalidation only access this V-bit array.
L1 D-Cache
Cache line replacement policy is pseudo random,
using a linear shift register, with allocated load miss,
but non-allocated store miss.
A cacheable load miss will allocate a line and will
execute the write through policy before the line is
loaded.
Stores do not allocate. Hence, store will causes line
invalidation if the target address is already in D-
cache, as determined by L2 cache directory.
L1 D-Cache
L1 D-cache is always inclusive to L2 cache.
L1 D-cache is always exclusive to L1 I-cache.
Each L1 D-cache is parity protected.
Parity error will cause D-cache miss, hence data will be
corrected.
In addition to pipeline read, L1 D-cache may be
accessed by ASI, BIST and RAM-test through test
access port (TAP).
Data flow through LSU
One store buffer (STB)
per thread.
Load misses are kept in
Load Miss Queue,
LMQ.
One outstanding load miss
per thread.
Load miss with duplicate
physical address (PA) is not
sent to L2-cache.
Fully associative DTLB
All CAM/RAM accesses are
single cycle operation.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
Data flow through LSU
STB consists of store
buffer CAM (SCM) and
store data array
(STBDATA).
SCM has 1 CAM port and
1 RW port
STBDATA has 1 read and
1 write port.
Each thread is
allocated with 8 fixed
entries in the shared
data structure.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
Data flow through LSU
A load instruction
speculate on D-cache
miss to reduce the CCX
access latency.
If speculation fails, load
instruction is taken out of
LMQ.
The arbiter (PCX
generator) takes 13
different inputs to
generate the packet to
PCX (Processor-to-
Crossbar interface).
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
Data flow through LSU
The arbiter inputs
consist of
4 load type instructions
4 store type instructions
One I-cache fill.
One FPU access.
One SPU access.
One interrupt.
One forward packet.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
Data flow through LSU
The arbitration inputs
consist of
I-cache miss
Load miss
Stores
{FPU operations, SPU
operations, Interrupts}
A two level history
mechanism allows to
implement fair
scheduling among
different priority levels.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
Data flow through LSU
In coming packets are
stored in the data fill
queue (DFQ).
Packets can be
Acknowledgment
Data
The targets for DFQ
are
Instruction fetch unit
(IFU)
Load Store Unit (LSU)
Trap Logic Unit (TLU)
Stream Processing Unit
(SPU)
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
Memory Management Unit
Maintains content of ITLB and
DTLB.
MMU helps SPARC-T1 to provide
support for virtualization.
Multiple OS co-exists on to of CMT
processor.
Hypervisor layer virtualizes underlying
CPU.
Virtual address (VA) from
application is translated into Real
Address (RA) and then to Physical
Address (PA) using TLB & MMU.
Data Flow in MMU
The system software maintains the content of TLBs by
sending instructions to MMU.
Instructions are - read, write, de-map.
TLB entries are shared among threads.
Consistency among TLB entries are maintained through
auto-de-map.
MMU is responsible for
Generating the pointers to Software Translation Storage Buffer (STB).
Maintains fault status for various traps.
Access to MMU is through hypervisor-managed ASI
(Alternate Space Identifier) operations, e.g. ldxa, stxa.
TLB structure
TLB structure
TLB consists of Content Addressable Memory (CAM)
and Random Access Memory (RAM).
CAM has 1 compare port and 1 read-write port.
RAM has 1 read-write port.
TLB support the mutually exclusive events of - CAM,
Read,Write, Bypass, De-map, Soft-reset, Hard-reset.
TLB structure
RAM contains the following fields.
Physical Address (PA).
Attributes.
CAM contains the following fields.
Partition ID (PID).
Real (indicates VA-to-PA or RA-to-PA translation)
Virtual address (VA), divided into page size based fields (V0 - V3)
Context ID (CTXT)
TLB entry replacement algorithm
Each entry has an used bit.
Replacement is picked up by the least significant
unused bit among all 64 entries.
A used bit is set on - write, CAM hit or lock.
A locked page has always the used bit set.
Entry invalidation will clear the used bit.
All used bit will be cleared, except the locked entry, if
TLB reaches saturation.
If TLB is saturated for all locked entry, default location
0x63 s chosen and error is reported.
L2-cache overview
3MB in total size with four
symmetrical data bank.
Each bank operates
independently.
Each bank is 12-way set
associative and 768KB of size.
Line size is 64 bytes.
Number of sets are 1024.
L2-cache overview
Accepts request from
processor-to-crossbar (PCX)
interface - a part of CCX.
Puts response on crossbar-to-
processor (CPX) interface - a
part of CCX.
Responsible to maintain on-
chip coherency across all L1-
cache.
Keeps copy of all L1 tags in a
directory structure.
L2-cache overview
128-bit fill interface.
64-bit write interface with
DMA controller.
Each bank has dedicated
DMA controller.
8-staged pipe lined cache
controller.
L2-cache overview
32-bit word is protected by 7-
bit single error correction,
double error detection (SEC/
DED) ECC code.
J-bus interface (JBI) using
snoop input queue and
RDMA write buffer.
L2-Cache structure
3 main components.
SCTAG (Secondary Cache TAG) :
contains TAG array,VUAD array, L2-
TAG directory and cache controller.
SCBUF (Secondary Cache BUF) :
contains write back buffer (WBB),
fill buffer (FB) and DMA buffer.
SCDATA (Secondary Cache
DATA) : contains L2-cache data.
L2-cache : Arbiter
Manages L2-cache pipeline access from various source
of request access.
Arbiter gets input from
Instruction from CCX and bypass path for input
queue (IQ).
DMA instruction from snoop input queue
Instructions for re-cycle from miss buffer (MB) and
fill buffer (FB).
Stall signal from the pipeline.
L2-cache :TAG
22-bit tag with 6-bit of SEC ECC protection.
No double bit error detection.
Single ported array.
Four states are maintained per tag line in VUAD array
Valid (V)
Used (U)
Allocated (A)
Dirty (D)
L2-cache :VUAD
Dual ported array structure.
VAD bits are parity protected since an error will be fatal
Used bit is not protected, since the error is not fatal.
VUAD is accessed while taking decision for line
replacement.
L2-cache : DATA
Single ported SRAM structure.
768 KB in size with 64 bytes logical line size.
Allows read access of 16 bytes and 64 bytes.
‘16-byte enable’ allows writing in 4-byte part.
Line fill updates all 64 bytes at a time.
L2-cache : DATA
Data array is subdivided into 4 column with six 32
Kbyte sub-array in each column.
Data array access needs 2 cycles to be completed.
No column can be accessed in consecutive cycle.
All accesses are pipelined, thus access have a through
put of one per cycle.
Each 32-bit line is protected by 7 bits of SEC/DED
ECC.
L2-cache : Input Queue(IQ)
16-entry FIFO queue takes incoming PCX packets.
Each entry is 130 bit wide.
FIFO implemented with dual ported array.
IQ asserts a stall when 11 entries are filled up.
To allow incoming packets already in fly.
L2-cache : Output Queue(OQ)
16-entry FIFO for the packets waiting to get access to
CPX.
Each entry is 146-bit wide.
FIFO implemented with dual ported array.
When OQ reaches high-water mark, L2-cache stops
accepting PCX and input from miss buffer.
Fills can still be happened since they do not cause CPX packets.
L2-cache : Miss Buffer (MB)
16-entry miss buffer stores instructions which can not be
processed as simple cache hit.
True L2 cache miss.
Same cache line address which had a miss.
An entry in the write back buffer.
Instructions need multiple L2 cache pipeline.
Unallocated L2-cache misses.
Access causing tag ECC error.
Non tag part holds data - it is a RAM with 1R 1W port.
Tag part holds address - it is a CAM with 1R,1W and 1CAM port.
L2-cache : Fill Buffer (FB)
8-entry buffer.
Contains cache-line wide entry to stage data from DRAM before it
fills the cache.
RAM structure is used for implementation.
Address is also stored to maintain the age ordering to satisfy data
coherence.
CAM structure is used for implementation.
Data arrives from DRAM in four 16 byte block starting with the
critical quad-word.
L2-cache :Write Back Buffer (WBB)
8-entry buffer, used to store 64-byte dirty data upon eviction.
The evicted lines are streamed to DRAM opportunistically.
An instruction having same address line as in WBB, the
instruction is pushed back into MB.
WBB also has RAM and CAM part to hold data and address
respectively.
64-byte read interface with data array and 64-bit write interface to
DRAM controller.
L2-cache : Directory
2048 entries, with one entry per L1 tag.
It is L1 tag to L2 bank mapping.
Half entries are for L1 I-cache and other half is for L1 D-cache.
I-cache dir and D-cache dir.
Participates in coherency management.
Also ensures same line is not a part of I-Cache and D-Cache.
Uses pseudo LRU for line replacement.
The ‘U’ bit (total 12, 1 per way) is set upon cache hit
All 12 ‘U’ bits get cleared when there is no unused or
unallocated.
‘A’-bit means the line is allocated for a miss.
Analogous to ‘lock’ bit.
‘A’ -bit gets cleared when the line fill happens.
L2-cache : Line Replacement Algorithm
‘D’-bit indicates the line is only valid inside cache and required
to be written back.
Set when data written to L2-cache.
Cleared when line is invalidated.
LRU examines all the ways from a certain point based one a
round-robin fashion.
The first unused-unallocated line is allocated for miss.
If no unused, first unallocated line is allocated for miss.
L2-cache : Line Replacement Algorithm
Scope of future study
Cache cross bar (CCX) for data transaction.
L2-cache pipelined data flow control.
Cache memory consistency and instruction
ordering.
Reference
http://opensparc-t1.sunsource.net/specs/
OpenSPARCT1_Micro_Arch.pdf

More Related Content

PDF
eBPF/XDP
PPT
JAVA Polymorphism
PDF
OFI libfabric Tutorial
PPTX
eBPF Basics
PPT
Linux presentation
PDF
Shell scripting
PDF
Introduction to MySQL
KEY
Evolution of Programming Languages
eBPF/XDP
JAVA Polymorphism
OFI libfabric Tutorial
eBPF Basics
Linux presentation
Shell scripting
Introduction to MySQL
Evolution of Programming Languages

What's hot (20)

DOCX
Recursion in C++
PPTX
The Phases of a Compiler
PPT
Basic 50 linus command
PPTX
OOPS Characteristics (With Examples in PHP)
PPT
Preprocessors
PPTX
RISC-V NOEL-V - A new high performance RISC-V Processor Family
PDF
Java - Interfaces & Packages
PDF
CMake - Introduction and best practices
PPT
Files and Directories in PHP
PDF
Getting started with libfabric
PPTX
PHP Variables and scopes
PPTX
Flappy bird game in c#
PDF
Xdp and ebpf_maps
PPTX
Inheritance in Java
PPTX
OOP Unit 1 - Foundation of Object- Oriented Programming
PPTX
RPM (Red Hat Package Manager)
PPT
java swing
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PPT
C, C++ Interview Questions Part - 1
PDF
CSW2017 Qinghao tang+Xinlei ying vmware_escape_final
Recursion in C++
The Phases of a Compiler
Basic 50 linus command
OOPS Characteristics (With Examples in PHP)
Preprocessors
RISC-V NOEL-V - A new high performance RISC-V Processor Family
Java - Interfaces & Packages
CMake - Introduction and best practices
Files and Directories in PHP
Getting started with libfabric
PHP Variables and scopes
Flappy bird game in c#
Xdp and ebpf_maps
Inheritance in Java
OOP Unit 1 - Foundation of Object- Oriented Programming
RPM (Red Hat Package Manager)
java swing
Kernel Recipes 2017: Using Linux perf at Netflix
C, C++ Interview Questions Part - 1
CSW2017 Qinghao tang+Xinlei ying vmware_escape_final
Ad

Viewers also liked (17)

PPTX
Presentation sparc m6 m5-32 server technical overview
PPTX
The sparc architecture (3)
PDF
shashank_micro92_00697015
PPT
1.prallelism
RTF
Partnership revision questions ay 2014 2015
PPTX
Accounting for Partnerships
PPT
Parallelism
PPTX
Accounting for Partnership by Guerrero et al
DOCX
Partnership accounts
PPT
Partnership accounting
PPTX
Types of parallelism
PPT
Cpu and its functions
PDF
Format of all accounts for O Levels
PPT
04 Cache Memory
PPTX
PPSX
08. Central Processing Unit (CPU)
PDF
Partnership accounts
Presentation sparc m6 m5-32 server technical overview
The sparc architecture (3)
shashank_micro92_00697015
1.prallelism
Partnership revision questions ay 2014 2015
Accounting for Partnerships
Parallelism
Accounting for Partnership by Guerrero et al
Partnership accounts
Partnership accounting
Types of parallelism
Cpu and its functions
Format of all accounts for O Levels
04 Cache Memory
08. Central Processing Unit (CPU)
Partnership accounts
Ad

Similar to SPARC T1 MMU Architecture (20)

PPTX
ADNSU Computer Architecture Topic Presentation.pptx
PPTX
Intel IA 64
PPTX
Presentation on risc pipeline
PDF
Risc vs cisc
PPTX
Berkeley Packet Filters
PPTX
lecture21b.pptxdvvvffffffffffffffffffffffffffffffffffff
ODP
Io Architecture
PPT
Memory Mapping Cache
PDF
Arm11
PPTX
486 or 80486 DX Architecture
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
BPF - in-kernel virtual machine
PDF
arm-cortex-a8
PPT
15 ia64
PPTX
Z 80 processors (History-Products)
PPT
cache memory
PPT
Chapter 4
PDF
Arch 1112-6
PPTX
M2 242-scsi-bus rl-2.4.2
PPTX
computer organization and architecture notes
ADNSU Computer Architecture Topic Presentation.pptx
Intel IA 64
Presentation on risc pipeline
Risc vs cisc
Berkeley Packet Filters
lecture21b.pptxdvvvffffffffffffffffffffffffffffffffffff
Io Architecture
Memory Mapping Cache
Arm11
486 or 80486 DX Architecture
Network Programming: Data Plane Development Kit (DPDK)
BPF - in-kernel virtual machine
arm-cortex-a8
15 ia64
Z 80 processors (History-Products)
cache memory
Chapter 4
Arch 1112-6
M2 242-scsi-bus rl-2.4.2
computer organization and architecture notes

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Well-logging-methods_new................
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Welding lecture in detail for understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
web development for engineering and engineering
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Construction Project Organization Group 2.pptx
bas. eng. economics group 4 presentation 1.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Well-logging-methods_new................
Strings in CPP - Strings in C++ are sequences of characters used to store and...
UNIT 4 Total Quality Management .pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Foundation to blockchain - A guide to Blockchain Tech
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Welding lecture in detail for understanding
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
web development for engineering and engineering
Arduino robotics embedded978-1-4302-3184-4.pdf

SPARC T1 MMU Architecture

  • 1. SPARC-T1 Cache & Virtual Memory Architecture by : Kaushik Patra (kpatra@gmail.com)
  • 2. Agenda SPARC-T1 overview SPARC core overview L1 Caches and TLBs L1 I-cache IFQ & MIL I-cache fill path I-cache miss path L1 D-Cache Data flow through LSU Memory Management Unit Data flow in MMU TLB structure TLB entry replacement algorithm. L2-Cache Overview L2-Cache structure L2-Cache line replacement algorithm.
  • 4. SPARC T1 overview 8 SPARC V9 core 4 threads per core 16 KB L1 instruction cache (I-Cache) per core. 8KB L1 data cache (D-Cache) per core.
  • 5. SPARC T1 overview 3MB L2 cache shared by all cores 4-way banked 12-way associative 132 GB/sec cross bar interconnect for on-chip communication.
  • 6. SPARC T1 overview 4 DDR-II DRAM controller 144 bit interface per channel 25GB/sec total peak bandwidth. IEE 754 compliant floating point unit (FPU). Shared by all core
  • 7. SPARC T1 overview External interface J-Bus interface for I/O 2.56 GB/sec peak bandwidth 128 bit multiplexed address and data bus. Serial System Interface (SSI) for boot PROM.
  • 10. SPARC core overview Instruction Fetch Unit (IFU) Load Store Unit (LSU) Memory Management Unit (MMU). Execution Unit (EXU) Multiplier Unit Trap Logic Unit Floating Point Front end Unit Stream Processing Unit
  • 11. SPARC core overview SPARC core data path Separate instruction cache (I-Cache) and data cache (D-Cache).
  • 12. SPARC core overview We’ll limit our discussion within I-Cache and D- Cache. We’ll also include associated TLB architecture for supporting memory virtualization.
  • 13. L1 Cache and TLBs IFU contains I-cache and I-TLB LSU contains D-cache and D-TLB MMU IFU LSU I-Cache I-TLB D-Cache D-TLB
  • 14. L1 Cache and TLBs IFU controls I-Cache content. LSU controls D-Cache content. MMU controls both the I- TLB and D-TLB MMU IFU LSU I-Cache I-TLB D-Cache D-TLB
  • 15. L1 I-Cache Physically indexed and tagged address is translated into physical location using I-TLB before cache hit/miss is determined. 4-way set associative. 16KB data storage with 32 bytes line size. Single ported data and tag array.
  • 16. L1 I-Cache I-Cache fill size - 16 bytes per access. Cached data contains 32-bit instruction 1-bit parity 1-bit pre-decode Valid bit array has 1 read and 1 write port. Cache invalidation access only V-bit array.
  • 17. L1 I-Cache Cache line replacement in pseudo-random. Read access has higher priority over write access to I- cache. Maximum wait time for a write access is 25 SPARC core clock cycle. Any write access request waiting more than 25 clock cycle will cause pipeline stall in-order allow access to complete the pending write operation.
  • 18. IFQ & MIL Instruction Fill Queue (IFQ) feeds into I-Cache. Missed Instruction List (MIL) stores the addresses which missed the I-Cache or I-TLB access. MIL feeds into LSU for further processing. I-Cache I-TLB IFQ MIL To LSU From LSU IFU Address
  • 19. Instruction fetch For every SPARC core clock cycle 2 instruction is fetched per instruction issue. This strategy has been takes to reduce I-Cache read access for opportunistic I-Cache line fill. Each thread is allowed to have one outstanding I- cache miss. i.e. total 4 I-cache miss per core is allowed. Duplicate I-cache miss do not induce redundant fill request for L2-Cache.
  • 20. I-Cache fill path Fill packet (PCX) comes from L2- cache via LSU. Parity and pre-decode is computed before I-cache is filled up. CPX packet also includes invalidations test access point (TAP) read & write error notification IFQINV BIST > ASI > CPX BIST ASI CPXPKT from LSU To V-bit array To I-Cache Bypass To TIR
  • 21. I-Cache fill path Invalidation CPX is handled through INV block. access V-bit array IFQ has a bypass circuit to deliver current CPX directly to Thread Instruction Register (TIR) to avoid extra stall in processing instruction. IFQINV BIST > ASI > CPX BIST ASI CPXPKT from LSU To V-bit array To I-Cache Bypass To TIR
  • 22. I-Cache fill path Each I-cache fill takes 2 CPX, 16 bytes each. I-cache line size is 32 bytes. I-cache line is invalidated after first packet is written. I-cache line becomes valid again after 2nd packet is written. IFQINV BIST > ASI > CPX BIST ASI CPXPKT from LSU To V-bit array To I-Cache Bypass To TIR V I WRITE CPX-1 WRITE CPX-2 V = Valid I = Invalid WRITE CPX-1
  • 23. I-Cache miss path Missed Instruction List (MIL) sends I-Cache miss request to L2-Cache using LSU. One miss entry per thread, i.e. total 4 miss entry per SPARC core. Each entry in MIL contains physical address (PA). The replacement way information. The MIL state information. The cacheability. The error information MIL Physical Address (PA) RR Arbitrator COMP PCXPKT to LSU
  • 24. I-Cache miss path PA keeps track of I-cache fetch progress from I-cache miss till I- cache fill. Round robin algorithm to dispatch I-cache fill request from different threads. MIL uses linked list, of size 4, to keep track of duplicate I-cache miss. Marks duplicate request as child. Any child request is serviced as soon as parent request gets response. MIL Physical Address (PA) RR Arbitrator COMP PCXPKT to LSU
  • 25. I-Cache miss path S1 S3 S2 S4 Make Fill Request CPX-1 not done Send Speculative notification New I-Cache Miss CPX-2 not doneSend notification MIL alters between 4 states. starts with S1 upon new I- cache miss. Makes fill request. Wait till I-cache fill is done. Upon completing CPX-1 fill, it sends speculative completion notification to thread scheduler.
  • 26. I-Cache miss path An I-Cache fill request may be cancelled upon trap or exception. However, MIL still goes through the filling a cache line, but the bypass to TIR is blocked. Why ? because, the pending child request should be serviced even if the parent request is cancelled. Child I-cache miss request needs to wait till the parent’s I- cache miss request is serviced.The child instruction fetch shall be retired (rollback) to fetch stage to allow it to access the I-cache.This is referred as ‘miss-fill crossover’ .
  • 27. L1 D-Cache 4-way set associative 8 KB data storage with 16 byte line size. Single read-write port for data and tag array. Dual ported Valid bit (V-bit) array. cache invalidation only access this V-bit array.
  • 28. L1 D-Cache Cache line replacement policy is pseudo random, using a linear shift register, with allocated load miss, but non-allocated store miss. A cacheable load miss will allocate a line and will execute the write through policy before the line is loaded. Stores do not allocate. Hence, store will causes line invalidation if the target address is already in D- cache, as determined by L2 cache directory.
  • 29. L1 D-Cache L1 D-cache is always inclusive to L2 cache. L1 D-cache is always exclusive to L1 I-cache. Each L1 D-cache is parity protected. Parity error will cause D-cache miss, hence data will be corrected. In addition to pipeline read, L1 D-cache may be accessed by ASI, BIST and RAM-test through test access port (TAP).
  • 30. Data flow through LSU One store buffer (STB) per thread. Load misses are kept in Load Miss Queue, LMQ. One outstanding load miss per thread. Load miss with duplicate physical address (PA) is not sent to L2-cache. Fully associative DTLB All CAM/RAM accesses are single cycle operation. I-Cache I-TLB DFQ To PCX From CPX LSU Address STB PCX Generator PCX PKT from IFU STORELOAD IFU LMQ IRF,FRF
  • 31. Data flow through LSU STB consists of store buffer CAM (SCM) and store data array (STBDATA). SCM has 1 CAM port and 1 RW port STBDATA has 1 read and 1 write port. Each thread is allocated with 8 fixed entries in the shared data structure. I-Cache I-TLB DFQ To PCX From CPX LSU Address STB PCX Generator PCX PKT from IFU STORELOAD IFU LMQ IRF,FRF
  • 32. Data flow through LSU A load instruction speculate on D-cache miss to reduce the CCX access latency. If speculation fails, load instruction is taken out of LMQ. The arbiter (PCX generator) takes 13 different inputs to generate the packet to PCX (Processor-to- Crossbar interface). I-Cache I-TLB DFQ To PCX From CPX LSU Address STB PCX Generator PCX PKT from IFU STORELOAD IFU LMQ IRF,FRF
  • 33. Data flow through LSU The arbiter inputs consist of 4 load type instructions 4 store type instructions One I-cache fill. One FPU access. One SPU access. One interrupt. One forward packet. I-Cache I-TLB DFQ To PCX From CPX LSU Address STB PCX Generator PCX PKT from IFU STORELOAD IFU LMQ IRF,FRF
  • 34. Data flow through LSU The arbitration inputs consist of I-cache miss Load miss Stores {FPU operations, SPU operations, Interrupts} A two level history mechanism allows to implement fair scheduling among different priority levels. I-Cache I-TLB DFQ To PCX From CPX LSU Address STB PCX Generator PCX PKT from IFU STORELOAD IFU LMQ IRF,FRF
  • 35. Data flow through LSU In coming packets are stored in the data fill queue (DFQ). Packets can be Acknowledgment Data The targets for DFQ are Instruction fetch unit (IFU) Load Store Unit (LSU) Trap Logic Unit (TLU) Stream Processing Unit (SPU) I-Cache I-TLB DFQ To PCX From CPX LSU Address STB PCX Generator PCX PKT from IFU STORELOAD IFU LMQ IRF,FRF
  • 36. Memory Management Unit Maintains content of ITLB and DTLB. MMU helps SPARC-T1 to provide support for virtualization. Multiple OS co-exists on to of CMT processor. Hypervisor layer virtualizes underlying CPU. Virtual address (VA) from application is translated into Real Address (RA) and then to Physical Address (PA) using TLB & MMU.
  • 37. Data Flow in MMU The system software maintains the content of TLBs by sending instructions to MMU. Instructions are - read, write, de-map. TLB entries are shared among threads. Consistency among TLB entries are maintained through auto-de-map. MMU is responsible for Generating the pointers to Software Translation Storage Buffer (STB). Maintains fault status for various traps. Access to MMU is through hypervisor-managed ASI (Alternate Space Identifier) operations, e.g. ldxa, stxa.
  • 39. TLB structure TLB consists of Content Addressable Memory (CAM) and Random Access Memory (RAM). CAM has 1 compare port and 1 read-write port. RAM has 1 read-write port. TLB support the mutually exclusive events of - CAM, Read,Write, Bypass, De-map, Soft-reset, Hard-reset.
  • 40. TLB structure RAM contains the following fields. Physical Address (PA). Attributes. CAM contains the following fields. Partition ID (PID). Real (indicates VA-to-PA or RA-to-PA translation) Virtual address (VA), divided into page size based fields (V0 - V3) Context ID (CTXT)
  • 41. TLB entry replacement algorithm Each entry has an used bit. Replacement is picked up by the least significant unused bit among all 64 entries. A used bit is set on - write, CAM hit or lock. A locked page has always the used bit set. Entry invalidation will clear the used bit. All used bit will be cleared, except the locked entry, if TLB reaches saturation. If TLB is saturated for all locked entry, default location 0x63 s chosen and error is reported.
  • 42. L2-cache overview 3MB in total size with four symmetrical data bank. Each bank operates independently. Each bank is 12-way set associative and 768KB of size. Line size is 64 bytes. Number of sets are 1024.
  • 43. L2-cache overview Accepts request from processor-to-crossbar (PCX) interface - a part of CCX. Puts response on crossbar-to- processor (CPX) interface - a part of CCX. Responsible to maintain on- chip coherency across all L1- cache. Keeps copy of all L1 tags in a directory structure.
  • 44. L2-cache overview 128-bit fill interface. 64-bit write interface with DMA controller. Each bank has dedicated DMA controller. 8-staged pipe lined cache controller.
  • 45. L2-cache overview 32-bit word is protected by 7- bit single error correction, double error detection (SEC/ DED) ECC code. J-bus interface (JBI) using snoop input queue and RDMA write buffer.
  • 46. L2-Cache structure 3 main components. SCTAG (Secondary Cache TAG) : contains TAG array,VUAD array, L2- TAG directory and cache controller. SCBUF (Secondary Cache BUF) : contains write back buffer (WBB), fill buffer (FB) and DMA buffer. SCDATA (Secondary Cache DATA) : contains L2-cache data.
  • 47. L2-cache : Arbiter Manages L2-cache pipeline access from various source of request access. Arbiter gets input from Instruction from CCX and bypass path for input queue (IQ). DMA instruction from snoop input queue Instructions for re-cycle from miss buffer (MB) and fill buffer (FB). Stall signal from the pipeline.
  • 48. L2-cache :TAG 22-bit tag with 6-bit of SEC ECC protection. No double bit error detection. Single ported array. Four states are maintained per tag line in VUAD array Valid (V) Used (U) Allocated (A) Dirty (D)
  • 49. L2-cache :VUAD Dual ported array structure. VAD bits are parity protected since an error will be fatal Used bit is not protected, since the error is not fatal. VUAD is accessed while taking decision for line replacement.
  • 50. L2-cache : DATA Single ported SRAM structure. 768 KB in size with 64 bytes logical line size. Allows read access of 16 bytes and 64 bytes. ‘16-byte enable’ allows writing in 4-byte part. Line fill updates all 64 bytes at a time.
  • 51. L2-cache : DATA Data array is subdivided into 4 column with six 32 Kbyte sub-array in each column. Data array access needs 2 cycles to be completed. No column can be accessed in consecutive cycle. All accesses are pipelined, thus access have a through put of one per cycle. Each 32-bit line is protected by 7 bits of SEC/DED ECC.
  • 52. L2-cache : Input Queue(IQ) 16-entry FIFO queue takes incoming PCX packets. Each entry is 130 bit wide. FIFO implemented with dual ported array. IQ asserts a stall when 11 entries are filled up. To allow incoming packets already in fly.
  • 53. L2-cache : Output Queue(OQ) 16-entry FIFO for the packets waiting to get access to CPX. Each entry is 146-bit wide. FIFO implemented with dual ported array. When OQ reaches high-water mark, L2-cache stops accepting PCX and input from miss buffer. Fills can still be happened since they do not cause CPX packets.
  • 54. L2-cache : Miss Buffer (MB) 16-entry miss buffer stores instructions which can not be processed as simple cache hit. True L2 cache miss. Same cache line address which had a miss. An entry in the write back buffer. Instructions need multiple L2 cache pipeline. Unallocated L2-cache misses. Access causing tag ECC error. Non tag part holds data - it is a RAM with 1R 1W port. Tag part holds address - it is a CAM with 1R,1W and 1CAM port.
  • 55. L2-cache : Fill Buffer (FB) 8-entry buffer. Contains cache-line wide entry to stage data from DRAM before it fills the cache. RAM structure is used for implementation. Address is also stored to maintain the age ordering to satisfy data coherence. CAM structure is used for implementation. Data arrives from DRAM in four 16 byte block starting with the critical quad-word.
  • 56. L2-cache :Write Back Buffer (WBB) 8-entry buffer, used to store 64-byte dirty data upon eviction. The evicted lines are streamed to DRAM opportunistically. An instruction having same address line as in WBB, the instruction is pushed back into MB. WBB also has RAM and CAM part to hold data and address respectively. 64-byte read interface with data array and 64-bit write interface to DRAM controller.
  • 57. L2-cache : Directory 2048 entries, with one entry per L1 tag. It is L1 tag to L2 bank mapping. Half entries are for L1 I-cache and other half is for L1 D-cache. I-cache dir and D-cache dir. Participates in coherency management. Also ensures same line is not a part of I-Cache and D-Cache.
  • 58. Uses pseudo LRU for line replacement. The ‘U’ bit (total 12, 1 per way) is set upon cache hit All 12 ‘U’ bits get cleared when there is no unused or unallocated. ‘A’-bit means the line is allocated for a miss. Analogous to ‘lock’ bit. ‘A’ -bit gets cleared when the line fill happens. L2-cache : Line Replacement Algorithm
  • 59. ‘D’-bit indicates the line is only valid inside cache and required to be written back. Set when data written to L2-cache. Cleared when line is invalidated. LRU examines all the ways from a certain point based one a round-robin fashion. The first unused-unallocated line is allocated for miss. If no unused, first unallocated line is allocated for miss. L2-cache : Line Replacement Algorithm
  • 60. Scope of future study Cache cross bar (CCX) for data transaction. L2-cache pipelined data flow control. Cache memory consistency and instruction ordering.