Phytium 64 core cpu preview

Mars: A 64-core ARMv8 Processor
Charles Zhang
Phytium Technology Co., Ltd

Statements
The following slides are presented to introduce the
general features of one of our products, instead of any
commitment about it. It is for information purposes
only, and may not be incorporated into any contract. It
is not suggested to make purchasing decisions
accordingly. The development, release, and timing of
any features or functionality described here remains at
the sole discretion of Phytium.
2

A Brief Introduction of Phytium
 China corporation, founded in 2012
 Guangzhou
 Tianjin
 Vision
 Leading edge CPU and ASIC provider in China
 Market focuses on chips for
 Internet & Cloud Computing infrastructure
 Traditional workload mainframe servers
3

China is a Fast-growing Server Market
4
Company
1Q15
Revenue
1Q15 Market
Share (%)
1Q14
Revenue
1Q14 Market
Share (%)
1Q15-1Q14
Growth (%)
HP 3,191,694,948 23.8 2,890,992,229 25.5 10.4
Dell 2,296,473,026 17.1 2,006,639,006 17.7 14.4
IBM 1,887,939,141 14.1 2,244,631,789 19.8 -15.9
Lenovo 970,254,659 7.2 127,973,470 1.1 658.2
Cisco 890,179,930 6.6 616,620,000 5.4 44.4
Others 4,157,871,704 31.0 3,469,383,444 30.6 19.8
Total 13,394,413,409 100.0 11,356,239,939 100.0 17.9
Company
1Q15
Revenue
1Q15 Market
Share (%)
1Q14
Revenue
1Q14 Market
Share (%)
1Q15-1Q14
Growth (%)
Inspur 332,613,480 21 227,328,256 17 46
Dell 322,063,140 20 246,281,271 19 31
Lenovo 295,914,571 18 80,084,826 6 270
HP 217,487,450 14 167,775,923 13 30
Huawei 197,490,419 12 189,963,266 14 4
Sugon 140,377.091 9 70,705,366 5 99
Others 104,566,737 6 329,549,621 25 -68
Total 1,610,512,888 100.0 1,311,688,529 100.0 23
Source: Gartner (May 2015)
China
WW

What is Mars for?
5
High performance
High volume of memory
High bandwidth memory access
High bandwidth I/O access
Large scale cache coherency maintained
Moderate performance
High power efficiency
High density computing
High bandwidth memory
access
Low cost
Mars
Earth

Mars Overview
 Architecture Features
 64 Xiaomi cores, ARMv8
compatible
 Hardware-maintained global
cache coherency
 Panel-based data affinity
architecture
 Mesh topology on chip network
 32MB L2 cache
 8 Cache & Memory Chips (CMC)
 128MB L3 cache
 16 DDR3-1600 channels
 Two 16-lane PCIE3.0 i/f
 ECC and parity protection on all
caches, tags and TLBs
6
Physical
• ~180M instances
• 2.0GHz@28nm
• 120W
Performance
• Peak：512GFLOPS
• Mem BW：204GB/s
• I/O BW： 32GB/s
panel0 panel1 panel3 panel2
panel4 panel5 panel7 panel6
CMC
PCIe
PCIe
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3

Panel Architecture
 Eight Xiaomi Cores
 Compatible design with ARMv8 arch license
 Both AArch32 and AArch64 modes
 EL0~EL3 supported
 ASIMD-128 supported
 Adv. hybrid Branch Prediction
 4 fetch/4 decode/4 dispatch Out-of-Order
superscalar pipeline
 Cache Hierarchy
 Separated L1 ICache and L1 Dcache
 Shared L2 cache, totally 4MB
 Directory-based cache coherency
maintenance
 Directory Control Unit (DCU)
 Routing Cell
7
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
6000μm
10600μm

Phytium Technology Co., Ltd8
Xiaomi Core
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue

ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
9
Xiaomi Core Front End
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Prefetch
• 32KB L1 instr. Cache
• Next line prefetch
• Hybrid Branch Predictor
• 2048-entry BTB
• Direction predict with TAGE predictor
• 512-entry indirect predictor
• 48-entry Speculative Return Stack
• Four instructions fetched per cycle
• 32-entry instruction buffer
• Loop detect and Instr. Cache bypass

ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
10
Xiaomi Core Decode, Rename & Dispatch
• Up to four instructions
decoded per cycle
• 192 physical registers
• Up to four instructions
renamed per cycle
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
• Up to four instructions dispatched per cycle
• Reorder buffer can hold 160 instructions, and about 210+ instructions
can be in-flight in the whole pipeline.
• Dispatch in-order, execution out-of-order, retirement in-order.

ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
11
Xiaomi Core Function Units
• Two separated 16-entry integer and ASIMD queues shared by four
integer units
• Two integer unit can process single-cycle integer instructions and
integer SIMD instructions, one can also process branch instructions.
• Two integer units can process multi-cycle integer instructions and
integer SIMD instructions.
• One shared16-entry floating point and ASIMD queue
• Two FP/ASIMD units equipped, which can be combined into one
lockstep ASIMD unit.
• FMA supported in both units.
• FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles
SCInt/VT
Queue
FP/VT
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue

ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
12
Xiaomi Core Function Units
• One 24-entry load/store queue
• 32KB L1 data cache
• 6 outstanding loads
• 4 cycles latency from load to use
• Next line and stride detected data
prefetch
• Streamlined pattern auto detected
LD/ST
Queue
DTLB
D Cache
STB & Prefetch

Cache coherence protocol
 Hawk cache coherence protocol
 Distributed directory-based global cache coherency
 MOESI-like packet-based coherence protocol
 A home node DCU(directory control unit) supports
 Affinitive pairing of L2Cs and CMCs
 “Infinite” capacity for non-conflicting Reads & Writes
 Optimized transaction flow for exclusive atomic accesses
 Reduced latency by cacheline forwarding
13
L2C L2C L2C
Hawk
L3C &
Memory
I/O
Interconnects
Global Exclusive Monitor
Core0
Core7
Coherence Logic
PanelN
MEM
Core0
Core7
Coherence Logic
Panel0
Local Monitor

Network on Chip
 2D Concentrated Mesh Architecture
 Cell based switch with 6 bidirectional ports
 Uniform package format for each port, a port can be configured to be
connected with a device or cascade cell
 4 physical channels for CC and 1 channel for debug, DOR Y-X routing
 Low latency: 3 cycles for each hop
 High bandwidth: 384GB/s each cell
Cell
1
5
3
0
2
4
L2cache
L2cache
MIU/IOU
MIU or
Cascade
Dest. Lat. (cycles)
0 3
1 6
2 9
3 12
4 15
5 12
6 9
7 6
Avg. 9
3
4
Cell0
1
5
3
0
2
4 Cell1
0
2
4
1
5
3
Cell4
5
1
3
2
0
4 Cell5
2
0
4
5
1
3 Cell7
5
1
3
2
0
4
Cell2
0
2
4
1
5
3Cell3
1
5
3
0
2
4
Cell6
2
0
4
5
1
3
master
0
1 2
56
7
14

Cache & Memory Chip
 L3 cache
 16MB Data Array
 2MB Data ECC
 DDR bandwidth
 2 x DDR3-800：25.6GB/s
 Proprietary interface between Mars &
CMC
 Parallel interface
 Needs more pins, but lower latency than
serdes
 Separate write/cmd and read data
channel
L3
Bank0
Mars Interface
L3
Bank1
L3
Bank2
L3
Bank3
Mem
Ctrl0
Mem
Ctrl1
D
D
R
D
D
R
15
 Effective read channel bandwidth：12.8GB/s
 Effective write/cmd channel bandwidth：6.4GB/s

Latency of affinitive access
Memory access latency(ns)
Local L1 cache hit ~2
Local L2 cache hit ~8
Affinitive L2 cache hit ~20
Affinitive L3 cache hit ~36
Affinitive DDR access ~70
• Panel : 2.0GHz
• NoC: 2.0GHz
• CMC: 1.5GHz
* PCB latency not considered
16
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
CMC

Memory Tune (mTune)
 Rich Data Collection
 Number of cache hits/misses for L1/L2/L3
 Workload of cache pipelines
 Busyness of the NoC
 ECC corrections of the memory system
 Support Multiple Metrics
 Average Miss rate/Hit rate
 Minimal/Maximal/Average Access Latency
 Bandwidth Analysis
 Concurrent Average Memory Access Time (CAMAT)
 Support MPI/OpenMP Applications
 Thread behavior analysis
 Inter-process behavior analysis
17

Scalable Debug System
 ARMv8 CoreSight Compatible debug system
 Scalable dedicated debug network across 64 cores
 Distributed debug components
 Configurable events broadcast scope
 Timestamp broadcasts with single signal to simplify
implementation
18
Cell0
DNB_C
DNB_C
1
5
3
0
DNB_M
DNB_I
2
4 Cell14 3
1 0
5 2
DNB_C
DNB_C
DNB_M
Cell4
DNB_C
DNB_C
5
1
3
2
0
4
DNB_M
Cell54 3
5 2
1 0
DNB_C
DNB_CDNB_M
Cell33 4
0 1
2 5
DNB_C
DNB_C
DNB_M
Cell24 3
1 0
5 2
DNB_C
DNB_C
DNB_M
Cell7
DNB_C
DNB_C
5
1
3
2
0
4
DNB_M
Cell64 3
5 2
1 0
DNB_C
DNB_CDNB_M
panel0 panel1
panel4 panel5
panel3 panel2
panel7 panel6
hdbg
JTAG1
JTAG2

Physical Design
 28nm process
 0.9v core/1.8v IO
 10 metal layers
 ~180M instances
 2.0GHz
 120W
 640mm2 die size
 FCBGA
 ~3000 pins
19
25.38mm
25.2mm

Performance Evaluation
 SpecCPU2006
20
Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark
19.2 17.8
0
5
10
15
20
25
INT FP
SPEC_CPU2006_base
672
585
0
100
200
300
400
500
600
700
800
INT FP
SPEC_CPU2006_rate

Performance Evaluation
 STREAM
21
0
10
20
30
40
50
60
70
80
90
1 2 4 8 16 24 32 40 48 56 64
STREAM triad
#cores
Bandwidth(GB/s)

Next Generation Scale-up CPU
 More powerful core
 Aggressive Branch Predictor
 Multithreading
 More aggressive ILP exploitation
 Wider SIMD
 More RAS features
 Higher bandwidth memory access
 Higher power efficiency
22

Mars: A 64-core ARMv8 Processor
Charles Zhang
Charles.zhang@phytium.com.cn

Phytium 64 core cpu preview

More Related Content

What's hot (20)

Similar to Phytium 64 core cpu preview (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Phytium 64 core cpu preview