Mars: A 64-core ARMv8 Processor
Charles Zhang
Phytium Technology Co., Ltd
Phytium Technology Co., Ltd
Statements
The following slides are presented to introduce the
general features of one of our products, instead of any
commitment about it. It is for information purposes
only, and may not be incorporated into any contract. It
is not suggested to make purchasing decisions
accordingly. The development, release, and timing of
any features or functionality described here remains at
the sole discretion of Phytium.
2
Phytium Technology Co., Ltd
A Brief Introduction of Phytium
 China corporation, founded in 2012
 Guangzhou
 Tianjin
 Vision
 Leading edge CPU and ASIC provider in China
 Market focuses on chips for
 Internet & Cloud Computing infrastructure
 Traditional workload mainframe servers
3
Phytium Technology Co., Ltd
China is a Fast-growing Server Market
4
Company
1Q15
Revenue
1Q15 Market
Share (%)
1Q14
Revenue
1Q14 Market
Share (%)
1Q15-1Q14
Growth (%)
HP 3,191,694,948 23.8 2,890,992,229 25.5 10.4
Dell 2,296,473,026 17.1 2,006,639,006 17.7 14.4
IBM 1,887,939,141 14.1 2,244,631,789 19.8 -15.9
Lenovo 970,254,659 7.2 127,973,470 1.1 658.2
Cisco 890,179,930 6.6 616,620,000 5.4 44.4
Others 4,157,871,704 31.0 3,469,383,444 30.6 19.8
Total 13,394,413,409 100.0 11,356,239,939 100.0 17.9
Company
1Q15
Revenue
1Q15 Market
Share (%)
1Q14
Revenue
1Q14 Market
Share (%)
1Q15-1Q14
Growth (%)
Inspur 332,613,480 21 227,328,256 17 46
Dell 322,063,140 20 246,281,271 19 31
Lenovo 295,914,571 18 80,084,826 6 270
HP 217,487,450 14 167,775,923 13 30
Huawei 197,490,419 12 189,963,266 14 4
Sugon 140,377.091 9 70,705,366 5 99
Others 104,566,737 6 329,549,621 25 -68
Total 1,610,512,888 100.0 1,311,688,529 100.0 23
Source: Gartner (May 2015)
China
WW
Phytium Technology Co., Ltd
What is Mars for?
5
High performance
High volume of memory
High bandwidth memory access
High bandwidth I/O access
Large scale cache coherency maintained
Moderate performance
High power efficiency
High density computing
High bandwidth memory
access
Low cost
Mars
Earth
Phytium Technology Co., Ltd
Mars Overview
 Architecture Features
 64 Xiaomi cores, ARMv8
compatible
 Hardware-maintained global
cache coherency
 Panel-based data affinity
architecture
 Mesh topology on chip network
 32MB L2 cache
 8 Cache & Memory Chips (CMC)
 128MB L3 cache
 16 DDR3-1600 channels
 Two 16-lane PCIE3.0 i/f
 ECC and parity protection on all
caches, tags and TLBs
6
Physical
• ~180M instances
• 2.0GHz@28nm
• 120W
Performance
• Peak:512GFLOPS
• Mem BW:204GB/s
• I/O BW: 32GB/s
panel0 panel1 panel3 panel2
panel4 panel5 panel7 panel6
CMC
PCIe
PCIe
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
CMC
DDR3
DDR3
Phytium Technology Co., Ltd
Panel Architecture
 Eight Xiaomi Cores
 Compatible design with ARMv8 arch license
 Both AArch32 and AArch64 modes
 EL0~EL3 supported
 ASIMD-128 supported
 Adv. hybrid Branch Prediction
 4 fetch/4 decode/4 dispatch Out-of-Order
superscalar pipeline
 Cache Hierarchy
 Separated L1 ICache and L1 Dcache
 Shared L2 cache, totally 4MB
 Directory-based cache coherency
maintenance
 Directory Control Unit (DCU)
 Routing Cell
7
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
6000μm
10600μm
Phytium Technology Co., Ltd8
Xiaomi Core
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
Phytium Technology Co., Ltd
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
9
Xiaomi Core Front End
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Prefetch
• 32KB L1 instr. Cache
• Next line prefetch
• Hybrid Branch Predictor
• 2048-entry BTB
• Direction predict with TAGE predictor
• 512-entry indirect predictor
• 48-entry Speculative Return Stack
• Four instructions fetched per cycle
• 32-entry instruction buffer
• Loop detect and Instr. Cache bypass
Phytium Technology Co., Ltd
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
10
Xiaomi Core Decode, Rename & Dispatch
• Up to four instructions
decoded per cycle
• 192 physical registers
• Up to four instructions
renamed per cycle
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
• Up to four instructions dispatched per cycle
• Reorder buffer can hold 160 instructions, and about 210+ instructions
can be in-flight in the whole pipeline.
• Dispatch in-order, execution out-of-order, retirement in-order.
Phytium Technology Co., Ltd
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
11
Xiaomi Core Function Units
• Two separated 16-entry integer and ASIMD queues shared by four
integer units
• Two integer unit can process single-cycle integer instructions and
integer SIMD instructions, one can also process branch instructions.
• Two integer units can process multi-cycle integer instructions and
integer SIMD instructions.
• One shared16-entry floating point and ASIMD queue
• Two FP/ASIMD units equipped, which can be combined into one
lockstep ASIMD unit.
• FMA supported in both units.
• FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles
SCInt/VT
Queue
FP/VT
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
Phytium Technology Co., Ltd
ITLB I CacheBTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoderdecoderdecoderdecoder
Rename Logic
Arch.
Reg
file
Phy.
Reg
file
Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
12
Xiaomi Core Function Units
• One 24-entry load/store queue
• 32KB L1 data cache
• 6 outstanding loads
• 4 cycles latency from load to use
• Next line and stride detected data
prefetch
• Streamlined pattern auto detected
LD/ST
Queue
DTLB
D Cache
STB & Prefetch
Phytium Technology Co., Ltd
Cache coherence protocol
 Hawk cache coherence protocol
 Distributed directory-based global cache coherency
 MOESI-like packet-based coherence protocol
 A home node DCU(directory control unit) supports
 Affinitive pairing of L2Cs and CMCs
 “Infinite” capacity for non-conflicting Reads & Writes
 Optimized transaction flow for exclusive atomic accesses
 Reduced latency by cacheline forwarding
13
L2C L2C L2C
Hawk
L3C &
Memory
I/O
Interconnects
Global Exclusive Monitor
Core0
Core7
Coherence Logic
PanelN
MEM
Core0
Core7
Coherence Logic
Panel0
Local Monitor
Phytium Technology Co., Ltd
Network on Chip
 2D Concentrated Mesh Architecture
 Cell based switch with 6 bidirectional ports
 Uniform package format for each port, a port can be configured to be
connected with a device or cascade cell
 4 physical channels for CC and 1 channel for debug, DOR Y-X routing
 Low latency: 3 cycles for each hop
 High bandwidth: 384GB/s each cell
Cell
1
5
3
0
2
4
L2cache
L2cache
MIU/IOU
MIU or
Cascade
Dest. Lat. (cycles)
0 3
1 6
2 9
3 12
4 15
5 12
6 9
7 6
Avg. 9
3
4
Cell0
1
5
3
0
2
4 Cell1
0
2
4
1
5
3
Cell4
5
1
3
2
0
4 Cell5
2
0
4
5
1
3 Cell7
5
1
3
2
0
4
Cell2
0
2
4
1
5
3Cell3
1
5
3
0
2
4
Cell6
2
0
4
5
1
3
master
0
1 2
56
7
14
Phytium Technology Co., Ltd
Cache & Memory Chip
 L3 cache
 16MB Data Array
 2MB Data ECC
 DDR bandwidth
 2 x DDR3-800:25.6GB/s
 Proprietary interface between Mars &
CMC
 Parallel interface
 Needs more pins, but lower latency than
serdes
 Separate write/cmd and read data
channel
L3
Bank0
Mars Interface
L3
Bank1
L3
Bank2
L3
Bank3
Mem
Ctrl0
Mem
Ctrl1
D
D
R
D
D
R
15
 Effective read channel bandwidth:12.8GB/s
 Effective write/cmd channel bandwidth:6.4GB/s
Phytium Technology Co., Ltd
Latency of affinitive access
Memory access latency(ns)
Local L1 cache hit ~2
Local L2 cache hit ~8
Affinitive L2 cache hit ~20
Affinitive L3 cache hit ~36
Affinitive DDR access ~70
• Panel : 2.0GHz
• NoC: 2.0GHz
• CMC: 1.5GHz
* PCB latency not considered
16
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
CMC
Phytium Technology Co., Ltd
Memory Tune (mTune)
 Rich Data Collection
 Number of cache hits/misses for L1/L2/L3
 Workload of cache pipelines
 Busyness of the NoC
 ECC corrections of the memory system
 Support Multiple Metrics
 Average Miss rate/Hit rate
 Minimal/Maximal/Average Access Latency
 Bandwidth Analysis
 Concurrent Average Memory Access Time (CAMAT)
 Support MPI/OpenMP Applications
 Thread behavior analysis
 Inter-process behavior analysis
17
Phytium Technology Co., Ltd
Scalable Debug System
 ARMv8 CoreSight Compatible debug system
 Scalable dedicated debug network across 64 cores
 Distributed debug components
 Configurable events broadcast scope
 Timestamp broadcasts with single signal to simplify
implementation
18
Cell0
DNB_C
DNB_C
1
5
3
0
DNB_M
DNB_I
2
4 Cell14 3
1 0
5 2
DNB_C
DNB_C
DNB_M
Cell4
DNB_C
DNB_C
5
1
3
2
0
4
DNB_M
Cell54 3
5 2
1 0
DNB_C
DNB_CDNB_M
Cell33 4
0 1
2 5
DNB_C
DNB_C
DNB_M
Cell24 3
1 0
5 2
DNB_C
DNB_C
DNB_M
Cell7
DNB_C
DNB_C
5
1
3
2
0
4
DNB_M
Cell64 3
5 2
1 0
DNB_C
DNB_CDNB_M
panel0 panel1
panel4 panel5
panel3 panel2
panel7 panel6
hdbg
JTAG1
JTAG2
Phytium Technology Co., Ltd
Physical Design
 28nm process
 0.9v core/1.8v IO
 10 metal layers
 ~180M instances
 2.0GHz
 120W
 640mm2 die size
 FCBGA
 ~3000 pins
19
25.38mm
25.2mm
Phytium Technology Co., Ltd
Performance Evaluation
 SpecCPU2006
20
Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark
19.2 17.8
0
5
10
15
20
25
INT FP
SPEC_CPU2006_base
672
585
0
100
200
300
400
500
600
700
800
INT FP
SPEC_CPU2006_rate
Phytium Technology Co., Ltd
Performance Evaluation
 STREAM
21
0
10
20
30
40
50
60
70
80
90
1 2 4 8 16 24 32 40 48 56 64
STREAM triad
#cores
Bandwidth(GB/s)
Phytium Technology Co., Ltd
Next Generation Scale-up CPU
 More powerful core
 Aggressive Branch Predictor
 Multithreading
 More aggressive ILP exploitation
 Wider SIMD
 More RAS features
 Higher bandwidth memory access
 Higher power efficiency
22
Mars: A 64-core ARMv8 Processor
Charles Zhang
Charles.zhang@phytium.com.cn

More Related Content

PDF
Introduction of Fujitsu's HPC Processor for the Post-K Computer
PDF
Japan's post K Computer
PPTX
EMC in HPC – The Journey so far and the Road Ahead
PDF
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
PDF
SGI: Meeting Manufacturing's Need for Production Supercomputing
PDF
It's Time to ROCm!
PDF
High Performance Interconnects: Assessment & Rankings
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Introduction of Fujitsu's HPC Processor for the Post-K Computer
Japan's post K Computer
EMC in HPC – The Journey so far and the Road Ahead
ARM-based Supercomputer from Fujitsu and RIKEN - "Post-K"
SGI: Meeting Manufacturing's Need for Production Supercomputing
It's Time to ROCm!
High Performance Interconnects: Assessment & Rankings
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...

What's hot (20)

PDF
@IBM Power roadmap 8
PDF
POWER9 for AI & HPC
PDF
Arm in HPC
PDF
An Update on Arm HPC
PDF
DOME 64-bit μDataCenter
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
BXI: Bull eXascale Interconnect
PDF
ARM HPC Ecosystem
PDF
IBM HPC Transformation with AI
PDF
Overview of HPC Interconnects
PDF
Ac922 cdac webinar
PDF
Arm as a Viable Architecture for HPC and AI
PDF
POWER10 innovations for HPC
PDF
OpenPOWER Latest Updates
PDF
Sx 6-single-node
PDF
High Performance Interconnects: Landscape, Assessments & Rankings
PPT
OpenPOWER Webinar
PDF
01 high bandwidth acquisitioncomputing compressionall in a box
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
@IBM Power roadmap 8
POWER9 for AI & HPC
Arm in HPC
An Update on Arm HPC
DOME 64-bit μDataCenter
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
BXI: Bull eXascale Interconnect
ARM HPC Ecosystem
IBM HPC Transformation with AI
Overview of HPC Interconnects
Ac922 cdac webinar
Arm as a Viable Architecture for HPC and AI
POWER10 innovations for HPC
OpenPOWER Latest Updates
Sx 6-single-node
High Performance Interconnects: Landscape, Assessments & Rankings
OpenPOWER Webinar
01 high bandwidth acquisitioncomputing compressionall in a box
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ad

Similar to Phytium 64 core cpu preview (20)

PPTX
Steen_Dissertation_March5
PPT
The Cell Processor
PDF
Qualcomm centriq 2400 hot chips final submission corrected
PPTX
Snapdragon SoC and ARMv7 Architecture
PDF
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
PDF
Introduction to intel galileo board gen2
PDF
Ceph optimized Storage / Global HW solutions for SDS, David Alvarez
PDF
Webinar: Microprocessadores 32 bits, suas principais aplicações no mercado br...
PDF
Building Industrial IoT Gateway using ARM SBC
PPTX
HiPEAC-CSW 2022_Kevin Mika presentation
PPTX
HiPEAC 2022-DL4IoT workshop_René Griessl presentation
PPTX
Intel Edison: Beyond the Breadboard
PDF
AI Accelerators for Cloud Datacenters
PDF
Deep learning: Hardware Landscape
PDF
Geniatech nxp product line For Industrial and Enterprise
PDF
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
PDF
Enhancing the Open-Source P-Mesh Cache Coherence System for Open ISAs
PDF
Theta and the Future of Accelerator Programming
DOCX
Top 10 Supercomputers With Descriptive Information & Analysis
PPT
Study on 32-bit Cortex - M3 Powered MCU: STM32F101
Steen_Dissertation_March5
The Cell Processor
Qualcomm centriq 2400 hot chips final submission corrected
Snapdragon SoC and ARMv7 Architecture
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Introduction to intel galileo board gen2
Ceph optimized Storage / Global HW solutions for SDS, David Alvarez
Webinar: Microprocessadores 32 bits, suas principais aplicações no mercado br...
Building Industrial IoT Gateway using ARM SBC
HiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC 2022-DL4IoT workshop_René Griessl presentation
Intel Edison: Beyond the Breadboard
AI Accelerators for Cloud Datacenters
Deep learning: Hardware Landscape
Geniatech nxp product line For Industrial and Enterprise
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
Enhancing the Open-Source P-Mesh Cache Coherence System for Open ISAs
Theta and the Future of Accelerator Programming
Top 10 Supercomputers With Descriptive Information & Analysis
Study on 32-bit Cortex - M3 Powered MCU: STM32F101
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...

Recently uploaded (20)

PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Enhancing emotion recognition model for a student engagement use case through...
Developing a website for English-speaking practice to English as a foreign la...
1 - Historical Antecedents, Social Consideration.pdf
O2C Customer Invoices to Receipt V15A.pptx
Tartificialntelligence_presentation.pptx
A novel scalable deep ensemble learning framework for big data classification...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Univ-Connecticut-ChatGPT-Presentaion.pdf
Zenith AI: Advanced Artificial Intelligence
Group 1 Presentation -Planning and Decision Making .pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
A comparative study of natural language inference in Swahili using monolingua...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Chapter 5: Probability Theory and Statistics
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
search engine optimization ppt fir known well about this
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game

Phytium 64 core cpu preview

  • 1. Mars: A 64-core ARMv8 Processor Charles Zhang Phytium Technology Co., Ltd
  • 2. Phytium Technology Co., Ltd Statements The following slides are presented to introduce the general features of one of our products, instead of any commitment about it. It is for information purposes only, and may not be incorporated into any contract. It is not suggested to make purchasing decisions accordingly. The development, release, and timing of any features or functionality described here remains at the sole discretion of Phytium. 2
  • 3. Phytium Technology Co., Ltd A Brief Introduction of Phytium  China corporation, founded in 2012  Guangzhou  Tianjin  Vision  Leading edge CPU and ASIC provider in China  Market focuses on chips for  Internet & Cloud Computing infrastructure  Traditional workload mainframe servers 3
  • 4. Phytium Technology Co., Ltd China is a Fast-growing Server Market 4 Company 1Q15 Revenue 1Q15 Market Share (%) 1Q14 Revenue 1Q14 Market Share (%) 1Q15-1Q14 Growth (%) HP 3,191,694,948 23.8 2,890,992,229 25.5 10.4 Dell 2,296,473,026 17.1 2,006,639,006 17.7 14.4 IBM 1,887,939,141 14.1 2,244,631,789 19.8 -15.9 Lenovo 970,254,659 7.2 127,973,470 1.1 658.2 Cisco 890,179,930 6.6 616,620,000 5.4 44.4 Others 4,157,871,704 31.0 3,469,383,444 30.6 19.8 Total 13,394,413,409 100.0 11,356,239,939 100.0 17.9 Company 1Q15 Revenue 1Q15 Market Share (%) 1Q14 Revenue 1Q14 Market Share (%) 1Q15-1Q14 Growth (%) Inspur 332,613,480 21 227,328,256 17 46 Dell 322,063,140 20 246,281,271 19 31 Lenovo 295,914,571 18 80,084,826 6 270 HP 217,487,450 14 167,775,923 13 30 Huawei 197,490,419 12 189,963,266 14 4 Sugon 140,377.091 9 70,705,366 5 99 Others 104,566,737 6 329,549,621 25 -68 Total 1,610,512,888 100.0 1,311,688,529 100.0 23 Source: Gartner (May 2015) China WW
  • 5. Phytium Technology Co., Ltd What is Mars for? 5 High performance High volume of memory High bandwidth memory access High bandwidth I/O access Large scale cache coherency maintained Moderate performance High power efficiency High density computing High bandwidth memory access Low cost Mars Earth
  • 6. Phytium Technology Co., Ltd Mars Overview  Architecture Features  64 Xiaomi cores, ARMv8 compatible  Hardware-maintained global cache coherency  Panel-based data affinity architecture  Mesh topology on chip network  32MB L2 cache  8 Cache & Memory Chips (CMC)  128MB L3 cache  16 DDR3-1600 channels  Two 16-lane PCIE3.0 i/f  ECC and parity protection on all caches, tags and TLBs 6 Physical • ~180M instances • 2.0GHz@28nm • 120W Performance • Peak:512GFLOPS • Mem BW:204GB/s • I/O BW: 32GB/s panel0 panel1 panel3 panel2 panel4 panel5 panel7 panel6 CMC PCIe PCIe DDR3 DDR3 CMC DDR3 DDR3 CMC DDR3 DDR3 CMC DDR3 DDR3 CMC DDR3 DDR3 CMC DDR3 DDR3 CMC DDR3 DDR3 CMC DDR3 DDR3
  • 7. Phytium Technology Co., Ltd Panel Architecture  Eight Xiaomi Cores  Compatible design with ARMv8 arch license  Both AArch32 and AArch64 modes  EL0~EL3 supported  ASIMD-128 supported  Adv. hybrid Branch Prediction  4 fetch/4 decode/4 dispatch Out-of-Order superscalar pipeline  Cache Hierarchy  Separated L1 ICache and L1 Dcache  Shared L2 cache, totally 4MB  Directory-based cache coherency maintenance  Directory Control Unit (DCU)  Routing Cell 7 Xiaomi Xiaomi Xiaomi Xiaomi L2cache Routing Cell DCU DCU Xiaomi Xiaomi Xiaomi Xiaomi L2cache 6000μm 10600μm
  • 8. Phytium Technology Co., Ltd8 Xiaomi Core ITLB I CacheBTB DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoderdecoderdecoder Rename Logic Arch. Reg file Phy. Reg file Dispatch Logic Reorder Buffer SCInt/VT Queue FP/VT Queue LD/ST Queue ALU /BR FMA C/FDI V FMA C/FDI V DTLB D Cache L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer ALU /BR MUL /DIV MUL /DIV MCI/VT Queue
  • 9. Phytium Technology Co., Ltd ITLB I CacheBTB DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoderdecoderdecoder Rename Logic Arch. Reg file Phy. Reg file Dispatch Logic Reorder Buffer SCInt/VT Queue FP/VT Queue LD/ST Queue ALU /BR FMA C/FDI V FMA C/FDI V DTLB D Cache L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer ALU /BR MUL /DIV MUL /DIV MCI/VT Queue 9 Xiaomi Core Front End ITLB I CacheBTB DirPre IndPre SRS Loop Detect Instruction Buffer Prefetch • 32KB L1 instr. Cache • Next line prefetch • Hybrid Branch Predictor • 2048-entry BTB • Direction predict with TAGE predictor • 512-entry indirect predictor • 48-entry Speculative Return Stack • Four instructions fetched per cycle • 32-entry instruction buffer • Loop detect and Instr. Cache bypass
  • 10. Phytium Technology Co., Ltd ITLB I CacheBTB DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoderdecoderdecoder Rename Logic Arch. Reg file Phy. Reg file Dispatch Logic Reorder Buffer SCInt/VT Queue FP/VT Queue LD/ST Queue ALU /BR FMA C/FDI V FMA C/FDI V DTLB D Cache L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer ALU /BR MUL /DIV MUL /DIV MCI/VT Queue 10 Xiaomi Core Decode, Rename & Dispatch • Up to four instructions decoded per cycle • 192 physical registers • Up to four instructions renamed per cycle decoderdecoderdecoderdecoder Rename Logic Arch. Reg file Phy. Reg file Dispatch Logic Reorder Buffer • Up to four instructions dispatched per cycle • Reorder buffer can hold 160 instructions, and about 210+ instructions can be in-flight in the whole pipeline. • Dispatch in-order, execution out-of-order, retirement in-order.
  • 11. Phytium Technology Co., Ltd ITLB I CacheBTB DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoderdecoderdecoder Rename Logic Arch. Reg file Phy. Reg file Dispatch Logic Reorder Buffer SCInt/VT Queue FP/VT Queue LD/ST Queue ALU /BR FMA C/FDI V FMA C/FDI V DTLB D Cache L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer ALU /BR MUL /DIV MUL /DIV MCI/VT Queue 11 Xiaomi Core Function Units • Two separated 16-entry integer and ASIMD queues shared by four integer units • Two integer unit can process single-cycle integer instructions and integer SIMD instructions, one can also process branch instructions. • Two integer units can process multi-cycle integer instructions and integer SIMD instructions. • One shared16-entry floating point and ASIMD queue • Two FP/ASIMD units equipped, which can be combined into one lockstep ASIMD unit. • FMA supported in both units. • FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles SCInt/VT Queue FP/VT Queue ALU /BR FMA C/FDI V FMA C/FDI V ALU /BR MUL /DIV MUL /DIV MCI/VT Queue
  • 12. Phytium Technology Co., Ltd ITLB I CacheBTB DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoderdecoderdecoder Rename Logic Arch. Reg file Phy. Reg file Dispatch Logic Reorder Buffer SCInt/VT Queue FP/VT Queue LD/ST Queue ALU /BR FMA C/FDI V FMA C/FDI V DTLB D Cache L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer ALU /BR MUL /DIV MUL /DIV MCI/VT Queue 12 Xiaomi Core Function Units • One 24-entry load/store queue • 32KB L1 data cache • 6 outstanding loads • 4 cycles latency from load to use • Next line and stride detected data prefetch • Streamlined pattern auto detected LD/ST Queue DTLB D Cache STB & Prefetch
  • 13. Phytium Technology Co., Ltd Cache coherence protocol  Hawk cache coherence protocol  Distributed directory-based global cache coherency  MOESI-like packet-based coherence protocol  A home node DCU(directory control unit) supports  Affinitive pairing of L2Cs and CMCs  “Infinite” capacity for non-conflicting Reads & Writes  Optimized transaction flow for exclusive atomic accesses  Reduced latency by cacheline forwarding 13 L2C L2C L2C Hawk L3C & Memory I/O Interconnects Global Exclusive Monitor Core0 Core7 Coherence Logic PanelN MEM Core0 Core7 Coherence Logic Panel0 Local Monitor
  • 14. Phytium Technology Co., Ltd Network on Chip  2D Concentrated Mesh Architecture  Cell based switch with 6 bidirectional ports  Uniform package format for each port, a port can be configured to be connected with a device or cascade cell  4 physical channels for CC and 1 channel for debug, DOR Y-X routing  Low latency: 3 cycles for each hop  High bandwidth: 384GB/s each cell Cell 1 5 3 0 2 4 L2cache L2cache MIU/IOU MIU or Cascade Dest. Lat. (cycles) 0 3 1 6 2 9 3 12 4 15 5 12 6 9 7 6 Avg. 9 3 4 Cell0 1 5 3 0 2 4 Cell1 0 2 4 1 5 3 Cell4 5 1 3 2 0 4 Cell5 2 0 4 5 1 3 Cell7 5 1 3 2 0 4 Cell2 0 2 4 1 5 3Cell3 1 5 3 0 2 4 Cell6 2 0 4 5 1 3 master 0 1 2 56 7 14
  • 15. Phytium Technology Co., Ltd Cache & Memory Chip  L3 cache  16MB Data Array  2MB Data ECC  DDR bandwidth  2 x DDR3-800:25.6GB/s  Proprietary interface between Mars & CMC  Parallel interface  Needs more pins, but lower latency than serdes  Separate write/cmd and read data channel L3 Bank0 Mars Interface L3 Bank1 L3 Bank2 L3 Bank3 Mem Ctrl0 Mem Ctrl1 D D R D D R 15  Effective read channel bandwidth:12.8GB/s  Effective write/cmd channel bandwidth:6.4GB/s
  • 16. Phytium Technology Co., Ltd Latency of affinitive access Memory access latency(ns) Local L1 cache hit ~2 Local L2 cache hit ~8 Affinitive L2 cache hit ~20 Affinitive L3 cache hit ~36 Affinitive DDR access ~70 • Panel : 2.0GHz • NoC: 2.0GHz • CMC: 1.5GHz * PCB latency not considered 16 Xiaomi Xiaomi Xiaomi Xiaomi L2cache Routing Cell DCU DCU Xiaomi Xiaomi Xiaomi Xiaomi L2cache Xiaomi Xiaomi Xiaomi Xiaomi L2cache Routing Cell DCU DCU Xiaomi Xiaomi Xiaomi Xiaomi L2cache CMC
  • 17. Phytium Technology Co., Ltd Memory Tune (mTune)  Rich Data Collection  Number of cache hits/misses for L1/L2/L3  Workload of cache pipelines  Busyness of the NoC  ECC corrections of the memory system  Support Multiple Metrics  Average Miss rate/Hit rate  Minimal/Maximal/Average Access Latency  Bandwidth Analysis  Concurrent Average Memory Access Time (CAMAT)  Support MPI/OpenMP Applications  Thread behavior analysis  Inter-process behavior analysis 17
  • 18. Phytium Technology Co., Ltd Scalable Debug System  ARMv8 CoreSight Compatible debug system  Scalable dedicated debug network across 64 cores  Distributed debug components  Configurable events broadcast scope  Timestamp broadcasts with single signal to simplify implementation 18 Cell0 DNB_C DNB_C 1 5 3 0 DNB_M DNB_I 2 4 Cell14 3 1 0 5 2 DNB_C DNB_C DNB_M Cell4 DNB_C DNB_C 5 1 3 2 0 4 DNB_M Cell54 3 5 2 1 0 DNB_C DNB_CDNB_M Cell33 4 0 1 2 5 DNB_C DNB_C DNB_M Cell24 3 1 0 5 2 DNB_C DNB_C DNB_M Cell7 DNB_C DNB_C 5 1 3 2 0 4 DNB_M Cell64 3 5 2 1 0 DNB_C DNB_CDNB_M panel0 panel1 panel4 panel5 panel3 panel2 panel7 panel6 hdbg JTAG1 JTAG2
  • 19. Phytium Technology Co., Ltd Physical Design  28nm process  0.9v core/1.8v IO  10 metal layers  ~180M instances  2.0GHz  120W  640mm2 die size  FCBGA  ~3000 pins 19 25.38mm 25.2mm
  • 20. Phytium Technology Co., Ltd Performance Evaluation  SpecCPU2006 20 Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark 19.2 17.8 0 5 10 15 20 25 INT FP SPEC_CPU2006_base 672 585 0 100 200 300 400 500 600 700 800 INT FP SPEC_CPU2006_rate
  • 21. Phytium Technology Co., Ltd Performance Evaluation  STREAM 21 0 10 20 30 40 50 60 70 80 90 1 2 4 8 16 24 32 40 48 56 64 STREAM triad #cores Bandwidth(GB/s)
  • 22. Phytium Technology Co., Ltd Next Generation Scale-up CPU  More powerful core  Aggressive Branch Predictor  Multithreading  More aggressive ILP exploitation  Wider SIMD  More RAS features  Higher bandwidth memory access  Higher power efficiency 22
  • 23. Mars: A 64-core ARMv8 Processor Charles Zhang Charles.zhang@phytium.com.cn