SlideShare a Scribd company logo
2IntegratedSystems Laboratory
1Department of Electrical,Electronic
and InformationEngineering
RISC-V open-ISA and open-HW
– a Swiss army knife for HPC
ICS2020, Workshop on RISC-V and OpenPOWER 29.06.2020
Andrea Bartolini1 & PULP team1,2
||
Energy efficiency challenge: Exascale
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
100 EFLOPS
10 EFLOPS
1 EFLOPS
100 PFLOPS
10 PFLOPS 2 nJ/FLOP
1 PFLOPS 200 pJ/FLOP
100 TFLOPS 20 pJ/FLOP
10 TFLOPS 2 pJ/FLOP
1 TFLOPS 0.2 pJ/FLOP
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
x10 every 4 years
/10 every 4 years
HPC is now power-bound→ need 10x energy efficiency improvement every 4 years
PERFORMANCE
ENERGY PER
OPERATION*
*20MWatt supercomputer: Performance & EnOP
1EFLOPs →20pj/FLOP
||
Peak Performance Moore law
FPU Performance Dennard law
Number of FPUs Moore + Dennard
App. Parallelism Amdahl's law
10^9
Exaflops
10^18
Gigaflops
10^9
serial fraction
1/10^9
We need
programmability support
Already at 20MWatt
C.Cavazzoni
HPC trends
||
“traditional” CPUs chips
are designed for maximum
performance for all
possible workloads
Silicon area wasted to
maximize single thread
performace
Compute Power
Energy
Datacenter Capacity
C.Cavazzoni
Energy trends
||
New chips designed for
maximum performance in a
reduced set of workloads
Simple functional units,
poor single thread
performance, but
maximum throughput
Compute Power
Energy
Datacenter Capacity
C.Cavazzoni
Change of paradigm #1
||
Change of paradigm #2
||
https://indico-jsc.fz-juelich.de/event/76/session/0/contribution/1/material/slides/0.pdf
Wayne Joubert - OpenPOWER ADG 2018
Change of paradigm #3
||
Change of paradigm #4
Performance
Analysis
Scalable Monitoring
Framework
Machine
Learning
Data
Visualization
Resources
Management
Energy
efficiency
Job
Scheduling
Heterogeneous
Sensors
Common Interface
CRAC
PDU
CLUSTER
Reactive and Proactive
Feedbacks
ENV.
||
…. a Swiss army knife for HPC
ARIANE:
The 64b
Application
Processor
ARA
The Vector
Engine
NTX
The Network
Training
Accelerator
HERO:
The Open
Heterogeneous
Research Platform
ControlPULP:
The Power
Controller for
HPC server
SNITCH:
The Pseudo Dual-
Issue Processor
for FP Workload
EXASCALE
2021
https://pulp-platform.org/
sPIN on PULP:
Network-
Accelerated
Memory
Transfers
||
▪ RV64GC, 6-stage, in-order,
out-of-order execute
▪ 16 KiB instruction cache, 32 KiB data cache
▪ Transprecision floating-point unit (TP-FPU) [3]
▪ double-, single- and half-precision FP formats
▪ Two custom formats FP16alt and FP8
▪ All standard RISC-V formats as well as SIMD
▪ Two different implementations:
▪ Ariane High Performance (AHP): tuned for high-performance applications
▪ Ariane Low Power (ALP): tuned for light, single-threaded applications
10
Architecture: Ariane RISC-V Cores
ARIANE:
The 64b
Application
Processor
The cost of application-class processing: Energy and Performance Analysis of a Linux-ready 1.7-
GHz 64-bit RISC-V Core in 22-nm FDSOI Technology
||
OpenPiton+Ariane
▪ Boots SMP Linux
▪ New write-through
cache subsystem
with invalidations
and the TRI
interface
▪ LR/SC in L1.5 cache
▪ Fetch-and-op in L2
cache
▪ RISC-V Debug
▪ RISC-V Peripherals
11
If you are really passionate about cache coherent “scalable” machines…
OpenPiton+ Ariane: The First Open-Source, SMP Linux-booting RISC-V
System Scaling From One to Many Cores
||
▪ “Network Training Accelerator”
▪ 32 bit float streaming co-processor
(IEEE 754 compatible)
▪ Custom 300 bit “wide-inside” Fused
Multiply-Accumulate
▪ 1.7x lower RMSE than conventional
FPU
▪ 1 RISC-V core (”RI5CY”) and DMA
▪ 8 NTX co-processors
▪ 64 kB L1 scratchpad memory
(comparable to 48 kB in V100)
Key ideas to increase hardware efficiency:
▪ Reduction of von Neumann bottleneck
(load/store elision through streaming)
▪ Latency hiding through DMA-based
double-buffering
13
Architecture: Network Training Accelerator (NTX)
Schuiki, Fabian, Michael Schaffner, Frank K. Gürkaynak, and Luca Benini. "A scalable near-memory
architecture for training deep neural networks on large in-memory datasets." IEEE Transactions on
Computers 68, no. 4 (2018): 484-497.
Schuiki, Fabian, Michael Schaffner, and Luca Benini. "Ntx: An energy-efficient streaming accelerator
for floating-point generalized reduction workloads in 22 nm fd-soi." In 2019 Design, Automation & Test
in Europe Conference & Exhibition (DATE), pp. 662-667. IEEE, 2019.
||
Flexible Architecture NTX accelerated cluster
▪ 1 processor core controls 8 NTX coprocessors
▪ Attached to 128 kB shared TCDM via a logarithmic interconnect
▪ DMA engine used to transfer data (double buffering)
▪ Multiple clusters connected via interconnect (crossbar/NoC)
||
Network Training Accelerator (NTX)
▪ Processor configures Reg IF and manages DMA double-buffering in L1 memory
▪ Controller issues AGU, HWL, and FPU micro-commands based on configuration
▪ AGUs generate address streams for data access
▪ FMAC with extended precision + ML functions
▪ Reads/writes data via 2 memory ports (2 operand and 1 writeback streams)
RiscV
core
1 for 8
MultibankedL1
15
||
▪ 22nm FDX technology
▪ Two application-class RISC-V
Ariane cores [1] - DP
▪ RV64GCXsmallfloat
▪ General purpose workloads
▪ Network Training Accelerator (NTX)
[2] - FP
▪ Accelerates oblivious kernels:
▪ Deep neuralnetwork training
▪ Stencils
▪ General linearalgebraworkloads
▪ 1.25 MiB of shared L2 memory
▪ Peripherals
17
Kosmodrom
Ariane and NTX on the same technology
ARIANE:
The 64b
Application
Processor
Schuiki, Fabian, Michael Schaffner, and Luca Benini. "NTX: A 260 Gflop/sW Streaming
Accelerator for Oblivious Floating-Point Algorithms in 22 nm FD-SOI." In 2019 International
SoC Design Conference (ISOCC), pp. 117-118. IEEE, 2019.
||
▪ We achieve higher energy-
efficiency for AHP and ALP than
competitive RISC-V processors
(Rocket)
▪ Ariane contains slightly larger caches
(32 KiB compared to 16 KiB)
▪ The ALP implementation is penalized
because of less mature cell libraries
available to us (7k cells vs 2k cells)
▪ NTX achieves a 2x gain in energy-
efficiency compared to Tesla V100
18
Summary on Kosmodrom: State of the Art
6x
18x
Zaruba, Florian, Fabian Schuiki, Stefan Mach, and Luca Benini. "The Floating
Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and
Performance." In 2019 26th IEEE International Conference on Electronics,
Circuits and Systems (ICECS), pp. 767-770. IEEE, 2019.
||
Ariane
1GHz
2 DP-GFLOPS
8 GB/s
I$, D$
Instruction Data
Interconnect
256b
64b 64b
Ara
1GHz
16 DP-GFLOPS
32 GB/s
VRF
Data
256b
Instruction
Queue
ACK/TRAP
Enter ARA: Open-Source RISCV Vector Engine
⚫ Ara targets 0.5 DP-FLOP/B
– Memory bandwidth scales with the
number of physical lanes
Cavalcante, Matheus, Fabian Schuiki, Florian Zaruba, Michael Schaffner, and Luca
Benini. "Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor
With Multiprecision Floating-Point Support in 22-nm FD-SOI." IEEE Transactions on
Very Large Scale Integration (VLSI) Systems 28, no. 2 (2019): 530-543.
||
Matrix multiplication on Ara
⚫ Load row i of matrix B into vB
⚫ for (int j = 0; j < n; j++)
– Load element A[j, i]
– Broadcast it into vA
– vC ← vA . vB + vC
vld vB, 0(addrB)
(Unrolled loop)
▪ ld t0, 0(addrA)
▪ addi addrA, addrA, 8
▪ vins vA, t0, zero
▪ vmadd vC, vA, vB, vC
▪ ld t0, 0(addrA)
▪ addi addrA, addrA, 8
▪ vins vA, t0, zero
▪ vmadd vC, vA, vB, vC
||
Issue rate performance limitation
⚫ vmadds are issued at best
every four cycles
– Since Ariane is single-issue
⚫ If the vector MACs take less
than four cycles to execute,
the FPUs starve waiting for
instructions
– Von Neumann Bottleneck
⚫ This translates to a boundary
in the roofline plot
||
Ara: Figures of Merit
⚫ Ara: 4 lanes GF 22FDX 1.25 GHz
implementation
⚫ Clock frequency
⚫ 1.25 GHz (nominal), 0.92 GHz (worst
condition)
⚫ 40 gate delays
⚫ Area: 3400 kGE
⚫ 0.68 mm2
⚫ 256 x 256 MATMUL
⚫ Performance: 9.8 DP-GFLOPS
⚫ Power: 259 mW
⚫ Efficiency: 38 DP-GFLOPS/W
⚫ ⁓2.5X better than Ariane on same
benchmark
⚫ Area breakdown
||
Ara: Scalability
⚫ Each lane is almost independent
– Contains part of the VRF and its
functional units
⚫ Scalability limitations
– VLSU and SLDU: need to
communicate to all banks
⚫ Instance with 16 lanes:
– 1.04 GHz (nom.), 0.78 GHz (w)
– 10.7 MGE (2.13mm²)
– 32.4 DP-GFLOPS
– 40.8 DP-GFLOPS/W (peak)
VLSU
Ariane
SLDU
16 ARAs give you 1TFLOP at 12W - NOT BAD!
||
SNITCH
Cluster 0
CC1 CC3
MULDIV I$
…
Hive 0
▪ Built around Snitch core:
RV32I, 15 kGE
▪ Add 64b FPU subsystem:
core complex (CC)
▪ 4 CCs, MULDIV, I-cache:
hive
▪ 2 hives, TCDM,
peripherals: cluster
▪ N clusters, system X-bar,
memory: system
▪ Float subsystem adds
novel HW
▪ 2 stream semantic
registers
▪ FPU sequencer
FPU
CC0 Snitch
Hive 1
…
Peripherals
… …
B0 B1 B2 B31…
Cluster 1 Cluster 2 … Memory
||
Stream Semantic Registers and FREP
▪ Vanilla RISCs: low functional unit utilization
▪ Solutions often complex: CISC, VLIW,
vectoring
▪ Map registers to memory streams: SSRs
▪ Reads, writes become memory requests
▪ Programmable generator emits addresses
+ Unmodified ISA
+ Orthogonal to hardware loops
▪ Snitch CC: FPU instruction stream
decoupled
▪ FPU Sequencer can buffer, loop over
instructions
▪ Core, FPU fed in parallel → pseudo-dual
issue
dotp: fld ft0, 0(a0)
fld ft1, 0(a1)
fmadd.d ft2, ft0, ft1, ft2
addi a0, a0, 8
addi a1, a1, 8
bge a0, t0, dotp
call conf_addr_gen
frep buflen, rep
fmadd.d ft2, ft0, ft1, ft2
Zaruba, Florian, Fabian Schuiki, Torsten Hoefler, and Luca Benini. "Snitch: A 10
kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of
Floating-Point Intensive Workloads." arXiv preprint cs.AR/2002.10143 (2020).
Schuiki, Fabian, Florian Zaruba, Torsten Hoefler, and Luca Benini. "Stream
Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute
Utilization in Single-Issue Cores." IEEE Transactions on Computers (2020).
||07.07.2020Andrea Bartolini 26
SNITCH Figures of Merit
Normalized Performance
Higher performance than VP
Almost 80 DP GFlop/sW
 2x more efficient of ARA
 13pJ x DP FLOP
 Exascale!!
22nm FDX technology
||07.07.2020Andrea Bartolini 27
System-Level Integration of Accelerators
• Our accelerators show leading performance and energy efficiency in
silicon at the core and cluster level
• How can we unleash this potential in real computing systems?
Image: Nvidia Xavier die shot annotated by WikiChip Image: Summit Supercomputer by OLCF at ORNL
||07.07.2020Andrea Bartolini 28
HERO: Open-Source Heterogeneous Research Platform
HERO combines
• general-purpose Host CPUs
• domain-specific programmable many-core accelerators
to unite versatility with performance,
enabling task offloading and data sharing across heterogeneous
• ISAs (e.g., ARMv8 and RV32)
• memory subsystems (e.g., caches and SPMs, virtual and physical addresses)
• data models (e.g., LP64 and ILP32)
• OSes and runtime libraries (e.g., Linux and OpenMP Device RTL)
with minimal run-time overhead and transparent to application programmers.
A. Kurth, P. Vogel, A. Marongiu, A. Capotondi, and L.Benini: "HERO: Heterogeneous Embedded
Research Platform for Exploring RISC-V Manycore Accelerators on FPGA." Proceedings of the First
Workshop on Computer Architecture Research with RISC-V (CARRV), pp. 1-7, IEEE/ACM, 2017.
A.Kurth, A. Capotondi, P. Vogel, L.Benini, and A. Marongiu: "HERO: an Open-Source Research
Platform for HW/SW Exploration of Heterogeneous Manycore Systems." Proceedings of the
Second Workshop on Autotuning and Adaptivity Approaches for Energy-Efficient HPC Systems
(ANDARE), pp. 13-18, ACM, 2018.
||07.07.2020Andrea Bartolini 29
Network-Accelerated Memory Transfers: sPIN on PULP
S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J, Beranek, M. Besta, L.Benini, D.
Roweth, and T. Hoefler: "Network-Accelerated Non-Contiguous Memory Transfers." Proceedings of
the International Conference for High Performance Computing, Networking, Storage and Analysis
(SC), pp. 1-14, ACM, 2019.
sPIN on PULP:
Network-
Accelerated
Memory Transfers
Processing user-defined network packet kernels
on a PULP-based accelerator
in the NIC
||
ControlPULP
Control-
PULP
VRM
BMC
PE
S
Operating System
Application
System Management / RM
GovernorsIn band
Hints/Prescription
Power Cap Energy vs.
Througput
DIMM
RJ45
SystemManagement/RM
Out of band
Node Power Cap
RAS
Main Archi. Blocks w. :
- Sensors (PVT, Util, archi)
- Controls (f,Vdd,Vbb,PG,CG)
- In band a.k.a low latency / user-
space telemetry (power, perf, …)
- O.S. PM governors:
- cpufreq/ cpuidle
- Based on O.S. metrics
- Slow & often unused
- Low latency PM requests and/or
suggestions
- From the Application/run-time
- Power cap => Max perf @ P<Pmax
- Energy => Min Energy @ f=f*
- Throughput => F > Fmax @ T,P<Max
- Out-of-band – zero overhead
telemetry
- Node Pcap – Max perf @
Pnode<Pmax
- RAS – error and conditions reporting
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
Coming Soon…
Andrea Bartolini, et al. A PULP-based Parallel Power Controller for FutureExascale Systems, in:
2019 26th IEEE International Conferenceon Electronics, Circuits and Systems (ICECS)
||
ControlPULP
Coming Soon…
PM
task
• Read voltage regulator,
power, status (VR)
• Power model update
T
control
Watchdog reset
Write power controller settings
Write to internal memory telemetry data
Read PVT sensors
Read workload from O.S.
Read target P/C state settings, power budget
Read Pending BMC requests
Compute controller settings
BMC
• Read pending command queue
• Decode Command/data
• Perform action:
• Change target P/C state, power budget
• Set pending BMC
• Ask telemetry data
Copyright © European Processor Initiative 2019. EPI Tutorial/bologna/22-01-2020
Ack. Robert Balas, Giovanni Bambini, Andrea Bentivogli, Davide Rossi,
Antonio Mastrandrea, Christian Conficoni, Simone Benatti, Andrea Tilli
||
HPC Vertical: The European Processor Initiative
▪ High Performance General Purpose
Processor for HPC
▪ High-performance RISC-V based
accelerator
▪ Computing platform for autonomous cars
▪ Will also target the AI, Big Data and other
markets in order to be economically
sustainable
Europe Needs its own Processors
▪ Processors now control almost every aspect
of our lives
▪ Security (back doors etc.)
▪ Possible future restrictions on exports to
EU due to increasing protectionism
▪ A competitive EU supply chain for HPC
technologies will create jobs and growth in
Europe
▪ Sovereignty (data, economical, embargo)
32
||
General Purpose Processor (GPP) chip
 7 nm, chip-let technology
 ARM-SVE tiles
 EPAC RISC-V vector+AI accelerator tiles
 L1, L2, L3 cache subsystem + HBM + DDR
HBMHBM
HBMHBM
DDR DDR
DDR DDR
RISC-V Accelerator Demonstrator Test Chip
 22 nm FDSOI
 Only one RISC-V accelerator tile
 On-chip L1, L2 + off-chip HBM + DDR PHY
 Targets 128 DP GFLOPS (vector) 200+GOPs/W SP
(STX)
Highspeed
SerDes
Vector
lanes
STX
units
Vector
lanes
STX
units
Scalar
core
Scalar
core
Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019
First Generation EPI chips
Scalar Core + STX units based on NTX and Snitch!
GPP power manager based on ControlPULP!
||
…. a Swiss army knife for HPC
EXASCALE
2021
||http://pulp-platform.org
The fun is
just beginning

More Related Content

PDF
Architecture innovations in POWER ISA v3.01 and POWER10
PDF
POWER9 for AI & HPC
PDF
POWER10 innovations for HPC
PDF
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
PDF
NNSA Explorations: ARM for Supercomputing
PDF
@IBM Power roadmap 8
PDF
Hardware & Software Platforms for HPC, AI and ML
Architecture innovations in POWER ISA v3.01 and POWER10
POWER9 for AI & HPC
POWER10 innovations for HPC
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
NNSA Explorations: ARM for Supercomputing
@IBM Power roadmap 8
Hardware & Software Platforms for HPC, AI and ML

What's hot (20)

PPTX
An open flow for dn ns on ultra low-power RISC-V cores
PDF
計算力学シミュレーションに GPU は役立つのか?
PDF
Exploring the Performance Impact of Virtualization on an HPC Cloud
PDF
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
IEEE CloudCom 2014参加報告
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
Deep Learning on the SaturnV Cluster
PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PPTX
OpenPOWER and IBM AI overview
PPTX
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
PDF
Expectations for optical network from the viewpoint of system software research
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
From Rack scale computers to Warehouse scale computers
PDF
An Update on Arm HPC
PDF
The Open Power ISA: A Summary of Architecture Compliancy Options and the Late...
PPTX
RISC-V 30907 summit 2020 joint picocom_mentor
PDF
Andes open cl for RISC-V
PDF
Japan's post K Computer
PDF
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
An open flow for dn ns on ultra low-power RISC-V cores
計算力学シミュレーションに GPU は役立つのか?
Exploring the Performance Impact of Virtualization on an HPC Cloud
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
IEEE CloudCom 2014参加報告
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Deep Learning on the SaturnV Cluster
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
OpenPOWER and IBM AI overview
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Expectations for optical network from the viewpoint of system software research
Utilizing AMD GPUs: Tuning, programming models, and roadmap
From Rack scale computers to Warehouse scale computers
An Update on Arm HPC
The Open Power ISA: A Summary of Architecture Compliancy Options and the Late...
RISC-V 30907 summit 2020 joint picocom_mentor
Andes open cl for RISC-V
Japan's post K Computer
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Ad

Similar to RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC (20)

PPTX
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PDF
Andes RISC-V processor solutions
PDF
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
PPTX
LEGaTO: Software Stack Runtimes
PDF
Ocpeu14
PDF
HPC Infrastructure To Solve The CFD Grand Challenge
PDF
A Library for Emerging High-Performance Computing Clusters
PDF
MARC ONERA Toulouse2012 Altreonic
PDF
Scaling the Container Dataplane
PDF
Latest HPC News from NVIDIA
PDF
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
PDF
Cilium - Fast IPv6 Container Networking with BPF and XDP
PDF
Semi dynamics high bandwidth vector capable RISC-V cores
PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
PPTX
Graphics processing uni computer archiecture
PDF
Inside the Volta GPU Architecture and CUDA 9
PDF
FD.io - The Universal Dataplane
PPTX
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
PPTX
Introduction to DPDK
PDF
Design installation-commissioning-red raider-cluster-ttu
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
Andes RISC-V processor solutions
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
LEGaTO: Software Stack Runtimes
Ocpeu14
HPC Infrastructure To Solve The CFD Grand Challenge
A Library for Emerging High-Performance Computing Clusters
MARC ONERA Toulouse2012 Altreonic
Scaling the Container Dataplane
Latest HPC News from NVIDIA
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
Cilium - Fast IPv6 Container Networking with BPF and XDP
Semi dynamics high bandwidth vector capable RISC-V cores
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Graphics processing uni computer archiecture
Inside the Volta GPU Architecture and CUDA 9
FD.io - The Universal Dataplane
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
Introduction to DPDK
Design installation-commissioning-red raider-cluster-ttu
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
IBM BOA for POWER
PDF
OpenPOWER System Marconi100
PDF
OpenPOWER Latest Updates
PDF
Deeplearningusingcloudpakfordata
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
AI in the enterprise
PDF
Robustness in deep learning
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
IBM BOA for POWER
OpenPOWER System Marconi100
OpenPOWER Latest Updates
Deeplearningusingcloudpakfordata
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
Robustness in deep learning

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced IT Governance
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced IT Governance
MYSQL Presentation for SQL database connectivity
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Advanced Soft Computing BINUS July 2025.pdf
Modernizing your data center with Dell and AMD

RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC

  • 1. 2IntegratedSystems Laboratory 1Department of Electrical,Electronic and InformationEngineering RISC-V open-ISA and open-HW – a Swiss army knife for HPC ICS2020, Workshop on RISC-V and OpenPOWER 29.06.2020 Andrea Bartolini1 & PULP team1,2
  • 2. || Energy efficiency challenge: Exascale Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019 100 EFLOPS 10 EFLOPS 1 EFLOPS 100 PFLOPS 10 PFLOPS 2 nJ/FLOP 1 PFLOPS 200 pJ/FLOP 100 TFLOPS 20 pJ/FLOP 10 TFLOPS 2 pJ/FLOP 1 TFLOPS 0.2 pJ/FLOP 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 x10 every 4 years /10 every 4 years HPC is now power-bound→ need 10x energy efficiency improvement every 4 years PERFORMANCE ENERGY PER OPERATION* *20MWatt supercomputer: Performance & EnOP 1EFLOPs →20pj/FLOP
  • 3. || Peak Performance Moore law FPU Performance Dennard law Number of FPUs Moore + Dennard App. Parallelism Amdahl's law 10^9 Exaflops 10^18 Gigaflops 10^9 serial fraction 1/10^9 We need programmability support Already at 20MWatt C.Cavazzoni HPC trends
  • 4. || “traditional” CPUs chips are designed for maximum performance for all possible workloads Silicon area wasted to maximize single thread performace Compute Power Energy Datacenter Capacity C.Cavazzoni Energy trends
  • 5. || New chips designed for maximum performance in a reduced set of workloads Simple functional units, poor single thread performance, but maximum throughput Compute Power Energy Datacenter Capacity C.Cavazzoni Change of paradigm #1
  • 8. || Change of paradigm #4 Performance Analysis Scalable Monitoring Framework Machine Learning Data Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC PDU CLUSTER Reactive and Proactive Feedbacks ENV.
  • 9. || …. a Swiss army knife for HPC ARIANE: The 64b Application Processor ARA The Vector Engine NTX The Network Training Accelerator HERO: The Open Heterogeneous Research Platform ControlPULP: The Power Controller for HPC server SNITCH: The Pseudo Dual- Issue Processor for FP Workload EXASCALE 2021 https://pulp-platform.org/ sPIN on PULP: Network- Accelerated Memory Transfers
  • 10. || ▪ RV64GC, 6-stage, in-order, out-of-order execute ▪ 16 KiB instruction cache, 32 KiB data cache ▪ Transprecision floating-point unit (TP-FPU) [3] ▪ double-, single- and half-precision FP formats ▪ Two custom formats FP16alt and FP8 ▪ All standard RISC-V formats as well as SIMD ▪ Two different implementations: ▪ Ariane High Performance (AHP): tuned for high-performance applications ▪ Ariane Low Power (ALP): tuned for light, single-threaded applications 10 Architecture: Ariane RISC-V Cores ARIANE: The 64b Application Processor The cost of application-class processing: Energy and Performance Analysis of a Linux-ready 1.7- GHz 64-bit RISC-V Core in 22-nm FDSOI Technology
  • 11. || OpenPiton+Ariane ▪ Boots SMP Linux ▪ New write-through cache subsystem with invalidations and the TRI interface ▪ LR/SC in L1.5 cache ▪ Fetch-and-op in L2 cache ▪ RISC-V Debug ▪ RISC-V Peripherals 11 If you are really passionate about cache coherent “scalable” machines… OpenPiton+ Ariane: The First Open-Source, SMP Linux-booting RISC-V System Scaling From One to Many Cores
  • 12. || ▪ “Network Training Accelerator” ▪ 32 bit float streaming co-processor (IEEE 754 compatible) ▪ Custom 300 bit “wide-inside” Fused Multiply-Accumulate ▪ 1.7x lower RMSE than conventional FPU ▪ 1 RISC-V core (”RI5CY”) and DMA ▪ 8 NTX co-processors ▪ 64 kB L1 scratchpad memory (comparable to 48 kB in V100) Key ideas to increase hardware efficiency: ▪ Reduction of von Neumann bottleneck (load/store elision through streaming) ▪ Latency hiding through DMA-based double-buffering 13 Architecture: Network Training Accelerator (NTX) Schuiki, Fabian, Michael Schaffner, Frank K. Gürkaynak, and Luca Benini. "A scalable near-memory architecture for training deep neural networks on large in-memory datasets." IEEE Transactions on Computers 68, no. 4 (2018): 484-497. Schuiki, Fabian, Michael Schaffner, and Luca Benini. "Ntx: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm fd-soi." In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 662-667. IEEE, 2019.
  • 13. || Flexible Architecture NTX accelerated cluster ▪ 1 processor core controls 8 NTX coprocessors ▪ Attached to 128 kB shared TCDM via a logarithmic interconnect ▪ DMA engine used to transfer data (double buffering) ▪ Multiple clusters connected via interconnect (crossbar/NoC)
  • 14. || Network Training Accelerator (NTX) ▪ Processor configures Reg IF and manages DMA double-buffering in L1 memory ▪ Controller issues AGU, HWL, and FPU micro-commands based on configuration ▪ AGUs generate address streams for data access ▪ FMAC with extended precision + ML functions ▪ Reads/writes data via 2 memory ports (2 operand and 1 writeback streams) RiscV core 1 for 8 MultibankedL1 15
  • 15. || ▪ 22nm FDX technology ▪ Two application-class RISC-V Ariane cores [1] - DP ▪ RV64GCXsmallfloat ▪ General purpose workloads ▪ Network Training Accelerator (NTX) [2] - FP ▪ Accelerates oblivious kernels: ▪ Deep neuralnetwork training ▪ Stencils ▪ General linearalgebraworkloads ▪ 1.25 MiB of shared L2 memory ▪ Peripherals 17 Kosmodrom Ariane and NTX on the same technology ARIANE: The 64b Application Processor Schuiki, Fabian, Michael Schaffner, and Luca Benini. "NTX: A 260 Gflop/sW Streaming Accelerator for Oblivious Floating-Point Algorithms in 22 nm FD-SOI." In 2019 International SoC Design Conference (ISOCC), pp. 117-118. IEEE, 2019.
  • 16. || ▪ We achieve higher energy- efficiency for AHP and ALP than competitive RISC-V processors (Rocket) ▪ Ariane contains slightly larger caches (32 KiB compared to 16 KiB) ▪ The ALP implementation is penalized because of less mature cell libraries available to us (7k cells vs 2k cells) ▪ NTX achieves a 2x gain in energy- efficiency compared to Tesla V100 18 Summary on Kosmodrom: State of the Art 6x 18x Zaruba, Florian, Fabian Schuiki, Stefan Mach, and Luca Benini. "The Floating Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and Performance." In 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp. 767-770. IEEE, 2019.
  • 17. || Ariane 1GHz 2 DP-GFLOPS 8 GB/s I$, D$ Instruction Data Interconnect 256b 64b 64b Ara 1GHz 16 DP-GFLOPS 32 GB/s VRF Data 256b Instruction Queue ACK/TRAP Enter ARA: Open-Source RISCV Vector Engine ⚫ Ara targets 0.5 DP-FLOP/B – Memory bandwidth scales with the number of physical lanes Cavalcante, Matheus, Fabian Schuiki, Florian Zaruba, Michael Schaffner, and Luca Benini. "Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, no. 2 (2019): 530-543.
  • 18. || Matrix multiplication on Ara ⚫ Load row i of matrix B into vB ⚫ for (int j = 0; j < n; j++) – Load element A[j, i] – Broadcast it into vA – vC ← vA . vB + vC vld vB, 0(addrB) (Unrolled loop) ▪ ld t0, 0(addrA) ▪ addi addrA, addrA, 8 ▪ vins vA, t0, zero ▪ vmadd vC, vA, vB, vC ▪ ld t0, 0(addrA) ▪ addi addrA, addrA, 8 ▪ vins vA, t0, zero ▪ vmadd vC, vA, vB, vC
  • 19. || Issue rate performance limitation ⚫ vmadds are issued at best every four cycles – Since Ariane is single-issue ⚫ If the vector MACs take less than four cycles to execute, the FPUs starve waiting for instructions – Von Neumann Bottleneck ⚫ This translates to a boundary in the roofline plot
  • 20. || Ara: Figures of Merit ⚫ Ara: 4 lanes GF 22FDX 1.25 GHz implementation ⚫ Clock frequency ⚫ 1.25 GHz (nominal), 0.92 GHz (worst condition) ⚫ 40 gate delays ⚫ Area: 3400 kGE ⚫ 0.68 mm2 ⚫ 256 x 256 MATMUL ⚫ Performance: 9.8 DP-GFLOPS ⚫ Power: 259 mW ⚫ Efficiency: 38 DP-GFLOPS/W ⚫ ⁓2.5X better than Ariane on same benchmark ⚫ Area breakdown
  • 21. || Ara: Scalability ⚫ Each lane is almost independent – Contains part of the VRF and its functional units ⚫ Scalability limitations – VLSU and SLDU: need to communicate to all banks ⚫ Instance with 16 lanes: – 1.04 GHz (nom.), 0.78 GHz (w) – 10.7 MGE (2.13mm²) – 32.4 DP-GFLOPS – 40.8 DP-GFLOPS/W (peak) VLSU Ariane SLDU 16 ARAs give you 1TFLOP at 12W - NOT BAD!
  • 22. || SNITCH Cluster 0 CC1 CC3 MULDIV I$ … Hive 0 ▪ Built around Snitch core: RV32I, 15 kGE ▪ Add 64b FPU subsystem: core complex (CC) ▪ 4 CCs, MULDIV, I-cache: hive ▪ 2 hives, TCDM, peripherals: cluster ▪ N clusters, system X-bar, memory: system ▪ Float subsystem adds novel HW ▪ 2 stream semantic registers ▪ FPU sequencer FPU CC0 Snitch Hive 1 … Peripherals … … B0 B1 B2 B31… Cluster 1 Cluster 2 … Memory
  • 23. || Stream Semantic Registers and FREP ▪ Vanilla RISCs: low functional unit utilization ▪ Solutions often complex: CISC, VLIW, vectoring ▪ Map registers to memory streams: SSRs ▪ Reads, writes become memory requests ▪ Programmable generator emits addresses + Unmodified ISA + Orthogonal to hardware loops ▪ Snitch CC: FPU instruction stream decoupled ▪ FPU Sequencer can buffer, loop over instructions ▪ Core, FPU fed in parallel → pseudo-dual issue dotp: fld ft0, 0(a0) fld ft1, 0(a1) fmadd.d ft2, ft0, ft1, ft2 addi a0, a0, 8 addi a1, a1, 8 bge a0, t0, dotp call conf_addr_gen frep buflen, rep fmadd.d ft2, ft0, ft1, ft2 Zaruba, Florian, Fabian Schuiki, Torsten Hoefler, and Luca Benini. "Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads." arXiv preprint cs.AR/2002.10143 (2020). Schuiki, Fabian, Florian Zaruba, Torsten Hoefler, and Luca Benini. "Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores." IEEE Transactions on Computers (2020).
  • 24. ||07.07.2020Andrea Bartolini 26 SNITCH Figures of Merit Normalized Performance Higher performance than VP Almost 80 DP GFlop/sW  2x more efficient of ARA  13pJ x DP FLOP  Exascale!! 22nm FDX technology
  • 25. ||07.07.2020Andrea Bartolini 27 System-Level Integration of Accelerators • Our accelerators show leading performance and energy efficiency in silicon at the core and cluster level • How can we unleash this potential in real computing systems? Image: Nvidia Xavier die shot annotated by WikiChip Image: Summit Supercomputer by OLCF at ORNL
  • 26. ||07.07.2020Andrea Bartolini 28 HERO: Open-Source Heterogeneous Research Platform HERO combines • general-purpose Host CPUs • domain-specific programmable many-core accelerators to unite versatility with performance, enabling task offloading and data sharing across heterogeneous • ISAs (e.g., ARMv8 and RV32) • memory subsystems (e.g., caches and SPMs, virtual and physical addresses) • data models (e.g., LP64 and ILP32) • OSes and runtime libraries (e.g., Linux and OpenMP Device RTL) with minimal run-time overhead and transparent to application programmers. A. Kurth, P. Vogel, A. Marongiu, A. Capotondi, and L.Benini: "HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA." Proceedings of the First Workshop on Computer Architecture Research with RISC-V (CARRV), pp. 1-7, IEEE/ACM, 2017. A.Kurth, A. Capotondi, P. Vogel, L.Benini, and A. Marongiu: "HERO: an Open-Source Research Platform for HW/SW Exploration of Heterogeneous Manycore Systems." Proceedings of the Second Workshop on Autotuning and Adaptivity Approaches for Energy-Efficient HPC Systems (ANDARE), pp. 13-18, ACM, 2018.
  • 27. ||07.07.2020Andrea Bartolini 29 Network-Accelerated Memory Transfers: sPIN on PULP S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J, Beranek, M. Besta, L.Benini, D. Roweth, and T. Hoefler: "Network-Accelerated Non-Contiguous Memory Transfers." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1-14, ACM, 2019. sPIN on PULP: Network- Accelerated Memory Transfers Processing user-defined network packet kernels on a PULP-based accelerator in the NIC
  • 28. || ControlPULP Control- PULP VRM BMC PE S Operating System Application System Management / RM GovernorsIn band Hints/Prescription Power Cap Energy vs. Througput DIMM RJ45 SystemManagement/RM Out of band Node Power Cap RAS Main Archi. Blocks w. : - Sensors (PVT, Util, archi) - Controls (f,Vdd,Vbb,PG,CG) - In band a.k.a low latency / user- space telemetry (power, perf, …) - O.S. PM governors: - cpufreq/ cpuidle - Based on O.S. metrics - Slow & often unused - Low latency PM requests and/or suggestions - From the Application/run-time - Power cap => Max perf @ P<Pmax - Energy => Min Energy @ f=f* - Throughput => F > Fmax @ T,P<Max - Out-of-band – zero overhead telemetry - Node Pcap – Max perf @ Pnode<Pmax - RAS – error and conditions reporting Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019 Coming Soon… Andrea Bartolini, et al. A PULP-based Parallel Power Controller for FutureExascale Systems, in: 2019 26th IEEE International Conferenceon Electronics, Circuits and Systems (ICECS)
  • 29. || ControlPULP Coming Soon… PM task • Read voltage regulator, power, status (VR) • Power model update T control Watchdog reset Write power controller settings Write to internal memory telemetry data Read PVT sensors Read workload from O.S. Read target P/C state settings, power budget Read Pending BMC requests Compute controller settings BMC • Read pending command queue • Decode Command/data • Perform action: • Change target P/C state, power budget • Set pending BMC • Ask telemetry data Copyright © European Processor Initiative 2019. EPI Tutorial/bologna/22-01-2020 Ack. Robert Balas, Giovanni Bambini, Andrea Bentivogli, Davide Rossi, Antonio Mastrandrea, Christian Conficoni, Simone Benatti, Andrea Tilli
  • 30. || HPC Vertical: The European Processor Initiative ▪ High Performance General Purpose Processor for HPC ▪ High-performance RISC-V based accelerator ▪ Computing platform for autonomous cars ▪ Will also target the AI, Big Data and other markets in order to be economically sustainable Europe Needs its own Processors ▪ Processors now control almost every aspect of our lives ▪ Security (back doors etc.) ▪ Possible future restrictions on exports to EU due to increasing protectionism ▪ A competitive EU supply chain for HPC technologies will create jobs and growth in Europe ▪ Sovereignty (data, economical, embargo) 32
  • 31. || General Purpose Processor (GPP) chip  7 nm, chip-let technology  ARM-SVE tiles  EPAC RISC-V vector+AI accelerator tiles  L1, L2, L3 cache subsystem + HBM + DDR HBMHBM HBMHBM DDR DDR DDR DDR RISC-V Accelerator Demonstrator Test Chip  22 nm FDSOI  Only one RISC-V accelerator tile  On-chip L1, L2 + off-chip HBM + DDR PHY  Targets 128 DP GFLOPS (vector) 200+GOPs/W SP (STX) Highspeed SerDes Vector lanes STX units Vector lanes STX units Scalar core Scalar core Copyright © European Processor Initiative 2019. EPI Tutorial/Barcelona/17-07-2019 First Generation EPI chips Scalar Core + STX units based on NTX and Snitch! GPP power manager based on ControlPULP!
  • 32. || …. a Swiss army knife for HPC EXASCALE 2021