SlideShare a Scribd company logo
FPGAs for Supercomputing: The Why and How
Hal Finkel2
(hfinkel@anl.gov), Kazutomo Yoshii1
, and Franck Cappello1
1
Mathematics and Computer Science (MCS)
2
Leadership Computing Facility (ALCF)
Argonne National Laboratory
Advanced Scientific Computing Advisory Committee
Tuesday, December 20, 2016
Washington, DC
Outline
● Why are FPGAs interesting?
● Can FPGAs competitively accelerate traditional HPC workloads?
● Challenges and potential solutions to FPGA programming.
For some things, FPGAs are really good!
http://escholarship.org/uc/item/35x310n6
70x faster!
bioinformatics
For some things, FPGAs are really good!
machine learning and neural networks
http://ieeexplore.ieee.org/abstract/document/7577314/
FPGA is faster than both
the CPU and GPU,
10x more power efficient,
and a much higher percentage
of peak!
http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
Parallelism Triumphs As We Head Toward Exascale
1986 1991 1996 2001 2006 2011 2016 2021
1
10
RelativeTransistorPerf
Giga
Tera
Peta
Exa
32x from transistor
32x from parallelism
8x from transistor
128x from parallelism
1.5x from transistor
670x from parallelism
System performance from parallelism
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201604/McCormick-ASCAC.pdf
(Maybe) It's All About the Power...
Do FPGA's
perform less
data movement
per computation?
http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
To Decrease Energy, Move Data Less!
On-die Data Movement vs Compute
Interconnect energy (per mm) reduces slower than compute
On-die data movement energy will start to dominate
90 65 45 32 22 14 10 7
0
0.2
0.4
0.6
0.8
1
1.2
Technology (nm)
Source: Intel
On die IC energy/mm
Compute energy
6X
60%
https://www.semiwiki.com/forum/content/6160-2016-leading-edge-semiconductor-landscape.html
Compute vs. Movement – Changes Afoot
http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pdf
(2013)
FPGAs vs. CPUs
http://evergreen.loyola.edu/dhhoe/www/HoeResearchFPGA.htm
FPGA
http://www.ics.ele.tue.nl/~heco/courses/EmbSystems/adv-architectures.ppt
CPU
Where Does the Power Go (CPU)?
http://link.springer.com/article/10.1186/1687-3963-2013-9
(Model with (# register files) x (read ports) x (write ports))
Fetch and decode
take most of the
energy!
More centralized register
files means more data
movement which takes
more power.
Only a small portion
of the energy goes
to the underlying
computation.
See also: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-130.pdf
Modern FPGAs: DSP Blocks and Block RAM
http://yosefk.com/blog/category/hardware
Design mapped
(Place & Route)
Intel Stratix 10 will have up to:
● 5760 DSP Blocks = 9.2 SP TFLOPS
● 11721 20Kb Block RAMs = 28MB
● 64-bit 4-core ARM @ 1.5 GHz
https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html
DSP blocks multiply
(Intel/Altera FPGAs have full SP FMA)
GFLOPS/Watt (Single Precision)
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
20
40
60
80
100
120
GFLOPS/Watt
● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - Taking 165 W max range
● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf
● http://www.xilinx.com/applications/high-performance-computing.html - Ultrascale+ figure inferred by a 33% performance increase (from Hotchips presentation)
● https://devblogs.nvidia.com/parallelforall/inside-pascal/
● https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html
Marketing Numbers
for unreleased products…
(be skeptical)
Do these FPGA
numbers include
system memory?
GFLOPS/Watt (Single Precision) – Let's be more realistic...
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
20
40
60
80
100
120
GFLOPS/Watt
● http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html
● https://hal.inria.fr/hal-00686006v2/document
● http://www.eecg.toronto.edu/~davor/papers/capalija_fpl2014_slides.pdf - Tile approach yields 75% of peak clock rate on full device
Conclusion: FPGAs are a competitive HPC accelerator technology by 2017!
90% of peak
on a CPU is excellent!
70% of peak
on a GPU is excellent!
Plus system memory:
assuming 6W for 16 GB DDR4
(and 150 W for the FPGA)
GFLOPS/device (Single Precision)
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
2000
4000
6000
8000
10000
12000
GFLOPS
● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-product-table.pdf - Largest variant with all DSPs doing FMAs @ the 800 MHz max
● http://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html
● http://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf - LUTs, not DSPs, are the limiting resource – filling device with FMAs @ 1 GHz
● https://devblogs.nvidia.com/parallelforall/inside-pascal/
● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - 28 cores @ 3.7 GHz * 16 FP ops per cycle * 2 for FMA (assuming same clock rate as the
E5-1660 v2)
● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf
All in theory...
GFLOPS/device (Single Precision) – Let's be more realistic...
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
2000
4000
6000
8000
10000
12000
GFLOPS
● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdf
● https://www.altera.com/en_US/pdfs/literature/wp/wp-01028.pdf (old but still useful)
90% of peak
on a CPU is excellent!
70% of peak
on a GPU is excellent!
80% usage at peak
frequency of an
FPGA is excellent!
Xilinx has no hard FP logic...
Reserving 30% of the
LUTs for other purposes.
For FPGAs, Parallelism is Essential
(CPU/GPU)(FPGA)
90nm
90nm65nm
http://rssi.ncsa.illinois.edu/proceedings/academic/Williams.pdf
(2008)
An experiment...
● Nallatech 385A Arria10
board
● 200 – 300 MHz (depend on
a design)
● 20 nm
● two DRAM channels. 34.1
● Sandy Bridge E5-2670
● 2.6 GHz (3.3 GHz w/ turbo)
● 32 nm
● four DRAM channels. 51.2
GB/s peak
An experiment: Power is Measured...
● Intel RAPL is used to measure
CPU energy
– CPU and memory
● Yokogawa WT310, an external
power meter, is used to measure
the FPGA power
– FPGA_pwr = meter_pwr -
host_idle_pwr +
FPGA_idle_pwr (~17 W)
– Note that meter_pwr includes
both CPU and FPGA
An experiment: Random Access with Computation using OpenCL
● # work-units is 256
● CPU: Sandy Bridge (4ch memory)
● FPGA: Arria 10 (2ch memory)
for (int i = 0; i < M; i++) {
double8 tmp;
index = rand() % len;
tmp = array[index];
sum += (tmp.s0 + tmp.s1) / 2.0;
sum += (tmp.s2 + tmp.s3) / 2.0;
sum += (tmp.s4 + tmp.s5) / 2.0;
sum += (tmp.s6 + tmp.s7) / 2.0;
}
An experiment: Random Access with Computation using OpenCL
● # work-units is 256
● CPU: Sandy Bridge (2ch memory)
● FPGA: Arria 10 (2ch memory)
for (int i = 0; i < M; i++) {
double8 tmp;
index = rand() % len;
tmp = array[index];
sum += (tmp.s0 + tmp.s1) / 2.0;
sum += (tmp.s2 + tmp.s3) / 2.0;
sum += (tmp.s4 + tmp.s5) / 2.0;
sum += (tmp.s6 + tmp.s7) / 2.0;
}
Make the comparison more fair...
FPGAs – Molecular Dynamics – Strong Scaling Again!
Martin Herbordt (Boston University)
FPGAs – Molecular Dynamics – Strong Scaling Again!
Martin Herbordt (Boston University)
High-End CPU + FPGA Systems Are Coming...
● Intel/Altera are starting to produce Xeon + FPGA systems
● Xilinx are producing ARM + FPGA systems
These are not just embedded cores,
but state-of-the-art multicore CPUs
Low latency and high bandwidth
CPU + FPGA systems fit nicely into the
HPC accelerator model! (“#pragma omp
target” can work for FPGAs too)
https://www.nextplatform.com/2016/03/14/intel-marrying-fpga-beefy-broadwell-open-compute-future/
Common Algorithm Classes in HPC
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
Common Algorithm Classes in HPC – What do they need?
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
FPGAs Can Help Everyone!
Compute Bound
(FPGAs have lots of compute)
Memory-Latency Bound
(FPGAs can pipeline deeply)
Memory-Bandwidth Bound
(FPGAs can do on-the-fly compression)
FPGAshavelotsofregisters
FPGAshavelotsembeddedmemory
Logic Synthesis
Place & Route
High-level Synthesis
datapathcontroller
Behavior
level
RT level
(VHDL, Verilog)
Gate level
(netlist)
C, C++, SystemC, OpenCL
High-level languages (OpenMP, OpenACC, etc.)
Source to Source
Levels of
Abstraction
Altera/Xilinx
toolchains
Bitstream
Derived from Deming Chen’s slide (UIUC).
FPGA Programming: Levels of Abstraction
FPGA Programming Techniques
● Use FPGAs as accelerators through (vendor-)optimized libraries
● Use of FPGAs through overlay architectures (pre-compiled custom processors)
● Use of FPGAs through high-level synthesis (e.g. via OpenMP)
● Use of FPGAs through programming in Verilog/VHDL (the FPGA “assembly language”)
● Lowest Risk
● Lowest User Difficulty
● Highest Risk
● Highest User Difficulty
Beware of Compile Time...
● Compiling a full design for a large FPGA (synthesis + place & route) can take many hours!
● Tile-based designs can help, but can still take tens of minutes!
● Overlay architectures (pre-compiled custom processors and on-chip networks) can help...
Is kernel really
Important in
this application?
Traditional compilation
for optimized
overlay architecture.
Use high-level synthesis
to generate custom hardware.
Overlay (iDEA)
https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fpt2012-cheah.pdf
● A very-small CPU.
● Runs near peak clock rate of the block RAM / DSP block!
● Makes use of dynamic configuration of the DSP block.
Overlay (DeCO)
https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fccm2016-jain.pdf
● Also spatial computing, but with much coarser resources.
● Place & Route is much faster!
● Performance is very good.
Each of these is a small soft CPU.
A Toolchain using HLS in Practice?
Compiler
(C/C++/Fortran)
Executable
Extract parallel
regions and compile
for the host in the
usual way
High-level
Synthesis
Place and Route
If placement
and routing takes
hours, we can't do
it this way!
A Toolchain using HLS in Practice?
Compiler
(C/C++/Fortran)
Executable
Extract parallel
regions and compile
for the host in the
usual way
High-level
Synthesis
Place and Route
Some kind
of token
Challenges Remain...
● OpenMP 4 technology for FPGAs is in its infancy (even less mature than the GPU
implementations).
● High-level synthesis technology has come a long way, but is just now starting to give
competitive performance to hand-programmed HDL designs.
● CPU + FPGA systems with cache-coherent interconnects are very new.
● High-performance overlay architectures have been created in academia, but none
targeting HPC workloads. High-performance on-chip networks are tricky.
● No one has yet created a complete HPC-practical toolchain.
Theoretical maximum performance on many algorithms on GPUs is 50-70%.
This is lower than CPU systems, but CPU systems have higher overhead.
In theory, FPGAs offer high percentage of peak and low overhead,
but can that be realized in practice?
Conclusions
✔ FPGA technology offers the most-promising direction toward higher FLOPS/Watt.
✔ FPGAs, soon combined with powerful CPUs, will naturally fit into our accelerator-infused HPC ecosystem.
✔ FPGAs can compete with CPUs/GPUs on traditional workloads while excelling at bioinformatics, machine
learning, and more!
✔ Combining high-level synthesis with overlay architectures can address FPGA programming challenges.
✔ Even so, pulling all of the pieces together will be challenging!
➔ ALCF is supported by DOE/SC under contract DE-AC02-06CH11357
Extra Slides
http://fire.pppl.gov/FESAC_AdvComput_Binkley_041014.pdf
ALCF Systems
https://www.alcf.anl.gov/files/alcfscibro2015.pdf
https://www.alcf.anl.gov/files/alcfscibro2015.pdf
Current Large-Scale Scientific Computing
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201604/2016-0404-ascac-01.pdf
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20150324/20150324_ASCAC_02a_No_Backups.pdf
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20150324/20150324_ASCAC_02a_No_Backups.pdf
How do we express parallelism?
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
How do we express parallelism - MPI+X?
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
OpenMP Evolving Toward Accelerators
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
New in OpenMP 4
OpenMP Accelerator Support – An Example (SAXPY)
http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
OpenMP Accelerator Support – An Example (SAXPY)
http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
Memory transfer
if necessary.
Traditional CPU-targeted
OpenMP might
only need this directive!
HPC-relevant Parallelism is Coming to C++17!
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
using namespace std::execution::parallel;
int a[] = {0,1};
for_each(par, std::begin(a), std::end(a), [&](int i) {
do_something(i);
});
void f(float* a, float*b) {
...
for_each(par_unseq, begin, end, [&](int i)
{
a[i] = b[i] + c;
});
}
The “par_unseq” execution policy
allows for vectorization as well.
Almost as concise
as OpenMP, but in many
ways more powerful!
HPC-relevant Parallelism is Coming to C++17!
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
Clang/LLVM
Where do we stand now?
Clang
(OpenMP 4 support nearly done)
Intel, IBM, and others finishing target offload support
LLVM
Polly
(Polyhedral optimizations)
SPIR-V
(Prototypes available,
but only for LLVM 3.6)
Vendor tools not
yet ready
C
C backend not upstream.
There is a relatively-recent
version on github.
Vendor HLS / OpenCL tools
Generate VHDL/Verilog directly?
Current FPGA + CPU System
http://www.panoradio-sdr.de/sdr-implementation/fpga-software-design/
Xilinx Zynq 7020 has
two ARM Cortex A9
cores.
53,200 LUTs
560 KB SRAM
220 DSP slices
http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
Interconnect Energy
Interconnect Structures
Buses over short distance
Shared busShared bus
1 to 10 fJ/bit
0 to 5mm
Limited scalability
Multi-ported MemoryMulti-ported Memory
Shared memory
10 to 100 fJ/bit
1 to 5mm
Limited scalability
X-BarX-Bar
Cross Bar Switch
0.1 to 1pJ/bit
2 to 10mm
Moderate scalability
1 to 3pJ/bit
>5 mm, scalable
Packet Switched Network
CPU and GPU Trends
https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/
KNL KNL
CPU vs. FGPA Efficiency
http://authors.library.caltech.edu/1629/1/DEHcomputer00.pdf
CPU and FPGA achieve maximum algorithmic
efficiency at polar opposite sides of the parameter
space!

More Related Content

PDF
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
PDF
FPGA_Overview_Ibr_2014
PDF
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PDF
FPGA_BasedGCD
PPTX
Introduction to FPGA acceleration
PDF
FPGA/Reconfigurable computing (HPRC)
PDF
2013 06-ohkawa-heart-presen
PPTX
RISC-V 30907 summit 2020 joint picocom_mentor
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
FPGA_Overview_Ibr_2014
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
FPGA_BasedGCD
Introduction to FPGA acceleration
FPGA/Reconfigurable computing (HPRC)
2013 06-ohkawa-heart-presen
RISC-V 30907 summit 2020 joint picocom_mentor

What's hot (20)

PPTX
FPGA workshop
PPTX
An open flow for dn ns on ultra low-power RISC-V cores
PPT
Fpga 03-cpld-and-fpga
PPT
Synopsys User Group Presentation
PPTX
Design options for digital systems
PDF
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
PDF
Session 2,3 FPGAs
PDF
Fpga Device Selection
PDF
FPGA In a Nutshell
PDF
Cpld fpga
PDF
Fpga computing
PPT
Fpga technology
PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
PDF
SDVIs and In-Situ Visualization on TACC's Stampede
DOCX
FPGA in outer space seminar report
PPTX
SoC FPGA Technology
PPTX
Dr.s.shiyamala fpga ppt
PPT
NWU and HPC
PPT
Programmable Logic Devices Plds
PPT
FPGA workshop
An open flow for dn ns on ultra low-power RISC-V cores
Fpga 03-cpld-and-fpga
Synopsys User Group Presentation
Design options for digital systems
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
Session 2,3 FPGAs
Fpga Device Selection
FPGA In a Nutshell
Cpld fpga
Fpga computing
Fpga technology
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
SDVIs and In-Situ Visualization on TACC's Stampede
FPGA in outer space seminar report
SoC FPGA Technology
Dr.s.shiyamala fpga ppt
NWU and HPC
Programmable Logic Devices Plds
Ad

Similar to FPGAs for Supercomputing: The Why and How (20)

PDF
The basic graphics architecture for all modern PCs and game consoles is similar
PDF
On the Capability and Achievable Performance of FPGAs for HPC Applications
PPTX
FPGAs in the cloud? (October 2017)
PDF
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
PDF
INFN Advanced ML Hackaton 2022 Talk
PDF
Gaurav slides
PDF
E3MV - Embedded Vision - Sundance
PPTX
FPGAs versus GPUs in Data centers
PDF
FPGA Embedded Design
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PPTX
fpga1 - What is.pptx
PPTX
HiPEAC 2022_Marco Tassemeier presentation
PDF
Can FPGAs Compete with GPUs?
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PPTX
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
PDF
Flexible and Scalable Domain-Specific Architectures
PDF
Nikravesh big datafeb2013bt
PDF
Challenges and Opportunities of FPGA Acceleration in Big Data
PPTX
AI Hardware Landscape 2021
PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
The basic graphics architecture for all modern PCs and game consoles is similar
On the Capability and Achievable Performance of FPGAs for HPC Applications
FPGAs in the cloud? (October 2017)
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
INFN Advanced ML Hackaton 2022 Talk
Gaurav slides
E3MV - Embedded Vision - Sundance
FPGAs versus GPUs in Data centers
FPGA Embedded Design
Using a Field Programmable Gate Array to Accelerate Application Performance
fpga1 - What is.pptx
HiPEAC 2022_Marco Tassemeier presentation
Can FPGAs Compete with GPUs?
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
Flexible and Scalable Domain-Specific Architectures
Nikravesh big datafeb2013bt
Challenges and Opportunities of FPGA Acceleration in Big Data
AI Hardware Landscape 2021
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Ad

More from DESMOND YUEN (20)

PDF
2022-AI-Index-Report_Master.pdf
PDF
Small Is the New Big
PDF
Intel® Blockscale™ ASIC Product Brief
PDF
Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors
PDF
Intel 2021 Product Security Report
PDF
How can regulation keep up as transformation races ahead? 2022 Global regulat...
PDF
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
PDF
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
PDF
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIES
PDF
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
PDF
An Introduction to Semiconductors and Intel
PDF
Changing demographics and economic growth bloom
PDF
Intel’s Impacts on the US Economy
PDF
2021 private networks infographics
PDF
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
PDF
Accelerate Your AI Today
PDF
Increasing Throughput per Node for Content Delivery Networks
PDF
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
PDF
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
PDF
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
2022-AI-Index-Report_Master.pdf
Small Is the New Big
Intel® Blockscale™ ASIC Product Brief
Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors
Intel 2021 Product Security Report
How can regulation keep up as transformation races ahead? 2022 Global regulat...
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIES
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
An Introduction to Semiconductors and Intel
Changing demographics and economic growth bloom
Intel’s Impacts on the US Economy
2021 private networks infographics
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
Accelerate Your AI Today
Increasing Throughput per Node for Content Delivery Networks
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...

Recently uploaded (20)

PPTX
Prograce_Present.....ggation_Simple.pptx
PPT
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
PDF
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
PPT
Hypersensitivity Namisha1111111111-WPS.ppt
PPTX
code of ethics.pptxdvhwbssssSAssscasascc
PDF
Dynamic Checkweighers and Automatic Weighing Machine Solutions
PDF
-DIGITAL-INDIA.pdf one of the most prominent
PPTX
Nanokeyer nano keyekr kano ketkker nano keyer
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
DOCX
A PROPOSAL ON IoT climate sensor 2.docx
PPTX
"Fundamentals of Digital Image Processing: A Visual Approach"
PPTX
KVL KCL ppt electrical electronics eee tiet
PPTX
making presentation that do no stick.pptx
PPTX
Embeded System for Artificial intelligence 2.pptx
PPTX
Wireless and Mobile Backhaul Market.pptx
PDF
PPT Determiners.pdf.......................
PPTX
Operating System Processes_Scheduler OSS
PPTX
Computers and mobile device: Evaluating options for home and work
PPTX
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
Prograce_Present.....ggation_Simple.pptx
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
Hypersensitivity Namisha1111111111-WPS.ppt
code of ethics.pptxdvhwbssssSAssscasascc
Dynamic Checkweighers and Automatic Weighing Machine Solutions
-DIGITAL-INDIA.pdf one of the most prominent
Nanokeyer nano keyekr kano ketkker nano keyer
Smarter Security: How Door Access Control Works with Alarms & CCTV
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
A PROPOSAL ON IoT climate sensor 2.docx
"Fundamentals of Digital Image Processing: A Visual Approach"
KVL KCL ppt electrical electronics eee tiet
making presentation that do no stick.pptx
Embeded System for Artificial intelligence 2.pptx
Wireless and Mobile Backhaul Market.pptx
PPT Determiners.pdf.......................
Operating System Processes_Scheduler OSS
Computers and mobile device: Evaluating options for home and work
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx

FPGAs for Supercomputing: The Why and How

  • 1. FPGAs for Supercomputing: The Why and How Hal Finkel2 (hfinkel@anl.gov), Kazutomo Yoshii1 , and Franck Cappello1 1 Mathematics and Computer Science (MCS) 2 Leadership Computing Facility (ALCF) Argonne National Laboratory Advanced Scientific Computing Advisory Committee Tuesday, December 20, 2016 Washington, DC
  • 2. Outline ● Why are FPGAs interesting? ● Can FPGAs competitively accelerate traditional HPC workloads? ● Challenges and potential solutions to FPGA programming.
  • 3. For some things, FPGAs are really good! http://escholarship.org/uc/item/35x310n6 70x faster! bioinformatics
  • 4. For some things, FPGAs are really good! machine learning and neural networks http://ieeexplore.ieee.org/abstract/document/7577314/ FPGA is faster than both the CPU and GPU, 10x more power efficient, and a much higher percentage of peak!
  • 5. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx Parallelism Triumphs As We Head Toward Exascale 1986 1991 1996 2001 2006 2011 2016 2021 1 10 RelativeTransistorPerf Giga Tera Peta Exa 32x from transistor 32x from parallelism 8x from transistor 128x from parallelism 1.5x from transistor 670x from parallelism System performance from parallelism
  • 7. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx To Decrease Energy, Move Data Less! On-die Data Movement vs Compute Interconnect energy (per mm) reduces slower than compute On-die data movement energy will start to dominate 90 65 45 32 22 14 10 7 0 0.2 0.4 0.6 0.8 1 1.2 Technology (nm) Source: Intel On die IC energy/mm Compute energy 6X 60% https://www.semiwiki.com/forum/content/6160-2016-leading-edge-semiconductor-landscape.html
  • 8. Compute vs. Movement – Changes Afoot http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pdf (2013)
  • 10. Where Does the Power Go (CPU)? http://link.springer.com/article/10.1186/1687-3963-2013-9 (Model with (# register files) x (read ports) x (write ports)) Fetch and decode take most of the energy! More centralized register files means more data movement which takes more power. Only a small portion of the energy goes to the underlying computation. See also: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-130.pdf
  • 11. Modern FPGAs: DSP Blocks and Block RAM http://yosefk.com/blog/category/hardware Design mapped (Place & Route) Intel Stratix 10 will have up to: ● 5760 DSP Blocks = 9.2 SP TFLOPS ● 11721 20Kb Block RAMs = 28MB ● 64-bit 4-core ARM @ 1.5 GHz https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html DSP blocks multiply (Intel/Altera FPGAs have full SP FMA)
  • 12. GFLOPS/Watt (Single Precision) Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 20 40 60 80 100 120 GFLOPS/Watt ● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - Taking 165 W max range ● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf ● http://www.xilinx.com/applications/high-performance-computing.html - Ultrascale+ figure inferred by a 33% performance increase (from Hotchips presentation) ● https://devblogs.nvidia.com/parallelforall/inside-pascal/ ● https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html Marketing Numbers for unreleased products… (be skeptical) Do these FPGA numbers include system memory?
  • 13. GFLOPS/Watt (Single Precision) – Let's be more realistic... Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 20 40 60 80 100 120 GFLOPS/Watt ● http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html ● https://hal.inria.fr/hal-00686006v2/document ● http://www.eecg.toronto.edu/~davor/papers/capalija_fpl2014_slides.pdf - Tile approach yields 75% of peak clock rate on full device Conclusion: FPGAs are a competitive HPC accelerator technology by 2017! 90% of peak on a CPU is excellent! 70% of peak on a GPU is excellent! Plus system memory: assuming 6W for 16 GB DDR4 (and 150 W for the FPGA)
  • 14. GFLOPS/device (Single Precision) Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 2000 4000 6000 8000 10000 12000 GFLOPS ● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-product-table.pdf - Largest variant with all DSPs doing FMAs @ the 800 MHz max ● http://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html ● http://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf - LUTs, not DSPs, are the limiting resource – filling device with FMAs @ 1 GHz ● https://devblogs.nvidia.com/parallelforall/inside-pascal/ ● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - 28 cores @ 3.7 GHz * 16 FP ops per cycle * 2 for FMA (assuming same clock rate as the E5-1660 v2) ● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf All in theory...
  • 15. GFLOPS/device (Single Precision) – Let's be more realistic... Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 2000 4000 6000 8000 10000 12000 GFLOPS ● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdf ● https://www.altera.com/en_US/pdfs/literature/wp/wp-01028.pdf (old but still useful) 90% of peak on a CPU is excellent! 70% of peak on a GPU is excellent! 80% usage at peak frequency of an FPGA is excellent! Xilinx has no hard FP logic... Reserving 30% of the LUTs for other purposes.
  • 16. For FPGAs, Parallelism is Essential (CPU/GPU)(FPGA) 90nm 90nm65nm http://rssi.ncsa.illinois.edu/proceedings/academic/Williams.pdf (2008)
  • 17. An experiment... ● Nallatech 385A Arria10 board ● 200 – 300 MHz (depend on a design) ● 20 nm ● two DRAM channels. 34.1 ● Sandy Bridge E5-2670 ● 2.6 GHz (3.3 GHz w/ turbo) ● 32 nm ● four DRAM channels. 51.2 GB/s peak
  • 18. An experiment: Power is Measured... ● Intel RAPL is used to measure CPU energy – CPU and memory ● Yokogawa WT310, an external power meter, is used to measure the FPGA power – FPGA_pwr = meter_pwr - host_idle_pwr + FPGA_idle_pwr (~17 W) – Note that meter_pwr includes both CPU and FPGA
  • 19. An experiment: Random Access with Computation using OpenCL ● # work-units is 256 ● CPU: Sandy Bridge (4ch memory) ● FPGA: Arria 10 (2ch memory) for (int i = 0; i < M; i++) { double8 tmp; index = rand() % len; tmp = array[index]; sum += (tmp.s0 + tmp.s1) / 2.0; sum += (tmp.s2 + tmp.s3) / 2.0; sum += (tmp.s4 + tmp.s5) / 2.0; sum += (tmp.s6 + tmp.s7) / 2.0; }
  • 20. An experiment: Random Access with Computation using OpenCL ● # work-units is 256 ● CPU: Sandy Bridge (2ch memory) ● FPGA: Arria 10 (2ch memory) for (int i = 0; i < M; i++) { double8 tmp; index = rand() % len; tmp = array[index]; sum += (tmp.s0 + tmp.s1) / 2.0; sum += (tmp.s2 + tmp.s3) / 2.0; sum += (tmp.s4 + tmp.s5) / 2.0; sum += (tmp.s6 + tmp.s7) / 2.0; } Make the comparison more fair...
  • 21. FPGAs – Molecular Dynamics – Strong Scaling Again! Martin Herbordt (Boston University)
  • 22. FPGAs – Molecular Dynamics – Strong Scaling Again! Martin Herbordt (Boston University)
  • 23. High-End CPU + FPGA Systems Are Coming... ● Intel/Altera are starting to produce Xeon + FPGA systems ● Xilinx are producing ARM + FPGA systems These are not just embedded cores, but state-of-the-art multicore CPUs Low latency and high bandwidth CPU + FPGA systems fit nicely into the HPC accelerator model! (“#pragma omp target” can work for FPGAs too) https://www.nextplatform.com/2016/03/14/intel-marrying-fpga-beefy-broadwell-open-compute-future/
  • 24. Common Algorithm Classes in HPC http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
  • 25. Common Algorithm Classes in HPC – What do they need? http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
  • 26. FPGAs Can Help Everyone! Compute Bound (FPGAs have lots of compute) Memory-Latency Bound (FPGAs can pipeline deeply) Memory-Bandwidth Bound (FPGAs can do on-the-fly compression) FPGAshavelotsofregisters FPGAshavelotsembeddedmemory
  • 27. Logic Synthesis Place & Route High-level Synthesis datapathcontroller Behavior level RT level (VHDL, Verilog) Gate level (netlist) C, C++, SystemC, OpenCL High-level languages (OpenMP, OpenACC, etc.) Source to Source Levels of Abstraction Altera/Xilinx toolchains Bitstream Derived from Deming Chen’s slide (UIUC). FPGA Programming: Levels of Abstraction
  • 28. FPGA Programming Techniques ● Use FPGAs as accelerators through (vendor-)optimized libraries ● Use of FPGAs through overlay architectures (pre-compiled custom processors) ● Use of FPGAs through high-level synthesis (e.g. via OpenMP) ● Use of FPGAs through programming in Verilog/VHDL (the FPGA “assembly language”) ● Lowest Risk ● Lowest User Difficulty ● Highest Risk ● Highest User Difficulty
  • 29. Beware of Compile Time... ● Compiling a full design for a large FPGA (synthesis + place & route) can take many hours! ● Tile-based designs can help, but can still take tens of minutes! ● Overlay architectures (pre-compiled custom processors and on-chip networks) can help... Is kernel really Important in this application? Traditional compilation for optimized overlay architecture. Use high-level synthesis to generate custom hardware.
  • 30. Overlay (iDEA) https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fpt2012-cheah.pdf ● A very-small CPU. ● Runs near peak clock rate of the block RAM / DSP block! ● Makes use of dynamic configuration of the DSP block.
  • 31. Overlay (DeCO) https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fccm2016-jain.pdf ● Also spatial computing, but with much coarser resources. ● Place & Route is much faster! ● Performance is very good. Each of these is a small soft CPU.
  • 32. A Toolchain using HLS in Practice? Compiler (C/C++/Fortran) Executable Extract parallel regions and compile for the host in the usual way High-level Synthesis Place and Route If placement and routing takes hours, we can't do it this way!
  • 33. A Toolchain using HLS in Practice? Compiler (C/C++/Fortran) Executable Extract parallel regions and compile for the host in the usual way High-level Synthesis Place and Route Some kind of token
  • 34. Challenges Remain... ● OpenMP 4 technology for FPGAs is in its infancy (even less mature than the GPU implementations). ● High-level synthesis technology has come a long way, but is just now starting to give competitive performance to hand-programmed HDL designs. ● CPU + FPGA systems with cache-coherent interconnects are very new. ● High-performance overlay architectures have been created in academia, but none targeting HPC workloads. High-performance on-chip networks are tricky. ● No one has yet created a complete HPC-practical toolchain. Theoretical maximum performance on many algorithms on GPUs is 50-70%. This is lower than CPU systems, but CPU systems have higher overhead. In theory, FPGAs offer high percentage of peak and low overhead, but can that be realized in practice?
  • 35. Conclusions ✔ FPGA technology offers the most-promising direction toward higher FLOPS/Watt. ✔ FPGAs, soon combined with powerful CPUs, will naturally fit into our accelerator-infused HPC ecosystem. ✔ FPGAs can compete with CPUs/GPUs on traditional workloads while excelling at bioinformatics, machine learning, and more! ✔ Combining high-level synthesis with overlay architectures can address FPGA programming challenges. ✔ Even so, pulling all of the pieces together will be challenging! ➔ ALCF is supported by DOE/SC under contract DE-AC02-06CH11357
  • 43. How do we express parallelism? http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
  • 44. How do we express parallelism - MPI+X? http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
  • 45. OpenMP Evolving Toward Accelerators http://llvm-hpc2-workshop.github.io/slides/Tian.pdf New in OpenMP 4
  • 46. OpenMP Accelerator Support – An Example (SAXPY) http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
  • 47. OpenMP Accelerator Support – An Example (SAXPY) http://llvm-hpc2-workshop.github.io/slides/Wong.pdf Memory transfer if necessary. Traditional CPU-targeted OpenMP might only need this directive!
  • 48. HPC-relevant Parallelism is Coming to C++17! http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm using namespace std::execution::parallel; int a[] = {0,1}; for_each(par, std::begin(a), std::end(a), [&](int i) { do_something(i); }); void f(float* a, float*b) { ... for_each(par_unseq, begin, end, [&](int i) { a[i] = b[i] + c; }); } The “par_unseq” execution policy allows for vectorization as well. Almost as concise as OpenMP, but in many ways more powerful!
  • 49. HPC-relevant Parallelism is Coming to C++17! http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
  • 50. Clang/LLVM Where do we stand now? Clang (OpenMP 4 support nearly done) Intel, IBM, and others finishing target offload support LLVM Polly (Polyhedral optimizations) SPIR-V (Prototypes available, but only for LLVM 3.6) Vendor tools not yet ready C C backend not upstream. There is a relatively-recent version on github. Vendor HLS / OpenCL tools Generate VHDL/Verilog directly?
  • 51. Current FPGA + CPU System http://www.panoradio-sdr.de/sdr-implementation/fpga-software-design/ Xilinx Zynq 7020 has two ARM Cortex A9 cores. 53,200 LUTs 560 KB SRAM 220 DSP slices
  • 52. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx Interconnect Energy Interconnect Structures Buses over short distance Shared busShared bus 1 to 10 fJ/bit 0 to 5mm Limited scalability Multi-ported MemoryMulti-ported Memory Shared memory 10 to 100 fJ/bit 1 to 5mm Limited scalability X-BarX-Bar Cross Bar Switch 0.1 to 1pJ/bit 2 to 10mm Moderate scalability 1 to 3pJ/bit >5 mm, scalable Packet Switched Network
  • 53. CPU and GPU Trends https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/ KNL KNL
  • 54. CPU vs. FGPA Efficiency http://authors.library.caltech.edu/1629/1/DEHcomputer00.pdf CPU and FPGA achieve maximum algorithmic efficiency at polar opposite sides of the parameter space!