SlideShare a Scribd company logo
• Simon Hammond – Sandia National Laboratories (sdhammo@sandia.gov)
• Howard Pritchard – Los Alamos National Laboratory (howardp@lanl.gov)
UNCLASSIFIED UNLIMITED RELEASE
NNSA Explorations:
ARM for Supercomputing
Exciting Time to be in HPC…
Exascale Computing Adoption of ML/AI for HPC New Hardware/Software
What we’ll cover today
• Why Arm – what’s so interesting about it?
• Marvell Thunder TX2 overview and comparison with x86_64
• Astra/Vanguard Program
• ASC mini-app and application performance
• Porting to ARM
7/30/19 Unclassified
What’s sooooooo Interesting About Arm?
• In many ways not much…
• Its just an instruction set
• As long as it can run Fortan, C and C++
we are good right?
• In others ways quite a lot is
interesting
• Different business model, consortium
of implementations
• Open for partners to suggest new
instructions
• Broad range of intellectual property
opportunities
• Broad(er) range if implementations
than say X86, POWER, SPARC etc
What’s sooooooo Interesting About Arm?
• DOE invests more than $100M in the hardware of a typical supercomputer
(often substantially more than this when the final bill comes in)
• Competition helps to drive down prices and increase innovation
• We want to optimize price/perf for our machines – get the absolute best workload
performance we can for the best price we can buy hardware
• The future is interesting – Arm is an IP company, not an implementation
• What if we could blend existing Arm IP blocks with our own DOE inspired accelerators?
• Build workload optimized processors and computers that benefit DOE scientists?
• e.g. a machine just for designing new materials but one which is 100X faster than today?
• Arm is an opportunity to engage with a broad range of suppliers and an ecosystem
• Not the only way to do this, can partner with traditional vendors like Intel, IBM, AMD etc
7/30/19 Unclassified
Arm is Growing in HPC…
7/30/19 Unclassified
NNSA/ASC Vanguard Program
A proving ground for next-generation HPC technologies in support of the
NNSA mission
http://vanguard.sandia.gov
Astra – the First Petscale Arm based Supercomputer
7/30/19 Unclassified
Test Beds
• Small testbeds
(~10-100 nodes)
• Breadth of
architectures Key
• Brave users
Vanguard
• Larger-scale experimental
systems
• Focused efforts to mature
new technologies
• Broader user-base
• Not Production
• Tri-lab resource but not for
ATCC runs
ATS/CTS Platforms
• Leadership-class systems
(Petascale, Exascale, ...)
• Advanced technologies,
sometimes first-of-kind
• Broad user-base
• Production Use
ASC Test Beds Vanguard ATS and CTS Platforms
Greater Scalability, Larger Scale, Focus on Production
Higher Risk, Greater Architectural Diversity
Where Vanguard Fits in our Program Strategy
7/30/19 Unclassified
NNSA/ASC Advanced Trilab Software Environment (ATSE) Project
• Advanced Tri-lab Software Environment
• Sandia leading development with input from Tri-lab Arm team
• Will be the user programming environment for Vanguard-Astra
• Partnership across the NNSA/ASC Labs and with HPE
• Lasting value
• Documented specification of:
• Software components needed for HPC production applications
• How they are configured (i.e., what features and capabilities are enabled) and interact
• User interfaces and conventions
• Reference implementation:
• Deployable on multiple ASC systems and architectures with common look and feel
• Tested against real ASC workloads
• Community inspired, focused and supported
ATSE is an integrated software environment for ASC workloads
ATSE
stack
7/30/19 Unclassified
HPE’s HPC Software Stack
HPE:
• HPE MPI (+ XPMEM)
• HPE Cluster Manager
• Arm:
• Arm HPC Compilers
• Arm Math Libraries
• Allinea Tools
• Mellanox-OFED & HPC-X
• RedHat 7.x for aarch64
ATSE Collaboration with HPE’s HPC Software Stack
ATSE
stack
7/30/19 Unclassified
SVE Enablement – Next Generation of SIMD/Vector Instructions
• SVE work is underway
• SVE = Scalable Vector Extensions
• Length agnostic vector instructions at an ISA level
• Using ArmIE (fast emulation) and RIKEN GEM5 Simulator
• GCC and Arm toolchains
• Collaboration with RIKEN
• Visited Sandia (participants from SNL, LANL, LLNL, RIKEN)
• Discussion of performance and simulation techniques
• Deep-dive on SVE (GEM5)
• Short term plan
• Use of SVE intrinsics for Kokkos-Kernels SIMD C++/data parallel
types
• Underpins number of key performance routines for Trilinos
libraries
• Seen large (6X) speedups for AVX512 on KNL and Skylake
• Expect to see similar gains for SVE vector units
• Critical performance enablement for Sandia production codes
7/30/19 Unclassified
• Workflows leveraging containers and virtual machines
• Support for machine learning frameworks
• ARMv8.1 includes new virtualization extensions, SR-IOV
• Evaluating parallel filesystems + I/O systems @ scale
• GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, …
• Resilience studies over Astra lifetime
• Improved MPI thread support, matching acceleration
• OS optimizations for HPC @ scale
• Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux
kernels to non-Linux lightweight kernels and multi-kernels
• Arm-specific optimizations
ATSE
stack
ATSE R&D Efforts – Developing Next-Generation NNSA Workflows
7/30/19 Unclassified
Marvell Thunder X2
7/30/19 Unclassified
ThunderX2 - Second Generation High-End Armv8-A Server SoC
7/30/19 Unclassified
Up to 32 custom Armv8.1 cores, up to 2.5GHz
Full OoO, 1, 2, 4 threads per core
1S and 2S Configuration
Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC
Up to 56 lanes of PCIe, 14 PCIe controllers
Full SoC: Integrated SATAv3 USB3 and GPIOs
Server class RAS & Virtualization
Extensive Power Management
LGA and BGA for most flexibility
40+ SKUs (75W – 180W)
7/30/19 Unclassified
Marvell
ThunderX2
Haswell E5-2698
v3
Broadwell E5-
2695
Skylake Gold
6152
Cores/Socket 32 (max 4 HT) 16 (2 HT) 22 (2 HT) 22 (2 HT)
L1 Cache/Core 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way)
L2 Cache/Core 256KB (8-way) 256 KB (8-way) 256 KB (8-way) 1 MB (16-way)
L3 Cache/Socket 32 MB 40 MB 33 MB 30.25 MB
#Memory
Channels/Socket
8 DDR4 4 DDR4 4 DDR4 6 DDR4
Base Clock Rate 2.2 GHz 2.3 GHz 2.2 GHz 2.1 GHz
Vector/SIMD
Length
128b (NEON) 256b (AVX2) 256b (AVX2) 512b (AVX512)
ThunderX2 Comparison with Xeon Processors
7/30/19 Unclassified
Roofline Comparison
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
FDTD-Acoustic(TTI)-8thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
FDTD-Acoustic(TTI)-8thorder
SEM-Elastic-4thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
AMD EPYC Naples/7601 (1.13 TFlops)
Amazon Graviton (294 GFlops)
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
FDTD-Elastic-4thorder
FDTD-Acoustic(ISO)-8thorder
FDTD-Acoustic(TTI)-8thorder
SEM-Elastic-4thorder
FDTD-Acoustic(VTI)-8thorder
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
NVIDIA Tesla V100 (7.5 TFlops)
Fujitsu A64FX (2.99 TFlops)
Intel Skylake 8168 (2.08 TFlops)
Huawei Kunpeng920 (1.13 TFlops)
Marvell ThunderX2 (0.56 TFlops)
Amazon Graviton (0.294 TFlops)
AMD EPYC Rome (2.41 TFlops)
TheoreticalPeakGflops
4096
1024
256
64
16
4
0.25 1 644 16 256 1024
Arithmetic Intensity (Flop/Byte)
STREAM Triad Bandwidth
• ThunderX2 provides highest
bandwidth of all processors
• Vectorization makes no discernable
difference to performance at large
core counts
• Around 10% higher with NEON at
smaller core counts (5 – 14)
• Significant number of kernels in HPC
are bound by the rate at which they
can load/store to memory (“memory
bandwidth bound”)
• Makes high memory bandwidth
desireable
• Ideally want to get to these bandwidths
without needing to vectorize
7/30/19 Unclassified
0
50
100
150
200
0 10 20 30 40 50 60
MeasuredBandwidth(GB/s)
Processor Cores
ThunderX2 (NEON)
ThunderX2 (No Vec)
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
Higher is better
0
20
40
60
80
100
120
0 500000 1x10
6
1.5x10
6
2x10
6
2.5x10
6
MeasuredBandwidth(GB/s)
Data Array Size
Haswell Read
Skylake Read
Haswell Write
Skylake Write
ThunderX2 Read
ThunderX2 Write
Cache Performance
• Haswell has highest per-core
bandwidth (read and write) at L1,
slower at L2.
• Skylake redesigned cache sizes
(larger L2, smaller L3) shows up in
graph
• Higher performance for certain work-
set sizes (typical for unstructured
codes)
• TX2 more uniform bandwidth at
larger scale (see less asymmetry
between read/write)
7/30/19 Unclassified
Higher is better
Larger L2 capacity
for Skylake
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
MeasuredPerformance(GF/s)
Processor Cores
Skylake
Haswell
ThunderX2
DGEMM Compute Performance
• ThunderX2 has similar
performance at scale to Haswell
• Roughly twice as many cores (TX2)
• Half the vector width (TX2 vs. HSW)
• See strata in Intel MKL results,
usually a result of matrix-size
kernel optimization
• ARM PL provides smoother
performance results (essentially
linear growth)
7/30/19 Unclassified
Higher is better
Floating Point Performance Sanity Check: HPL
7/30/19 Unclassified
• ThunderX2 has about half the floating point capacity of comparable Xeon
CPUs
• Xeon 8180 vs. ThunderX2 • HPL.dat
163840 Ns
256 NBs
0 PMAP process mapping (0=Row-,1=Column-
major)
7 Ps
8 Qs
1 PFACTs (0=left, 1=Crout, 2=Right)
2 RFACTs (0=left, 1=Crout, 2=Right)
0 BCASTs
(0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
2.00E+03
8.82E+02
4.99E+02
0.00E+00
5.00E+02
1.00E+03
1.50E+03
2.00E+03
2.50E+03
Xeon 8180 SMT=2+Turbo SMT=4 w/o
Turbo
GFLOPS
HPL | N=163840 (200GB)
Results from using Astra and other TX2 Platforms
Applications
7/30/19 Unclassified
0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20 25 30
Giga-Updates/Second(GUP/s)
Processor Cores
ThunderX2 (No Vec)
Skylake (No Vec)
Haswell (No Vec)
GUPS Random Access
• Running all processors in SMT-1
mode, SMT(>1) is usually better
performance
• Expect SMT2/4 on TX2 to give better
numbers
• Usually more cores gives higher
performance (more load/store
units driving requests).
• Typical for TLB performance to be a
limiter
• Need to consider larger pages for
future runs
7/30/19 Unclassified
Higher is better
0
2000
4000
6000
8000
10000
12000
0 5 10 15 20 25 30
FigureofMerit(Zone/S)
Processor Cores
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
ThunderX2 (NEON)
ThunderX2 (No Vec)
LULESH Hydrodynamics Mini-App
• Typically fairly intensive L2
accesses for unstructured mesh
(although LULESH is regular
structure in unstructured format)
• Expect slightly higher
performance with SMT(>1)
modes for all processesors
7/30/19 Unclassified
Higher is better
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
0 5 10 15 20 25 30
FigureofMerit(Lookups/S)
Processor Cores
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
ThunderX2 (NEON)
ThunderX2 (No Vec)
XSBench Cross-Section Lookup Mini-App
• Two level random-like access into
memory, look-up in first table and
then use indirection to reach
second lookup
• Means random access but is more
like search so vectors can help
• See gain on Haswell and Skylake
which both have vector-gather
support
• No support for gather in NEON
• XSBench is mostly read-only
(gather)
7/30/19 Unclassified
Higher is better
Branson Mini-App and Benchmark
7/30/19 Unclassified
• Monte Carlo based Radiation transport
mini-app
• Lots of time spent in math intrinsics (exp,
log, sin, cos). Benefits from ARM
optimized math intrinsics
• Poor memory locality, benefits some from
large pages
• Doesn’t vectorize
• Random number generator not yet
optimized for ARM
• On a per node basis, TX2 is on par with
SKL-gold
• Need to improve vectorizability
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8 16 32 64
TX2 TX2+armpl SKL-vec
Relative perf wrt SKL-gold
MPI Processes
EMPIRE on Astra
Trinity HSW 32 MPI x 1 OMP Astra TX2 56 MPI x 1 OMP
Strong and weak scaling studies for EMPIRE-PIC for awesome blob test case
Missing Trinity XL mesh 512 and 4096 node results because of MueLu FPE
Missing Astra XL mesh 2048 node results because of MueLu FPE
Work by Paul Lin7/30/19 Unclassified
EMPIRE on Astra
• TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW
node
• (HSW time)/(TX2 time) > 1 means TX2 is faster
• Strong scaling for awesome blob small mesh (1-8 nodes), strong scaling for
medium mesh (8-64 nodes), strong scaling for large mesh (64-512)
• (HSW time)/(TX2 time) for linear solve not great, low
computation/communication regime
(Good)
7/30/19 Unclassified Work by Paul Lin
• TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW
node
• (HSW time)/(TX2 time) > 1 means TX2 is faster
• Strong scaling for awesome blob medium mesh (1-8 nodes), strong scaling for
large mesh (8-64 nodes)
• (HSW time)/(TX2 time) for linear solve definite better than previous slide, due
to increased computation/communication
EMPIRE on Astra
(Good)
7/30/19 Unclassified Work by Paul Lin
xRAGE
7/30/19 Unclassified
• Eulerian-based
hydrodynamics/radiation
transport application
• Uses adaptive mesh
refinement
• Significant amount of
gather/scatter
• Does not currently benefit
from AVX2/512
vectorization
• Memory bound
0
100
200
300
400
500
600
700
8 16 32
TX2 (50ppn)
BWL (48ppn)
SKL (56ppn)
Results from Cray XC50 using
Cray CCE9 Compiler
Lower is better
#nodes
Walltime(secs)
PARTISN
7/30/19 Unclassified
• Neutron transport code –
deterministic SN method
• Sensitive to cache
performance, not typically
memory bound
• Vectorizes well for avx512,
NEON
• Can be run mixed
MPI/OpenMP
• Limited by cache BW on
TX2 and front end stalls
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 4 8 16 32
TX2
SKL-novec
SKL-vec
MPI Processes
Higher is better
Relative Perf. To BWL-vec
PARTISN can benefit from 4 SMTs/core
7/30/19 Unclassified
• Example of code with
significant front end stalls
• Taucommander indicates
high rate of branch
misprediction in the sweep
kernel
Cray XC50 - CCE 9.0 compiler
RIKEN Fiber Benchmarks – Compiler Performance Comparison
7/30/19 Unclassified
• Comparison of Cray 8/9
compilers against Allinea19
using Riken Fiber benchmarks
• Results are mixed, no clear
winner in terms of compilers
• Takeaway is to try to build your
app with several compilers
Cray XC50 - CCE 9.0 compilerLowerisbetter
Early Results from Astra
7/30/19 Unclassified
System has been online for around two weeks , incredible team working round the
clock, already running full application ports and many of our key frameworks
Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell
CFD Models Hydrodynamics Molecular DynamicsMonte Carlo
1.60X 1.45X 1.30X 1.42X
Linear Solvers
1.87X
Porting to ARM
7/30/19 Unclassified
Sanity Checks
• See if your software has already been ported to aarch64:
• www.gitlab.com/arm-hpc/packages/wikis
• See if its available via Spack https://github.com/spack/spack
• Don’t use old compilers:
• GCC 8.2 or newer, 9.1 better
• Allinea armflang/armclang 19.0 or newer
• If you’re package relies on some system packages in performance critical areas, may
want to build your own versions. Libraries that come with base release are not
optimized for Thunderx2
• If your application has lots of dependencies, this may be a good time to learn
how to use Spack
• Checkout training material at https://gitlab.com/arm-hpc/training
7/30/19 Unclassified
7/30/19 Unclassified
Porting Cheat Sheet
Ensure all dependencies have been ported.
•Arm HPC Packages Wiki: https://gitlab.com/arm-hpc/packages/wikis/categories/allPackages
Update or patch autotools and libtool as needed
•wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' -O config.guess
•wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' -O config.sub
•sed -i -e 's#wl=""#wl="-Wl,"#g' libtool
•sed -i -e 's#pic_flag=""#pic_flag=" -fPIC -DPIC"#g' libtool
Update build system to use the right compiler and architecture
•Check #ifdef in Makefiles. Use other architectures as a template.
Use the right compiler flags
•Start with -mcpu=native -Ofast
Avoid non-standard compiler extensions and language features
•Arm compiler team is actively adding new “unique” features, but it’s best to stick to the standard.
Update hard-wired intrinsics for other architectures
•https://developer.arm.com/technologies/neon/intrinsics
•Worst case: default to a slow code.
Update, and possibly fix, your test suite
•Regression tests are a porter’s best friend.
•Beware of tests that expect exactly the same answer on all architectures!
Know architectural features and what they mean for your code
•Arm’s weak memory model.
•Division by zero is silently zero on Arm.
Questions?
7/30/19 Unclassified

More Related Content

PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
Deep Learning on the SaturnV Cluster
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
IBM HPC Transformation with AI
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
Hardware & Software Platforms for HPC, AI and ML
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Energy Efficient Computing using Dynamic Tuning
Preparing to program Aurora at Exascale - Early experiences and future direct...
Deep Learning on the SaturnV Cluster
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
IBM HPC Transformation with AI
CUDA-Python and RAPIDS for blazing fast scientific computing

What's hot (20)

PDF
dCUDA: Distributed GPU Computing with Hardware Overlap
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
PDF
DOME 64-bit μDataCenter
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
BXI: Bull eXascale Interconnect
PDF
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
PDF
An Update on Arm HPC
PDF
Trends in Systems and How to Get Efficient Performance
PDF
A Fresh Look at HPC from Huawei Enterprise
PDF
POWER10 innovations for HPC
PDF
OpenHPC: A Comprehensive System Software Stack
PDF
IBM Data Centric Systems & OpenPOWER
PDF
Arm in HPC
PDF
Lenovo HPC Strategy Update
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PDF
ARM HPC Ecosystem
PDF
High Performance Interconnects: Assessment & Rankings
PDF
Summit workshop thompto
PDF
Intel dpdk Tutorial
dCUDA: Distributed GPU Computing with Hardware Overlap
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
DOME 64-bit μDataCenter
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
TAU E4S ON OpenPOWER /POWER9 platform
BXI: Bull eXascale Interconnect
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
An Update on Arm HPC
Trends in Systems and How to Get Efficient Performance
A Fresh Look at HPC from Huawei Enterprise
POWER10 innovations for HPC
OpenHPC: A Comprehensive System Software Stack
IBM Data Centric Systems & OpenPOWER
Arm in HPC
Lenovo HPC Strategy Update
Using a Field Programmable Gate Array to Accelerate Application Performance
ARM HPC Ecosystem
High Performance Interconnects: Assessment & Rankings
Summit workshop thompto
Intel dpdk Tutorial
Ad

Similar to NNSA Explorations: ARM for Supercomputing (20)

PDF
Mauricio breteernitiz hpc-exascale-iscte
PDF
Arm as a Viable Architecture for HPC and AI
PPTX
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
PDF
Beyond Moore's Law: The Challenge of Heterogeneous Compute & Memory Systems
PDF
Deep learning: Hardware Landscape
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
Programming Models for Heterogeneous Chips
PDF
E3MV - Embedded Vision - Sundance
PDF
Exploring emerging technologies in the HPC co-design space
PDF
The First SVE Enabled Arm Processor: A64FX and Building up Arm HPC Ecosystem
PDF
Exascale Update from Hyperion Research
PDF
State of ARM-based HPC
PDF
Flexible and Scalable Domain-Specific Architectures
PDF
Barcelona Supercomputing Center, Generador de Riqueza
PDF
06 EPI: the European approach for Exascale ages
PDF
Designing HPC Architectures at the Barcelona Supercomputing Center
PDF
Ximea - the pc camera, 90 gflps smart camera
PDF
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
PDF
Update on the Mont-Blanc Project for ARM-based HPC
Mauricio breteernitiz hpc-exascale-iscte
Arm as a Viable Architecture for HPC and AI
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Beyond Moore's Law: The Challenge of Heterogeneous Compute & Memory Systems
Deep learning: Hardware Landscape
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Programming Models for Heterogeneous Chips
E3MV - Embedded Vision - Sundance
Exploring emerging technologies in the HPC co-design space
The First SVE Enabled Arm Processor: A64FX and Building up Arm HPC Ecosystem
Exascale Update from Hyperion Research
State of ARM-based HPC
Flexible and Scalable Domain-Specific Architectures
Barcelona Supercomputing Center, Generador de Riqueza
06 EPI: the European approach for Exascale ages
Designing HPC Architectures at the Barcelona Supercomputing Center
Ximea - the pc camera, 90 gflps smart camera
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
Update on the Mont-Blanc Project for ARM-based HPC
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PDF
Data Parallel Deep Learning
PDF
Making Supernovae with Jets
PDF
Adaptive Linear Solvers and Eigensolvers
Major Market Shifts in IT
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Data Parallel Deep Learning
Making Supernovae with Jets
Adaptive Linear Solvers and Eigensolvers

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced IT Governance
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
NewMind AI Monthly Chronicles - July 2025
Understanding_Digital_Forensics_Presentation.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
Advanced IT Governance
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
GamePlan Trading System Review: Professional Trader's Honest Take
Network Security Unit 5.pdf for BCA BBA.

NNSA Explorations: ARM for Supercomputing

  • 1. • Simon Hammond – Sandia National Laboratories (sdhammo@sandia.gov) • Howard Pritchard – Los Alamos National Laboratory (howardp@lanl.gov) UNCLASSIFIED UNLIMITED RELEASE NNSA Explorations: ARM for Supercomputing
  • 2. Exciting Time to be in HPC… Exascale Computing Adoption of ML/AI for HPC New Hardware/Software
  • 3. What we’ll cover today • Why Arm – what’s so interesting about it? • Marvell Thunder TX2 overview and comparison with x86_64 • Astra/Vanguard Program • ASC mini-app and application performance • Porting to ARM 7/30/19 Unclassified
  • 4. What’s sooooooo Interesting About Arm? • In many ways not much… • Its just an instruction set • As long as it can run Fortan, C and C++ we are good right? • In others ways quite a lot is interesting • Different business model, consortium of implementations • Open for partners to suggest new instructions • Broad range of intellectual property opportunities • Broad(er) range if implementations than say X86, POWER, SPARC etc
  • 5. What’s sooooooo Interesting About Arm? • DOE invests more than $100M in the hardware of a typical supercomputer (often substantially more than this when the final bill comes in) • Competition helps to drive down prices and increase innovation • We want to optimize price/perf for our machines – get the absolute best workload performance we can for the best price we can buy hardware • The future is interesting – Arm is an IP company, not an implementation • What if we could blend existing Arm IP blocks with our own DOE inspired accelerators? • Build workload optimized processors and computers that benefit DOE scientists? • e.g. a machine just for designing new materials but one which is 100X faster than today? • Arm is an opportunity to engage with a broad range of suppliers and an ecosystem • Not the only way to do this, can partner with traditional vendors like Intel, IBM, AMD etc 7/30/19 Unclassified
  • 6. Arm is Growing in HPC… 7/30/19 Unclassified
  • 7. NNSA/ASC Vanguard Program A proving ground for next-generation HPC technologies in support of the NNSA mission http://vanguard.sandia.gov
  • 8. Astra – the First Petscale Arm based Supercomputer 7/30/19 Unclassified
  • 9. Test Beds • Small testbeds (~10-100 nodes) • Breadth of architectures Key • Brave users Vanguard • Larger-scale experimental systems • Focused efforts to mature new technologies • Broader user-base • Not Production • Tri-lab resource but not for ATCC runs ATS/CTS Platforms • Leadership-class systems (Petascale, Exascale, ...) • Advanced technologies, sometimes first-of-kind • Broad user-base • Production Use ASC Test Beds Vanguard ATS and CTS Platforms Greater Scalability, Larger Scale, Focus on Production Higher Risk, Greater Architectural Diversity Where Vanguard Fits in our Program Strategy 7/30/19 Unclassified
  • 10. NNSA/ASC Advanced Trilab Software Environment (ATSE) Project • Advanced Tri-lab Software Environment • Sandia leading development with input from Tri-lab Arm team • Will be the user programming environment for Vanguard-Astra • Partnership across the NNSA/ASC Labs and with HPE • Lasting value • Documented specification of: • Software components needed for HPC production applications • How they are configured (i.e., what features and capabilities are enabled) and interact • User interfaces and conventions • Reference implementation: • Deployable on multiple ASC systems and architectures with common look and feel • Tested against real ASC workloads • Community inspired, focused and supported ATSE is an integrated software environment for ASC workloads ATSE stack 7/30/19 Unclassified
  • 11. HPE’s HPC Software Stack HPE: • HPE MPI (+ XPMEM) • HPE Cluster Manager • Arm: • Arm HPC Compilers • Arm Math Libraries • Allinea Tools • Mellanox-OFED & HPC-X • RedHat 7.x for aarch64 ATSE Collaboration with HPE’s HPC Software Stack ATSE stack 7/30/19 Unclassified
  • 12. SVE Enablement – Next Generation of SIMD/Vector Instructions • SVE work is underway • SVE = Scalable Vector Extensions • Length agnostic vector instructions at an ISA level • Using ArmIE (fast emulation) and RIKEN GEM5 Simulator • GCC and Arm toolchains • Collaboration with RIKEN • Visited Sandia (participants from SNL, LANL, LLNL, RIKEN) • Discussion of performance and simulation techniques • Deep-dive on SVE (GEM5) • Short term plan • Use of SVE intrinsics for Kokkos-Kernels SIMD C++/data parallel types • Underpins number of key performance routines for Trilinos libraries • Seen large (6X) speedups for AVX512 on KNL and Skylake • Expect to see similar gains for SVE vector units • Critical performance enablement for Sandia production codes 7/30/19 Unclassified
  • 13. • Workflows leveraging containers and virtual machines • Support for machine learning frameworks • ARMv8.1 includes new virtualization extensions, SR-IOV • Evaluating parallel filesystems + I/O systems @ scale • GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, … • Resilience studies over Astra lifetime • Improved MPI thread support, matching acceleration • OS optimizations for HPC @ scale • Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux kernels to non-Linux lightweight kernels and multi-kernels • Arm-specific optimizations ATSE stack ATSE R&D Efforts – Developing Next-Generation NNSA Workflows 7/30/19 Unclassified
  • 15. ThunderX2 - Second Generation High-End Armv8-A Server SoC 7/30/19 Unclassified Up to 32 custom Armv8.1 cores, up to 2.5GHz Full OoO, 1, 2, 4 threads per core 1S and 2S Configuration Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC Up to 56 lanes of PCIe, 14 PCIe controllers Full SoC: Integrated SATAv3 USB3 and GPIOs Server class RAS & Virtualization Extensive Power Management LGA and BGA for most flexibility 40+ SKUs (75W – 180W)
  • 16. 7/30/19 Unclassified Marvell ThunderX2 Haswell E5-2698 v3 Broadwell E5- 2695 Skylake Gold 6152 Cores/Socket 32 (max 4 HT) 16 (2 HT) 22 (2 HT) 22 (2 HT) L1 Cache/Core 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) L2 Cache/Core 256KB (8-way) 256 KB (8-way) 256 KB (8-way) 1 MB (16-way) L3 Cache/Socket 32 MB 40 MB 33 MB 30.25 MB #Memory Channels/Socket 8 DDR4 4 DDR4 4 DDR4 6 DDR4 Base Clock Rate 2.2 GHz 2.3 GHz 2.2 GHz 2.1 GHz Vector/SIMD Length 128b (NEON) 256b (AVX2) 256b (AVX2) 512b (AVX512) ThunderX2 Comparison with Xeon Processors
  • 17. 7/30/19 Unclassified Roofline Comparison 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder FDTD-Acoustic(TTI)-8thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder FDTD-Acoustic(TTI)-8thorder SEM-Elastic-4thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) 4 16 64 256 1024 4096 0.0625 0.25 1 4 16 64 256 1024 Marvell TX2-CN9980 (560 GFlops) 170 G flops/s Intel Skylake-8168 (2.08 TFlops) 127 G flops/s Fujitsu-A64FX (2.99 TFlops) 1024 G flops/s Huawei-Kunpeng920 (1.33 TFlops) AMD EPYC Naples/7601 (1.13 TFlops) Amazon Graviton (294 GFlops) 42 G flops/s Intel KNL-7250 (3.04 TFlops) 400 G flops/s Nvidia tesla V100 (7.5 TFlops) 900 G flops/s FDTD-Elastic-4thorder FDTD-Acoustic(ISO)-8thorder FDTD-Acoustic(TTI)-8thorder SEM-Elastic-4thorder FDTD-Acoustic(VTI)-8thorder AMD EPYC Rome (2.41 TFlops) Attainable(GFlop/s) Arithmetic Intensity (Flop/Byte) NVIDIA Tesla V100 (7.5 TFlops) Fujitsu A64FX (2.99 TFlops) Intel Skylake 8168 (2.08 TFlops) Huawei Kunpeng920 (1.13 TFlops) Marvell ThunderX2 (0.56 TFlops) Amazon Graviton (0.294 TFlops) AMD EPYC Rome (2.41 TFlops) TheoreticalPeakGflops 4096 1024 256 64 16 4 0.25 1 644 16 256 1024 Arithmetic Intensity (Flop/Byte)
  • 18. STREAM Triad Bandwidth • ThunderX2 provides highest bandwidth of all processors • Vectorization makes no discernable difference to performance at large core counts • Around 10% higher with NEON at smaller core counts (5 – 14) • Significant number of kernels in HPC are bound by the rate at which they can load/store to memory (“memory bandwidth bound”) • Makes high memory bandwidth desireable • Ideally want to get to these bandwidths without needing to vectorize 7/30/19 Unclassified 0 50 100 150 200 0 10 20 30 40 50 60 MeasuredBandwidth(GB/s) Processor Cores ThunderX2 (NEON) ThunderX2 (No Vec) Skylake (AVX512) Skylake (No Vec) Haswell (AVX2) Haswell (No Vec) Higher is better
  • 19. 0 20 40 60 80 100 120 0 500000 1x10 6 1.5x10 6 2x10 6 2.5x10 6 MeasuredBandwidth(GB/s) Data Array Size Haswell Read Skylake Read Haswell Write Skylake Write ThunderX2 Read ThunderX2 Write Cache Performance • Haswell has highest per-core bandwidth (read and write) at L1, slower at L2. • Skylake redesigned cache sizes (larger L2, smaller L3) shows up in graph • Higher performance for certain work- set sizes (typical for unstructured codes) • TX2 more uniform bandwidth at larger scale (see less asymmetry between read/write) 7/30/19 Unclassified Higher is better Larger L2 capacity for Skylake
  • 20. 0 100 200 300 400 500 600 700 800 0 5 10 15 20 25 30 MeasuredPerformance(GF/s) Processor Cores Skylake Haswell ThunderX2 DGEMM Compute Performance • ThunderX2 has similar performance at scale to Haswell • Roughly twice as many cores (TX2) • Half the vector width (TX2 vs. HSW) • See strata in Intel MKL results, usually a result of matrix-size kernel optimization • ARM PL provides smoother performance results (essentially linear growth) 7/30/19 Unclassified Higher is better
  • 21. Floating Point Performance Sanity Check: HPL 7/30/19 Unclassified • ThunderX2 has about half the floating point capacity of comparable Xeon CPUs • Xeon 8180 vs. ThunderX2 • HPL.dat 163840 Ns 256 NBs 0 PMAP process mapping (0=Row-,1=Column- major) 7 Ps 8 Qs 1 PFACTs (0=left, 1=Crout, 2=Right) 2 RFACTs (0=left, 1=Crout, 2=Right) 0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 0 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 2.00E+03 8.82E+02 4.99E+02 0.00E+00 5.00E+02 1.00E+03 1.50E+03 2.00E+03 2.50E+03 Xeon 8180 SMT=2+Turbo SMT=4 w/o Turbo GFLOPS HPL | N=163840 (200GB)
  • 22. Results from using Astra and other TX2 Platforms Applications 7/30/19 Unclassified
  • 23. 0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 25 30 Giga-Updates/Second(GUP/s) Processor Cores ThunderX2 (No Vec) Skylake (No Vec) Haswell (No Vec) GUPS Random Access • Running all processors in SMT-1 mode, SMT(>1) is usually better performance • Expect SMT2/4 on TX2 to give better numbers • Usually more cores gives higher performance (more load/store units driving requests). • Typical for TLB performance to be a limiter • Need to consider larger pages for future runs 7/30/19 Unclassified Higher is better
  • 24. 0 2000 4000 6000 8000 10000 12000 0 5 10 15 20 25 30 FigureofMerit(Zone/S) Processor Cores Skylake (AVX512) Skylake (No Vec) Haswell (AVX2) Haswell (No Vec) ThunderX2 (NEON) ThunderX2 (No Vec) LULESH Hydrodynamics Mini-App • Typically fairly intensive L2 accesses for unstructured mesh (although LULESH is regular structure in unstructured format) • Expect slightly higher performance with SMT(>1) modes for all processesors 7/30/19 Unclassified Higher is better
  • 25. 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 0 5 10 15 20 25 30 FigureofMerit(Lookups/S) Processor Cores Skylake (AVX512) Skylake (No Vec) Haswell (AVX2) Haswell (No Vec) ThunderX2 (NEON) ThunderX2 (No Vec) XSBench Cross-Section Lookup Mini-App • Two level random-like access into memory, look-up in first table and then use indirection to reach second lookup • Means random access but is more like search so vectors can help • See gain on Haswell and Skylake which both have vector-gather support • No support for gather in NEON • XSBench is mostly read-only (gather) 7/30/19 Unclassified Higher is better
  • 26. Branson Mini-App and Benchmark 7/30/19 Unclassified • Monte Carlo based Radiation transport mini-app • Lots of time spent in math intrinsics (exp, log, sin, cos). Benefits from ARM optimized math intrinsics • Poor memory locality, benefits some from large pages • Doesn’t vectorize • Random number generator not yet optimized for ARM • On a per node basis, TX2 is on par with SKL-gold • Need to improve vectorizability 0 0.2 0.4 0.6 0.8 1 1.2 1 2 4 8 16 32 64 TX2 TX2+armpl SKL-vec Relative perf wrt SKL-gold MPI Processes
  • 27. EMPIRE on Astra Trinity HSW 32 MPI x 1 OMP Astra TX2 56 MPI x 1 OMP Strong and weak scaling studies for EMPIRE-PIC for awesome blob test case Missing Trinity XL mesh 512 and 4096 node results because of MueLu FPE Missing Astra XL mesh 2048 node results because of MueLu FPE Work by Paul Lin7/30/19 Unclassified
  • 28. EMPIRE on Astra • TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW node • (HSW time)/(TX2 time) > 1 means TX2 is faster • Strong scaling for awesome blob small mesh (1-8 nodes), strong scaling for medium mesh (8-64 nodes), strong scaling for large mesh (64-512) • (HSW time)/(TX2 time) for linear solve not great, low computation/communication regime (Good) 7/30/19 Unclassified Work by Paul Lin
  • 29. • TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW node • (HSW time)/(TX2 time) > 1 means TX2 is faster • Strong scaling for awesome blob medium mesh (1-8 nodes), strong scaling for large mesh (8-64 nodes) • (HSW time)/(TX2 time) for linear solve definite better than previous slide, due to increased computation/communication EMPIRE on Astra (Good) 7/30/19 Unclassified Work by Paul Lin
  • 30. xRAGE 7/30/19 Unclassified • Eulerian-based hydrodynamics/radiation transport application • Uses adaptive mesh refinement • Significant amount of gather/scatter • Does not currently benefit from AVX2/512 vectorization • Memory bound 0 100 200 300 400 500 600 700 8 16 32 TX2 (50ppn) BWL (48ppn) SKL (56ppn) Results from Cray XC50 using Cray CCE9 Compiler Lower is better #nodes Walltime(secs)
  • 31. PARTISN 7/30/19 Unclassified • Neutron transport code – deterministic SN method • Sensitive to cache performance, not typically memory bound • Vectorizes well for avx512, NEON • Can be run mixed MPI/OpenMP • Limited by cache BW on TX2 and front end stalls 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 4 8 16 32 TX2 SKL-novec SKL-vec MPI Processes Higher is better Relative Perf. To BWL-vec
  • 32. PARTISN can benefit from 4 SMTs/core 7/30/19 Unclassified • Example of code with significant front end stalls • Taucommander indicates high rate of branch misprediction in the sweep kernel Cray XC50 - CCE 9.0 compiler
  • 33. RIKEN Fiber Benchmarks – Compiler Performance Comparison 7/30/19 Unclassified • Comparison of Cray 8/9 compilers against Allinea19 using Riken Fiber benchmarks • Results are mixed, no clear winner in terms of compilers • Takeaway is to try to build your app with several compilers Cray XC50 - CCE 9.0 compilerLowerisbetter
  • 34. Early Results from Astra 7/30/19 Unclassified System has been online for around two weeks , incredible team working round the clock, already running full application ports and many of our key frameworks Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell CFD Models Hydrodynamics Molecular DynamicsMonte Carlo 1.60X 1.45X 1.30X 1.42X Linear Solvers 1.87X
  • 35. Porting to ARM 7/30/19 Unclassified
  • 36. Sanity Checks • See if your software has already been ported to aarch64: • www.gitlab.com/arm-hpc/packages/wikis • See if its available via Spack https://github.com/spack/spack • Don’t use old compilers: • GCC 8.2 or newer, 9.1 better • Allinea armflang/armclang 19.0 or newer • If you’re package relies on some system packages in performance critical areas, may want to build your own versions. Libraries that come with base release are not optimized for Thunderx2 • If your application has lots of dependencies, this may be a good time to learn how to use Spack • Checkout training material at https://gitlab.com/arm-hpc/training 7/30/19 Unclassified
  • 37. 7/30/19 Unclassified Porting Cheat Sheet Ensure all dependencies have been ported. •Arm HPC Packages Wiki: https://gitlab.com/arm-hpc/packages/wikis/categories/allPackages Update or patch autotools and libtool as needed •wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' -O config.guess •wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' -O config.sub •sed -i -e 's#wl=""#wl="-Wl,"#g' libtool •sed -i -e 's#pic_flag=""#pic_flag=" -fPIC -DPIC"#g' libtool Update build system to use the right compiler and architecture •Check #ifdef in Makefiles. Use other architectures as a template. Use the right compiler flags •Start with -mcpu=native -Ofast Avoid non-standard compiler extensions and language features •Arm compiler team is actively adding new “unique” features, but it’s best to stick to the standard. Update hard-wired intrinsics for other architectures •https://developer.arm.com/technologies/neon/intrinsics •Worst case: default to a slow code. Update, and possibly fix, your test suite •Regression tests are a porter’s best friend. •Beware of tests that expect exactly the same answer on all architectures! Know architectural features and what they mean for your code •Arm’s weak memory model. •Division by zero is silently zero on Arm.