NNSA Explorations: ARM for Supercomputing

• Simon Hammond – Sandia National Laboratories (sdhammo@sandia.gov)
• Howard Pritchard – Los Alamos National Laboratory (howardp@lanl.gov)
UNCLASSIFIED UNLIMITED RELEASE
NNSA Explorations:
ARM for Supercomputing

Exciting Time to be in HPC…
Exascale Computing Adoption of ML/AI for HPC New Hardware/Software

What we’ll cover today
• Why Arm – what’s so interesting about it?
• Marvell Thunder TX2 overview and comparison with x86_64
• Astra/Vanguard Program
• ASC mini-app and application performance
• Porting to ARM
7/30/19 Unclassified

What’s sooooooo Interesting About Arm?
• In many ways not much…
• Its just an instruction set
• As long as it can run Fortan, C and C++
we are good right?
• In others ways quite a lot is
interesting
• Different business model, consortium
of implementations
• Open for partners to suggest new
instructions
• Broad range of intellectual property
opportunities
• Broad(er) range if implementations
than say X86, POWER, SPARC etc

What’s sooooooo Interesting About Arm?
• DOE invests more than $100M in the hardware of a typical supercomputer
(often substantially more than this when the final bill comes in)
• Competition helps to drive down prices and increase innovation
• We want to optimize price/perf for our machines – get the absolute best workload
performance we can for the best price we can buy hardware
• The future is interesting – Arm is an IP company, not an implementation
• What if we could blend existing Arm IP blocks with our own DOE inspired accelerators?
• Build workload optimized processors and computers that benefit DOE scientists?
• e.g. a machine just for designing new materials but one which is 100X faster than today?
• Arm is an opportunity to engage with a broad range of suppliers and an ecosystem
• Not the only way to do this, can partner with traditional vendors like Intel, IBM, AMD etc

Arm is Growing in HPC…

NNSA/ASC Vanguard Program
A proving ground for next-generation HPC technologies in support of the
NNSA mission
http://vanguard.sandia.gov

Astra – the First Petscale Arm based Supercomputer

Test Beds
• Small testbeds
(~10-100 nodes)
• Breadth of
architectures Key
• Brave users
Vanguard
• Larger-scale experimental
systems
• Focused efforts to mature
new technologies
• Broader user-base
• Not Production
• Tri-lab resource but not for
ATCC runs
ATS/CTS Platforms
• Leadership-class systems
(Petascale, Exascale, ...)
• Advanced technologies,
sometimes first-of-kind
• Broad user-base
• Production Use
ASC Test Beds Vanguard ATS and CTS Platforms
Greater Scalability, Larger Scale, Focus on Production
Higher Risk, Greater Architectural Diversity
Where Vanguard Fits in our Program Strategy

NNSA/ASC Advanced Trilab Software Environment (ATSE) Project
• Advanced Tri-lab Software Environment
• Sandia leading development with input from Tri-lab Arm team
• Will be the user programming environment for Vanguard-Astra
• Partnership across the NNSA/ASC Labs and with HPE
• Lasting value
• Documented specification of:
• Software components needed for HPC production applications
• How they are configured (i.e., what features and capabilities are enabled) and interact
• User interfaces and conventions
• Reference implementation:
• Deployable on multiple ASC systems and architectures with common look and feel
• Tested against real ASC workloads
• Community inspired, focused and supported
ATSE is an integrated software environment for ASC workloads
ATSE
stack

HPE’s HPC Software Stack
HPE:
• HPE MPI (+ XPMEM)
• HPE Cluster Manager
• Arm:
• Arm HPC Compilers
• Arm Math Libraries
• Allinea Tools
• Mellanox-OFED & HPC-X
• RedHat 7.x for aarch64
ATSE Collaboration with HPE’s HPC Software Stack
ATSE
stack

SVE Enablement – Next Generation of SIMD/Vector Instructions
• SVE work is underway
• SVE = Scalable Vector Extensions
• Length agnostic vector instructions at an ISA level
• Using ArmIE (fast emulation) and RIKEN GEM5 Simulator
• GCC and Arm toolchains
• Collaboration with RIKEN
• Visited Sandia (participants from SNL, LANL, LLNL, RIKEN)
• Discussion of performance and simulation techniques
• Deep-dive on SVE (GEM5)
• Short term plan
• Use of SVE intrinsics for Kokkos-Kernels SIMD C++/data parallel
types
• Underpins number of key performance routines for Trilinos
libraries
• Seen large (6X) speedups for AVX512 on KNL and Skylake
• Expect to see similar gains for SVE vector units
• Critical performance enablement for Sandia production codes

• Workflows leveraging containers and virtual machines
• Support for machine learning frameworks
• ARMv8.1 includes new virtualization extensions, SR-IOV
• Evaluating parallel filesystems + I/O systems @ scale
• GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, …
• Resilience studies over Astra lifetime
• Improved MPI thread support, matching acceleration
• OS optimizations for HPC @ scale
• Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux
kernels to non-Linux lightweight kernels and multi-kernels
• Arm-specific optimizations
ATSE
stack
ATSE R&D Efforts – Developing Next-Generation NNSA Workflows

Marvell Thunder X2

ThunderX2 - Second Generation High-End Armv8-A Server SoC
Up to 32 custom Armv8.1 cores, up to 2.5GHz
Full OoO, 1, 2, 4 threads per core
1S and 2S Configuration
Up to 8 DDR4-2667 Memory Controllers, 1 & 2 DPC
Up to 56 lanes of PCIe, 14 PCIe controllers
Full SoC: Integrated SATAv3 USB3 and GPIOs
Server class RAS & Virtualization
Extensive Power Management
LGA and BGA for most flexibility
40+ SKUs (75W – 180W)

Marvell
ThunderX2
Haswell E5-2698
v3
Broadwell E5-
2695
Skylake Gold
6152
Cores/Socket 32 (max 4 HT) 16 (2 HT) 22 (2 HT) 22 (2 HT)
L1 Cache/Core 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way) 32KB I/D (8-way)
L2 Cache/Core 256KB (8-way) 256 KB (8-way) 256 KB (8-way) 1 MB (16-way)
L3 Cache/Socket 32 MB 40 MB 33 MB 30.25 MB
#Memory
Channels/Socket
8 DDR4 4 DDR4 4 DDR4 6 DDR4
Base Clock Rate 2.2 GHz 2.3 GHz 2.2 GHz 2.1 GHz
Vector/SIMD
Length
128b (NEON) 256b (AVX2) 256b (AVX2) 512b (AVX512)
ThunderX2 Comparison with Xeon Processors

Roofline Comparison
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
Marvell TX2-CN9980 (560 GFlops)
170
G
flops/s
Attainable(GFlop/s)
Arithmetic Intensity (Flop/Byte)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
Intel Skylake-8168 (2.08 TFlops)
127
G
flops/s
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
Fujitsu-A64FX (2.99 TFlops)
1024
G
flops/s
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
Huawei-Kunpeng920 (1.33 TFlops)
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
AMD EPYC Naples/7601 (1.13 TFlops)
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
AMD EPYC Rome (2.41 TFlops)
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
Amazon Graviton (294 GFlops)
42
G
flops/s
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
Intel KNL-7250 (3.04 TFlops)
400
G
flops/s
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
400
G
flops/s
Nvidia tesla V100 (7.5 TFlops)
900
G
flops/s
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
400
G
flops/s
900
G
flops/s
FDTD-Elastic-4thorder
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
400
G
flops/s
900
G
flops/s
FDTD-Acoustic(ISO)-8thorder
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
400
G
flops/s
900
G
flops/s
FDTD-Acoustic(TTI)-8thorder
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
400
G
flops/s
900
G
flops/s
SEM-Elastic-4thorder
Attainable(GFlop/s)
4
16
64
256
1024
4096
0.0625 0.25 1 4 16 64 256 1024
170
G
flops/s
127
G
flops/s
1024
G
flops/s
42
G
flops/s
400
G
flops/s
900
G
flops/s
SEM-Elastic-4thorder
FDTD-Acoustic(VTI)-8thorder
Attainable(GFlop/s)
NVIDIA Tesla V100 (7.5 TFlops)
Fujitsu A64FX (2.99 TFlops)
Intel Skylake 8168 (2.08 TFlops)
Huawei Kunpeng920 (1.13 TFlops)
Marvell ThunderX2 (0.56 TFlops)
Amazon Graviton (0.294 TFlops)
TheoreticalPeakGflops
4096
1024
256
64
16
4
0.25 1 644 16 256 1024

STREAM Triad Bandwidth
• ThunderX2 provides highest
bandwidth of all processors
• Vectorization makes no discernable
difference to performance at large
core counts
• Around 10% higher with NEON at
smaller core counts (5 – 14)
• Significant number of kernels in HPC
are bound by the rate at which they
can load/store to memory (“memory
bandwidth bound”)
• Makes high memory bandwidth
desireable
• Ideally want to get to these bandwidths
without needing to vectorize
0
50
100
150
200
0 10 20 30 40 50 60
MeasuredBandwidth(GB/s)
Processor Cores
ThunderX2 (NEON)
ThunderX2 (No Vec)
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
Higher is better

0
20
40
60
80
100
120
0 500000 1x10
6
1.5x10
6
2x10
6
2.5x10
6
MeasuredBandwidth(GB/s)
Data Array Size
Haswell Read
Skylake Read
Haswell Write
Skylake Write
ThunderX2 Read
ThunderX2 Write
Cache Performance
• Haswell has highest per-core
bandwidth (read and write) at L1,
slower at L2.
• Skylake redesigned cache sizes
(larger L2, smaller L3) shows up in
graph
• Higher performance for certain work-
set sizes (typical for unstructured
codes)
• TX2 more uniform bandwidth at
larger scale (see less asymmetry
between read/write)
Higher is better
Larger L2 capacity
for Skylake

0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
MeasuredPerformance(GF/s)
Processor Cores
Skylake
Haswell
ThunderX2
DGEMM Compute Performance
• ThunderX2 has similar
performance at scale to Haswell
• Roughly twice as many cores (TX2)
• Half the vector width (TX2 vs. HSW)
• See strata in Intel MKL results,
usually a result of matrix-size
kernel optimization
• ARM PL provides smoother
performance results (essentially
linear growth)
Higher is better

Floating Point Performance Sanity Check: HPL
• ThunderX2 has about half the floating point capacity of comparable Xeon
CPUs
• Xeon 8180 vs. ThunderX2 • HPL.dat
163840 Ns
256 NBs
0 PMAP process mapping (0=Row-,1=Column-
major)
7 Ps
8 Qs
1 PFACTs (0=left, 1=Crout, 2=Right)
2 RFACTs (0=left, 1=Crout, 2=Right)
0 BCASTs
(0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
2.00E+03
8.82E+02
4.99E+02
0.00E+00
5.00E+02
1.00E+03
1.50E+03
2.00E+03
2.50E+03
Xeon 8180 SMT=2+Turbo SMT=4 w/o
Turbo
GFLOPS
HPL | N=163840 (200GB)

Results from using Astra and other TX2 Platforms
Applications

0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20 25 30
Giga-Updates/Second(GUP/s)
Processor Cores
ThunderX2 (No Vec)
Skylake (No Vec)
Haswell (No Vec)
GUPS Random Access
• Running all processors in SMT-1
mode, SMT(>1) is usually better
performance
• Expect SMT2/4 on TX2 to give better
numbers
• Usually more cores gives higher
performance (more load/store
units driving requests).
• Typical for TLB performance to be a
limiter
• Need to consider larger pages for
future runs
Higher is better

0
2000
4000
6000
8000
10000
12000
0 5 10 15 20 25 30
FigureofMerit(Zone/S)
Processor Cores
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
ThunderX2 (NEON)
ThunderX2 (No Vec)
LULESH Hydrodynamics Mini-App
• Typically fairly intensive L2
accesses for unstructured mesh
(although LULESH is regular
structure in unstructured format)
• Expect slightly higher
performance with SMT(>1)
modes for all processesors
Higher is better

0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
0 5 10 15 20 25 30
FigureofMerit(Lookups/S)
Processor Cores
Skylake (AVX512)
Skylake (No Vec)
Haswell (AVX2)
Haswell (No Vec)
ThunderX2 (NEON)
ThunderX2 (No Vec)
XSBench Cross-Section Lookup Mini-App
• Two level random-like access into
memory, look-up in first table and
then use indirection to reach
second lookup
• Means random access but is more
like search so vectors can help
• See gain on Haswell and Skylake
which both have vector-gather
support
• No support for gather in NEON
• XSBench is mostly read-only
(gather)
Higher is better

Branson Mini-App and Benchmark
• Monte Carlo based Radiation transport
mini-app
• Lots of time spent in math intrinsics (exp,
log, sin, cos). Benefits from ARM
optimized math intrinsics
• Poor memory locality, benefits some from
large pages
• Doesn’t vectorize
• Random number generator not yet
optimized for ARM
• On a per node basis, TX2 is on par with
SKL-gold
• Need to improve vectorizability
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8 16 32 64
TX2 TX2+armpl SKL-vec
Relative perf wrt SKL-gold
MPI Processes

EMPIRE on Astra
Trinity HSW 32 MPI x 1 OMP Astra TX2 56 MPI x 1 OMP
Strong and weak scaling studies for EMPIRE-PIC for awesome blob test case
Missing Trinity XL mesh 512 and 4096 node results because of MueLu FPE
Missing Astra XL mesh 2048 node results because of MueLu FPE
Work by Paul Lin7/30/19 Unclassified

EMPIRE on Astra
• TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW
node
• (HSW time)/(TX2 time) > 1 means TX2 is faster
• Strong scaling for awesome blob small mesh (1-8 nodes), strong scaling for
medium mesh (8-64 nodes), strong scaling for large mesh (64-512)
• (HSW time)/(TX2 time) for linear solve not great, low
computation/communication regime
(Good)
7/30/19 Unclassified Work by Paul Lin

• TX2 node has ~2x memory bandwidth and 1.75x cores (56 vs. 32) of Trinity HSW
node
• (HSW time)/(TX2 time) > 1 means TX2 is faster
• Strong scaling for awesome blob medium mesh (1-8 nodes), strong scaling for
large mesh (8-64 nodes)
• (HSW time)/(TX2 time) for linear solve definite better than previous slide, due
to increased computation/communication
EMPIRE on Astra
(Good)
7/30/19 Unclassified Work by Paul Lin

xRAGE
• Eulerian-based
hydrodynamics/radiation
transport application
• Uses adaptive mesh
refinement
• Significant amount of
gather/scatter
• Does not currently benefit
from AVX2/512
vectorization
• Memory bound
0
100
200
300
400
500
600
700
8 16 32
TX2 (50ppn)
BWL (48ppn)
SKL (56ppn)
Results from Cray XC50 using
Cray CCE9 Compiler
Lower is better
#nodes
Walltime(secs)

PARTISN
• Neutron transport code –
deterministic SN method
• Sensitive to cache
performance, not typically
memory bound
• Vectorizes well for avx512,
NEON
• Can be run mixed
MPI/OpenMP
• Limited by cache BW on
TX2 and front end stalls
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 4 8 16 32
TX2
SKL-novec
SKL-vec
MPI Processes
Higher is better
Relative Perf. To BWL-vec

PARTISN can benefit from 4 SMTs/core
• Example of code with
significant front end stalls
• Taucommander indicates
high rate of branch
misprediction in the sweep
kernel
Cray XC50 - CCE 9.0 compiler

RIKEN Fiber Benchmarks – Compiler Performance Comparison
• Comparison of Cray 8/9
compilers against Allinea19
using Riken Fiber benchmarks
• Results are mixed, no clear
winner in terms of compilers
• Takeaway is to try to build your
app with several compilers
Cray XC50 - CCE 9.0 compilerLowerisbetter

Early Results from Astra
System has been online for around two weeks , incredible team working round the
clock, already running full application ports and many of our key frameworks
Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell
CFD Models Hydrodynamics Molecular DynamicsMonte Carlo
1.60X 1.45X 1.30X 1.42X
Linear Solvers
1.87X

Porting to ARM

Sanity Checks
• See if your software has already been ported to aarch64:
• www.gitlab.com/arm-hpc/packages/wikis
• See if its available via Spack https://github.com/spack/spack
• Don’t use old compilers:
• GCC 8.2 or newer, 9.1 better
• Allinea armflang/armclang 19.0 or newer
• If you’re package relies on some system packages in performance critical areas, may
want to build your own versions. Libraries that come with base release are not
optimized for Thunderx2
• If your application has lots of dependencies, this may be a good time to learn
how to use Spack
• Checkout training material at https://gitlab.com/arm-hpc/training

Porting Cheat Sheet
Ensure all dependencies have been ported.
•Arm HPC Packages Wiki: https://gitlab.com/arm-hpc/packages/wikis/categories/allPackages
Update or patch autotools and libtool as needed
•wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' -O config.guess
•wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' -O config.sub
•sed -i -e 's#wl=""#wl="-Wl,"#g' libtool
•sed -i -e 's#pic_flag=""#pic_flag=" -fPIC -DPIC"#g' libtool
Update build system to use the right compiler and architecture
•Check #ifdef in Makefiles. Use other architectures as a template.
Use the right compiler flags
•Start with -mcpu=native -Ofast
Avoid non-standard compiler extensions and language features
•Arm compiler team is actively adding new “unique” features, but it’s best to stick to the standard.
Update hard-wired intrinsics for other architectures
•https://developer.arm.com/technologies/neon/intrinsics
•Worst case: default to a slow code.
Update, and possibly fix, your test suite
•Regression tests are a porter’s best friend.
•Beware of tests that expect exactly the same answer on all architectures!
Know architectural features and what they mean for your code
•Arm’s weak memory model.
•Division by zero is silently zero on Arm.

Questions?

NNSA Explorations: ARM for Supercomputing

More Related Content

What's hot (20)

Similar to NNSA Explorations: ARM for Supercomputing (20)

More from inside-BigData.com (20)

Recently uploaded (20)

NNSA Explorations: ARM for Supercomputing