SlideShare a Scribd company logo
Preparing Fusion Codes for
Perlmutter
Igor Sfiligoi
San Diego Supercomputer Center
Under contract with General Atomics
1
NUG 2022
This talk focuses on CGYRO
• There are many tools used in Fusion research
• This talk focuses on CGYRO
• An Eulerian fusion plasma
turbulence simulation tool
• Optimized for multi-scale simulations
• Both memory and compute heavy
Experimental methods are essential for gathering new
operational modes. But simulations are used to validate
basic theory, plan experiments, interpret results on
present devices, and ultimately to design future devices.
E. Belli and J. Candy
main authors
https://gafusion.github.io/doc/cgyro.html
2
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
3
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Easy to split among several
CPU/GPU cores and
nodes
4
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Easy to split among several
CPU/GPU cores and
nodes
Most of the compute-intensive
portion is based on small-ish 2D FFTs
Can use system-optimized libraries
5
Using
OpenMP +
OpenACC +
MPI
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Requires frequent TRANSPOSE operations
i.e. MPI_AllToAll
Easy to split among several
CPU/GPU cores and
nodes
6
CGYRO inherently parallel
• Operates on 5+1 dimensional grid
• Several steps in the simulation loop,
where each step
• Can cleanly partition the problem
in at least one dimension
• But no one-dimension in common
between all of them
• All dimensions compute-parallel
• But some dimension may rely
on neighbor data from previous step
Requires frequent TRANSPOSE operations
i.e. MPI_AllToAll
Easy to split among several
CPU/GPU cores and
nodes
Exploring alternatives
but none ready yet
7
Cori and Perlmutter
• Cori was long a major CGYRO compute resource
• And we were very happy with KNL CPUs
• Lots of (slower) cores was always better than fewer marginally-faster cores
• CGYRO was ported to GPUs first for ORNL Titan
• Then improved for ORNL Summit
(Titan’s K80’s have severe limitations, like tiny memory and limited comm.)
• Deploying on Perlmutter (GPU partition) required just a recompilation
• It just worked
• Most of the time since spent
on environment optimizations, e.g. NVIDIA MPS Already had
experience with A100s
from Cloud compute
8
CPU vs GPU code paths
• CGYRO uses a OpenMP+OpenACC(+MPI) parallelization approach
• Plus native FFT libraries, FFTW/MKL on Cori, cuFFT on Perlmutter
• Most code identical for the two
• Enabling OpenMP or OpenACC based on compile flag
• A few loops have specialized OpenMP vs OpenACC implementations
(but most don’t)
• cuFFT required batch execution (reminder, many small FFTs)
• Efficient OpenACC requires careful memory handling
• Was especially a problem while porting pieces of the code to GPU
(now virtually all compute on GPU, partitioned memory just works)
• Now mostly for interacting with IO / diagnostics printouts
9
Importance of great networking
• CGYRO communication intensive
• Large memory footprint + frequent MPI_AllToAll
• Non-negligible MPI_AllReduce, too
• First experience on Perlmutter with Slingshot 10 a mixed bag
• Great compute speedup
• But simulation bottlenecked
by communication
~30%
~70%
Benchmark sh04 case
10
Preview of
SC22 poster
Importance of great networking
• CGYRO communication intensive
• Large memory footprint + frequent MPI_AllToAll
• Non-negligible MPI_AllReduce, too
• First experience on Perlmutter with Slingshot 10 a mixed bag
• The updated Slingshot 11 networking makes us much happier
~30%
~50%
Benchmark sh04 case
11
Preview of
SC22 poster
Importance of great networking
• CGYRO communication intensive
• Large memory footprint + frequent MPI_AllToAll
• Non-negligible MPI_AllReduce, too
• First experience on Perlmutter with Slingshot 10 a mixed bag
• The updated Slingshot 11 networking makes us much happier
• But brings new problems
• SS11 does not play well with MPS
• Gets drastically slower when mapping multiple MPI processes per GPU
• Something we are currently relying on for optimization reasons
• Not a showstopper, but slows down our simulation in certain setups
• NERSC ticket open, hopefully can be fixed
• But we are also working on alternatives in CGYRO code 12
Disk IO light
• CGYRO does not have much disk IO
• Updates results every O(10 mins)
• Checkpoints every O(1h)
• Uses MPI-mediated parallel writes
• Only a couple files, one per logical data type
13
A comparison to other systems
14
Single GCP node faster
than 16x Summit nodes
Compute-only
Total time
SS 10
Presented at PEARC22 - https://doi.org/10.1145/3491418.3535130
Looking at compute-only,
Perlmutter’s A100s about
twice as fast as Summit’s V100s
Summary and Conclusions
• Fusion CGYRO users happy with transition from Cori to Perlmutter
• Much faster at equivalent chip count
• Porting required just a recompile
• Perlmutter still in deployment phase
• Had periods when things were not working too great
• But typically transient, hopefully will stabilize
• Waiting for the quotas to be raised (128 nodes is not a lot for CGYRO)
• Only known remaining annoyance is SS11+MPS interference
15
Acknowledgements
• This work was partially supported by
• The U.S. Department of Energy under awards DE-FG02-95ER54309,
DE-FC02-06ER54873 (Edge Simulation Laboratory) and
DE-SC0017992 (AToM SciDAC-4 project).
• The US National Science Foundation (NSF) Grant OAC-1826967.
• An award of computer time was provided by the INCITE program.
• This research used resources of the Oak Ridge Leadership Computing Facility,
which is an Office of Science User Facility supported under Contract DE-
AC05-00OR22725.
• Computing resources were also provided by the National Energy Research
Scientific Computing Center, which is an Office of Science User Facility
supported under Contract DE-AC02-05CH11231.
16

More Related Content

PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PDF
Comparing single-node and multi-node performance of an important fusion HPC c...
PDF
CGYRO Performance on Power9 CPUs and Volta GPUS
PDF
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
PPT
HPPS - Final - 06/14/2007
PPT
3D-DRESD DReAMS
PDF
PDF
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Comparing single-node and multi-node performance of an important fusion HPC c...
CGYRO Performance on Power9 CPUs and Volta GPUS
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
HPPS - Final - 06/14/2007
3D-DRESD DReAMS
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture

Similar to Preparing Fusion codes for Perlmutter - CGYRO (20)

PDF
Architecting a 35 PB distributed parallel file system for science
PPTX
Role of python in hpc
PDF
cReComp : Automated Design Tool for ROS-Compliant FPGA Component
PDF
FPGA処理をROSコンポーネント化する自動設計環境
PDF
A Framework For Unit Testing With Coarray Fortran
PPTX
DATE 2020: Design, Automation and Test in Europe Conference
PDF
Pratical mpi programming
PDF
Modest scale HPC on Azure using CGYRO
PPT
Current Trends in HPC
PDF
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
PPT
Developing new zynq based instruments
PPT
Rev2 HPPS Project 2007
PDF
JJ_Thesis
PDF
CONDOR: An automated framework to accelerate convolutional neural networks on...
PDF
RAMSES @CSCS
PPT
Necesidades de supercomputacion en las empresas españolas
PPTX
Senior Design: Raspberry Pi Cluster Computing
PDF
Fletcher Framework for Programming FPGA
PDF
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
PPTX
PGI Compilers & Tools Update- March 2018
Architecting a 35 PB distributed parallel file system for science
Role of python in hpc
cReComp : Automated Design Tool for ROS-Compliant FPGA Component
FPGA処理をROSコンポーネント化する自動設計環境
A Framework For Unit Testing With Coarray Fortran
DATE 2020: Design, Automation and Test in Europe Conference
Pratical mpi programming
Modest scale HPC on Azure using CGYRO
Current Trends in HPC
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
Developing new zynq based instruments
Rev2 HPPS Project 2007
JJ_Thesis
CONDOR: An automated framework to accelerate convolutional neural networks on...
RAMSES @CSCS
Necesidades de supercomputacion en las empresas españolas
Senior Design: Raspberry Pi Cluster Computing
Fletcher Framework for Programming FPGA
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
PGI Compilers & Tools Update- March 2018
Ad

More from Igor Sfiligoi (20)

PDF
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
PDF
The anachronism of whole-GPU accounting
PDF
Auto-scaling HTCondor pools using Kubernetes compute resources
PDF
Speeding up bowtie2 by improving cache-hit rate
PDF
Comparing GPU effectiveness for Unifrac distance compute
PDF
Managing Cloud networking costs for data-intensive applications by provisioni...
PDF
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
PDF
Using A100 MIG to Scale Astronomy Scientific Output
PDF
Using commercial Clouds to process IceCube jobs
PDF
Data-intensive IceCube Cloud Burst
PDF
Scheduling a Kubernetes Federation with Admiralty
PDF
Accelerating microbiome research with OpenACC
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
PDF
Porting and optimizing UniFrac for GPUs
PDF
Demonstrating 100 Gbps in and out of the public Clouds
PDF
TransAtlantic Networking using Cloud links
PDF
Bursting into the public Cloud - Sharing my experience doing it at large scal...
PDF
Demonstrating 100 Gbps in and out of the Clouds
PDF
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
PDF
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
The anachronism of whole-GPU accounting
Auto-scaling HTCondor pools using Kubernetes compute resources
Speeding up bowtie2 by improving cache-hit rate
Comparing GPU effectiveness for Unifrac distance compute
Managing Cloud networking costs for data-intensive applications by provisioni...
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Using A100 MIG to Scale Astronomy Scientific Output
Using commercial Clouds to process IceCube jobs
Data-intensive IceCube Cloud Burst
Scheduling a Kubernetes Federation with Admiralty
Accelerating microbiome research with OpenACC
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Porting and optimizing UniFrac for GPUs
Demonstrating 100 Gbps in and out of the public Clouds
TransAtlantic Networking using Cloud links
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Demonstrating 100 Gbps in and out of the Clouds
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Ad

Recently uploaded (20)

PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
An interstellar mission to test astrophysical black holes
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
famous lake in india and its disturibution and importance
PPTX
BIOMOLECULES PPT........................
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
AlphaEarth Foundations and the Satellite Embedding dataset
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
An interstellar mission to test astrophysical black holes
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Introduction to Cardiovascular system_structure and functions-1
famous lake in india and its disturibution and importance
BIOMOLECULES PPT........................
. Radiology Case Scenariosssssssssssssss
INTRODUCTION TO EVS | Concept of sustainability
2Systematics of Living Organisms t-.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Phytochemical Investigation of Miliusa longipes.pdf
7. General Toxicologyfor clinical phrmacy.pptx
ECG_Course_Presentation د.محمد صقران ppt
neck nodes and dissection types and lymph nodes levels
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

Preparing Fusion codes for Perlmutter - CGYRO

  • 1. Preparing Fusion Codes for Perlmutter Igor Sfiligoi San Diego Supercomputer Center Under contract with General Atomics 1 NUG 2022
  • 2. This talk focuses on CGYRO • There are many tools used in Fusion research • This talk focuses on CGYRO • An Eulerian fusion plasma turbulence simulation tool • Optimized for multi-scale simulations • Both memory and compute heavy Experimental methods are essential for gathering new operational modes. But simulations are used to validate basic theory, plan experiments, interpret results on present devices, and ultimately to design future devices. E. Belli and J. Candy main authors https://gafusion.github.io/doc/cgyro.html 2
  • 3. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step 3
  • 4. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Easy to split among several CPU/GPU cores and nodes 4
  • 5. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Easy to split among several CPU/GPU cores and nodes Most of the compute-intensive portion is based on small-ish 2D FFTs Can use system-optimized libraries 5 Using OpenMP + OpenACC + MPI
  • 6. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Requires frequent TRANSPOSE operations i.e. MPI_AllToAll Easy to split among several CPU/GPU cores and nodes 6
  • 7. CGYRO inherently parallel • Operates on 5+1 dimensional grid • Several steps in the simulation loop, where each step • Can cleanly partition the problem in at least one dimension • But no one-dimension in common between all of them • All dimensions compute-parallel • But some dimension may rely on neighbor data from previous step Requires frequent TRANSPOSE operations i.e. MPI_AllToAll Easy to split among several CPU/GPU cores and nodes Exploring alternatives but none ready yet 7
  • 8. Cori and Perlmutter • Cori was long a major CGYRO compute resource • And we were very happy with KNL CPUs • Lots of (slower) cores was always better than fewer marginally-faster cores • CGYRO was ported to GPUs first for ORNL Titan • Then improved for ORNL Summit (Titan’s K80’s have severe limitations, like tiny memory and limited comm.) • Deploying on Perlmutter (GPU partition) required just a recompilation • It just worked • Most of the time since spent on environment optimizations, e.g. NVIDIA MPS Already had experience with A100s from Cloud compute 8
  • 9. CPU vs GPU code paths • CGYRO uses a OpenMP+OpenACC(+MPI) parallelization approach • Plus native FFT libraries, FFTW/MKL on Cori, cuFFT on Perlmutter • Most code identical for the two • Enabling OpenMP or OpenACC based on compile flag • A few loops have specialized OpenMP vs OpenACC implementations (but most don’t) • cuFFT required batch execution (reminder, many small FFTs) • Efficient OpenACC requires careful memory handling • Was especially a problem while porting pieces of the code to GPU (now virtually all compute on GPU, partitioned memory just works) • Now mostly for interacting with IO / diagnostics printouts 9
  • 10. Importance of great networking • CGYRO communication intensive • Large memory footprint + frequent MPI_AllToAll • Non-negligible MPI_AllReduce, too • First experience on Perlmutter with Slingshot 10 a mixed bag • Great compute speedup • But simulation bottlenecked by communication ~30% ~70% Benchmark sh04 case 10 Preview of SC22 poster
  • 11. Importance of great networking • CGYRO communication intensive • Large memory footprint + frequent MPI_AllToAll • Non-negligible MPI_AllReduce, too • First experience on Perlmutter with Slingshot 10 a mixed bag • The updated Slingshot 11 networking makes us much happier ~30% ~50% Benchmark sh04 case 11 Preview of SC22 poster
  • 12. Importance of great networking • CGYRO communication intensive • Large memory footprint + frequent MPI_AllToAll • Non-negligible MPI_AllReduce, too • First experience on Perlmutter with Slingshot 10 a mixed bag • The updated Slingshot 11 networking makes us much happier • But brings new problems • SS11 does not play well with MPS • Gets drastically slower when mapping multiple MPI processes per GPU • Something we are currently relying on for optimization reasons • Not a showstopper, but slows down our simulation in certain setups • NERSC ticket open, hopefully can be fixed • But we are also working on alternatives in CGYRO code 12
  • 13. Disk IO light • CGYRO does not have much disk IO • Updates results every O(10 mins) • Checkpoints every O(1h) • Uses MPI-mediated parallel writes • Only a couple files, one per logical data type 13
  • 14. A comparison to other systems 14 Single GCP node faster than 16x Summit nodes Compute-only Total time SS 10 Presented at PEARC22 - https://doi.org/10.1145/3491418.3535130 Looking at compute-only, Perlmutter’s A100s about twice as fast as Summit’s V100s
  • 15. Summary and Conclusions • Fusion CGYRO users happy with transition from Cori to Perlmutter • Much faster at equivalent chip count • Porting required just a recompile • Perlmutter still in deployment phase • Had periods when things were not working too great • But typically transient, hopefully will stabilize • Waiting for the quotas to be raised (128 nodes is not a lot for CGYRO) • Only known remaining annoyance is SS11+MPS interference 15
  • 16. Acknowledgements • This work was partially supported by • The U.S. Department of Energy under awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation Laboratory) and DE-SC0017992 (AToM SciDAC-4 project). • The US National Science Foundation (NSF) Grant OAC-1826967. • An award of computer time was provided by the INCITE program. • This research used resources of the Oak Ridge Leadership Computing Facility, which is an Office of Science User Facility supported under Contract DE- AC05-00OR22725. • Computing resources were also provided by the National Energy Research Scientific Computing Center, which is an Office of Science User Facility supported under Contract DE-AC02-05CH11231. 16