SlideShare a Scribd company logo
Progress Toward Accelerating CAM-SE.Jeff Larkin <larkin@cray.com>Along with:Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor
BackgroundIn 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012?Answer to come on next slideCenter for Accelerated Application Research (CAAR) established to determine:Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendorsEach team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere
CAM-SE Target Problem1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry. Why is acceleration needed to “do” the problem?When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems. What unrealized parallelism needs to be exposed?In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).
Profile of Runtime% of Runtime
Next StepsOnce the dominant routines were identified, standalone kernels were created for each.Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCLDirectives-based compiler were too immature at the timePoor support for Fortran modules and derived typesDid not allow implementation at a high enough levelCUDA Fortran provided good performance while allowing us to remain in Fortran
Identifying ParallelismHOMME parallelizes both MPI and OpenMP over elementsMost of the tracer advection can also parallelize over tracers (q) and levels (k)Vertical remap is the exception, due to vertical dependence in levels.Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.
StatusEuler_step & laplace_sphere_wk were straightforward to rewrite in CUDA FortranVertical Remap was rewritten to be more amenable to GPU (made it vectorize)Resulting code is 2X faster on CPU than original code and has been given back to the communityEdge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already)Designed for 1 element per MPI rank, but we plan to run with moreOnce this is node-aware, it can also be device-aware and greatly reduce PCIe transfersSomeone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”
Status (cont.)Kernels were put back into HOMME and validation tests were run and passedThis version did nothing to reduce data movement, only tested kernel accuracyIn process of porting forward to current trunk and do more intelligent data movementCurrently reevaluating directives now that compilers have maturedDirectives-based vertical remap now slightly outperforms hand-tuned CUDAStill working around derived_type issues
ChallengesData Structures (Object-Oriented Fortran)Every node has an array of element derived types, which contains more arraysWe only care about some of these arrays, so data movement isn’t very naturalWe must essentially change many non-contiguous CPU arrays into a contiguous GPU arrayParallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directivesCray compiler handles this via whole program analysis, PGI compiler may support this via inline library
Challenges (cont.)CUDA Fortran requires everything live in the same moduleMust duplicate some routines and data structures from several module in our “cuda_mod”Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routinesSimple for user, but developer must maintain duplicate routinesHey Dave, when will this get changed? ;)
Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.
With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.
Vertical remap rewrite is 2X faster on the CPU and still faster on GPU.  All data already resident on device from euler_step, so kernel-only time is realistic.
Future WorkUse CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performanceMove forward to CAM5/CESM1No chance of our work being used otherwiseSome additional, small kernels are needed to allow data to remain residentCheaper to run these on the GPU than to copy the dataReprofile with accelerated application to identify next most important routinesChemisty implicit solver is expected to be nextPhysics is expected to require mature, directives-based compilerRinse, repeat
ConclusionsMuch has been done, much remainsFor a fairly new, cleanly written code, CUDA Fortran was tractable.HOMME has very similar loop nests throughout, that was key to making this possibleStill results in multiple code paths to maintain, so we’d prefer to move to directives for the long-runWe believe GPU accelerators will be beneficial for the selected problemWe hope that it will also benefit a wider audience (CAM5 should help this)

More Related Content

PDF
Java Performance Tuning
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
PPTX
Optimizing High Performance Computing Applications for Energy
PDF
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
PDF
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
PDF
Hyperdex - A closer look
Java Performance Tuning
MIT's experience on OpenPOWER/POWER 9 platform
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Optimizing High Performance Computing Applications for Energy
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hyperdex - A closer look

What's hot (20)

PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PPTX
A Comparative Study between Honeybee Foraging Behaviour Algorithm and Round ...
PPTX
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
PPTX
High Performance Computing (HPC) in cloud
PDF
Workflowsim escience12
PDF
Solaris Linux Performance, Tools and Tuning
PDF
AIST Super Green Cloud: lessons learned from the operation and the performanc...
PDF
Spark on Mesos
PDF
ANSYS SCADE Usage for Unmanned Aircraft Vehicles
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PPTX
ODSC West TidalScale Keynote Slides
PDF
Container Orchestrator Smackdown @ContinousLifecycle
PDF
Tackling Scaling Challenges of Apache Spark at LinkedIn
PDF
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
DOCX
Moving CCAP To The Cloud
PDF
HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store...
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PPTX
Special Purpose Quantum Annealing Quantum Computer v1.0
PPT
Ecss des
PPTX
Perfomance tuning on Go 2.0
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
A Comparative Study between Honeybee Foraging Behaviour Algorithm and Round ...
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
High Performance Computing (HPC) in cloud
Workflowsim escience12
Solaris Linux Performance, Tools and Tuning
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Spark on Mesos
ANSYS SCADE Usage for Unmanned Aircraft Vehicles
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
ODSC West TidalScale Keynote Slides
Container Orchestrator Smackdown @ContinousLifecycle
Tackling Scaling Challenges of Apache Spark at LinkedIn
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
Moving CCAP To The Cloud
HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Special Purpose Quantum Annealing Quantum Computer v1.0
Ecss des
Perfomance tuning on Go 2.0
Ad

Viewers also liked (6)

PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
PDF
HPCMPUG2011 cray tutorial
PPT
Drupal1a
PDF
A Comparison of Accelerator Programming Models
PDF
XT Best Practices
PPTX
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
Refactoring Applications for the XK7 and Future Hybrid Architectures
HPCMPUG2011 cray tutorial
Drupal1a
A Comparison of Accelerator Programming Models
XT Best Practices
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
Ad

Similar to Progress Toward Accelerating CAM-SE (20)

PPTX
Conflux:gpgpu for .net (en)
PPTX
Conflux: gpgpu for .net (en)
PPTX
GPU Accelerated Computational Chemistry Applications
PDF
GPU Programming
PPTX
Octnews featured article
PDF
Accelerating microbiome research with OpenACC
PDF
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
PDF
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
PPTX
Lec09 nbody-optimization
PDF
PhD defense talk (portfolio of my expertise)
PDF
Liszt los alamos national laboratory Aug 2011
PDF
LCU13: GPGPU on ARM Experience Report
PDF
CUG2011 Introduction to GPU Computing
PDF
Porting and optimizing UniFrac for GPUs
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PPTX
19-7960-07-notes.pptx
PDF
Nvidia Cuda Apps Jun27 11
PPTX
OpenACC Monthly Highlights: October2020
PDF
CFD on Power
PDF
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Conflux:gpgpu for .net (en)
Conflux: gpgpu for .net (en)
GPU Accelerated Computational Chemistry Applications
GPU Programming
Octnews featured article
Accelerating microbiome research with OpenACC
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
Lec09 nbody-optimization
PhD defense talk (portfolio of my expertise)
Liszt los alamos national laboratory Aug 2011
LCU13: GPGPU on ARM Experience Report
CUG2011 Introduction to GPU Computing
Porting and optimizing UniFrac for GPUs
Assisting User’s Transition to Titan’s Accelerated Architecture
19-7960-07-notes.pptx
Nvidia Cuda Apps Jun27 11
OpenACC Monthly Highlights: October2020
CFD on Power
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...

More from Jeff Larkin (12)

PDF
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
PDF
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
PDF
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
PDF
Performance Portability Through Descriptive Parallelism
PDF
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
PDF
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
PDF
SC13: OpenMP and NVIDIA
PDF
Optimizing GPU to GPU Communication on Cray XK7
PDF
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
PDF
May2010 hex-core-opt
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
PDF
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Performance Portability Through Descriptive Parallelism
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
SC13: OpenMP and NVIDIA
Optimizing GPU to GPU Communication on Cray XK7
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
May2010 hex-core-opt
Cray XT Porting, Scaling, and Optimization Best Practices
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)

Recently uploaded (20)

PPTX
TLE Review Electricity (Electricity).pptx
PDF
Encapsulation theory and applications.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
August Patch Tuesday
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A comparative study of natural language inference in Swahili using monolingua...
TLE Review Electricity (Electricity).pptx
Encapsulation theory and applications.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
SOPHOS-XG Firewall Administrator PPT.pptx
August Patch Tuesday
Advanced methodologies resolving dimensionality complications for autism neur...
cloud_computing_Infrastucture_as_cloud_p
A comparative study of natural language inference in Swahili using monolingua...

Progress Toward Accelerating CAM-SE

  • 1. Progress Toward Accelerating CAM-SE.Jeff Larkin <larkin@cray.com>Along with:Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor
  • 2. BackgroundIn 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012?Answer to come on next slideCenter for Accelerated Application Research (CAAR) established to determine:Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendorsEach team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere
  • 3. CAM-SE Target Problem1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry. Why is acceleration needed to “do” the problem?When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems. What unrealized parallelism needs to be exposed?In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).
  • 5. Next StepsOnce the dominant routines were identified, standalone kernels were created for each.Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCLDirectives-based compiler were too immature at the timePoor support for Fortran modules and derived typesDid not allow implementation at a high enough levelCUDA Fortran provided good performance while allowing us to remain in Fortran
  • 6. Identifying ParallelismHOMME parallelizes both MPI and OpenMP over elementsMost of the tracer advection can also parallelize over tracers (q) and levels (k)Vertical remap is the exception, due to vertical dependence in levels.Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.
  • 7. StatusEuler_step & laplace_sphere_wk were straightforward to rewrite in CUDA FortranVertical Remap was rewritten to be more amenable to GPU (made it vectorize)Resulting code is 2X faster on CPU than original code and has been given back to the communityEdge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already)Designed for 1 element per MPI rank, but we plan to run with moreOnce this is node-aware, it can also be device-aware and greatly reduce PCIe transfersSomeone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”
  • 8. Status (cont.)Kernels were put back into HOMME and validation tests were run and passedThis version did nothing to reduce data movement, only tested kernel accuracyIn process of porting forward to current trunk and do more intelligent data movementCurrently reevaluating directives now that compilers have maturedDirectives-based vertical remap now slightly outperforms hand-tuned CUDAStill working around derived_type issues
  • 9. ChallengesData Structures (Object-Oriented Fortran)Every node has an array of element derived types, which contains more arraysWe only care about some of these arrays, so data movement isn’t very naturalWe must essentially change many non-contiguous CPU arrays into a contiguous GPU arrayParallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directivesCray compiler handles this via whole program analysis, PGI compiler may support this via inline library
  • 10. Challenges (cont.)CUDA Fortran requires everything live in the same moduleMust duplicate some routines and data structures from several module in our “cuda_mod”Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routinesSimple for user, but developer must maintain duplicate routinesHey Dave, when will this get changed? ;)
  • 11. Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.
  • 12. With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.
  • 13. Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.
  • 14. Future WorkUse CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performanceMove forward to CAM5/CESM1No chance of our work being used otherwiseSome additional, small kernels are needed to allow data to remain residentCheaper to run these on the GPU than to copy the dataReprofile with accelerated application to identify next most important routinesChemisty implicit solver is expected to be nextPhysics is expected to require mature, directives-based compilerRinse, repeat
  • 15. ConclusionsMuch has been done, much remainsFor a fairly new, cleanly written code, CUDA Fortran was tractable.HOMME has very similar loop nests throughout, that was key to making this possibleStill results in multiple code paths to maintain, so we’d prefer to move to directives for the long-runWe believe GPU accelerators will be beneficial for the selected problemWe hope that it will also benefit a wider audience (CAM5 should help this)

Editor's Notes

  • #5: Added outlines to show how these bars relate to our kernels. Edge Packing &amp; Unpacking are part of the “Boundary Exchange”. Designed for maximum MPI scaling with one element per task and one task per core. Need to redesign for smaller number of more powerful nodes and lower surface/volume ratio.Verremap2 is the “Vertical Remap”“Euler Step” consists of euler_step, divergence_spere, and limiter2d_zeroNOTE: In the application, a boundary exchange occurs inside of euler_step“Laplace Sphere Weak” is a call to divergence_sphere_wk and gradient_sphere