SlideShare a Scribd company logo
Optimizing communication
Vincent C. Betro, Ph.D.
Optimizing Communication Types and Patterns for
High Performance Computing
© ARM
This is the one time you want everyone
talking over each other!
© ARM
Outline
• Importance of communication
• Types of communication
• Performance implication of communication
• How to identify performance issues with Allinea
• Improving communication patterns
© ARM
Performance Roadmap
© ARM
Communication in HPC applications
• HPC used to solve large scientific problems
• Need to spread problem over multiple nodes
– Domain decomposition
– Requires data to be communicated between nodes
• Communication determined by algorithm
– Need to understand the pattern
– What type of communication?
– How does it scale?
© ARM
Why care about communication?
Because communication DOES NOT always
scale linearly with problem size!
N-body problem: each object interacts with all
(n-1) other objects: n(n-1) communications
2 comm 12 comm
30 comm
© ARM
Why care about communication?
Decomposition must balance computational work
against communication overhead.
Keep the CPU busy!
Strong scaling example: 100x100 cell domain
10000 cells/core
No communication
5000 cells/core
200 communications
2500 cells/core
400 communications
© ARM
Topologies
• Fat Tree
– Common for Infiniband
– Branching factor important
– Fast localised communication
• Torus
– Extension of hyper-cubes
– High connectivity
– Good for complex spatial decompositions
• Specialist
– Dragonfly, Islands
– Trade-off performance and scalability
Fat Tree
2D Torus
© ARM
Communication: latency vs. bandwidth
• Latency: How long it takes data to traverse
between processes
• Bandwidth: How much data can traverse a given
network at a given time
vs.
vs.
© ARM
MPI in HPC applications
• Defacto standard for inter-node communication
• Allinea tools have support for other communication types
Focus on MPI
• Unlike shared memory communications
• Some support for one-sided communication
Explicit message
passing
• Can be used in conjunction with shared memory
• OpenMP on node, MPI off node
Hybrid
• Communication doesn’t always scale linearly
• Depends on algorithm & domain decomposition
Scalability
© ARM
Types of communication
• Single source to single destination
• Used for neighbour communication
Point-to-point
• Reduction or broadcast operations
• To or from multiple ranks
• Used for sharing global data
Collectives
• Used to align ranks
• Can include point-to-point and collectives
Synchronisations
© ARM
Performance implications of communication
Scaling
• Communication often the limiting factor in scaling
• Impacted by data size and node count
Complex
• Subtle interplay between components
• Algorithm – Communication pattern, synchronisation, decomposition
• Hardware – Network topology, bandwidth, latency
Performance
• Synchronisation a major problem
• Highlights load imbalance
• Ranks sat idle
Tools
• Allinea Forge shows the cost of communication
• Identify where and why
• Helps application programmers resolve issues
Demonstration
Any question, feel free to ask!
© ARM
Simple MPI communication example
• Basic 2D CFD example
• Nearest neighbour halo exchange
– After every inner iteration
• Global mesh reduction
– AllGatherV at end of each outer iteration
• Focus on halo exchange
• Currently naïve implementation
– Everyone sends to their left and then to their right
– Suitable at low core count
– Scaling issues
© ARM
How Allinea can help
• Performance Reports
– Summary overview
– Time spent in MPI
– Breakdown of communication type
– Effective bandwidths
• Profiled on 8 cores (1 node)
• MAP
– Detailed event timeline
– Activity over time
– In-depth MPI metrics
– Source code linkage
© ARM
Application MPI profile – 8 cores (1 node)
© ARM
Strong scaling – 64 cores (8 nodes)
© ARM
Application MPI profile – 64 cores (8 nodes)
© ARM
Obvious scaling problem
• Runtime increases
– 64.2 s – 161.8 s
• MPI point-to-point bandwidth fallen
– ~80 MB/s to ~35 MB/s
• Dependency from neighbours
– Causes a serialisation
– Blocked from exiting send before other process has received
• Collective call duration increased
• Altogether not Scalable
• MAP highlights location of problem
© ARM
Initial communication pattern (with dependencies)
0
1
2
3
Send
Receive
© ARM
Option 1
Improving communication pattern (pairwise)
• Ranks operate in pairs
– Odd sends, even receives
– Swap
– Do for left and right neighbours
0
1
2
3
Send
Receive
© ARM
Code changes
BEFORE
AFTER
© ARM
• Better profile
– From 160sec to 70sec
– Still very MPI-bound
Results
© ARM
Option 2
Initial communication pattern, with non-blocking (ISend) method
• Non-blocking sends
– Everyone sends data to both neighbours
– Then moves to receive data
– Blocking on receive
– Wait on the sends (to ensure data has been received)
© ARM
Code changes
BEFORE
AFTER
© ARM
Results
© ARM
Non-blocking – synchronization points
• Overlap sends
– No dependency
• Much faster
– More scalable
– 46.4 s (down from 161.8 s)
• Helps load imbalance
• Could be improved
– Non-blocking receives
– Wait any on response
© ARM
Communication method comparison
Method Runtime 8 Cores Runtime 64 Cores P2P Bandwidth
64 Cores
Naïve Neighbour 64.2 s 161.8 s 35.2 MB/s
Pairwise 57.8 s 69.9 s 82.1 MB/s
Non-Blocking 57.7 s 46.4 s 123 MB/s
• Shows importance of algorithm at scale
– Tools provide the understanding of the performance problems
• Performance now dominated by send volume
– Artefact of problem decomposition
– Surface to volume ratio is very in-efficient
– Use MAP to guide code changes
© ARM
Load imbalance identification with Allinea MAP
• Identified as a synchronization point
• Long MPI call duration - ‘triangle’
• Mean line close to bottom
– Means only a few ranks are waiting a long time
– Other ranks are busy
– In this case with I/O (orange)
• In this case synchronization on MPI Finalise
© ARM
Summary
• Getting MPI communication right is difficult
– Can have dramatic impact on performance
– Subtle interplay between lots of components
• Allinea tools can help identify the problem
– Performance Reports for an overview summary
– MAP for an in-depth analysis
– Provide data for scaling studies
• Needs application knowledge to improve code
– Tools can only show the way and confirm improvement
© ARM
THANK YOU!
Vincent C. Betro, Ph.D.
Mathematics Instructor
Baylor School, Chattanooga, TN
See our full menu of live or recorded webinars:
https://www.allinea.com/performance-webinars-menu
To take a trial visit:
https://www.allinea.com/get-your-free-allinea-forge-and-allinea-
performance-reports-trial
To contact sales email:
sales@allinea.com

More Related Content

DOCX
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
PDF
Advanced computer architecture unit 5
PDF
Parallel programming model, language and compiler in ACA.
PPTX
Hpc 6 7
PPTX
Parallel programming model
PDF
Program and Network Properties
PPTX
Lec 4 (program and network properties)
PDF
Bulk-Synchronous-Parallel - BSP
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
Advanced computer architecture unit 5
Parallel programming model, language and compiler in ACA.
Hpc 6 7
Parallel programming model
Program and Network Properties
Lec 4 (program and network properties)
Bulk-Synchronous-Parallel - BSP

What's hot (20)

PDF
Improving Performance of TCP in Wireless Environment using TCP-P
PDF
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
PDF
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
PPT
Network Layer
PPT
Network Layer
PPT
Network layer
DOCX
Distributed System
PPTX
Presentation date -20-nov-2012 for prof. chen
PPT
program partitioning and scheduling IN Advanced Computer Architecture
PPT
S.t rajan cjb0912010 ft12
PPT
Network Layer
PPT
DOC
Bandwidth estimation for ieee 802
PPTX
Programming using MPI and OpenMP
PPTX
Medium access with adaptive relay selection in cooperative wireless networks
PPTX
Network layer - design Issues
PPT
Frame Relayprint
PPTX
Types of Networks,Network Design Issues,Design Tools
PPT
Connection( less & oriented)
PPTX
Scalable multiprocessors
Improving Performance of TCP in Wireless Environment using TCP-P
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
Network Layer
Network Layer
Network layer
Distributed System
Presentation date -20-nov-2012 for prof. chen
program partitioning and scheduling IN Advanced Computer Architecture
S.t rajan cjb0912010 ft12
Network Layer
Bandwidth estimation for ieee 802
Programming using MPI and OpenMP
Medium access with adaptive relay selection in cooperative wireless networks
Network layer - design Issues
Frame Relayprint
Types of Networks,Network Design Issues,Design Tools
Connection( less & oriented)
Scalable multiprocessors
Ad

Similar to Optimizing communication (20)

PDF
Move Message Passing Interface Applications to the Next Level
PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
PPTX
25-MPI-OpenMP.pptx
PPTX
Computer system Architecture. This PPT is based on computer system
PDF
More mpi4py
PPT
SecondPresentationDesigning_Parallel_Programs.ppt
PPTX
Presentation - Programming a Heterogeneous Computing Cluster
PPTX
Introduction to parallel processing
PPTX
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
PDF
Interprocess communication with C++.pdf
PDF
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
PPT
Parallel architecture
PDF
Inter-process communication on steroids
PPT
Parallel concepts1
PPT
Distributed systems
PPTX
Seminar on Parallel and Concurrent Programming
PDF
Parallel Application Performance Prediction of Using Analysis Based Modeling
PDF
Optimizing thread performance for a genomics variant caller
PDF
High Performance Computer Architecture
PPTX
The Message Passing Interface (MPI) in Layman's Terms
Move Message Passing Interface Applications to the Next Level
Refactoring Applications for the XK7 and Future Hybrid Architectures
25-MPI-OpenMP.pptx
Computer system Architecture. This PPT is based on computer system
More mpi4py
SecondPresentationDesigning_Parallel_Programs.ppt
Presentation - Programming a Heterogeneous Computing Cluster
Introduction to parallel processing
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Interprocess communication with C++.pdf
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Parallel architecture
Inter-process communication on steroids
Parallel concepts1
Distributed systems
Seminar on Parallel and Concurrent Programming
Parallel Application Performance Prediction of Using Analysis Based Modeling
Optimizing thread performance for a genomics variant caller
High Performance Computer Architecture
The Message Passing Interface (MPI) in Layman's Terms
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Tartificialntelligence_presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Big Data Technologies - Introduction.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Machine Learning_overview_presentation.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Spectroscopy.pptx food analysis technology
Tartificialntelligence_presentation.pptx
NewMind AI Weekly Chronicles - August'25-Week II
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation_ Review paper, used for researhc scholars
Getting Started with Data Integration: FME Form 101
Big Data Technologies - Introduction.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Machine Learning_overview_presentation.pptx
A Presentation on Artificial Intelligence
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology

Optimizing communication

  • 2. Vincent C. Betro, Ph.D. Optimizing Communication Types and Patterns for High Performance Computing
  • 3. © ARM This is the one time you want everyone talking over each other!
  • 4. © ARM Outline • Importance of communication • Types of communication • Performance implication of communication • How to identify performance issues with Allinea • Improving communication patterns
  • 6. © ARM Communication in HPC applications • HPC used to solve large scientific problems • Need to spread problem over multiple nodes – Domain decomposition – Requires data to be communicated between nodes • Communication determined by algorithm – Need to understand the pattern – What type of communication? – How does it scale?
  • 7. © ARM Why care about communication? Because communication DOES NOT always scale linearly with problem size! N-body problem: each object interacts with all (n-1) other objects: n(n-1) communications 2 comm 12 comm 30 comm
  • 8. © ARM Why care about communication? Decomposition must balance computational work against communication overhead. Keep the CPU busy! Strong scaling example: 100x100 cell domain 10000 cells/core No communication 5000 cells/core 200 communications 2500 cells/core 400 communications
  • 9. © ARM Topologies • Fat Tree – Common for Infiniband – Branching factor important – Fast localised communication • Torus – Extension of hyper-cubes – High connectivity – Good for complex spatial decompositions • Specialist – Dragonfly, Islands – Trade-off performance and scalability Fat Tree 2D Torus
  • 10. © ARM Communication: latency vs. bandwidth • Latency: How long it takes data to traverse between processes • Bandwidth: How much data can traverse a given network at a given time vs. vs.
  • 11. © ARM MPI in HPC applications • Defacto standard for inter-node communication • Allinea tools have support for other communication types Focus on MPI • Unlike shared memory communications • Some support for one-sided communication Explicit message passing • Can be used in conjunction with shared memory • OpenMP on node, MPI off node Hybrid • Communication doesn’t always scale linearly • Depends on algorithm & domain decomposition Scalability
  • 12. © ARM Types of communication • Single source to single destination • Used for neighbour communication Point-to-point • Reduction or broadcast operations • To or from multiple ranks • Used for sharing global data Collectives • Used to align ranks • Can include point-to-point and collectives Synchronisations
  • 13. © ARM Performance implications of communication Scaling • Communication often the limiting factor in scaling • Impacted by data size and node count Complex • Subtle interplay between components • Algorithm – Communication pattern, synchronisation, decomposition • Hardware – Network topology, bandwidth, latency Performance • Synchronisation a major problem • Highlights load imbalance • Ranks sat idle Tools • Allinea Forge shows the cost of communication • Identify where and why • Helps application programmers resolve issues
  • 15. © ARM Simple MPI communication example • Basic 2D CFD example • Nearest neighbour halo exchange – After every inner iteration • Global mesh reduction – AllGatherV at end of each outer iteration • Focus on halo exchange • Currently naïve implementation – Everyone sends to their left and then to their right – Suitable at low core count – Scaling issues
  • 16. © ARM How Allinea can help • Performance Reports – Summary overview – Time spent in MPI – Breakdown of communication type – Effective bandwidths • Profiled on 8 cores (1 node) • MAP – Detailed event timeline – Activity over time – In-depth MPI metrics – Source code linkage
  • 17. © ARM Application MPI profile – 8 cores (1 node)
  • 18. © ARM Strong scaling – 64 cores (8 nodes)
  • 19. © ARM Application MPI profile – 64 cores (8 nodes)
  • 20. © ARM Obvious scaling problem • Runtime increases – 64.2 s – 161.8 s • MPI point-to-point bandwidth fallen – ~80 MB/s to ~35 MB/s • Dependency from neighbours – Causes a serialisation – Blocked from exiting send before other process has received • Collective call duration increased • Altogether not Scalable • MAP highlights location of problem
  • 21. © ARM Initial communication pattern (with dependencies) 0 1 2 3 Send Receive
  • 22. © ARM Option 1 Improving communication pattern (pairwise) • Ranks operate in pairs – Odd sends, even receives – Swap – Do for left and right neighbours 0 1 2 3 Send Receive
  • 24. © ARM • Better profile – From 160sec to 70sec – Still very MPI-bound Results
  • 25. © ARM Option 2 Initial communication pattern, with non-blocking (ISend) method • Non-blocking sends – Everyone sends data to both neighbours – Then moves to receive data – Blocking on receive – Wait on the sends (to ensure data has been received)
  • 28. © ARM Non-blocking – synchronization points • Overlap sends – No dependency • Much faster – More scalable – 46.4 s (down from 161.8 s) • Helps load imbalance • Could be improved – Non-blocking receives – Wait any on response
  • 29. © ARM Communication method comparison Method Runtime 8 Cores Runtime 64 Cores P2P Bandwidth 64 Cores Naïve Neighbour 64.2 s 161.8 s 35.2 MB/s Pairwise 57.8 s 69.9 s 82.1 MB/s Non-Blocking 57.7 s 46.4 s 123 MB/s • Shows importance of algorithm at scale – Tools provide the understanding of the performance problems • Performance now dominated by send volume – Artefact of problem decomposition – Surface to volume ratio is very in-efficient – Use MAP to guide code changes
  • 30. © ARM Load imbalance identification with Allinea MAP • Identified as a synchronization point • Long MPI call duration - ‘triangle’ • Mean line close to bottom – Means only a few ranks are waiting a long time – Other ranks are busy – In this case with I/O (orange) • In this case synchronization on MPI Finalise
  • 31. © ARM Summary • Getting MPI communication right is difficult – Can have dramatic impact on performance – Subtle interplay between lots of components • Allinea tools can help identify the problem – Performance Reports for an overview summary – MAP for an in-depth analysis – Provide data for scaling studies • Needs application knowledge to improve code – Tools can only show the way and confirm improvement
  • 32. © ARM THANK YOU! Vincent C. Betro, Ph.D. Mathematics Instructor Baylor School, Chattanooga, TN See our full menu of live or recorded webinars: https://www.allinea.com/performance-webinars-menu To take a trial visit: https://www.allinea.com/get-your-free-allinea-forge-and-allinea- performance-reports-trial To contact sales email: sales@allinea.com

Editor's Notes

  • #8: Growth in problem sizes and use of machine clusters Communication doesn’t always scale linearly with problem size Need to understand how will algorithms will scale Hardware architecture and physical limitations
  • #9: Growth in problem sizes and use of machine clusters Communication doesn’t always scale linearly with problem size Need to understand how will algorithms will scale Hardware architecture and physical limitations
  • #10: There is little one can do about a supercomputer’s network configuration e.g. 3D Taurus vs Star network Explore process and network speeds to determine the importance of Surface-to-Volume Ratio to compute-bound problem.
  • #11: Explain how this can be seen through MAP! Function of type of network/distance between nodes negligable unless on cloud!
  • #31: NOTE FROM ALLINEA: Is it necessary to discuss about MPI imbalance?