Optimizing communication

Vincent C. Betro, Ph.D.
Optimizing Communication Types and Patterns for
High Performance Computing

© ARM
This is the one time you want everyone
talking over each other!

© ARM
Outline
• Importance of communication
• Types of communication
• Performance implication of communication
• How to identify performance issues with Allinea
• Improving communication patterns

© ARM
Communication in HPC applications
• HPC used to solve large scientific problems
• Need to spread problem over multiple nodes
– Domain decomposition
– Requires data to be communicated between nodes
• Communication determined by algorithm
– Need to understand the pattern
– What type of communication?
– How does it scale?

© ARM
Why care about communication?
Because communication DOES NOT always
scale linearly with problem size!
N-body problem: each object interacts with all
(n-1) other objects: n(n-1) communications
2 comm 12 comm
30 comm

© ARM
Why care about communication?
Decomposition must balance computational work
against communication overhead.
Keep the CPU busy!
Strong scaling example: 100x100 cell domain
10000 cells/core
No communication
5000 cells/core
200 communications
2500 cells/core
400 communications

© ARM
Topologies
• Fat Tree
– Common for Infiniband
– Branching factor important
– Fast localised communication
• Torus
– Extension of hyper-cubes
– High connectivity
– Good for complex spatial decompositions
• Specialist
– Dragonfly, Islands
– Trade-off performance and scalability
Fat Tree
2D Torus

© ARM
Communication: latency vs. bandwidth
• Latency: How long it takes data to traverse
between processes
• Bandwidth: How much data can traverse a given
network at a given time
vs.
vs.

© ARM
MPI in HPC applications
• Defacto standard for inter-node communication
• Allinea tools have support for other communication types
Focus on MPI
• Unlike shared memory communications
• Some support for one-sided communication
Explicit message
passing
• Can be used in conjunction with shared memory
• OpenMP on node, MPI off node
Hybrid
• Communication doesn’t always scale linearly
• Depends on algorithm & domain decomposition
Scalability

© ARM
Types of communication
• Single source to single destination
• Used for neighbour communication
Point-to-point
• Reduction or broadcast operations
• To or from multiple ranks
• Used for sharing global data
Collectives
• Used to align ranks
• Can include point-to-point and collectives
Synchronisations

© ARM
Performance implications of communication
Scaling
• Communication often the limiting factor in scaling
• Impacted by data size and node count
Complex
• Subtle interplay between components
• Algorithm – Communication pattern, synchronisation, decomposition
• Hardware – Network topology, bandwidth, latency
Performance
• Synchronisation a major problem
• Highlights load imbalance
• Ranks sat idle
Tools
• Allinea Forge shows the cost of communication
• Identify where and why
• Helps application programmers resolve issues

Demonstration
Any question, feel free to ask!

© ARM
Simple MPI communication example
• Basic 2D CFD example
• Nearest neighbour halo exchange
– After every inner iteration
• Global mesh reduction
– AllGatherV at end of each outer iteration
• Focus on halo exchange
• Currently naïve implementation
– Everyone sends to their left and then to their right
– Suitable at low core count
– Scaling issues

© ARM
How Allinea can help
• Performance Reports
– Summary overview
– Time spent in MPI
– Breakdown of communication type
– Effective bandwidths
• Profiled on 8 cores (1 node)
• MAP
– Detailed event timeline
– Activity over time
– In-depth MPI metrics
– Source code linkage

© ARM
Application MPI profile – 8 cores (1 node)

© ARM
Strong scaling – 64 cores (8 nodes)

© ARM
Application MPI profile – 64 cores (8 nodes)

© ARM
Obvious scaling problem
• Runtime increases
– 64.2 s – 161.8 s
• MPI point-to-point bandwidth fallen
– ~80 MB/s to ~35 MB/s
• Dependency from neighbours
– Causes a serialisation
– Blocked from exiting send before other process has received
• Collective call duration increased
• Altogether not Scalable
• MAP highlights location of problem

© ARM
Initial communication pattern (with dependencies)
0
1
2
3
Send
Receive

© ARM
Option 1
Improving communication pattern (pairwise)
• Ranks operate in pairs
– Odd sends, even receives
– Swap
– Do for left and right neighbours
0
1
2
3
Send
Receive

© ARM
Option 2
Initial communication pattern, with non-blocking (ISend) method
• Non-blocking sends
– Everyone sends data to both neighbours
– Then moves to receive data
– Blocking on receive
– Wait on the sends (to ensure data has been received)

© ARM
Non-blocking – synchronization points
• Overlap sends
– No dependency
• Much faster
– More scalable
– 46.4 s (down from 161.8 s)
• Helps load imbalance
• Could be improved
– Non-blocking receives
– Wait any on response

© ARM
Communication method comparison
Method Runtime 8 Cores Runtime 64 Cores P2P Bandwidth
64 Cores
Naïve Neighbour 64.2 s 161.8 s 35.2 MB/s
Pairwise 57.8 s 69.9 s 82.1 MB/s
Non-Blocking 57.7 s 46.4 s 123 MB/s
• Shows importance of algorithm at scale
– Tools provide the understanding of the performance problems
• Performance now dominated by send volume
– Artefact of problem decomposition
– Surface to volume ratio is very in-efficient
– Use MAP to guide code changes

© ARM
Load imbalance identification with Allinea MAP
• Identified as a synchronization point
• Long MPI call duration - ‘triangle’
• Mean line close to bottom
– Means only a few ranks are waiting a long time
– Other ranks are busy
– In this case with I/O (orange)
• In this case synchronization on MPI Finalise

© ARM
Summary
• Getting MPI communication right is difficult
– Can have dramatic impact on performance
– Subtle interplay between lots of components
• Allinea tools can help identify the problem
– Performance Reports for an overview summary
– MAP for an in-depth analysis
– Provide data for scaling studies
• Needs application knowledge to improve code
– Tools can only show the way and confirm improvement

© ARM
THANK YOU!
Vincent C. Betro, Ph.D.
Mathematics Instructor
Baylor School, Chattanooga, TN
See our full menu of live or recorded webinars:
https://www.allinea.com/performance-webinars-menu
To take a trial visit:
https://www.allinea.com/get-your-free-allinea-forge-and-allinea-
performance-reports-trial
To contact sales email:
sales@allinea.com

Optimizing communication

More Related Content

What's hot (20)

Similar to Optimizing communication (20)

Recently uploaded (20)

Optimizing communication

Editor's Notes