SlideShare a Scribd company logo
Evaluation of Intel Omni-Path on the Intel
Knights Landing Processor
Carlos Rosales, Antonio Gómez-Iglesias
July 11, 2017
Goal
▶ Study and develop a set of best practices to use Intel Omni-Path (OPA) with Intel
Knights Landing (KNL)
▶ We used Stampede-KNL Upgrade for these tests
▶ Intel 17.0.0 compiler and MPI
▶ OSU micro-benchmarks 5.3.2
▶ Our findings can be applied to Stampede2
▶ In the Top500 list of November 2016, 28 systems already used OPA (38 in June 2017)
OPA on KNL | July 11, 2017 | 2
Disclaimer (I)
▶ There are many moving parts here
▶ Results change quickly
▶ Some of the issues will be fixed
▶ Some won’t
▶ We provide a set of guidelines: good starting point
▶ If you care about performance, run your own tests
OPA on KNL | July 11, 2017 | 3
Disclaimer (II)
▶ This KNL-OPA thing has
been going on at TACC
for over a year now
▶ A lot of people at TACC
behind results like these
▶ Carlos and I just took
advantage of their work
OPA on KNL | July 11, 2017 | 4
Things We Care About
1. Inter-node bandwidth
2. Bandwidth saturation
3. Inter-node latency
4. MPI_Init times
5. MPI overhead
6. MPI collectives
OPA on KNL | July 11, 2017 | 5
Inter-node Bandwidth
▶ Study the bandwith from core to core between two different nodes
▶ 1 MPI task on each node
▶ Are there differences between cores?
▶ You see in multi-socket machines that there are differences between sockets
▶ But what about a system without sockets?
▶ OPA uses CPU cores to perform internal operations
▶ KNL cores are slow
OPA on KNL | July 11, 2017 | 6
Inter-node Bandwidth (All-to-all and Quadrant)
Inter-node bandwidth in Cache Quadrant mode Inter-node bandwidth in Flat All-to-all mode
OPA on KNL | July 11, 2017 | 7
Inter-node Bandwidth (SNC)
Inter-node bandwidth in Flat SNC4 mode
▶ 2 good sockets and 2 bad sockets
▶ The bad core is actually worse than in
the other configurations
▶ The good core is also worse
OPA on KNL | July 11, 2017 | 8
Bandwidth Saturation
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
Bandwidth(GB/s)
Number of MPI Tasks per Node
Inter-node bandwidth with increasing number of MPI tasks
per node. The bandwidth is nearly saturated for two
point-to-point messaging tasks
▶ How many MPI tasks do we need on
each process to saturate the network?
▶ With 1 MPI, we get approx. 72% of the
saturation bandwidth (except for good
and bad cores)
▶ Quickly saturation: 2 MPI tasks per
node → 97% of saturation
OPA on KNL | July 11, 2017 | 9
Internode Latency - Early Results
3.84
3.85
3.86
3.87
3.88
3.89
3.9
0 10 20 30 40 50 60 70
Time(us)
Core
Inter-node latency for adjacent nodes with first releases of
the software stack
▶ 2 nodes on the same chassis
▶ Average latency from each core in one
node to all cores in the other node
▶ Highly variable system (these results
were the best we got)
OPA on KNL | July 11, 2017 | 10
Internode Latency - Current Status
3.1
3.15
3.2
3.25
3.3
3.35
3.4
3.45
3.5
0 10 20 30 40 50 60 70
Time(us)
Core
2 Racks
Top-Bottom Rack
Same Chassis
Inter-node latency between two nodes
▶ Much lower latency
▶ No good and bad cores
▶ Tested different configurations to get
the best combination of bandwidth
and latency
▶ I_MPI_FABRICS=shm:tmi
I_MPI_TMI_PROVIDER=psm2
OPA on KNL | July 11, 2017 | 11
MPI_Init in Intel MPI
0
5
10
15
20
25
10 Nodes 20 Nodes 30 Nodes 40 Nodes
alltoall
cache
nocache
lazy-cache
MPI_Init time normalized to alltoall. Values on the y-axis
represent how many times each algorithm is slower or faster
than alltoall
▶ Intel MPI accepts 4 different
algorithms for MPI_Init
▶ alltoall
▶ cache
▶ nocache
▶ lazy-cache
▶ We wanted to identify the best
algorithm
▶ On recent releases of Intel MPI, it uses
the best one by default
▶ At TACC, we still enforce the best
one, just in case
OPA on KNL | July 11, 2017 | 12
MPI_Init
▶ KNL can run many MPI tasks
▶ Based on previous results you might think that the best scenario is to use as many
MPI tasks as possible
▶ Remember the amount of memory per node
▶ 16 GB MCDRAM
▶ 92 GB DDR4
▶ The fact that you can do something doesn’t mean you have to do it
OPA on KNL | July 11, 2017 | 13
MPI_Init Scaling
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70
Time(s)
Number of Nodes
TPN = 1
TPN = 32
TPN = 64
TPN = 128
TPN = 256
MPI_Init scaling with number of nodes
▶ Try different number of tasks per node
(1, 32, 64, 128, 256)
▶ With 1 task per node, 64 tasks total
(largest case)
▶ 16384 tasks when using 64 nodes and
256 tasks per node
▶ Compare 128 tasks and 64 nodes, with
256 tasks and 64 nodes
OPA on KNL | July 11, 2017 | 14
MPI_Init Scaling
0
10
20
30
40
50
60
70
80
90
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Time(s)
Number of MPI Tasks
Nodes = 2
Nodes = 4
Nodes = 8
Nodes = 16
Nodes = 32
Nodes = 64
MPI_Init scaling with number of tasks
▶ Compare the results for 4096 tasks:
▶ Under 30 seconds when using 64
nodes (64 tasks per node)
▶ 60 seconds when using 32 nodes (128
tasks per node)
▶ 100 seconds for 16 nodes (256 tpn)
▶ Check now for 2048 tasks: 9s (64 nodes;
32 tpn), 14s (32; 64 ppn), 27s (16), 50s (8)
▶ Keep the number of MPI tasks per
node low
▶ Reduced MPI overhead
▶ More memory per task
OPA on KNL | July 11, 2017 | 15
MPI_Init Scaling
▶ Using too many tasks per node → limited memory per task
▶ MPI_Init will be (very) slow at large node counts with many MPI tasks per node
▶ Instabilities in the system
▶ Nodes can crash
▶ Consider MPI overhead and that individual cores are slow
OPA on KNL | July 11, 2017 | 16
MPI Collectives
▶ Probably the part that needs more work
▶ Crashes on early versions of the MPI library for some collectives at given message
sizes
▶ Intel MPI uses different algorithms for each collective
▶ Switches between algorithms depending on
▶ Number of nodes
▶ Tasks per node
▶ Message size
▶ It’s possible to force Intel MPI to use a specific algorithm
▶ Based on nodes, tasks per node, and/or message size
OPA on KNL | July 11, 2017 | 17
MPI_Allreduce
0
1000
2000
3000
4000
5000
6000
7000
8000
0 2.0E5 4.0E5 6.0E5 8.0E5 1.0E6
Time(us)
Message Size (Bytes)
Nodes = 1
Nodes = 2
Nodes = 4
Nodes = 10
Nodes = 20
Nodes = 40
Time for MPI_Allreduce for different number of nodes and
tasks
0
500
1000
1500
2000
2500
3000
3500
4000
0 2.0E5 4.0E5 6.0E5 8.0E5 1.0E6
Time(us)
Message Size (Bytes)
Nodes = 1
Nodes = 2
Nodes = 4
Nodes = 10
Nodes = 20
Nodes = 40
Time for tuned MPI_Allreduce
OPA on KNL | July 11, 2017 | 18
MPI_Allgather
Time for MPI_Allgather 20 nodes, 64
tasks per node, and different internal
algorithms
Time for MPI_Allgather 256 nodes, 32
tasks per node, and different internal
algorithms
Time for MPI_Allgather 256 nodes, 64
tasks per node, and different internal
algorithms
OPA on KNL | July 11, 2017 | 19
Stampede2
▶ 18 PF system
▶ 12 in last Top500
▶ Phase 1: KNL + OPA
▶ Will add SKX
▶ While Stampede2 is easy
to use, having a set of
best practices is a good
idea
“Stampede2: The Evolution of an XSEDE Supercomputer”, Tomorrow 12:00pm - 12:30pm
OPA on KNL | July 11, 2017 | 20
Best Practices
▶ Use at least 2 MPI tasks per node to maximize achieved inter-node bandwidth.
▶ Use as few MPI tasks per node as possible when running large jobs to minimize
startup times.
▶ Configure the MPI library to use the best available algorithm for initialization.
▶ Keep the number of MPI tasks per node below or equal to the number of physical
cores to avoid performance and stability issues.
▶ Modify the default MPI collective algorithms configuration in order to improve
overall performance.
▶ Be aware of the cores that provide low and high MPI bandwidth.
▶ SNC4 memory mode introduces challenges not only in terms of effective
utilization, but also in terms of the bandwidth that each quadrant in the node can
achieve.
OPA on KNL | July 11, 2017 | 21
Thank you!
agomez@tacc.utexas.edu
OPA on KNL | July 11, 2017 | 22

More Related Content

ODP
Rust Primer
ODP
Insight Demo
ODP
Insight Recent Demo
PDF
How to Measure Latency
PDF
p4alu: Arithmetic Logic Unit in P4
PDF
QCD multi-jet calculation on MadGraph
PDF
Understanding Apache Kafka P99 Latency at Scale
PDF
Whoops! I Rewrote It in Rust
Rust Primer
Insight Demo
Insight Recent Demo
How to Measure Latency
p4alu: Arithmetic Logic Unit in P4
QCD multi-jet calculation on MadGraph
Understanding Apache Kafka P99 Latency at Scale
Whoops! I Rewrote It in Rust

What's hot (20)

PDF
Kubernetes: The Very Hard Way
PDF
Writing NetBSD Sound Drivers in Haskell
PDF
[POSS 2019] OVirt and Ceph: Perfect Combination.?
PDF
Performance optimization 101 - Erlang Factory SF 2014
PDF
OpenAPI and gRPC Side by-Side
ODP
BAXTER phase 1b
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
PPTX
PDF
Building SciPy kernels with Pythran
PDF
Machine learning from software developers point of view
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
Practical SystemTAP basics: Perl memory profiling
PDF
Take Your Time
PPT
Fun421 stephens
PDF
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
PPTX
Rust programming-language
PDF
HKG15-110: ODP Project Update
PDF
Using eBPF to Measure the k8s Cluster Health
PPT
走向开源:提交CPAN模块Step by Step
PDF
Dynamic pricing of Lyft rides using streaming
Kubernetes: The Very Hard Way
Writing NetBSD Sound Drivers in Haskell
[POSS 2019] OVirt and Ceph: Perfect Combination.?
Performance optimization 101 - Erlang Factory SF 2014
OpenAPI and gRPC Side by-Side
BAXTER phase 1b
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Building SciPy kernels with Pythran
Machine learning from software developers point of view
High-Performance Networking Using eBPF, XDP, and io_uring
Practical SystemTAP basics: Perl memory profiling
Take Your Time
Fun421 stephens
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Rust programming-language
HKG15-110: ODP Project Update
Using eBPF to Measure the k8s Cluster Health
走向开源:提交CPAN模块Step by Step
Dynamic pricing of Lyft rides using streaming
Ad

Similar to PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor (20)

PDF
shashank_spdp1993_00395543
PDF
SPE effiency on modern hardware paper presentation
PPTX
PMIx Updated Overview
PDF
Performance Optimization of HPC Applications: From Hardware to Source Code
PDF
A Library for Emerging High-Performance Computing Clusters
PDF
Deep Dive on Amazon EC2 Instances (March 2017)
PPTX
Citrix TechXperts Perth May 2016
PPTX
Fallacies of Distributed Computing
PDF
Introduction to Parallelization and performance optimization
PDF
1083 wang
PDF
Burst Buffer: From Alpha to Omega
PDF
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
PDF
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
PPT
Oow2007 performance
PDF
Runtime Performance Optimizations for an OpenFOAM Simulation
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PPTX
defense-linkedin
PDF
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
PDF
Moving Data through Massive-scale Memory Lanes Lightning-Fast
PDF
Manycores for the Masses
shashank_spdp1993_00395543
SPE effiency on modern hardware paper presentation
PMIx Updated Overview
Performance Optimization of HPC Applications: From Hardware to Source Code
A Library for Emerging High-Performance Computing Clusters
Deep Dive on Amazon EC2 Instances (March 2017)
Citrix TechXperts Perth May 2016
Fallacies of Distributed Computing
Introduction to Parallelization and performance optimization
1083 wang
Burst Buffer: From Alpha to Omega
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
Oow2007 performance
Runtime Performance Optimizations for an OpenFOAM Simulation
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
defense-linkedin
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
Moving Data through Massive-scale Memory Lanes Lightning-Fast
Manycores for the Masses
Ad

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
A comparative analysis of optical character recognition models for extracting...
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

  • 1. Evaluation of Intel Omni-Path on the Intel Knights Landing Processor Carlos Rosales, Antonio Gómez-Iglesias July 11, 2017
  • 2. Goal ▶ Study and develop a set of best practices to use Intel Omni-Path (OPA) with Intel Knights Landing (KNL) ▶ We used Stampede-KNL Upgrade for these tests ▶ Intel 17.0.0 compiler and MPI ▶ OSU micro-benchmarks 5.3.2 ▶ Our findings can be applied to Stampede2 ▶ In the Top500 list of November 2016, 28 systems already used OPA (38 in June 2017) OPA on KNL | July 11, 2017 | 2
  • 3. Disclaimer (I) ▶ There are many moving parts here ▶ Results change quickly ▶ Some of the issues will be fixed ▶ Some won’t ▶ We provide a set of guidelines: good starting point ▶ If you care about performance, run your own tests OPA on KNL | July 11, 2017 | 3
  • 4. Disclaimer (II) ▶ This KNL-OPA thing has been going on at TACC for over a year now ▶ A lot of people at TACC behind results like these ▶ Carlos and I just took advantage of their work OPA on KNL | July 11, 2017 | 4
  • 5. Things We Care About 1. Inter-node bandwidth 2. Bandwidth saturation 3. Inter-node latency 4. MPI_Init times 5. MPI overhead 6. MPI collectives OPA on KNL | July 11, 2017 | 5
  • 6. Inter-node Bandwidth ▶ Study the bandwith from core to core between two different nodes ▶ 1 MPI task on each node ▶ Are there differences between cores? ▶ You see in multi-socket machines that there are differences between sockets ▶ But what about a system without sockets? ▶ OPA uses CPU cores to perform internal operations ▶ KNL cores are slow OPA on KNL | July 11, 2017 | 6
  • 7. Inter-node Bandwidth (All-to-all and Quadrant) Inter-node bandwidth in Cache Quadrant mode Inter-node bandwidth in Flat All-to-all mode OPA on KNL | July 11, 2017 | 7
  • 8. Inter-node Bandwidth (SNC) Inter-node bandwidth in Flat SNC4 mode ▶ 2 good sockets and 2 bad sockets ▶ The bad core is actually worse than in the other configurations ▶ The good core is also worse OPA on KNL | July 11, 2017 | 8
  • 9. Bandwidth Saturation 0 2 4 6 8 10 12 14 0 1 2 3 4 5 6 7 Bandwidth(GB/s) Number of MPI Tasks per Node Inter-node bandwidth with increasing number of MPI tasks per node. The bandwidth is nearly saturated for two point-to-point messaging tasks ▶ How many MPI tasks do we need on each process to saturate the network? ▶ With 1 MPI, we get approx. 72% of the saturation bandwidth (except for good and bad cores) ▶ Quickly saturation: 2 MPI tasks per node → 97% of saturation OPA on KNL | July 11, 2017 | 9
  • 10. Internode Latency - Early Results 3.84 3.85 3.86 3.87 3.88 3.89 3.9 0 10 20 30 40 50 60 70 Time(us) Core Inter-node latency for adjacent nodes with first releases of the software stack ▶ 2 nodes on the same chassis ▶ Average latency from each core in one node to all cores in the other node ▶ Highly variable system (these results were the best we got) OPA on KNL | July 11, 2017 | 10
  • 11. Internode Latency - Current Status 3.1 3.15 3.2 3.25 3.3 3.35 3.4 3.45 3.5 0 10 20 30 40 50 60 70 Time(us) Core 2 Racks Top-Bottom Rack Same Chassis Inter-node latency between two nodes ▶ Much lower latency ▶ No good and bad cores ▶ Tested different configurations to get the best combination of bandwidth and latency ▶ I_MPI_FABRICS=shm:tmi I_MPI_TMI_PROVIDER=psm2 OPA on KNL | July 11, 2017 | 11
  • 12. MPI_Init in Intel MPI 0 5 10 15 20 25 10 Nodes 20 Nodes 30 Nodes 40 Nodes alltoall cache nocache lazy-cache MPI_Init time normalized to alltoall. Values on the y-axis represent how many times each algorithm is slower or faster than alltoall ▶ Intel MPI accepts 4 different algorithms for MPI_Init ▶ alltoall ▶ cache ▶ nocache ▶ lazy-cache ▶ We wanted to identify the best algorithm ▶ On recent releases of Intel MPI, it uses the best one by default ▶ At TACC, we still enforce the best one, just in case OPA on KNL | July 11, 2017 | 12
  • 13. MPI_Init ▶ KNL can run many MPI tasks ▶ Based on previous results you might think that the best scenario is to use as many MPI tasks as possible ▶ Remember the amount of memory per node ▶ 16 GB MCDRAM ▶ 92 GB DDR4 ▶ The fact that you can do something doesn’t mean you have to do it OPA on KNL | July 11, 2017 | 13
  • 14. MPI_Init Scaling 0 50 100 150 200 250 300 350 400 450 500 0 10 20 30 40 50 60 70 Time(s) Number of Nodes TPN = 1 TPN = 32 TPN = 64 TPN = 128 TPN = 256 MPI_Init scaling with number of nodes ▶ Try different number of tasks per node (1, 32, 64, 128, 256) ▶ With 1 task per node, 64 tasks total (largest case) ▶ 16384 tasks when using 64 nodes and 256 tasks per node ▶ Compare 128 tasks and 64 nodes, with 256 tasks and 64 nodes OPA on KNL | July 11, 2017 | 14
  • 15. MPI_Init Scaling 0 10 20 30 40 50 60 70 80 90 100 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Time(s) Number of MPI Tasks Nodes = 2 Nodes = 4 Nodes = 8 Nodes = 16 Nodes = 32 Nodes = 64 MPI_Init scaling with number of tasks ▶ Compare the results for 4096 tasks: ▶ Under 30 seconds when using 64 nodes (64 tasks per node) ▶ 60 seconds when using 32 nodes (128 tasks per node) ▶ 100 seconds for 16 nodes (256 tpn) ▶ Check now for 2048 tasks: 9s (64 nodes; 32 tpn), 14s (32; 64 ppn), 27s (16), 50s (8) ▶ Keep the number of MPI tasks per node low ▶ Reduced MPI overhead ▶ More memory per task OPA on KNL | July 11, 2017 | 15
  • 16. MPI_Init Scaling ▶ Using too many tasks per node → limited memory per task ▶ MPI_Init will be (very) slow at large node counts with many MPI tasks per node ▶ Instabilities in the system ▶ Nodes can crash ▶ Consider MPI overhead and that individual cores are slow OPA on KNL | July 11, 2017 | 16
  • 17. MPI Collectives ▶ Probably the part that needs more work ▶ Crashes on early versions of the MPI library for some collectives at given message sizes ▶ Intel MPI uses different algorithms for each collective ▶ Switches between algorithms depending on ▶ Number of nodes ▶ Tasks per node ▶ Message size ▶ It’s possible to force Intel MPI to use a specific algorithm ▶ Based on nodes, tasks per node, and/or message size OPA on KNL | July 11, 2017 | 17
  • 18. MPI_Allreduce 0 1000 2000 3000 4000 5000 6000 7000 8000 0 2.0E5 4.0E5 6.0E5 8.0E5 1.0E6 Time(us) Message Size (Bytes) Nodes = 1 Nodes = 2 Nodes = 4 Nodes = 10 Nodes = 20 Nodes = 40 Time for MPI_Allreduce for different number of nodes and tasks 0 500 1000 1500 2000 2500 3000 3500 4000 0 2.0E5 4.0E5 6.0E5 8.0E5 1.0E6 Time(us) Message Size (Bytes) Nodes = 1 Nodes = 2 Nodes = 4 Nodes = 10 Nodes = 20 Nodes = 40 Time for tuned MPI_Allreduce OPA on KNL | July 11, 2017 | 18
  • 19. MPI_Allgather Time for MPI_Allgather 20 nodes, 64 tasks per node, and different internal algorithms Time for MPI_Allgather 256 nodes, 32 tasks per node, and different internal algorithms Time for MPI_Allgather 256 nodes, 64 tasks per node, and different internal algorithms OPA on KNL | July 11, 2017 | 19
  • 20. Stampede2 ▶ 18 PF system ▶ 12 in last Top500 ▶ Phase 1: KNL + OPA ▶ Will add SKX ▶ While Stampede2 is easy to use, having a set of best practices is a good idea “Stampede2: The Evolution of an XSEDE Supercomputer”, Tomorrow 12:00pm - 12:30pm OPA on KNL | July 11, 2017 | 20
  • 21. Best Practices ▶ Use at least 2 MPI tasks per node to maximize achieved inter-node bandwidth. ▶ Use as few MPI tasks per node as possible when running large jobs to minimize startup times. ▶ Configure the MPI library to use the best available algorithm for initialization. ▶ Keep the number of MPI tasks per node below or equal to the number of physical cores to avoid performance and stability issues. ▶ Modify the default MPI collective algorithms configuration in order to improve overall performance. ▶ Be aware of the cores that provide low and high MPI bandwidth. ▶ SNC4 memory mode introduces challenges not only in terms of effective utilization, but also in terms of the bandwidth that each quadrant in the node can achieve. OPA on KNL | July 11, 2017 | 21
  • 22. Thank you! agomez@tacc.utexas.edu OPA on KNL | July 11, 2017 | 22