PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

Evaluation of Intel Omni-Path on the Intel
Knights Landing Processor
Carlos Rosales, Antonio Gómez-Iglesias
July 11, 2017

Goal
▶ Study and develop a set of best practices to use Intel Omni-Path (OPA) with Intel
Knights Landing (KNL)
▶ We used Stampede-KNL Upgrade for these tests
▶ Intel 17.0.0 compiler and MPI
▶ OSU micro-benchmarks 5.3.2
▶ Our findings can be applied to Stampede2
▶ In the Top500 list of November 2016, 28 systems already used OPA (38 in June 2017)
OPA on KNL | July 11, 2017 | 2

Disclaimer (I)
▶ There are many moving parts here
▶ Results change quickly
▶ Some of the issues will be fixed
▶ Some won’t
▶ We provide a set of guidelines: good starting point
▶ If you care about performance, run your own tests

Disclaimer (II)
▶ This KNL-OPA thing has
been going on at TACC
for over a year now
▶ A lot of people at TACC
behind results like these
▶ Carlos and I just took
advantage of their work

Things We Care About
1. Inter-node bandwidth
2. Bandwidth saturation
3. Inter-node latency
4. MPI_Init times
5. MPI overhead
6. MPI collectives

Inter-node Bandwidth
▶ Study the bandwith from core to core between two different nodes
▶ 1 MPI task on each node
▶ Are there differences between cores?
▶ You see in multi-socket machines that there are differences between sockets
▶ But what about a system without sockets?
▶ OPA uses CPU cores to perform internal operations
▶ KNL cores are slow

Inter-node Bandwidth (All-to-all and Quadrant)
Inter-node bandwidth in Cache Quadrant mode Inter-node bandwidth in Flat All-to-all mode

Inter-node Bandwidth (SNC)
Inter-node bandwidth in Flat SNC4 mode
▶ 2 good sockets and 2 bad sockets
▶ The bad core is actually worse than in
the other configurations
▶ The good core is also worse

Bandwidth Saturation
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
Bandwidth(GB/s)
Number of MPI Tasks per Node
Inter-node bandwidth with increasing number of MPI tasks
per node. The bandwidth is nearly saturated for two
point-to-point messaging tasks
▶ How many MPI tasks do we need on
each process to saturate the network?
▶ With 1 MPI, we get approx. 72% of the
saturation bandwidth (except for good
and bad cores)
▶ Quickly saturation: 2 MPI tasks per
node → 97% of saturation

Internode Latency - Early Results
3.84
3.85
3.86
3.87
3.88
3.89
3.9
0 10 20 30 40 50 60 70
Time(us)
Core
Inter-node latency for adjacent nodes with first releases of
the software stack
▶ 2 nodes on the same chassis
▶ Average latency from each core in one
node to all cores in the other node
▶ Highly variable system (these results
were the best we got)
OPA on KNL | July 11, 2017 | 10

Internode Latency - Current Status
3.1
3.15
3.2
3.25
3.3
3.35
3.4
3.45
3.5
0 10 20 30 40 50 60 70
Time(us)
Core
2 Racks
Top-Bottom Rack
Same Chassis
Inter-node latency between two nodes
▶ Much lower latency
▶ No good and bad cores
▶ Tested different configurations to get
the best combination of bandwidth
and latency
▶ I_MPI_FABRICS=shm:tmi
I_MPI_TMI_PROVIDER=psm2
OPA on KNL | July 11, 2017 | 11

MPI_Init in Intel MPI
0
5
10
15
20
25
10 Nodes 20 Nodes 30 Nodes 40 Nodes
alltoall
cache
nocache
lazy-cache
MPI_Init time normalized to alltoall. Values on the y-axis
represent how many times each algorithm is slower or faster
than alltoall
▶ Intel MPI accepts 4 different
algorithms for MPI_Init
▶ alltoall
▶ cache
▶ nocache
▶ lazy-cache
▶ We wanted to identify the best
algorithm
▶ On recent releases of Intel MPI, it uses
the best one by default
▶ At TACC, we still enforce the best
one, just in case
OPA on KNL | July 11, 2017 | 12

MPI_Init
▶ KNL can run many MPI tasks
▶ Based on previous results you might think that the best scenario is to use as many
MPI tasks as possible
▶ Remember the amount of memory per node
▶ 16 GB MCDRAM
▶ 92 GB DDR4
▶ The fact that you can do something doesn’t mean you have to do it
OPA on KNL | July 11, 2017 | 13

MPI_Init Scaling
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70
Time(s)
Number of Nodes
TPN = 1
TPN = 32
TPN = 64
TPN = 128
TPN = 256
MPI_Init scaling with number of nodes
▶ Try different number of tasks per node
(1, 32, 64, 128, 256)
▶ With 1 task per node, 64 tasks total
(largest case)
▶ 16384 tasks when using 64 nodes and
256 tasks per node
▶ Compare 128 tasks and 64 nodes, with
256 tasks and 64 nodes
OPA on KNL | July 11, 2017 | 14

MPI_Init Scaling
0
10
20
30
40
50
60
70
80
90
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Time(s)
Number of MPI Tasks
Nodes = 2
Nodes = 4
Nodes = 8
Nodes = 16
Nodes = 32
Nodes = 64
MPI_Init scaling with number of tasks
▶ Compare the results for 4096 tasks:
▶ Under 30 seconds when using 64
nodes (64 tasks per node)
▶ 60 seconds when using 32 nodes (128
tasks per node)
▶ 100 seconds for 16 nodes (256 tpn)
▶ Check now for 2048 tasks: 9s (64 nodes;
32 tpn), 14s (32; 64 ppn), 27s (16), 50s (8)
▶ Keep the number of MPI tasks per
node low
▶ Reduced MPI overhead
▶ More memory per task
OPA on KNL | July 11, 2017 | 15

MPI_Init Scaling
▶ Using too many tasks per node → limited memory per task
▶ MPI_Init will be (very) slow at large node counts with many MPI tasks per node
▶ Instabilities in the system
▶ Nodes can crash
▶ Consider MPI overhead and that individual cores are slow
OPA on KNL | July 11, 2017 | 16

MPI Collectives
▶ Probably the part that needs more work
▶ Crashes on early versions of the MPI library for some collectives at given message
sizes
▶ Intel MPI uses different algorithms for each collective
▶ Switches between algorithms depending on
▶ Number of nodes
▶ Tasks per node
▶ Message size
▶ It’s possible to force Intel MPI to use a specific algorithm
▶ Based on nodes, tasks per node, and/or message size
OPA on KNL | July 11, 2017 | 17

MPI_Allreduce
0
1000
2000
3000
4000
5000
6000
7000
8000
0 2.0E5 4.0E5 6.0E5 8.0E5 1.0E6
Time(us)
Message Size (Bytes)
Nodes = 1
Nodes = 2
Nodes = 4
Nodes = 10
Nodes = 20
Nodes = 40
Time for MPI_Allreduce for different number of nodes and
tasks
0
500
1000
1500
2000
2500
3000
3500
4000
0 2.0E5 4.0E5 6.0E5 8.0E5 1.0E6
Time(us)
Message Size (Bytes)
Nodes = 1
Nodes = 2
Nodes = 4
Nodes = 10
Nodes = 20
Nodes = 40
Time for tuned MPI_Allreduce
OPA on KNL | July 11, 2017 | 18

MPI_Allgather
Time for MPI_Allgather 20 nodes, 64
tasks per node, and different internal
algorithms
algorithms
algorithms
OPA on KNL | July 11, 2017 | 19

Stampede2
▶ 18 PF system
▶ 12 in last Top500
▶ Phase 1: KNL + OPA
▶ Will add SKX
▶ While Stampede2 is easy
to use, having a set of
best practices is a good
idea
“Stampede2: The Evolution of an XSEDE Supercomputer”, Tomorrow 12:00pm - 12:30pm
OPA on KNL | July 11, 2017 | 20

Best Practices
▶ Use at least 2 MPI tasks per node to maximize achieved inter-node bandwidth.
▶ Use as few MPI tasks per node as possible when running large jobs to minimize
startup times.
▶ Configure the MPI library to use the best available algorithm for initialization.
▶ Keep the number of MPI tasks per node below or equal to the number of physical
cores to avoid performance and stability issues.
▶ Modify the default MPI collective algorithms configuration in order to improve
overall performance.
▶ Be aware of the cores that provide low and high MPI bandwidth.
▶ SNC4 memory mode introduces challenges not only in terms of effective
utilization, but also in terms of the bandwidth that each quadrant in the node can
achieve.
OPA on KNL | July 11, 2017 | 21

Thank you!
agomez@tacc.utexas.edu
OPA on KNL | July 11, 2017 | 22

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

More Related Content

What's hot (20)

Similar to PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor (20)

Recently uploaded (20)

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor