Design installation-commissioning-red raider-cluster-ttu

Alan F. Sill, PhD
Managing Director, High Performance Computing Center, Texas Tech University
On behalf of the TTU IT Division and HPCC Staff
AMD HPC User Forum — September 15-17, 2020
Design Considerations, Installation, and
Commissioning of the RedRaider
Cluster at the Texas Tech University
High Performance Computing Center
High Performance Computing Center

Outline of this talk
HPCC Staff and Students
Previous clusters
• History, Performance, usage
Patterns, and Experience
Motivation for Upgrades
• Compute Capacity Goals
• Related Considerations
Installation and Benchmarks
Conclusions and Q&A

Staff members:
Alan Sill, PhD — Managing Director
Eric Rees, PhD — Assistant Managing Director
Chris Turner, PhD — Research Associate
Tom Brown, PhD — Research Associate
Amanda McConnell, BA — Administrative  
Business Assistant
Graduate students:
Misha Ahmadian, Graduate Research Assistant
These people provide and support the TTU HPCC resources
HPCC Staff and Students (Fall 2020)
Amy Wang, MSc — Programmer Analyst IV 
(Lead System Administrator)
Nandini Ramanathan, MSc — Programmer 
Analyst III
Sowmith Lakki-Reddy, MSc — System 
Administrator III
Undergraduate students:
Arthur Jones, Student Assistant
Nhi Nguyen, Student Assistant
Travis Turner, Student Assistant

Quanah I cluster (Commissioned March 2017)
4 racks, 243 nodes, 36 cores/node, 8748 total cores Intel Broadwell
100 Gbps non-blocking Omni-Path fabric. Benchmarked at 253 TF.

Quanah II Cluster (As Upgraded Nov. 2017)
• 10 racks, 467 Dell™ C6320 nodes
- 36 CPU cores/node Intel Xeon E5-2695
v4 (two 18-core sockets per node)
- 192 GB of RAM per node
- 16,812 worker node cores total
- Compute power: ~1 Tflop/s per node
- Benchmarked at 485 TF
• Operating System:
- CentOS 7.4.1708, 64-bit, Kernel 3.10
• High Speed Fabric:
- Intel ™ OmniPath, 100Gbps/connection
- Non-blocking fat tree topology
- 12 core switches, 48 ports/switch
- 57.6 Tbit/s core throughput capacity
• Management/Control Network:
- Ethernet, 10 Gbps, sequential chained
switches, 36 ports per switch

Uptime and Utilization - Previous Cluster (Quanah)
Quanah I Quanah II —>

Job Sizes Patterns - Previous Cluster (Quanah)
1
10
100
1000
10000
100000
1 2-10 11-100 101-1000 1001+
Jobs in Range
0
100000
200000
300000
400000
500000
600000
1 2-10 11-100 101-1000 1001+
Slots Taken in Range
Typical usage pattern for jobs:
• Charts above show most recent month of
job activity
• Large number of small jobs (note log scale)
• Most jobs below 1000 cores
• Not unusual to see requests for several
thousand cores
Typical usage pattern for queue slots:
• Most cores consumed by jobs in the middle
(11-1000 cores/job) range
• Scheduling a job of more than 2000 cores
allocates ~1/8 of the cluster
• Some evidence of users self-limiting job
sizes to avoid long scheduling queue waits

RedRaider Design Goals
1. Add at least 1 petaflops total computing capacity beyond existing Quanah cluster.
2. Fit within existing cooling capacity and recently expanded power limits, which
means that approximately 2/3rds of new power used by cluster must be removed
through direct liquid cooling to stay within room air cooling limits.
3. Coalesce operation of the existing Quanah and older Ivy Bridge and Community
Cluster nodes with the addition above, to be operated as a single cluster.
4. Streamline storage network utilizing LNet routers to connect all components to
expanded central storage based on 200 Gbps Mellanox HDR fabric.
5. Chosen path: Addition of 240 nodes (30,720 cores) of dual 64-core AMD
Rome processors + 20 GPU nodes with 2 Nvidia GPU accelerators per node.
This new cluster will eventually include the new AMD Rome CPU, NVidia GPU, previous  
Intel Broadwell cluster and other previous specialty queues operated as Slurm partitions.

RedRaider cluster (Delivered July 2020)
CPUs: 256 physical cores per rack unit; 200 Gbps non-blocking HDR200 Infiniband fabric
GPUs: 2 NVidia V100s per two rack units; 100 Gbps HDR100 Infiniband to HDR200 core

Why non-blocking HDR200?
• Previous experience with Quanah and earlier clusters shows simplicity of scheduling jobs
without having to schedule into islands in the fabric produces simpler scheduling and
allows a high degree of utilization of the cluster.
• Increase in density produces high demands on fabric in terms of bandwidth per core.  
This figure of merit is actually lower for the RedRaider Nocona CPUs than for the
previous Quanah cluster.
• Simple fat-tree non-blocking arrangements with multiple core-to-leaf links per switch
provide redundancy and resilience in the event of cable or connector failures.
• User community has sometimes asked for jobs of many thousands of cores.
• Compare and contrast to Expanse design: RedRaider uses full non-blocking fabric;
Expanse is non-blocking within racks with modest oversubscription between rack groups.
Overall cost at our scale is not that different.
• Given the density of our racks and relatively small size of the overall Infiniband fabric,
reaching full non-blocking did not add very much to the cost.

RedRaider cluster initial installation
Front view Back view - CPU racks
40 GPU nodes: Dell R740, 2 GPUs/node, 256 GB main memory/node, air cooled
240 CPU nodes: Dell C6525, 2 Rome 7702’s w/ 512 GB memory/node, liquid cooled

RedRaider cluster initial installation - close-ups
Secondary cooling line
installation under floor
Back view close-up of
cooling lines in CPU rack
Interior of example 1/2-U
C6525 cpu worker node

RedRaider final installation w/ cold-aisle enclosure
Benchmark
(GPUs):
• 226 Tflops
• 20 nodes
• Efficiency:
80.6%
Benchmark
(CPUs):
• 804 Tflops
• 240 nodes
• Efficiency:
81.4%
Total Cluster
Performance
• 1030 Tflops

Software Environment (in progress, starting deployment)
Operating System
• CentOS 8.1 *
Cluster Management Software
• OpenHPC 2.0
Infiniband Drivers
• Mellanox OFED 5.0-2.1.8.0 *
Cluster-Wide File System
• Lustre 2.12.5 *
BMC Firmware
• Dell iDRAC 4.10.10.10 *
Image and Node Provisioning
• Warewulf 3.9.0
• rpmbuild 4.14.2
Job Resource Manager (batch scheduler)
• Slurm 20.02.3-1
Package Build Environment
• Spack 0.15.4
Software Deployment Environment
• LMod 8.2.10
Other Conditions and Tools
• Single-instance Slurm DBD and job status area
(investigating shared-mount NVMEoF for job status)
• Dual NFS 2.3.3 (HA mode) for applications
• Gnu compilers made available through Spack and
LMod. Others (NVidia HPC-X, AOCC) also loadable
as alternatives through LMod.
• Cluster will also have Open OnDemand access.
* Had to fall back to previous version in each starred case above to get consistent deployable conditions

Total Compute Capacity versus Fiscal Year: 2011 - 2021 (All Clusters)
0
500
1000
1500
2000
FY2011 2013 2015 2017 2019 2021
HPCC Total Theoretical Capacity By Cluster (Teraflops):
TTU-Owned and Researcher-Owned Systems
Campus Grid Janus, Weland Hrothgar Westmere
Hrothgar Ivy Bridge Quanah (public) Lonestar 4/5
RedRaider NoconaCPU (public) RedRaider Matador GPU RedRaider NoconaCPU (researchers)
Hrothgar CC Quanah HEP Realtime/Realtime2
AntaeusHEP Nepag Discfarm
105.0 105.1 109.9 132.2 150.6 152.4
417.5
641.0 629.0
581.0
1536.0
0
500
1000
1500
FY2011 2013 2015 2017 2019 2021
HPCC Total Usable Capacity (Teraflops, 80% of Theoretical Peak)
Hrothgar
+
Ivy
+
Ivy+
CC
+
Quanah
I
+
Quanah
II
+
RedRaider
(Gradual retirement of
Hrothgar Westmere)
Design goals for the RedRaider cluster:
• Add at least 1 PF of overall compute capacity
• Allow retirement of older Hrothgar cluster
• Merge operation of primary clusters
• Support GPU computing
Practical restrictions:
• < 300 kVA power usage
• Fit within 8 racks of available floor space
• Stay within existing cooling capacity
• Commission by early FY 2021
(Gradual retirement of
Hrothgar Westmere)

Questions for this group
We look to this forum to help with the following community topics and issues:
• Application building
• Spack, EasyBuild recipes
• Compatible compilers and libraries
• Benchmarking and Workload Optimization
• Processor Settings
• NUMA and Memory Settings
• I/O and PCIe Settings
• On-the-fly versus fixed permanent choice settings
• Safe conditions for liquid vs. air-cooled operation

Conclusions
We have designed, installed, and commissioned a cluster that delivers the desired
performance level of >1 PF in less than 7 racks, with space left over for future expansion.
The cluster was designed to allow an increase in the average job size for large jobs while
still delivering good performance for small and medium-size jobs, with simple scheduling
due to non-blocking fat tree Infiniband.
Overall, adding new cluster based on AMD Rome CPUs and NVidia GPUs allowed us to
put more than twice as much computing capacity in 2/3rds of the space compared to our
existing cluster and stay within our desired power and cooling profile. Since we also retain
and still run the previous cluster, the overall capacity has nearly tripled as a result of this
addition.
Future expansions are possible.

Design installation-commissioning-red raider-cluster-ttu

Backup slides

Primary Cluster Utilization
Hrothgar cluster averaged about 80% utilization before addition of Quanah phase I,
commissioned in April 2017.
Addition of Quanah I roughly doubled the total number of utilized cores to ~16,000, with a
total number of available cores of ~18,000 across all of HPCC.
Addition of Quanah II in November 2017 required decreasing size of Hrothgar to make
everything fit, but still led to over 20,000 cores in regular use. Power limits and campus
research network bandwidth restrictions prevented running all of former Hrothgar.
Power limits in ESB solved with new UPS and generator upgrade in FY 2019. (Campus
research network limits will be addressed in upcoming upgrades to be discussed here.)
Quanah utilization has been extremely good, beginning with Quanah I in early 2017 and
extending on to current state with nearly 95% calendar uptime and near-100% usage.
Former Hrothgar systems exceeded expected end of life (oldest components 8 years old).

Primary HPCC Clusters With RedRaider Expansion
Quanah Cluster - Omni CPU
• 16812 Cores (467 Nodes, 36 Cores/Node, 192 GB/
Node), Intel Xeon E5-2695 v4 @ 2.1GHz
• Dell C6300 4-node 2U enclosures
• Benchmarked 486 Teraflop/sec (464/467 nodes)
• Omni-Path non-blocking 100 Gbits/Second fabric for
MPI communication & storage
• 10 Gb/s Ethernet network for cluster management
• Total power drawn by Quanah cluster: ~ 200 kW
RedRaider Cluster - Nocona CPU
• 30720 Cores (240 Nodes, 128 Cores/Node, 512 GB/
Node), AMD Rome 7702 @ 2.0GHz
• Dell C6525 4-node 2U enclosures, direct liquid
cooled CPUs
• Benchmarked 804 Teraflop/sec (240 nodes)
• Mellanox non-blocking 200 Gbits/Second fabric for
• Total power drawn: ~ 150 kW
RedRaider Cluster - Matador GPU
• 40 NVidia V100, 640 cpu + 25,600 tensor + 204,800
CUDA cores total
• Dell R740 2U host nodes, 2 GPUs/node
• Benchmarked 226 Teraflop/sec (20 nodes)
• Mellanox non-blocking 100 Gbits/Second fabric for
• Total power drawn: ~ 21 kW
• Total power for RedRaider cluster including LNet
routers: ~180 kW

Design installation-commissioning-red raider-cluster-ttu

More Related Content

What's hot (20)

Similar to Design installation-commissioning-red raider-cluster-ttu (20)

More from Alan Sill (18)

Recently uploaded (20)

Design installation-commissioning-red raider-cluster-ttu