"Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Performance Evaluation,
Scalability Analysis, and
Optimization Tuning of Altair HyperWorks
on a Modern HPC Compute Cluster
Pak Lui
pak@hpcadvisorycouncil.com
May 7, 2015

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Agenda
• Introduction to HPC Advisory Council
• Benchmark Configuration
• Performance Benchmark Testing and Results
• MPI Profiling
• Summary
• Q&A / For More Information

• World-wide HPC organization (400+ members)
• Bridges the gap between HPC usage and its full potential
• Provides best practices and a support/development center
• Explores future technologies and future developments
• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage
• Leading edge solutions and technology demonstrations
The HPC Advisory Council

HPC Advisory Council Members

• HPC Advisory Council Chairman
Gilad Shainer - gilad@hpcadvisorycouncil.com
• HPC Advisory Council Media Relations and Events Director
Brian Sparks - brian@hpcadvisorycouncil.com
• HPC Advisory Council China Events Manager
Blade Meng - blade@hpcadvisorycouncil.com
• Director of the HPC Advisory Council, Asia
Tong Liu - tong@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Works SIG Chair
and Cluster Center Manager
Pak Lui - pak@hpcadvisorycouncil.com
• HPC Advisory Council Director of Educational Outreach
Scot Schultz – scot@hpcadvisorycouncil.com
• HPC Advisory Council Programming Advisor
Tarick Bedeir - Tarick@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Scale SIG Chair
Richard Graham – richard@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Cloud SIG Chair
William Lu – william@hpcadvisorycouncil.com
• HPC Advisory Council HPC|GPU SIG Chair
Sadaf Alam – sadaf@hpcadvisorycouncil.com
• HPC Advisory Council India Outreach
Goldi Misra – goldi@hpcadvisorycouncil.com
• Director of the HPC Advisory Council Switzerland Center of
Excellence and HPC|Storage SIG Chair
Hussein Harake – hussein@hpcadvisorycouncil.com
• HPC Advisory Council Workshop Program Director
Eric Lantz – eric@hpcadvisorycouncil.com
• HPC Advisory Council Research Steering Committee
Director
Cydney Stevens - cydney@hpcadvisorycouncil.com
HPC Council Board

HPC Advisory Council HPC CenterInfiniBand-based Storage (Lustre) Juniper Heimdall
Plutus Janus Athena
Vesta Thor Mala
Lustre FS 640 cores
456 cores192 cores
704 cores 896 cores
280 cores
16 GPUs
80 cores

• HPC|Scale
• To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary
based
supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing
services.
• HPC|Cloud
• To explore usage of HPC components as part of the creation of external/public/internal/private cloud
computing environments.
• HPC|Works
• To provide best practices for building balanced and scalable HPC systems, performance tuning and
application guidelines.
• HPC|Storage
• To demonstrate how to build high-performance storage solutions and their affect on application performance
and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions,
and to expose more users to the potential of Lustre over high-speed networks.
• HPC|GPU
• To explore usage models of GPU components as part of next generation compute environments and potential
optimizations for GPU based computing.
• HPC|FSI
• To explore the usage of high-performance computing solutions for low latency trading,
more productive simulations (such as Monte Carlo) and overall more efficient financial services.
Special Interest Subgroups Missions

• HPC Advisory Council (HPCAC)
• 400+ members
• http://www.hpcadvisorycouncil.com/
• Application best practices, case studies (Over 150)
• Benchmarking center with remote access for users
• World-wide workshops
• Value add for your customers to stay up to date
and in tune to HPC market
• 2015 Workshops
• USA (Stanford University) – February 2015
• Switzerland – March 2015
• Brazil – August 2015
• Spain – September 2015
• China (HPC China) – Oct 2015
• For more information
• www.hpcadvisorycouncil.com
• info@hpcadvisorycouncil.com
HPC Advisory Council

• University-based teams to compete and demonstrate the incredible
capabilities of state-of- the-art HPC systems and applications on the
2015 ISC High Performance Conference show-floor
• The Student Cluster Competition is designed to introduce the next
generation of students to the high performance computing world and
community
2015 ISC High Performance Conference
– Student Cluster Competition

• Research performed under the HPC Advisory Council activities
• Participating vendors: Intel, Dell, Mellanox
• Compute resource - HPC Advisory Council Cluster Center
• Objectives
• Give overview of RADIOSS performance
• Compare different MPI libraries, network interconnects and others
• Understand RADIOSS communication patterns
• Provide best practices to increase RADIOSS productivity
RADIOSS Performance Study

• Compute-intensive simulation software for Manufacturing
• For 20+ years an established standard for automotive crash and impact
• Differentiated by its high scalability, quality and robustness
• Supports multiphysics simulation and advanced materials
• Used across all industries to improve safety and manufacturability
• Companies use RADIOSS to simulate real-world scenarios (crash tests,
climate effects, etc.) to test the performance of a product
About RADIOSS

• Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster
• Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Static max Perf in BIOS)
• OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack
• Memory: 64GB memory, DDR3 2133 MHz
• Hard Drives: 1TB 7.2 RPM SATA 2.5”
• Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch
• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters
• Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters
• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch
• MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0
• Application: Altair RADIOSS 13.0
• Benchmark datasets:
• Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated
Test Cluster Configuration

• Intel® Cluster Ready systems make it practical to use a cluster to
increase your simulation and modeling productivity
• Simplifies selection, deployment, and operation of a cluster
• A single architecture platform supported by many OEMs, ISVs, cluster
provisioning vendors, and interconnect providers
• Focus on your work productivity, spend less management time on the cluster
• Select Intel Cluster Ready
• Where the cluster is delivered ready to run
• Hardware and software are integrated and configured together
• Applications are registered, validating execution on the Intel Cluster Ready
architecture
• Includes Intel® Cluster Checker tool, to verify functionality and periodically
check cluster health
• RADIOSS is Intel Cluster Ready
About Intel® Cluster Ready

• Performance and efficiency
• Intelligent hardware-driven systems management
with extensive power management features
• Innovative tools including automation for
parts replacement and lifecycle manageability
• Broad choice of networking technologies from GbE to IB
• Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits
• Designed for performance workloads
• from big data analytics, distributed storage or distributed computing
where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package
• Hardware Capabilities
• Flexible compute platform with dense storage capacity
• 2S/2U server, 6 PCIe slots
• Large memory footprint (Up to 768GB / 24 DIMMs)
• High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server
• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
PowerEdge R730
Massive flexibility for data intensive operations

RADIOSS Performance – Interconnect (MPP)
28 Processes/Node
4.8x
Intel MPI
• EDR InfiniBand provides better scalability performance than Ethernet
• 70 times better performance than 1GbE at 16 nodes / 448 cores
• 4.8x better performance than 10GbE at 16 nodes / cores
• Ethernet solutions does not scale beyond 4 nodes with pure MPI
Higher is better
70x

RADIOSS Performance – Interconnect (MPP)
28 Processes/Node
25%
28%
Intel MPI
• EDR InfiniBand provides better scalability performance
• EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 488 cores
• Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes
Higher is better

• Running more cores per node generally improves overall performance
• Seen improvement of 18% from 20 to 28 cores per node at 8 nodes
• Improvement seems not as consistent at higher node counts
RADIOSS Performance – CPU Cores
6%
Intel MPIHigher is better
18%

• Increasing simulation time can increase the run time
• Increasing a 8ms simulation to 80ms can result in much longer runtime
• 10x longer simulation run can result in a 14x in the runtime
RADIOSS Performance – Simulation Time
Intel MPIHigher is better
14x
14x
13x

• Tuning MPI collective algorithm can improve performance
• MPI profile shows about 20% of runtime spent on MPI_Allreduce communications
• Default algorithm in Intel MPI is Recursive Doubling
• The default algorithm is the best among all tested for MPP
RADIOSS Performance – Intel MPI Tuning for MPP
8 Threads/MPI proc
Intel MPI
Higher is better

• Highly parallel code
• Multi-level parallelization
• Domain decomposition MPI parallelization
• Multithreading OpenMP
• Enhanced performance
• Best scalability in the marketplace
• High efficiency on large HPC clusters
• Unique, proven method for rich scalability over thousands of cores for FEA
• Flexibility -- easy tuning of MPI & OpenMP
• Robustness -- parallel arithmetic allows perfect repeatability in parallel
RADIOSS Hybrid MPP Parallelization

• Enabling Hybrid MPP mode unlocks the RADIOSS scalability
• At larger scale, productivity improves as more threads involves
• As more threads involved, amount of communications by processes are reduced
• At 32 nodes (or 896 cores), the best configuration is 2 PPN with 14 threads each
• The following environment setting and tuned flags are used:
• Intel MPI flags: I_MPI_PIN_DOMAIN auto
• I_MPI_ADJUST_BCAST=1
• I_MPI_ADJUST_ALLREDUCE=5
• KMP_AFFINITY=compact
• KMP_STACKSIZE=400m
• User: ulimit -s unlimited
RADIOSS Performance – Hybrid MPP version
EDR InfiniBand
Intel MPI Higher is better
3.7x
32%70%

• Single precision job runs faster double precision
• SP provides 47% speedup than DP
• Similar scalability is seen for double precision tests
RADIOSS Performance – Floating Point Precision
47%
Intel MPI
Higher is better 2 PPN / 14 OpenMP

• Increasing CPU core frequency enables higher job efficiency
• 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed)
• 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed)
• Increase in performance gain exceeds the increase CPU frequencies
• CPU bound application see higher benefit of using CPU with higher frequencies
RADIOSS Performance – CPU Frequency
29%
Intel MPI
18%

• RADIOSS utilizes point-to-point communications in most data transfers
• The most time MPI consuming calls is MPI_Waitany() and MPI_Wait()
• MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%)
RADIOSS Profiling – % Time Spent on MPI
16 Processes/NodePure MPP

• For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP
• For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default
• For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default
• Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5
RADIOSS Performance – Intel MPI Tuning (DP)
27%
Intel MPI
44%

• The most time consuming MPI communications are:
• MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B
• MPI_Waitany: Messages are: 48B, 8B, 384B
• MPI_Allreduce: Most message sizes appears at 80B
RADIOSS Profiling – MPI Message Sizes

• EDR InfiniBand provides better scalability performance than Ethernet
• 214% better performance than 1GbE at 16 nodes
• 104% better performance than 10GbE at 16 nodes
• InfiniBand typically outperforms other interconnect in collective operations
RADIOSS Performance – Interconnect (HMPP)
214%
104%
2 PPN / 14 OpenMP
Intel MPI
Higher is better

• EDR InfiniBand provides better scalability performance than FDR IB
• EDR IB outperforms FDR IB by 27% at 32 nodes
• Improvement for EDR InfiniBand occurs at high node count
RADIOSS Performance – Interconnect (HMPP)
27%
2 PPN / 14 OpenMP
Intel MPI
Higher is better

• RADIOSS 12.0 utilizes most non-blocking calls for communications
• MPI_Wait, MPI_Waitany, MPI_Irecv and MPI_Isend are almost used exclusively
• RADIOSS 13.0 appears to use the same calls for most part
• The MPI_Bcast calls seem to be replaced by MPI_Allreduce calls instead
RADIOSS Profiling – Number of MPI Calls
Pure MPP At 32 Nodes
RADIOSS 13.0RADIOSS 12.0

• Intel E5-2680v3 (Haswell) cluster outperforms prior generations
• Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes
• System components used:
• Thor: 2-socket Intel E5-2680v3@2.6GHz, 2133MHz DIMMs, EDR IB, v13.0
• Jupiter: 2-socket Intel E5-2680@2.7GHz, 1600MHz DIMMs, FDR IB, v12.0
• Janus: 2-socket Intel X5670@2.93GHz, 1333MHz DIMMs, QDR IB, v12.0
RADIOSS Performance – System Generations
Single Precision
100%
238%

• The memory required to run this workload is around 5GB per node
• Considered as a small workload but good enough to observe application behavior
RADIOSS Profiling – Memory Required

• RADIOSS is designed to perform at large scale HPC environment
• Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP
• Hybrid MPP version enhanced RADIOSS scalability
• 2 MPI processes per socket, 14 threads each
• Additional CPU cores generally accelerating time to solution performance
• Intel E5-2680v3 (Haswell) cluster outperforms prior generations
• Performs faster by 100% vs “Sandy Bridge”, by 238% vs “Westmere “ at 16 nodes
• Network and MPI Tuning
• EDR InfiniBand outperforms other Ethernet-based interconnects in scalability
• EDR InfiniBand delivers higher scalability performance than FDR and QDR IB
• Tuning environment parameters is important to maximize performance
• Tuning MPI collective ops helps RADIOSS to achieve even better scalability
RADIOSS – Summary

33
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You
• Questions?
Pak Lui
pak@hpcadvisorycouncil.com

"Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to "Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster " (20)

More from Altair (20)

"Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "