SlideShare a Scribd company logo
Performance Evaluation,
Scalability Analysis, and
Optimization Tuning of Altair HyperWorks
on a Modern HPC Compute Cluster
Pak Lui
pak@hpcadvisorycouncil.com
May 7, 2015
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Agenda
• Introduction to HPC Advisory Council
• Benchmark Configuration
• Performance Benchmark Testing and Results
• MPI Profiling
• Summary
• Q&A / For More Information
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• World-wide HPC organization (400+ members)
• Bridges the gap between HPC usage and its full potential
• Provides best practices and a support/development center
• Explores future technologies and future developments
• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage
• Leading edge solutions and technology demonstrations
The HPC Advisory Council
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
HPC Advisory Council Members
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• HPC Advisory Council Chairman
Gilad Shainer - gilad@hpcadvisorycouncil.com
• HPC Advisory Council Media Relations and Events Director
Brian Sparks - brian@hpcadvisorycouncil.com
• HPC Advisory Council China Events Manager
Blade Meng - blade@hpcadvisorycouncil.com
• Director of the HPC Advisory Council, Asia
Tong Liu - tong@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Works SIG Chair
and Cluster Center Manager
Pak Lui - pak@hpcadvisorycouncil.com
• HPC Advisory Council Director of Educational Outreach
Scot Schultz – scot@hpcadvisorycouncil.com
• HPC Advisory Council Programming Advisor
Tarick Bedeir - Tarick@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Scale SIG Chair
Richard Graham – richard@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Cloud SIG Chair
William Lu – william@hpcadvisorycouncil.com
• HPC Advisory Council HPC|GPU SIG Chair
Sadaf Alam – sadaf@hpcadvisorycouncil.com
• HPC Advisory Council India Outreach
Goldi Misra – goldi@hpcadvisorycouncil.com
• Director of the HPC Advisory Council Switzerland Center of
Excellence and HPC|Storage SIG Chair
Hussein Harake – hussein@hpcadvisorycouncil.com
• HPC Advisory Council Workshop Program Director
Eric Lantz – eric@hpcadvisorycouncil.com
• HPC Advisory Council Research Steering Committee
Director
Cydney Stevens - cydney@hpcadvisorycouncil.com
HPC Council Board
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
HPC Advisory Council HPC CenterInfiniBand-based Storage (Lustre) Juniper Heimdall
Plutus Janus Athena
Vesta Thor Mala
Lustre FS 640 cores
456 cores192 cores
704 cores 896 cores
280 cores
16 GPUs
80 cores
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• HPC|Scale
• To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary
based
supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing
services.
• HPC|Cloud
• To explore usage of HPC components as part of the creation of external/public/internal/private cloud
computing environments.
• HPC|Works
• To provide best practices for building balanced and scalable HPC systems, performance tuning and
application guidelines.
• HPC|Storage
• To demonstrate how to build high-performance storage solutions and their affect on application performance
and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions,
and to expose more users to the potential of Lustre over high-speed networks.
• HPC|GPU
• To explore usage models of GPU components as part of next generation compute environments and potential
optimizations for GPU based computing.
• HPC|FSI
• To explore the usage of high-performance computing solutions for low latency trading,
more productive simulations (such as Monte Carlo) and overall more efficient financial services.
Special Interest Subgroups Missions
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• HPC Advisory Council (HPCAC)
• 400+ members
• http://www.hpcadvisorycouncil.com/
• Application best practices, case studies (Over 150)
• Benchmarking center with remote access for users
• World-wide workshops
• Value add for your customers to stay up to date
and in tune to HPC market
• 2015 Workshops
• USA (Stanford University) – February 2015
• Switzerland – March 2015
• Brazil – August 2015
• Spain – September 2015
• China (HPC China) – Oct 2015
• For more information
• www.hpcadvisorycouncil.com
• info@hpcadvisorycouncil.com
HPC Advisory Council
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• University-based teams to compete and demonstrate the incredible
capabilities of state-of- the-art HPC systems and applications on the
2015 ISC High Performance Conference show-floor
• The Student Cluster Competition is designed to introduce the next
generation of students to the high performance computing world and
community
2015 ISC High Performance Conference
– Student Cluster Competition
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Research performed under the HPC Advisory Council activities
• Participating vendors: Intel, Dell, Mellanox
• Compute resource - HPC Advisory Council Cluster Center
• Objectives
• Give overview of RADIOSS performance
• Compare different MPI libraries, network interconnects and others
• Understand RADIOSS communication patterns
• Provide best practices to increase RADIOSS productivity
RADIOSS Performance Study
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Compute-intensive simulation software for Manufacturing
• For 20+ years an established standard for automotive crash and impact
• Differentiated by its high scalability, quality and robustness
• Supports multiphysics simulation and advanced materials
• Used across all industries to improve safety and manufacturability
• Companies use RADIOSS to simulate real-world scenarios (crash tests,
climate effects, etc.) to test the performance of a product
About RADIOSS
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster
• Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Static max Perf in BIOS)
• OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack
• Memory: 64GB memory, DDR3 2133 MHz
• Hard Drives: 1TB 7.2 RPM SATA 2.5”
• Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch
• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters
• Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters
• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch
• MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0
• Application: Altair RADIOSS 13.0
• Benchmark datasets:
• Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated
Test Cluster Configuration
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Intel® Cluster Ready systems make it practical to use a cluster to
increase your simulation and modeling productivity
• Simplifies selection, deployment, and operation of a cluster
• A single architecture platform supported by many OEMs, ISVs, cluster
provisioning vendors, and interconnect providers
• Focus on your work productivity, spend less management time on the cluster
• Select Intel Cluster Ready
• Where the cluster is delivered ready to run
• Hardware and software are integrated and configured together
• Applications are registered, validating execution on the Intel Cluster Ready
architecture
• Includes Intel® Cluster Checker tool, to verify functionality and periodically
check cluster health
• RADIOSS is Intel Cluster Ready
About Intel® Cluster Ready
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Performance and efficiency
• Intelligent hardware-driven systems management
with extensive power management features
• Innovative tools including automation for
parts replacement and lifecycle manageability
• Broad choice of networking technologies from GbE to IB
• Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits
• Designed for performance workloads
• from big data analytics, distributed storage or distributed computing
where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package
• Hardware Capabilities
• Flexible compute platform with dense storage capacity
• 2S/2U server, 6 PCIe slots
• Large memory footprint (Up to 768GB / 24 DIMMs)
• High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server
• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
PowerEdge R730
Massive flexibility for data intensive operations
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
RADIOSS Performance – Interconnect (MPP)
28 Processes/Node
4.8x
Intel MPI
• EDR InfiniBand provides better scalability performance than Ethernet
• 70 times better performance than 1GbE at 16 nodes / 448 cores
• 4.8x better performance than 10GbE at 16 nodes / cores
• Ethernet solutions does not scale beyond 4 nodes with pure MPI
Higher is better
70x
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
RADIOSS Performance – Interconnect (MPP)
28 Processes/Node
25%
28%
Intel MPI
• EDR InfiniBand provides better scalability performance
• EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 488 cores
• Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes
Higher is better
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Running more cores per node generally improves overall performance
• Seen improvement of 18% from 20 to 28 cores per node at 8 nodes
• Improvement seems not as consistent at higher node counts
RADIOSS Performance – CPU Cores
6%
Intel MPIHigher is better
18%
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Increasing simulation time can increase the run time
• Increasing a 8ms simulation to 80ms can result in much longer runtime
• 10x longer simulation run can result in a 14x in the runtime
RADIOSS Performance – Simulation Time
Intel MPIHigher is better
14x
14x
13x
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Tuning MPI collective algorithm can improve performance
• MPI profile shows about 20% of runtime spent on MPI_Allreduce communications
• Default algorithm in Intel MPI is Recursive Doubling
• The default algorithm is the best among all tested for MPP
RADIOSS Performance – Intel MPI Tuning for MPP
8 Threads/MPI proc
Intel MPI
Higher is better
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Highly parallel code
• Multi-level parallelization
• Domain decomposition MPI parallelization
• Multithreading OpenMP
• Enhanced performance
• Best scalability in the marketplace
• High efficiency on large HPC clusters
• Unique, proven method for rich scalability over thousands of cores for FEA
• Flexibility -- easy tuning of MPI & OpenMP
• Robustness -- parallel arithmetic allows perfect repeatability in parallel
RADIOSS Hybrid MPP Parallelization
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Enabling Hybrid MPP mode unlocks the RADIOSS scalability
• At larger scale, productivity improves as more threads involves
• As more threads involved, amount of communications by processes are reduced
• At 32 nodes (or 896 cores), the best configuration is 2 PPN with 14 threads each
• The following environment setting and tuned flags are used:
• Intel MPI flags: I_MPI_PIN_DOMAIN auto
• I_MPI_ADJUST_BCAST=1
• I_MPI_ADJUST_ALLREDUCE=5
• KMP_AFFINITY=compact
• KMP_STACKSIZE=400m
• User: ulimit -s unlimited
RADIOSS Performance – Hybrid MPP version
EDR InfiniBand
Intel MPI Higher is better
3.7x
32%70%
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Single precision job runs faster double precision
• SP provides 47% speedup than DP
• Similar scalability is seen for double precision tests
RADIOSS Performance – Floating Point Precision
47%
Intel MPI
Higher is better 2 PPN / 14 OpenMP
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Increasing CPU core frequency enables higher job efficiency
• 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed)
• 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed)
• Increase in performance gain exceeds the increase CPU frequencies
• CPU bound application see higher benefit of using CPU with higher frequencies
RADIOSS Performance – CPU Frequency
29%
Intel MPI
Higher is better 2 PPN / 14 OpenMP
18%
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• RADIOSS utilizes point-to-point communications in most data transfers
• The most time MPI consuming calls is MPI_Waitany() and MPI_Wait()
• MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%)
RADIOSS Profiling – % Time Spent on MPI
16 Processes/NodePure MPP
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP
• For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default
• For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default
• Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5
RADIOSS Performance – Intel MPI Tuning (DP)
27%
Intel MPI
Higher is better 2 PPN / 14 OpenMP
44%
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• The most time consuming MPI communications are:
• MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B
• MPI_Waitany: Messages are: 48B, 8B, 384B
• MPI_Allreduce: Most message sizes appears at 80B
RADIOSS Profiling – MPI Message Sizes
28 Processes/NodePure MPP
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• EDR InfiniBand provides better scalability performance than Ethernet
• 214% better performance than 1GbE at 16 nodes
• 104% better performance than 10GbE at 16 nodes
• InfiniBand typically outperforms other interconnect in collective operations
RADIOSS Performance – Interconnect (HMPP)
214%
104%
2 PPN / 14 OpenMP
Intel MPI
Higher is better
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• EDR InfiniBand provides better scalability performance than FDR IB
• EDR IB outperforms FDR IB by 27% at 32 nodes
• Improvement for EDR InfiniBand occurs at high node count
RADIOSS Performance – Interconnect (HMPP)
27%
2 PPN / 14 OpenMP
Intel MPI
Higher is better
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• RADIOSS 12.0 utilizes most non-blocking calls for communications
• MPI_Wait, MPI_Waitany, MPI_Irecv and MPI_Isend are almost used exclusively
• RADIOSS 13.0 appears to use the same calls for most part
• The MPI_Bcast calls seem to be replaced by MPI_Allreduce calls instead
RADIOSS Profiling – Number of MPI Calls
Pure MPP At 32 Nodes
RADIOSS 13.0RADIOSS 12.0
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Intel E5-2680v3 (Haswell) cluster outperforms prior generations
• Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes
• System components used:
• Thor: 2-socket Intel E5-2680v3@2.6GHz, 2133MHz DIMMs, EDR IB, v13.0
• Jupiter: 2-socket Intel E5-2680@2.7GHz, 1600MHz DIMMs, FDR IB, v12.0
• Janus: 2-socket Intel X5670@2.93GHz, 1333MHz DIMMs, QDR IB, v12.0
RADIOSS Performance – System Generations
Single Precision
100%
238%
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• The memory required to run this workload is around 5GB per node
• Considered as a small workload but good enough to observe application behavior
RADIOSS Profiling – Memory Required
28 Processes/NodePure MPP
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• RADIOSS is designed to perform at large scale HPC environment
• Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP
• Hybrid MPP version enhanced RADIOSS scalability
• 2 MPI processes per socket, 14 threads each
• Additional CPU cores generally accelerating time to solution performance
• Intel E5-2680v3 (Haswell) cluster outperforms prior generations
• Performs faster by 100% vs “Sandy Bridge”, by 238% vs “Westmere “ at 16 nodes
• Network and MPI Tuning
• EDR InfiniBand outperforms other Ethernet-based interconnects in scalability
• EDR InfiniBand delivers higher scalability performance than FDR and QDR IB
• Tuning environment parameters is important to maximize performance
• Tuning MPI collective ops helps RADIOSS to achieve even better scalability
RADIOSS – Summary
Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
33
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You
• Questions?
Pak Lui
pak@hpcadvisorycouncil.com

More Related Content

PPTX
Whd master deck_final
PDF
Red Hat Ceph Storage: Past, Present and Future
PPTX
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
PDF
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
PDF
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
PPTX
Hadoop in the Clouds, Virtualization and Virtual Machines
PDF
Apache Geode Meetup, Cork, Ireland at CIT
PPTX
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Whd master deck_final
Red Hat Ceph Storage: Past, Present and Future
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Hadoop in the Clouds, Virtualization and Virtual Machines
Apache Geode Meetup, Cork, Ireland at CIT
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...

What's hot (20)

PPTX
20150314 sahara intro and the future plan for open stack meetup
PDF
Running Hadoop as Service in AltiScale Platform
PDF
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
PPTX
An Introduction to Apache Geode (incubating)
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
PDF
9/ IBM POWER @ OPEN'16
PDF
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
PPTX
20151027 sahara + manila final
PDF
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
PDF
Apache Geode - The First Six Months
PDF
Best Practices for Virtualizing Apache Hadoop
PPTX
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
PPTX
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
ODP
Sahara presentation latest - Codemotion Rome 2015
PDF
Apache Geode Meetup, London
PPTX
Best Practices for Virtualizing Hadoop
PDF
Key trends in Big Data and new reference architecture from Hewlett Packard En...
PDF
The TCO Calculator - Estimate the True Cost of Hadoop
PPTX
Empower Hive with Spark
20150314 sahara intro and the future plan for open stack meetup
Running Hadoop as Service in AltiScale Platform
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
An Introduction to Apache Geode (incubating)
One Click Hadoop Clusters - Anywhere (Using Docker)
9/ IBM POWER @ OPEN'16
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
20151027 sahara + manila final
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
Apache Geode - The First Six Months
Best Practices for Virtualizing Apache Hadoop
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
Sahara presentation latest - Codemotion Rome 2015
Apache Geode Meetup, London
Best Practices for Virtualizing Hadoop
Key trends in Big Data and new reference architecture from Hewlett Packard En...
The TCO Calculator - Estimate the True Cost of Hadoop
Empower Hive with Spark
Ad

Viewers also liked (20)

PDF
Altair HTC 2012 NVH Training
PDF
Altair NVH Solutions - Americas ATC 2015 Workshop
PDF
Crashworthiness Workshop - Altair HTC 2012
PPT
Kako se pravi prodajni tim? - How to build a sales team?
PPTX
PDF
Những người thích đùa azit nexin
PPT
Infracrveno procesno zagrevanje
PPTX
Napredno učenje - Advanced learning
PPTX
Moha dani power point
PDF
Dr Obeid Al Rashoud BLS Completed Sept 2016
PPTX
7 tips for smart communicators
PPTX
The sims presentatie
PPTX
Prirodni brojevi
PPTX
Indija i Kina
PPTX
Napredno učenje - Advanced learning
PDF
New Roadside Hardware Improvements
PPTX
The Australasia NCAP Experience
PPTX
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
PDF
Managing Clusters with Intel® Xeon Phi™ Coprocessors using Bright Cluster Man...
PPTX
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Altair HTC 2012 NVH Training
Altair NVH Solutions - Americas ATC 2015 Workshop
Crashworthiness Workshop - Altair HTC 2012
Kako se pravi prodajni tim? - How to build a sales team?
Những người thích đùa azit nexin
Infracrveno procesno zagrevanje
Napredno učenje - Advanced learning
Moha dani power point
Dr Obeid Al Rashoud BLS Completed Sept 2016
7 tips for smart communicators
The sims presentatie
Prirodni brojevi
Indija i Kina
Napredno učenje - Advanced learning
New Roadside Hardware Improvements
The Australasia NCAP Experience
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Managing Clusters with Intel® Xeon Phi™ Coprocessors using Bright Cluster Man...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Ad

Similar to "Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster " (20)

PDF
OpenPOWER Acceleration of HPCC Systems
PPTX
Oracle Database Appliance (ODA) X6-2 Portfolio Overview
PPTX
New Ceph capabilities and Reference Architectures
PPTX
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
PDF
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
PPTX
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
PDF
Demystify OpenPOWER
PDF
Give Your Organization Better, Faster Insights & Answers with High Performanc...
PPTX
NSCC Training Introductory Class
PPTX
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
PPTX
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
PPTX
Introduction to HPC & Supercomputing in AI
PDF
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
PDF
Oracle Database Appliance Workshop
PDF
StackVelocity Overview
PDF
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
PDF
NSCC Training - Introductory Class
PDF
Cray Urika-XA Advanced Analytics Platform
PDF
Accelerate Big Data Processing with High-Performance Computing Technologies
PDF
MySQL & Oracle Linux Keynote at Open Source India 2014
OpenPOWER Acceleration of HPCC Systems
Oracle Database Appliance (ODA) X6-2 Portfolio Overview
New Ceph capabilities and Reference Architectures
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
Demystify OpenPOWER
Give Your Organization Better, Faster Insights & Answers with High Performanc...
NSCC Training Introductory Class
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Introduction to HPC & Supercomputing in AI
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Oracle Database Appliance Workshop
StackVelocity Overview
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
NSCC Training - Introductory Class
Cray Urika-XA Advanced Analytics Platform
Accelerate Big Data Processing with High-Performance Computing Technologies
MySQL & Oracle Linux Keynote at Open Source India 2014

More from Altair (20)

PDF
Altair for Manufacturing Applications
PDF
Smart Product Development: Scalable Solutions for Your Entire Product Lifecycle
PDF
Simplify and Scale FEA Post-Processing
PDF
Designing for Sustainability: Altair's Customer Story
PDF
why digital twin adoption rates are skyrocketing.pdf
PDF
Can digital twins save the planet?
PDF
Altair for Industrial Design Applications
PDF
Analyze performance and operations of truck fleets in real time
PDF
Powerful Customer Intelligence | Altair Knowledge Studio
PDF
Altair Data analytics for Healthcare.
PDF
AI supported material test automation.
PDF
Altair High-performance Computing (HPC) and Cloud
PDF
No Code Data Transformation for Insurance with Altair Monarch
PDF
Altair Data analytics for Banking, Financial Services and Insurance
PDF
Altair data analytics and artificial intelligence solutions
PDF
Are You Maximising the Potential of Composite Materials?
PDF
Lead time reduction in CAE: Automated FEM Description Report
PDF
A way to reduce mass of gearbox housing
PDF
The Team H2politO: vehicles for low consumption competitions using HyperWorks
PDF
Improving of Assessment Quality of Fatigue Analysis Using: MS, FEMFAT and FEM...
Altair for Manufacturing Applications
Smart Product Development: Scalable Solutions for Your Entire Product Lifecycle
Simplify and Scale FEA Post-Processing
Designing for Sustainability: Altair's Customer Story
why digital twin adoption rates are skyrocketing.pdf
Can digital twins save the planet?
Altair for Industrial Design Applications
Analyze performance and operations of truck fleets in real time
Powerful Customer Intelligence | Altair Knowledge Studio
Altair Data analytics for Healthcare.
AI supported material test automation.
Altair High-performance Computing (HPC) and Cloud
No Code Data Transformation for Insurance with Altair Monarch
Altair Data analytics for Banking, Financial Services and Insurance
Altair data analytics and artificial intelligence solutions
Are You Maximising the Potential of Composite Materials?
Lead time reduction in CAE: Automated FEM Description Report
A way to reduce mass of gearbox housing
The Team H2politO: vehicles for low consumption competitions using HyperWorks
Improving of Assessment Quality of Fatigue Analysis Using: MS, FEMFAT and FEM...

"Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

  • 1. Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster Pak Lui pak@hpcadvisorycouncil.com May 7, 2015
  • 2. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Agenda • Introduction to HPC Advisory Council • Benchmark Configuration • Performance Benchmark Testing and Results • MPI Profiling • Summary • Q&A / For More Information
  • 3. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • World-wide HPC organization (400+ members) • Bridges the gap between HPC usage and its full potential • Provides best practices and a support/development center • Explores future technologies and future developments • Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage • Leading edge solutions and technology demonstrations The HPC Advisory Council
  • 4. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. HPC Advisory Council Members
  • 5. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • HPC Advisory Council Chairman Gilad Shainer - gilad@hpcadvisorycouncil.com • HPC Advisory Council Media Relations and Events Director Brian Sparks - brian@hpcadvisorycouncil.com • HPC Advisory Council China Events Manager Blade Meng - blade@hpcadvisorycouncil.com • Director of the HPC Advisory Council, Asia Tong Liu - tong@hpcadvisorycouncil.com • HPC Advisory Council HPC|Works SIG Chair and Cluster Center Manager Pak Lui - pak@hpcadvisorycouncil.com • HPC Advisory Council Director of Educational Outreach Scot Schultz – scot@hpcadvisorycouncil.com • HPC Advisory Council Programming Advisor Tarick Bedeir - Tarick@hpcadvisorycouncil.com • HPC Advisory Council HPC|Scale SIG Chair Richard Graham – richard@hpcadvisorycouncil.com • HPC Advisory Council HPC|Cloud SIG Chair William Lu – william@hpcadvisorycouncil.com • HPC Advisory Council HPC|GPU SIG Chair Sadaf Alam – sadaf@hpcadvisorycouncil.com • HPC Advisory Council India Outreach Goldi Misra – goldi@hpcadvisorycouncil.com • Director of the HPC Advisory Council Switzerland Center of Excellence and HPC|Storage SIG Chair Hussein Harake – hussein@hpcadvisorycouncil.com • HPC Advisory Council Workshop Program Director Eric Lantz – eric@hpcadvisorycouncil.com • HPC Advisory Council Research Steering Committee Director Cydney Stevens - cydney@hpcadvisorycouncil.com HPC Council Board
  • 6. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. HPC Advisory Council HPC CenterInfiniBand-based Storage (Lustre) Juniper Heimdall Plutus Janus Athena Vesta Thor Mala Lustre FS 640 cores 456 cores192 cores 704 cores 896 cores 280 cores 16 GPUs 80 cores
  • 7. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • HPC|Scale • To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary based supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing services. • HPC|Cloud • To explore usage of HPC components as part of the creation of external/public/internal/private cloud computing environments. • HPC|Works • To provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines. • HPC|Storage • To demonstrate how to build high-performance storage solutions and their affect on application performance and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions, and to expose more users to the potential of Lustre over high-speed networks. • HPC|GPU • To explore usage models of GPU components as part of next generation compute environments and potential optimizations for GPU based computing. • HPC|FSI • To explore the usage of high-performance computing solutions for low latency trading, more productive simulations (such as Monte Carlo) and overall more efficient financial services. Special Interest Subgroups Missions
  • 8. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • HPC Advisory Council (HPCAC) • 400+ members • http://www.hpcadvisorycouncil.com/ • Application best practices, case studies (Over 150) • Benchmarking center with remote access for users • World-wide workshops • Value add for your customers to stay up to date and in tune to HPC market • 2015 Workshops • USA (Stanford University) – February 2015 • Switzerland – March 2015 • Brazil – August 2015 • Spain – September 2015 • China (HPC China) – Oct 2015 • For more information • www.hpcadvisorycouncil.com • info@hpcadvisorycouncil.com HPC Advisory Council
  • 9. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the 2015 ISC High Performance Conference show-floor • The Student Cluster Competition is designed to introduce the next generation of students to the high performance computing world and community 2015 ISC High Performance Conference – Student Cluster Competition
  • 10. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Research performed under the HPC Advisory Council activities • Participating vendors: Intel, Dell, Mellanox • Compute resource - HPC Advisory Council Cluster Center • Objectives • Give overview of RADIOSS performance • Compare different MPI libraries, network interconnects and others • Understand RADIOSS communication patterns • Provide best practices to increase RADIOSS productivity RADIOSS Performance Study
  • 11. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Compute-intensive simulation software for Manufacturing • For 20+ years an established standard for automotive crash and impact • Differentiated by its high scalability, quality and robustness • Supports multiphysics simulation and advanced materials • Used across all industries to improve safety and manufacturability • Companies use RADIOSS to simulate real-world scenarios (crash tests, climate effects, etc.) to test the performance of a product About RADIOSS
  • 12. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster • Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Static max Perf in BIOS) • OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack • Memory: 64GB memory, DDR3 2133 MHz • Hard Drives: 1TB 7.2 RPM SATA 2.5” • Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch • Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters • Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters • Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch • MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0 • Application: Altair RADIOSS 13.0 • Benchmark datasets: • Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated Test Cluster Configuration
  • 13. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Intel® Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity • Simplifies selection, deployment, and operation of a cluster • A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers • Focus on your work productivity, spend less management time on the cluster • Select Intel Cluster Ready • Where the cluster is delivered ready to run • Hardware and software are integrated and configured together • Applications are registered, validating execution on the Intel Cluster Ready architecture • Includes Intel® Cluster Checker tool, to verify functionality and periodically check cluster health • RADIOSS is Intel Cluster Ready About Intel® Cluster Ready
  • 14. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Performance and efficiency • Intelligent hardware-driven systems management with extensive power management features • Innovative tools including automation for parts replacement and lifecycle manageability • Broad choice of networking technologies from GbE to IB • Built in redundancy with hot plug and swappable PSU, HDDs and fans • Benefits • Designed for performance workloads • from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments • High performance scale-out compute and low cost dense storage in one package • Hardware Capabilities • Flexible compute platform with dense storage capacity • 2S/2U server, 6 PCIe slots • Large memory footprint (Up to 768GB / 24 DIMMs) • High I/O performance and optional storage configurations • HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server • Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch PowerEdge R730 Massive flexibility for data intensive operations
  • 15. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. RADIOSS Performance – Interconnect (MPP) 28 Processes/Node 4.8x Intel MPI • EDR InfiniBand provides better scalability performance than Ethernet • 70 times better performance than 1GbE at 16 nodes / 448 cores • 4.8x better performance than 10GbE at 16 nodes / cores • Ethernet solutions does not scale beyond 4 nodes with pure MPI Higher is better 70x
  • 16. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. RADIOSS Performance – Interconnect (MPP) 28 Processes/Node 25% 28% Intel MPI • EDR InfiniBand provides better scalability performance • EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 488 cores • Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes Higher is better
  • 17. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Running more cores per node generally improves overall performance • Seen improvement of 18% from 20 to 28 cores per node at 8 nodes • Improvement seems not as consistent at higher node counts RADIOSS Performance – CPU Cores 6% Intel MPIHigher is better 18%
  • 18. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Increasing simulation time can increase the run time • Increasing a 8ms simulation to 80ms can result in much longer runtime • 10x longer simulation run can result in a 14x in the runtime RADIOSS Performance – Simulation Time Intel MPIHigher is better 14x 14x 13x
  • 19. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Tuning MPI collective algorithm can improve performance • MPI profile shows about 20% of runtime spent on MPI_Allreduce communications • Default algorithm in Intel MPI is Recursive Doubling • The default algorithm is the best among all tested for MPP RADIOSS Performance – Intel MPI Tuning for MPP 8 Threads/MPI proc Intel MPI Higher is better
  • 20. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Highly parallel code • Multi-level parallelization • Domain decomposition MPI parallelization • Multithreading OpenMP • Enhanced performance • Best scalability in the marketplace • High efficiency on large HPC clusters • Unique, proven method for rich scalability over thousands of cores for FEA • Flexibility -- easy tuning of MPI & OpenMP • Robustness -- parallel arithmetic allows perfect repeatability in parallel RADIOSS Hybrid MPP Parallelization
  • 21. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Enabling Hybrid MPP mode unlocks the RADIOSS scalability • At larger scale, productivity improves as more threads involves • As more threads involved, amount of communications by processes are reduced • At 32 nodes (or 896 cores), the best configuration is 2 PPN with 14 threads each • The following environment setting and tuned flags are used: • Intel MPI flags: I_MPI_PIN_DOMAIN auto • I_MPI_ADJUST_BCAST=1 • I_MPI_ADJUST_ALLREDUCE=5 • KMP_AFFINITY=compact • KMP_STACKSIZE=400m • User: ulimit -s unlimited RADIOSS Performance – Hybrid MPP version EDR InfiniBand Intel MPI Higher is better 3.7x 32%70%
  • 22. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Single precision job runs faster double precision • SP provides 47% speedup than DP • Similar scalability is seen for double precision tests RADIOSS Performance – Floating Point Precision 47% Intel MPI Higher is better 2 PPN / 14 OpenMP
  • 23. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Increasing CPU core frequency enables higher job efficiency • 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed) • 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed) • Increase in performance gain exceeds the increase CPU frequencies • CPU bound application see higher benefit of using CPU with higher frequencies RADIOSS Performance – CPU Frequency 29% Intel MPI Higher is better 2 PPN / 14 OpenMP 18%
  • 24. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • RADIOSS utilizes point-to-point communications in most data transfers • The most time MPI consuming calls is MPI_Waitany() and MPI_Wait() • MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%) RADIOSS Profiling – % Time Spent on MPI 16 Processes/NodePure MPP
  • 25. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP • For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default • For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default • Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5 RADIOSS Performance – Intel MPI Tuning (DP) 27% Intel MPI Higher is better 2 PPN / 14 OpenMP 44%
  • 26. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • The most time consuming MPI communications are: • MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B • MPI_Waitany: Messages are: 48B, 8B, 384B • MPI_Allreduce: Most message sizes appears at 80B RADIOSS Profiling – MPI Message Sizes 28 Processes/NodePure MPP
  • 27. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • EDR InfiniBand provides better scalability performance than Ethernet • 214% better performance than 1GbE at 16 nodes • 104% better performance than 10GbE at 16 nodes • InfiniBand typically outperforms other interconnect in collective operations RADIOSS Performance – Interconnect (HMPP) 214% 104% 2 PPN / 14 OpenMP Intel MPI Higher is better
  • 28. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • EDR InfiniBand provides better scalability performance than FDR IB • EDR IB outperforms FDR IB by 27% at 32 nodes • Improvement for EDR InfiniBand occurs at high node count RADIOSS Performance – Interconnect (HMPP) 27% 2 PPN / 14 OpenMP Intel MPI Higher is better
  • 29. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • RADIOSS 12.0 utilizes most non-blocking calls for communications • MPI_Wait, MPI_Waitany, MPI_Irecv and MPI_Isend are almost used exclusively • RADIOSS 13.0 appears to use the same calls for most part • The MPI_Bcast calls seem to be replaced by MPI_Allreduce calls instead RADIOSS Profiling – Number of MPI Calls Pure MPP At 32 Nodes RADIOSS 13.0RADIOSS 12.0
  • 30. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • Intel E5-2680v3 (Haswell) cluster outperforms prior generations • Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes • System components used: • Thor: 2-socket Intel E5-2680v3@2.6GHz, 2133MHz DIMMs, EDR IB, v13.0 • Jupiter: 2-socket Intel E5-2680@2.7GHz, 1600MHz DIMMs, FDR IB, v12.0 • Janus: 2-socket Intel X5670@2.93GHz, 1333MHz DIMMs, QDR IB, v12.0 RADIOSS Performance – System Generations Single Precision 100% 238%
  • 31. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • The memory required to run this workload is around 5GB per node • Considered as a small workload but good enough to observe application behavior RADIOSS Profiling – Memory Required 28 Processes/NodePure MPP
  • 32. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. • RADIOSS is designed to perform at large scale HPC environment • Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP • Hybrid MPP version enhanced RADIOSS scalability • 2 MPI processes per socket, 14 threads each • Additional CPU cores generally accelerating time to solution performance • Intel E5-2680v3 (Haswell) cluster outperforms prior generations • Performs faster by 100% vs “Sandy Bridge”, by 238% vs “Westmere “ at 16 nodes • Network and MPI Tuning • EDR InfiniBand outperforms other Ethernet-based interconnects in scalability • EDR InfiniBand delivers higher scalability performance than FDR and QDR IB • Tuning environment parameters is important to maximize performance • Tuning MPI collective ops helps RADIOSS to achieve even better scalability RADIOSS – Summary
  • 33. Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. 33 All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein Thank You • Questions? Pak Lui pak@hpcadvisorycouncil.com