SlideShare a Scribd company logo
Computing Nodes Performance AnalysisABSTRACT
References
[1] Reference for L’Ecuyer et al: https://cran.r-
project.org/web/views/HighPerformanceComputing.html
[2] Table 1: Loop unrolling across HPC using array job technique:
https://scl-wiki.fda.gov/wiki/images/6/6f/MS_Tasks_Parallelization-
V2_FINAL.pdf
[3] Sample Python project for Table #2 can be found at:
https://rusife.github.io/Performance-analysis-for-HPC-cluster-system-/
[4] Sample Python project for Table #3 can be found at:
https://rusife.github.io/Performance-Analysis-for-Network-Storage/
The run times of the applications are drastically reduced after
using the techniques based on the array job facility of the job
schedulers.
Application Performance Measurement graphs help the HPC team
for preventive maintenance and system capacity management.
Network Storage Performance plots help the HPC team for further
analysis of the performance of InfiniBand, 10Gb, and 1Gb ethernet with
an increasing number of node counts. In this case, 10Gb performance
levels off at about 10 and 8 Gb/sec. The 1Gb performance continues
with a constant linear improvement with additional nodes.
Future Work
Research is needed to automate steps in setting up and
convergence phases of the array job based scaling techniques.
Experimental and theoretical works need to be conducted to estimate
real speed up compared to ideal linear speedup.
Scaling
Technique
Advantages Disadvantages
Multi-
threading,
OpenMP
More than one
execution threads
working in parallel
within a node.
• Scaling is limited to cores on one
computing node.
MPI
More than one
execution threads
working in parallel on
more than one node.
• Increased likelihood of I/O and load
balancing problems.
• In practice, all computational resources
requested by all parallel threads must be
available for an MPI application to start.
This may lead to job starvation.
• No checkpointing.
Single loop
parallel
All of the above.
• Does not parallelize multilevel nested
loops.
Scientific
workflows,
MapReduce,
Spark,
Hadoop
Series of
computational or data
manipulation tasks
run on more than one
node in scalable
manner.
• Incomplete approach for Modelling and
Simulation (M&S )applications scaling and
parallelization.
• Not integrated with Son of Grid Engine and
similar widely used job schedulers.
Array job All of the above • Setup and convergence phase
Scaling Techniques Comparison
System Performance Plots
The Python program was further
enhanced to process other sets of data
files for measuring DataDirect Storage
read/write performance. The challenge in
this task was to read the data from the
input files in which the records were
mixed with the notes by the data
producer (application). To be able to read
the data and get the resulted graph, the
data were first filtered out by first two
and last columns, then important three
columns derived from the resulted data
frame and plotted using Seaborn Python
package
Disk Performance AnalysisScaling Simulation Loops
Why Array jobs?
Traditional software parallelization (scaling)
techniques include Multi-threading, OpenMP,
Message Passing Interface (MPI), Single loop
parallelization, Scientific workflows, MapReduce,
Spark, and Hadoop. The table below shows the
advantages and disadvantages of these techniques.
Array job based techniques combine
advantages of the traditional techniques. The
disadvantage of the array job based techniques is
the introduction of set up and convergence phases.
These overheads are insignificant compared to those
associated with traditional techniques. Some
specific advantages include:
(a) Natural checkpointing: Every task of an array
job is independent and system failures affect only
subsets of the tasks.
(b) Automated identification of incomplete partial
result files by counting and sorting numbers of
lines in the partial result files.
(c) Automated identification of missing partial
result files by comparing the list of expected
partial result file names with the actual ones;
(d) Rerunning only the failed tasks.
Application Scaling Techniques
We use array job facility of the cluster job
schedulers for scaling applications across the
computing nodes of the clusters and adapt
L’Ecuyer’s RngStream package [1] to provide a
quality Random Number Generator (RNG) for the
massively parallel computations. Simulation
iterations of an application are divided into subsets
of iterations which are delegated as array job tasks
to computing nodes of the clusters. Since tasks are
independent the array job starts as soon as resources
are available even for a single task. This avoids job
starvation problem encountered when using the
most dominant software parallelization technique –
Message Passing Interface.
The quality of simulation applications strongly
depends on the quality of employed random
numbers across tasks running independently on
different computing nodes. Studies show that
traditional RNGs are not adequate for parallel
simulations. L’Ecuyer’s RngStream package
provides 2^64 independent random number streams,
each with the period of the 2^127. A unique task ID
of a task is used to compute the unique stream for
the task.
FDA/CDRH High-Performance Computing
(HPC) team provides training and expert
consultations to the HPC users in migration and
scaling scientific applications on the HPC clusters.
These applications include large-scale modeling and
simulations, genomic analysis, computational
physics and chemistry, molecular and fluid
dynamics, and others which overwhelm even the
most powerful scientific computers and
workstations. Software scaling techniques for
simulation programs written in C/C++ and Java
programming languages are presented in this work.
Python programming language also was used to do
performance analysis.
A Python program is created to help with
analysis of the system performance. An input file
named hostnum.csv contains the node names with
their unique IDs, and the other three CSV input
data files contain the system performance logs
accumulated while running test application
programs on the Betsy cluster. Based on these files,
the Python program produces plots which show
bc111-bc120 do better on the BLASTX and HPL
problems but performs significantly worse on
JAGS problem for June 2019. Likewise, the other
nodes with differences are bc121-168. We have
known that different applications behave
differently, but this plot shows a very impressive
example that these problems are different. The
plots can be zoomed in to discover more detail
information about system performance.
Conclusion
Scaling Applications on High Performance Computing Clusters and
Analysis of the System Performance
Rusif Eyvazli1, Sabrina Mikayilova2 , Mike Mikailov3 , Stuart Barkley4
FDA/CDRH/OSEL/DIDSR
US Food and Drug Administration, Silver Spring, MD, USA
{1Rusif.Eyvazli, 2Sabrina.Mikayilova , 3Mike.Mikailov, 4Stuart.Barkley}@fda.hhs.gov
Acknowledgements
Authors appreciate great support of FDA/CDRH/OSEL/DIDSR
management, mentors, Dr. Mike Mikailov, Stuart Barkley, and all HPC
team members, including Fu-Jyh Luo.
Table 1: Loop unrolling across HPC using array job technique [2]
Table 2: Performance analysis for HPC cluster system [3]
Table 3: Disk read performance analysis for DataDirect Network storage used by FDA [4]

More Related Content

PDF
Using R for Cyber Security Part 1
PDF
Reduce Side Joins
PPTX
Data provenance in Hopsworks
PDF
Sparkr sigmod
PDF
Poster (1)
PDF
Applying stratosphere for big data analytics
PDF
Relational Algebra and MapReduce
PPTX
Resisting skew accumulation
Using R for Cyber Security Part 1
Reduce Side Joins
Data provenance in Hopsworks
Sparkr sigmod
Poster (1)
Applying stratosphere for big data analytics
Relational Algebra and MapReduce
Resisting skew accumulation

What's hot (20)

PDF
Enhancing Big Data Analysis by using Map-reduce Technique
PDF
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
PPTX
Stratosphere with big_data_analytics
PDF
Weka tutorial
PPTX
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
PDF
Mapreduce - Simplified Data Processing on Large Clusters
PDF
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
PPT
Hadoop Map Reduce
PDF
Final Report_798 Project_Nithin_Sharmila
PPTX
Repartition join in mapreduce
PDF
Shuffle phase as the bottleneck in Hadoop Terasort
PDF
Lecture 1 mapreduce
PDF
Effiziente Verarbeitung von grossen Datenmengen
PDF
Programming Languages - Functional Programming Paper
PDF
MapReduce in Cloud Computing
DOCX
Map reduce advantages over parallel databases report
PDF
Big Data Processing: Performance Gain Through In-Memory Computation
PPT
Programmability in spss 14
PDF
Gsoc proposal
PDF
Gsoc proposal 2021 polaris
Enhancing Big Data Analysis by using Map-reduce Technique
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
Stratosphere with big_data_analytics
Weka tutorial
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Mapreduce - Simplified Data Processing on Large Clusters
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Hadoop Map Reduce
Final Report_798 Project_Nithin_Sharmila
Repartition join in mapreduce
Shuffle phase as the bottleneck in Hadoop Terasort
Lecture 1 mapreduce
Effiziente Verarbeitung von grossen Datenmengen
Programming Languages - Functional Programming Paper
MapReduce in Cloud Computing
Map reduce advantages over parallel databases report
Big Data Processing: Performance Gain Through In-Memory Computation
Programmability in spss 14
Gsoc proposal
Gsoc proposal 2021 polaris
Ad

Similar to Scaling Application on High Performance Computing Clusters and Analysis of the System Performance (20)

PDF
Scheduling MapReduce Jobs in HPC Clusters
PPTX
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
PDF
CloudComputing_UNIT5.pdf
PDF
High Performance Computing: an Introduction for the Society of Actuaries
PDF
Software Design Practices for Large-Scale Automation
PDF
Exascale Scientific Applications Scalability and Performance Portability 1st ...
PDF
Cloud Computing System models for Distributed and cloud computing & Performan...
PDF
Cloud computing system models for distributed and cloud computing
PDF
Exascale Scientific Applications Scalability and Performance Portability 1st ...
PPT
System models for distributed and cloud computing
PDF
Workflowsim escience12
PPTX
Role of python in hpc
PDF
Parallel and Distributed Computing chapter 1
PDF
PPTX
Performance testing in scope of migration to cloud by Serghei Radov
PDF
A Queue Simulation Tool for a High Performance Scientific Computing Center
PPT
Parallel Computing 2007: Overview
PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
PDF
HPCC Presentation
PPTX
High performance computing for research
Scheduling MapReduce Jobs in HPC Clusters
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
CloudComputing_UNIT5.pdf
High Performance Computing: an Introduction for the Society of Actuaries
Software Design Practices for Large-Scale Automation
Exascale Scientific Applications Scalability and Performance Portability 1st ...
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud computing system models for distributed and cloud computing
Exascale Scientific Applications Scalability and Performance Portability 1st ...
System models for distributed and cloud computing
Workflowsim escience12
Role of python in hpc
Parallel and Distributed Computing chapter 1
Performance testing in scope of migration to cloud by Serghei Radov
A Queue Simulation Tool for a High Performance Scientific Computing Center
Parallel Computing 2007: Overview
Refactoring Applications for the XK7 and Future Hybrid Architectures
HPCC Presentation
High performance computing for research
Ad

Recently uploaded (20)

PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
PPTX
English-9-Q1-3-.pptxjkshbxnnxgchchxgxhxhx
PPTX
S. Anis Al Habsyi & Nada Shobah - Klasifikasi Hambatan Depresi.pptx
PPTX
Self management and self evaluation presentation
PPTX
nose tajweed for the arabic alphabets for the responsive
PDF
Swiggy’s Playbook: UX, Logistics & Monetization
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PDF
natwest.pdf company description and business model
PPTX
Emphasizing It's Not The End 08 06 2025.pptx
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
Human Mind & its character Characteristics
PPTX
Project and change Managment: short video sequences for IBA
PPTX
worship songs, in any order, compilation
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
oil_refinery_presentation_v1 sllfmfls.pdf
An Unlikely Response 08 10 2025.pptx
Learning-Plan-5-Policies-and-Practices.pptx
English-9-Q1-3-.pptxjkshbxnnxgchchxgxhxhx
S. Anis Al Habsyi & Nada Shobah - Klasifikasi Hambatan Depresi.pptx
Self management and self evaluation presentation
nose tajweed for the arabic alphabets for the responsive
Swiggy’s Playbook: UX, Logistics & Monetization
Impressionism_PostImpressionism_Presentation.pptx
natwest.pdf company description and business model
Emphasizing It's Not The End 08 06 2025.pptx
chapter8-180915055454bycuufucdghrwtrt.pptx
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
Human Mind & its character Characteristics
Project and change Managment: short video sequences for IBA
worship songs, in any order, compilation
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC

Scaling Application on High Performance Computing Clusters and Analysis of the System Performance

  • 1. Computing Nodes Performance AnalysisABSTRACT References [1] Reference for L’Ecuyer et al: https://cran.r- project.org/web/views/HighPerformanceComputing.html [2] Table 1: Loop unrolling across HPC using array job technique: https://scl-wiki.fda.gov/wiki/images/6/6f/MS_Tasks_Parallelization- V2_FINAL.pdf [3] Sample Python project for Table #2 can be found at: https://rusife.github.io/Performance-analysis-for-HPC-cluster-system-/ [4] Sample Python project for Table #3 can be found at: https://rusife.github.io/Performance-Analysis-for-Network-Storage/ The run times of the applications are drastically reduced after using the techniques based on the array job facility of the job schedulers. Application Performance Measurement graphs help the HPC team for preventive maintenance and system capacity management. Network Storage Performance plots help the HPC team for further analysis of the performance of InfiniBand, 10Gb, and 1Gb ethernet with an increasing number of node counts. In this case, 10Gb performance levels off at about 10 and 8 Gb/sec. The 1Gb performance continues with a constant linear improvement with additional nodes. Future Work Research is needed to automate steps in setting up and convergence phases of the array job based scaling techniques. Experimental and theoretical works need to be conducted to estimate real speed up compared to ideal linear speedup. Scaling Technique Advantages Disadvantages Multi- threading, OpenMP More than one execution threads working in parallel within a node. • Scaling is limited to cores on one computing node. MPI More than one execution threads working in parallel on more than one node. • Increased likelihood of I/O and load balancing problems. • In practice, all computational resources requested by all parallel threads must be available for an MPI application to start. This may lead to job starvation. • No checkpointing. Single loop parallel All of the above. • Does not parallelize multilevel nested loops. Scientific workflows, MapReduce, Spark, Hadoop Series of computational or data manipulation tasks run on more than one node in scalable manner. • Incomplete approach for Modelling and Simulation (M&S )applications scaling and parallelization. • Not integrated with Son of Grid Engine and similar widely used job schedulers. Array job All of the above • Setup and convergence phase Scaling Techniques Comparison System Performance Plots The Python program was further enhanced to process other sets of data files for measuring DataDirect Storage read/write performance. The challenge in this task was to read the data from the input files in which the records were mixed with the notes by the data producer (application). To be able to read the data and get the resulted graph, the data were first filtered out by first two and last columns, then important three columns derived from the resulted data frame and plotted using Seaborn Python package Disk Performance AnalysisScaling Simulation Loops Why Array jobs? Traditional software parallelization (scaling) techniques include Multi-threading, OpenMP, Message Passing Interface (MPI), Single loop parallelization, Scientific workflows, MapReduce, Spark, and Hadoop. The table below shows the advantages and disadvantages of these techniques. Array job based techniques combine advantages of the traditional techniques. The disadvantage of the array job based techniques is the introduction of set up and convergence phases. These overheads are insignificant compared to those associated with traditional techniques. Some specific advantages include: (a) Natural checkpointing: Every task of an array job is independent and system failures affect only subsets of the tasks. (b) Automated identification of incomplete partial result files by counting and sorting numbers of lines in the partial result files. (c) Automated identification of missing partial result files by comparing the list of expected partial result file names with the actual ones; (d) Rerunning only the failed tasks. Application Scaling Techniques We use array job facility of the cluster job schedulers for scaling applications across the computing nodes of the clusters and adapt L’Ecuyer’s RngStream package [1] to provide a quality Random Number Generator (RNG) for the massively parallel computations. Simulation iterations of an application are divided into subsets of iterations which are delegated as array job tasks to computing nodes of the clusters. Since tasks are independent the array job starts as soon as resources are available even for a single task. This avoids job starvation problem encountered when using the most dominant software parallelization technique – Message Passing Interface. The quality of simulation applications strongly depends on the quality of employed random numbers across tasks running independently on different computing nodes. Studies show that traditional RNGs are not adequate for parallel simulations. L’Ecuyer’s RngStream package provides 2^64 independent random number streams, each with the period of the 2^127. A unique task ID of a task is used to compute the unique stream for the task. FDA/CDRH High-Performance Computing (HPC) team provides training and expert consultations to the HPC users in migration and scaling scientific applications on the HPC clusters. These applications include large-scale modeling and simulations, genomic analysis, computational physics and chemistry, molecular and fluid dynamics, and others which overwhelm even the most powerful scientific computers and workstations. Software scaling techniques for simulation programs written in C/C++ and Java programming languages are presented in this work. Python programming language also was used to do performance analysis. A Python program is created to help with analysis of the system performance. An input file named hostnum.csv contains the node names with their unique IDs, and the other three CSV input data files contain the system performance logs accumulated while running test application programs on the Betsy cluster. Based on these files, the Python program produces plots which show bc111-bc120 do better on the BLASTX and HPL problems but performs significantly worse on JAGS problem for June 2019. Likewise, the other nodes with differences are bc121-168. We have known that different applications behave differently, but this plot shows a very impressive example that these problems are different. The plots can be zoomed in to discover more detail information about system performance. Conclusion Scaling Applications on High Performance Computing Clusters and Analysis of the System Performance Rusif Eyvazli1, Sabrina Mikayilova2 , Mike Mikailov3 , Stuart Barkley4 FDA/CDRH/OSEL/DIDSR US Food and Drug Administration, Silver Spring, MD, USA {1Rusif.Eyvazli, 2Sabrina.Mikayilova , 3Mike.Mikailov, 4Stuart.Barkley}@fda.hhs.gov Acknowledgements Authors appreciate great support of FDA/CDRH/OSEL/DIDSR management, mentors, Dr. Mike Mikailov, Stuart Barkley, and all HPC team members, including Fu-Jyh Luo. Table 1: Loop unrolling across HPC using array job technique [2] Table 2: Performance analysis for HPC cluster system [3] Table 3: Disk read performance analysis for DataDirect Network storage used by FDA [4]