SlideShare a Scribd company logo
spcl.inf.ethz.ch
@spcl_eth
TORSTEN HOEFLER, ROBERTO BELLI
Scientific Benchmarking of Parallel Computing Systems
Twelve ways to tell the masses when reporting performance results
spcl.inf.ethz.ch
@spcl_eth
 This is a state of the practice talk!
 Explained in SC15 FAQ:
“generalizable insights as gained from experiences with particular HPC
machines/operations/applications/benchmarks, overall analysis
of the status quo of a particular metric of the entire field or
historical reviews of the progress of the field.”
 Don’t expect novel insights
I hope to communicate new knowledge nevertheless
 My musings shall not offend anybody
 Everything is (now) anonymized
 Criticism may be rhetorically exaggerated
 Watch for tropes!
 This talk should be entertaining!
2
Disclaimer(s)
spcl.inf.ethz.ch
@spcl_eth
 We are all interested in High Performance Computing
 We (want to) see it as a science – reproducing experiments is a major pillar of the scientific method
 When measuring performance, important questions are
 “How many iterations do I have to run per measurement?”
 “How many measurements should I run?”
 “Once I have all data, how do I summarize it into a single number?”
 “How do I compare the performance of different systems?”
 “How do I measure time in a parallel system?”
 …
 How are they answered in the field today?
 Let me start with a little anecdote … a reaction to this paper 
3
How does Garth measure and report performance?
spcl.inf.ethz.ch
@spcl_eth
 Original findings:
 If carefully tuned, NBC speeds up a 3D solver
Full code published
 8003 domain – 4 GB array
1 process per node, 8-96 nodes
Opteron 246 (old even in 2006, retired now)
 Super-linear speedup for 96 nodes
~5% better than linear
 9 years later: attempt to reproduce !
System A: 28 quad-core nodes, Xeon E5520
System B: 4 nodes, dual Opteron 6274
“Neither the experiment in A nor the one in B could
reproduce the results presented in the original paper,
where the usage of the NBC library resulted in a
performance gain for practically all node counts,
reaching a superlinear speedup for 96 cores (explained
as being due to cache effects in the inner part of the
matrix vector product).”
4
(2006)
(2015)
A
B
1 node
(system B)
spcl.inf.ethz.ch
@spcl_eth
 Stratified random sample of three top-conferences over four years
 HPDC, PPoPP, SC (years: 2011, 2012, 2013, 2014)
 10 random papers from each (10-50% of population)
 120 total papers, 20% (25) did not report performance (were excluded)
5
State of the Practice in HPC
spcl.inf.ethz.ch
@spcl_eth
 Stratified random sample of three top-conferences over four years
 HPDC, PPoPP, SC (years: 2011, 2012, 2013, 2014)
 10 random papers from each (10-50% of population)
 120 total papers, 20% (25) did not report performance (were excluded)
6
State of the Practice in HPC
 Main results:
1. Most papers report details about the hardware but fail to describe the software environment.
Important details for reproducibility missing
2. The average paper’s results are hard to interpret and easy to question
Measurements and data not well explained
3. No statistically significant evidence for improvement over the years 
 Our main thesis:
Performance results are often nearly impossible to reproduce! Thus, we need provide enough
information to allow scientists to understand the experiment, draw own conclusions, assess their
certainty, and possibly generalize results.
This is especially important for HPC conferences and activities such as the Gordon Bell award!
spcl.inf.ethz.ch
@spcl_eth
Yes, this is a
garlic press!
We simply provide data for well-known issues (SoP  )
7
1991 – the classic!
2012 – the shocking
2013 – the extension
spcl.inf.ethz.ch
@spcl_eth
Yes, this is a
garlic press!
We simply provide data for well-known issues (SoP  )
8
1991 – the classic!
2012 – the shocking
2013 – the extension
Our constructive approach: provide a set of (12) rules
 Attempt to emphasize interpretability of performance experiments
 The set is not complete
 And probably never will be
 Intended to serve as a solid start
 Call to the community to extend it
 I will illustrate the 12 rules now
 Using real-world examples
All anonymized!
 Garth and Eddie will represent the scientists
spcl.inf.ethz.ch
@spcl_eth
9
The most common issue: speedup plots
Check out my
wonderful
Speedup!
I can’t tell if
this is useful
at all!
 Most common and oldest-known issue
 First seen 1988 – also included in Bailey’s 12 ways
 39 papers reported speedups
15 (38%) did not specify the base-performance 
 Recently rediscovered in the “big data” universe
A. Rowstron et al.: Nobody ever got fired for using Hadoop on a cluster, HotCDP 2012
F. McSherry et al.: Scalability! but at what cost?, HotOS 2015
spcl.inf.ethz.ch
@spcl_eth
10
The most common issue: speedup plots
Check out my
wonderful
Speedup!
I can’t tell if
this is useful
at all!
 Most common and oldest-known issue
 First seen 1988 – also included in Bailey’s 12 ways
 39 papers reported speedups
15 (38%) did not specify the base-performance 
 Recently rediscovered in the “big data” universe
A. Rowstron et al.: Nobody ever got fired for using Hadoop on a cluster, HotCDP 2012
F. McSherry et al.: Scalability! but at what cost?, HotOS 2015
Rule 1: When publishing parallel speedup, report if the base
case is a single parallel process or best serial execution, as
well as the absolute execution performance of the base case.
 A simple generalization of this rule implies that one should never report ratios without
absolute values.
spcl.inf.ethz.ch
@spcl_eth
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
NAS CG NAS LU NAS EP
Performance in Gflop/s
ICC LLVM GarthCC
11
Garth’s new compiler optimization
Check out my
new compiler!
How did it
perform for FT
and BT?
Well, GarthCC
segfaulted for FT
and was 20%
slower for BT.
spcl.inf.ethz.ch
@spcl_eth
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
NAS CG NAS LU NAS EP
Performance in Gflop/s
ICC LLVM GarthCC
12
Garth’s new compiler optimization
Check out my
new compiler!
How did it
perform for FT
and BT?
Well, GarthCC
segfaulted for FT
and was 20%
slower for BT.
Rule 2: Specify the reason for only reporting subsets of
standard benchmarks or applications or not using all system
resources.
 This implies: Show results even if your code/approach stops scaling!
spcl.inf.ethz.ch
@spcl_eth
13
The mean parts of means – or how to summarize data
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
NAS CG NAS LU NAS EP NAS BT
Performance in Gflop/s
ICC GarthCC
+20% +20% +20% -20%
But GarthCC is
10% faster than
ICC on average!
Ugs, well, BT ran much longer
than the others. GarthCC is
actually 10% slower!
Ah, true, the
geometric mean
is 8% speedup!
You cannot use the
arithmetic mean for
ratios!
The geometric mean has no
clear interpretation! What
was the completion time of
the whole workload?
spcl.inf.ethz.ch
@spcl_eth
14
The mean parts of means – or how to summarize data
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
NAS CG NAS LU NAS EP NAS BT
Performance in Gflop/s
ICC GarthCC
+20% +20% +20% -20%
But GarthCC is
10% faster than
ICC on average!
Ugs, well, BT ran much longer
than the others. GarthCC is
actually 10% slower!
Ah, true, the
geometric mean
is 8% speedup!
You cannot use the
arithmetic mean for
ratios!
The geometric mean has no
clear interpretation! What
was the completion time of
the whole workload?
Rule 3: Use the arithmetic mean only for summarizing costs.
Use the harmonic mean for summarizing rates.
Rule 4: Avoid summarizing ratios; summarize the costs or
rates that the ratios base on instead. Only if these are not
available use the geometric mean for summarizing ratios.
 51 papers use means to summarize data, only four (!) specify which mean was used
 A single paper correctly specifies the use of the harmonic mean
 Two use geometric means, without reason
 Similar issues in other communities (PLDI, CGO, LCTES) – see N. Amaral’s report
 harmonic mean ≤ geometric mean ≤ arithmetic mean
spcl.inf.ethz.ch
@spcl_eth
15
The latency of
Piz Dora is
1.75us!
How did you
get to this?
I averaged 106
tests, it must be
right!
usec
sample
Why do you
think so? Can I
see the data?
Dealing with variation
spcl.inf.ethz.ch
@spcl_eth
16
The latency of
Piz Dora is
1.75us!
How did you
get to this?
I averaged 106
tests, it must be
right!
usec
sample
Why do you
think so? Can I
see the data?
Dealing with variation
Rule 5: Report if the measurement values are deterministic.
For nondeterministic data, report confidence intervals of the
measurement.
 Most papers report nondeterministic measurement results
 Only 15 mention some measure of variance
 Only two (!) report confidence intervals
 CIs allow us to compute the number of required measurements!
 Can be very simple, e.g., single sentence in evaluation:
“We collected measurements until the 99% confidence interval was within 5% of our reported means.”
spcl.inf.ethz.ch
@spcl_eth
Dealing with variation
17
The confidence
interval is 1.745us
to 1.755us
Did you assume
normality?
Yes, I used the central
limit theorem to
normalize by summing
subsets of size 100!
Can we test for
normality?
Ugs, the data is not
normal at all! The real
CI is actually 1.6us to
1.8us!
spcl.inf.ethz.ch
@spcl_eth
Dealing with variation
18
The confidence
interval is 1.745us
to 1.755us
Did you assume
normality?
Yes, I used the central
limit theorem to
normalize by summing
subsets of size 100!
Can we test for
normality?
Ugs, the data is not
normal at all! The real
CI is actually 1.6us to
1.8us!
Rule 6: Do not assume normality of collected data (e.g.,
based on the number of samples) without diagnostic checking.
 Most events will slow down performance
 Heavy right-tailed distributions
 The Central Limit Theorem only applies asymptotically
 Some papers/textbook mention “30-40 samples”, don’t trust them!
 Two papers used CIs around me mean without testing for normality
spcl.inf.ethz.ch
@spcl_eth
19
Comparing nondeterministic measurements
I saw variance
using GarthCC as
well!
Retract the
paper! You have
not shown
anything!
ICC GarthCC
ExecutionTime
20%
Show me the
data!
95% CI
spcl.inf.ethz.ch
@spcl_eth
20
Comparing nondeterministic measurements
I saw variance
using GarthCC as
well!
Retract the
paper! You have
not shown
anything!
ICC GarthCC
ExecutionTime
20%
Show me the
data!
95% CI
Rule 7: Compare nondeterministic data in a statistically sound
way, e.g., using non-overlapping confidence intervals or
ANOVA.
 None of the investigated papers used statistically sound comparisons
 The “effect size” can even be a stronger metric
spcl.inf.ethz.ch
@spcl_eth
21
What if the data looks weird!?
Look what
data I got!
Clearly, the
mean/median are
not sufficient!
Try quantile
regression!
spcl.inf.ethz.ch
@spcl_eth
Quantile Regression
22
Wow, so Pilatus is better for latency-
critical workloads even though Dora
is expected to be faster
spcl.inf.ethz.ch
@spcl_eth
Quantile Regression
23
Wow, so Pilatus is better for latency-
critical workloads even though Dora
is expected to be faster
Rule 8: Carefully investigate if measures of central tendency
such as mean or median are useful to report. Some problems,
such as worst-case latency, may require other percentiles.
 Check Oliveira et al. “Why you should care about quantile regression”. SIGARCH
Computer Architecture News, 2013.
spcl.inf.ethz.ch
@spcl_eth
24
Experimental design
MPI_Reduce
behaves much
simpler!
I don’t believe you, try
other numbers of
processes!
spcl.inf.ethz.ch
@spcl_eth
25
Experimental design
MPI_Reduce
behaves much
simpler!
I don’t believe you, try
other numbers of
processes!
Rule 9: Document all varying factors and their levels as well
as the complete experimental setup (e.g., software, hardware,
techniques) to facilitate reproducibility and provide
interpretability.
 We recommend factorial design
 Consider parameters such as node allocation, process-to-node mapping, network or
node contention
 If they cannot be controlled easily, use randomization and model as random variable
 This is hard in practice and not easy to capture in rules
spcl.inf.ethz.ch
@spcl_eth
26
Time in parallel systems
My simple
broadcast takes
only one latency!
That’s nonsense!
But I measured it
so it must be true!
t = -MPI_Wtime();
for(i=0; i<1000; i++) {
MPI_Bcast(…);
}
t += MPI_Wtime();
t /= 1000;
…
Measure each
operation
separately!
spcl.inf.ethz.ch
@spcl_eth
27
Summarizing times in parallel systems!
My new reduce
takes only 30us
on 64 ranks.
Come on, show
me the data!
spcl.inf.ethz.ch
@spcl_eth
28
Summarizing times in parallel systems!
My new reduce
takes only 30us
on 64 ranks.
Come on, show
me the data!
Rule 10: For parallel time measurements, report all
measurement, (optional) synchronization, and summarization
techniques.
 Measure events separately
 Use high-precision timers
 Synchronize processes
 Summarize across processes:
 Min/max (unstable), average, median – depends on use-case
spcl.inf.ethz.ch
@spcl_eth
29
Give times a meaning!
I compute 1010
digits Pi in 2ms
on Dora!
I have no clue.
Can you provide?
- Ideal speedup
- Amdahl’s speedup
- Parallel overheads
Ok: The code runs
17ms on a single
core, 0.2ms are
initialization and it
has one reduction!
spcl.inf.ethz.ch
@spcl_eth
30
Give times a meaning!
I compute 1010
digits Pi in 2ms
on Dora!
I have no clue.
Can you provide?
- Ideal speedup
- Amdahl’s speedup
- Parallel overheads
Ok: The code runs
17ms on a single
core, 0.2ms are
initialization and it
has one reduction!
Rule 11: If possible, show upper performance bounds to
facilitate interpretability of the measured results.
 Model computer system as k-dimensional space
 Each dimension represents a capability
Floating point, Integer, memory bandwidth, cache bandwidth, etc.
 Features are typical rates
 Determine maximum rate for each feature
E.g., from documentation or benchmarks
 Can be used to proof optimality of implementation
 If the requirements of the bottleneck feature are minimal
spcl.inf.ethz.ch
@spcl_eth
My most common
request was
“show me the
data”
31
Plot as much information as possible!
This is how I should
have presented the
Dora results.
spcl.inf.ethz.ch
@spcl_eth
My most common
request was
“show me the
data”
32
Plot as much information as possible!
This is how I should
have presented the
Dora results.
Rule 12: Plot as much information as needed to interpret the
experimental results. Only connect measurements by lines if
they indicate trends and the interpolation is valid.
spcl.inf.ethz.ch
@spcl_eth
Acknowledgments
 ETH’s mathematics department (home of R)
 Hans Rudolf Künsch, Martin Maechler, and Robert Gantner
 Comments on early drafts
 David H. Bailey, William T. Kramer, Matthias Hauswirth, Timothy
Roscoe, Gustavo Alonso, Georg Hager, Jesper Träff, and Sascha
Hunold
 Help with HPL run
 Gilles Fourestier (CSCS) and Massimiliano Fatica (NVIDIA)
33
Conclusions and call for action
 Performance may not be reproducible
 At least not for some (important) results
 Interpretability fosters scientific progress
 Enables to build on results
 Sounds statistics is the biggest gap today
 We need to foster interpretability
 Do it ourselves (this is not easy)
 Teach young students
 Maybe even enforce in TPCs
 See the 12 rules as a start
 Need to be extended (or concretized)
 Much is implemented in LibSciBench [1]
No vegetables were harmed for creating these slides!
[1]: http://spcl.inf.ethz.ch/Research/Performance/LibLSB/.
spcl.inf.ethz.ch
@spcl_eth
34
Backup slides
spcl.inf.ethz.ch
@spcl_eth
 Rank-based measures (no assumption about distribution)
 Almost always better than assuming normality
 Example: median (50th percentile) vs. mean for HPL
 Rather stable statistic for expectation
 Other percentiles (usually 25th and 75th) are also useful
35
Dealing with non-normal data – nonparametric statistics
spcl.inf.ethz.ch
@spcl_eth
 Measurements are expensive!
 Yet necessary to reach certain confidence
 How to determine the minimal number of measurements?
 Measure until the confidence interval has a certain acceptable width
 For example, measure until the 95% CI is within 5% of the mean/median
 Can be computed analytically assuming normal data
 Compute iteratively for nonparametric statistics
 Often heard: “we cannot afford more than a single measurement”
 E.g., Gordon Bell runs
 Well, then one cannot say anything about the variance
Even 3-4 measurement can provide very tight CI (assuming normality)
36
How many measurements are needed?

More Related Content

PPTX
Measure to fail
PPTX
4Developers 2015: Measure to fail - Tomasz Kowalczewski
PDF
Everybody Lies
PDF
PraveenBOUT++
PDF
High Performance Engineering - 01-intro.pdf
PDF
Parallel Distributed And Pervasive Computing 1st Edition Marvin Zelkowitz Phd...
PDF
Stop the Guessing: Performance Methodologies for Production Systems
PDF
Parallel Computing - Lec 4
Measure to fail
4Developers 2015: Measure to fail - Tomasz Kowalczewski
Everybody Lies
PraveenBOUT++
High Performance Engineering - 01-intro.pdf
Parallel Distributed And Pervasive Computing 1st Edition Marvin Zelkowitz Phd...
Stop the Guessing: Performance Methodologies for Production Systems
Parallel Computing - Lec 4

Similar to Scientific Benchmarking of Parallel Computing Systems (20)

PDF
The math behind big systems analysis.
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
PDF
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
PDF
The evolution of semantic technology evaluation in my own flesh (The 15 tip...
PDF
Cyber Analytics Applications for Data-Intensive Computing
PDF
Performance OR Capacity #CMGimPACt2016
PDF
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
PPTX
Training - What is Performance ?
PPTX
Comparing Big Data and Simulation Applications and Implications for Software ...
PPTX
Demanding the Impossible: Rigorous Database Benchmarking
PDF
Welcome and Lightning Intros
PPTX
HPAT presentation at JuliaCon 2016
PDF
OSMC 2016 - Friends and foes by Heinrich Hartmann
PDF
OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
PPTX
An Industrial Case Study on the Automated Detection of Performance Regression...
PDF
Lyon jug-how-to-fail-at-benchmarking
PDF
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
PDF
An Introduction to Parallel Programming 2. Edition Pacheco
PPTX
Web Performance BootCamp 2013
PDF
Early Application experiences on Summit
The math behind big systems analysis.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
The evolution of semantic technology evaluation in my own flesh (The 15 tip...
Cyber Analytics Applications for Data-Intensive Computing
Performance OR Capacity #CMGimPACt2016
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Training - What is Performance ?
Comparing Big Data and Simulation Applications and Implications for Software ...
Demanding the Impossible: Rigorous Database Benchmarking
Welcome and Lightning Intros
HPAT presentation at JuliaCon 2016
OSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
An Industrial Case Study on the Automated Detection of Performance Regression...
Lyon jug-how-to-fail-at-benchmarking
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
An Introduction to Parallel Programming 2. Edition Pacheco
Web Performance BootCamp 2013
Early Application experiences on Summit
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects
Ad

Recently uploaded (20)

PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...
Digital-Transformation-Roadmap-for-Companies.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Encapsulation_ Review paper, used for researhc scholars
SOPHOS-XG Firewall Administrator PPT.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Hybrid model detection and classification of lung cancer
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Group 1 Presentation -Planning and Decision Making .pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Heart disease approach using modified random forest and particle swarm optimi...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25-Week II

Scientific Benchmarking of Parallel Computing Systems

  • 1. spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER, ROBERTO BELLI Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results
  • 2. spcl.inf.ethz.ch @spcl_eth  This is a state of the practice talk!  Explained in SC15 FAQ: “generalizable insights as gained from experiences with particular HPC machines/operations/applications/benchmarks, overall analysis of the status quo of a particular metric of the entire field or historical reviews of the progress of the field.”  Don’t expect novel insights I hope to communicate new knowledge nevertheless  My musings shall not offend anybody  Everything is (now) anonymized  Criticism may be rhetorically exaggerated  Watch for tropes!  This talk should be entertaining! 2 Disclaimer(s)
  • 3. spcl.inf.ethz.ch @spcl_eth  We are all interested in High Performance Computing  We (want to) see it as a science – reproducing experiments is a major pillar of the scientific method  When measuring performance, important questions are  “How many iterations do I have to run per measurement?”  “How many measurements should I run?”  “Once I have all data, how do I summarize it into a single number?”  “How do I compare the performance of different systems?”  “How do I measure time in a parallel system?”  …  How are they answered in the field today?  Let me start with a little anecdote … a reaction to this paper  3 How does Garth measure and report performance?
  • 4. spcl.inf.ethz.ch @spcl_eth  Original findings:  If carefully tuned, NBC speeds up a 3D solver Full code published  8003 domain – 4 GB array 1 process per node, 8-96 nodes Opteron 246 (old even in 2006, retired now)  Super-linear speedup for 96 nodes ~5% better than linear  9 years later: attempt to reproduce ! System A: 28 quad-core nodes, Xeon E5520 System B: 4 nodes, dual Opteron 6274 “Neither the experiment in A nor the one in B could reproduce the results presented in the original paper, where the usage of the NBC library resulted in a performance gain for practically all node counts, reaching a superlinear speedup for 96 cores (explained as being due to cache effects in the inner part of the matrix vector product).” 4 (2006) (2015) A B 1 node (system B)
  • 5. spcl.inf.ethz.ch @spcl_eth  Stratified random sample of three top-conferences over four years  HPDC, PPoPP, SC (years: 2011, 2012, 2013, 2014)  10 random papers from each (10-50% of population)  120 total papers, 20% (25) did not report performance (were excluded) 5 State of the Practice in HPC
  • 6. spcl.inf.ethz.ch @spcl_eth  Stratified random sample of three top-conferences over four years  HPDC, PPoPP, SC (years: 2011, 2012, 2013, 2014)  10 random papers from each (10-50% of population)  120 total papers, 20% (25) did not report performance (were excluded) 6 State of the Practice in HPC  Main results: 1. Most papers report details about the hardware but fail to describe the software environment. Important details for reproducibility missing 2. The average paper’s results are hard to interpret and easy to question Measurements and data not well explained 3. No statistically significant evidence for improvement over the years   Our main thesis: Performance results are often nearly impossible to reproduce! Thus, we need provide enough information to allow scientists to understand the experiment, draw own conclusions, assess their certainty, and possibly generalize results. This is especially important for HPC conferences and activities such as the Gordon Bell award!
  • 7. spcl.inf.ethz.ch @spcl_eth Yes, this is a garlic press! We simply provide data for well-known issues (SoP  ) 7 1991 – the classic! 2012 – the shocking 2013 – the extension
  • 8. spcl.inf.ethz.ch @spcl_eth Yes, this is a garlic press! We simply provide data for well-known issues (SoP  ) 8 1991 – the classic! 2012 – the shocking 2013 – the extension Our constructive approach: provide a set of (12) rules  Attempt to emphasize interpretability of performance experiments  The set is not complete  And probably never will be  Intended to serve as a solid start  Call to the community to extend it  I will illustrate the 12 rules now  Using real-world examples All anonymized!  Garth and Eddie will represent the scientists
  • 9. spcl.inf.ethz.ch @spcl_eth 9 The most common issue: speedup plots Check out my wonderful Speedup! I can’t tell if this is useful at all!  Most common and oldest-known issue  First seen 1988 – also included in Bailey’s 12 ways  39 papers reported speedups 15 (38%) did not specify the base-performance   Recently rediscovered in the “big data” universe A. Rowstron et al.: Nobody ever got fired for using Hadoop on a cluster, HotCDP 2012 F. McSherry et al.: Scalability! but at what cost?, HotOS 2015
  • 10. spcl.inf.ethz.ch @spcl_eth 10 The most common issue: speedup plots Check out my wonderful Speedup! I can’t tell if this is useful at all!  Most common and oldest-known issue  First seen 1988 – also included in Bailey’s 12 ways  39 papers reported speedups 15 (38%) did not specify the base-performance   Recently rediscovered in the “big data” universe A. Rowstron et al.: Nobody ever got fired for using Hadoop on a cluster, HotCDP 2012 F. McSherry et al.: Scalability! but at what cost?, HotOS 2015 Rule 1: When publishing parallel speedup, report if the base case is a single parallel process or best serial execution, as well as the absolute execution performance of the base case.  A simple generalization of this rule implies that one should never report ratios without absolute values.
  • 11. spcl.inf.ethz.ch @spcl_eth 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 NAS CG NAS LU NAS EP Performance in Gflop/s ICC LLVM GarthCC 11 Garth’s new compiler optimization Check out my new compiler! How did it perform for FT and BT? Well, GarthCC segfaulted for FT and was 20% slower for BT.
  • 12. spcl.inf.ethz.ch @spcl_eth 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 NAS CG NAS LU NAS EP Performance in Gflop/s ICC LLVM GarthCC 12 Garth’s new compiler optimization Check out my new compiler! How did it perform for FT and BT? Well, GarthCC segfaulted for FT and was 20% slower for BT. Rule 2: Specify the reason for only reporting subsets of standard benchmarks or applications or not using all system resources.  This implies: Show results even if your code/approach stops scaling!
  • 13. spcl.inf.ethz.ch @spcl_eth 13 The mean parts of means – or how to summarize data 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 NAS CG NAS LU NAS EP NAS BT Performance in Gflop/s ICC GarthCC +20% +20% +20% -20% But GarthCC is 10% faster than ICC on average! Ugs, well, BT ran much longer than the others. GarthCC is actually 10% slower! Ah, true, the geometric mean is 8% speedup! You cannot use the arithmetic mean for ratios! The geometric mean has no clear interpretation! What was the completion time of the whole workload?
  • 14. spcl.inf.ethz.ch @spcl_eth 14 The mean parts of means – or how to summarize data 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 NAS CG NAS LU NAS EP NAS BT Performance in Gflop/s ICC GarthCC +20% +20% +20% -20% But GarthCC is 10% faster than ICC on average! Ugs, well, BT ran much longer than the others. GarthCC is actually 10% slower! Ah, true, the geometric mean is 8% speedup! You cannot use the arithmetic mean for ratios! The geometric mean has no clear interpretation! What was the completion time of the whole workload? Rule 3: Use the arithmetic mean only for summarizing costs. Use the harmonic mean for summarizing rates. Rule 4: Avoid summarizing ratios; summarize the costs or rates that the ratios base on instead. Only if these are not available use the geometric mean for summarizing ratios.  51 papers use means to summarize data, only four (!) specify which mean was used  A single paper correctly specifies the use of the harmonic mean  Two use geometric means, without reason  Similar issues in other communities (PLDI, CGO, LCTES) – see N. Amaral’s report  harmonic mean ≤ geometric mean ≤ arithmetic mean
  • 15. spcl.inf.ethz.ch @spcl_eth 15 The latency of Piz Dora is 1.75us! How did you get to this? I averaged 106 tests, it must be right! usec sample Why do you think so? Can I see the data? Dealing with variation
  • 16. spcl.inf.ethz.ch @spcl_eth 16 The latency of Piz Dora is 1.75us! How did you get to this? I averaged 106 tests, it must be right! usec sample Why do you think so? Can I see the data? Dealing with variation Rule 5: Report if the measurement values are deterministic. For nondeterministic data, report confidence intervals of the measurement.  Most papers report nondeterministic measurement results  Only 15 mention some measure of variance  Only two (!) report confidence intervals  CIs allow us to compute the number of required measurements!  Can be very simple, e.g., single sentence in evaluation: “We collected measurements until the 99% confidence interval was within 5% of our reported means.”
  • 17. spcl.inf.ethz.ch @spcl_eth Dealing with variation 17 The confidence interval is 1.745us to 1.755us Did you assume normality? Yes, I used the central limit theorem to normalize by summing subsets of size 100! Can we test for normality? Ugs, the data is not normal at all! The real CI is actually 1.6us to 1.8us!
  • 18. spcl.inf.ethz.ch @spcl_eth Dealing with variation 18 The confidence interval is 1.745us to 1.755us Did you assume normality? Yes, I used the central limit theorem to normalize by summing subsets of size 100! Can we test for normality? Ugs, the data is not normal at all! The real CI is actually 1.6us to 1.8us! Rule 6: Do not assume normality of collected data (e.g., based on the number of samples) without diagnostic checking.  Most events will slow down performance  Heavy right-tailed distributions  The Central Limit Theorem only applies asymptotically  Some papers/textbook mention “30-40 samples”, don’t trust them!  Two papers used CIs around me mean without testing for normality
  • 19. spcl.inf.ethz.ch @spcl_eth 19 Comparing nondeterministic measurements I saw variance using GarthCC as well! Retract the paper! You have not shown anything! ICC GarthCC ExecutionTime 20% Show me the data! 95% CI
  • 20. spcl.inf.ethz.ch @spcl_eth 20 Comparing nondeterministic measurements I saw variance using GarthCC as well! Retract the paper! You have not shown anything! ICC GarthCC ExecutionTime 20% Show me the data! 95% CI Rule 7: Compare nondeterministic data in a statistically sound way, e.g., using non-overlapping confidence intervals or ANOVA.  None of the investigated papers used statistically sound comparisons  The “effect size” can even be a stronger metric
  • 21. spcl.inf.ethz.ch @spcl_eth 21 What if the data looks weird!? Look what data I got! Clearly, the mean/median are not sufficient! Try quantile regression!
  • 22. spcl.inf.ethz.ch @spcl_eth Quantile Regression 22 Wow, so Pilatus is better for latency- critical workloads even though Dora is expected to be faster
  • 23. spcl.inf.ethz.ch @spcl_eth Quantile Regression 23 Wow, so Pilatus is better for latency- critical workloads even though Dora is expected to be faster Rule 8: Carefully investigate if measures of central tendency such as mean or median are useful to report. Some problems, such as worst-case latency, may require other percentiles.  Check Oliveira et al. “Why you should care about quantile regression”. SIGARCH Computer Architecture News, 2013.
  • 24. spcl.inf.ethz.ch @spcl_eth 24 Experimental design MPI_Reduce behaves much simpler! I don’t believe you, try other numbers of processes!
  • 25. spcl.inf.ethz.ch @spcl_eth 25 Experimental design MPI_Reduce behaves much simpler! I don’t believe you, try other numbers of processes! Rule 9: Document all varying factors and their levels as well as the complete experimental setup (e.g., software, hardware, techniques) to facilitate reproducibility and provide interpretability.  We recommend factorial design  Consider parameters such as node allocation, process-to-node mapping, network or node contention  If they cannot be controlled easily, use randomization and model as random variable  This is hard in practice and not easy to capture in rules
  • 26. spcl.inf.ethz.ch @spcl_eth 26 Time in parallel systems My simple broadcast takes only one latency! That’s nonsense! But I measured it so it must be true! t = -MPI_Wtime(); for(i=0; i<1000; i++) { MPI_Bcast(…); } t += MPI_Wtime(); t /= 1000; … Measure each operation separately!
  • 27. spcl.inf.ethz.ch @spcl_eth 27 Summarizing times in parallel systems! My new reduce takes only 30us on 64 ranks. Come on, show me the data!
  • 28. spcl.inf.ethz.ch @spcl_eth 28 Summarizing times in parallel systems! My new reduce takes only 30us on 64 ranks. Come on, show me the data! Rule 10: For parallel time measurements, report all measurement, (optional) synchronization, and summarization techniques.  Measure events separately  Use high-precision timers  Synchronize processes  Summarize across processes:  Min/max (unstable), average, median – depends on use-case
  • 29. spcl.inf.ethz.ch @spcl_eth 29 Give times a meaning! I compute 1010 digits Pi in 2ms on Dora! I have no clue. Can you provide? - Ideal speedup - Amdahl’s speedup - Parallel overheads Ok: The code runs 17ms on a single core, 0.2ms are initialization and it has one reduction!
  • 30. spcl.inf.ethz.ch @spcl_eth 30 Give times a meaning! I compute 1010 digits Pi in 2ms on Dora! I have no clue. Can you provide? - Ideal speedup - Amdahl’s speedup - Parallel overheads Ok: The code runs 17ms on a single core, 0.2ms are initialization and it has one reduction! Rule 11: If possible, show upper performance bounds to facilitate interpretability of the measured results.  Model computer system as k-dimensional space  Each dimension represents a capability Floating point, Integer, memory bandwidth, cache bandwidth, etc.  Features are typical rates  Determine maximum rate for each feature E.g., from documentation or benchmarks  Can be used to proof optimality of implementation  If the requirements of the bottleneck feature are minimal
  • 31. spcl.inf.ethz.ch @spcl_eth My most common request was “show me the data” 31 Plot as much information as possible! This is how I should have presented the Dora results.
  • 32. spcl.inf.ethz.ch @spcl_eth My most common request was “show me the data” 32 Plot as much information as possible! This is how I should have presented the Dora results. Rule 12: Plot as much information as needed to interpret the experimental results. Only connect measurements by lines if they indicate trends and the interpolation is valid.
  • 33. spcl.inf.ethz.ch @spcl_eth Acknowledgments  ETH’s mathematics department (home of R)  Hans Rudolf Künsch, Martin Maechler, and Robert Gantner  Comments on early drafts  David H. Bailey, William T. Kramer, Matthias Hauswirth, Timothy Roscoe, Gustavo Alonso, Georg Hager, Jesper Träff, and Sascha Hunold  Help with HPL run  Gilles Fourestier (CSCS) and Massimiliano Fatica (NVIDIA) 33 Conclusions and call for action  Performance may not be reproducible  At least not for some (important) results  Interpretability fosters scientific progress  Enables to build on results  Sounds statistics is the biggest gap today  We need to foster interpretability  Do it ourselves (this is not easy)  Teach young students  Maybe even enforce in TPCs  See the 12 rules as a start  Need to be extended (or concretized)  Much is implemented in LibSciBench [1] No vegetables were harmed for creating these slides! [1]: http://spcl.inf.ethz.ch/Research/Performance/LibLSB/.
  • 35. spcl.inf.ethz.ch @spcl_eth  Rank-based measures (no assumption about distribution)  Almost always better than assuming normality  Example: median (50th percentile) vs. mean for HPL  Rather stable statistic for expectation  Other percentiles (usually 25th and 75th) are also useful 35 Dealing with non-normal data – nonparametric statistics
  • 36. spcl.inf.ethz.ch @spcl_eth  Measurements are expensive!  Yet necessary to reach certain confidence  How to determine the minimal number of measurements?  Measure until the confidence interval has a certain acceptable width  For example, measure until the 95% CI is within 5% of the mean/median  Can be computed analytically assuming normal data  Compute iteratively for nonparametric statistics  Often heard: “we cannot afford more than a single measurement”  E.g., Gordon Bell runs  Well, then one cannot say anything about the variance Even 3-4 measurement can provide very tight CI (assuming normality) 36 How many measurements are needed?