SlideShare a Scribd company logo
Optimizing Thread Performance for a
Genomics Variant Caller
This talk
• Introduce two tools that can help improve the performance of
multithreaded code
• Apply the tools to a real world Genomics code
caption
Tool 1: Allinea Performance Reports – benchmarking and
characterization
Tool 2: Allinea Forge - Debugging and Profiling
• Debug and profile from
one interface,
configuration
• Secure native remote and
local access
• Rapidly switch between
the tasks
• Edit, build, commit,
debug, profile, optimize..
Small data files
<5% slowdown
No instrumentation
No recompilation
Our profiler finds the performance bottlenecks
Our debugger helps bugs and performance
• Observe why
workload is
imbalanced
• Observe why
particular code paths
are followed
• .. And fix any bugs
that optimization
creates!
Above all…
• The tools are aimed at any performance problem that matters
– Focus on time: the ultimate judge of performance
• Do not prejudge the problem
– Don’t assume it’s MPI messages, threads or I/O before profiling!
• If there’s a problem..
– Allinea Performance Reports shows it, and advises you on solutions
– Allinea Forge’s profiler shows it, next to your code
6 steps to improve performance
Get a realistic test
case
• Performance on real data
matters
• Keep the test case for
reference and re-use
Profile your code
• Add “-g” flag to your
compilation
• Run with a profiler
Look for the significant
• Which part/phase of the
code dominates time?
• Is there any unexpected
significant time use?
What is the nature of
the problem?
• Compute? I/O? MPI?
Thread synchronization?
• Display the metrics that
show the problem best
Apply brain to solve
• MPI – can you balance the
work better?
• Compute – is memory time
dominant – can you improve
layout?
Think of the future
• Try larger process or thread
counts to watch for
scalability problems
• Keep the profile (.map file)
for future comparison
Example: Improving Thread Usage in Genomics
• DISCOVAR
– Variant caller and small genome assembler
– Sub-mammalian sized genomes
– Newer DISCOVAR de novo for larger genomes
• C++ and OpenMP
• Developed by Broad Institute at MIT
A first look – on real hardware
• It’s not I/O intensive
• Good quantity of
OpenMP time
• No vectorization
OpenMP in detail
• Physical cores are
200% loaded:
hyperthreading is on
• 17% of parallel region
time is synchronization
• .. That’s quite high
Investigating the OpenMP synchronization
• Horizontal time axis:
colour coded
– Dark green – single core
– Light green – OpenMP work
– Light blue – pthread
synchronization
– Gray – idle
• Vertical axis
– #cores doing something
• Something’s very wrong
towards the end – with
all the gray
Zoom in on the region
• Stacks, code, regions,
time are all focused on
zoom area
• Key observation:
– OpenMP region with
“omp critical” is where
the time is being wasted
Fixing
• #pragma omp critical
– Execute exactly one
thread at a time to
ensure safety
• Is costing too much
– Passing “token” from
thread to thread to do
small pieces of work.
• Run whole section on
one thread instead
– Has same semantics
Impact of change
• Runtime down by 7%
As a performance report
• Improvements in
– Runtime
– Synchronization
overhead
Let’s try something bigger – into Amazon cloud!
• C4.8xlarge
– 36 hyperthreaded cores
– 60GB RAM
– Xeon E5-2666 v3 Haswell
– 25MB Cache
– 2.6GHZ
vs
• Our physical server
– 24 hyperthreaded cores
– 24 GB RAM
– Xeon E5-2407 v2
– 10MB Cache
– 2.4GHz
$ ./runme.sh
discovar version: Discovar r52488
loadaverage: 0.05 0.98 1.36 1/790 16317
2015-07-27 07:57 PERF: REAL 835.857 USER 36.188
SYSTEM 5.441 PERC 4.71
835 seconds to run on EC2
… vs …
~448 seconds on our physical server
Why?
Profile with Allinea Forge to find where the problem is
• Focus on initial 300
seconds: something
must be wrong here
• Serious lack of good
“green” compute
In detail…
• 36 threads, waiting… but who is using madvise?!
Why is glibc so bad?
• madvise system call in
_int_free()
– At least two context
switches each call ..
– This glibc version has
issues…?
• What other options are
there?
Maybe Google TCMalloc?
• Optimized for multi-
threaded applications
• No-win
– Same run time
– Issue is use of sys_futex
not madvise
• .. Not optimized for this
multithreaded
application!
Jemalloc?
• As recommended by
the Broad Institute
• … same runtime
Jemalloc – same problem
• Source proves the issue
again…
Can Intel libraries help?
• We try the Intel TBB
multithreaded allocator
• 14 minutes down to 10
minutes!
• .. But still this code has
scope for more…
Real optimization of OpenMP regions
• NB – still profiling for
first 300 seconds only
• Significant inactivity in
final 60 seconds
• OpenMP region
– #pragma omp parallel for
• Is it working?
– No – the threads are idle
• Let’s remove
After the first fix…
• Now able to run to
completion
– 358 seconds
• Still inactivity at end of
run
Zoomed to the inactivity…
• Another OpenMP region
• Quick edit: comment out
the OpenMP, again!
… and the impact
• Down to 304 seconds
Finally… something to sort out
• Recursive, in-place
multithreaded sorter
• Is not scaling well in
thread counts
• Options?
– Re-engineer
– Replace
– Tune
Let’s tune
• Try limiting the thread pool to 8 workers
– Better than 36 clashing threads?
Result…
• Runtime 4.7 minutes
• 3x improvement on
original
• #1 position on the
Broad Benchmark list
for a sub-$2 / hour
system!
Lessons learned
• Real codes exhibit many different performance patterns
– Profiling real data sets at real scales is vital to target the effort
– Small test cases do not expose all the problems
– Small thread counts can be too small to find real problems
• Changing code can be simple
– Use threads wisely – it will not always be faster
– Changing libraries – someone else might have fixed your problem
• Re-engineering is sometimes necessary
– Take advantage of vector units
– Take advantage of threads
Increase the performance of your software
Analyze and tune
with Allinea
Performance Reports
Develop, profile and
debug applications
with Allinea Forge
With professional
support when you
need it most
Read more!

More Related Content

PPTX
Preparing for SRE Interviews
PDF
Sista: Improving Cog’s JIT performance
PPTX
Leveraging HP Performance Center
PPT
Deploying Puppet Code At Light Speed - Puppet Camp Silicon Valley
PPT
Deploying puppet code at light speed
PPTX
Parallel and Asynchronous Programming - ITProDevConnections 2012 (English)
PPTX
Building trust within the organization, first steps towards DevOps
PDF
Craftsmanship Workshop: Coding Kata
Preparing for SRE Interviews
Sista: Improving Cog’s JIT performance
Leveraging HP Performance Center
Deploying Puppet Code At Light Speed - Puppet Camp Silicon Valley
Deploying puppet code at light speed
Parallel and Asynchronous Programming - ITProDevConnections 2012 (English)
Building trust within the organization, first steps towards DevOps
Craftsmanship Workshop: Coding Kata

What's hot (20)

PPTX
Keeping MongoDB Data Safe
PPTX
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
PDF
Perl-Critic
PPTX
Outsmarting Merge Edge Cases in Component Based Design
PPTX
Process Scheduling Algorithms | Interviews | Operating system
PDF
Practical Malware Analysis: Ch 9: OllyDbg
PPTX
Coding For Cores - C# Way
PDF
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
PDF
DevOps For Solo Developers
PDF
Introduction to keras
PPTX
Ginsbourg.com presentation of open source performance validation
PDF
Practical Malware Analysis: Ch 15: Anti-Disassembly
PPTX
Using the big guns: Advanced OS performance tools for troubleshooting databas...
KEY
Celery
PDF
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
PDF
Profiling and Optimizing for Xeon Phi with Allinea MAP
ODP
Give A Great Tech Talk 2013
PDF
Pharo: A Reflective System
PDF
CNIT 126 8: Debugging
PPTX
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Keeping MongoDB Data Safe
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
Perl-Critic
Outsmarting Merge Edge Cases in Component Based Design
Process Scheduling Algorithms | Interviews | Operating system
Practical Malware Analysis: Ch 9: OllyDbg
Coding For Cores - C# Way
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
DevOps For Solo Developers
Introduction to keras
Ginsbourg.com presentation of open source performance validation
Practical Malware Analysis: Ch 15: Anti-Disassembly
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Celery
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Profiling and Optimizing for Xeon Phi with Allinea MAP
Give A Great Tech Talk 2013
Pharo: A Reflective System
CNIT 126 8: Debugging
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Ad

Similar to Optimizing thread performance for a genomics variant caller (20)

PDF
Preparing Codes for Intel Knights Landing (KNL)
PDF
Introduction to multicore .ppt
PDF
2.4 Optimizing your Visual COBOL Applications
PDF
Performance and Abstractions
PPT
Lecture1
PDF
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
PDF
PraveenBOUT++
PDF
Java performance - not so scary after all
PPT
cs1311lecture25wdl.ppt
PDF
Performance optimization techniques for Java code
PDF
Performance tuning the Spring Pet Clinic sample application
PPTX
BTV PHP - Building Fast Websites
PDF
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
KEY
Ruby codebases in an entropic universe
PPS
CS101- Introduction to Computing- Lecture 45
PPTX
CSC718 Operating Systems and Parallel Programming
PDF
Velocity 2015 linux perf tools
PDF
Gearman: A Job Server made for Scale
KEY
OpenMP
PPTX
Hardware Provisioning
Preparing Codes for Intel Knights Landing (KNL)
Introduction to multicore .ppt
2.4 Optimizing your Visual COBOL Applications
Performance and Abstractions
Lecture1
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
PraveenBOUT++
Java performance - not so scary after all
cs1311lecture25wdl.ppt
Performance optimization techniques for Java code
Performance tuning the Spring Pet Clinic sample application
BTV PHP - Building Fast Websites
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Ruby codebases in an entropic universe
CS101- Introduction to Computing- Lecture 45
CSC718 Operating Systems and Parallel Programming
Velocity 2015 linux perf tools
Gearman: A Job Server made for Scale
OpenMP
Hardware Provisioning
Ad

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
history of c programming in notes for students .pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
System and Network Administration Chapter 2
PPTX
Odoo POS Development Services by CandidRoot Solutions
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
wealthsignaloriginal-com-DS-text-... (1).pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
ai tools demonstartion for schools and inter college
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Understanding Forklifts - TECH EHS Solution
Design an Analysis of Algorithms I-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
Odoo Companies in India – Driving Business Transformation.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How Creative Agencies Leverage Project Management Software.pdf
Transform Your Business with a Software ERP System
System and Network Administration Chapter 2
Odoo POS Development Services by CandidRoot Solutions

Optimizing thread performance for a genomics variant caller

  • 1. Optimizing Thread Performance for a Genomics Variant Caller
  • 2. This talk • Introduce two tools that can help improve the performance of multithreaded code • Apply the tools to a real world Genomics code
  • 3. caption Tool 1: Allinea Performance Reports – benchmarking and characterization
  • 4. Tool 2: Allinea Forge - Debugging and Profiling • Debug and profile from one interface, configuration • Secure native remote and local access • Rapidly switch between the tasks • Edit, build, commit, debug, profile, optimize..
  • 5. Small data files <5% slowdown No instrumentation No recompilation Our profiler finds the performance bottlenecks
  • 6. Our debugger helps bugs and performance • Observe why workload is imbalanced • Observe why particular code paths are followed • .. And fix any bugs that optimization creates!
  • 7. Above all… • The tools are aimed at any performance problem that matters – Focus on time: the ultimate judge of performance • Do not prejudge the problem – Don’t assume it’s MPI messages, threads or I/O before profiling! • If there’s a problem.. – Allinea Performance Reports shows it, and advises you on solutions – Allinea Forge’s profiler shows it, next to your code
  • 8. 6 steps to improve performance Get a realistic test case • Performance on real data matters • Keep the test case for reference and re-use Profile your code • Add “-g” flag to your compilation • Run with a profiler Look for the significant • Which part/phase of the code dominates time? • Is there any unexpected significant time use? What is the nature of the problem? • Compute? I/O? MPI? Thread synchronization? • Display the metrics that show the problem best Apply brain to solve • MPI – can you balance the work better? • Compute – is memory time dominant – can you improve layout? Think of the future • Try larger process or thread counts to watch for scalability problems • Keep the profile (.map file) for future comparison
  • 9. Example: Improving Thread Usage in Genomics • DISCOVAR – Variant caller and small genome assembler – Sub-mammalian sized genomes – Newer DISCOVAR de novo for larger genomes • C++ and OpenMP • Developed by Broad Institute at MIT
  • 10. A first look – on real hardware • It’s not I/O intensive • Good quantity of OpenMP time • No vectorization
  • 11. OpenMP in detail • Physical cores are 200% loaded: hyperthreading is on • 17% of parallel region time is synchronization • .. That’s quite high
  • 12. Investigating the OpenMP synchronization • Horizontal time axis: colour coded – Dark green – single core – Light green – OpenMP work – Light blue – pthread synchronization – Gray – idle • Vertical axis – #cores doing something • Something’s very wrong towards the end – with all the gray
  • 13. Zoom in on the region • Stacks, code, regions, time are all focused on zoom area • Key observation: – OpenMP region with “omp critical” is where the time is being wasted
  • 14. Fixing • #pragma omp critical – Execute exactly one thread at a time to ensure safety • Is costing too much – Passing “token” from thread to thread to do small pieces of work. • Run whole section on one thread instead – Has same semantics
  • 15. Impact of change • Runtime down by 7%
  • 16. As a performance report • Improvements in – Runtime – Synchronization overhead
  • 17. Let’s try something bigger – into Amazon cloud! • C4.8xlarge – 36 hyperthreaded cores – 60GB RAM – Xeon E5-2666 v3 Haswell – 25MB Cache – 2.6GHZ vs • Our physical server – 24 hyperthreaded cores – 24 GB RAM – Xeon E5-2407 v2 – 10MB Cache – 2.4GHz $ ./runme.sh discovar version: Discovar r52488 loadaverage: 0.05 0.98 1.36 1/790 16317 2015-07-27 07:57 PERF: REAL 835.857 USER 36.188 SYSTEM 5.441 PERC 4.71 835 seconds to run on EC2 … vs … ~448 seconds on our physical server Why?
  • 18. Profile with Allinea Forge to find where the problem is • Focus on initial 300 seconds: something must be wrong here • Serious lack of good “green” compute
  • 19. In detail… • 36 threads, waiting… but who is using madvise?!
  • 20. Why is glibc so bad? • madvise system call in _int_free() – At least two context switches each call .. – This glibc version has issues…? • What other options are there?
  • 21. Maybe Google TCMalloc? • Optimized for multi- threaded applications • No-win – Same run time – Issue is use of sys_futex not madvise • .. Not optimized for this multithreaded application!
  • 22. Jemalloc? • As recommended by the Broad Institute • … same runtime
  • 23. Jemalloc – same problem • Source proves the issue again…
  • 24. Can Intel libraries help? • We try the Intel TBB multithreaded allocator • 14 minutes down to 10 minutes! • .. But still this code has scope for more…
  • 25. Real optimization of OpenMP regions • NB – still profiling for first 300 seconds only • Significant inactivity in final 60 seconds • OpenMP region – #pragma omp parallel for • Is it working? – No – the threads are idle • Let’s remove
  • 26. After the first fix… • Now able to run to completion – 358 seconds • Still inactivity at end of run
  • 27. Zoomed to the inactivity… • Another OpenMP region • Quick edit: comment out the OpenMP, again!
  • 28. … and the impact • Down to 304 seconds
  • 29. Finally… something to sort out • Recursive, in-place multithreaded sorter • Is not scaling well in thread counts • Options? – Re-engineer – Replace – Tune
  • 30. Let’s tune • Try limiting the thread pool to 8 workers – Better than 36 clashing threads?
  • 31. Result… • Runtime 4.7 minutes • 3x improvement on original • #1 position on the Broad Benchmark list for a sub-$2 / hour system!
  • 32. Lessons learned • Real codes exhibit many different performance patterns – Profiling real data sets at real scales is vital to target the effort – Small test cases do not expose all the problems – Small thread counts can be too small to find real problems • Changing code can be simple – Use threads wisely – it will not always be faster – Changing libraries – someone else might have fixed your problem • Re-engineering is sometimes necessary – Take advantage of vector units – Take advantage of threads
  • 33. Increase the performance of your software Analyze and tune with Allinea Performance Reports Develop, profile and debug applications with Allinea Forge With professional support when you need it most Read more!