Optimizing thread performance for a genomics variant caller

Optimizing Thread Performance for a
Genomics Variant Caller

This talk
• Introduce two tools that can help improve the performance of
multithreaded code
• Apply the tools to a real world Genomics code

caption
Tool 1: Allinea Performance Reports – benchmarking and
characterization

Tool 2: Allinea Forge - Debugging and Profiling
• Debug and profile from
one interface,
configuration
• Secure native remote and
local access
• Rapidly switch between
the tasks
• Edit, build, commit,
debug, profile, optimize..

Small data files
<5% slowdown
No instrumentation
No recompilation
Our profiler finds the performance bottlenecks

Our debugger helps bugs and performance
• Observe why
workload is
imbalanced
• Observe why
particular code paths
are followed
• .. And fix any bugs
that optimization
creates!

Above all…
• The tools are aimed at any performance problem that matters
– Focus on time: the ultimate judge of performance
• Do not prejudge the problem
– Don’t assume it’s MPI messages, threads or I/O before profiling!
• If there’s a problem..
– Allinea Performance Reports shows it, and advises you on solutions
– Allinea Forge’s profiler shows it, next to your code

6 steps to improve performance
Get a realistic test
case
• Performance on real data
matters
• Keep the test case for
reference and re-use
Profile your code
• Add “-g” flag to your
compilation
• Run with a profiler
Look for the significant
• Which part/phase of the
code dominates time?
• Is there any unexpected
significant time use?
What is the nature of
the problem?
• Compute? I/O? MPI?
Thread synchronization?
• Display the metrics that
show the problem best
Apply brain to solve
• MPI – can you balance the
work better?
• Compute – is memory time
dominant – can you improve
layout?
Think of the future
• Try larger process or thread
counts to watch for
scalability problems
• Keep the profile (.map file)
for future comparison

Example: Improving Thread Usage in Genomics
• DISCOVAR
– Variant caller and small genome assembler
– Sub-mammalian sized genomes
– Newer DISCOVAR de novo for larger genomes
• C++ and OpenMP
• Developed by Broad Institute at MIT

A first look – on real hardware
• It’s not I/O intensive
• Good quantity of
OpenMP time
• No vectorization

OpenMP in detail
• Physical cores are
200% loaded:
hyperthreading is on
• 17% of parallel region
time is synchronization
• .. That’s quite high

Investigating the OpenMP synchronization
• Horizontal time axis:
colour coded
– Dark green – single core
– Light green – OpenMP work
– Light blue – pthread
synchronization
– Gray – idle
• Vertical axis
– #cores doing something
• Something’s very wrong
towards the end – with
all the gray

Zoom in on the region
• Stacks, code, regions,
time are all focused on
zoom area
• Key observation:
– OpenMP region with
“omp critical” is where
the time is being wasted

Fixing
• #pragma omp critical
– Execute exactly one
thread at a time to
ensure safety
• Is costing too much
– Passing “token” from
thread to thread to do
small pieces of work.
• Run whole section on
one thread instead
– Has same semantics

Impact of change
• Runtime down by 7%

As a performance report
• Improvements in
– Runtime
– Synchronization
overhead

Let’s try something bigger – into Amazon cloud!
• C4.8xlarge
– 36 hyperthreaded cores
– 60GB RAM
– Xeon E5-2666 v3 Haswell
– 25MB Cache
– 2.6GHZ
vs
• Our physical server
– 24 hyperthreaded cores
– 24 GB RAM
– Xeon E5-2407 v2
– 10MB Cache
– 2.4GHz
$ ./runme.sh
discovar version: Discovar r52488
loadaverage: 0.05 0.98 1.36 1/790 16317
2015-07-27 07:57 PERF: REAL 835.857 USER 36.188
SYSTEM 5.441 PERC 4.71
835 seconds to run on EC2
… vs …
~448 seconds on our physical server
Why?

Profile with Allinea Forge to find where the problem is
• Focus on initial 300
seconds: something
must be wrong here
• Serious lack of good
“green” compute

In detail…
• 36 threads, waiting… but who is using madvise?!

Why is glibc so bad?
• madvise system call in
_int_free()
– At least two context
switches each call ..
– This glibc version has
issues…?
• What other options are
there?

Maybe Google TCMalloc?
• Optimized for multi-
threaded applications
• No-win
– Same run time
– Issue is use of sys_futex
not madvise
• .. Not optimized for this
multithreaded
application!

Jemalloc?
• As recommended by
the Broad Institute
• … same runtime

Jemalloc – same problem
• Source proves the issue
again…

Can Intel libraries help?
• We try the Intel TBB
multithreaded allocator
• 14 minutes down to 10
minutes!
• .. But still this code has
scope for more…

Real optimization of OpenMP regions
• NB – still profiling for
first 300 seconds only
• Significant inactivity in
final 60 seconds
• OpenMP region
– #pragma omp parallel for
• Is it working?
– No – the threads are idle
• Let’s remove

After the first fix…
• Now able to run to
completion
– 358 seconds
• Still inactivity at end of
run

Zoomed to the inactivity…
• Another OpenMP region
• Quick edit: comment out
the OpenMP, again!

… and the impact
• Down to 304 seconds

Finally… something to sort out
• Recursive, in-place
multithreaded sorter
• Is not scaling well in
thread counts
• Options?
– Re-engineer
– Replace
– Tune

Let’s tune
• Try limiting the thread pool to 8 workers
– Better than 36 clashing threads?

Result…
• Runtime 4.7 minutes
• 3x improvement on
original
• #1 position on the
Broad Benchmark list
for a sub-$2 / hour
system!

Lessons learned
• Real codes exhibit many different performance patterns
– Profiling real data sets at real scales is vital to target the effort
– Small test cases do not expose all the problems
– Small thread counts can be too small to find real problems
• Changing code can be simple
– Use threads wisely – it will not always be faster
– Changing libraries – someone else might have fixed your problem
• Re-engineering is sometimes necessary
– Take advantage of vector units
– Take advantage of threads

Increase the performance of your software
Analyze and tune
with Allinea
Performance Reports
Develop, profile and
debug applications
with Allinea Forge
With professional
support when you
need it most
Read more!

Optimizing thread performance for a genomics variant caller

More Related Content

What's hot (20)

Similar to Optimizing thread performance for a genomics variant caller (20)

Recently uploaded (20)

Optimizing thread performance for a genomics variant caller