SlideShare a Scribd company logo
Reshaping Core Genomics Software Tools for the Manycore Era
Ben Langmead
Assistant Professor
Johns Hopkins University, Department of Computer Science
November 2016
Reshaping core genomics software
tools for the many-core era
3
The Intel Parallel Computing Center at
4
Outline
§  Introduction to sequencing data analysis & Bowtie
§  Thread scaling improvements using TBB
–  Choice of mutex
–  Two-stage parsing
§  AVX2, AVX512-KNC & AVX512-KNC improvements
§  Impact on the field
5
Sequencing
6
Sequencing
7
Sequencing
8
Read alignment
9
Read alignment
§  Needle in a haystack
§  Billions of reads from
a single week-long
sequencing run
§  Human reference
genome is ~3B bases
(letters) long
10
Bowtie and Bowtie 2
§  Together cited by
>12K other scientific
studies since 2009
§  Bundled with dozens
of other tools & many
Linux distros
11
HISAT
§  Based on Bowtie 2 and a leading spliced aligner for RNA sequencing data
§  Cited in >75 scientific studies since 2015
12
Design of Bowtie & Bowtie 2
Bowtie 1
Bowtie 2
13
Design of Bowtie & Bowtie 2
Bowtie 1
Bowtie 2
Random access to large index
data structure and minimal ILP
14
Design of Bowtie & Bowtie 2
Bowtie 1
Bowtie 2
Dynamic programming, lots of ILP
0
250
500
750
1000
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
TBB spin_mutex
tinythreads fast_mutex
15
Thread scaling
§  Switching to analogous TBB lock could bring big improvement
Ivy Bridge, 4 NUMA nodes,
120 threads
Vertical axis is per-
thread running time;
lower is better
Bowtie 1 unpaired
100
150
200
250
300
350
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB spin_mutex
version
Original parsing
16
Thread scaling
§  Removing synchronization by “stubbing” input lock gives further improvement
Bowtie 2 unpairedIvy Bridge, 4 NUMA nodes,
120 threads
Vertical axis is per-thread
running time; lower is better
17
Thread scaling
§  Vtune investigation indicates synchronization itself (e.g. see __TBB_LockByte)
is taking the time
18
Thread scaling
Bowtie 2 unpaired
How to close the
gap between
actual and ideal
performance?
19
Thread scaling
Bowtie 2 unpaired
Why does mutex
choice have outsize
effect?
CMU 15-418/618, Spring 2015
Test-and-set lock performance
Benchmark&executes:&
lock(L);&
critical>section(c)&
unlock(L);
Time(us)
Number of processors
Benchmark: total of N lock/unlock sequences (in aggregate) by P processors
Critical section time removed so graph plots only time acquiring/releasing the lock
Bus contention increases amount of
time to transfer lock (lock holder must
wait to acquire bus to release)
Not shown: bus contention also slows
down execution of critical section
Figure credit: Culler, Singh, and Gupta
20
Thread scaling
§  Mutex spinning on atomic op
(compare-and-swap, test-and-
set), spurs exchange of cache
coherence messages
§  Image by Kayvon Fatahalian,
Copyright 2015 Carnegie
Mellon University
21
Thread scaling
§  Even a standard pthreads mutex was outperforming the
spin lock when running one thread per available core
–  More evidence that cache coherence traffic is culprit
§  Queue locks are known to have better cache properties
–  Waiting thread spins on normal (non-atomic) read
–  Cache line read belongs exclusively to that thread
and can live in L1
22
Thread scaling
§  We hypothesized a NUMA-aware “cohort lock” could help further
Dice, David, Virendra J. Marathe, and Nir Shavit. "Lock cohorting: a general technique for
designing NUMA locks." ACM SIGPLAN Notices. Vol. 47. No. 8. ACM, 2012.
23
Cohort locking
class	
  CohortLock	
  {	
  
public:	
  
	
  	
  CohortLock()	
  :	
  lockers_numa_idx(-­‐1)	
  {	
  
	
  	
  	
  	
  starvation_counters	
  =	
  new	
  int[MAX_NODES]();	
  
	
  	
  	
  	
  own_global	
  =	
  new	
  bool[MAX_NODES]();	
  
	
  	
  	
  	
  local_locks	
  =	
  new	
  TKTLock[MAX_NODES];	
  
	
  	
  }	
  
	
  	
  ~CohortLock()	
  {	
  
	
  	
  	
  	
  delete[]	
  starvation_counters;	
  
	
  	
  	
  	
  delete[]	
  own_global;	
  
	
  	
  	
  	
  delete[]	
  local_locks;	
  
	
  	
  }	
  
	
  	
  void	
  lock();	
  
	
  	
  void	
  unlock();	
  
private:	
  
	
  	
  static	
  const	
  int	
  STARVATION_LIMIT	
  =	
  100;	
  
	
  	
  static	
  const	
  int	
  MAX_NODES	
  =	
  128;	
  
	
  	
  volatile	
  int*	
  	
  starvation_counters;	
  //	
  1	
  per	
  node	
  
	
  	
  volatile	
  bool*	
  own_global;	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  1	
  per	
  node	
  
	
  	
  volatile	
  int	
  	
  	
  lockers_numa_idx;	
  	
  	
  	
  //	
  1	
  per	
  node	
  
	
  	
  TKTLock*	
  local_locks;	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  1	
  per	
  node	
  
	
  	
  PTLLock	
  global_lock;	
  
};	
  
§  Each NUMA node has per-node ticket lock
§  Other per-node information tracks when to
pass lock to other threads on same node
§  Single global partitioned ticket lock
24
Cohort locking
void	
  CohortLock::lock()	
  {	
  
	
  	
  const	
  int	
  numa_idx	
  =	
  determine_numa_idx();	
  
	
  	
  local_locks[numa_idx].lock();	
  
	
  	
  if(!own_global[numa_idx])	
  {	
  
	
  	
  	
  	
  	
  	
  global_lock.lock();	
  
	
  	
  }	
  
	
  	
  starvation_counters[numa_idx]++;	
  
	
  	
  own_global[numa_idx]	
  =	
  true;	
  
	
  	
  lockers_numa_idx	
  =	
  numa_idx;	
  
}	
  
	
  
void	
  CohortLock::unlock()	
  {	
  
	
  	
  assert(lockers_numa_idx	
  !=	
  -­‐1);	
  
	
  	
  int	
  numa_idx	
  =	
  lockers_numa_idx;	
  
	
  	
  lockers_numa_idx	
  =	
  -­‐1;	
  
	
  	
  if(local_locks[numa_idx].q_length()	
  ==	
  1	
  ||	
  
	
  	
  	
  	
  	
  starvation_counters[numa_idx]	
  >	
  STARVATION_LIMIT)	
  
	
  	
  {	
  
	
  	
  	
  	
  global_lock.unlock();	
  
	
  	
  	
  	
  starvation_counters[numa_idx]	
  =	
  0;	
  
	
  	
  	
  	
  own_global[numa_idx]	
  =	
  false;	
  
	
  	
  }	
  
	
  	
  local_locks[numa_idx].unlock();	
  
}	
  
§  When locking:
–  Grab local lock
–  Once grabbed, grab global lock if not
already owned by this node
§  When unlocking:
–  Is another thread on same node queued?
If so, hand lock to next in queue
–  Otherwise release global & local locks
–  Override hand-off if others are starving
25
Cohort locking
§  Another implementation of cohort locking available in ConcurrencyKit:
http://concurrencykit.org
–  https://github.com/concurrencykit/ck/blob/master/include/ck_cohort.h
26
Thread scaling
§  Chris Wilks added TBB queue locks, JHU/TBB
Cohort locks (2 flavors) to Bowtie 2, Bowtie & HISAT
§  Available in public branches, with all but cohort locks
available in master branch and in recent releases
27
Thread scaling
§  Novel strategy splits input parsing into two “phases”
§  First (“light parsing”) rapidly detects record
boundaries, requiring synchronization but with very
brief critical section
§  Second (“full parsing”) fully parses each record
(pictured, right) with no synchronization
§  Minimizes time spent in crucial critical section
@ABC_123_1
GCTATTATGCTAT
+
JJSYEGGU8233^
@ABC_424_1
GTGATATGCAT
+
SYEG!U8@233
@ABCD_9_1
GCTATTATGCTATAAAC
+
JJSYEGGU8233^32FR
@D_91231_1
GCTATTATGCTAT
+
JJSYEGGU8233^
…
100
200
300
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
28
Thread scaling: Bowtie 2 unpaired
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  TBB queuing_mutex and TBB/JHU cohort lock perform best
100
200
300
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
version
Optimized parsing
Original parsing
29
Thread scaling: Bowtie 2 unpaired
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  Two-phase parsing yields substantial thread-scaling boost; close to perfect up
to 120 threads, regardless of mutex
100
200
300
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
30
Thread scaling: Bowtie 2 paired-end
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  queuing_mutex and cohort lock again perform the best, near ideal
31
Thread scaling: Bowtie 2 paired-end
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  Two-phase parsing yields substantial thread-scaling boost; close to perfect up
to 120 threads, with mutex having smaller impact
0
100
200
300
0 25 50 75 100 125
# threads (paired−end)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
32
Thread scaling: Bowtie
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  As with Bowtie 2, near-ideal scaling with queuing and cohort locks
0
100
200
300
400
0 25 50 75 100 125
# threads
Normalizedrunningtime
version
Optimized parsing
Original parsing
lock
TBB queuing_mutex
tinythreads fast_mutex
33
Thread scaling: HISAT unpaired
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  Huge improvements with queuing_lock and two-phase parsing
34
Thread scaling
§  Further gains possible with batch parsing, where the first phase “lightly” parses
several reads at once, reducing # critical section entrances
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
20 30 40
bowtie # threads
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
200
300
0 25 50 75 100 125
bowtie # threads
Normalizedrunningtime
●
● ●
●
●
●
● ● ●
●
● ● ● ● ●
● ●
●
●
●
● ● ● ● ● ● ●
● ●
●
● ● ●
● ● ● ●
●
●
●
20 30 40
bowtie2 # threads
● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
● ●
● ● ●
●
●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
● ● ●
● ● ●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
0 25 50 75 100 125
bowtie2 # threads
Normalizedrunningtime
●
●
●
●
●
●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
150
200
250
lizedrunningtime
20 30 40
bowtie # threads
0 25 50 75 100 125
bowtie # threads
●
● ●
●
●
●
● ● ●
●
● ● ● ● ●
● ●
●
●
●
● ● ● ● ● ● ●
● ●
●
● ● ●
● ● ● ●
●
●
●
20 30 40
bowtie2 # threads
● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
● ●
● ● ●
●
●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
● ● ●
● ● ●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
0 25 50 75 100 125
bowtie2 # threads
Normalizedrunningtime
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
20 30 40
hisat # threads
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
50
100
150
200
250
0 25 50 75 100 125
hisat # threads
Normalizedrunningtime
lock MP tinythreads fast_mutex TBB queuing_mutex
version ● ● ●Batch parsing Original parsing Two−phase parsing
35
Thread scaling: Bowtie 2 on Broadwell
§  Experiment conducted by John Oneill at Intel
§  TBB + optimized parsing yields speedups of 1.1x - 1.8x on 88 threads on
Broadwell E5-2699 v4 part. TBB/JHU Cohort lock outperforms other mutexes.
36
Thread scaling: Bowtie 2 on Knight’s Landing
§  Experiment conducted by John Oneill at Intel
§  TBB + optimized parsing yields speedups of 2x - 2.7x on 192 threads on KNL
B0 bin3 part. TBB/JHU Cohort lock outperforms other mutexes.
37
Thread scaling: summary
§  Using a queue mutex / cohort lock can yield big improvement over spin /
normal lock
§  Achieved near-ideal scaling up to 120 threads with (a) queue/cohort locks and
(b) cleaner parsing for Bowtie, Bowtie 2.
§  Promising scaling results on KNC & KNL; more to do
§  Cohort locks were best option in Broadwell & KNL experiments
§  Cohort locks seem to put KNL in a better position to outperform Xeon on
genomics workloads
38
Vectorization of Bowtie 2 inner loop
§  Dynamic programming alignment not unique to Bowtie 2
§  Common to many sequence alignment problems
39
Vectorization of Bowtie 2 inner loop
40
Vectorization of Bowtie 2 inner loop
The wider the vector word, the more times the fixup loop iterates
§  Mitigates the benefit of
having wider words
41
Vectorization of Bowtie 2 inner loop
…but in some situations, the fixup loop can be skipped with little or no downside
§  Important future work is to
determine whether selective
suppression of fixup loop
can remove most or all of
the downside of having
wider words
42
Impact on the field
§  As of Bowtie 1.0.1 release / Bowtie 2 2.2.0 release, Intel improvements are “in
the wild,” assisting life science researchers
43
Impact on the field
§  Added TBB to Bowtie 1.1.2, Bowtie 2 2.2.6. Also added to public branch of
HISAT. Plan to make TBB the default threading library in upcoming release.
44
Impact on the field
§  Daehwan Kim of JHU IPCC team parallelized the index building process in
Bowtie 2; TBB version of parallel index building available as of 2.2.7
45
Impact on the field
§  With changes fully reflected in
Bowtie 1.2.0 and Bowtie 2 2.3.0,
JHU team drafting manuscript
describing improvements and
lessons learned
46
Future directions
§  Where and why does the cohort lock help?
§  Does cohort lock have a future in TBB?
§  Can selective suppression of Bowtie 2 fixup loop
unlock power of wider vector words?
§  Can all of the above yield a big Knight’s Landing
throughput win?
47
Other resources
§  http://www.langmead-lab.org
§  https://www.coursera.org/learn/dna-sequencing
–  YouTube videos for above: http://bit.ly/ADS1_videos
48
Thank you
§  John Oneill, Ram Ramanujam, Kevin O’leary, and many other great Intel
engineers we spoke to and worked with
§  Lisa Smith, Brian Napier and others in IPCC program
§  Langmead lab team: Chris Wilks, Valentin Antonescu
§  Salzberg lab team: Steven Salzberg, Daehwan Kim
§  Intel
Thank you for your time
Ben Langmead
langmea@cs.jhu.edu
www.intel.com/hpcdevcon

More Related Content

PDF
ゼロから作るパケット転送用OS (Internet Week 2014)
PDF
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
PDF
Anatomy of neutron from the eagle eyes of troubelshoorters
PDF
L3HA-VRRP-20141201
PPTX
Accelerating Neutron with Intel DPDK
PPTX
Ovs perf
PDF
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
PDF
Salsa20
ゼロから作るパケット転送用OS (Internet Week 2014)
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Anatomy of neutron from the eagle eyes of troubelshoorters
L3HA-VRRP-20141201
Accelerating Neutron with Intel DPDK
Ovs perf
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Salsa20

What's hot (20)

PDF
Varnish in action pbc10
PPTX
Угадываем пароль за минуту
PDF
Hacking (with) WebSockets
PDF
Low latency & mechanical sympathy issues and solutions
PDF
Kernel Recipes 2015: Introduction to Kernel Power Management
PDF
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
PDF
Kalyna block cipher presentation in English
PPTX
Salsa20 Cipher
PDF
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
PPTX
Am I reading GC logs Correctly?
PDF
RSA NetWitness Log Decoder
PDF
XDP in Practice: DDoS Mitigation @Cloudflare
PPTX
community detection
PDF
Neutron: br-ex is now deprecated! what is modern way?
PPT
Hs java open_party
PDF
BGP zombie routes
PDF
Ethereum Blockchain and DApps - Workshop at Software University
PDF
Cloud Monitors Cloud
ODP
VPC Implementation In OpenStack Heat
PDF
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
Varnish in action pbc10
Угадываем пароль за минуту
Hacking (with) WebSockets
Low latency & mechanical sympathy issues and solutions
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kalyna block cipher presentation in English
Salsa20 Cipher
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
Am I reading GC logs Correctly?
RSA NetWitness Log Decoder
XDP in Practice: DDoS Mitigation @Cloudflare
community detection
Neutron: br-ex is now deprecated! what is modern way?
Hs java open_party
BGP zombie routes
Ethereum Blockchain and DApps - Workshop at Software University
Cloud Monitors Cloud
VPC Implementation In OpenStack Heat
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
Ad

Similar to Reshaping Core Genomics Software Tools for the Manycore Era (20)

PDF
osdi23_slides_lo_v2.pdf
PPTX
L3-.pptx
PDF
Futex Scaling for Multi-core Systems
PPTX
ZERO WIRE LOAD MODEL.pptx
PDF
Adaptive Linear Solvers and Eigensolvers
PDF
Search at Twitter: Presented by Michael Busch, Twitter
PDF
New hope is comming? Project Loom.pdf
PDF
synopsys logic synthesis
PPTX
21EC71_Module-2_Routing PPT Electronics and communication engineering module 2
PDF
Theta and the Future of Accelerator Programming
PDF
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
PDF
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
PDF
Apache Cassandra multi-datacenter essentials
PDF
Instaclustr introduction to managing cassandra
PDF
AOS Lab 4: If you liked it, then you should have put a “lock” on it
PPTX
Percona FT / TokuDB
PDF
LAGraph 2021-10-13
PDF
Codemotion 2015 Infinispan Tech lab
PPTX
Java on arm theory, applications, and workloads [dev5048]
PPTX
Memory model
osdi23_slides_lo_v2.pdf
L3-.pptx
Futex Scaling for Multi-core Systems
ZERO WIRE LOAD MODEL.pptx
Adaptive Linear Solvers and Eigensolvers
Search at Twitter: Presented by Michael Busch, Twitter
New hope is comming? Project Loom.pdf
synopsys logic synthesis
21EC71_Module-2_Routing PPT Electronics and communication engineering module 2
Theta and the Future of Accelerator Programming
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra multi-datacenter essentials
Instaclustr introduction to managing cassandra
AOS Lab 4: If you liked it, then you should have put a “lock” on it
Percona FT / TokuDB
LAGraph 2021-10-13
Codemotion 2015 Infinispan Tech lab
Java on arm theory, applications, and workloads [dev5048]
Memory model
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
PDF
AI for good: Scaling AI in science, healthcare, and more.
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PDF
AIDC India - AI on IA
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
AI for All: Biology is eating the world & AI is eating Biology
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
AI for good: Scaling AI in science, healthcare, and more.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
AWS & Intel Webinar Series - Accelerating AI Research
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
AIDC India - AI on IA
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Reshaping Core Genomics Software Tools for the Manycore Era

  • 2. Ben Langmead Assistant Professor Johns Hopkins University, Department of Computer Science November 2016 Reshaping core genomics software tools for the many-core era
  • 3. 3 The Intel Parallel Computing Center at
  • 4. 4 Outline §  Introduction to sequencing data analysis & Bowtie §  Thread scaling improvements using TBB –  Choice of mutex –  Two-stage parsing §  AVX2, AVX512-KNC & AVX512-KNC improvements §  Impact on the field
  • 9. 9 Read alignment §  Needle in a haystack §  Billions of reads from a single week-long sequencing run §  Human reference genome is ~3B bases (letters) long
  • 10. 10 Bowtie and Bowtie 2 §  Together cited by >12K other scientific studies since 2009 §  Bundled with dozens of other tools & many Linux distros
  • 11. 11 HISAT §  Based on Bowtie 2 and a leading spliced aligner for RNA sequencing data §  Cited in >75 scientific studies since 2015
  • 12. 12 Design of Bowtie & Bowtie 2 Bowtie 1 Bowtie 2
  • 13. 13 Design of Bowtie & Bowtie 2 Bowtie 1 Bowtie 2 Random access to large index data structure and minimal ILP
  • 14. 14 Design of Bowtie & Bowtie 2 Bowtie 1 Bowtie 2 Dynamic programming, lots of ILP
  • 15. 0 250 500 750 1000 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock TBB spin_mutex tinythreads fast_mutex 15 Thread scaling §  Switching to analogous TBB lock could bring big improvement Ivy Bridge, 4 NUMA nodes, 120 threads Vertical axis is per- thread running time; lower is better Bowtie 1 unpaired
  • 16. 100 150 200 250 300 350 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock None (stubbed I/O) TBB spin_mutex version Original parsing 16 Thread scaling §  Removing synchronization by “stubbing” input lock gives further improvement Bowtie 2 unpairedIvy Bridge, 4 NUMA nodes, 120 threads Vertical axis is per-thread running time; lower is better
  • 17. 17 Thread scaling §  Vtune investigation indicates synchronization itself (e.g. see __TBB_LockByte) is taking the time
  • 18. 18 Thread scaling Bowtie 2 unpaired How to close the gap between actual and ideal performance?
  • 19. 19 Thread scaling Bowtie 2 unpaired Why does mutex choice have outsize effect?
  • 20. CMU 15-418/618, Spring 2015 Test-and-set lock performance Benchmark&executes:& lock(L);& critical>section(c)& unlock(L); Time(us) Number of processors Benchmark: total of N lock/unlock sequences (in aggregate) by P processors Critical section time removed so graph plots only time acquiring/releasing the lock Bus contention increases amount of time to transfer lock (lock holder must wait to acquire bus to release) Not shown: bus contention also slows down execution of critical section Figure credit: Culler, Singh, and Gupta 20 Thread scaling §  Mutex spinning on atomic op (compare-and-swap, test-and- set), spurs exchange of cache coherence messages §  Image by Kayvon Fatahalian, Copyright 2015 Carnegie Mellon University
  • 21. 21 Thread scaling §  Even a standard pthreads mutex was outperforming the spin lock when running one thread per available core –  More evidence that cache coherence traffic is culprit §  Queue locks are known to have better cache properties –  Waiting thread spins on normal (non-atomic) read –  Cache line read belongs exclusively to that thread and can live in L1
  • 22. 22 Thread scaling §  We hypothesized a NUMA-aware “cohort lock” could help further Dice, David, Virendra J. Marathe, and Nir Shavit. "Lock cohorting: a general technique for designing NUMA locks." ACM SIGPLAN Notices. Vol. 47. No. 8. ACM, 2012.
  • 23. 23 Cohort locking class  CohortLock  {   public:      CohortLock()  :  lockers_numa_idx(-­‐1)  {          starvation_counters  =  new  int[MAX_NODES]();          own_global  =  new  bool[MAX_NODES]();          local_locks  =  new  TKTLock[MAX_NODES];      }      ~CohortLock()  {          delete[]  starvation_counters;          delete[]  own_global;          delete[]  local_locks;      }      void  lock();      void  unlock();   private:      static  const  int  STARVATION_LIMIT  =  100;      static  const  int  MAX_NODES  =  128;      volatile  int*    starvation_counters;  //  1  per  node      volatile  bool*  own_global;                    //  1  per  node      volatile  int      lockers_numa_idx;        //  1  per  node      TKTLock*  local_locks;                              //  1  per  node      PTLLock  global_lock;   };   §  Each NUMA node has per-node ticket lock §  Other per-node information tracks when to pass lock to other threads on same node §  Single global partitioned ticket lock
  • 24. 24 Cohort locking void  CohortLock::lock()  {      const  int  numa_idx  =  determine_numa_idx();      local_locks[numa_idx].lock();      if(!own_global[numa_idx])  {              global_lock.lock();      }      starvation_counters[numa_idx]++;      own_global[numa_idx]  =  true;      lockers_numa_idx  =  numa_idx;   }     void  CohortLock::unlock()  {      assert(lockers_numa_idx  !=  -­‐1);      int  numa_idx  =  lockers_numa_idx;      lockers_numa_idx  =  -­‐1;      if(local_locks[numa_idx].q_length()  ==  1  ||            starvation_counters[numa_idx]  >  STARVATION_LIMIT)      {          global_lock.unlock();          starvation_counters[numa_idx]  =  0;          own_global[numa_idx]  =  false;      }      local_locks[numa_idx].unlock();   }   §  When locking: –  Grab local lock –  Once grabbed, grab global lock if not already owned by this node §  When unlocking: –  Is another thread on same node queued? If so, hand lock to next in queue –  Otherwise release global & local locks –  Override hand-off if others are starving
  • 25. 25 Cohort locking §  Another implementation of cohort locking available in ConcurrencyKit: http://concurrencykit.org –  https://github.com/concurrencykit/ck/blob/master/include/ck_cohort.h
  • 26. 26 Thread scaling §  Chris Wilks added TBB queue locks, JHU/TBB Cohort locks (2 flavors) to Bowtie 2, Bowtie & HISAT §  Available in public branches, with all but cohort locks available in master branch and in recent releases
  • 27. 27 Thread scaling §  Novel strategy splits input parsing into two “phases” §  First (“light parsing”) rapidly detects record boundaries, requiring synchronization but with very brief critical section §  Second (“full parsing”) fully parses each record (pictured, right) with no synchronization §  Minimizes time spent in crucial critical section @ABC_123_1 GCTATTATGCTAT + JJSYEGGU8233^ @ABC_424_1 GTGATATGCAT + SYEG!U8@233 @ABCD_9_1 GCTATTATGCTATAAAC + JJSYEGGU8233^32FR @D_91231_1 GCTATTATGCTAT + JJSYEGGU8233^ …
  • 28. 100 200 300 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex 28 Thread scaling: Bowtie 2 unpaired Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  TBB queuing_mutex and TBB/JHU cohort lock perform best
  • 29. 100 200 300 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex version Optimized parsing Original parsing 29 Thread scaling: Bowtie 2 unpaired Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  Two-phase parsing yields substantial thread-scaling boost; close to perfect up to 120 threads, regardless of mutex
  • 30. 100 200 300 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex 30 Thread scaling: Bowtie 2 paired-end Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  queuing_mutex and cohort lock again perform the best, near ideal
  • 31. 31 Thread scaling: Bowtie 2 paired-end Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  Two-phase parsing yields substantial thread-scaling boost; close to perfect up to 120 threads, with mutex having smaller impact
  • 32. 0 100 200 300 0 25 50 75 100 125 # threads (paired−end) Normalizedrunningtime lock None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex 32 Thread scaling: Bowtie Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  As with Bowtie 2, near-ideal scaling with queuing and cohort locks
  • 33. 0 100 200 300 400 0 25 50 75 100 125 # threads Normalizedrunningtime version Optimized parsing Original parsing lock TBB queuing_mutex tinythreads fast_mutex 33 Thread scaling: HISAT unpaired Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  Huge improvements with queuing_lock and two-phase parsing
  • 34. 34 Thread scaling §  Further gains possible with batch parsing, where the first phase “lightly” parses several reads at once, reducing # critical section entrances ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 bowtie # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 200 300 0 25 50 75 100 125 bowtie # threads Normalizedrunningtime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 bowtie2 # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 0 25 50 75 100 125 bowtie2 # threads Normalizedrunningtime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 150 200 250 lizedrunningtime 20 30 40 bowtie # threads 0 25 50 75 100 125 bowtie # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 bowtie2 # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 0 25 50 75 100 125 bowtie2 # threads Normalizedrunningtime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 hisat # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 100 150 200 250 0 25 50 75 100 125 hisat # threads Normalizedrunningtime lock MP tinythreads fast_mutex TBB queuing_mutex version ● ● ●Batch parsing Original parsing Two−phase parsing
  • 35. 35 Thread scaling: Bowtie 2 on Broadwell §  Experiment conducted by John Oneill at Intel §  TBB + optimized parsing yields speedups of 1.1x - 1.8x on 88 threads on Broadwell E5-2699 v4 part. TBB/JHU Cohort lock outperforms other mutexes.
  • 36. 36 Thread scaling: Bowtie 2 on Knight’s Landing §  Experiment conducted by John Oneill at Intel §  TBB + optimized parsing yields speedups of 2x - 2.7x on 192 threads on KNL B0 bin3 part. TBB/JHU Cohort lock outperforms other mutexes.
  • 37. 37 Thread scaling: summary §  Using a queue mutex / cohort lock can yield big improvement over spin / normal lock §  Achieved near-ideal scaling up to 120 threads with (a) queue/cohort locks and (b) cleaner parsing for Bowtie, Bowtie 2. §  Promising scaling results on KNC & KNL; more to do §  Cohort locks were best option in Broadwell & KNL experiments §  Cohort locks seem to put KNL in a better position to outperform Xeon on genomics workloads
  • 38. 38 Vectorization of Bowtie 2 inner loop §  Dynamic programming alignment not unique to Bowtie 2 §  Common to many sequence alignment problems
  • 40. 40 Vectorization of Bowtie 2 inner loop The wider the vector word, the more times the fixup loop iterates §  Mitigates the benefit of having wider words
  • 41. 41 Vectorization of Bowtie 2 inner loop …but in some situations, the fixup loop can be skipped with little or no downside §  Important future work is to determine whether selective suppression of fixup loop can remove most or all of the downside of having wider words
  • 42. 42 Impact on the field §  As of Bowtie 1.0.1 release / Bowtie 2 2.2.0 release, Intel improvements are “in the wild,” assisting life science researchers
  • 43. 43 Impact on the field §  Added TBB to Bowtie 1.1.2, Bowtie 2 2.2.6. Also added to public branch of HISAT. Plan to make TBB the default threading library in upcoming release.
  • 44. 44 Impact on the field §  Daehwan Kim of JHU IPCC team parallelized the index building process in Bowtie 2; TBB version of parallel index building available as of 2.2.7
  • 45. 45 Impact on the field §  With changes fully reflected in Bowtie 1.2.0 and Bowtie 2 2.3.0, JHU team drafting manuscript describing improvements and lessons learned
  • 46. 46 Future directions §  Where and why does the cohort lock help? §  Does cohort lock have a future in TBB? §  Can selective suppression of Bowtie 2 fixup loop unlock power of wider vector words? §  Can all of the above yield a big Knight’s Landing throughput win?
  • 47. 47 Other resources §  http://www.langmead-lab.org §  https://www.coursera.org/learn/dna-sequencing –  YouTube videos for above: http://bit.ly/ADS1_videos
  • 48. 48 Thank you §  John Oneill, Ram Ramanujam, Kevin O’leary, and many other great Intel engineers we spoke to and worked with §  Lisa Smith, Brian Napier and others in IPCC program §  Langmead lab team: Chris Wilks, Valentin Antonescu §  Salzberg lab team: Steven Salzberg, Daehwan Kim §  Intel
  • 49. Thank you for your time Ben Langmead langmea@cs.jhu.edu www.intel.com/hpcdevcon