SlideShare a Scribd company logo
Parallel computing
in bioinformatics
Dr Torsten Seemann
Ideal world
● A single computer with
o one really fast processor
o huge amount of really fast memory
● Compromise #1: a single computer with
o lots of processors
o huge memory fast enough for all processors
● Compromise #2: a bunch of computers with
o lots of fast processors on each node
o lots of memory on each node
o really fast, low latency inter-node communication
The real world
● None of these exist :-(
● Computer nodes
o Good: CPU & RAM on the increase
o Bad: CPU is competing for RAM
● Node:Node communication
o Good: getting faster
o Bad: latency gets worse with more nodes
Types of parallelism
● Cluster
o distribute workload across networked computers
● SMP
o symmetric multiple processing
o use multiple cores on a single computer
o (we’ll ignore NUMA)
● SIMD
o single instruction, multiple data
o same machine code instruction on vector of values
o (we’ll ignore MIMD, GPU)
Clusters
Clusters
● Can be ad hoc
o bunch of PCs over Ethernet (Beowulf)
● Cluster specific
o high density, fast interconnect (Blade)
● Highly specialised
o high density, low power, very fast interconnect, low
latency, many switches (eg. IBM BlueGene)
Using clusters
Break task into subtasks:
● Independent tasks
○ “pleasantly parallel” is a good situation!
○ Submit these to cluster queue
o Combine results
● Dependent tasks
o Need to communicate during run
o Various ways to do this (more later)
SMP
SMP: symmetric multi processing
Use multiple cores on one node:
● Simple case
○ run multiple subtasks, one per core
● Multi-threading
○ use tools that support multiple cores
■ BWA, bowtie, samtools 0.18+
○ use languages that support native threading
■ Java
■ C, C++, Perl, Haskell - with standard libraries
■ Python has issues here
Using SMP
● POSIX threads
o standard “C” Unix interface
o a library of functions
● OpenMP
o standard “C” Unix interface
o functions and #pragmas to help compiler parallelize
● Unix Shell
o use job control and ‘&’ and ‘wait’
o Makefiles, GNU parallel, pipelines (more later)
● Use tools that do this natively for you
SMP communication
● Sometimes threads needs to talk
o Just like cluster nodes need to talk
● IPC
o Inter-Process Communication
● Methods
o files, time-stamped “touch files”
o pipes, sockets, message passing
o shared memory
o semaphores
o signals
SIMD
Machine code 101
● CPUs run “machine code” instructions:
○ load R0 , [years] # put var in reg
mul R0 , 365 # mult by 365
add R0 , 1 # add 1
store [days], R0 # put reg in mem
● Each instruction does one atomic operation
○ to change one piece of data
■ memory location (RAM variable - slow)
■ register (CPU variable - fast)
● Example
○ vector dot product: x ∙ y = Σi=1..|x| xi × yi
● Pseudo-code
○ var x, y : integer[8]
var sum : integer
sum := 0;
for i in 0..7:
sum := sum + x[i] * y[i]
● Operations
○ 1 + 8 * 3 = 25 ops
Vector operations
Vector operations
● Vector registers and instructions
○ assume 8-element operations (actually common!)
● SIMD
○ load V0, [x] # put x[] in vec register
load V1, [y] # same for y[]
mult V0, V1 # vector multiply!
vsum R7, V0 # vec sum into scalar reg
● Operations
○ 1 + 1 + 1 + 1 = 4 ops
SIMD Instruction Sets
● Specialised since 1970s
○ MASPAR
○ Connection Machine
○ Cray super-scalar
○ DEC Alpha MVI
● Consumer grade
○ Intel MMX / AMD 3DNow! (integer) [x86]
○ Intel SSE, SSE2, SSE3, SSE4.x (floating point) [x86]
○ IBM Altivec (both) [BlueGene,POWER]
● GPUs also, but they do MIMD too.
Using SIMD
● Not accessible from scripting languages
o they are too many layers away from machine code
● Some libraries exploit it
o Numpy (uses some SSE in CoreFunc)
o GSL - Gnu Scientific Library
o BLAS - Linear algebra
● Find the tools that use it
o HMMER (profile:sequence alignment)
o FASTA 35+, SWIFT (full local/global/semi alignment)
o BWA, Bowtie (short read alignment)
Automatic SIMD vectorization
● Some compilers can recognise patterns that
can be converted into SIMD instructions
○ Simple loops
○ Array operations
○ Data copying
● Re-compile your C/C++ code
○ GCC (GNU C Compiler)
■ gcc -march=native -O3
○ ICC (Intel C Compiler)
■ vectorization is automatic
Using SMP
Spawn multiple jobs
# run 23 alignments, 1 core per chromosome
for CHR in $(seq 1 1 23); do
bwa mem $CHR.fasta reads.fq.gz 
1> $CHR.sam 2> $CHR.err &
done
# wait until all background jobs finish
wait
Use a Makefile
% ls
1.fasta 2.fasta 3.fasta
% vi Makefile
all: 1.sam 2.sam 3.sam
%.sam: %.fasta reads.fq.gz
bwa mem $< reads.fq.gz > $@
% make -j 8 # use 8 cores
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
GNU Parallel
% ls
1.fasta 2.fasta 3.fasta
% parallel -j 8 
“bwa mem {} reads.fq.gz > {.}.sam” 
::: *.fasta
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
{} replaced by each *.fasta in turn
{.} is {} but with file extension removed
Underused multi-threaded tools
● pigz
○ parallel gzip
○ if you have fast disks, scales to 64 cores easily
○ compression better than decompression
○ command line option: --processes=16 or -p 16
● pbzip2
○ parallel bzip2
● sort
○ yes, good ol’ Unix sort!
○ command line option: --parallel=16
Dedicated pipeline system
● Ruffus / Rubra
● BPIPE
● Nesoni
.... and so many more
.......... and so many more still coming!
Implicit Unix SMP
Pipes
● When you pipe two commands together
○ two separate processes are started: A and B
○ a “pipe” connects A:stdout to B:stdin (A | B)
● Example
○ frequency distribution of initial 4-mers in English
cat /usr/dict/words # already sorted
| cut -c 1-4 # first 4 characters
| tr ‘A-Z’ ‘a-z’ # canonicalize to lc
| uniq -c # count dupes
| sort -n -r # most freq first
| head -10 # top 10
Pipes (result)
428 over
410 inte
300 comp
272 unde
262 cons
261 tran
248 cont
211 disc
197 comm
171 fore
Sub-shells
● Use case:
○ software alignerX only accepts .fastq files
○ you have compressed .fastq.gz files
○ your disk is slow and has no space left
● Sub-shells to the rescue!
alignerX ref.fa R1.fq R2.fq
alignerX ref.fa <(zcat R1.fq.gz) 
<(zcat R2.fq.gz)
Sub shells + Pipes
● Use case:
○ software alignerX only accepts .fasta files
○ you have compressed .fastq.gz files
● Sub-shells can be pipes too!
alignerX ref.fa 
<(zcat R1.fq.gz | paste - - - - | cut -f 1,2
| sed 's/^@/>/' | tr "t" "n") 
<(zcat R2.fq.gz | paste - - - - | cut -f 1,2
| sed 's/^@/>/' | tr "t" "n")
Nested sub shells
HC SVNT DRACONES
(here be dragons)
Putting it all
together
Making BAMs
● Align FASTQ to reference
o bwa mem ref R1.fq.gz R2.fq.gz > SAM
● Convert to BAM
o samtools view SAM > BAM
● Sort BAM
o samtools sort BAM > SORTBAM
● Remove dupes
o samtools rmdup SORTBAM > SORTBAM
Making BAMs
Look mum! No intermediate files! Less idle CPUs!
% bwa mem -t 16 ref.fa R1.fq.gz R2.fq.gz
| samtools view -@ 16 -S -b -u -T ref.fa -
| samtools sort -@ 16 -m 1G -o -
| samtools rmdup - out.bam
-t 16 16 threads for bwa
-@ 16 16 threads for samtools 0.18+
-m 1G 1 GB RAM per thread for RAM sorting
-u pipe an uncompressed BAM
-o use stdout instead of writing to a file
Conclusions
Conclusions
● The “cluster” level
○ we are pretty good at that now
● The “SIMD” level
○ too low level, depend on others to exploit
○ thankfully many of our key tools already use it
● The “SMP” level
○ our pipelines still have single-threaded bottlenecks
○ always check if your tool has --threads option
○ exploit pipes and sub-shells wherever possible
○ and use GNU Parallel - it’s awesome (and Perl)
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed 10 sep 2014

More Related Content

PDF
Picobgp - A simple deamon for routing advertising
PPTX
Linux rt in financial markets
PPT
FreeNAS backup solution
 
PDF
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
PDF
Yet another introduction to Linux RCU
PDF
Linux Locking Mechanisms
ODP
How to Diagnose Problems Quickly on Linux Servers
Picobgp - A simple deamon for routing advertising
Linux rt in financial markets
FreeNAS backup solution
 
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Yet another introduction to Linux RCU
Linux Locking Mechanisms
How to Diagnose Problems Quickly on Linux Servers

What's hot (20)

PDF
ARM 64bit has come!
PDF
Let's Talk Locks!
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
Tips and Tricks for Increased Development Efficiency
PDF
Unix Ramblings
PDF
Rust Is Safe. But Is It Fast?
PDF
Seastore: Next Generation Backing Store for Ceph
PDF
Tips of Malloc & Free
PDF
Virtual memory 20070222-en
PDF
Practical SystemTAP basics: Perl memory profiling
PDF
Postgresql on NFS - J.Battiato, pgday2016
PPT
Concurrency bug identification through kernel panic log (english)
PDF
Fun with FUSE
PPT
PDF
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
PPTX
zmq.rs - A brief history of concurrency in Rust
PDF
Whoops! I Rewrote It in Rust
PDF
Staging driver sins
PDF
MongoDB Replication Cluster
PDF
Breaking the RpiDocker challenge
ARM 64bit has come!
Let's Talk Locks!
High-Performance Networking Using eBPF, XDP, and io_uring
Tips and Tricks for Increased Development Efficiency
Unix Ramblings
Rust Is Safe. But Is It Fast?
Seastore: Next Generation Backing Store for Ceph
Tips of Malloc & Free
Virtual memory 20070222-en
Practical SystemTAP basics: Perl memory profiling
Postgresql on NFS - J.Battiato, pgday2016
Concurrency bug identification through kernel panic log (english)
Fun with FUSE
Vanquishing Latency Outliers in the Lightbits LightOS Software Defined Storag...
zmq.rs - A brief history of concurrency in Rust
Whoops! I Rewrote It in Rust
Staging driver sins
MongoDB Replication Cluster
Breaking the RpiDocker challenge
Ad

Viewers also liked (20)

PDF
Mousegenomes tk-wtsi (1)
PDF
Assessing the impact of transposable element variation on mouse phenotypes an...
PDF
AMR surveillance in Europe: historical background and future outlook. Hajo G...
PDF
Long read sequencing - LSCC lab talk - fri 5 june 2015
PPT
5 point someone
PPT
Bioinformatics-General_Intro
PDF
How to write bioinformatics software people will use and cite - t.seemann - ...
PDF
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
PPT
Assembling NGS Data - IMB Winter School - 3 July 2012
PDF
Multiple mouse reference genomes and strain specific gene annotations
PDF
Mouse Genomes Project + RNA-Editing
PPT
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
PDF
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
PDF
Wellcome Trust Advances Course: NGS Course - Lecture1
PPT
De novo genome assembly - IMB Winter School - 7 July 2015
PDF
Large Scale Resequencing: Approaches and Challenges
PPTX
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
PPS
Bio Inspired Computing Final Version
PDF
Overview of methods for variant calling from next-generation sequence data
PPTX
Toolbox for bacterial population analysis using NGS
Mousegenomes tk-wtsi (1)
Assessing the impact of transposable element variation on mouse phenotypes an...
AMR surveillance in Europe: historical background and future outlook. Hajo G...
Long read sequencing - LSCC lab talk - fri 5 june 2015
5 point someone
Bioinformatics-General_Intro
How to write bioinformatics software people will use and cite - t.seemann - ...
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
Multiple mouse reference genomes and strain specific gene annotations
Mouse Genomes Project + RNA-Editing
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Wellcome Trust Advances Course: NGS Course - Lecture1
De novo genome assembly - IMB Winter School - 7 July 2015
Large Scale Resequencing: Approaches and Challenges
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Bio Inspired Computing Final Version
Overview of methods for variant calling from next-generation sequence data
Toolbox for bacterial population analysis using NGS
Ad

Similar to Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014 (20)

PPTX
Lrz kurs: big data analysis
PDF
parallel-computation.pdf
PDF
Parallel computation
PDF
Nvidia in bioinformatics
PDF
ppOpen-AT : Yet Another Directive-base AT Language
PPT
Multiprocessor_YChen.ppt
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
PPTX
Gpu workshop cluster universe: scripting cuda
PPTX
Lrz kurs: gpu and mic programming with r
PDF
The Rise of Parallel Computing
PPTX
CSA unit5.pptx
PDF
Introduction to Parallelization ans performance optimization
PPTX
Paralell
PDF
E3MV - Embedded Vision - Sundance
PPT
Introduction to HPC
PPTX
Introduction to Parallelization ans performance optimization
PDF
GPUs in Big Data - StampedeCon 2014
PDF
Directive-based approach to Heterogeneous Computing
PDF
HPC Essentials 0
PDF
Scaling Systems for Research Computing
Lrz kurs: big data analysis
parallel-computation.pdf
Parallel computation
Nvidia in bioinformatics
ppOpen-AT : Yet Another Directive-base AT Language
Multiprocessor_YChen.ppt
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Gpu workshop cluster universe: scripting cuda
Lrz kurs: gpu and mic programming with r
The Rise of Parallel Computing
CSA unit5.pptx
Introduction to Parallelization ans performance optimization
Paralell
E3MV - Embedded Vision - Sundance
Introduction to HPC
Introduction to Parallelization ans performance optimization
GPUs in Big Data - StampedeCon 2014
Directive-based approach to Heterogeneous Computing
HPC Essentials 0
Scaling Systems for Research Computing

More from Torsten Seemann (16)

PDF
How to write bioinformatics software no one will use
PDF
Snippy - T.Seemann - Poster - Genome Informatics 2016
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PDF
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
PDF
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
PDF
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
PDF
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
PDF
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
PDF
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
PDF
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
PDF
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
PDF
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
PPTX
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
PPTX
Prokka - rapid bacterial genome annotation - ABPHM 2013
PPTX
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
PPTX
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
How to write bioinformatics software no one will use
Snippy - T.Seemann - Poster - Genome Informatics 2016
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Prokka - rapid bacterial genome annotation - ABPHM 2013
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014

Recently uploaded (20)

PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
2. Earth - The Living Planet earth and life
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
famous lake in india and its disturibution and importance
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPT
Chemical bonding and molecular structure
PPTX
neck nodes and dissection types and lymph nodes levels
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Microbiology with diagram medical studies .pptx
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
2. Earth - The Living Planet Module 2ELS
Classification Systems_TAXONOMY_SCIENCE8.pptx
Biophysics 2.pdffffffffffffffffffffffffff
2. Earth - The Living Planet earth and life
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
bbec55_b34400a7914c42429908233dbd381773.pdf
famous lake in india and its disturibution and importance
microscope-Lecturecjchchchchcuvuvhc.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Taita Taveta Laboratory Technician Workshop Presentation.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
TOTAL hIP ARTHROPLASTY Presentation.pptx
Chemical bonding and molecular structure
neck nodes and dissection types and lymph nodes levels
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Microbiology with diagram medical studies .pptx

Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

  • 2. Ideal world ● A single computer with o one really fast processor o huge amount of really fast memory ● Compromise #1: a single computer with o lots of processors o huge memory fast enough for all processors ● Compromise #2: a bunch of computers with o lots of fast processors on each node o lots of memory on each node o really fast, low latency inter-node communication
  • 3. The real world ● None of these exist :-( ● Computer nodes o Good: CPU & RAM on the increase o Bad: CPU is competing for RAM ● Node:Node communication o Good: getting faster o Bad: latency gets worse with more nodes
  • 4. Types of parallelism ● Cluster o distribute workload across networked computers ● SMP o symmetric multiple processing o use multiple cores on a single computer o (we’ll ignore NUMA) ● SIMD o single instruction, multiple data o same machine code instruction on vector of values o (we’ll ignore MIMD, GPU)
  • 6. Clusters ● Can be ad hoc o bunch of PCs over Ethernet (Beowulf) ● Cluster specific o high density, fast interconnect (Blade) ● Highly specialised o high density, low power, very fast interconnect, low latency, many switches (eg. IBM BlueGene)
  • 7. Using clusters Break task into subtasks: ● Independent tasks ○ “pleasantly parallel” is a good situation! ○ Submit these to cluster queue o Combine results ● Dependent tasks o Need to communicate during run o Various ways to do this (more later)
  • 8. SMP
  • 9. SMP: symmetric multi processing Use multiple cores on one node: ● Simple case ○ run multiple subtasks, one per core ● Multi-threading ○ use tools that support multiple cores ■ BWA, bowtie, samtools 0.18+ ○ use languages that support native threading ■ Java ■ C, C++, Perl, Haskell - with standard libraries ■ Python has issues here
  • 10. Using SMP ● POSIX threads o standard “C” Unix interface o a library of functions ● OpenMP o standard “C” Unix interface o functions and #pragmas to help compiler parallelize ● Unix Shell o use job control and ‘&’ and ‘wait’ o Makefiles, GNU parallel, pipelines (more later) ● Use tools that do this natively for you
  • 11. SMP communication ● Sometimes threads needs to talk o Just like cluster nodes need to talk ● IPC o Inter-Process Communication ● Methods o files, time-stamped “touch files” o pipes, sockets, message passing o shared memory o semaphores o signals
  • 12. SIMD
  • 13. Machine code 101 ● CPUs run “machine code” instructions: ○ load R0 , [years] # put var in reg mul R0 , 365 # mult by 365 add R0 , 1 # add 1 store [days], R0 # put reg in mem ● Each instruction does one atomic operation ○ to change one piece of data ■ memory location (RAM variable - slow) ■ register (CPU variable - fast)
  • 14. ● Example ○ vector dot product: x ∙ y = Σi=1..|x| xi × yi ● Pseudo-code ○ var x, y : integer[8] var sum : integer sum := 0; for i in 0..7: sum := sum + x[i] * y[i] ● Operations ○ 1 + 8 * 3 = 25 ops Vector operations
  • 15. Vector operations ● Vector registers and instructions ○ assume 8-element operations (actually common!) ● SIMD ○ load V0, [x] # put x[] in vec register load V1, [y] # same for y[] mult V0, V1 # vector multiply! vsum R7, V0 # vec sum into scalar reg ● Operations ○ 1 + 1 + 1 + 1 = 4 ops
  • 16. SIMD Instruction Sets ● Specialised since 1970s ○ MASPAR ○ Connection Machine ○ Cray super-scalar ○ DEC Alpha MVI ● Consumer grade ○ Intel MMX / AMD 3DNow! (integer) [x86] ○ Intel SSE, SSE2, SSE3, SSE4.x (floating point) [x86] ○ IBM Altivec (both) [BlueGene,POWER] ● GPUs also, but they do MIMD too.
  • 17. Using SIMD ● Not accessible from scripting languages o they are too many layers away from machine code ● Some libraries exploit it o Numpy (uses some SSE in CoreFunc) o GSL - Gnu Scientific Library o BLAS - Linear algebra ● Find the tools that use it o HMMER (profile:sequence alignment) o FASTA 35+, SWIFT (full local/global/semi alignment) o BWA, Bowtie (short read alignment)
  • 18. Automatic SIMD vectorization ● Some compilers can recognise patterns that can be converted into SIMD instructions ○ Simple loops ○ Array operations ○ Data copying ● Re-compile your C/C++ code ○ GCC (GNU C Compiler) ■ gcc -march=native -O3 ○ ICC (Intel C Compiler) ■ vectorization is automatic
  • 20. Spawn multiple jobs # run 23 alignments, 1 core per chromosome for CHR in $(seq 1 1 23); do bwa mem $CHR.fasta reads.fq.gz 1> $CHR.sam 2> $CHR.err & done # wait until all background jobs finish wait
  • 21. Use a Makefile % ls 1.fasta 2.fasta 3.fasta % vi Makefile all: 1.sam 2.sam 3.sam %.sam: %.fasta reads.fq.gz bwa mem $< reads.fq.gz > $@ % make -j 8 # use 8 cores % ls 1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
  • 22. GNU Parallel % ls 1.fasta 2.fasta 3.fasta % parallel -j 8 “bwa mem {} reads.fq.gz > {.}.sam” ::: *.fasta % ls 1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam {} replaced by each *.fasta in turn {.} is {} but with file extension removed
  • 23. Underused multi-threaded tools ● pigz ○ parallel gzip ○ if you have fast disks, scales to 64 cores easily ○ compression better than decompression ○ command line option: --processes=16 or -p 16 ● pbzip2 ○ parallel bzip2 ● sort ○ yes, good ol’ Unix sort! ○ command line option: --parallel=16
  • 24. Dedicated pipeline system ● Ruffus / Rubra ● BPIPE ● Nesoni .... and so many more .......... and so many more still coming!
  • 26. Pipes ● When you pipe two commands together ○ two separate processes are started: A and B ○ a “pipe” connects A:stdout to B:stdin (A | B) ● Example ○ frequency distribution of initial 4-mers in English cat /usr/dict/words # already sorted | cut -c 1-4 # first 4 characters | tr ‘A-Z’ ‘a-z’ # canonicalize to lc | uniq -c # count dupes | sort -n -r # most freq first | head -10 # top 10
  • 27. Pipes (result) 428 over 410 inte 300 comp 272 unde 262 cons 261 tran 248 cont 211 disc 197 comm 171 fore
  • 28. Sub-shells ● Use case: ○ software alignerX only accepts .fastq files ○ you have compressed .fastq.gz files ○ your disk is slow and has no space left ● Sub-shells to the rescue! alignerX ref.fa R1.fq R2.fq alignerX ref.fa <(zcat R1.fq.gz) <(zcat R2.fq.gz)
  • 29. Sub shells + Pipes ● Use case: ○ software alignerX only accepts .fasta files ○ you have compressed .fastq.gz files ● Sub-shells can be pipes too! alignerX ref.fa <(zcat R1.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^@/>/' | tr "t" "n") <(zcat R2.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^@/>/' | tr "t" "n")
  • 30. Nested sub shells HC SVNT DRACONES (here be dragons)
  • 32. Making BAMs ● Align FASTQ to reference o bwa mem ref R1.fq.gz R2.fq.gz > SAM ● Convert to BAM o samtools view SAM > BAM ● Sort BAM o samtools sort BAM > SORTBAM ● Remove dupes o samtools rmdup SORTBAM > SORTBAM
  • 33. Making BAMs Look mum! No intermediate files! Less idle CPUs! % bwa mem -t 16 ref.fa R1.fq.gz R2.fq.gz | samtools view -@ 16 -S -b -u -T ref.fa - | samtools sort -@ 16 -m 1G -o - | samtools rmdup - out.bam -t 16 16 threads for bwa -@ 16 16 threads for samtools 0.18+ -m 1G 1 GB RAM per thread for RAM sorting -u pipe an uncompressed BAM -o use stdout instead of writing to a file
  • 35. Conclusions ● The “cluster” level ○ we are pretty good at that now ● The “SIMD” level ○ too low level, depend on others to exploit ○ thankfully many of our key tools already use it ● The “SMP” level ○ our pipelines still have single-threaded bottlenecks ○ always check if your tool has --threads option ○ exploit pipes and sub-shells wherever possible ○ and use GNU Parallel - it’s awesome (and Perl)

Editor's Notes

  • #17: Dave used MASPAR at Monash during his PhD for DPA alignment!!!