Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

Parallel computing
in bioinformatics
Dr Torsten Seemann

Ideal world
● A single computer with
o one really fast processor
o huge amount of really fast memory
● Compromise #1: a single computer with
o lots of processors
o huge memory fast enough for all processors
● Compromise #2: a bunch of computers with
o lots of fast processors on each node
o lots of memory on each node
o really fast, low latency inter-node communication

The real world
● None of these exist :-(
● Computer nodes
o Good: CPU & RAM on the increase
o Bad: CPU is competing for RAM
● Node:Node communication
o Good: getting faster
o Bad: latency gets worse with more nodes

Types of parallelism
● Cluster
o distribute workload across networked computers
● SMP
o symmetric multiple processing
o use multiple cores on a single computer
o (we’ll ignore NUMA)
● SIMD
o single instruction, multiple data
o same machine code instruction on vector of values
o (we’ll ignore MIMD, GPU)

Clusters
● Can be ad hoc
o bunch of PCs over Ethernet (Beowulf)
● Cluster specific
o high density, fast interconnect (Blade)
● Highly specialised
o high density, low power, very fast interconnect, low
latency, many switches (eg. IBM BlueGene)

Using clusters
Break task into subtasks:
● Independent tasks
○ “pleasantly parallel” is a good situation!
○ Submit these to cluster queue
o Combine results
● Dependent tasks
o Need to communicate during run
o Various ways to do this (more later)

SMP: symmetric multi processing
Use multiple cores on one node:
● Simple case
○ run multiple subtasks, one per core
● Multi-threading
○ use tools that support multiple cores
■ BWA, bowtie, samtools 0.18+
○ use languages that support native threading
■ Java
■ C, C++, Perl, Haskell - with standard libraries
■ Python has issues here

Using SMP
● POSIX threads
o standard “C” Unix interface
o a library of functions
● OpenMP
o standard “C” Unix interface
o functions and #pragmas to help compiler parallelize
● Unix Shell
o use job control and ‘&’ and ‘wait’
o Makefiles, GNU parallel, pipelines (more later)
● Use tools that do this natively for you

SMP communication
● Sometimes threads needs to talk
o Just like cluster nodes need to talk
● IPC
o Inter-Process Communication
● Methods
o files, time-stamped “touch files”
o pipes, sockets, message passing
o shared memory
o semaphores
o signals

Machine code 101
● CPUs run “machine code” instructions:
○ load R0 , [years] # put var in reg
mul R0 , 365 # mult by 365
add R0 , 1 # add 1
store [days], R0 # put reg in mem
● Each instruction does one atomic operation
○ to change one piece of data
■ memory location (RAM variable - slow)
■ register (CPU variable - fast)

● Example
○ vector dot product: x ∙ y = Σi=1..|x| xi × yi
● Pseudo-code
○ var x, y : integer[8]
var sum : integer
sum := 0;
for i in 0..7:
sum := sum + x[i] * y[i]
● Operations
○ 1 + 8 * 3 = 25 ops
Vector operations

Vector operations
● Vector registers and instructions
○ assume 8-element operations (actually common!)
● SIMD
○ load V0, [x] # put x[] in vec register
load V1, [y] # same for y[]
mult V0, V1 # vector multiply!
vsum R7, V0 # vec sum into scalar reg
● Operations
○ 1 + 1 + 1 + 1 = 4 ops

SIMD Instruction Sets
● Specialised since 1970s
○ MASPAR
○ Connection Machine
○ Cray super-scalar
○ DEC Alpha MVI
● Consumer grade
○ Intel MMX / AMD 3DNow! (integer) [x86]
○ Intel SSE, SSE2, SSE3, SSE4.x (floating point) [x86]
○ IBM Altivec (both) [BlueGene,POWER]
● GPUs also, but they do MIMD too.

Using SIMD
● Not accessible from scripting languages
o they are too many layers away from machine code
● Some libraries exploit it
o Numpy (uses some SSE in CoreFunc)
o GSL - Gnu Scientific Library
o BLAS - Linear algebra
● Find the tools that use it
o HMMER (profile:sequence alignment)
o FASTA 35+, SWIFT (full local/global/semi alignment)
o BWA, Bowtie (short read alignment)

Automatic SIMD vectorization
● Some compilers can recognise patterns that
can be converted into SIMD instructions
○ Simple loops
○ Array operations
○ Data copying
● Re-compile your C/C++ code
○ GCC (GNU C Compiler)
■ gcc -march=native -O3
○ ICC (Intel C Compiler)
■ vectorization is automatic

Spawn multiple jobs
# run 23 alignments, 1 core per chromosome
for CHR in $(seq 1 1 23); do
bwa mem $CHR.fasta reads.fq.gz
1> $CHR.sam 2> $CHR.err &
done
# wait until all background jobs finish
wait

Use a Makefile
% ls
1.fasta 2.fasta 3.fasta
% vi Makefile
all: 1.sam 2.sam 3.sam
%.sam: %.fasta reads.fq.gz
bwa mem $< reads.fq.gz > $@
% make -j 8 # use 8 cores
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam

GNU Parallel
% ls
1.fasta 2.fasta 3.fasta
% parallel -j 8
“bwa mem {} reads.fq.gz > {.}.sam”
::: *.fasta
% ls
1.fasta 2.fasta 3.fasta 1.sam 2.sam 3.sam
{} replaced by each *.fasta in turn
{.} is {} but with file extension removed

Underused multi-threaded tools
● pigz
○ parallel gzip
○ if you have fast disks, scales to 64 cores easily
○ compression better than decompression
○ command line option: --processes=16 or -p 16
● pbzip2
○ parallel bzip2
● sort
○ yes, good ol’ Unix sort!
○ command line option: --parallel=16

Dedicated pipeline system
● Ruffus / Rubra
● BPIPE
● Nesoni
.... and so many more
.......... and so many more still coming!

Pipes
● When you pipe two commands together
○ two separate processes are started: A and B
○ a “pipe” connects A:stdout to B:stdin (A | B)
● Example
○ frequency distribution of initial 4-mers in English
cat /usr/dict/words # already sorted
| cut -c 1-4 # first 4 characters
| tr ‘A-Z’ ‘a-z’ # canonicalize to lc
| uniq -c # count dupes
| sort -n -r # most freq first
| head -10 # top 10

Pipes (result)
428 over
410 inte
300 comp
272 unde
262 cons
261 tran
248 cont
211 disc
197 comm
171 fore

Sub-shells
● Use case:
○ software alignerX only accepts .fastq files
○ you have compressed .fastq.gz files
○ your disk is slow and has no space left
● Sub-shells to the rescue!
alignerX ref.fa R1.fq R2.fq
alignerX ref.fa <(zcat R1.fq.gz)
<(zcat R2.fq.gz)

Nested sub shells
HC SVNT DRACONES
(here be dragons)

Making BAMs
● Align FASTQ to reference
o bwa mem ref R1.fq.gz R2.fq.gz > SAM
● Convert to BAM
o samtools view SAM > BAM
● Sort BAM
o samtools sort BAM > SORTBAM
● Remove dupes
o samtools rmdup SORTBAM > SORTBAM

Making BAMs
Look mum! No intermediate files! Less idle CPUs!
% bwa mem -t 16 ref.fa R1.fq.gz R2.fq.gz
| samtools view -@ 16 -S -b -u -T ref.fa -
| samtools sort -@ 16 -m 1G -o -
| samtools rmdup - out.bam
-t 16 16 threads for bwa
-@ 16 16 threads for samtools 0.18+
-m 1G 1 GB RAM per thread for RAM sorting
-u pipe an uncompressed BAM
-o use stdout instead of writing to a file

Conclusions
● The “cluster” level
○ we are pretty good at that now
● The “SIMD” level
○ too low level, depend on others to exploit
○ thankfully many of our key tools already use it
● The “SMP” level
○ our pipelines still have single-threaded bottlenecks
○ always check if your tool has --threads option
○ exploit pipes and sub-shells wherever possible
○ and use GNU Parallel - it’s awesome (and Perl)

Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014 (20)

More from Torsten Seemann (16)

Recently uploaded (20)

Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed 10 sep 2014

Editor's Notes