Parallel DNA Sequence Alignment

SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo
Luca Pepe
Daniele Valenza

• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan

Sequence alignment is a process for comparing two
or more DNA or RNA sequences.
Sequence alignment is performed in order to find
similar or identical regions in the provided sequences,
or to check if it is a known sequence stored in a
database.

DNA STRUCTURE
DNA bases: A C G T
Bounds: (A, T) (C, G)

DNA ALIGNMENT
Affinity measures:
• MATCH
• MISMATCH
• GAP
MATCHING TYPE:
• SIMPLE
• REVERSE AND COMPLEMENT
Q: ATGATTACC DNA String
R(Q): CCATTAGTA Reverse
C (R(Q)): GGTAATCAT Complement

• Global Alignment:
• Local Alignment:
• Local Alignment:
DNA ALIGNMENT TYPES

Searching all the perfect matchings of a small query string in a bigger
DNA string.
INPUT: DNA String, Query String
OUTPUT: Number of occurences, Occurences starting positions
SIMPLE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙 𝑞
DNA Length 𝑙 𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖

Searching the «best» n alignments of a small query string in a bigger
DNA string
INPUT: DNA String, Query String
OUTPUT: Best alignments starting positions
APPROXIMATE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙 𝑞
DNA Length 𝑙 𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖

APPROXIMATE SEARCH – SIMILARITY EVALUATION
Character similarity function
𝑠𝑖 =
𝑥, 𝑀𝑎𝑡𝑐ℎ
𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ
𝑧, 𝐺𝑎𝑝
x > 0; y, z ≤ 0
(In this work gaps are not considered)
Objective function to maximize:
𝑆 =
𝑖
𝑙 𝑞
𝑠𝑖

The common approach to all
solutions is based on Map Reduce
model:
• Master node splits the string into
chunks and scatters them to
workers node.
• The workers perform the
computation and results are sent
back to the master.
• Master combines the single
solutions and returns the output.
GENERAL IDEA
Attention must be paid to the cross-matching strings

GENERIC SPLIT AND COMPUTATION
Complete
Matching
Partial
Matching
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
Query string
DNA string
𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏

GENERIC REDUCE PHASE
𝑖 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑖
Worker ID Offset
𝑖 𝑜𝑓𝑓𝑖
𝑗 𝑜𝑓𝑓𝑗
𝑙 𝑞
Query string
DNA string
Size
WORKERS OUTPUT
FINAL OUTPUT
𝑗 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙 𝑞
Positions
𝑠𝑖
𝑠𝑗
𝑠𝑖 𝑠𝑗

Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES

𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
BIGGER CHUNK APPROACH
Complete
Matching
Chunk 𝒊 − 𝟏
𝑙 𝑑/n
Query string
DNA string
𝑙 𝑞-1
Chunk size
Chunk 𝒊
𝑙 𝑑/n 𝑙 𝑞-1
Chunk 𝒊 + 𝟏
𝑙 𝑑/n 𝑙 𝑞-1
Same Char

ADVANTAGES:
• it does not requires intra-workers communication;
• it does not produce duplicated occurrences;
• the master has an extremely small sequential work to perform.
DISADVANTAGES:
• each worker (except the last one) receives 𝑙 𝑞 − 1 extra characters
Thus, an extra bandwidth 𝑏 𝑒 usage is produced such as:
𝑏 𝑒 = 𝑙 𝑞 − 1 ⋅ (𝑛 − 1)
BIGGER CHUNK APPROACH

Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE

𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – BIG REQUEST APPROACH
𝑙 𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete
Matching
Partial
Matching
Query string
DNA string
Chunk size

ADVANTAGES:
• extra data is requested only when needed
• it does not produce duplicated occurrences
• a single request is performed for each worker
DISADVANTAGES:
• extra overhead for the big request
• potential useless extra characters
ON DEMAND – BIG REQUEST APPROACH

𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – SMALL REQUEST APPROACH
𝑙 𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete
Matching
Partial
Matching
Query string
DNA string
Chunk size

ADVANTAGES:
• extra data are requested only when needed
• better bandwidth usage than big request
DISADVANTAGES:
• Number of requests grows proportionally to the length of the
query
ON DEMAND – SMALL REQUEST APPROACH

Two kind of communication can be adopted:
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized:
request is made to master node
Distributed:
request is made to adjacent right node
k )

ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized Distributed
ADVANTAGES Master idle time is
reduced.
No extra accesses to
DNA are needed.
No linearization
point.
DISADVANTAGES Linearization point is
added.
Access to DNA must be
performed.
Extra data requests
may be slowed
down.

…
𝑇0
𝑇1
𝑇2
𝑇3
𝑇𝑗
𝑇𝑗+1
𝑇𝑗+2
𝑇𝑗+3
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Complete
Matching
Right-side
Partial
Matching
Query string
DNA string
𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊
…
Left-side
Partial
Matching
…
Chunk 𝒊 + 𝟏
…

ADVANTAGES:
• no extra data is required
• no extra communication is needed
• the master does not need to store the DNA string
• it reduces bandwidth consumption to perform cross-chunk strings
checking. Indeed workers return bits instead of integers.
DISADVANTAGES:
• Extra work is required to the master (partial matchings combine)
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH

3SQ REDUCE PHASE
𝑖 ∗ (𝑙 𝑑 𝑛)- j
1 1 0 1
𝑙 𝑞
Query match
DNA string
Size
WORKER i
Right side array
FINAL OUTPUT
𝑖 ∗ (𝑙 𝑑 𝑛)- k 𝑙 𝑞 Positions
𝑠𝑗
𝑠 𝑘
Results array
𝑠 𝑘
1 0 0 1
AND
1 0 0 1
WORKER i+1
Left side array
𝑠𝑖
𝒋 𝒌

Same as simple search
• Splitting phase:
same of simple search
• Computation phase:
• Similarity function is evaluated for every alignment of query string
• Likely simple search, Cross-chunk strings must be considered
• Every worker returns its 𝑛 best similarity values, with relative
positions
• Reduce phase:
All similarity values are merged in order and the best 𝑛 alignments
are returned
PARALLELIZATION MODEL

REDUCE PHASE
Off. Similarity
X 10
Y 8
Z 3
Off. Similarity
A 5
B -3
C -6
W.
Id
Off. Sim.
1 X 10
1 Y 8
2 U 7
3 A 5
1 Z 3
2 V 2
2 W -1
3 B -3
3 C -6
Pos. Similarity
X’ 10
Y’ 8
U’ 7
O
R
D
E
R
E
D
M
E
R
G
E
P
O
S.
T
R
A
N
S
L
A
T
I
O
N
Off. Similarity
U 7
V 2
W -1
FINAL OUTPUT
Worker 1
Worker 2
Worker 3

Bigger chunk:
The master sends to every worker a chunk of size s ≤
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching similarities can be evaluated.
Every worker receives a chunk of size s =
𝑙 𝑑
𝑛
and computes its similarity
values and all partial similarities (leftside and rightside). Partial
similarities will be summed by the master in order to compute Cross-
Chunk String similarity values.
CROSS-CHUNK MATCHING

3SQ PARTIAL SIMILARITY COMBINE PHASE
4 2 0 1
𝑙 𝑞
Query match
DNA string
Size
WORKER i
Right side array
OUTPUT
W.
Id.
Off. Sim
i 𝑠𝑗 5
i 𝑠 𝑘 3
i …
Results array
sk
1 0 3 -4
+
5 2 3 -3
WORKER i+1
Left side array
si
𝒋 𝒌
𝑠𝑗 = 𝑙 𝑐 − (𝑙 𝑞 − 1)+ j
𝑙 𝑞
Chunk 𝒊 Chunk 𝒊 + 𝟏

Varying parameters:
• Number of Workers
• Query Length
We plan to evaluate the running times of every presented
algorithm. The analysis of these results will validate our
proposal, highlighting the algorithm that performs better.
OVERVIEW

SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo
Luca Pepe
Daniele Valenza
DEVELOPMENT AND BENCHMARKING

• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions

Every proposed algorithm has been
implemented using C language and OpenMPI library
Advantages:
• High performances
• Scalability
• Portability
INTRODUCTION

A natural approach: load it entirely from file, calculate the size (𝑙 𝑑),
split it in 𝑛 chunks and send them to the workers
Problems:
A DNA genome may be very large (3.0 ×109 bp (base pairs) )
The available memory can’t be enough.
Projectual choice:
The whole DNA is actually never needed
DNA is never entirely loaded in memory, first dna and chunk size are
calculated, and then step by step 𝑙 𝑐 characters are read from file and
sent to a worker.
PROJECTUAL CHOICES: DNA SPLITTING

• The type of messages exchanged during the simple search
computation would normally consist in:
• Characters (splitting phase)
• Integers (Reduce phase)
Bandwidth usage:
• 1 byte (Char size) x lc x n - Splitting phase
• 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase
Can we do better? … YES!
PROJECTUAL CHOICES: BANDWIDTH USAGE

In the Simple Search algorithm, a compression can be performed in
order to drastically reduce bandwidth consumption.
Simple Search Reduce phase Compression: instead of sending
actual positions, a bit array of size 𝑙 𝑐 is exploited.
Bit array costruction:
for each position, if a matching is found starting from it, the bit is
set to 1, 0 otherwise.
Compression Ratio:
1: 32 (E.g, with 4 integers from 4 positions to 128 positions)
PROJECTUAL CHOICES: BANDWIDTH USAGE

COMUNICATION
Master to workers Extra Comunication Workers to Master
Messages Data
Type
Type Messages Data
Type
Type Messages Data
Type
Type
Bigger Chunk N(ld/n+lq-1) Char Async
Sync X N(ld/n) Int
Bit
Sync
Sync
On Demand: N(ld/n) Char Async
Sync
N-1(lq-1) Char Sync
Sync
N(ld/n) Int
Bit
Sync
Sync
3SQ N(ld/n) Char Async
Sync X N(ld/n)+2(l
q-1)
Int
Bit
Sync
Sync

Cluster
8 Nodes - Ethernet 100Mbps connection
Node
CPU: Intel Xeon Dual Core 2.8 Ghz
RAM: 4GB
Hard Drive: 2x 30GB SCSI
Software
OS: Debian 6.0.4
OpenMPI 1.6.1
TESTING ENVIRONMENT
Image for illustrative purposes only

Benchmarking consisted in evaluating and comparing running
times of each algorithm as function of the following
parameters
• Number of processors (# workers +1) [2, 4, 8, 16]
• DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-)
• Query length (Small -8byte-, Medium -32byte-, Large -64byte-)
• # best allignments -Approximate search only- (10, 50, 100)
In grey the fixed value for the parameter when not evaluated
TEST PLAN

SIMPLE SEARCH: NUMBER OF WORKERS (1/2)
Results:
• Good Scalability for
every algorithm
• 3SQ worse than the
others because
additional sequential
work must be
performed.

SIMPLE SEARCH: NUMBER OF WORKERS (2/2)
Results:
• Bigger Chunk Bit
performs better than
int solution.
• Increasing processors,
bigger chunk performs
better than the others
because more cross-
chunk matchings occur.
• No relevant
improvements occurred
between 8 and 16
processors.

SIMPLE SEARCH: SPEED UP
0
0,5
1
1,5
2
2,5
3
3,5
4
2 4 8 16
SPEEDUP
NUMBER OF PROCESSORS
Speed Up Simple Search
DNA Size: Big Query Size: Small
BC-bit OD-cent BC-int OD-dist 3SQ
Results:
• Increasing speedup for
every algorithm
(except BC-int)
• The speedup grows
proportionally to
𝑛 + 1
• BC-int suffers from
network bottleneck due
to the size of the
messages.

SIMPLE SEARCH: DNA LENGTH
Results:
• Good Scalability for
every algorithm
• 3SQ worse: additional
sequential work than
others….
• Bigger Chunk Bit
performs better than
int solution
• Execution times grows
linearly respect to DNA
size

SIMPLE SEARCH: QUERY LENGTH
Results:
• 3SQ is highly sensible to query
length variations due to partial
matching combine phase.
• No significative variations for
other algorithms since single
Query Matching is interrupted
on first mismatch found.

APPROXIMATE SEARCH: NUMBER OF WORKERS
Results:
• Running times
decrease linearly
respectively to the
number of processors.
• 3SQ is only slightly
worse than Bigger
chunk because the
sequential work is
almost the same
(Ordered Merge)

APPROXIMATE SEARCH: SPEED UP
0
2
4
6
8
10
12
14
16
2 4 8 16
SPEEDUP
NUMBER OF PROCESSORS
Speed Up Approximate Search
DNA Size: Medium Query Size: Small
3SQ BC-int
Results:
Speed up globally better
than simple search and
close to the ideal value.

APPROXIMATE SEARCH: DNA SIZE
Results:
Running times grows
linearly respectively to the
DNA SIZE
Motivation
The main sequential
computation consists in
Ordered Merge that has
linear complexity.

APPROXIMATE SEARCH: QUERY SIZE
Results:
Running times is
influenced by Query Size.
Motivation
The computation of
similarity function is
affected by query length.

APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS
Results:
Running times grows
almost linearly.
Motivation
Each worker returns to the
master its Number of best
alignments and the
ordered merge process is
affected by it.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
10 50 100
RUNNINGTIME(SECONDS)
NUMBER OF BEST ALIGNMENT
Approximate Search
DNA size: Big Processor: 16 Query Size:
Small
BC-int 3SQ

The winner is….
Bigger Chunk
On Demand
3SQ 

Further improvements can be applied to the presented algorithms
Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit
are enough to rappresent the character, instead of 8 bit
Bit Mapping:
e.g. A=00, T=01, C=10, G=11
Compression Ratio:
1: 4 (E.g, with 1 character from 1 base to 4 bases)
IMPROVEMENTS

3SQ algorithm:
Partial matchings combine phase can be performed in a distributed
manner
Each node sends its left or right partial matching to left or right sibling,
which will combine it with his results and send them to master.
In this way sequential work can be reduced
IMPROVEMENTS

Parallel DNA Sequence Alignment

More Related Content

Viewers also liked (19)

Similar to Parallel DNA Sequence Alignment (20)

Recently uploaded (20)

Parallel DNA Sequence Alignment