SlideShare a Scribd company logo
SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo
Luca Pepe
Daniele Valenza
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Sequence alignment is a process for comparing two
or more DNA or RNA sequences.
Sequence alignment is performed in order to find
similar or identical regions in the provided sequences,
or to check if it is a known sequence stored in a
database.
DNA STRUCTURE
DNA bases: A C G T
Bounds: (A, T) (C, G)
DNA ALIGNMENT
Affinity measures:
• MATCH
• MISMATCH
• GAP
MATCHING TYPE:
• SIMPLE
• REVERSE AND COMPLEMENT
Q: ATGATTACC DNA String
R(Q): CCATTAGTA Reverse
C (R(Q)): GGTAATCAT Complement
• Global Alignment:
• Local Alignment:
• Local Alignment:
DNA ALIGNMENT TYPES
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Searching all the perfect matchings of a small query string in a bigger
DNA string.
INPUT: DNA String, Query String
OUTPUT: Number of occurences, Occurences starting positions
SIMPLE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙 𝑞
DNA Length 𝑙 𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖
Searching the «best» n alignments of a small query string in a bigger
DNA string
INPUT: DNA String, Query String
OUTPUT: Best alignments starting positions
APPROXIMATE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙 𝑞
DNA Length 𝑙 𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖
APPROXIMATE SEARCH – SIMILARITY EVALUATION
Character similarity function
𝑠𝑖 =
𝑥, 𝑀𝑎𝑡𝑐ℎ
𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ
𝑧, 𝐺𝑎𝑝
x > 0; y, z ≤ 0
(In this work gaps are not considered)
Objective function to maximize:
𝑆 =
𝑖
𝑙 𝑞
𝑠𝑖
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
The common approach to all
solutions is based on Map Reduce
model:
• Master node splits the string into
chunks and scatters them to
workers node.
• The workers perform the
computation and results are sent
back to the master.
• Master combines the single
solutions and returns the output.
GENERAL IDEA
Attention must be paid to the cross-matching strings
GENERIC SPLIT AND COMPUTATION
Complete
Matching
Partial
Matching
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
Query string
DNA string
𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏
GENERIC REDUCE PHASE
𝑖 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑖
Worker ID Offset
𝑖 𝑜𝑓𝑓𝑖
𝑗 𝑜𝑓𝑓𝑗
𝑙 𝑞
Query string
DNA string
Size
WORKERS OUTPUT
FINAL OUTPUT
𝑗 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙 𝑞
Positions
𝑠𝑖
𝑠𝑗
𝑠𝑖 𝑠𝑗
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
BIGGER CHUNK APPROACH
Complete
Matching
Chunk 𝒊 − 𝟏
𝑙 𝑑/n
Query string
DNA string
𝑙 𝑞-1
Chunk size
Chunk 𝒊
𝑙 𝑑/n 𝑙 𝑞-1
Chunk 𝒊 + 𝟏
𝑙 𝑑/n 𝑙 𝑞-1
Same Char
ADVANTAGES:
• it does not requires intra-workers communication;
• it does not produce duplicated occurrences;
• the master has an extremely small sequential work to perform.
DISADVANTAGES:
• each worker (except the last one) receives 𝑙 𝑞 − 1 extra characters
Thus, an extra bandwidth 𝑏 𝑒 usage is produced such as:
𝑏 𝑒 = 𝑙 𝑞 − 1 ⋅ (𝑛 − 1)
BIGGER CHUNK APPROACH
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE
Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – BIG REQUEST APPROACH
𝑙 𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete
Matching
Partial
Matching
Query string
DNA string
Chunk size
ADVANTAGES:
• extra data is requested only when needed
• it does not produce duplicated occurrences
• a single request is performed for each worker
DISADVANTAGES:
• extra overhead for the big request
• potential useless extra characters
ON DEMAND – BIG REQUEST APPROACH
𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – SMALL REQUEST APPROACH
𝑙 𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete
Matching
Partial
Matching
Query string
DNA string
Chunk size
ADVANTAGES:
• extra data are requested only when needed
• it does not produce duplicated occurrences
• better bandwidth usage than big request
DISADVANTAGES:
• Number of requests grows proportionally to the length of the
query
ON DEMAND – SMALL REQUEST APPROACH
Two kind of communication can be adopted:
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized:
request is made to master node
Distributed:
request is made to adjacent right node
k )
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized Distributed
ADVANTAGES Master idle time is
reduced.
No extra accesses to
DNA are needed.
No linearization
point.
DISADVANTAGES Linearization point is
added.
Access to DNA must be
performed.
Extra data requests
may be slowed
down.
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE
Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
…
𝑇0
𝑇1
𝑇2
𝑇3
𝑇𝑗
𝑇𝑗+1
𝑇𝑗+2
𝑇𝑗+3
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Complete
Matching
Right-side
Partial
Matching
Query string
DNA string
𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊
…
Left-side
Partial
Matching
…
Chunk 𝒊 + 𝟏
…
ADVANTAGES:
• no extra data is required
• it does not produce duplicated occurrences
• no extra communication is needed
• the master does not need to store the DNA string
• it reduces bandwidth consumption to perform cross-chunk strings
checking. Indeed workers return bits instead of integers.
DISADVANTAGES:
• Extra work is required to the master (partial matchings combine)
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE
3SQ REDUCE PHASE
𝑖 ∗ (𝑙 𝑑 𝑛)- j
1 1 0 1
𝑙 𝑞
Query match
DNA string
Size
WORKER i
Right side array
FINAL OUTPUT
𝑖 ∗ (𝑙 𝑑 𝑛)- k 𝑙 𝑞 Positions
𝑠𝑗
𝑠 𝑘
Results array
𝑠 𝑘
1 0 0 1
AND
1 0 0 1
WORKER i+1
Left side array
𝑠𝑖
𝒋 𝒌
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Same as simple search
• Splitting phase:
same of simple search
• Computation phase:
• Similarity function is evaluated for every alignment of query string
• Likely simple search, Cross-chunk strings must be considered
• Every worker returns its 𝑛 best similarity values, with relative
positions
• Reduce phase:
All similarity values are merged in order and the best 𝑛 alignments
are returned
PARALLELIZATION MODEL
REDUCE PHASE
Off. Similarity
X 10
Y 8
Z 3
Off. Similarity
A 5
B -3
C -6
W.
Id
Off. Sim.
1 X 10
1 Y 8
2 U 7
3 A 5
1 Z 3
2 V 2
2 W -1
3 B -3
3 C -6
Pos. Similarity
X’ 10
Y’ 8
U’ 7
O
R
D
E
R
E
D
M
E
R
G
E
P
O
S.
T
R
A
N
S
L
A
T
I
O
N
Off. Similarity
U 7
V 2
W -1
FINAL OUTPUT
Worker 1
Worker 2
Worker 3
Bigger chunk:
The master sends to every worker a chunk of size s ≤
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching similarities can be evaluated.
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of size s =
𝑙 𝑑
𝑛
and computes its similarity
values and all partial similarities (leftside and rightside). Partial
similarities will be summed by the master in order to compute Cross-
Chunk String similarity values.
CROSS-CHUNK MATCHING
3SQ PARTIAL SIMILARITY COMBINE PHASE
4 2 0 1
𝑙 𝑞
Query match
DNA string
Size
WORKER i
Right side array
OUTPUT
W.
Id.
Off. Sim
i 𝑠𝑗 5
i 𝑠 𝑘 3
i …
Results array
sk
1 0 3 -4
+
5 2 3 -3
WORKER i+1
Left side array
si
𝒋 𝒌
𝑠𝑗 = 𝑙 𝑐 − (𝑙 𝑞 − 1)+ j
𝑙 𝑞
Chunk 𝒊 Chunk 𝒊 + 𝟏
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Varying parameters:
• Number of Workers
• Query Length
We plan to evaluate the running times of every presented
algorithm. The analysis of these results will validate our
proposal, highlighting the algorithm that performs better.
OVERVIEW
SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo
Luca Pepe
Daniele Valenza
DEVELOPMENT AND BENCHMARKING
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
Every proposed algorithm has been
implemented using C language and OpenMPI library
Advantages:
• High performances
• Scalability
• Portability
INTRODUCTION
A natural approach: load it entirely from file, calculate the size (𝑙 𝑑),
split it in 𝑛 chunks and send them to the workers
Problems:
A DNA genome may be very large (3.0 ×109 bp (base pairs) )
The available memory can’t be enough.
Projectual choice:
The whole DNA is actually never needed
DNA is never entirely loaded in memory, first dna and chunk size are
calculated, and then step by step 𝑙 𝑐 characters are read from file and
sent to a worker.
PROJECTUAL CHOICES: DNA SPLITTING
• The type of messages exchanged during the simple search
computation would normally consist in:
• Characters (splitting phase)
• Integers (Reduce phase)
Bandwidth usage:
• 1 byte (Char size) x lc x n - Splitting phase
• 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase
Can we do better? … YES!
PROJECTUAL CHOICES: BANDWIDTH USAGE
In the Simple Search algorithm, a compression can be performed in
order to drastically reduce bandwidth consumption.
Simple Search Reduce phase Compression: instead of sending
actual positions, a bit array of size 𝑙 𝑐 is exploited.
Bit array costruction:
for each position, if a matching is found starting from it, the bit is
set to 1, 0 otherwise.
Compression Ratio:
1: 32 (E.g, with 4 integers from 4 positions to 128 positions)
PROJECTUAL CHOICES: BANDWIDTH USAGE
COMUNICATION
Master to workers Extra Comunication Workers to Master
Messages Data
Type
Type Messages Data
Type
Type Messages Data
Type
Type
Bigger Chunk N(ld/n+lq-1) Char Async
Sync X N(ld/n) Int
Bit
Sync
Sync
On Demand: N(ld/n) Char Async
Sync
N-1(lq-1) Char Sync
Sync
N(ld/n) Int
Bit
Sync
Sync
3SQ N(ld/n) Char Async
Sync X N(ld/n)+2(l
q-1)
Int
Bit
Sync
Sync
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
Cluster
8 Nodes - Ethernet 100Mbps connection
Node
CPU: Intel Xeon Dual Core 2.8 Ghz
RAM: 4GB
Hard Drive: 2x 30GB SCSI
Software
OS: Debian 6.0.4
OpenMPI 1.6.1
TESTING ENVIRONMENT
Image for illustrative purposes only
Benchmarking consisted in evaluating and comparing running
times of each algorithm as function of the following
parameters
• Number of processors (# workers +1) [2, 4, 8, 16]
• DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-)
• Query length (Small -8byte-, Medium -32byte-, Large -64byte-)
• # best allignments -Approximate search only- (10, 50, 100)
In grey the fixed value for the parameter when not evaluated
TEST PLAN
SIMPLE SEARCH: NUMBER OF WORKERS (1/2)
Results:
• Good Scalability for
every algorithm
• 3SQ worse than the
others because
additional sequential
work must be
performed.
SIMPLE SEARCH: NUMBER OF WORKERS (2/2)
Results:
• Bigger Chunk Bit
performs better than
int solution.
• Increasing processors,
bigger chunk performs
better than the others
because more cross-
chunk matchings occur.
• No relevant
improvements occurred
between 8 and 16
processors.
SIMPLE SEARCH: SPEED UP
0
0,5
1
1,5
2
2,5
3
3,5
4
2 4 8 16
SPEEDUP
NUMBER OF PROCESSORS
Speed Up Simple Search
DNA Size: Big Query Size: Small
BC-bit OD-cent BC-int OD-dist 3SQ
Results:
• Increasing speedup for
every algorithm
(except BC-int)
• The speedup grows
proportionally to
𝑛 + 1
• BC-int suffers from
network bottleneck due
to the size of the
messages.
SIMPLE SEARCH: DNA LENGTH
Results:
• Good Scalability for
every algorithm
• 3SQ worse: additional
sequential work than
others….
• Bigger Chunk Bit
performs better than
int solution
• Execution times grows
linearly respect to DNA
size
SIMPLE SEARCH: QUERY LENGTH
Results:
• 3SQ is highly sensible to query
length variations due to partial
matching combine phase.
• No significative variations for
other algorithms since single
Query Matching is interrupted
on first mismatch found.
APPROXIMATE SEARCH: NUMBER OF WORKERS
Results:
• Running times
decrease linearly
respectively to the
number of processors.
• 3SQ is only slightly
worse than Bigger
chunk because the
sequential work is
almost the same
(Ordered Merge)
APPROXIMATE SEARCH: SPEED UP
0
2
4
6
8
10
12
14
16
2 4 8 16
SPEEDUP
NUMBER OF PROCESSORS
Speed Up Approximate Search
DNA Size: Medium Query Size: Small
3SQ BC-int
Results:
Speed up globally better
than simple search and
close to the ideal value.
APPROXIMATE SEARCH: DNA SIZE
Results:
Running times grows
linearly respectively to the
DNA SIZE
Motivation
The main sequential
computation consists in
Ordered Merge that has
linear complexity.
APPROXIMATE SEARCH: QUERY SIZE
Results:
Running times is
influenced by Query Size.
Motivation
The computation of
similarity function is
affected by query length.
APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS
Results:
Running times grows
almost linearly.
Motivation
Each worker returns to the
master its Number of best
alignments and the
ordered merge process is
affected by it.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
10 50 100
RUNNINGTIME(SECONDS)
NUMBER OF BEST ALIGNMENT
Approximate Search
DNA size: Big Processor: 16 Query Size:
Small
BC-int 3SQ
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
The winner is….
Bigger Chunk
On Demand
3SQ 
Further improvements can be applied to the presented algorithms
Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit
are enough to rappresent the character, instead of 8 bit
Bit Mapping:
e.g. A=00, T=01, C=10, G=11
Compression Ratio:
1: 4 (E.g, with 1 character from 1 base to 4 bases)
IMPROVEMENTS
3SQ algorithm:
Partial matchings combine phase can be performed in a distributed
manner
Each node sends its left or right partial matching to left or right sibling,
which will combine it with his results and send them to master.
In this way sequential work can be reduced
IMPROVEMENTS
Thanks !

More Related Content

PDF
P2P .NET short seminar
DOCX
Chapter 5
PDF
120 talutloze lichtgewicht ophogingen voor trambaan over a4
PDF
About cci media connecting solutions introduction 2012
PPTX
Taller Android UTPL: Estilos y Diálogos
PPTX
Propuesta de trabajo.
DOCX
Krishnan Resume (1)
PDF
AFCN TFC Jess Ibrom - Wellington Phoenix Youth Academy
P2P .NET short seminar
Chapter 5
120 talutloze lichtgewicht ophogingen voor trambaan over a4
About cci media connecting solutions introduction 2012
Taller Android UTPL: Estilos y Diálogos
Propuesta de trabajo.
Krishnan Resume (1)
AFCN TFC Jess Ibrom - Wellington Phoenix Youth Academy

Viewers also liked (19)

PDF
Recopilados
PDF
Ptek HB3
PPT
Posicionamiento natural seo para la empresa - Octubre 2010
PPT
En La Luz De La Providencia Divina,
PDF
Análisis de pagos en el mundo. Mayo 2016
PPTX
Presentation for korea multimedia(in english)
PPTX
Programming Healthcare Silos
PPTX
Wikis, blogs, redes sociales
PDF
La pelota -Felisberto Hernández
PPTX
Extracción del petróleo
PPTX
El Lazarillo de Tormes (Como niños, la picaresca)
PPTX
Hontangas
DOCX
Naturaleza marketing estrategico
PPTX
Comunicacion Institucional
PDF
IMTC VoLTE Webinar - Voice over LTE: Industry, Standardization and Market Rea...
PDF
Chapter 8 Performance Management and Appraisal
PDF
Hardness of Online Voting
PDF
There's no such thing as digital strategy : Unilever for Hyper Island Master ...
DOCX
FAZIA MOHAMMED KAYANI MBA-RESUME-Updated
Recopilados
Ptek HB3
Posicionamiento natural seo para la empresa - Octubre 2010
En La Luz De La Providencia Divina,
Análisis de pagos en el mundo. Mayo 2016
Presentation for korea multimedia(in english)
Programming Healthcare Silos
Wikis, blogs, redes sociales
La pelota -Felisberto Hernández
Extracción del petróleo
El Lazarillo de Tormes (Como niños, la picaresca)
Hontangas
Naturaleza marketing estrategico
Comunicacion Institucional
IMTC VoLTE Webinar - Voice over LTE: Industry, Standardization and Market Rea...
Chapter 8 Performance Management and Appraisal
Hardness of Online Voting
There's no such thing as digital strategy : Unilever for Hyper Island Master ...
FAZIA MOHAMMED KAYANI MBA-RESUME-Updated
Ad

Similar to Parallel DNA Sequence Alignment (20)

PDF
Aligning seqeunces with W-curve and SQL.
PDF
Performance Efficient DNA Sequence Detectionalgo
PDF
Real-time, Fine-grained Version Control with CRDTs
PDF
Sequence Similarity between Genetic Codes using Improved Longest Common Subse...
PDF
A Comparison of Computation Techniques for DNA Sequence Comparison
PPTX
Bichromatic Reverse Nearest Neighbours
PDF
Bioalgo 2012-01-gene-prediction-sim
PDF
Introduction to SeqAn, an Open-source C++ Template Library
PDF
NGBT_poster_v0.4
PPTX
parallel Merging
PPT
CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion
PDF
Combining text and pattern preprocessing in an adaptive dna pattern matcher
PDF
A multithreaded method for network alignment
PDF
Optimizing Search Interactions within Professional Social Networks (thesis p...
PPT
Recreation mathematics ppt
PPTX
26 nov2013seminar
PDF
Building Conclave: a decentralized, real-time collaborative text editor
PDF
Chord a scalable peer to-peer lookup service for internet applications
PDF
Iaetsd effective method for searching substrings in large databases
PDF
Bytewise Approximate Match: Theory, Algorithms and Applications
Aligning seqeunces with W-curve and SQL.
Performance Efficient DNA Sequence Detectionalgo
Real-time, Fine-grained Version Control with CRDTs
Sequence Similarity between Genetic Codes using Improved Longest Common Subse...
A Comparison of Computation Techniques for DNA Sequence Comparison
Bichromatic Reverse Nearest Neighbours
Bioalgo 2012-01-gene-prediction-sim
Introduction to SeqAn, an Open-source C++ Template Library
NGBT_poster_v0.4
parallel Merging
CNVMiner: Pipeline to Mine CNV & Structural Variation in Hierarchical Fashion
Combining text and pattern preprocessing in an adaptive dna pattern matcher
A multithreaded method for network alignment
Optimizing Search Interactions within Professional Social Networks (thesis p...
Recreation mathematics ppt
26 nov2013seminar
Building Conclave: a decentralized, real-time collaborative text editor
Chord a scalable peer to-peer lookup service for internet applications
Iaetsd effective method for searching substrings in large databases
Bytewise Approximate Match: Theory, Algorithms and Applications
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Machine Learning_overview_presentation.pptx
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
A comparative analysis of optical character recognition models for extracting...
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
Assigned Numbers - 2025 - Bluetooth® Document
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Parallel DNA Sequence Alignment

  • 1. SEQUENCE ALIGNMENT SPEED-UP A PARALLEL APPROACH University of Salerno Parallel and Concurrent Computing Course 19 February 2013 Giuliana Carullo Luca Pepe Daniele Valenza
  • 2. • Introduction • Problem definition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 3. Sequence alignment is a process for comparing two or more DNA or RNA sequences. Sequence alignment is performed in order to find similar or identical regions in the provided sequences, or to check if it is a known sequence stored in a database.
  • 4. DNA STRUCTURE DNA bases: A C G T Bounds: (A, T) (C, G)
  • 5. DNA ALIGNMENT Affinity measures: • MATCH • MISMATCH • GAP MATCHING TYPE: • SIMPLE • REVERSE AND COMPLEMENT Q: ATGATTACC DNA String R(Q): CCATTAGTA Reverse C (R(Q)): GGTAATCAT Complement
  • 6. • Global Alignment: • Local Alignment: • Local Alignment: DNA ALIGNMENT TYPES
  • 7. • Introduction • Problem definition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 8. Searching all the perfect matchings of a small query string in a bigger DNA string. INPUT: DNA String, Query String OUTPUT: Number of occurences, Occurences starting positions SIMPLE SEARCH Variables Notation # Workers 𝑛 Query length 𝑙 𝑞 DNA Length 𝑙 𝑑 Relative pos. 𝑂𝑓𝑓𝑖 Absolute pos. 𝑠𝑖
  • 9. Searching the «best» n alignments of a small query string in a bigger DNA string INPUT: DNA String, Query String OUTPUT: Best alignments starting positions APPROXIMATE SEARCH Variables Notation # Workers 𝑛 Query length 𝑙 𝑞 DNA Length 𝑙 𝑑 Relative pos. 𝑂𝑓𝑓𝑖 Absolute pos. 𝑠𝑖
  • 10. APPROXIMATE SEARCH – SIMILARITY EVALUATION Character similarity function 𝑠𝑖 = 𝑥, 𝑀𝑎𝑡𝑐ℎ 𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ 𝑧, 𝐺𝑎𝑝 x > 0; y, z ≤ 0 (In this work gaps are not considered) Objective function to maximize: 𝑆 = 𝑖 𝑙 𝑞 𝑠𝑖
  • 11. • Introduction • Problem definition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 12. The common approach to all solutions is based on Map Reduce model: • Master node splits the string into chunks and scatters them to workers node. • The workers perform the computation and results are sent back to the master. • Master combines the single solutions and returns the output. GENERAL IDEA Attention must be paid to the cross-matching strings
  • 13. GENERIC SPLIT AND COMPUTATION Complete Matching Partial Matching 𝑇0 𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 Query string DNA string 𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n Chunk size Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏
  • 14. GENERIC REDUCE PHASE 𝑖 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑖 Worker ID Offset 𝑖 𝑜𝑓𝑓𝑖 𝑗 𝑜𝑓𝑓𝑗 𝑙 𝑞 Query string DNA string Size WORKERS OUTPUT FINAL OUTPUT 𝑗 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙 𝑞 Positions 𝑠𝑖 𝑠𝑗 𝑠𝑖 𝑠𝑗
  • 15. • Introduction • Problem definition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 16. Bigger chunk: The master sends to every worker a chunk of sizes = 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching strings can be found. On Demand: The master sends chunks of sizes = 𝑙 𝑑 𝑛 , whether a worker finds a partial matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘 such that cross chunk matching strings can be found. Two possible heuristics: big request and small request Side to Side Sliding Query (3SQ): Every worker receives a chunk of sizes = 𝑙 𝑑 𝑛 and computes its complete matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings. SOLUTION APPROACHES
  • 17. 𝑇0 𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 BIGGER CHUNK APPROACH Complete Matching Chunk 𝒊 − 𝟏 𝑙 𝑑/n Query string DNA string 𝑙 𝑞-1 Chunk size Chunk 𝒊 𝑙 𝑑/n 𝑙 𝑞-1 Chunk 𝒊 + 𝟏 𝑙 𝑑/n 𝑙 𝑞-1 Same Char
  • 18. ADVANTAGES: • it does not requires intra-workers communication; • it does not produce duplicated occurrences; • the master has an extremely small sequential work to perform. DISADVANTAGES: • each worker (except the last one) receives 𝑙 𝑞 − 1 extra characters Thus, an extra bandwidth 𝑏 𝑒 usage is produced such as: 𝑏 𝑒 = 𝑙 𝑞 − 1 ⋅ (𝑛 − 1) BIGGER CHUNK APPROACH
  • 19. Bigger Chunk analogous to generic approach On Demand: analogous to generic approach Side to Side Sliding Query (3SQ): additional work is performed by master node for combining partial matchings. REDUCE PHASE
  • 20. Bigger chunk: The master sends to every worker a chunk of sizes = 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching strings can be found. On Demand: The master sends chunks of sizes = 𝑙 𝑑 𝑛 , whether a worker finds a partial matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘 such that cross chunk matching strings can be found. Two possible heuristics: big request and small request Side to Side Sliding Query (3SQ): Every worker receives a chunk of sizes = 𝑙 𝑑 𝑛 and computes its complete matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings. SOLUTION APPROACHES
  • 21. 𝑇0 … 𝑇𝑗 𝑇𝑗 + 1 𝑇𝑗 + 2 𝑇𝑗 + 3 ON DEMAND – BIG REQUEST APPROACH 𝑙 𝑑/n v v v v v x v v v v … Chunk 𝒊 Chunk 𝒊 + 𝟏 Complete Matching Partial Matching Query string DNA string Chunk size
  • 22. ADVANTAGES: • extra data is requested only when needed • it does not produce duplicated occurrences • a single request is performed for each worker DISADVANTAGES: • extra overhead for the big request • potential useless extra characters ON DEMAND – BIG REQUEST APPROACH
  • 23. 𝑇0 … 𝑇𝑗 𝑇𝑗 + 1 𝑇𝑗 + 2 𝑇𝑗 + 3 ON DEMAND – SMALL REQUEST APPROACH 𝑙 𝑑/n v v v v v x v v v v … Chunk 𝒊 Chunk 𝒊 + 𝟏 Complete Matching Partial Matching Query string DNA string Chunk size
  • 24. ADVANTAGES: • extra data are requested only when needed • it does not produce duplicated occurrences • better bandwidth usage than big request DISADVANTAGES: • Number of requests grows proportionally to the length of the query ON DEMAND – SMALL REQUEST APPROACH
  • 25. Two kind of communication can be adopted: ON DEMAND – CENTRALIZED VS DISTRIBUTED Centralized: request is made to master node Distributed: request is made to adjacent right node k )
  • 26. ON DEMAND – CENTRALIZED VS DISTRIBUTED Centralized Distributed ADVANTAGES Master idle time is reduced. No extra accesses to DNA are needed. No linearization point. DISADVANTAGES Linearization point is added. Access to DNA must be performed. Extra data requests may be slowed down.
  • 27. Bigger Chunk analogous to generic approach On Demand: analogous to generic approach Side to Side Sliding Query (3SQ): additional work is performed by master node for combining partial matchings. REDUCE PHASE
  • 28. Bigger chunk: The master sends to every worker a chunk of sizes = 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching strings can be found. On Demand: The master sends chunks of sizes = 𝑙 𝑑 𝑛 , whether a worker finds a partial matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘 such that cross chunk matching strings can be found. Two possible heuristics: big request and small request Side to Side Sliding Query (3SQ): Every worker receives a chunk of sizes = 𝑙 𝑑 𝑛 and computes its complete matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings. SOLUTION APPROACHES
  • 29. … 𝑇0 𝑇1 𝑇2 𝑇3 𝑇𝑗 𝑇𝑗+1 𝑇𝑗+2 𝑇𝑗+3 SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH Complete Matching Right-side Partial Matching Query string DNA string 𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n Chunk size Chunk 𝒊 − 𝟏 Chunk 𝒊 … Left-side Partial Matching … Chunk 𝒊 + 𝟏 …
  • 30. ADVANTAGES: • no extra data is required • it does not produce duplicated occurrences • no extra communication is needed • the master does not need to store the DNA string • it reduces bandwidth consumption to perform cross-chunk strings checking. Indeed workers return bits instead of integers. DISADVANTAGES: • Extra work is required to the master (partial matchings combine) SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
  • 31. Bigger Chunk analogous to generic approach On Demand: analogous to generic approach Side to Side Sliding Query (3SQ): additional work is performed by master node for combining partial matchings. REDUCE PHASE
  • 32. 3SQ REDUCE PHASE 𝑖 ∗ (𝑙 𝑑 𝑛)- j 1 1 0 1 𝑙 𝑞 Query match DNA string Size WORKER i Right side array FINAL OUTPUT 𝑖 ∗ (𝑙 𝑑 𝑛)- k 𝑙 𝑞 Positions 𝑠𝑗 𝑠 𝑘 Results array 𝑠 𝑘 1 0 0 1 AND 1 0 0 1 WORKER i+1 Left side array 𝑠𝑖 𝒋 𝒌
  • 33. • Introduction • Problem definition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 34. Same as simple search • Splitting phase: same of simple search • Computation phase: • Similarity function is evaluated for every alignment of query string • Likely simple search, Cross-chunk strings must be considered • Every worker returns its 𝑛 best similarity values, with relative positions • Reduce phase: All similarity values are merged in order and the best 𝑛 alignments are returned PARALLELIZATION MODEL
  • 35. REDUCE PHASE Off. Similarity X 10 Y 8 Z 3 Off. Similarity A 5 B -3 C -6 W. Id Off. Sim. 1 X 10 1 Y 8 2 U 7 3 A 5 1 Z 3 2 V 2 2 W -1 3 B -3 3 C -6 Pos. Similarity X’ 10 Y’ 8 U’ 7 O R D E R E D M E R G E P O S. T R A N S L A T I O N Off. Similarity U 7 V 2 W -1 FINAL OUTPUT Worker 1 Worker 2 Worker 3
  • 36. Bigger chunk: The master sends to every worker a chunk of size s ≤ 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching similarities can be evaluated. Side to Side Sliding Query (3SQ): Every worker receives a chunk of size s = 𝑙 𝑑 𝑛 and computes its similarity values and all partial similarities (leftside and rightside). Partial similarities will be summed by the master in order to compute Cross- Chunk String similarity values. CROSS-CHUNK MATCHING
  • 37. 3SQ PARTIAL SIMILARITY COMBINE PHASE 4 2 0 1 𝑙 𝑞 Query match DNA string Size WORKER i Right side array OUTPUT W. Id. Off. Sim i 𝑠𝑗 5 i 𝑠 𝑘 3 i … Results array sk 1 0 3 -4 + 5 2 3 -3 WORKER i+1 Left side array si 𝒋 𝒌 𝑠𝑗 = 𝑙 𝑐 − (𝑙 𝑞 − 1)+ j 𝑙 𝑞 Chunk 𝒊 Chunk 𝒊 + 𝟏
  • 38. • Introduction • Problem definition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 39. Varying parameters: • Number of Workers • Query Length We plan to evaluate the running times of every presented algorithm. The analysis of these results will validate our proposal, highlighting the algorithm that performs better. OVERVIEW
  • 40. SEQUENCE ALIGNMENT SPEED-UP A PARALLEL APPROACH University of Salerno Parallel and Concurrent Computing Course 19 February 2013 Giuliana Carullo Luca Pepe Daniele Valenza DEVELOPMENT AND BENCHMARKING
  • 41. • Implementation • Introduction • DNA Splitting • Bandwidth usage • Comunication • Benchmarking • Testing environment • Test plan • Results • Conclusions
  • 42. Every proposed algorithm has been implemented using C language and OpenMPI library Advantages: • High performances • Scalability • Portability INTRODUCTION
  • 43. A natural approach: load it entirely from file, calculate the size (𝑙 𝑑), split it in 𝑛 chunks and send them to the workers Problems: A DNA genome may be very large (3.0 ×109 bp (base pairs) ) The available memory can’t be enough. Projectual choice: The whole DNA is actually never needed DNA is never entirely loaded in memory, first dna and chunk size are calculated, and then step by step 𝑙 𝑐 characters are read from file and sent to a worker. PROJECTUAL CHOICES: DNA SPLITTING
  • 44. • The type of messages exchanged during the simple search computation would normally consist in: • Characters (splitting phase) • Integers (Reduce phase) Bandwidth usage: • 1 byte (Char size) x lc x n - Splitting phase • 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase Can we do better? … YES! PROJECTUAL CHOICES: BANDWIDTH USAGE
  • 45. In the Simple Search algorithm, a compression can be performed in order to drastically reduce bandwidth consumption. Simple Search Reduce phase Compression: instead of sending actual positions, a bit array of size 𝑙 𝑐 is exploited. Bit array costruction: for each position, if a matching is found starting from it, the bit is set to 1, 0 otherwise. Compression Ratio: 1: 32 (E.g, with 4 integers from 4 positions to 128 positions) PROJECTUAL CHOICES: BANDWIDTH USAGE
  • 46. COMUNICATION Master to workers Extra Comunication Workers to Master Messages Data Type Type Messages Data Type Type Messages Data Type Type Bigger Chunk N(ld/n+lq-1) Char Async Sync X N(ld/n) Int Bit Sync Sync On Demand: N(ld/n) Char Async Sync N-1(lq-1) Char Sync Sync N(ld/n) Int Bit Sync Sync 3SQ N(ld/n) Char Async Sync X N(ld/n)+2(l q-1) Int Bit Sync Sync
  • 47. • Implementation • Introduction • DNA Splitting • Bandwidth usage • Comunication • Benchmarking • Testing environment • Test plan • Results • Conclusions
  • 48. Cluster 8 Nodes - Ethernet 100Mbps connection Node CPU: Intel Xeon Dual Core 2.8 Ghz RAM: 4GB Hard Drive: 2x 30GB SCSI Software OS: Debian 6.0.4 OpenMPI 1.6.1 TESTING ENVIRONMENT Image for illustrative purposes only
  • 49. Benchmarking consisted in evaluating and comparing running times of each algorithm as function of the following parameters • Number of processors (# workers +1) [2, 4, 8, 16] • DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-) • Query length (Small -8byte-, Medium -32byte-, Large -64byte-) • # best allignments -Approximate search only- (10, 50, 100) In grey the fixed value for the parameter when not evaluated TEST PLAN
  • 50. SIMPLE SEARCH: NUMBER OF WORKERS (1/2) Results: • Good Scalability for every algorithm • 3SQ worse than the others because additional sequential work must be performed.
  • 51. SIMPLE SEARCH: NUMBER OF WORKERS (2/2) Results: • Bigger Chunk Bit performs better than int solution. • Increasing processors, bigger chunk performs better than the others because more cross- chunk matchings occur. • No relevant improvements occurred between 8 and 16 processors.
  • 52. SIMPLE SEARCH: SPEED UP 0 0,5 1 1,5 2 2,5 3 3,5 4 2 4 8 16 SPEEDUP NUMBER OF PROCESSORS Speed Up Simple Search DNA Size: Big Query Size: Small BC-bit OD-cent BC-int OD-dist 3SQ Results: • Increasing speedup for every algorithm (except BC-int) • The speedup grows proportionally to 𝑛 + 1 • BC-int suffers from network bottleneck due to the size of the messages.
  • 53. SIMPLE SEARCH: DNA LENGTH Results: • Good Scalability for every algorithm • 3SQ worse: additional sequential work than others…. • Bigger Chunk Bit performs better than int solution • Execution times grows linearly respect to DNA size
  • 54. SIMPLE SEARCH: QUERY LENGTH Results: • 3SQ is highly sensible to query length variations due to partial matching combine phase. • No significative variations for other algorithms since single Query Matching is interrupted on first mismatch found.
  • 55. APPROXIMATE SEARCH: NUMBER OF WORKERS Results: • Running times decrease linearly respectively to the number of processors. • 3SQ is only slightly worse than Bigger chunk because the sequential work is almost the same (Ordered Merge)
  • 56. APPROXIMATE SEARCH: SPEED UP 0 2 4 6 8 10 12 14 16 2 4 8 16 SPEEDUP NUMBER OF PROCESSORS Speed Up Approximate Search DNA Size: Medium Query Size: Small 3SQ BC-int Results: Speed up globally better than simple search and close to the ideal value.
  • 57. APPROXIMATE SEARCH: DNA SIZE Results: Running times grows linearly respectively to the DNA SIZE Motivation The main sequential computation consists in Ordered Merge that has linear complexity.
  • 58. APPROXIMATE SEARCH: QUERY SIZE Results: Running times is influenced by Query Size. Motivation The computation of similarity function is affected by query length.
  • 59. APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS Results: Running times grows almost linearly. Motivation Each worker returns to the master its Number of best alignments and the ordered merge process is affected by it. 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 10 50 100 RUNNINGTIME(SECONDS) NUMBER OF BEST ALIGNMENT Approximate Search DNA size: Big Processor: 16 Query Size: Small BC-int 3SQ
  • 60. • Implementation • Introduction • DNA Splitting • Bandwidth usage • Comunication • Benchmarking • Testing environment • Test plan • Results • Conclusions
  • 61. The winner is…. Bigger Chunk On Demand 3SQ 
  • 62. Further improvements can be applied to the presented algorithms Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit are enough to rappresent the character, instead of 8 bit Bit Mapping: e.g. A=00, T=01, C=10, G=11 Compression Ratio: 1: 4 (E.g, with 1 character from 1 base to 4 bases) IMPROVEMENTS
  • 63. 3SQ algorithm: Partial matchings combine phase can be performed in a distributed manner Each node sends its left or right partial matching to left or right sibling, which will combine it with his results and send them to master. In this way sequential work can be reduced IMPROVEMENTS