SlideShare a Scribd company logo
T-coffee:
A Method for Fast and Accurate Multiple Sequence Alignment
Chen, Gui
03/18/2015
Backgroud & Motivation
Algorithm Illustration
Validation & Result
Why do We Need Multiple Sequences Alignment?
Homology Modeling
Phylogenetic reconstruction
Illustrate conserved and variable sites within a family
Can be used to construct profile or HMM to scour databases of
distantly related members of the family
When construct MSA, theoretically we should consider evolution
and structural relationships within the family. However…
1. Specific expertise knowledge(if not lacking) is hard to

be integrated into algorithm
2. General empirical models of protein evolution doesn’t work well with 

sequences are less than 30% identical
3. Mathematically sound methods is prohibitively demanding in computer
resources
That is why we introduce Heuristic method.
A Brief Review of Previous Methods to Construct MSA
ClustalW:

Heuristic method
MSA & DCA:
Mathematically sound
Prrp&Muscle:
Iterative Method

dynamic programing to build
guide tree,
then do progressive
alignments
comment: fast but suffers
from its greediness
(once a gap always a gap)
no local alignment
information, take little
reference from other
sequences in sequences set
during construct the MSA
simultaneous alignment of all the
sequences
Carrilo and Lipman Algorithm,
Multipledimensional dynamic
programing
comment: extremely CPU and
memory-intensive approach
and not better than other
methods in term of alignment
performance, no local
alignment information taken
heuristic method combined with
iterative Method
comment: an interesting
method, neither significantly
faster nor align better than
ClustalW , no local alignment
information, take some
reference from other
sequences in sequences set
during construct the MSA
Question: theoretically speaking, why iterative method is not better than ClustalW? and
why local alignment information should be taken?
A Brief Review of Previous Methods to Construct MSA
Dialign2:

Heuristic method
simultaneous alignment of
all sequences but with crude
heuristic local ailignment
method
comment: fast
but only consider local
alignment information and to
some extent consider reference
from all sequences in a
sequence set , align poorly in
practice use.
Question: theoretically speaking, why iterative method is not better than ClustalW? And
why local alignment information should be taken?
…so the motivation now is to build a method with 

all merits listed below:
Combine information from global alignment
Combine information from local alignment
Take reference from other sequences in sequence set
during alignment
T-Coffee
Method
Back
Hueristic method with practical computational time
local alignment is more sensitive to
domain and motif

reference from other sequences should
be taken from other sequences when
align conserved part

optimal alignment not equivalent to
biological meaningful alignment
*A Case Will Fail ClustalW Method
Backgroud & Motivation
Algorithm Illustration
Validation & Result
An Overview of T-coffee Method
…so how T-coffee satisfy all the merits mentioned earlier?
Requirement solution
Hueristic method with
practical computational time Progressive Alignment codes from ClustalW
Combine information from
global alignment
Primary Library from
Global Alignment
codes from ClustalW
Combine information from
local alignment
Primary Library from
Local Alignment
codes from Lalign in
FASTA package
Take reference from other
sequences in sequence set
during progressive alignment
Extension Library from 

Primary Library
refer intermediate
sequence method
Three major steps in FASTA:
1. build Hashing table
2. concatenate matched k-tuple
3. extend to get high score segments
An Overview of T-coffee Method
Primary Library of Alignments (Global and Local)
Library—	a	set	of	pairwise	alignments	between	all	of	the	sequences	to	be	aligned,	and	in	a			
sequence-to-sequence-position-pair	speci8ic	weight	list	form.	a	library	can	be	stored	as	a	N*N	
lower	(or	upper)	triangular	matrix	where	main	diagonal	can	be	ignored,	and	each	entry	is	a		
weight	list.	In	other	word,	a	list	of	weighted	pairwise	constraints.	The	primary	library	of	global	
alignment	for	a	sequence	set	is	denoted	by	AG	and	the	primary	library	of	local	alignment	for	a	
sequence	set	is	denoted	by	AL.	A*	is	referred	to	either	AG	or	AL.		
Now	suppose	we	have	a	sequence	set	with	size	N(N	refer	to	the	number	of	sequences	in	the	set)	,	
the	total	number	of	sequence	pairs	for	the	sequence	set	is	N*(N-1)/2.	
We	can	use	Ai	to	denote	the	ith	sequence(item)	in	the	sequence	set	A.	So	that	in	matrix	A*	we	can	
know	entry	A*ij	where	contain	the	information	from	the	pair	of	alignment	the	entry	denotes.	Before	
to	generate	global	alignment	AG	or	local	alignment	AL,	we	should	8irst	do	all	possible	pairwise	
alignments	using	global	alignment	method	or	FASTA	local	segments	match	method(Lalign)	
*When we do local pairwise alignment, by default, we choose ten top-scoring non-intersections local
alignment from each pair of alignment. So the number of segments derived from an alignment is very
likely less than 10 (simply because there are no so many qualified matched segments) and could be 0.
	After	the	pairwise	alignment,	we	derived	a	list	of	pairwise	residue	matches	for	each	entry	of	A*.		And	Xm	
denote	the	mth	position	in	a	certain	sequence	Ai.	So	the	list	in	an	entry	can	be	denoted	by	(Xn	Xm)|A*ij.		
Finally, we assign a weight to each pairwise residues match in all lists directed by all entries in A*,
and the weight equal to percentage identity of the alignment of Ai to Aj where the pairwise residue
match is derived from. W[(Xn, Xm)|A*ij] = P.I.(A*ij). The weight is also referred as constraint.
Primary Library of Alignments (Global and Local)
A1 … Ai Aj … AN
A1
A2
…
Aj %
…
AN
a list of W(Xm,Xn|A*ij)
Library is a generalized list which contains key-list and key-value pairs.

List contains key-value pairs. For global alignment:
For local alignment
Produce
Combination of the Libraries: Addition
Pooling the ClustalW and Lalign primary libraries in a
simple process of addition:
AGL$=$AG+AL$
W[(Xn, Xm)|AGLij] = W[(Xn, Xm)|AGij]+ W[(Xn, Xm)|ALij]

If W[(Xn, Xm)|A*ij] is not recorded in A*, assign 0 to it.
Then entry AGLij can be regarded as a ‘sparse’ list with L(i)*L(j) number of
key-value pairs (a lot of values are 0). L(i) denote the length(or number of
residues) of sequence Ai.
Library can be used as scoring scheme
Library A* can be regarded as sequence-to-sequence-position-
pair specific scoring scheme.
It can be regarded as a secondary scoring scheme derived from
dynamic programing pairwise alignment using substitution matrix as
primary scoring scheme.
Extending the library: Background
Purpose: to take reference from other sequences in each step of progressive
alignment.
Previous solutions for this purpose:
Fitting a set of weighted constraints into a multiple alignment is a well-known
problem, formulated by Kececioglu as an instance of the “maximum weight
trace”, an NP-complete problem. And two optimizaition strategies were proposed:

1. genetic Algorithm: prohibitive computational time
2. graph-theoretical method: not robust enough for all cases
In a word, this problem cannot be illustrated well from graph-theory point of view.
Solution proposed by this paper: a heuristic algorithm inspired from
intermediate sequence method. A triplet approach.
Extending the library: Triplet approach
W(A(G), C(?)) W(A(G), C(?))consider seqC
consider seqD W(A(G), D(?)) W(A(G), D(?))
For W(A(G), B(G)) E[W(A(G), B(G))]=W(A(G), B(G))+%d=88+77
If C(?) == C(?): get %(min) of W(A(G), C(?))=77
W(A(G), D(?))=100else %(0)
v
v
Sometimes we will get better alignment

If we don’t strictly follow the guide tree.
That is why we take inference from other

sequences when align two sequences

following the guide tree.


Iterative method achieve this goal by 

modifying guide tree in a heuristic manner.

e.g. MUSLE
Extending the library: Let’s code this process
Note the library extension operator as AE and notice that it is not a library that can
be added to A* because it is a function of A*. AE(A*)= A*E.
def AE (A*):
for i=1, i++, i<=N

for j=i+1, j++, j<=N // go through A*ij: C(2,N)
for m=1, m++, m<=L(Ai) 

for n=1, n++, n<=L(Aj) //go through all
constraints in the matrix entries: L^2

E=0, for k of each Ak belonging to A-Ai-Aj
a = get_position(m i k a)

b = get_position(n j k b)
if a == b // to find consistent
residues in other sequences supporting match of 

Xm|Ai and Xn|Aj: 2L
e1 = W[(Xm, Xa)|A*ik]

e2 = W[(Xn, Xb)|A*jk]
E +=min{e1, e2} // get extension
weight
W[(Xm,Xn)|A*ij]+= E // A*E
def get_position(m i k a):
for n=1 n++ n<=L(Ak)
if W[(Xm,Xn)|A*ik] != 0
add n to a // find the possible consistent
position in Ak: L(Ak)
return a
C(2,N)* (L^2)*L=O(N^2*L^3)
Extending the library: Let’s formulate this process
AGLE =AE (AG)+AE (AL)
Notice that distributive law is not allowed for operator AE .

That is to say: AGLE =AE (AGL)
Conclusion: Coffee Score Scheme
Given any pair of residues from any two sequences in sequence set:
If weight = 0, that residue pairs never supported by global, local or extension triplet alignment.
(in other words, the pair of that residues maybe aligned in form of gap).
If weight >0, that weight will reflect a combination of the similarity of the pair of
sequences(Global) or sequence segments(Local) that the residue pair comes from and the
consistency of match of the residues with residues from other sequences in the sequence set.
The weight library can then be used as coffee score scheme to do progressive alignment.
*When apply Coffee score scheme to do dynamic programming or progressive alignment, there is no need to set

additional gap open or gap extension penalty simply for two reasons:
1. Coffee score scheme is a secondary score scheme generated from dynamic programming using primary
score scheme, where penalty about gap is already taken account of.
2. Although local alignment primary library doesn’t reflect how the match of pair of residues introduce gaps
globally, if this match of pair of residues is also supported by global alignment, gap information will be
reflected through global alignment . Otherwise this mach of pair of residues is not going to have high weight
if it is not supported by consistency with reference from other sequences. In this case, gap penalty is still
not necessary.
In other word the weight reflects how the residue pair is supported
by direct local or global alignment within which the residue pair
comes from and the indirect alignment with facilitation of all other
sequences as intermediate-sequences.
Practically, gap penalty=0.
Progressive Alignment Strategy
Given the Column n
C
C
C
T
+
T
T
T
! !, !!
!!!
!
!!!!!
!!
!
= !"#$!%#_1(!1)!!
! !, !!
!!!
!
!!!!!
!!
!
= !"#$!%#_2(!2)!!
C
C
C
T
T
T
T
+
C
C
Don’t need to align pairs of residues within existing column of
alignment , only consider weights of matched pairs of residues
between existing column:
!!
!!! [!"#!(!), !"#!(!)]!
!!!!!
!!
!
∗ !!
!
= !"#$!%#_3(!3)!!
average_1’=a1+a2+a3
average_2’
Within Within Between
average_3’
Backgroud & Motivation
Algorithm Illustration
Validation & Result
Test Cases is from BaliBase Why Balibase
Reliabitlity:The MSA in Balibase is resulted from manual structure comparison and validated 

using structure-superposition algorithms SSAP-DALI
Comprehensiveness: 141 MSA cases in Balibase can be grouped into 5 categories:

1. Group with phylogenetically equidistant members
2. Group with one orphan sequence and a group of close relatives

3. Group with two distant subgroups

4. Group in which some members have long terminal insertions

5. Group in which some members have long internal insertions

Thus the cases are unlikely to be biased toward any specific multiple-
alignment method.
Validation method: Scoring Scheme and Multimethod Comparison
Scoring Scheme:
1. column-wise comparison: get point only when the whole column is aligned correctly
2. SP: sum-of-pairs: get weighted point when the column aligned is partially correct.
Validation is carried out by comparing each calculated multiple alignment
with its counterpart in BaliBase.
Multimethod Comparison:
Candidate Methods
1. Prrp
2. ClustalW
3. MSA & DCA methods eliminated at the very begining
4. Dialign2
Statistic Method: Wilcoxon signed matched-pair ranked test : non-parametric test

which use difference between sums of ranks from two series of data as statistic

H0: no difference H1: has difference

if P-value is large, accept H0. 

Otherwise reject H0.
Result: Extension Library is Superior to Primary Library
Comparison of three types of primary library:
1. ClustalW pair-wise library(C) (extended to CE)
2. Lalign pairwise Library(L) (extended to LE)
3. Pooling of the ClustalW and Lalign pairwise libraries(CL) (extended to CLE)
Result: CL > C , CL > L
Comparison of Extension library with Primary library
Result: CE > C , LE > L , CLE > CL
Comparison between three types of Extension Libraries
Result: CLE > CE , CLE > LE
So that we can conclude that CLE is the best library as scoring scheme.
Result: T-Coffee Method is Superior to Other Methods
As comparison with other Methods, two scoring scheme has been separately applied, and for each 

scoring scheme, two kinds of test has been applied.
Column-wise
core region test T-Coffee > Prrp > ClustalW
complete alignment test T-Coffee > ClustalW> Prrp
Sum of pairs
core region test T-Coffee > ClustalW> Prrp
complete alignment test ?
Result: T-Coffee does not always outperform other methods in all specific cases
Thank you!
Questions?

More Related Content

PPTX
Distance based method
PPTX
Global and local alignment (bioinformatics)
PDF
Gene prediction method
PPT
Clustal
PDF
Sequence alignment
PPTX
Clustal W - Multiple Sequence alignment
PPTX
Introduction to sequence alignment partii
PPTX
Molecular Phylogenetics
Distance based method
Global and local alignment (bioinformatics)
Gene prediction method
Clustal
Sequence alignment
Clustal W - Multiple Sequence alignment
Introduction to sequence alignment partii
Molecular Phylogenetics

What's hot (20)

PDF
Gene prediction methods vijay
PPTX
PAM : Point Accepted Mutation
PPTX
Uni prot presentation
PDF
Phylogenetic analysis
PPTX
Dna chip
PDF
sequence alignment
PPT
RNA secondary structure prediction
PPTX
Protein – DNA interactions, an overview
PPTX
gene prediction programs
PPTX
Multiple sequence alignment
PPTX
222397 lecture 16 17
PPTX
Flux balance analysis
PPTX
Sequence alignment global vs. local
PPTX
Molecular phylogenetics
PPTX
BLAST (Basic local alignment search Tool)
PPTX
Protein information resource (PIR)
PPTX
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
PPTX
Comparative genomics
PPTX
Entrez databases
PPT
4.1 introduction to bioinformatics
Gene prediction methods vijay
PAM : Point Accepted Mutation
Uni prot presentation
Phylogenetic analysis
Dna chip
sequence alignment
RNA secondary structure prediction
Protein – DNA interactions, an overview
gene prediction programs
Multiple sequence alignment
222397 lecture 16 17
Flux balance analysis
Sequence alignment global vs. local
Molecular phylogenetics
BLAST (Basic local alignment search Tool)
Protein information resource (PIR)
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
Comparative genomics
Entrez databases
4.1 introduction to bioinformatics
Ad

Viewers also liked (13)

PDF
2016年中国生物试剂产业研究报告
PDF
2016年中国网路数据中心产业研究报告
DOC
Resume CV
PDF
2016年中国月经杯产业研究报告
PPTX
Coca-Cola Intrínsecos
DOC
Plantilla propuestapim
PDF
PORTFIOLIOPDF copy
PDF
Alaa-CV's-- - Copy
DOC
Hari Profile
PDF
2016年中国左氧氟沙星产业研究报告
PPSX
Slideshare 1.2
PPTX
Water animals presentation
DOCX
Matriz oña-y-pinto(1)
2016年中国生物试剂产业研究报告
2016年中国网路数据中心产业研究报告
Resume CV
2016年中国月经杯产业研究报告
Coca-Cola Intrínsecos
Plantilla propuestapim
PORTFIOLIOPDF copy
Alaa-CV's-- - Copy
Hari Profile
2016年中国左氧氟沙星产业研究报告
Slideshare 1.2
Water animals presentation
Matriz oña-y-pinto(1)
Ad

Similar to T coffee algorithm dissection (20)

PDF
Multiple sequence alignment
PPTX
Lec 4-multiple sequence alignment.pptx..
DOCX
Bioinformatics_Sequence Analysis
PPTX
PPTX
презентация за варшава
PPTX
MULTIPLE SEQUENCE ALIGNMENT
PPTX
Msa & rooted/unrooted tree
PPT
20100515 bioinformatics kapushesky_lecture07
PPT
Phylogenetics1
PPTX
Sequence comparison techniques
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PDF
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PPTX
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PDF
PCB_Lect02_Pairwise_allign (1).pdf
PDF
Ch06 multalign
DOCX
multiple sequence alignment
Multiple sequence alignment
Lec 4-multiple sequence alignment.pptx..
Bioinformatics_Sequence Analysis
презентация за варшава
MULTIPLE SEQUENCE ALIGNMENT
Msa & rooted/unrooted tree
20100515 bioinformatics kapushesky_lecture07
Phylogenetics1
Sequence comparison techniques
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
NEW SEQUENCE ALIGNMENT ALGORITHM USING AI RULES AND DYNAMIC SEEDS
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PCB_Lect02_Pairwise_allign (1).pdf
Ch06 multalign
multiple sequence alignment

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
NewMind AI Weekly Chronicles - August'25 Week I
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?

T coffee algorithm dissection

  • 1. T-coffee: A Method for Fast and Accurate Multiple Sequence Alignment Chen, Gui 03/18/2015
  • 2. Backgroud & Motivation Algorithm Illustration Validation & Result
  • 3. Why do We Need Multiple Sequences Alignment? Homology Modeling Phylogenetic reconstruction Illustrate conserved and variable sites within a family Can be used to construct profile or HMM to scour databases of distantly related members of the family
  • 4. When construct MSA, theoretically we should consider evolution and structural relationships within the family. However… 1. Specific expertise knowledge(if not lacking) is hard to
 be integrated into algorithm 2. General empirical models of protein evolution doesn’t work well with 
 sequences are less than 30% identical 3. Mathematically sound methods is prohibitively demanding in computer resources That is why we introduce Heuristic method.
  • 5. A Brief Review of Previous Methods to Construct MSA ClustalW:
 Heuristic method MSA & DCA: Mathematically sound Prrp&Muscle: Iterative Method
 dynamic programing to build guide tree, then do progressive alignments comment: fast but suffers from its greediness (once a gap always a gap) no local alignment information, take little reference from other sequences in sequences set during construct the MSA simultaneous alignment of all the sequences Carrilo and Lipman Algorithm, Multipledimensional dynamic programing comment: extremely CPU and memory-intensive approach and not better than other methods in term of alignment performance, no local alignment information taken heuristic method combined with iterative Method comment: an interesting method, neither significantly faster nor align better than ClustalW , no local alignment information, take some reference from other sequences in sequences set during construct the MSA Question: theoretically speaking, why iterative method is not better than ClustalW? and why local alignment information should be taken?
  • 6. A Brief Review of Previous Methods to Construct MSA Dialign2:
 Heuristic method simultaneous alignment of all sequences but with crude heuristic local ailignment method comment: fast but only consider local alignment information and to some extent consider reference from all sequences in a sequence set , align poorly in practice use. Question: theoretically speaking, why iterative method is not better than ClustalW? And why local alignment information should be taken? …so the motivation now is to build a method with 
 all merits listed below: Combine information from global alignment Combine information from local alignment Take reference from other sequences in sequence set during alignment T-Coffee Method Back Hueristic method with practical computational time local alignment is more sensitive to domain and motif reference from other sequences should be taken from other sequences when align conserved part optimal alignment not equivalent to biological meaningful alignment
  • 7. *A Case Will Fail ClustalW Method
  • 8. Backgroud & Motivation Algorithm Illustration Validation & Result
  • 9. An Overview of T-coffee Method …so how T-coffee satisfy all the merits mentioned earlier? Requirement solution Hueristic method with practical computational time Progressive Alignment codes from ClustalW Combine information from global alignment Primary Library from Global Alignment codes from ClustalW Combine information from local alignment Primary Library from Local Alignment codes from Lalign in FASTA package Take reference from other sequences in sequence set during progressive alignment Extension Library from 
 Primary Library refer intermediate sequence method Three major steps in FASTA: 1. build Hashing table 2. concatenate matched k-tuple 3. extend to get high score segments
  • 10. An Overview of T-coffee Method
  • 11. Primary Library of Alignments (Global and Local) Library— a set of pairwise alignments between all of the sequences to be aligned, and in a sequence-to-sequence-position-pair speci8ic weight list form. a library can be stored as a N*N lower (or upper) triangular matrix where main diagonal can be ignored, and each entry is a weight list. In other word, a list of weighted pairwise constraints. The primary library of global alignment for a sequence set is denoted by AG and the primary library of local alignment for a sequence set is denoted by AL. A* is referred to either AG or AL. Now suppose we have a sequence set with size N(N refer to the number of sequences in the set) , the total number of sequence pairs for the sequence set is N*(N-1)/2. We can use Ai to denote the ith sequence(item) in the sequence set A. So that in matrix A* we can know entry A*ij where contain the information from the pair of alignment the entry denotes. Before to generate global alignment AG or local alignment AL, we should 8irst do all possible pairwise alignments using global alignment method or FASTA local segments match method(Lalign) *When we do local pairwise alignment, by default, we choose ten top-scoring non-intersections local alignment from each pair of alignment. So the number of segments derived from an alignment is very likely less than 10 (simply because there are no so many qualified matched segments) and could be 0. After the pairwise alignment, we derived a list of pairwise residue matches for each entry of A*. And Xm denote the mth position in a certain sequence Ai. So the list in an entry can be denoted by (Xn Xm)|A*ij. Finally, we assign a weight to each pairwise residues match in all lists directed by all entries in A*, and the weight equal to percentage identity of the alignment of Ai to Aj where the pairwise residue match is derived from. W[(Xn, Xm)|A*ij] = P.I.(A*ij). The weight is also referred as constraint.
  • 12. Primary Library of Alignments (Global and Local) A1 … Ai Aj … AN A1 A2 … Aj % … AN a list of W(Xm,Xn|A*ij) Library is a generalized list which contains key-list and key-value pairs.
 List contains key-value pairs. For global alignment: For local alignment Produce
  • 13. Combination of the Libraries: Addition Pooling the ClustalW and Lalign primary libraries in a simple process of addition: AGL$=$AG+AL$ W[(Xn, Xm)|AGLij] = W[(Xn, Xm)|AGij]+ W[(Xn, Xm)|ALij]
 If W[(Xn, Xm)|A*ij] is not recorded in A*, assign 0 to it. Then entry AGLij can be regarded as a ‘sparse’ list with L(i)*L(j) number of key-value pairs (a lot of values are 0). L(i) denote the length(or number of residues) of sequence Ai.
  • 14. Library can be used as scoring scheme Library A* can be regarded as sequence-to-sequence-position- pair specific scoring scheme. It can be regarded as a secondary scoring scheme derived from dynamic programing pairwise alignment using substitution matrix as primary scoring scheme.
  • 15. Extending the library: Background Purpose: to take reference from other sequences in each step of progressive alignment. Previous solutions for this purpose: Fitting a set of weighted constraints into a multiple alignment is a well-known problem, formulated by Kececioglu as an instance of the “maximum weight trace”, an NP-complete problem. And two optimizaition strategies were proposed:
 1. genetic Algorithm: prohibitive computational time 2. graph-theoretical method: not robust enough for all cases In a word, this problem cannot be illustrated well from graph-theory point of view. Solution proposed by this paper: a heuristic algorithm inspired from intermediate sequence method. A triplet approach.
  • 16. Extending the library: Triplet approach W(A(G), C(?)) W(A(G), C(?))consider seqC consider seqD W(A(G), D(?)) W(A(G), D(?)) For W(A(G), B(G)) E[W(A(G), B(G))]=W(A(G), B(G))+%d=88+77 If C(?) == C(?): get %(min) of W(A(G), C(?))=77 W(A(G), D(?))=100else %(0) v v Sometimes we will get better alignment
 If we don’t strictly follow the guide tree. That is why we take inference from other
 sequences when align two sequences
 following the guide tree. 
 Iterative method achieve this goal by 
 modifying guide tree in a heuristic manner.
 e.g. MUSLE
  • 17. Extending the library: Let’s code this process Note the library extension operator as AE and notice that it is not a library that can be added to A* because it is a function of A*. AE(A*)= A*E. def AE (A*): for i=1, i++, i<=N
 for j=i+1, j++, j<=N // go through A*ij: C(2,N) for m=1, m++, m<=L(Ai) 
 for n=1, n++, n<=L(Aj) //go through all constraints in the matrix entries: L^2
 E=0, for k of each Ak belonging to A-Ai-Aj a = get_position(m i k a)
 b = get_position(n j k b) if a == b // to find consistent residues in other sequences supporting match of 
 Xm|Ai and Xn|Aj: 2L e1 = W[(Xm, Xa)|A*ik]
 e2 = W[(Xn, Xb)|A*jk] E +=min{e1, e2} // get extension weight W[(Xm,Xn)|A*ij]+= E // A*E def get_position(m i k a): for n=1 n++ n<=L(Ak) if W[(Xm,Xn)|A*ik] != 0 add n to a // find the possible consistent position in Ak: L(Ak) return a C(2,N)* (L^2)*L=O(N^2*L^3)
  • 18. Extending the library: Let’s formulate this process AGLE =AE (AG)+AE (AL) Notice that distributive law is not allowed for operator AE .
 That is to say: AGLE =AE (AGL)
  • 19. Conclusion: Coffee Score Scheme Given any pair of residues from any two sequences in sequence set: If weight = 0, that residue pairs never supported by global, local or extension triplet alignment. (in other words, the pair of that residues maybe aligned in form of gap). If weight >0, that weight will reflect a combination of the similarity of the pair of sequences(Global) or sequence segments(Local) that the residue pair comes from and the consistency of match of the residues with residues from other sequences in the sequence set. The weight library can then be used as coffee score scheme to do progressive alignment. *When apply Coffee score scheme to do dynamic programming or progressive alignment, there is no need to set
 additional gap open or gap extension penalty simply for two reasons: 1. Coffee score scheme is a secondary score scheme generated from dynamic programming using primary score scheme, where penalty about gap is already taken account of. 2. Although local alignment primary library doesn’t reflect how the match of pair of residues introduce gaps globally, if this match of pair of residues is also supported by global alignment, gap information will be reflected through global alignment . Otherwise this mach of pair of residues is not going to have high weight if it is not supported by consistency with reference from other sequences. In this case, gap penalty is still not necessary. In other word the weight reflects how the residue pair is supported by direct local or global alignment within which the residue pair comes from and the indirect alignment with facilitation of all other sequences as intermediate-sequences. Practically, gap penalty=0.
  • 20. Progressive Alignment Strategy Given the Column n C C C T + T T T ! !, !! !!! ! !!!!! !! ! = !"#$!%#_1(!1)!! ! !, !! !!! ! !!!!! !! ! = !"#$!%#_2(!2)!! C C C T T T T + C C Don’t need to align pairs of residues within existing column of alignment , only consider weights of matched pairs of residues between existing column: !! !!! [!"#!(!), !"#!(!)]! !!!!! !! ! ∗ !! ! = !"#$!%#_3(!3)!! average_1’=a1+a2+a3 average_2’ Within Within Between average_3’
  • 21. Backgroud & Motivation Algorithm Illustration Validation & Result
  • 22. Test Cases is from BaliBase Why Balibase Reliabitlity:The MSA in Balibase is resulted from manual structure comparison and validated 
 using structure-superposition algorithms SSAP-DALI Comprehensiveness: 141 MSA cases in Balibase can be grouped into 5 categories:
 1. Group with phylogenetically equidistant members 2. Group with one orphan sequence and a group of close relatives
 3. Group with two distant subgroups
 4. Group in which some members have long terminal insertions
 5. Group in which some members have long internal insertions
 Thus the cases are unlikely to be biased toward any specific multiple- alignment method.
  • 23. Validation method: Scoring Scheme and Multimethod Comparison Scoring Scheme: 1. column-wise comparison: get point only when the whole column is aligned correctly 2. SP: sum-of-pairs: get weighted point when the column aligned is partially correct. Validation is carried out by comparing each calculated multiple alignment with its counterpart in BaliBase. Multimethod Comparison: Candidate Methods 1. Prrp 2. ClustalW 3. MSA & DCA methods eliminated at the very begining 4. Dialign2 Statistic Method: Wilcoxon signed matched-pair ranked test : non-parametric test
 which use difference between sums of ranks from two series of data as statistic
 H0: no difference H1: has difference
 if P-value is large, accept H0. 
 Otherwise reject H0.
  • 24. Result: Extension Library is Superior to Primary Library Comparison of three types of primary library: 1. ClustalW pair-wise library(C) (extended to CE) 2. Lalign pairwise Library(L) (extended to LE) 3. Pooling of the ClustalW and Lalign pairwise libraries(CL) (extended to CLE) Result: CL > C , CL > L Comparison of Extension library with Primary library Result: CE > C , LE > L , CLE > CL Comparison between three types of Extension Libraries Result: CLE > CE , CLE > LE So that we can conclude that CLE is the best library as scoring scheme.
  • 25. Result: T-Coffee Method is Superior to Other Methods As comparison with other Methods, two scoring scheme has been separately applied, and for each 
 scoring scheme, two kinds of test has been applied. Column-wise core region test T-Coffee > Prrp > ClustalW complete alignment test T-Coffee > ClustalW> Prrp Sum of pairs core region test T-Coffee > ClustalW> Prrp complete alignment test ?
  • 26. Result: T-Coffee does not always outperform other methods in all specific cases