Sequence alignment 1

INTRODUCTION TO
SEQUENCE ALIGNMENT
PART 1

CONTENT
1-SEQUENCE ALIGNMENT
2-APPLICATIONS OF SEQUENCE
ALIGNMENT
3-DEFINITIONS FOR ALIGNED
SEQUENCES
(a) ALGORITHM (b) DIVERGENT
EVOLUTION (c) CONSERVATION
(d) PROGRAM (e) IDENTITY
(f) SIMILARITY (g) HOMOLOGS
(h) HETEROLOGS (i) ANALOGS
(j) ORTHOLOGS (k) PARALOGS
(l) XENOLOGS
4-PHYSICOCHEMICAL RELATIONSHIPS
BETWEEN AMINO
5-MEASURES OF SEQUENCE
SIMILARITY
(a) HAMMING AND LEVENSHTEIN
DISTANCES
(b) CONCEPT OF SIMILARITY AND
DISTANCE TABLE
(c) SIMPLE MATCHING COEFFICIENT

SEQUENCE ALIGNMENT
• Sequence alignment describes the relationship between
biological sequences by designating portions of sequences that
correspond to each other.
• It is the method used to analyze the similarities and
differences at the level of individual bases or amino acids
with the aim of inferring structural, functional and
evolutionary relationships or random events among the
sequences.
• It is the identification of residue- residue correspondence
OR
• Any assignment of correspondence that preserves the order
of the residues within the sequences is an alignment.

APPLICATIONS OF SEQUENCE ALIGNMENT
INFORMATIONS that are gained by aligning DNA, RNA and
protein sequences----------------
 ● Searching for patterns and informative elements within
a sequence.
 ● Obtaining statistical information on a sequence.
 ● Searching for similarities between two sequences, or
many sequences
 ● Constructing phylogenetic trees based on sequences.
 ● Predicting and analyzing the secondary/tertiary
structures and folding on the basis of the sequence.

APPLICATIONS OF SEQUENCE ALIGNMENT
 ● Identifying unknown sequences.
 ● Finding other members of multigene families.
 ● Gaining information for primer designing.
 ● Reconstructing long sequence of DNA from
string fragments.
 ● Determining physical and genetic maps from
probe data under various experimental
protocols.
 ● Predicting function of actual gene products.
 ● Getting information for molecular modelling.

DEFINITIONS FOR ALIGNED SEQUENCES
 ALGORITHM
 Algorithm is defined by a logical sequence of steps
by which a task can be performed. It is a set of rules
for calculating or solving a problem which normally
is carried out by a computer program.
 Important features of an algorithm are-
 (i) It should stop after a finite number of
steps.
 (ii) All steps of an algorithm must be precisely
defined.
 (iii) Input to the algorithm must be specified.
 (iv) Output to the algorithm must be
specified.
 (v) Algorithm must be very effective.
Thus algorithm is a complete and precise specification
of a method for solving a problem

 DIVERGENT
EVOLUTION
 CONSERVATION
 Similarity among sequence could arise by chance
or it could be a convergance towards a common
sequence and structure and therefore function,
through evolution, or, the similarity could arise
from divergent evolution of the two sequences
from a common ancestral sequence. The similarity
that arises from this last mechanism alone (i.e.
divergent evolution) is called homologs. Homologs,
heterologs, anologs, orthologs, paralogs and
xenologs are words that describe the different
ways in which sequence similarity could arise.
 Changes at a specific position of an amino acid
in proteins (less commonly nucleotides in DNA
sequences) that preserves the physicochemical
properties of the original residue is known as
conservation.

 PROGRAM
 IDENTITY
 SIMILARITY :
 A program is an implementation of an
algorithm..
 The extent to which two (nucleotide or amino
acid) sequences are invariant expresses identity
among the sequences.
 The extent to which nucleotide or protein
sequences are related is known as similarity. The
extent of similarity between two sequences can
be based on percent sequence identity and/or
conservation. In BLAST similarity refers to a
positive matrix score.

 Homologs  ‘Homologs’ is a general term to indicate
sequences with common origin.
 It is qualitative term and is somewhat
arbitrary in the sense that it is up to the
researcher to decide what level of
similarity indicates homology.
 Numerical indicators of similarity
determined by the sequence alignment
programs, other parameters derived from
biochemistry or other areas of biology
may contribute to the decision on whether
a pair of sequence is homologous.

 Heterologs :  The opposite of 'homologs'
is 'heterologs'.
Heterologous sequences
may still be similar, but
they do not have a
common origin, and nor
do they have a common
function or activity.

 ANALOGS :  Sequences that have same function but
lack sufficient similarity to imply
common origin are said to be 'analogs'.
What it means is that analogous
sequences followed evolutionary
pathways from different origins to
converge upon same function and
activity. They have homologous
function but heterologous origins. They
may be considered a product of
convergent evolution. Examples are
Chymotrypsin and
subtilisin proteins.

 ORTHOLOGS
 Orthologs’ are homologs that arise by
speciation.
 The implication is that the two sequences
are currently taken from two different
species, but they show similarity because
they are both derived from the same
ancestral sequence that was present in
the ancestral organism.
 The amount of difference in two
orthologous sequences may be taken as a
rough indicator of the amount of time that
has passed since the speciation event took
place. For example hemoglobin sequences
from horse and zebra.

 PARALOGS  ‘Paralogs’ are homologs that arise by gene
duplication, without this duplication event
being followed by a speciation event.
 It means that paralogs are homologous
sequences that exist in the same organism,
and that have different functions.
 Paralogs have homologous origin but
heterologous function. for e g. myoglobin
and haemoglobin from human
beings.
 In DNA sequence, the functional
differences between paralogs may simply
be that one is a functional gene, while the
other is a silent non-expressed gene or a
pseudogene.

 XENOLOGS
 ‘Xenologs’ are homologs resulting from
horizontal gene transfer.
 They are an exception to the rule that the
homologs are always descended from a
common ancestor.
 horizontal or lateral exchange of genes may
take place between different species.
Function of such genes would be
homologous.

PHYSICOCHEMICAL RELATIONSHIPS BETWEEN AMINO ACID
Construction of biologically significant alignments should consider the
fact that protein evolution is constrained by the chemical properties of
amino acids, and by the degeneracy of the genetic code.
Chemically conservative replacements tend to occur more frequently than
replacements with amino acids that are chemically different.
For example, it is more likely to see a substitution of the Leucine with
Isoleucine, both of which are nonpolar, than a substitution of Aspartic
acid, which is negatively charged, for Leucine. Such changes are less
likely to affect the structure and function of the protein. The figure
given below gives an idea of allowed substitutions among amino acids

PHYSICOCHEMICAL RELATIONSHIPS BETWEEN AMINO
ACIDS

MEASURES OF SEQUENCE SIMILARITY
 Quantitative measures of sequence similarity and
difference; two measures of the distance between two
sequence are :
 (1) The Hamming Distance- defines the number of
positions with mismatching characters between two
strings of equal length.
 (2)The Levenshtein (Edit) Distance- between two strings
of not necessarily equal length, is the minimal number of
'edit operations' required to change one string into other,
where an edit operation is a deletion, insertion or
alteration of a single character in either sequence. It is
desirable to assign variable weights to different
edit operations since certain changes are more likely to
happen naturally than others.


MEASURES OF SEQUENCE SIMILARITY
 A given sequence of edit operations induces a unique
alignment, but not vice versa. Example
AGTC
CGTA
 Hamming distance = 2
AG-TCC
CGCTCA
Levenshtein distance = 3
Hamming and Levenshtein distances measure the
dissimilarity of two sequences: similar sequences
give small distances and dissimilar sequences give large
distances.

CONCEPT OF SIMILARITY AND DISTANCE TABLE
a b c d e
a 100 65 50 50 50
b 65 100 50 50 50
c 50 50 100 97 65
d 50 50 97 100 65
e 50 50 65 65 100
By taking hypothetical tables concept of similarity and distance may be
explained.
Here (a-e) are five organisms who are scored for resemblance.
Table-1 table for similarity
Similarity table shows percent of matches, thus the diagonal (in which
each species is compared to itself) consists of 100% values.
(Such data forms the basis of Adansonian analysis or numerical
taxonomy. Definition of the term is “the classification of organisms based
on giving equal weight to every character of the organism; this principle
has its greatest application in numeric taxonomy.")

CONCEPT OF SIMILARITY AND DISTANCE TABLE
a b c d e
a 0 6 11 11 11
b 6 0 11 11 11
c 11 11 0 2 6
d 11 11 2 0 6
e 11 11 6 6 0
The numbers in the distance table show percent differences or
distances. Thus, the diagonal consists of 0% values.
Distance tables are in general use. A common measure of
difference between macromolecular sequences is 100-S where S is the
percentage of identical monomers when sequences have been optimally
aligned.
Table 2- Distance table

SIMPLE MATCHING COEFFICIENT
 The simple matching coefficient (SMC) is a statistic used for
comparing the similarity and diversity of sample sets.

M01 is the total number of attributes where the attribute
of A is 0 and the attribute of B is 1.
M10 is the total number of attributes where the attribute
of A is 1 and the attribute of B is 0.
M00 is the total number of attributes where A and B both
have a value of 0.
The simple matching distance(SMD), which measures
dissimilarity between sample sets, is given by 1-SMC.
SIMPLE MATCHING COEFFICIENT

Sequence alignment 1

More Related Content

What's hot (20)

Similar to Sequence alignment 1 (20)

More from SumatiHajela (7)

Recently uploaded (20)

Sequence alignment 1