SlideShare a Scribd company logo
Multiple Sequence Alignment James McInerney bioinf4biologists Feb. 2009
Alignment can be easy or difficult Easy Difficult due  to insertions  or deletions  (indels)
Homology: Definition Homology:   similarity that is the result of inheritance from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics. An   Alignment  is an hypothesis of positional homology between bases/Amino Acids.
Multiple Sequence Alignment- Goals To generate a concise, information-rich summary of sequence data. Sometimes used to illustrate the  dissimilarity  between a group of sequences. Alignments can be treated as models that can be used to test hypotheses. Used to identify homologous residues within sequences.
Multiple sequence alignments - problems All sequences show some similarity (even random sequences). Similarity levels might be high in some parts of the sequence and low in other parts. Sequences might show substantial length variation and presence/absence of various domains.
 
 
SSU rRNA Structural RNA (not translated) Found in the small ribosomal subunit. Widely-used for phylogeny reconstruction (found in every species) Contains stem and loop structures. Stem structures usually conform to watson-crick base pairing.
Alignment of 16S rRNA can be guided by secondary structure Alignment of 16S rRNA sequences from different bacteria
Protein Alignment may be guided by Tertiary Structure Interactions Homo sapiens DjlA protein Escherichia coli DjlA protein
Multiple Sequence Alignment- Methods 3 main methods of alignment: Manual (using custom-built text editors). Automatic (using custom-built alignment software). Combined
Manual Alignment - reasons Might be carried out because: Alignment is easy. There is some extraneous information (structural). Automated alignment methods have encountered the local minimum problem. An automated alignment method can be “improved”.
Local minimum GARFIELDTHEFAT---CAT GARFIELDTHEFATFATCAT
The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously. Lets consider a dotplot between sperm whale and human myoglobins Dotplots Sperm whale myoglobin GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG   human myoglobin VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG
Put one sequence on top the other on the side where residues are identical put a dot Diagonal lines of dots show similarities Dotplot example sperm whale vs human myg Sperm whale myoglobin  G L S D G E W Q L V ... V  *  L  *  * S  * E  *  G  *  *  E  *   W  *   Q  * L  *  * V  *  * . . . Human myoglobin  Just do the first 10 amino acids of each Make a table with  whale sequence on top  human sequence on the side
This is the result for the whole sequence It is easy to see that the diagonal is a line of dots. So sperm whale and human  myoglobin are very similar But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well Dotplot example sperm whale vs human myg Sperm whale myoglobin  G L S D G E W Q L V ... V  *  L  *  * S  * E  *  G  *  *  E  *   W  *   Q  * L  *  * V  *  * . . . Human myoglobin 
can smooth noise  using a sliding window which considers neighbouring residues as well  Have done this here can see the diagonal is highly similar Also instead of using using a simple identity use a scoring matrix Dotplot example sperm whale vs human myg
Dotplots in practice The best tool is an applet* called  dotlet www.isrec.isb-sib.ch/java/dotlet/Dotlet.html www.bip.bham.ac.uk/dotlet/Dotlet.html * an applet is a program that runs in a web browser. This means that you can produce dotplots within a netscape/IE window. Dotplots are often useful to identify things like repeated domains or duplications in big proteins...
Example dotplot - repeated domains in  Drosophila melanogaster  SLIT protein.  Protein has many repeats  SLIT_DROME (P24014): MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY  Perform a dotplot of the SLIT protein against itself www.bio.bham.ac.uk/dotlet/Dotlet.html.
Example dotplot - repeated domains in  Drosophila melanogaster  SLIT protein Swiss-prot entry For further discussion of dotplot see Attwood and Parry-Smith p116-8
Dynamic programming 2 methods: Dynamic programming Consider 2 protein sequences of 100 amino acids in length. If it takes 100 2  seconds to exhaustively align these sequences, then it will take 100 3  seconds to align 3 sequences, 100 4  to align 4 sequences...etc. More time than the universe has existed to align 20 sequences exhaustively.  Progressive alignment
Progressive Alignment Devised by Feng and Doolittle in 1987. Essentially a heuristic method and as such is not guaranteed to find the ‘optimal’ alignment. Requires  n-1+n-2+n-3...n-n+1  pairwise alignments as a starting point Most successful implementation is Clustal (Des Higgins).  This software is cited 3,000 times per year in the scientific literature.
Overview of ClustalW Procedure 1  PEEKSAVTALWGKVN--VDEVGG 2  GEEKAAVLALWDKVN--EEEVGG 3  PADKTNVKAAWGKVGAHAGEYGA 4  AADKTNVKAAWSKVGGHAGEYGA 5  EHEWQLVLHVWAKVEADVAGHGQ Hbb_Human  1  - Hbb_Horse  2  .17  - Hba_Human  3  .59  .60  - Hba_Horse  4  .59  .59  .13  - Myg_Whale  5  .77  .77  .75  .75  - Hbb_Human Hbb_Horse Hba_Horse Hba_Human Myg_Whale 2 1 3 4 2 1 3 4 alpha-helices Quick pairwise alignment:  calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment  following guide tree CLUSTAL W
ClustalW- Pairwise Alignments First perform all possible pairwise alignments between each pair of sequences. There are ( n-1)+(n-2)...(n-n+1)  possibilities. Calculate the ‘distance’ between each pair of sequences based on these isolated pairwise alignments. Generate a distance matrix.
Path Graph for aligning two sequences.
Possible alignment 1 1 0 1 0 -1 Scoring Scheme: Match: +1 Mismatch: 0 Indel:  -1 Score for this path= 2
Alignment using this path GATTC- GAATTC 1 1 0 1 0 -1
Optimal Alignment 1 1 1 -1 1 1 1 Alignment score: 4 Alignment using  this path GA-TTC GAATTC
Optimal Alignment 2 1 -1 1 1 1 1 Alignment score: 4 Alignment using  this path G-ATTC GAATTC
Alignment of 3 sequences
ClustalW- Guide Tree Generate a Neighbor-Joining ‘guide tree’ from these pairwise distances. This guide tree gives the order in which the progressive alignment will be carried out.
Neighbor joining method The neighbor-joining method is a greedy heuristic which joins at each step, the two closest sub-trees that are not already joined. It is based on the minimum evolution principle. neighbors  are defined as two taxa that are connected by a single node in an unrooted tree A B Node 1
Distance Matrix What is required for the Neighbour joining method? Distance matrix
First Step PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum Monkey Human Spinach Mosquito Rice
Calculation of New Distances After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum]  = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2  = (90.8 + 86.3)/2 = 88.55  Mon-Hum Monkey Human Spinach
Next Cycle Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum)
Penultimate Cycle Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum) Spin-Rice
Last Joining Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum) Spin-Rice (Spin-Rice)-(Mos-(Mon-Hum))
Unrooted Neighbor-Joining Tree Human Monkey Mosquito Rice Spinach
Multiple Alignment- First pair Align the two most closely-related sequences first. This alignment is then ‘fixed’ and will never change.  If a gap is to be introduced subsequently, then it will be introduced in the same place  in both sequences , but their relative alignment remains unchanged.
ClustalW- Decision time Consult the guide tree to see what alignment is performed next. Align a third sequence to the first two Or Align two entirely different sequences to each other. Option 1 Option 2
ClustalW- Alternative 1 If, on the other hand, two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out. If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences. + ClustalW- Alternative 2 +
ClustalW- Progression The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a ‘pair’ having more than one sequence.
Progressive alignment - step 1 1.  gctcgatacgatacgatgactagcta 2.  gctcgatacaagacgatgacagcta 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 5. ctcgaacgatacgatgactagct 1.  gctcgatacgatacgatgactagcta 2.  gctcgatacaagacgatgac-agcta 1 2 3 4 5
Progressive alignment - step 2 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgacagcta 3.  gctcgatacacgatgactagcta 4.  gctcgatacacgatgacgagcga 5. ctcgaacgatacgatgactagct 3.  gctcgatacacgatgactagcta 4.  gctcgatacacgatgacgagcga 1 2 3 4 5
Progressive alignment - step 3 1.  gctcgatacgatacgatgactagcta 2.  gctcgatacaagacgatgac-agcta + 3.  gctcgatacacgatgactagcta 4.  gctcgatacacgatgacgagcga 1.  gctcgatacgatacgatgactagcta 2.  gctcgatacaagacgatgac-agcta 3.  gctcgatacacga---tgactagcta 4.  gctcgatacacga---tgacgagcga 1 2 3 4 5
Progressive alignment - final step 1.  gctcgatacgatacgatgactagcta 2.  gctcgatacaagacgatgac-agcta 3.  gctcgatacacga---tgactagcta 4.  gctcgatacacga---tgacgagcga + 5. ctcgaacgatacgatgactagct 1.  gctcgatacgatacgatgactagcta 2.  gctcgatacaagacgatgac-agcta 3.  gctcgatacacga---tgactagcta 4.  gctcgatacacga---tgacgagcga 5.  -ctcga-acgatacgatgactagct- 1 2 3 4 5
ClustalW-Good points/Bad points Advantages: Speed. Disadvantages: No objective function. No way of quantifying whether or not the alignment is good No way of knowing if the alignment is ‘correct’.
ClustalW-Local Minimum Potential problems: Local minimum problem.  If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure. Arbitrary alignment.
Increasing the sophistication of the alignment process. Should we treat all the sequences in the same way? - even though some sequences are closely-related and some sequences are distant relatives. Should we treat all positions in the sequences as though they were the same? - even though they might have different functions and different locations in the 3-dimensional structure.
 
ClustalW- Caveats Sequence weighting Varying substitution matrices Residue-specific gap penalties and reduced penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions. Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments
ClustalW- User-supplied values Two penalties are set by the user (there are default values, but you should know that it is possible to change these). GOP -  G ap  O pening  P enalty is the cost of opening a gap in an alignment. GEP -  G ap  E xtension  P enalty is the cost of extending this gap.
Position-Specific gap penalties Before any pair of (groups of) sequences are aligned, a table of GOPs are generated for each position in the two (sets of) sequences. The GOP is manipulated in a position-specific manner, so that it can vary over the sequences. If there is a gap at a position, the GOP and GEP penalties are lowered, the other rules do not apply. This makes gaps more likely at positions where gaps already exist.
Discouraging too many gaps  If there is no gap opened, then the GOP is increased if the position is within 8 residues of an existing gap. This discourages gaps that are too close together. At any position within a run of hydrophilic residues, the GOP is decreased. These runs usually indicate loop regions in protein structures. A run of 5 hydrophilic residues is considered to be a  hydrophilic stretch . The default hydrophilic residues are: D, E, G, K, N, Q, P, R, S But this can be changed by the user.
Divergent Sequences The most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align. It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned). The user has the choice of setting a cutoff (default is 40% identity). This will delay the alignment until the others have been aligned.
Advice on progressive alignment Progressive alignment is a mathematical process that is completely independent of biological reality. Can be a very good estimate Can be an impossibly poor estimate. Requires user input and skill. Treat cautiously Can be improved by eye (usually) Often helps to have colour-coding. Depending on the use, the user should be able to make a judgement on those regions that are reliable or not. For phylogeny reconstruction, only use those positions whose hypothesis of positional homology is unimpeachable
Alignment of protein-coding DNA sequences It is not very sensible to align the DNA sequences of protein-coding genes.  ATGCTGTTAGGG ATGACTCTGTTAGGG ATG-CT--GTTAGGG ATGACTCTGTTAGGG The result might be highly-implausible and might not reflect what is known about biological processes. It is much more sensible to translate the sequences to their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment.
Manual Alignment- software GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from: http://www.biochem.ucl.ac.uk Seqapp/Seqpup- Mac/PC/UNIX available from: http://iubio.bio.indiana.edu SeAl for Macintosh, available from: http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html BioEdit for PC, available from: http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bioedit.html

More Related Content

PPTX
Phylogenetic data analysis
PPTX
Sequence Alignment
PPTX
Clustal W - Multiple Sequence alignment
PPTX
DOCX
multiple sequence alignment
PPT
Maximum parsimony
Phylogenetic data analysis
Sequence Alignment
Clustal W - Multiple Sequence alignment
multiple sequence alignment
Maximum parsimony

What's hot (20)

PPTX
Phylogenetic tree construction
PPT
Sequence file formats
PPTX
Data base searching tool
PPTX
Protein Data Bank ( PDB ) - Bioinformatics
PPTX
Database Searching
PPTX
Sequence similarity tools.pptx
PDF
The ensembl database
PPTX
Swiss prot database
PPTX
sequence of file formats in bioinformatics
PPTX
Blast and fasta
PPTX
Cath
PPTX
Dynamic programming
PPTX
(Expasy)
PPTX
Multiple Sequence Alignment
PDF
Functional annotation
PDF
Sequence analysis - Bioinformatics
PDF
PPT
Gene bank by kk sahu
PPTX
Phylogenetic tree
Phylogenetic tree construction
Sequence file formats
Data base searching tool
Protein Data Bank ( PDB ) - Bioinformatics
Database Searching
Sequence similarity tools.pptx
The ensembl database
Swiss prot database
sequence of file formats in bioinformatics
Blast and fasta
Cath
Dynamic programming
(Expasy)
Multiple Sequence Alignment
Functional annotation
Sequence analysis - Bioinformatics
Gene bank by kk sahu
Phylogenetic tree
Ad

Similar to Alignments (20)

PPTX
MULTIPLE SEQUENCE ALIGNMENT
PDF
Multiple sequence alignment
PDF
Sequence alignment
PPT
20100515 bioinformatics kapushesky_lecture07
DOCX
Bioinformatics_Sequence Analysis
PDF
International Journal of Computer Science, Engineering and Information Techno...
PPTX
Sequence Alignment
PPTX
local and global allignment
PDF
Sequence Alignment_Assumption.pdf sequence
PPTX
bioinformatics lecture 2.pptx and computational Boilogygy
PPTX
Sequence Analysis
PDF
sequence alignment
PPT
AlgoAlignementGenomicSequences.ppt
PPTX
Lec 4-multiple sequence alignment.pptx..
PPTX
Sequence alignment for bio informatics.pptx
PPT
Seq alignment
PPTX
4. sequence alignment.pptx
PPTX
Bioinformatics lesson
PPTX
Bioinformatics lesson
PPTX
Introduction to sequence alignment
MULTIPLE SEQUENCE ALIGNMENT
Multiple sequence alignment
Sequence alignment
20100515 bioinformatics kapushesky_lecture07
Bioinformatics_Sequence Analysis
International Journal of Computer Science, Engineering and Information Techno...
Sequence Alignment
local and global allignment
Sequence Alignment_Assumption.pdf sequence
bioinformatics lecture 2.pptx and computational Boilogygy
Sequence Analysis
sequence alignment
AlgoAlignementGenomicSequences.ppt
Lec 4-multiple sequence alignment.pptx..
Sequence alignment for bio informatics.pptx
Seq alignment
4. sequence alignment.pptx
Bioinformatics lesson
Bioinformatics lesson
Introduction to sequence alignment
Ad

Recently uploaded (20)

PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Modernising the Digital Integration Hub
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
The various Industrial Revolutions .pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
August Patch Tuesday
PDF
1 - Historical Antecedents, Social Consideration.pdf
A novel scalable deep ensemble learning framework for big data classification...
Modernising the Digital Integration Hub
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
cloud_computing_Infrastucture_as_cloud_p
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Getting Started with Data Integration: FME Form 101
The various Industrial Revolutions .pptx
DP Operators-handbook-extract for the Mautical Institute
NewMind AI Weekly Chronicles - August'25-Week II
Web App vs Mobile App What Should You Build First.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
1. Introduction to Computer Programming.pptx
Tartificialntelligence_presentation.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Chapter 5: Probability Theory and Statistics
August Patch Tuesday
1 - Historical Antecedents, Social Consideration.pdf

Alignments

  • 1. Multiple Sequence Alignment James McInerney bioinf4biologists Feb. 2009
  • 2. Alignment can be easy or difficult Easy Difficult due to insertions or deletions (indels)
  • 3. Homology: Definition Homology: similarity that is the result of inheritance from a common ancestor - identification and analysis of homologies is central to phylogenetic systematics. An Alignment is an hypothesis of positional homology between bases/Amino Acids.
  • 4. Multiple Sequence Alignment- Goals To generate a concise, information-rich summary of sequence data. Sometimes used to illustrate the dissimilarity between a group of sequences. Alignments can be treated as models that can be used to test hypotheses. Used to identify homologous residues within sequences.
  • 5. Multiple sequence alignments - problems All sequences show some similarity (even random sequences). Similarity levels might be high in some parts of the sequence and low in other parts. Sequences might show substantial length variation and presence/absence of various domains.
  • 6.  
  • 7.  
  • 8. SSU rRNA Structural RNA (not translated) Found in the small ribosomal subunit. Widely-used for phylogeny reconstruction (found in every species) Contains stem and loop structures. Stem structures usually conform to watson-crick base pairing.
  • 9. Alignment of 16S rRNA can be guided by secondary structure Alignment of 16S rRNA sequences from different bacteria
  • 10. Protein Alignment may be guided by Tertiary Structure Interactions Homo sapiens DjlA protein Escherichia coli DjlA protein
  • 11. Multiple Sequence Alignment- Methods 3 main methods of alignment: Manual (using custom-built text editors). Automatic (using custom-built alignment software). Combined
  • 12. Manual Alignment - reasons Might be carried out because: Alignment is easy. There is some extraneous information (structural). Automated alignment methods have encountered the local minimum problem. An automated alignment method can be “improved”.
  • 13. Local minimum GARFIELDTHEFAT---CAT GARFIELDTHEFATFATCAT
  • 14. The dotplot provides a way of quickly visualizing the similarities between all parts of two sequences simultaneously. Lets consider a dotplot between sperm whale and human myoglobins Dotplots Sperm whale myoglobin GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG FQG human myoglobin VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG
  • 15. Put one sequence on top the other on the side where residues are identical put a dot Diagonal lines of dots show similarities Dotplot example sperm whale vs human myg Sperm whale myoglobin  G L S D G E W Q L V ... V * L * * S * E * G * * E * W * Q * L * * V * * . . . Human myoglobin  Just do the first 10 amino acids of each Make a table with whale sequence on top human sequence on the side
  • 16. This is the result for the whole sequence It is easy to see that the diagonal is a line of dots. So sperm whale and human myoglobin are very similar But the picture is noisy can smooth using a sliding window which considers neighbouring residues as well Dotplot example sperm whale vs human myg Sperm whale myoglobin  G L S D G E W Q L V ... V * L * * S * E * G * * E * W * Q * L * * V * * . . . Human myoglobin 
  • 17. can smooth noise using a sliding window which considers neighbouring residues as well Have done this here can see the diagonal is highly similar Also instead of using using a simple identity use a scoring matrix Dotplot example sperm whale vs human myg
  • 18. Dotplots in practice The best tool is an applet* called dotlet www.isrec.isb-sib.ch/java/dotlet/Dotlet.html www.bip.bham.ac.uk/dotlet/Dotlet.html * an applet is a program that runs in a web browser. This means that you can produce dotplots within a netscape/IE window. Dotplots are often useful to identify things like repeated domains or duplications in big proteins...
  • 19. Example dotplot - repeated domains in Drosophila melanogaster SLIT protein. Protein has many repeats SLIT_DROME (P24014): MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCT GLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVI TTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSW LSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTL PDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLL LNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCE SPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGR ISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFE HLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCT CTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYN KLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQ MKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNAT CTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAK CMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHEC KHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAV ELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLEN KCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGN QCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY Perform a dotplot of the SLIT protein against itself www.bio.bham.ac.uk/dotlet/Dotlet.html.
  • 20. Example dotplot - repeated domains in Drosophila melanogaster SLIT protein Swiss-prot entry For further discussion of dotplot see Attwood and Parry-Smith p116-8
  • 21. Dynamic programming 2 methods: Dynamic programming Consider 2 protein sequences of 100 amino acids in length. If it takes 100 2 seconds to exhaustively align these sequences, then it will take 100 3 seconds to align 3 sequences, 100 4 to align 4 sequences...etc. More time than the universe has existed to align 20 sequences exhaustively. Progressive alignment
  • 22. Progressive Alignment Devised by Feng and Doolittle in 1987. Essentially a heuristic method and as such is not guaranteed to find the ‘optimal’ alignment. Requires n-1+n-2+n-3...n-n+1 pairwise alignments as a starting point Most successful implementation is Clustal (Des Higgins). This software is cited 3,000 times per year in the scientific literature.
  • 23. Overview of ClustalW Procedure 1 PEEKSAVTALWGKVN--VDEVGG 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 4 AADKTNVKAAWSKVGGHAGEYGA 5 EHEWQLVLHVWAKVEADVAGHGQ Hbb_Human 1 - Hbb_Horse 2 .17 - Hba_Human 3 .59 .60 - Hba_Horse 4 .59 .59 .13 - Myg_Whale 5 .77 .77 .75 .75 - Hbb_Human Hbb_Horse Hba_Horse Hba_Human Myg_Whale 2 1 3 4 2 1 3 4 alpha-helices Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment following guide tree CLUSTAL W
  • 24. ClustalW- Pairwise Alignments First perform all possible pairwise alignments between each pair of sequences. There are ( n-1)+(n-2)...(n-n+1) possibilities. Calculate the ‘distance’ between each pair of sequences based on these isolated pairwise alignments. Generate a distance matrix.
  • 25. Path Graph for aligning two sequences.
  • 26. Possible alignment 1 1 0 1 0 -1 Scoring Scheme: Match: +1 Mismatch: 0 Indel: -1 Score for this path= 2
  • 27. Alignment using this path GATTC- GAATTC 1 1 0 1 0 -1
  • 28. Optimal Alignment 1 1 1 -1 1 1 1 Alignment score: 4 Alignment using this path GA-TTC GAATTC
  • 29. Optimal Alignment 2 1 -1 1 1 1 1 Alignment score: 4 Alignment using this path G-ATTC GAATTC
  • 30. Alignment of 3 sequences
  • 31. ClustalW- Guide Tree Generate a Neighbor-Joining ‘guide tree’ from these pairwise distances. This guide tree gives the order in which the progressive alignment will be carried out.
  • 32. Neighbor joining method The neighbor-joining method is a greedy heuristic which joins at each step, the two closest sub-trees that are not already joined. It is based on the minimum evolution principle. neighbors are defined as two taxa that are connected by a single node in an unrooted tree A B Node 1
  • 33. Distance Matrix What is required for the Neighbour joining method? Distance matrix
  • 34. First Step PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum Monkey Human Spinach Mosquito Rice
  • 35. Calculation of New Distances After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55 Mon-Hum Monkey Human Spinach
  • 36. Next Cycle Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum)
  • 37. Penultimate Cycle Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum) Spin-Rice
  • 38. Last Joining Human Mosquito Mon-Hum Monkey Spinach Rice Mos-(Mon-Hum) Spin-Rice (Spin-Rice)-(Mos-(Mon-Hum))
  • 39. Unrooted Neighbor-Joining Tree Human Monkey Mosquito Rice Spinach
  • 40. Multiple Alignment- First pair Align the two most closely-related sequences first. This alignment is then ‘fixed’ and will never change. If a gap is to be introduced subsequently, then it will be introduced in the same place in both sequences , but their relative alignment remains unchanged.
  • 41. ClustalW- Decision time Consult the guide tree to see what alignment is performed next. Align a third sequence to the first two Or Align two entirely different sequences to each other. Option 1 Option 2
  • 42. ClustalW- Alternative 1 If, on the other hand, two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out. If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences. + ClustalW- Alternative 2 +
  • 43. ClustalW- Progression The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a ‘pair’ having more than one sequence.
  • 44. Progressive alignment - step 1 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgacagcta 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 5. ctcgaacgatacgatgactagct 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 1 2 3 4 5
  • 45. Progressive alignment - step 2 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgacagcta 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 5. ctcgaacgatacgatgactagct 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 1 2 3 4 5
  • 46. Progressive alignment - step 3 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta + 3. gctcgatacacgatgactagcta 4. gctcgatacacgatgacgagcga 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 3. gctcgatacacga---tgactagcta 4. gctcgatacacga---tgacgagcga 1 2 3 4 5
  • 47. Progressive alignment - final step 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 3. gctcgatacacga---tgactagcta 4. gctcgatacacga---tgacgagcga + 5. ctcgaacgatacgatgactagct 1. gctcgatacgatacgatgactagcta 2. gctcgatacaagacgatgac-agcta 3. gctcgatacacga---tgactagcta 4. gctcgatacacga---tgacgagcga 5. -ctcga-acgatacgatgactagct- 1 2 3 4 5
  • 48. ClustalW-Good points/Bad points Advantages: Speed. Disadvantages: No objective function. No way of quantifying whether or not the alignment is good No way of knowing if the alignment is ‘correct’.
  • 49. ClustalW-Local Minimum Potential problems: Local minimum problem. If an error is introduced early in the alignment process, it is impossible to correct this later in the procedure. Arbitrary alignment.
  • 50. Increasing the sophistication of the alignment process. Should we treat all the sequences in the same way? - even though some sequences are closely-related and some sequences are distant relatives. Should we treat all positions in the sequences as though they were the same? - even though they might have different functions and different locations in the 3-dimensional structure.
  • 51.  
  • 52. ClustalW- Caveats Sequence weighting Varying substitution matrices Residue-specific gap penalties and reduced penalties in hydrophilic regions (external regions of protein sequences), encourage gaps in loops rather than in core regions. Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments
  • 53. ClustalW- User-supplied values Two penalties are set by the user (there are default values, but you should know that it is possible to change these). GOP - G ap O pening P enalty is the cost of opening a gap in an alignment. GEP - G ap E xtension P enalty is the cost of extending this gap.
  • 54. Position-Specific gap penalties Before any pair of (groups of) sequences are aligned, a table of GOPs are generated for each position in the two (sets of) sequences. The GOP is manipulated in a position-specific manner, so that it can vary over the sequences. If there is a gap at a position, the GOP and GEP penalties are lowered, the other rules do not apply. This makes gaps more likely at positions where gaps already exist.
  • 55. Discouraging too many gaps If there is no gap opened, then the GOP is increased if the position is within 8 residues of an existing gap. This discourages gaps that are too close together. At any position within a run of hydrophilic residues, the GOP is decreased. These runs usually indicate loop regions in protein structures. A run of 5 hydrophilic residues is considered to be a hydrophilic stretch . The default hydrophilic residues are: D, E, G, K, N, Q, P, R, S But this can be changed by the user.
  • 56. Divergent Sequences The most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align. It is sometimes better to delay their aligment until later (when the easier sequences have already been aligned). The user has the choice of setting a cutoff (default is 40% identity). This will delay the alignment until the others have been aligned.
  • 57. Advice on progressive alignment Progressive alignment is a mathematical process that is completely independent of biological reality. Can be a very good estimate Can be an impossibly poor estimate. Requires user input and skill. Treat cautiously Can be improved by eye (usually) Often helps to have colour-coding. Depending on the use, the user should be able to make a judgement on those regions that are reliable or not. For phylogeny reconstruction, only use those positions whose hypothesis of positional homology is unimpeachable
  • 58. Alignment of protein-coding DNA sequences It is not very sensible to align the DNA sequences of protein-coding genes. ATGCTGTTAGGG ATGACTCTGTTAGGG ATG-CT--GTTAGGG ATGACTCTGTTAGGG The result might be highly-implausible and might not reflect what is known about biological processes. It is much more sensible to translate the sequences to their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to where they are found in the amino acid alignment.
  • 59. Manual Alignment- software GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from: http://www.biochem.ucl.ac.uk Seqapp/Seqpup- Mac/PC/UNIX available from: http://iubio.bio.indiana.edu SeAl for Macintosh, available from: http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html BioEdit for PC, available from: http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bioedit.html