SlideShare a Scribd company logo
BLAST and FASTA


                  1
Pairwise Alignment

          Global                        Local
• Best score from among       • Best score from among
  alignments of full-length     alignments of partial
  sequences                     sequences
• Needelman-Wunch             • Smith-Waterman
  algorithm                     algorithm




                                                        2
Why do we need local alignments?

 •   To compare a short sequence to a large one.

 •   To compare a single sequence to an entire
     database

 •   To compare a partial sequence to the whole.



                                                   3
Why do we need local alignments?
 • Identify newly determined sequences
 • Compare new genes to known ones
 • Guess functions for entire genomes full of
   ORFs of unknown function




                                                4
Mathematical Basis
for Local Alignment
• Model matches as a sequence of coin
  tosses
• Let p be the probability of “head”
   – For a “fair” coin, p = 0.5
• According to Paul Erdös-Alfréd Rényi
  law:
  If there are n throws, then the expected
  length, R, of the longest run of “heads”
  is
               R = log1/p (n).               Paul Erdös
                                                          5
Mathematical Basis
for Local Alignment

• Example: Suppose n = 20 for a “fair” coin
             R=log2(20)=4.32
• Problem: How does one model DNA (or
  amino acid) alignments as coin tosses.




                                              6
Modeling Sequence Alignments
• To model random sequence alignments, replace a match by
  “head” (H) and mismatch by “tail” (T).

             AATCAT
                              HTHHHT
             ATTCAG

• For ungapped DNA alignments, the probability of a “head”
  is 1/4.

• For ungapped amino acid alignments, the probability of a
  “head” is 1/20.
                                                             7
Modeling Sequence Alignments
• Thus, for any one particular alignment, the Erdös-
  Rényi law can be applied
• What about for all possible alignments?
   – Consider that sequences can being shifted back and
     forth in the dot matrix plot
• The expected length of the longest match is
                   R = log1/p(mn)
  where m and n are the lengths of the two
  sequences.
                                                          8
Modeling Sequence Alignments
• Suppose m = n = 10, and we deal with DNA
  sequences
            R = log4(100) = 3.32
• This analysis assumes that the base
  composition is uniform and the alignment is
  ungapped. The result is approximate, but
  not bad.

                                            9
10
Heuristic Methods: FASTA and BLAST

FASTA
• First fast sequence searching algorithm for
  comparing a query sequence against a database.

BLAST
• Basic Local Alignment Search Technique
  improvement of FASTA: Search speed, ease of
  use, statistical rigor.
                                              11
FASTA and BLAST
• Basic idea: a good alignment contains
  subsequences of absolute identity (short lengths
  of exact matches):

  – First, identify very short exact matches.
  – Next, the best short hits from the first step are
    extended to longer regions of similarity.
  – Finally, the best hits are optimized.


                                                        12
FASTA
Derived from logic of the dot plot
  – compute best diagonals from all frames of
    alignment
The method looks for exact matches between
 words in query and test sequence
  – DNA words are usually 6 nucleotides long
  – protein words are 2 amino acids long



                                                13
FASTA Algorithm




                  14
Makes Longest Diagonal
After all diagonals are found, tries to join
 diagonals by adding gaps

Computes alignments in regions of best
 diagonals


                                           15
FASTA Alignments




                   16
FASTA Results - Histogram
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences:     2,050 Symbols:
913,285 Word Size: 6
 Searching with both strands of the query.
 Scoring matrix: GenRunData:fastadna.cmp
 Constant pamfactor used
 Gap creation penalty: 16 Gap extension penalty: 4

Histogram Key:
 Each histogram symbol represents 4 search set sequences
 Each inset symbol represents 1 search set sequences
 z-scores computed from opt scores
z-score obs    exp
        (=)    (*)
< 20      0      0:
  22      0      0:
  24      3      0:=
  26      2      0:=
  28      5      0:==
  30     11      3:*==
  32     19     11:==*==
  34     38     30:=======*==
  36     58     61:===============*
  38     79    100:====================    *
  40    134    140:==================================*
  42    167    171:==========================================*
  44    205    189:===============================================*====
  46    209    192:===============================================*=====   17
  48    177    184:=============================================*
FASTA Results - List
The best scores are:                   init1 initn      opt     z-sc E(1018780)..

SW:PPI1_HUMAN    Begin: 1 End: 269
! Q00169 homo sapiens (human). phosph... 1854   1854   1854   2249.3   1.8e-117
SW:PPI1_RABIT    Begin: 1 End: 269
! P48738 oryctolagus cuniculus (rabbi... 1840   1840   1840   2232.4   1.6e-116
SW:PPI1_RAT    Begin: 1 End: 270
! P16446 rattus norvegicus (rat). pho... 1543   1543   1837   2228.7   2.5e-116
SW:PPI1_MOUSE    Begin: 1 End: 270
! P53810 mus musculus (mouse). phosph... 1542   1542   1836   2227.5   2.9e-116
SW:PPI2_HUMAN    Begin: 1 End: 270
! P48739 homo sapiens (human). phosph... 1533   1533   1533   1861.0   7.7e-96
SPTREMBL_NEW:BAC25830    Begin: 1 End: 270
! Bac25830 mus musculus (mouse). 10, ... 1488   1488   1522   1847.6   4.2e-95
SP_TREMBL:Q8N5W1    Begin: 1 End: 268
! Q8n5w1 homo sapiens (human). simila... 1477   1477   1522   1847.6   4.3e-95
SW:PPI2_RAT    Begin: 1 End: 269
! P53812 rattus norvegicus (rat). pho... 1482   1482   1516   1840.4   1.1e-94




                                                                                    18
FASTA Results - Alignment
SCORES   Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374                                         (2038 nt)
 initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58
  66.2% identity in 875 nt overlap
 (83-957:151-1022)

                   60        70        80         90      100       110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
                                            || ||| | ||||| |    ||| |||||
DMU09374     AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
                    130       140       150        160      170       180

                  120       130       140       150       160       170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
             |||||||||   || |||    |   | || ||| |         || || ||||| ||
DMU09374     GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
                    190       200       210       220       230       240

                  180       190       200       210       220       230
u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
               ||| | ||||| ||    |||   ||||    | || | |||||||| || ||| ||
DMU09374     AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
                    250       260       270       280       290       300

                  240       250       260       270       280       290
u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
             ||||||||||     ||||| |     |||||| |||| |||   || ||| || |
DMU09374     AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT   19
                    310       320       330       340       350       360
FASTA on the Web

• Many websites offer
  FASTA searches
• Each server has its limits
• Be aware that you
  depend “on the kindness
  of strangers.”

                               20
Institut de Génétique Humaine, Montpellier France, GeneStream server
         http://www2.igh.cnrs.fr/bin/fasta-guess.cgi
Oak Ridge National Laboratory GenQuest server
         http://avalon.epm.ornl.gov/
European Bioinformatics Institute, Cambridge, UK
         http://www.ebi.ac.uk/htbin/fasta.py?request
EMBL, Heidelberg, Germany
         http://www.embl-heidelberg.de/cgi/fasta-wrapper-free
Munich Information Center for Protein Sequences (MIPS)
at Max-Planck-Institut, Germany
         http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html
Institute of Biology and Chemistry of Proteins Lyon, France
         http://www.ibcp.fr/serv_main.html
Institute Pasteur, France
         http://central.pasteur.fr/seqanal/interfaces/fasta.html
GenQuest at The Johns Hopkins University
         http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html
National Cancer Center of Japan
         http://bioinfo.ncc.go.jp

                                                                       21
FASTA Format
• simple format used by almost all programs
• >header line with a [return] at end
• Sequence (no specific requirements for line
  length, characters, etc)
>URO1 uro1.seq   Length: 2018   November 9, 2000 11:50   Type: N   Check: 3854   ..
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA
ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT
GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC
CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG
TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA
GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT
CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC
CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC
CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT                          22
Assessing Alignment Significance
• Generate random alignments and
calculate their scores
• Compute the mean and the standard
deviation (SD) for random scores
• Compute the deviation of the actual score
from the mean of random scores
               Z = (meanX)/SD
• Evaluate the significance of the alignment
• The probability of a Z value is called the E
score
                                            23
E scores or E values
E scores are not equivalent to p
values where
             p < 0.05
are generally considered
statistically significant.
                               24
E values (rules of thumb)
E values below 10-6 are most probably
statistically significant.
E values above 10-6 but below 10-3
deserve a second look.
E values above 10-3 should not be
tossed aside lightly; they should be
thrown out with great force.           25
BLAST
• Basic Local Alignment Search Tool
  – Altschul et al. 1990,1994,1997
• Heuristic method for local alignment
• Designed specifically for database searches
• Based on the same assumption as FASTA
  that good alignments contain short lengths
  of exact matches
                                            26
BLAST
• Both BLAST and FASTA search for local
  sequence similarity - indeed they have exactly
  the same goals, though they use somewhat
  different algorithms and statistical approaches.

• BLAST benefits
  – Speed
  – User friendly
  – Statistical rigor
  – More sensitive
                                                27
Input/Output
• Input:
  – Query sequence Q
  – Database of sequences DB
  – Minimal score S

• Output:
  – Sequences from DB (Seq), such that Q and Seq
    have scores > S

                                               28
BLAST Searches GenBank
[BLAST= Basic Local Alignment Search Tool]
The NCBI BLAST web server lets you compare your
  query sequence to various sections of GenBank:
        –   nr = non-redundant (main sections)
        –   month = new sequences from the past few weeks
        –   refseq_rna
        –   RNA entries from NCBI's Reference Sequence project
        –   refseq_genomic
        –   Genomic entries from NCBI's Reference Sequence project
        –   ESTs
        –   Taxon = e.g., human, Drososphila, yeast, E. coli
        –   proteins (by automatic translation)
        –   pdb = Sequences derived from the 3-dimensional structure
            from Brookhaven Protein Data Bank
                                                                  29
BLAST
• Uses word matching like FASTA
• Similarity matching of words (3 amino acids, 11
  bases)
  – does not require identical words.
• If no words are similar, then no alignment
  – Will not find matches for very short sequences

• Does not handle gaps well
• “gapped BLAST” is somewhat better
                                                     30
BLAST Algorithm




                  31
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
 EAA
  AAV        Break query
    AVK
     VKE     into words:
      KEE
       EEI
         EIS
          ISV
          ...         Break database
                        sequences
                        into words:


                                       32
Find locations of matching words
       in database sequences

      ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELT MEAT
MEA
EAA     TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
AAV        IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRV KLVAIVDPH
AVK
KLV
KEE
EEI
EIS
ISV




                                                         33
Extend hits one base at a time




                                 34
Seq_XYZ:      HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
Query:           QSVFDYIYYGCYCGWGLG_GK__PRDA

E-val=10-13




  •Use two word matches as anchors to build an alignment
  between the query and a database sequence.

  •Then score the alignment.
                                                     35
HSPs are Aligned Regions
• The results of the word matching and
  attempts to extend the alignment are
  segments
   - called HSPs (High-Scoring Segment
     Pairs)
• BLAST often produces several short HSPs
  rather than a single aligned region

                                            36
•   >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
•             Length = 369
•    Score =    272 bits (137),   Expect = 4e-71
•    Identities = 258/297 (86%), Gaps = 1/297 (0%)
•    Strand = Plus / Plus
•
•   Query: 17    aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
•                |||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
•   Sbjct: 1     aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
•
•   Query: 77    agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
•                |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
•   Sbjct: 60    agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
•
•   Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
•               |||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
•   Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
•
•   Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
•              ||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
•   Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
•
•   Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
•              || || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
•   Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296




                                                                                    37
BLAST variants




                 38
39
40
41
42
43
Understanding BLAST output




                         44
45
46
47
48
49
50
51
52
53
Choosing the right parameters




                            54
55
56
57
Controlling the output




                         58
59
60
61
62
More on BLAST

NCBI Blast Glossary
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

Education: Blast Information
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Steve Altschul's Blast Course
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html




                                                             63

More Related Content

PDF
Gene prediction methods vijay
PPTX
Scoring matrices
PPTX
blast bioinformatics
PPTX
MULTIPLE SEQUENCE ALIGNMENT
PPT
Est database
Gene prediction methods vijay
Scoring matrices
blast bioinformatics
MULTIPLE SEQUENCE ALIGNMENT
Est database

What's hot (20)

PPTX
Multiple sequence alignment
PPT
Primary and secondary database
PPTX
Primary and secondary databases ppt by puneet kulyana
PPTX
Protein information resource (PIR)
PPTX
Sequence Alignment
PPTX
Transcriptome analysis
PPTX
Comparative genomics
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PDF
Ab Initio Protein Structure Prediction
PPTX
Genomic mapping, genetic mapping
PPTX
Major databases in bioinformatics
PPTX
222397 lecture 16 17
PPTX
Orthologs,Paralogs & Xenologs
PDF
Gene prediction method
PDF
MEGA (Molecular Evolutionary Genetics Analysis)
PPTX
DNA data bank of japan (DDBJ)
PPTX
Genome sequencing
PPT
Clustal X
PPTX
Express sequence tags
Multiple sequence alignment
Primary and secondary database
Primary and secondary databases ppt by puneet kulyana
Protein information resource (PIR)
Sequence Alignment
Transcriptome analysis
Comparative genomics
Sequence alig Sequence Alignment Pairwise alignment:-
Ab Initio Protein Structure Prediction
Genomic mapping, genetic mapping
Major databases in bioinformatics
222397 lecture 16 17
Orthologs,Paralogs & Xenologs
Gene prediction method
MEGA (Molecular Evolutionary Genetics Analysis)
DNA data bank of japan (DDBJ)
Genome sequencing
Clustal X
Express sequence tags
Ad

Similar to Blast fasta 4 (20)

PPT
Similarity
PPTX
Fasta : steps, features, algorithm, result etc.
PPTX
Bioinformatics t5-databasesearching v2014
PPTX
Bioinformatica t4-alignments
PPTX
Bioinformatics t4-alignments wim_vancriekingev2013
PPTX
2016 bioinformatics i_database_searching_wimvancriekinge
PDF
Ch06 alignment
PDF
PPT
Phylogenetics1
PPT
PDF
MSc Thesis Presentation
PPTX
2015 bioinformatics database_searching_wimvancriekinge
PPT
De novo genome assembly - IMB Winter School - 7 July 2015
PDF
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
PDF
Presentation_Parallel GRASP algorithm for job shop scheduling
PDF
Lecture6.pptx
PDF
Jogging While Driving, and Other Software Engineering Research Problems (invi...
PPT
PDF
Representations for large-scale (Big) Sequence Data Mining
Similarity
Fasta : steps, features, algorithm, result etc.
Bioinformatics t5-databasesearching v2014
Bioinformatica t4-alignments
Bioinformatics t4-alignments wim_vancriekingev2013
2016 bioinformatics i_database_searching_wimvancriekinge
Ch06 alignment
Phylogenetics1
MSc Thesis Presentation
2015 bioinformatics database_searching_wimvancriekinge
De novo genome assembly - IMB Winter School - 7 July 2015
(SAC2020 SVT-2) Constrained Detecting Arrays for Fault Localization in Combin...
Presentation_Parallel GRASP algorithm for job shop scheduling
Lecture6.pptx
Jogging While Driving, and Other Software Engineering Research Problems (invi...
Representations for large-scale (Big) Sequence Data Mining
Ad

Recently uploaded (20)

PDF
Business Ethics Teaching Materials for college
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Pre independence Education in Inndia.pdf
Business Ethics Teaching Materials for college
Abdominal Access Techniques with Prof. Dr. R K Mishra
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
01-Introduction-to-Information-Management.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
human mycosis Human fungal infections are called human mycosis..pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Renaissance Architecture: A Journey from Faith to Humanism
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pre independence Education in Inndia.pdf

Blast fasta 4

  • 2. Pairwise Alignment Global Local • Best score from among • Best score from among alignments of full-length alignments of partial sequences sequences • Needelman-Wunch • Smith-Waterman algorithm algorithm 2
  • 3. Why do we need local alignments? • To compare a short sequence to a large one. • To compare a single sequence to an entire database • To compare a partial sequence to the whole. 3
  • 4. Why do we need local alignments? • Identify newly determined sequences • Compare new genes to known ones • Guess functions for entire genomes full of ORFs of unknown function 4
  • 5. Mathematical Basis for Local Alignment • Model matches as a sequence of coin tosses • Let p be the probability of “head” – For a “fair” coin, p = 0.5 • According to Paul Erdös-Alfréd Rényi law: If there are n throws, then the expected length, R, of the longest run of “heads” is R = log1/p (n). Paul Erdös 5
  • 6. Mathematical Basis for Local Alignment • Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 • Problem: How does one model DNA (or amino acid) alignments as coin tosses. 6
  • 7. Modeling Sequence Alignments • To model random sequence alignments, replace a match by “head” (H) and mismatch by “tail” (T). AATCAT HTHHHT ATTCAG • For ungapped DNA alignments, the probability of a “head” is 1/4. • For ungapped amino acid alignments, the probability of a “head” is 1/20. 7
  • 8. Modeling Sequence Alignments • Thus, for any one particular alignment, the Erdös- Rényi law can be applied • What about for all possible alignments? – Consider that sequences can being shifted back and forth in the dot matrix plot • The expected length of the longest match is R = log1/p(mn) where m and n are the lengths of the two sequences. 8
  • 9. Modeling Sequence Alignments • Suppose m = n = 10, and we deal with DNA sequences R = log4(100) = 3.32 • This analysis assumes that the base composition is uniform and the alignment is ungapped. The result is approximate, but not bad. 9
  • 10. 10
  • 11. Heuristic Methods: FASTA and BLAST FASTA • First fast sequence searching algorithm for comparing a query sequence against a database. BLAST • Basic Local Alignment Search Technique improvement of FASTA: Search speed, ease of use, statistical rigor. 11
  • 12. FASTA and BLAST • Basic idea: a good alignment contains subsequences of absolute identity (short lengths of exact matches): – First, identify very short exact matches. – Next, the best short hits from the first step are extended to longer regions of similarity. – Finally, the best hits are optimized. 12
  • 13. FASTA Derived from logic of the dot plot – compute best diagonals from all frames of alignment The method looks for exact matches between words in query and test sequence – DNA words are usually 6 nucleotides long – protein words are 2 amino acids long 13
  • 15. Makes Longest Diagonal After all diagonals are found, tries to join diagonals by adding gaps Computes alignments in regions of best diagonals 15
  • 17. FASTA Results - Histogram !!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02 TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols: 913,285 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 Histogram Key: Each histogram symbol represents 4 search set sequences Each inset symbol represents 1 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 0 0: 22 0 0: 24 3 0:= 26 2 0:= 28 5 0:== 30 11 3:*== 32 19 11:==*== 34 38 30:=======*== 36 58 61:===============* 38 79 100:==================== * 40 134 140:==================================* 42 167 171:==========================================* 44 205 189:===============================================*==== 46 209 192:===============================================*===== 17 48 177 184:=============================================*
  • 18. FASTA Results - List The best scores are: init1 initn opt z-sc E(1018780).. SW:PPI1_HUMAN Begin: 1 End: 269 ! Q00169 homo sapiens (human). phosph... 1854 1854 1854 2249.3 1.8e-117 SW:PPI1_RABIT Begin: 1 End: 269 ! P48738 oryctolagus cuniculus (rabbi... 1840 1840 1840 2232.4 1.6e-116 SW:PPI1_RAT Begin: 1 End: 270 ! P16446 rattus norvegicus (rat). pho... 1543 1543 1837 2228.7 2.5e-116 SW:PPI1_MOUSE Begin: 1 End: 270 ! P53810 mus musculus (mouse). phosph... 1542 1542 1836 2227.5 2.9e-116 SW:PPI2_HUMAN Begin: 1 End: 270 ! P48739 homo sapiens (human). phosph... 1533 1533 1533 1861.0 7.7e-96 SPTREMBL_NEW:BAC25830 Begin: 1 End: 270 ! Bac25830 mus musculus (mouse). 10, ... 1488 1488 1522 1847.6 4.2e-95 SP_TREMBL:Q8N5W1 Begin: 1 End: 268 ! Q8n5w1 homo sapiens (human). simila... 1477 1477 1522 1847.6 4.3e-95 SW:PPI2_RAT Begin: 1 End: 269 ! P53812 rattus norvegicus (rat). pho... 1482 1482 1516 1840.4 1.1e-94 18
  • 19. FASTA Results - Alignment SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022) 60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180 120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240 180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300 240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 19 310 320 330 340 350 360
  • 20. FASTA on the Web • Many websites offer FASTA searches • Each server has its limits • Be aware that you depend “on the kindness of strangers.” 20
  • 21. Institut de Génétique Humaine, Montpellier France, GeneStream server http://www2.igh.cnrs.fr/bin/fasta-guess.cgi Oak Ridge National Laboratory GenQuest server http://avalon.epm.ornl.gov/ European Bioinformatics Institute, Cambridge, UK http://www.ebi.ac.uk/htbin/fasta.py?request EMBL, Heidelberg, Germany http://www.embl-heidelberg.de/cgi/fasta-wrapper-free Munich Information Center for Protein Sequences (MIPS) at Max-Planck-Institut, Germany http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html Institute of Biology and Chemistry of Proteins Lyon, France http://www.ibcp.fr/serv_main.html Institute Pasteur, France http://central.pasteur.fr/seqanal/interfaces/fasta.html GenQuest at The Johns Hopkins University http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html National Cancer Center of Japan http://bioinfo.ncc.go.jp 21
  • 22. FASTA Format • simple format used by almost all programs • >header line with a [return] at end • Sequence (no specific requirements for line length, characters, etc) >URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT 22
  • 23. Assessing Alignment Significance • Generate random alignments and calculate their scores • Compute the mean and the standard deviation (SD) for random scores • Compute the deviation of the actual score from the mean of random scores Z = (meanX)/SD • Evaluate the significance of the alignment • The probability of a Z value is called the E score 23
  • 24. E scores or E values E scores are not equivalent to p values where p < 0.05 are generally considered statistically significant. 24
  • 25. E values (rules of thumb) E values below 10-6 are most probably statistically significant. E values above 10-6 but below 10-3 deserve a second look. E values above 10-3 should not be tossed aside lightly; they should be thrown out with great force. 25
  • 26. BLAST • Basic Local Alignment Search Tool – Altschul et al. 1990,1994,1997 • Heuristic method for local alignment • Designed specifically for database searches • Based on the same assumption as FASTA that good alignments contain short lengths of exact matches 26
  • 27. BLAST • Both BLAST and FASTA search for local sequence similarity - indeed they have exactly the same goals, though they use somewhat different algorithms and statistical approaches. • BLAST benefits – Speed – User friendly – Statistical rigor – More sensitive 27
  • 28. Input/Output • Input: – Query sequence Q – Database of sequences DB – Minimal score S • Output: – Sequences from DB (Seq), such that Q and Seq have scores > S 28
  • 29. BLAST Searches GenBank [BLAST= Basic Local Alignment Search Tool] The NCBI BLAST web server lets you compare your query sequence to various sections of GenBank: – nr = non-redundant (main sections) – month = new sequences from the past few weeks – refseq_rna – RNA entries from NCBI's Reference Sequence project – refseq_genomic – Genomic entries from NCBI's Reference Sequence project – ESTs – Taxon = e.g., human, Drososphila, yeast, E. coli – proteins (by automatic translation) – pdb = Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank 29
  • 30. BLAST • Uses word matching like FASTA • Similarity matching of words (3 amino acids, 11 bases) – does not require identical words. • If no words are similar, then no alignment – Will not find matches for very short sequences • Does not handle gaps well • “gapped BLAST” is somewhat better 30
  • 32. BLAST Word Matching MEAAVKEEISVEDEAVDKNI MEA EAA AAV Break query AVK VKE into words: KEE EEI EIS ISV ... Break database sequences into words: 32
  • 33. Find locations of matching words in database sequences ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELT MEAT MEA EAA TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY AAV IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRV KLVAIVDPH AVK KLV KEE EEI EIS ISV 33
  • 34. Extend hits one base at a time 34
  • 35. Seq_XYZ: HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA Query: QSVFDYIYYGCYCGWGLG_GK__PRDA E-val=10-13 •Use two word matches as anchors to build an alignment between the query and a database sequence. •Then score the alignment. 35
  • 36. HSPs are Aligned Regions • The results of the word matching and attempts to extend the alignment are segments - called HSPs (High-Scoring Segment Pairs) • BLAST often produces several short HSPs rather than a single aligned region 36
  • 37. >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. • Length = 369 • Score = 272 bits (137), Expect = 4e-71 • Identities = 258/297 (86%), Gaps = 1/297 (0%) • Strand = Plus / Plus • • Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 • |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| • Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59 • • Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 • |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| || • Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119 • • Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 • |||||||| | || | ||||||||||||||| ||||||||||| || |||||||||||| • Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179 • • Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 • ||||||||| | |||||||| |||||||||||||||||| |||||||||||||||||||| • Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239 • • Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 • || || ||||| || ||||||||||| | |||||||||||||||||| |||||||| • Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296 37
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. 43
  • 45. 45
  • 46. 46
  • 47. 47
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52
  • 53. 53
  • 54. Choosing the right parameters 54
  • 55. 55
  • 56. 56
  • 57. 57
  • 59. 59
  • 60. 60
  • 61. 61
  • 62. 62
  • 63. More on BLAST NCBI Blast Glossary http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html Education: Blast Information http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html Steve Altschul's Blast Course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html 63

Editor's Notes