SlideShare a Scribd company logo
http://www.bits.vib.be/training
search engines



                                                    lennart martens
                                             lennart.martens@ugent.be
                               Lennart MARTENS
                             lennart.martens@ebi.ac.uk
                              Computational Omics and Systems Biology Group
                                 Proteomics Services Group
                              European Bioinformatics Institute
                                 Department of Medical Protein Research, VIB
                                     Hinxton, Cambridge
                                       United Kingdom
                                 Department of Biochemistry, Ghent University
                                        www.ebi.ac.uk
Lennart Martens                                      Ghent, Belgium
                           BITS MS Data Processing – Search Engines
lennart.martens@UGent.be   UGent, Gent, Belgium – 19 September 2011
THREE TYPICAL PRE-PROCESSING STEPS




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Noise thresholding
                                          precursor
                                                                       Global thresholding




                                          precursor
                                                                         Local thresholding




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Charge deconvolution (peptides)




                    From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg

Lennart Martens                  BITS MS Data Processing – Search Engines
lennart.martens@UGent.be          UGent, Gent, Belgium – 19 September 2011
Charge deconvolution (proteins)




                           From: Gill et al, EMBO Journal, 2000

Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Centroiding (peak picking)




                     Monoisotopic mass                                       Average mass




                 x                                                       x
Lennart Martens              BITS MS Data Processing – Search Engines
lennart.martens@UGent.be      UGent, Gent, Belgium – 19 September 2011
Combined results




                      A total ion current chromatogram, corrected by
                                typical pre-processing steps.


                           From: Last et al, Nature Rev. Mol. Cell Bio., 2007

Lennart Martens                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be           UGent, Gent, Belgium – 19 September 2011
Data size reduction
                         60



                                                                                                     Q-TOF II
                                                                                                     Q-TOF      Esquire HCT
                                                                                                                Esquire HCT
                         50




                         40



 File size
        File size (MB)




   (MB)
                         30

                              51.4


                         20




                                           24.5           25.8
                                                                 23.7
                         10




                                                                                      0.7        0.2              0.3       0.1
                         0
                                     RAW                  RAW GZIPped                   Peak lists              Peak lists GZIPped
                                                                        Data type
                                                                          Data type


                                                  See: Martens et al., Proteomics, 2005

Lennart Martens                                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                           UGent, Gent, Belgium – 19 September 2011
MS/MS IDENTIFICATION
PEPTIDE FRAGMENTATION FINGERPRINTING




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Peptide sequences and MS/MS spectra

                                LENNART
intensity



                                                                        LENNAR
                                RT

                                                          NNART
                                                      NART
                                        LEN                                      LENNART
                                                         LENNA                   LENNART
                                           ART                          ENNART
                 T                            LENN
                     L
                               LE
             L             E        N          N         A            R          T
                                                                                           m/z
Lennart Martens                      BITS MS Data Processing – Search Engines
lennart.martens@UGent.be              UGent, Gent, Belgium – 19 September 2011
Peptide fragment fingerprinting (PFF)
                                                                                     Int
                                                    YSFVATAER
                                                                                                      m/z
                                                                                     Int
                                                     HETSINGK
                                    in silico                           in silico    Int
                                                                                                      m/z


                                                  MILQEESTVYYR
                                     digest                             MS/MS
                                                                                                      m/z
                                                                                     Int
                                                    SEFASTPINK
                                                          …                                           m/z



    protein sequence database                   peptide sequences                      theoretical MS/MS
                                                                                            spectra



                    1) YSFVATAER 34
                                                         in silico
                    2) YSFVSAIR 12
                    3) FFLIGGGGK 12                      matching
                           peptide scores
                                                                               experimental MS/MS spectrum

Lennart Martens                    BITS MS Data Processing – Search Engines
lennart.martens@UGent.be            UGent, Gent, Belgium – 19 September 2011
Three types of PFF identification

   Spectral comparison
                                                        theoretical            compare   experimental
     database                 sequence
                                                         spectrum                         spectrum


   Sequencial comparison
                                                compare           de novo                experimental
     database                 sequence
                                                                 sequence                 spectrum


   Threading comparison
                                                              thread                     experimental
     database                sequence
                                                                                          spectrum


                           From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007

Lennart Martens                    BITS MS Data Processing – Search Engines
lennart.martens@UGent.be            UGent, Gent, Belgium – 19 September 2011
The most popular algorithms


      • MASCOT (Matrix Science)
          http://www.matrixscience.com

      • SEQUEST (Scripps, Thermo Fisher Scientific)
          http://fields.scripps.edu/sequest


      • X!Tandem (The Global Proteome Machine Organization)
          http://www.thegpm.org/TANDEM


      • OMSSA (NCBI)
          http://pubchem.ncbi.nlm.nih.gov/omssa/




Lennart Martens                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be           UGent, Gent, Belgium – 19 September 2011
Overall concept of scores and cut-offs

                                Incorrect identifications                  Threshold score




                                                                   Correct
                                                                identifications




                                 False negatives                False positives
                           Adapted from: www.proteomesoftware.com – Wiki pages

Lennart Martens                    BITS MS Data Processing – Search Engines
lennart.martens@UGent.be            UGent, Gent, Belgium – 19 September 2011
Playing with probabilistic cut-off scores

                             higher stringency




                                     6%                                                      100%

                                                                                             90%
                                     5%
                                                                                             80%

                                     4%
                                                                       identifications       70%

                                                                                             60%

                                     3%                                                      50%

                                                      false positives                        40%
                                     2%
                                                                                             30%

                                                                                             20%
                                     1%
                                                                                             10%

                                     0%                                                      0%
                                             p=0.05        p=0.01       p=0.005   p=0.0005



Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
SEQUEST
   • Very well established search engine
   • Can be used for MS/MS (PFF) identifications
   • Based on a cross-correlation score (includes peak height)
   • Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994
   • Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),
       and score difference between the top tow ranks (deltaCn, ∆Cn)
   • Thresholding is up to the user, and is commonly done per charge state


             
   • Many extensions exist to perform a more automatic validation of results

     = �  ∙ (+)
           =1
                   1
                                 +75

    XCorr = 0 − 151            � 
                                                                            XCorr 1 − XCorr 2
                                =−75
                                                         deltaCn=
                                                                                XCorr 1
Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
SEQUEST: some additional pictures




   From: MacCoss et al., Anal. Chem. 2002




                                                           From: Peng et al., J. Prot. Res.. 2002

Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
Mascot


    • Very well established search engine, Perkins, Electrophoresis 1999
    • Can do MS (PMF) and MS/MS (PFF) identifications
    • Based on the MOWSE score,
    • Unpublished core algorithm (trade secret)
    • Predicts an a priori threshold score that identifications need to pass
    • From version 2.2, Mascot allows integrated decoy searches
    • Provides rank, score, threshold and expectation value per identification
    • Customizable confidence level for the threshold score




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Mascot: some additional pictures

                                 40
    Average identity threshold




                                 35        y = 8.3761x - 34.089
                                                 2
                                            6%R = 0.9985                                                                      100%
 Average identitythreshold




                                 30

                                 25                                                                                           90%
                                            5%
                                 20                                                                                           80%
                                 15                                                                                           70%
                                            4%
                                 10
                                                                                                            identifications   60%
                                  5
                                            3%                                                                                50%
                                  0
                                   6.50   7.00              7.50              8.00         8.50                               40%
                                            2%          log10(number of AA)
                                                                                                                              30%
                                                                                        false positives
                                                                                                                              20%
                                            1%
                                                                                                                              10%

                                            0%                                                                                0%
                                                         p=0.05                p=0.01             p=0.005         p=0.0005



Lennart Martens                                             BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                                     UGent, Gent, Belgium – 19 September 2011
X!Tandem


   • A successful open source search engine, Craig and Beavis, RCMS 2003
   • Can be used for MS/MS (PFF) identifications
                                                                                     n         
   • Based on a hyperscore (Pi is either 0 or 1):                      HyperScore =  ∑ Ii * Pi  * Nb !* Ny !
                                                                                     i =0      
   • Relies on a hypergeometric distribution (hence hyperscore)
   • Published core algorithm, and is freely available
   • Provides hyperscore and expectancy score (the discriminating one)
   • X!Tandem is fast and can handle modifications in an iterative fashion
   • Has rapidly gained popularity as (auxiliary) search engine




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
X!Tandem: some additional pictures
                  60                                                                                      4

                                                                                                         3.5
                  50
                                                                                                          3




                                                                                        log(# results)
                  40
 # results




                                                                                                         2.5

                  30                                                                                      2

                                                                                                         1.5
                  20
                                                                                                          1

                  10                                                                                     0.5

                                                                                                          0
                       0                                                                                       20   25   30   35       40   45   50
                           0        20         40         60            80      100                                       hyperscore
                                              hyperscore
                                                                                                                               significance
                  6
                                                                                                                                threshold
                  4
 log(# results)




                  2

                  0

                  -2                                                               Adapted from: Brian Searle, ProteomeSoftware,
                  -4
                                                                               http://www.proteomesoftware.com/XTandem_edited.pdf
                  -6

                  -8
                                                                       E-value=e-8.2
            -10
                       0       20        40     60   80          100
                                    hyperscore
Lennart Martens                                            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                                       UGent, Gent, Belgium – 19 September 2011
A note on how the scores differ

             SEQUEST       Accuracy Score                    Relative Score



                              XCorr                               DeltaCn
             X! Tandem




                           HyperScore                              E-Value



                                                           Adapted from: Brian Searle, ProteomeSoftware

Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
OMSSA


   • A successful open source search engine, Geer, JPR 2004
   • Can be used for MS/MS (PFF) identifications
   • Relies on a Poisson distribution
   • Published core algorithm, and is freely available
   • Provides an expectancy score, similar to the BLAST E-value
   • OMSSA was recently upgraded to take peak intensity into account
   • Good really good marks in a recently published comparative study




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
OMSSA: some additional pictures




       Yeast lysate spectrum, m/z matches of               Validation of the Poisson distribution model:
     fragment peak matches versus all NCBI nr               mean number of modelled and measured
     sequence library. Poisson distribution fitted.           matching peaks (against the NCBI nr
                                                               database) for two mass tolerances.

                             Adapted from: Geer et al., J. Prot. Res., 2004

Lennart Martens                  BITS MS Data Processing – Search Engines
lennart.martens@UGent.be          UGent, Gent, Belgium – 19 September 2011
COMPARATIVE STUDIES




Lennart Martens               BITS MS Data Processing – Search Engines
lennart.martens@UGent.be       UGent, Gent, Belgium – 19 September 2011
Kapp et al., Proteomics, 2005




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Balgley et al., Mol. Cell. Proteomics, 2007




                                                1.6x more?!




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Combining the output of search algorithms

                                  Mascot                                                SEQUEST
                                  3229                                                     3792
                                                212                            486
                                              (+4,2%)                        (+9,6%)
                           ProteinSolver
                               3203
                                                                179          168             Phenyx
                                                     40
                                                                                             3186
                                             329                                     380
                                           (+6,5%)        501          348         (+7,5%)


                                                                1776
                                                      139                    96
                                                            195        77

                                                                146




                 Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center,
                        Ruhr-Universität Bochum; Human Brain Proteome Project

Lennart Martens                       BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                 UGent, Gent, Belgium – 19 September 2011
SEQUENCIAL COMPARISON
                             ALGORITHMS




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Sequence tags




                                                                  sequence tag




        The concept of sequence tags was introduced by Mann and Wilm
             (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).

                          Image from: Matthias Wilm, EMBL Heidelberg, Germany
            http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html
Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
GutenTag, DirecTag, TagRecon


   • Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010
   • Recent implementations of the sequence tag approach
   • Refine hits by peak mapping in a second stage to resolve ambiguities
   • Rely on a empirical fragmentation model
   • Published core algorithms, DirecTag and TagRecon freely available
   • Most useful to retrieve unexpected peptides (modifications, variations)
   • Entire workflows exist (e.g., combination with IDPicker)




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
GutenTag: some additional pictures




                           From: Tabb et al., Anal. Chem., 2003

Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
De novo compared to sequence tags




                        Example of a manual de novo of an MS/MS spectrum
                       No more database necessary to extract a sequence!


                     Algorithms                                      References

                       Lutefisk                            Dancik 1999, Taylor 2000
                      Sherenga                            Fernandez-de-Cossio 2000
                       PEAKS                                 Ma 2003, Zhang 2004
                      PepNovo                            Frank 2005, Grossmann 2005
                          …                                           …


Lennart Martens                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be           UGent, Gent, Belgium – 19 September 2011
Thank you!
                Questions?
Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011

More Related Content

PPTX
Emerging challenges in data-intensive genomics
PDF
Introduction to Linux for bioinformatics
PDF
Towards an understanding of diversity in biological and biomedical systems
PDF
NGS analysis of micro-RNA
PPTX
Data analytics challenges in genomics
PDF
RNA-seq for DE analysis: the biology behind observed changes - part 6
PDF
BITS - Introduction to comparative genomics
PDF
RNA-seq for DE analysis: extracting counts and QC - part 4
Emerging challenges in data-intensive genomics
Introduction to Linux for bioinformatics
Towards an understanding of diversity in biological and biomedical systems
NGS analysis of micro-RNA
Data analytics challenges in genomics
RNA-seq for DE analysis: the biology behind observed changes - part 6
BITS - Introduction to comparative genomics
RNA-seq for DE analysis: extracting counts and QC - part 4

Viewers also liked (19)

PDF
Utilidad de la genómica en la salud humana
PDF
Text mining on the command line - Introduction to linux for bioinformatics
PDF
Managing your data - Introduction to Linux for bioinformatics
PDF
RNA-seq: Mapping and quality control - part 3
PPTX
Deep learning with Tensorflow in R
PDF
Differential expression in RNA-Seq
PPTX
Mass Spectrometry: Protein Identification Strategies
PDF
BITS - Genevestigator to easily access transcriptomics data
PDF
BITS - Comparative genomics: the Contra tool
PDF
Productivity tips - Introduction to linux for bioinformatics
PDF
BITS - Comparative genomics on the genome level
PDF
BITS - Protein inference from mass spectrometry data
PDF
BITS - Comparative genomics: gene family analysis
PDF
The structure of Linux - Introduction to Linux for bioinformatics
POT
RNA-seq quality control and pre-processing
PDF
RNA-seq for DE analysis: detecting differential expression - part 5
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PPTX
RNA-seq differential expression analysis
PDF
RNA-seq: general concept, goal and experimental design - part 1
Utilidad de la genómica en la salud humana
Text mining on the command line - Introduction to linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
RNA-seq: Mapping and quality control - part 3
Deep learning with Tensorflow in R
Differential expression in RNA-Seq
Mass Spectrometry: Protein Identification Strategies
BITS - Genevestigator to easily access transcriptomics data
BITS - Comparative genomics: the Contra tool
Productivity tips - Introduction to linux for bioinformatics
BITS - Comparative genomics on the genome level
BITS - Protein inference from mass spectrometry data
BITS - Comparative genomics: gene family analysis
The structure of Linux - Introduction to Linux for bioinformatics
RNA-seq quality control and pre-processing
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq differential expression analysis
RNA-seq: general concept, goal and experimental design - part 1
Ad

More from BITS (13)

PDF
BITS - Overview of sequence databases for mass spectrometry data analysis
PDF
BITS - Introduction to proteomics
PDF
BITS - Introduction to Mass Spec data generation
PPTX
BITS training - UCSC Genome Browser - Part 2
PPTX
Marcs (bio)perl course
PDF
Basics statistics
PDF
Cytoscape: Integrating biological networks
PDF
Cytoscape: Gene coexppression and PPI networks
PDF
Genevestigator
PDF
BITS: UCSC genome browser - Part 1
PPT
Vnti11 basics course
PPT
Bits protein structure
PPT
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Introduction to proteomics
BITS - Introduction to Mass Spec data generation
BITS training - UCSC Genome Browser - Part 2
Marcs (bio)perl course
Basics statistics
Cytoscape: Integrating biological networks
Cytoscape: Gene coexppression and PPI networks
Genevestigator
BITS: UCSC genome browser - Part 1
Vnti11 basics course
Bits protein structure
BITS: Introduction to Linux - Software installation the graphical and the co...
Ad

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Complications of Minimal Access Surgery at WLH
PDF
Sports Quiz easy sports quiz sports quiz
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Lesson notes of climatology university.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Institutional Correction lecture only . . .
PDF
RMMM.pdf make it easy to upload and study
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O7-L3 Supply Chain Operations - ICLT Program
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Complications of Minimal Access Surgery at WLH
Sports Quiz easy sports quiz sports quiz
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Pharma ospi slides which help in ospi learning
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Lesson notes of climatology university.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Microbial diseases, their pathogenesis and prophylaxis
Institutional Correction lecture only . . .
RMMM.pdf make it easy to upload and study
2.FourierTransform-ShortQuestionswithAnswers.pdf
Cell Structure & Organelles in detailed.
Microbial disease of the cardiovascular and lymphatic systems
Renaissance Architecture: A Journey from Faith to Humanism
O7-L3 Supply Chain Operations - ICLT Program

BITS - Search engines for mass spec data

  • 2. search engines lennart martens lennart.martens@ugent.be Lennart MARTENS lennart.martens@ebi.ac.uk Computational Omics and Systems Biology Group Proteomics Services Group European Bioinformatics Institute Department of Medical Protein Research, VIB Hinxton, Cambridge United Kingdom Department of Biochemistry, Ghent University www.ebi.ac.uk Lennart Martens Ghent, Belgium BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 3. THREE TYPICAL PRE-PROCESSING STEPS Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 4. Noise thresholding precursor Global thresholding precursor Local thresholding Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 5. Charge deconvolution (peptides) From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 6. Charge deconvolution (proteins) From: Gill et al, EMBO Journal, 2000 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 7. Centroiding (peak picking) Monoisotopic mass Average mass x x Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 8. Combined results A total ion current chromatogram, corrected by typical pre-processing steps. From: Last et al, Nature Rev. Mol. Cell Bio., 2007 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 9. Data size reduction 60 Q-TOF II Q-TOF Esquire HCT Esquire HCT 50 40 File size File size (MB) (MB) 30 51.4 20 24.5 25.8 23.7 10 0.7 0.2 0.3 0.1 0 RAW RAW GZIPped Peak lists Peak lists GZIPped Data type Data type See: Martens et al., Proteomics, 2005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 10. MS/MS IDENTIFICATION PEPTIDE FRAGMENTATION FINGERPRINTING Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 11. Peptide sequences and MS/MS spectra LENNART intensity LENNAR RT NNART NART LEN LENNART LENNA LENNART ART ENNART T LENN L LE L E N N A R T m/z Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 12. Peptide fragment fingerprinting (PFF) Int YSFVATAER m/z Int HETSINGK in silico in silico Int m/z MILQEESTVYYR digest MS/MS m/z Int SEFASTPINK … m/z protein sequence database peptide sequences theoretical MS/MS spectra 1) YSFVATAER 34 in silico 2) YSFVSAIR 12 3) FFLIGGGGK 12 matching peptide scores experimental MS/MS spectrum Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 13. Three types of PFF identification Spectral comparison theoretical compare experimental database sequence spectrum spectrum Sequencial comparison compare de novo experimental database sequence sequence spectrum Threading comparison thread experimental database sequence spectrum From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 14. The most popular algorithms • MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/ Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 15. Overall concept of scores and cut-offs Incorrect identifications Threshold score Correct identifications False negatives False positives Adapted from: www.proteomesoftware.com – Wiki pages Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 16. Playing with probabilistic cut-off scores higher stringency 6% 100% 90% 5% 80% 4% identifications 70% 60% 3% 50% false positives 40% 2% 30% 20% 1% 10% 0% 0% p=0.05 p=0.01 p=0.005 p=0.0005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 17. SEQUEST • Very well established search engine • Can be used for MS/MS (PFF) identifications • Based on a cross-correlation score (includes peak height) • Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994 • Provides preliminary (Sp) score, rank, cross-correlation score (XCorr), and score difference between the top tow ranks (deltaCn, ∆Cn) • Thresholding is up to the user, and is commonly done per charge state • Many extensions exist to perform a more automatic validation of results = � ∙ (+) =1 1 +75 XCorr = 0 − 151 � XCorr 1 − XCorr 2 =−75 deltaCn= XCorr 1 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 18. SEQUEST: some additional pictures From: MacCoss et al., Anal. Chem. 2002 From: Peng et al., J. Prot. Res.. 2002 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 19. Mascot • Very well established search engine, Perkins, Electrophoresis 1999 • Can do MS (PMF) and MS/MS (PFF) identifications • Based on the MOWSE score, • Unpublished core algorithm (trade secret) • Predicts an a priori threshold score that identifications need to pass • From version 2.2, Mascot allows integrated decoy searches • Provides rank, score, threshold and expectation value per identification • Customizable confidence level for the threshold score Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 20. Mascot: some additional pictures 40 Average identity threshold 35 y = 8.3761x - 34.089 2 6%R = 0.9985 100% Average identitythreshold 30 25 90% 5% 20 80% 15 70% 4% 10 identifications 60% 5 3% 50% 0 6.50 7.00 7.50 8.00 8.50 40% 2% log10(number of AA) 30% false positives 20% 1% 10% 0% 0% p=0.05 p=0.01 p=0.005 p=0.0005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 21. X!Tandem • A successful open source search engine, Craig and Beavis, RCMS 2003 • Can be used for MS/MS (PFF) identifications  n  • Based on a hyperscore (Pi is either 0 or 1): HyperScore =  ∑ Ii * Pi  * Nb !* Ny !  i =0  • Relies on a hypergeometric distribution (hence hyperscore) • Published core algorithm, and is freely available • Provides hyperscore and expectancy score (the discriminating one) • X!Tandem is fast and can handle modifications in an iterative fashion • Has rapidly gained popularity as (auxiliary) search engine Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 22. X!Tandem: some additional pictures 60 4 3.5 50 3 log(# results) 40 # results 2.5 30 2 1.5 20 1 10 0.5 0 0 20 25 30 35 40 45 50 0 20 40 60 80 100 hyperscore hyperscore significance 6 threshold 4 log(# results) 2 0 -2 Adapted from: Brian Searle, ProteomeSoftware, -4 http://www.proteomesoftware.com/XTandem_edited.pdf -6 -8 E-value=e-8.2 -10 0 20 40 60 80 100 hyperscore Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 23. A note on how the scores differ SEQUEST Accuracy Score Relative Score XCorr DeltaCn X! Tandem HyperScore E-Value Adapted from: Brian Searle, ProteomeSoftware Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 24. OMSSA • A successful open source search engine, Geer, JPR 2004 • Can be used for MS/MS (PFF) identifications • Relies on a Poisson distribution • Published core algorithm, and is freely available • Provides an expectancy score, similar to the BLAST E-value • OMSSA was recently upgraded to take peak intensity into account • Good really good marks in a recently published comparative study Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 25. OMSSA: some additional pictures Yeast lysate spectrum, m/z matches of Validation of the Poisson distribution model: fragment peak matches versus all NCBI nr mean number of modelled and measured sequence library. Poisson distribution fitted. matching peaks (against the NCBI nr database) for two mass tolerances. Adapted from: Geer et al., J. Prot. Res., 2004 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 26. COMPARATIVE STUDIES Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 27. Kapp et al., Proteomics, 2005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 28. Balgley et al., Mol. Cell. Proteomics, 2007 1.6x more?! Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 29. Combining the output of search algorithms Mascot SEQUEST 3229 3792 212 486 (+4,2%) (+9,6%) ProteinSolver 3203 179 168 Phenyx 40 3186 329 380 (+6,5%) 501 348 (+7,5%) 1776 139 96 195 77 146 Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome Project Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 30. SEQUENCIAL COMPARISON ALGORITHMS Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 31. Sequence tags sequence tag The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399). Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 32. GutenTag, DirecTag, TagRecon • Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010 • Recent implementations of the sequence tag approach • Refine hits by peak mapping in a second stage to resolve ambiguities • Rely on a empirical fragmentation model • Published core algorithms, DirecTag and TagRecon freely available • Most useful to retrieve unexpected peptides (modifications, variations) • Entire workflows exist (e.g., combination with IDPicker) Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 33. GutenTag: some additional pictures From: Tabb et al., Anal. Chem., 2003 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 34. De novo compared to sequence tags Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence! Algorithms References Lutefisk Dancik 1999, Taylor 2000 Sherenga Fernandez-de-Cossio 2000 PEAKS Ma 2003, Zhang 2004 PepNovo Frank 2005, Grossmann 2005 … … Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 35. Thank you! Questions? Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011