SlideShare a Scribd company logo
Introduction to the PSI standard data formats
Dr. Juan Antonio Vizcaíno
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Standards are needed in real life: also in bioinformatics…
With a small number
of standards,
converters are feasible
Data standards are needed
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Taken from Biocomicals, http://biocomicals.blogspot.com
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Mass Spectrometry (MS)-based proteomics
• Many different workflows -> Many different data types ->
Need for several data standards.
• Discovery mode:
• Bottom-up proteomics
• Data dependent acquisition (DDA)
• Data independent acquisition (DIA)
• Top down proteomics
• Targeted mode:
• SRM/MRM/PRM (Selected/ Multiple/Parallel Reaction
Monitoring)
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
•Develops data standards for proteomics.
•Both data representation and annotation standards.
•Involves data producers, database providers, software producers,
publishers, everyone who wants to be involved…
•Active Workgroups: MI, MS, PI, Mod and the new QC.
•Inter-group activities: MIAPE and Controlled Vocabularies.
•Started in 2002, so some experience already…
•One annual meeting in March-April, regular phone calls.
•Close interaction with the metabolomics community (MSI).
http://www.psidev.info
HUPO Proteomics Standards Initiative
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
PSI Deliverables
•Minimum information (MIAPE) specifications: Format-independent
specification of minimum information guidelines.
•Formats: Usually XML-based (but also tab-delimited files), capable of
representing the relevant Minimum Information, plus additional detailed data
for the domain.
•Controlled vocabularies: Usually an OBO-style hierarchical controlled
vocabulary precisely defining the metadata that are encoded in the formats.
•Databases and Tools: Foster open software implementations to make the
standards truly useful.
•Community interaction to ensure deposition of data in public repositories.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
PSI MS Controlled Vocabulary
Mayer et al., Database, 2013~2,700 terms by June 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
The Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ontology-lookup/
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
The typical dilemma
•Data standards need to be stable to promote adoption
•Proteomics standards need to evolve very rapidly:
• Data is inherently very complex
• Experimental techniques are evolving all the time
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Binary data
mzData
mzXML
mzML
XML-based
files
.dta, .pkl, .mgf,
.ms2
Peak lists
Data formats for mass spectra data
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
An example of success story: mzML
• A data format for the storage and exchange of MS output files
• Designed by merging the best aspects of both mzData and mzXML
• Developed with full participation of academic researchers, hardware
and software vendors
• Expected to replace mzXML and mzData, but not expected to
completely replace vendor binary formats
• Captures spectra (raw data or peak lists), chromatograms and related
metadata
• Version 1.0 released in June 2008, v1.1 released in June 2009
• Many implementations already exist
• Version 1.2 with enhanced compression considered for the near
future.
Martens et al., MCP, 2011
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
An example of success story: mzML
The most popular search
engines support mzML
Many parser libraries available
Conversion from raw files
into mzMLhttp://www.psidev.info/mzml_1_0_0
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzIdentML, mascot
.dat, sequest .out,
SpectrumMill .spo
pep.xml, prot.xml
Only qualitative data!
Data formats for output from search engines
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzIdentML
• XML-based data standard for peptide and protein identifications e.g. following
database search and protein inference
• Sections for all PSMs, proteins/protein groups, protocols/parameters etc.
• Timeline:
• Original 1.0 version in Aug 2009
• Version 1.1 stable (Aug 2011); Original manuscript published in MCP in 2012*
• Well supported in lots of open source and commercial software
• Fully supported by ProteomeXchange resources
• 2012 onwards (mzIdentML 1.2): extended use cases
• Better support for protein grouping. Manuscript published in Proteomics **
• 2017 mzIdentML 1.2 release; manuscript published at MCP***
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass
spectrometry-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
** Seymour, S. L., Farrah, T., Binz, P. A., Chalkley, R. J., et al., A standardized framing for reporting protein
identifications in mzIdentML 1.2. Proteomics 2014, 14, 2389-2399.
*** Vizcaíno, J. A., Mayer G., Perkins S., Barsnes H., et al., The mzIdentML Data Standard Version 1.2,
Supporting advances in Proteome Informatics. Molecular & Cellular Proteomics 2017, 16, 1275-1285.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzIdentML 1.1
Data standard for peptide
and protein identification
data
mzIdentML 1.2
2011-
2012
2017
New support for:
- Cross-linking approaches
- Peptide level scores
- Modification localization scores
- Proteogenomics approaches
Improved support for:
- Protein inference
- Pre-fractionation
- de novo sequencing
- Spectral library searches
Increasingly
supported
by the most-
used
proteomics
software
and
databases
jmzIdentML
mzid Library
ms-data-core-api
MyriMatch
ProteoAnnotator
PIA
ProCon
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzQuantML status
• XML-based standard for quantification data Can report tables of data (QuantLayers),
columns are: StudyVariables, Assays or Ratios, rows are ProteinGroups, Proteins or
Peptides
• Can also capture 2D coordinates of quantified regions in LC-MS (Features)
Timeline
• Work started in Oct 2011, and progressed at various PSI meetings
• Completed PSI process in Feb 2013 – version 1.0 release
• Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ)
and MS1 label techniques e.g. SILAC*
• Updated in 2013-2014 to support SRM as a new technique**; mzqLibrary***
• 2015, mzQuantML 1.0.1 – minor update with SRM included
Open issues
• Not widely supported by software. No live development. Efforts are being put into
mzTab support instead
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015, 15(18):3152-62
*** Qi et al PROTEOMICS 2015, 15, 2592-2596.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzTab – Aims and concept
• To provide a simple and efficient way of exchanging results from MS
approaches.
• Simpler summary report of the experimental results
• Peptides and proteins identified in a given experimental setting
• Small molecules identified
• Reported quantification values
• Technical and biological metadata
• Easier to parse and use by the research community, systems
biologists as well as providers of knowledge bases.
• It can be used by non-experts in bioinformatics.
• It does not aim to replace mzIdentMl and mzQuantML
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzTab - Sections
• Basic information about experiment and sample
• Key-Value pairsMetadata
• Basic information about protein identifications
• Table-basedProtein
• Information about quantified peptides
• Table-basedPeptide
• Information about identified spectra
• Table-basedPSM
• Basic information about identified small molecules
• Table-basedSmall Molecule
Griss et al., MCP, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Metadata section - Example
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Unify exchange of transitions with TraML
• PSI’s TraML (Transitions Markup Language)
• Format for encoding SRM/MRM transitions
• Version 1.0.0 now released and published in MCP (Deutsch et al. 2012)
Journal
Articles
Transitions
Databases
Excel
sheets
SRM
Analysis
Software
Instruments
TraML
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
PSI document process
•Every data standard has to undergo a
thorough review process…
•In fact, in practice, two review processes
happen in parallel: the PSI and
manuscript review.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Proteogenomics related data formats
• Two ongoing formats are being developed: proBed
(version 1 available) and proBAM (still under review).
• Same overall objective: to map identified peptides to
genome coordinates.
• Different level of detail:
• proBed is tab-delimited and simpler, based on the original
BED format. Less level of detail.
• proBAM is based in the original SAM/BAM formats, widely
used in genomics. Much higher level of detail.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Proteogenomics related data formats
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Provide your own data to genome browsers
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
TrackHubs in Genome Browsers
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Data standard publications
mzML (for MS data) Martens et al., MCP, 2011
mzIdentML (peptide/protein IDs) Jones et al., MCP, 2012
Vizcaíno et al., MCP, 2017
TraML (for SRM transitions) Deutsch et al., MCP, 2012
mzQuantML (for quantitative data) Walzer et al., MCP, 2013
mzTab (ID and quantification) Griss et al., MCP, 2014
proBed & proBAM (proteogenomics) Menschaert et al., Genome Biology, 2018
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015
PSI promotes implementations. The reference libraries are always
open source and can be used by anyone!
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
And also… protein-protein interactions
PSI-XML: XML-based format
• Version 2.5 is the working version
• Version 3.0 under development
MITAB: tab-delimited format
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Summary slide
Deutsch et al., JPR, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Do you want to learn more?
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018

More Related Content

PDF
Proteomics repositories
PDF
Reusing and integrating public proteomics data to improve our knowledge of th...
PDF
Introduction to the Proteomics Bioinformatics Course 2018
PDF
PRIDE resources and ProteomeXchange
PDF
Reuse of public proteomics data
PPTX
Introduction to the Proteomics Bioinformatics Course 2017
PPTX
Proteomics repositories
PPTX
Introduction to the Proteomics Bioinformatics Course 2016
Proteomics repositories
Reusing and integrating public proteomics data to improve our knowledge of th...
Introduction to the Proteomics Bioinformatics Course 2018
PRIDE resources and ProteomeXchange
Reuse of public proteomics data
Introduction to the Proteomics Bioinformatics Course 2017
Proteomics repositories
Introduction to the Proteomics Bioinformatics Course 2016

What's hot (20)

PPTX
Proteomics data standards
PPTX
Reuse of public proteomics data
PPTX
PRIDE and ProteomeXchange
PPTX
Introduction to proteomics
PPSX
Bioinformatic tools in Pheromone technology
PPT
Bioinformatics Information Sources
PDF
Bioinformatics resources and search tools - report on summer training proj...
PPTX
Interoperable Data for KnetMiner and DFW Use Cases
PPTX
Application of bioinformatics in climate smart horticulture
PPTX
Is it feasible to identify novel biomarkers by mining public proteomics data?
PPTX
FAIR Agronomy, where are we? The KnetMiner Use Case
PPTX
KnetMiner - EBI Workshop 2017
PPTX
KnetMiner - Knowledge Network Miner
PPTX
Introduction to Gene Mining Part A: BLASTn-off!
PPTX
Bioinformatics
PDF
Multi-Omics Bioinformatics across Application Domains
PPTX
AgriSchemas: Sharing Agrifood data with Bioschemas
PDF
PubChem for drug discovery in the age of big data and artificial intelligence
PPT
Pathology is being disrupted by Data Integration, AI & Blockchain
PDF
Basics of Data Analysis in Bioinformatics
Proteomics data standards
Reuse of public proteomics data
PRIDE and ProteomeXchange
Introduction to proteomics
Bioinformatic tools in Pheromone technology
Bioinformatics Information Sources
Bioinformatics resources and search tools - report on summer training proj...
Interoperable Data for KnetMiner and DFW Use Cases
Application of bioinformatics in climate smart horticulture
Is it feasible to identify novel biomarkers by mining public proteomics data?
FAIR Agronomy, where are we? The KnetMiner Use Case
KnetMiner - EBI Workshop 2017
KnetMiner - Knowledge Network Miner
Introduction to Gene Mining Part A: BLASTn-off!
Bioinformatics
Multi-Omics Bioinformatics across Application Domains
AgriSchemas: Sharing Agrifood data with Bioschemas
PubChem for drug discovery in the age of big data and artificial intelligence
Pathology is being disrupted by Data Integration, AI & Blockchain
Basics of Data Analysis in Bioinformatics
Ad

Similar to Introduction to the PSI standard data formats (20)

PPTX
Proteomics data standards
PPTX
Proteomics data standards
PPTX
Proteomics data standards
PPTX
PSI-Proteome Informatics update
PPTX
Mass Spectrometry Informatics formats in progress
PPTX
Experiences to learn from the MS proteomics field
PPTX
The Proteomics Standards Initiative (PSI)
PDF
Standarization in Proteomics: From raw data to metadata files
PPT
EMBL-EBI Proteomics data resources and services
PPTX
ProteomeXchange_and_PRIDE_Semmeting_2015
PPTX
Data formats and ontologies
PPTX
Pride and ProteomeXchange
PPTX
Proteomics public data resources: enabling "big data" analysis in proteomics
PPTX
PRIDE-ProteomeXchange
PPTX
PRIDE and ProteomeXchange: Training webinar
PDF
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
PPTX
Proteomics repositories
PPTX
Reuse of public proteomics data
PPTX
Proteomics repositories
PPTX
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Proteomics data standards
Proteomics data standards
Proteomics data standards
PSI-Proteome Informatics update
Mass Spectrometry Informatics formats in progress
Experiences to learn from the MS proteomics field
The Proteomics Standards Initiative (PSI)
Standarization in Proteomics: From raw data to metadata files
EMBL-EBI Proteomics data resources and services
ProteomeXchange_and_PRIDE_Semmeting_2015
Data formats and ontologies
Pride and ProteomeXchange
Proteomics public data resources: enabling "big data" analysis in proteomics
PRIDE-ProteomeXchange
PRIDE and ProteomeXchange: Training webinar
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
Proteomics repositories
Reuse of public proteomics data
Proteomics repositories
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Ad

More from Juan Antonio Vizcaino (13)

PDF
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
PDF
ProteomeXchange update
PDF
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
PDF
The ELIXIR Proteomics community
PDF
The ELIXIR Proteomics Community
PPTX
The ProteomeXchange Consoritum: 2017 update
PPTX
How to run and maintain a popular biological data repository?
PPTX
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PPTX
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
PPTX
ProteomeXchange update 2017
PPTX
Enabling automated processing and analysis of large-scale proteomics data
PPTX
Introduction to EBI for Proteomics in ELIXIR
PPTX
Reuse of public data in proteomics
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ProteomeXchange update
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
The ELIXIR Proteomics community
The ELIXIR Proteomics Community
The ProteomeXchange Consoritum: 2017 update
How to run and maintain a popular biological data repository?
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
ProteomeXchange update 2017
Enabling automated processing and analysis of large-scale proteomics data
Introduction to EBI for Proteomics in ELIXIR
Reuse of public data in proteomics

Recently uploaded (20)

PPTX
Introduction to Cardiovascular system_structure and functions-1
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
BIOMOLECULES PPT........................
PDF
Placing the Near-Earth Object Impact Probability in Context
PPT
protein biochemistry.ppt for university classes
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
An interstellar mission to test astrophysical black holes
Introduction to Cardiovascular system_structure and functions-1
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Classification Systems_TAXONOMY_SCIENCE8.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ECG_Course_Presentation د.محمد صقران ppt
INTRODUCTION TO EVS | Concept of sustainability
BIOMOLECULES PPT........................
Placing the Near-Earth Object Impact Probability in Context
protein biochemistry.ppt for university classes
TOTAL hIP ARTHROPLASTY Presentation.pptx
. Radiology Case Scenariosssssssssssssss
HPLC-PPT.docx high performance liquid chromatography
Biophysics 2.pdffffffffffffffffffffffffff
The KM-GBF monitoring framework – status & key messages.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
An interstellar mission to test astrophysical black holes

Introduction to the PSI standard data formats

  • 1. Introduction to the PSI standard data formats Dr. Juan Antonio Vizcaíno EMBL-EBI Hinxton, Cambridge, UK
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Overview • A couple of slides about the need of data standards • The Proteomics Standards Initiative • Existing data standards
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Standards are needed in real life: also in bioinformatics… With a small number of standards, converters are feasible Data standards are needed
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Taken from Biocomicals, http://biocomicals.blogspot.com
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Mass Spectrometry (MS)-based proteomics • Many different workflows -> Many different data types -> Need for several data standards. • Discovery mode: • Bottom-up proteomics • Data dependent acquisition (DDA) • Data independent acquisition (DIA) • Top down proteomics • Targeted mode: • SRM/MRM/PRM (Selected/ Multiple/Parallel Reaction Monitoring)
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Overview • A couple of slides about the need of data standards • The Proteomics Standards Initiative • Existing data standards
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 •Develops data standards for proteomics. •Both data representation and annotation standards. •Involves data producers, database providers, software producers, publishers, everyone who wants to be involved… •Active Workgroups: MI, MS, PI, Mod and the new QC. •Inter-group activities: MIAPE and Controlled Vocabularies. •Started in 2002, so some experience already… •One annual meeting in March-April, regular phone calls. •Close interaction with the metabolomics community (MSI). http://www.psidev.info HUPO Proteomics Standards Initiative
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 PSI Deliverables •Minimum information (MIAPE) specifications: Format-independent specification of minimum information guidelines. •Formats: Usually XML-based (but also tab-delimited files), capable of representing the relevant Minimum Information, plus additional detailed data for the domain. •Controlled vocabularies: Usually an OBO-style hierarchical controlled vocabulary precisely defining the metadata that are encoded in the formats. •Databases and Tools: Foster open software implementations to make the standards truly useful. •Community interaction to ensure deposition of data in public repositories.
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 PSI MS Controlled Vocabulary Mayer et al., Database, 2013~2,700 terms by June 2017
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 The Ontology Lookup Service (OLS) http://www.ebi.ac.uk/ontology-lookup/
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Overview • A couple of slides about the need of data standards • The Proteomics Standards Initiative • Existing data standards
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 The typical dilemma •Data standards need to be stable to promote adoption •Proteomics standards need to evolve very rapidly: • Data is inherently very complex • Experimental techniques are evolving all the time
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Binary data mzData mzXML mzML XML-based files .dta, .pkl, .mgf, .ms2 Peak lists Data formats for mass spectra data
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 An example of success story: mzML • A data format for the storage and exchange of MS output files • Designed by merging the best aspects of both mzData and mzXML • Developed with full participation of academic researchers, hardware and software vendors • Expected to replace mzXML and mzData, but not expected to completely replace vendor binary formats • Captures spectra (raw data or peak lists), chromatograms and related metadata • Version 1.0 released in June 2008, v1.1 released in June 2009 • Many implementations already exist • Version 1.2 with enhanced compression considered for the near future. Martens et al., MCP, 2011
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 An example of success story: mzML The most popular search engines support mzML Many parser libraries available Conversion from raw files into mzMLhttp://www.psidev.info/mzml_1_0_0
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzIdentML, mascot .dat, sequest .out, SpectrumMill .spo pep.xml, prot.xml Only qualitative data! Data formats for output from search engines
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzIdentML • XML-based data standard for peptide and protein identifications e.g. following database search and protein inference • Sections for all PSMs, proteins/protein groups, protocols/parameters etc. • Timeline: • Original 1.0 version in Aug 2009 • Version 1.1 stable (Aug 2011); Original manuscript published in MCP in 2012* • Well supported in lots of open source and commercial software • Fully supported by ProteomeXchange resources • 2012 onwards (mzIdentML 1.2): extended use cases • Better support for protein grouping. Manuscript published in Proteomics ** • 2017 mzIdentML 1.2 release; manuscript published at MCP*** * Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381. ** Seymour, S. L., Farrah, T., Binz, P. A., Chalkley, R. J., et al., A standardized framing for reporting protein identifications in mzIdentML 1.2. Proteomics 2014, 14, 2389-2399. *** Vizcaíno, J. A., Mayer G., Perkins S., Barsnes H., et al., The mzIdentML Data Standard Version 1.2, Supporting advances in Proteome Informatics. Molecular & Cellular Proteomics 2017, 16, 1275-1285.
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzIdentML 1.1 Data standard for peptide and protein identification data mzIdentML 1.2 2011- 2012 2017 New support for: - Cross-linking approaches - Peptide level scores - Modification localization scores - Proteogenomics approaches Improved support for: - Protein inference - Pre-fractionation - de novo sequencing - Spectral library searches Increasingly supported by the most- used proteomics software and databases jmzIdentML mzid Library ms-data-core-api MyriMatch ProteoAnnotator PIA ProCon
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzQuantML status • XML-based standard for quantification data Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios, rows are ProteinGroups, Proteins or Peptides • Can also capture 2D coordinates of quantified regions in LC-MS (Features) Timeline • Work started in Oct 2011, and progressed at various PSI meetings • Completed PSI process in Feb 2013 – version 1.0 release • Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and MS1 label techniques e.g. SILAC* • Updated in 2013-2014 to support SRM as a new technique**; mzqLibrary*** • 2015, mzQuantML 1.0.1 – minor update with SRM included Open issues • Not widely supported by software. No live development. Efforts are being put into mzTab support instead *Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506 **Qi et al. PROTEOMICS, 2015, 15(18):3152-62 *** Qi et al PROTEOMICS 2015, 15, 2592-2596.
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzTab – Aims and concept • To provide a simple and efficient way of exchanging results from MS approaches. • Simpler summary report of the experimental results • Peptides and proteins identified in a given experimental setting • Small molecules identified • Reported quantification values • Technical and biological metadata • Easier to parse and use by the research community, systems biologists as well as providers of knowledge bases. • It can be used by non-experts in bioinformatics. • It does not aim to replace mzIdentMl and mzQuantML
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzTab - Sections • Basic information about experiment and sample • Key-Value pairsMetadata • Basic information about protein identifications • Table-basedProtein • Information about quantified peptides • Table-basedPeptide • Information about identified spectra • Table-basedPSM • Basic information about identified small molecules • Table-basedSmall Molecule Griss et al., MCP, 2014
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Metadata section - Example
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Unify exchange of transitions with TraML • PSI’s TraML (Transitions Markup Language) • Format for encoding SRM/MRM transitions • Version 1.0.0 now released and published in MCP (Deutsch et al. 2012) Journal Articles Transitions Databases Excel sheets SRM Analysis Software Instruments TraML
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 PSI document process •Every data standard has to undergo a thorough review process… •In fact, in practice, two review processes happen in parallel: the PSI and manuscript review.
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Proteogenomics related data formats • Two ongoing formats are being developed: proBed (version 1 available) and proBAM (still under review). • Same overall objective: to map identified peptides to genome coordinates. • Different level of detail: • proBed is tab-delimited and simpler, based on the original BED format. Less level of detail. • proBAM is based in the original SAM/BAM formats, widely used in genomics. Much higher level of detail.
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Proteogenomics related data formats
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Provide your own data to genome browsers
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 TrackHubs in Genome Browsers
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Data standard publications mzML (for MS data) Martens et al., MCP, 2011 mzIdentML (peptide/protein IDs) Jones et al., MCP, 2012 Vizcaíno et al., MCP, 2017 TraML (for SRM transitions) Deutsch et al., MCP, 2012 mzQuantML (for quantitative data) Walzer et al., MCP, 2013 mzTab (ID and quantification) Griss et al., MCP, 2014 proBed & proBAM (proteogenomics) Menschaert et al., Genome Biology, 2018
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Importance of making software available jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009 jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012 jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012 jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014 jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014 ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015 PSI promotes implementations. The reference libraries are always open source and can be used by anyone!
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 And also… protein-protein interactions PSI-XML: XML-based format • Version 2.5 is the working version • Version 3.0 under development MITAB: tab-delimited format
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Summary slide Deutsch et al., JPR, 2017
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Do you want to learn more?
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk WT Proteomics Bioinformatics Course 2018 Hinxton, 18 July 2018