SlideShare a Scribd company logo
Public proteomics data: a (mostly
unexploited) gold mine for computational
researchers
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data” approach -> PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
One slide intro to MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored
• From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Stats (1): Data submissions to PRIDE Archive continue to
increase
1,950 datasets submitted to PRIDE Archive in 2016
… and still the number of submitted datasets is growing…
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Stats (2): Data growth in EBI resources
Genomics
Transcriptomics
Metabolomics
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Countries with at least
100 submitted datasets :
1019 USA
734 Germany
492 United Kingdom
470 China
273 France
209 Netherlands
173 Canada
165 Switzerland
157 Australia
148 Austria
142 Denmark
137 Spain
115 Sweden
109 Japan
100 India
Stats (3): 5,198 ProteomeXchange datasets in PRIDE
Type:
3835 ‘Partial’ submissions (73.8%)
1363 ‘Complete’ submissions (26.2%)
Released: 3462 datasets (66.6%)
Unpublished: 1736 datasets (33.4%)
Data volume in PRIDE:
Total: ~400 TB
Number of files: ~670,000
PXD000320-324: ~ 4 TB
PXD002319-26 ~2.4 TB
PXD001471 ~1.6 TB
Top Species represented (at least
100 datasets):
2267 Homo sapiens
765 Mus musculus
201 Saccharomyces cerevisiae
169 Arabidopsis thaliana
154 Rattus norvegicus
124Escherichia coli
~ 1000 species in total
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
5571 (88.2%)
516 (8.2 %)
139 (2.2%) 86 (1.4%)
Stats (4): PRIDE share in ProteomeXchange (May 2017)
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
PRIDE Inspector Toolsuite: data
visualisation/ QC
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., Bioinformatics,
2015
Perez-Riverol et al., MCP, 2016
• PRIDE Inspector - standalone tool to enable visualisation and validation of MS
data.
• Build on top of ms-data-core-api - open source algorithms and libraries for
computational proteomics.
• Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE
XML.
• Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-api
https://github.com/PRIDE-Toolsuite/pride-inspector
Summary and QC charts Peptide spectra annotation and
visualization
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data” approach -> PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
The “dark” proteome
Sequence-based
search engines
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
The “dark” proteome
• Only ~25-30% of spectra in a typical proteomics
experiments are identified.
• What does that fraction of unidentified spectra correspond to?
• For sure, there will be artefacts (e.g. chimeric spectra).
• Undetected protein variants:
• What it is not included in the searched database cannot be
found.
• Peptide containing unexpected Post-Translational
Modifications (PTMs).
• Big potential to find novel biological relevant “proteoforms”.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Concept of “proteoform”
Could any of these “undetected” proteoforms have an important biological
function?
Smith et al., Nat Methods, 2013
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Reuse of public proteomics data is on the rise!!
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Data downloads are increasing
Data download volume for
PRIDE Archive in 2016: 243 TB
0
50
100
150
200
250
300
2013 2014 2015 2016
Downloads in TBs
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
MS proteomics: Discovery proteomics (DDA)
in vivo in silico
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Across-omics -> Proteogenomics approaches
• Proteomics data is combined with genomics and/or
transcriptomics information, typically by using sequence
databases generated from DNA sequencing efforts, RNA-
Seq experiments, Ribo-Seq approaches, and long-non-
coding RNAs.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
MS proteomics: Proteogenomics
in vivo in silico
DNA, RNASeq,
RiboSeq
Proteogenomics
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
MS proteomics: ProteoGenomics
Nesvizhskii, Nat Methods, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Examples of repurposing datasets: proteogenomics
Also some studies have been performed in model organisms: mouse, rat,
Drosophila, and other microorganisms (Mycobacterium tuberculosis,
Helicobacter pylori)
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Across-omics -> Proteogenomics approaches
• Proteogenomics approaches are increasingly utilized to
understand the information flow from genotype to phenotype
in complex diseases such as cancer and to support
personalized medicine studies.
• Study of human variation, e.g. in diseases such as cancer.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
MS proteomics: ProteoGenomics
in vivo in silico
Personal genomes
Personal
proteomes
Personalised medicine
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Repurposing: new PTMs found
• Some examples (using phosphoproteomics data sets):
• O-GlcNAc-6-phosphate1
• Phosphoglyceryl2
• ADP-ribosylation3
1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9
2Moellering & Cravatt, Science (2013) 341 549-553
3Matic et al., Nat Methods (2012) 9 771-2
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Recent examples of meta-analysis studies
Lund-Johanssen et al., Nat Methods, 2016 Drew et al., Mol Systems Biol, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data” approach -> PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Introduction to Spectrum Clustering
spectra-cluster algorithm
Unidentified
spectrum
Spectrum identified
as peptide A
Spectrum identified
as peptide B
Consensus
spectra
(= data reduction)
Input Mass
Spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
The spectra-cluster toolsuite
Clustering
• Command-line tool, graphical user interface and Hadoop
implementation of the spectra-cluster algorithm.
• Stand-alone tools optimised for small datasets
Develop-
ment
• Parser APIs for Java and Python
• spectra-cluster Java API to facilitate the development
of new clustering algorithms
Analysis
• Growing collection of simple-to-use tools for detailed analysis
• spectra-cluster-py Python framework available for the
development of own scripts
https://spectra-cluster.github.io
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by summer 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods,
2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
One perfect cluster in PRIDE Cluster web
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
http://www.ebi.ac.uk/pride/cluster/
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
3. Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
recurrent unidentified
spectra
??
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Consistently unidentified clusters (Recurring Unidentified Spectra)
• 19 M clusters contain only unidentified spectra.
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides -> Potential
Biomarkers?
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
3. Consistently unidentified clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
PRIDE Cluster as a Public Data Mining Resource
43
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• Spectral archives (including the Recurring Unidentified Spectra)
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Status of PRIDE Cluster in 2017
PX
Complete
.
.
n
Hadoop Cluster
PRIDE Archive Import
MGF
(Annotations)
QC
PX successfully converted
New Peptide/PTMs
Number of Identified and non-Identified Spectra
Clustering
Files
QC
Number of new clusters
PRIDE Cluster score distribution
Number of clusters by modification
Peptide tablesQC
Number of Peptides
Number of new Peptides
Number of PTMs
Number of New PTMs
Refined / Improved pipeline including robust QC checks.
The main focus is not in quantity any longer: Filtering more PSMs a priori
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Applications of spectrum clustering…
• Applicable to small groups of “similar” datasets:
• Can be used to target spectra that are “consistently” unidentified.
• Unidentified spectra could represent PTMs or sequence variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• Improve protein quantification.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data” approach -> PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Open analysis pipelines
• Goal: Development of open, reproducible and modular pipelines
(based on OpenMS as a starting point) for DDA (Data Dependent
Acquisition) approaches.
• Deployment in the EMBL-”Embassy Cloud”, with the goal that in the
future, they can be deployed in other cloud infrastructures, and
be reused by anyone in the community.
• Connected to PRIDE, bringing the tools closer to the data.
• We can use these pipelines to reanalyse PRIDE data.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Open analysis pipelines -> In the near
future…
• Recent 3-year BBSRC grant awarded to do the same for DIA
approaches (to start on December 2017).
• In collaboration with the Stoller Center (Manchester) (co-PIs Graham,
Hubbard & Townsend)
• Recent 4-year Wellcome Trust grant awarded to do (among other
things) pipelines for proteogenomics approaches (to start mid 2018).
• In collaboration with J. Choudhary (Institute of Cancer Research, London)
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Summary
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Summary
• Public proteomics datasets are on the rise! Reliable (widely
used) infrastructure now exists.
• A lot of possibilities open for reuse of this data.
• New purposes: proteogenomics, new PTMs,...
• It is possible to mine public data using spectrum clustering
looking for new proteoforms (new potential biomarkers?)
• Starting to work in open and reproducible analysis pipelines.
• Aim: In the future they are made available to everyone in the
community.
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
Aknowledgements: People
Attila Csordas
Tobias Ternent
Mathias Walzer
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017
www.hupo2017.ie
Dublin 17-21st September
Juan A. Vizcaíno
juan@ebi.ac.uk
Danish Bioinformatics Conference
Odense, 25 August 2017

More Related Content

PPTX
How to run and maintain a popular biological data repository?
PPTX
The ProteomeXchange Consoritum: 2017 update
PPTX
PRIDE and ProteomeXchange
PDF
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
PPTX
Proteomics data standards
PPTX
Reuse of public proteomics data
PPTX
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
PPTX
Proteomics public data resources: enabling "big data" analysis in proteomics
How to run and maintain a popular biological data repository?
The ProteomeXchange Consoritum: 2017 update
PRIDE and ProteomeXchange
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
Proteomics data standards
Reuse of public proteomics data
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics public data resources: enabling "big data" analysis in proteomics

What's hot (20)

PPTX
Experiences to learn from the MS proteomics field
PPTX
Mining the hidden proteome using hundreds of public proteomics datasets
PPTX
Reuse of public proteomics data
PPTX
Proteomics data standards
PPTX
Proteomics repositories
PPTX
PRIDE-ProteomeXchange
PPTX
Mass spectrometry resources at the EBI
PDF
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
PDF
Pride cluster presentation
PDF
GBIF towards 2030 (November 2018)
PDF
Museum collections as research data - October 2019
PDF
FAIR and open biodiversity collection data management
PDF
2021-01-27--biodiversity-informatics-gbif-(52slides)
PDF
GBIF and Open Science
PDF
GBIF and Biodiversity informatics for museums, 15 March 2021
PDF
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
PDF
The Biodiversity Informatics Landscape
PDF
Global Biodiversity Information Facility - 2013
PDF
GBIF and reuse of research data, Bergen (2016-12-14)
PDF
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
Experiences to learn from the MS proteomics field
Mining the hidden proteome using hundreds of public proteomics datasets
Reuse of public proteomics data
Proteomics data standards
Proteomics repositories
PRIDE-ProteomeXchange
Mass spectrometry resources at the EBI
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Pride cluster presentation
GBIF towards 2030 (November 2018)
Museum collections as research data - October 2019
FAIR and open biodiversity collection data management
2021-01-27--biodiversity-informatics-gbif-(52slides)
GBIF and Open Science
GBIF and Biodiversity informatics for museums, 15 March 2021
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
The Biodiversity Informatics Landscape
Global Biodiversity Information Facility - 2013
GBIF and reuse of research data, Bergen (2016-12-14)
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
Ad

Similar to Public proteomics data: a (mostly unexploited) gold mine for computational researchers (20)

PPTX
Is it feasible to identify novel biomarkers by mining public proteomics data?
PPTX
An overview of the PRIDE ecosystem of resources and computational tools for m...
PPTX
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PPTX
Reuse of public data in proteomics
PDF
Reusing and integrating public proteomics data to improve our knowledge of th...
PPTX
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PPTX
PRIDE and ProteomeXchange: Training webinar
PDF
The ELIXIR Proteomics community
PDF
Reuse of public proteomics data
PPTX
Pride and ProteomeXchange
PPTX
ProteomeXchange_and_PRIDE_Semmeting_2015
PDF
ProteomeXchange update
PPTX
Proteomexchange
PDF
Proteomics repositories
PPTX
ProteomeXchange: data deposition and data retrieval made easy
PPTX
Proteomics repositories
PDF
PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
PPTX
Proteomics repositories
PPTX
Introduction to EBI for Proteomics in ELIXIR
PDF
PRIDE resources and ProteomeXchange
Is it feasible to identify novel biomarkers by mining public proteomics data?
An overview of the PRIDE ecosystem of resources and computational tools for m...
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
Reuse of public data in proteomics
Reusing and integrating public proteomics data to improve our knowledge of th...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: Training webinar
The ELIXIR Proteomics community
Reuse of public proteomics data
Pride and ProteomeXchange
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange update
Proteomexchange
Proteomics repositories
ProteomeXchange: data deposition and data retrieval made easy
Proteomics repositories
PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
Proteomics repositories
Introduction to EBI for Proteomics in ELIXIR
PRIDE resources and ProteomeXchange
Ad

More from Juan Antonio Vizcaino (11)

PPTX
Introduction to the PSI standard data formats
PDF
Introduction to the Proteomics Bioinformatics Course 2018
PDF
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
PPTX
PSI-Proteome Informatics update
PDF
The ELIXIR Proteomics Community
PPTX
Introduction to the Proteomics Bioinformatics Course 2017
PPTX
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
PPTX
ProteomeXchange update 2017
PPTX
Enabling automated processing and analysis of large-scale proteomics data
PPTX
The Proteomics Standards Initiative (PSI)
PPTX
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the PSI standard data formats
Introduction to the Proteomics Bioinformatics Course 2018
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
PSI-Proteome Informatics update
The ELIXIR Proteomics Community
Introduction to the Proteomics Bioinformatics Course 2017
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
ProteomeXchange update 2017
Enabling automated processing and analysis of large-scale proteomics data
The Proteomics Standards Initiative (PSI)
Introduction to the Proteomics Bioinformatics Course 2016

Recently uploaded (20)

PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Microbiology with diagram medical studies .pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
famous lake in india and its disturibution and importance
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
BIOMOLECULES PPT........................
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Comparative Structure of Integument in Vertebrates.pptx
Microbiology with diagram medical studies .pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
famous lake in india and its disturibution and importance
TOTAL hIP ARTHROPLASTY Presentation.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
7. General Toxicologyfor clinical phrmacy.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
AlphaEarth Foundations and the Satellite Embedding dataset
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
. Radiology Case Scenariosssssssssssssss
Derivatives of integument scales, beaks, horns,.pptx
BIOMOLECULES PPT........................
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...

Public proteomics data: a (mostly unexploited) gold mine for computational researchers

  • 1. Public proteomics data: a (mostly unexploited) gold mine for computational researchers Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK E-mail: juan@ebi.ac.uk
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data” approach -> PRIDE Cluster • Open analysis pipelines
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 One slide intro to MS based proteomics Hein et al., Handbook of Systems Biology, 2012
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches • Any type of data can be stored • From July 2017, an ELIXIR core resource PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Stats (1): Data submissions to PRIDE Archive continue to increase 1,950 datasets submitted to PRIDE Archive in 2016 … and still the number of submitted datasets is growing…
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Stats (2): Data growth in EBI resources Genomics Transcriptomics Metabolomics
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta jPOST (MS/MS data) Mandatory raw data deposition since July 2015 • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014 Deutsch et al., NAR, 2017
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Countries with at least 100 submitted datasets : 1019 USA 734 Germany 492 United Kingdom 470 China 273 France 209 Netherlands 173 Canada 165 Switzerland 157 Australia 148 Austria 142 Denmark 137 Spain 115 Sweden 109 Japan 100 India Stats (3): 5,198 ProteomeXchange datasets in PRIDE Type: 3835 ‘Partial’ submissions (73.8%) 1363 ‘Complete’ submissions (26.2%) Released: 3462 datasets (66.6%) Unpublished: 1736 datasets (33.4%) Data volume in PRIDE: Total: ~400 TB Number of files: ~670,000 PXD000320-324: ~ 4 TB PXD002319-26 ~2.4 TB PXD001471 ~1.6 TB Top Species represented (at least 100 datasets): 2267 Homo sapiens 765 Mus musculus 201 Saccharomyces cerevisiae 169 Arabidopsis thaliana 154 Rattus norvegicus 124Escherichia coli ~ 1000 species in total
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 5571 (88.2%) 516 (8.2 %) 139 (2.2%) 86 (1.4%) Stats (4): PRIDE share in ProteomeXchange (May 2017)
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 PRIDE Inspector Toolsuite: data visualisation/ QC Wang et al., Nat. Biotechnology, 2012 Perez-Riverol et al., Bioinformatics, 2015 Perez-Riverol et al., MCP, 2016 • PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. • Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics. • Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML. • Broad functionality. https://github.com/PRIDE-Utilities/ms-data-core-api https://github.com/PRIDE-Toolsuite/pride-inspector Summary and QC charts Peptide spectra annotation and visualization
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data” approach -> PRIDE Cluster • Open analysis pipelines
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 The “dark” proteome Sequence-based search engines
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 The “dark” proteome • Only ~25-30% of spectra in a typical proteomics experiments are identified. • What does that fraction of unidentified spectra correspond to? • For sure, there will be artefacts (e.g. chimeric spectra). • Undetected protein variants: • What it is not included in the searched database cannot be found. • Peptide containing unexpected Post-Translational Modifications (PTMs). • Big potential to find novel biological relevant “proteoforms”.
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Concept of “proteoform” Could any of these “undetected” proteoforms have an important biological function? Smith et al., Nat Methods, 2013
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Reuse of public proteomics data is on the rise!! Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Data downloads are increasing Data download volume for PRIDE Archive in 2016: 243 TB 0 50 100 150 200 250 300 2013 2014 2015 2016 Downloads in TBs
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 MS proteomics: Discovery proteomics (DDA) in vivo in silico
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs. • Meta-analysis studies.
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Across-omics -> Proteogenomics approaches • Proteomics data is combined with genomics and/or transcriptomics information, typically by using sequence databases generated from DNA sequencing efforts, RNA- Seq experiments, Ribo-Seq approaches, and long-non- coding RNAs.
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 MS proteomics: Proteogenomics in vivo in silico DNA, RNASeq, RiboSeq Proteogenomics
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 MS proteomics: ProteoGenomics Nesvizhskii, Nat Methods, 2014
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Examples of repurposing datasets: proteogenomics Data in public resources can be used for genome annotation purposes -> Discovery of short ORFs, translated lncRNAs, etc
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Examples of repurposing datasets: proteogenomics Also some studies have been performed in model organisms: mouse, rat, Drosophila, and other microorganisms (Mycobacterium tuberculosis, Helicobacter pylori)
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Across-omics -> Proteogenomics approaches • Proteogenomics approaches are increasingly utilized to understand the information flow from genotype to phenotype in complex diseases such as cancer and to support personalized medicine studies. • Study of human variation, e.g. in diseases such as cancer.
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 MS proteomics: ProteoGenomics in vivo in silico Personal genomes Personal proteomes Personalised medicine
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Public datasets from different omics: OmicsDI http://www.omicsdi.org/ • Aims to integrate of ‘omics’ datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVE jPOST PASSEL GPMDB ArrayExpress Expression Atlas MetaboLights Metabolomics Workbench GNPS EGA Perez-Riverol et al., Nat Biotechnol, 2017
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 OmicsDI: Portal for omics datasets
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs. • Meta-analysis studies.
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Repurposing: new PTMs found • Some examples (using phosphoproteomics data sets): • O-GlcNAc-6-phosphate1 • Phosphoglyceryl2 • ADP-ribosylation3 1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9 2Moellering & Cravatt, Science (2013) 341 549-553 3Matic et al., Nat Methods (2012) 9 771-2
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs. • Meta-analysis studies.
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Recent examples of meta-analysis studies Lund-Johanssen et al., Nat Methods, 2016 Drew et al., Mol Systems Biol, 2017
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data” approach -> PRIDE Cluster • Open analysis pipelines
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Introduction to Spectrum Clustering spectra-cluster algorithm Unidentified spectrum Spectrum identified as peptide A Spectrum identified as peptide B Consensus spectra (= data reduction) Input Mass Spectra
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 The spectra-cluster toolsuite Clustering • Command-line tool, graphical user interface and Hadoop implementation of the spectra-cluster algorithm. • Stand-alone tools optimised for small datasets Develop- ment • Parser APIs for Java and Python • spectra-cluster Java API to facilitate the development of new clustering algorithms Analysis • Growing collection of simple-to-use tools for detailed analysis • spectra-cluster-py Python framework available for the development of own scripts https://spectra-cluster.github.io
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 PRIDE Cluster - Concept NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR NMMAACDPR NMMAACDPR Consensus spectrum PPECPDFDPPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 PRIDE Cluster: Second Implementation • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2013 • Clustered all public spectra in PRIDE by summer 2015. • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores • Griss et al., Nat. Methods, 2016
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 One perfect cluster in PRIDE Cluster web - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments http://www.ebi.ac.uk/pride/cluster/
  • 40. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 3. Consistently unidentified clusters Not identified Not identified Not identified Not identified Consensus spectrum Not identified Not identified Originally submitted spectra Spectrum clustering Method to target recurrent unidentified spectra ??
  • 41. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Consistently unidentified clusters (Recurring Unidentified Spectra) • 19 M clusters contain only unidentified spectra. • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides -> Potential Biomarkers? • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 42. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 3. Consistently unidentified clusters
  • 43. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 PRIDE Cluster as a Public Data Mining Resource 43 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • Spectral archives (including the Recurring Unidentified Spectra) • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 44. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Status of PRIDE Cluster in 2017 PX Complete . . n Hadoop Cluster PRIDE Archive Import MGF (Annotations) QC PX successfully converted New Peptide/PTMs Number of Identified and non-Identified Spectra Clustering Files QC Number of new clusters PRIDE Cluster score distribution Number of clusters by modification Peptide tablesQC Number of Peptides Number of new Peptides Number of PTMs Number of New PTMs Refined / Improved pipeline including robust QC checks. The main focus is not in quantity any longer: Filtering more PSMs a priori
  • 45. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Applications of spectrum clustering… • Applicable to small groups of “similar” datasets: • Can be used to target spectra that are “consistently” unidentified. • Unidentified spectra could represent PTMs or sequence variants. • Try “more-expensive” computational analysis methods (e.g. spectral searches, de novo). • Improve protein quantification.
  • 46. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data” approach -> PRIDE Cluster • Open analysis pipelines
  • 47. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Open analysis pipelines • Goal: Development of open, reproducible and modular pipelines (based on OpenMS as a starting point) for DDA (Data Dependent Acquisition) approaches. • Deployment in the EMBL-”Embassy Cloud”, with the goal that in the future, they can be deployed in other cloud infrastructures, and be reused by anyone in the community. • Connected to PRIDE, bringing the tools closer to the data. • We can use these pipelines to reanalyse PRIDE data.
  • 48. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Open analysis pipelines
  • 49. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Open analysis pipelines -> In the near future… • Recent 3-year BBSRC grant awarded to do the same for DIA approaches (to start on December 2017). • In collaboration with the Stoller Center (Manchester) (co-PIs Graham, Hubbard & Townsend) • Recent 4-year Wellcome Trust grant awarded to do (among other things) pipelines for proteogenomics approaches (to start mid 2018). • In collaboration with J. Choudhary (Institute of Cancer Research, London)
  • 50. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Summary
  • 51. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Summary • Public proteomics datasets are on the rise! Reliable (widely used) infrastructure now exists. • A lot of possibilities open for reuse of this data. • New purposes: proteogenomics, new PTMs,... • It is possible to mine public data using spectrum clustering looking for new proteoforms (new potential biomarkers?) • Starting to work in open and reproducible analysis pipelines. • Aim: In the future they are made available to everyone in the community.
  • 52. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 Aknowledgements: People Attila Csordas Tobias Ternent Mathias Walzer Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!! @pride_ebi @proteomexchange
  • 53. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017 www.hupo2017.ie Dublin 17-21st September
  • 54. Juan A. Vizcaíno juan@ebi.ac.uk Danish Bioinformatics Conference Odense, 25 August 2017