SlideShare a Scribd company logo
Proteomics public data resources:
enabling “big data” analysis in proteomics
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
“Big data”: definition
Slide from: http://www.ibmbigdatahub.com/
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
“Big data” in biology
The term has been applied so far mainly to genomics
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
One slide intro to MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
What is a proteomics publication in 2016?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral: Centralised portal for all PX
datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
UniProt/
neXtProtPeptide Atlas
Other DBs
Receiving repositories
PRIDE
GPMDBResearcher’s results
Raw data
Metadata
PASSEL
proteomicsDB
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
OmicsDI
Integration with other
omics datasets
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE: Source of MS proteomics data
• PRIDE Archive already provides or
will soon provide MS proteomics
data to other EMBL-EBI resources
such as UniProt, Ensembl and the
EBI Expression Atlas.
http://www.ebi.ac.uk/pride/archive
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Archive – over 4,500 datasets from
over 51 countries and 1,700 groups
• USA – 814 datasets
• Germany – 528
• UK – 338
• China – 328
• France – 222
• Netherlands – 175
• Canada - 137
Data volume:
• Total: ~275 TB
• Number of all files: ~560,000
• PXD000320-324: ~ 4 TB
• PXD002319-26 ~2.4 TB
• PXD001471 ~1.6 TB
• 1,973 datasets i.e. 52% of
all are publicly accessible
• ~90% of all
ProteomeXchange datasets
YearSubmissions
All submissions
Complete
PRIDE Archive growth
In the last 12 months: ~165 submitted datasets per month
Top Species studied by at least 100
datasets:
2,010 Homo sapiens
604 Mus musculus
191 Saccharomyces cerevisiae
140 Arabidopsis thaliana
127 Rattus norvegicus
>900 reported taxa in total
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Components: Data Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
In addition to PRIDE Archive, the PRIDE team develops
and maintains different tools and software libraries to
facilitate the handling and visualisation of MS proteomics
data and the submission process
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., Bioinformatics,
2015
Perez-Riverol et al., MCP, 2016
• PRIDE Inspector - standalone tool to enable visualisation and validation of MS
data.
• Build on top of ms-data-core-api - open source algorithms and libraries for
computational proteomics.
• Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE
XML.
• Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-api
https://github.com/PRIDE-Toolsuite/pride-inspector
Summary and QC charts Peptide spectra annotation and
visualization
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PX Submission Tool
 Desktop application for data
submissions to ProteomeXchange via
PRIDE
• Implemented in Java 7
• Streamlines the submission process
• Capture mappings between files
• Retain metadata
• Fast file transfer with Aspera (FASP®
transfer technology) – FTP also
available
• Command line option
Submission tool screenshot
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Datasets are being reused more and more….
Vaudel et al., Proteomics, 2016
Data download volume for
PRIDE Archive in 2015: 198 TB
0
50
100
150
200
250
2013 2014 2015 2016
Downloads in TBs
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014
•Two independent groups claimed to have produced the
first complete draft of the human proteome by MS.
• Some of their findings are controversial and need further
validation… but generated a lot of discussion and put
proteomics in the spotlight.
•They used many different tissues.
Nature cover 29 May 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Examples of repurposing in proteogenomics
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Data sharing in Proteomics
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Challenges for data reuse in proteomics
• Insufficient technical and biological metadata.
• Large computational infrastructure maybe needed (e.g. when
analysing many datasets together).
• Shortage of expertise (people).
• Lack of standardisation in the field.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Summary of the talk so far
• PRIDE Archive and other ProteomeXchange resources make
possible data sharing in the MS proteomics field.
• Data sharing is becoming the norm in the field.
• Standalone tools: PRIDE Inspector and PX Submission tool.
• Datasets are increasingly reused (many opportunities):
• Example of one of the drafts of the human proteome.
• Proteogenomics approaches.
• But there are important challenges as well.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with Big data: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster
• Provide an aggregated peptide centric view of PRIDE Archive.
• Hypothesis: same peptide will generate similar MS/MS spectra across
experiments.
• Enables QC of peptide-spectrum matches (PSMs). Infer reliable
identifications by comparing submitted identifications of spectra within a
cluster.
 After clustering, a representative spectrum is built for all peptides
consistently identified across different datasets.
Griss et al., Nat. Methods, 2013
Griss et al., Nat. Methods,
2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster - Concept
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster: Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster Iteration 2: Why?
• PRIDE Archive has experienced a huge increase in data
since 2013.
• We wanted to develop an algorithm that could also work
with unidentified spectra.
Year
Submissions
All submissions
Complete
PRIDE Archive growth
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism
using the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Examples: one perfect cluster (2)
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the
cluster >50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Validation
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
2. Inferring identifications for originally unidentified spectra
52
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
3. Consistently unidentified clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster as a Public Data Mining Resource
55
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Public datasets from different omics: OmicsDI
http://www.ebi.ac.uk/Tools/omicsdi/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., 2016, BioRXxiv
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Summary part 2
• Using a “big data” approach we were able to get extra
knowledge from all the public data in PRIDE Archive.
• Spectrum clustering enables QC in proteomics resources
such as PRIDE Archive.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).
• OmicsDI: new platform to identify public datasets coming
from different omics technologies (more possibilities for data
reuse!)
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Enrique Perez
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange
Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Questions?
http://www.slideshare.net/JuanAntonioVizcaino

More Related Content

PPTX
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
PPTX
Experiences to learn from the MS proteomics field
PPTX
Public proteomics data: a (mostly unexploited) gold mine for computational re...
PPTX
Reuse of public proteomics data
PPTX
PRIDE-ProteomeXchange
PPTX
Mass spectrometry resources at the EBI
PPTX
Proteomics data standards
PPTX
Mining the hidden proteome using hundreds of public proteomics datasets
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Experiences to learn from the MS proteomics field
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Reuse of public proteomics data
PRIDE-ProteomeXchange
Mass spectrometry resources at the EBI
Proteomics data standards
Mining the hidden proteome using hundreds of public proteomics datasets

What's hot (20)

PPTX
Proteomics repositories
PPTX
The ProteomeXchange Consoritum: 2017 update
PPTX
Proteomics data standards
PPTX
How to run and maintain a popular biological data repository?
PPTX
An overview of the PRIDE ecosystem of resources and computational tools for m...
PPTX
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PDF
Pride cluster presentation
PDF
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
PPTX
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
PDF
ICIC 2013 New Product Introductions ChemAxon
PPTX
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
PPTX
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
PPTX
Reflections on a (slightly unusual) multi-disciplinary academic career
PDF
OEG-Tools for supporting Ontology Engineering
PPTX
Reproducibility, Research Objects and Reality, Leiden 2016
PPTX
Reproducible Research: how could Research Objects help
PPTX
FAIR data and model management for systems biology.
PPTX
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
PPT
News from the DOI and DataCite Community
Proteomics repositories
The ProteomeXchange Consoritum: 2017 update
Proteomics data standards
How to run and maintain a popular biological data repository?
An overview of the PRIDE ecosystem of resources and computational tools for m...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
Pride cluster presentation
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
ICIC 2013 New Product Introductions ChemAxon
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Reflections on a (slightly unusual) multi-disciplinary academic career
OEG-Tools for supporting Ontology Engineering
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducible Research: how could Research Objects help
FAIR data and model management for systems biology.
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
News from the DOI and DataCite Community
Ad

Viewers also liked (12)

PPTX
Use of spark for proteomic scoring seattle presentation
POT
Proteomic identification of host and parasite biomarkers in saliva from patie...
PDF
Proteomics course 1
PPT
Soil organisms
PPTX
Soil microbiology and cycles of the elements
PPT
07 soil microbiology
PPTX
Soil Microbiology
PPT
158 genomic and proteomic risk factors
PPTX
Proteomics
PPTX
Soil microbiology
 
PPTX
Proteomics ppt
PPT
Proteomics
Use of spark for proteomic scoring seattle presentation
Proteomic identification of host and parasite biomarkers in saliva from patie...
Proteomics course 1
Soil organisms
Soil microbiology and cycles of the elements
07 soil microbiology
Soil Microbiology
158 genomic and proteomic risk factors
Proteomics
Soil microbiology
 
Proteomics ppt
Proteomics
Ad

Similar to Proteomics public data resources: enabling "big data" analysis in proteomics (20)

PPTX
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PPTX
Pride and ProteomeXchange
PPTX
Human microbiome project
PPTX
PRIDE and ProteomeXchange
PDF
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
PDF
The ELIXIR Proteomics Community
PPTX
ProteomeXchange update 2017
PDF
ProteomeXchange update
PDF
PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
PPTX
PRIDE and ProteomeXchange: Training webinar
PPTX
ProteomeXchange: data deposition and data retrieval made easy
PDF
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
PPTX
ProteomeXchange update HUPO 2016
PPTX
ProteomeXchange update
PPTX
Introduction to EBI for Proteomics in ELIXIR
PDF
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
PDF
TIB's action for research data managament as a national library's strategy in...
PDF
Intact danish workshop_20171001
PPTX
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
PPTX
Enabling better science: Results and vision of the OpenAIRE infrastructure an...
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
Pride and ProteomeXchange
Human microbiome project
PRIDE and ProteomeXchange
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
The ELIXIR Proteomics Community
ProteomeXchange update 2017
ProteomeXchange update
PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
PRIDE and ProteomeXchange: Training webinar
ProteomeXchange: data deposition and data retrieval made easy
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
ProteomeXchange update HUPO 2016
ProteomeXchange update
Introduction to EBI for Proteomics in ELIXIR
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
TIB's action for research data managament as a national library's strategy in...
Intact danish workshop_20171001
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Enabling better science: Results and vision of the OpenAIRE infrastructure an...

More from Juan Antonio Vizcaino (18)

PDF
Reusing and integrating public proteomics data to improve our knowledge of th...
PPTX
Introduction to the PSI standard data formats
PDF
Reuse of public proteomics data
PDF
PRIDE resources and ProteomeXchange
PDF
Proteomics repositories
PDF
Introduction to the Proteomics Bioinformatics Course 2018
PDF
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
PPTX
PSI-Proteome Informatics update
PDF
The ELIXIR Proteomics community
PPTX
Reuse of public proteomics data
PPTX
Proteomics repositories
PPTX
Introduction to the Proteomics Bioinformatics Course 2017
PPTX
Is it feasible to identify novel biomarkers by mining public proteomics data?
PPTX
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
PPTX
Enabling automated processing and analysis of large-scale proteomics data
PPTX
The Proteomics Standards Initiative (PSI)
PPTX
Introduction to the Proteomics Bioinformatics Course 2016
PPTX
Reuse of public data in proteomics
Reusing and integrating public proteomics data to improve our knowledge of th...
Introduction to the PSI standard data formats
Reuse of public proteomics data
PRIDE resources and ProteomeXchange
Proteomics repositories
Introduction to the Proteomics Bioinformatics Course 2018
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
PSI-Proteome Informatics update
The ELIXIR Proteomics community
Reuse of public proteomics data
Proteomics repositories
Introduction to the Proteomics Bioinformatics Course 2017
Is it feasible to identify novel biomarkers by mining public proteomics data?
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
Enabling automated processing and analysis of large-scale proteomics data
The Proteomics Standards Initiative (PSI)
Introduction to the Proteomics Bioinformatics Course 2016
Reuse of public data in proteomics

Recently uploaded (20)

PDF
Sciences of Europe No 170 (2025)
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Microbiology with diagram medical studies .pptx
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
Sciences of Europe No 170 (2025)
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
HPLC-PPT.docx high performance liquid chromatography
neck nodes and dissection types and lymph nodes levels
Microbiology with diagram medical studies .pptx
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The KM-GBF monitoring framework – status & key messages.pptx
2. Earth - The Living Planet earth and life
Comparative Structure of Integument in Vertebrates.pptx
lecture 2026 of Sjogren's syndrome l .pdf
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Placing the Near-Earth Object Impact Probability in Context
ECG_Course_Presentation د.محمد صقران ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
The scientific heritage No 166 (166) (2025)
Cell Membrane: Structure, Composition & Functions
Classification Systems_TAXONOMY_SCIENCE8.pptx

Proteomics public data resources: enabling "big data" analysis in proteomics

  • 1. Proteomics public data resources: enabling “big data” analysis in proteomics Dr. Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 “Big data”: definition Slide from: http://www.ibmbigdatahub.com/
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 “Big data” in biology The term has been applied so far mainly to genomics
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 One slide intro to MS based proteomics Hein et al., Handbook of Systems Biology, 2012
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 What is a proteomics publication in 2016? • Proteomics studies generate potentially large amounts of data and results. • Ideally, a proteomics publication needs to: • Summarize the results of the study • Provide supporting information for reliability of any results reported • Information in a publication: • Manuscript • Supplementary material • Associated data submitted to a public repository
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta jPOST (MS/MS data) Mandatory raw data deposition since July 2015 • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. http://www.proteomexchange.org New in 2016 Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral Metadata / Manuscript Raw Data Results Journals Peptide Atlas Receiving repositories PRIDE Researcher’s results Raw data Metadata PASSEL Research groups Reanalysis of datasets MassIVE jPOST MS/MS data (as complete submissions) Any other workflow (mainly partial submissions) DATASETS SRM data Reprocessed results MassIVE ProteomeXchange data workflow Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral: Centralised portal for all PX datasets http://proteomecentral.proteomexchange.org/cgi/GetDataset
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral Metadata / Manuscript Raw Data Results Journals Peptide Atlas Receiving repositories PRIDE Researcher’s results Raw data Metadata PASSEL Research groups Reanalysis of datasets MassIVE jPOST MS/MS data (as complete submissions) Any other workflow (mainly partial submissions) DATASETS SRM data Reprocessed results MassIVE ProteomeXchange data workflow Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral Metadata / Manuscript Raw Data Results Journals UniProt/ neXtProtPeptide Atlas Other DBs Receiving repositories PRIDE GPMDBResearcher’s results Raw data Metadata PASSEL proteomicsDB Research groups Reanalysis of datasets MassIVE jPOST MS/MS data (as complete submissions) Any other workflow (mainly partial submissions) DATASETS OmicsDI Integration with other omics datasets SRM data Reprocessed results MassIVE ProteomeXchange data workflow Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE: Source of MS proteomics data • PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas. http://www.ebi.ac.uk/pride/archive
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Archive – over 4,500 datasets from over 51 countries and 1,700 groups • USA – 814 datasets • Germany – 528 • UK – 338 • China – 328 • France – 222 • Netherlands – 175 • Canada - 137 Data volume: • Total: ~275 TB • Number of all files: ~560,000 • PXD000320-324: ~ 4 TB • PXD002319-26 ~2.4 TB • PXD001471 ~1.6 TB • 1,973 datasets i.e. 52% of all are publicly accessible • ~90% of all ProteomeXchange datasets YearSubmissions All submissions Complete PRIDE Archive growth In the last 12 months: ~165 submitted datasets per month Top Species studied by at least 100 datasets: 2,010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus >900 reported taxa in total
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Components: Data Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML In addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Inspector Toolsuite Wang et al., Nat. Biotechnology, 2012 Perez-Riverol et al., Bioinformatics, 2015 Perez-Riverol et al., MCP, 2016 • PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. • Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics. • Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML. • Broad functionality. https://github.com/PRIDE-Utilities/ms-data-core-api https://github.com/PRIDE-Toolsuite/pride-inspector Summary and QC charts Peptide spectra annotation and visualization
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PX Submission Tool  Desktop application for data submissions to ProteomeXchange via PRIDE • Implemented in Java 7 • Streamlines the submission process • Capture mappings between files • Retain metadata • Fast file transfer with Aspera (FASP® transfer technology) – FTP also available • Command line option Submission tool screenshot
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Datasets are being reused more and more…. Vaudel et al., Proteomics, 2016 Data download volume for PRIDE Archive in 2015: 198 TB 0 50 100 150 200 250 2013 2014 2015 2016 Downloads in TBs
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014 •Two independent groups claimed to have produced the first complete draft of the human proteome by MS. • Some of their findings are controversial and need further validation… but generated a lot of discussion and put proteomics in the spotlight. •They used many different tissues. Nature cover 29 May 2014
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 •Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE. •They complement that data with “exotic” tissues.
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Examples of repurposing in proteogenomics
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Challenges for data reuse in proteomics • Insufficient technical and biological metadata. • Large computational infrastructure maybe needed (e.g. when analysing many datasets together). • Shortage of expertise (people). • Lack of standardisation in the field.
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Summary of the talk so far • PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field. • Data sharing is becoming the norm in the field. • Standalone tools: PRIDE Inspector and PX Submission tool. • Datasets are increasingly reused (many opportunities): • Example of one of the drafts of the human proteome. • Proteogenomics approaches. • But there are important challenges as well.
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with Big data: PRIDE Cluster
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster: Initial Motivation • Provide a QC-filtered peptide-centric view of PRIDE. • Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done). • Heterogeneous quality, difficult to make the data comparable. • Enable assessment of (published) proteomics data.
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster • Provide an aggregated peptide centric view of PRIDE Archive. • Hypothesis: same peptide will generate similar MS/MS spectra across experiments. • Enables QC of peptide-spectrum matches (PSMs). Infer reliable identifications by comparing submitted identifications of spectra within a cluster.  After clustering, a representative spectrum is built for all peptides consistently identified across different datasets. Griss et al., Nat. Methods, 2013 Griss et al., Nat. Methods, 2016
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster - Concept NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR NMMAACDPR NMMAACDPR Consensus spectrum PPECPDFDPPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster - Concept
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster: Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster Iteration 2: Why? • PRIDE Archive has experienced a huge increase in data since 2013. • We wanted to develop an algorithm that could also work with unidentified spectra. Year Submissions All submissions Complete PRIDE Archive growth
  • 40. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Parallelizing Spectrum Clustering: Hadoop • Optimizes work distribution among machines. • Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. • Solves many general issues of large parallel jobs: • Scheduling • inter-job communication • failure https://hadoop.apache.org/
  • 41. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster: Second Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2016 • Clustered all public spectra in PRIDE by April 2015 • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
  • 42. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Examples: one perfect cluster - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments
  • 43. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Examples: one perfect cluster (2)
  • 44. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster Sequence-based search engines Spectrum clustering Incorrectly or unidentified spectra
  • 45. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified spectra. • 3. Clusters just containing unidentified spectra.
  • 46. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 1. Re-analysis of inconsistent clusters NMMAACDPR NMMAACDPR IGGIGTVPVGR NMMAACDPR PPECPDFDPPR VFDEFKPLVEEPQNLIK NMMAACDPR IGGIGTVPVGR No sequence has a proportion in the cluster >50% Consensus spectrum PPECPDFDPPR VFDEFKPLVEEP QNLIK Originally submitted identified spectra Spectrum clustering
  • 47. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 1. Re-analysis of inconsistent clusters • Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem. • 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin. • In this case, it is likely that a contaminants DB was not used in the search.
  • 48. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Validation
  • 49. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016
  • 50. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016
  • 51. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016
  • 52. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 2. Inferring identifications for originally unidentified spectra 52 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  • 53. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 3. Consistently unidentified clusters • 19 M clusters contain only unidentified spectra. • 41,155 of these spectra have more than 100 spectra (= 12 M spectra). • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 54. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 3. Consistently unidentified clusters
  • 55. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster as a Public Data Mining Resource 55 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 56. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Public datasets from different omics: OmicsDI http://www.ebi.ac.uk/Tools/omicsdi/ • Aims to integrate of ‘omics’ datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVE jPOST PASSEL GPMDB ArrayExpress Expression Atlas MetaboLights Metabolomics Workbench GNPS EGA Perez-Riverol et al., 2016, BioRXxiv
  • 57. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 OmicsDI: Portal for omics datasets
  • 58. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 OmicsDI: Portal for omics datasets
  • 59. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 OmicsDI: Portal for omics datasets
  • 60. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Summary part 2 • Using a “big data” approach we were able to get extra knowledge from all the public data in PRIDE Archive. • Spectrum clustering enables QC in proteomics resources such as PRIDE Archive. • It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered). • OmicsDI: new platform to identify public datasets coming from different omics technologies (more possibilities for data reuse!)
  • 61. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Aknowledgements: People Attila Csordas Tobias Ternent Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Enrique Perez Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!! @pride_ebi @proteomexchange
  • 62. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Questions? http://www.slideshare.net/JuanAntonioVizcaino