SlideShare a Scribd company logo
Proteomics Analysis and integration of large-scale data sets Lars Juhl Jensen EMBL Heidelberg
Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
Part 1 Methods for predicting protein-protein interactions Lars Juhl Jensen EMBL Heidelberg
Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
Cross-species integration of diverse data Challenges and promises of large-scale data integration Explosive increase in both the amounts and different types of high-throughput data sets that are being produced These data are highly heterogeneous and lack standardization Most data sets are error-prone and suffer from systematic biases Experiments should be integrated across model organisms STRING is a web resource that integrates and transfers diverse large-scale data across 100+ species, but it is  not a primary repository for experimental data a curated database of complexes or pathways a substitute for expert annotation
What is STRING? Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
Genomic context methods © Nature Biotechnology, 2004
Inferring functional modules from gene presence/absence patterns T rends in Microbiology
Inferring functional modules from gene presence/absence patterns T rends in Microbiology
Inferring functional modules from gene presence/absence patterns T rends in Microbiology
Inferring functional modules from gene presence/absence patterns T rends in Microbiology Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring  proteins Cellulosomes Cellulose The “Cellulosome”
Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
Inferring functional associations from evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
Predicting functional and physical interactions from gene fusion/fission events Find in  A  genes that match a the same gene in  B Exclude overlapping alignments Calibrate against KEGG  maps Calculate all-against-all pairwise alignments
Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
The Qspline method for non-linear intensity normalization of expression data From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black) A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline function As reference one can either use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation
Non-linear normalization of intensities and correction for spatial effects Downloaded SMD data After intensity normalization Spatial bias estimate After spatial normalization
Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence cannot be guaranteed for in-paralogs These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional equivalence are calculated from all-against-all alignment Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships ? Source species Target species
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
Conclusions Many types of data can be used for interaction prediction To make the best of these data they must each be benchmarked integrated across species The STRING web resource does just this
Questions?
Part 2 Quality control of high-throughput interaction data Lars Juhl Jensen EMBL Heidelberg
Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
Protein interaction data sets Many high-throughput data sets published the past 5 years S. cerevisiae  is by far the best covered organism Recently, large data sets were made for two metazoans Two fundamentally different techniques have been used Affinity purification/MS The yeast two-hybrid assay Interaction databases IntAct, BIND, DIP, MINT Species specific databases © Current Opinions in Structural Biology, 2004
The topology of protein interaction networks A multitude of publications exist on protein network topology Global measures of topology Degree distribution Mean path length Clustering coefficient Theoretical models of networks Random Scale-free Hierarchical Local topology, network motifs
What is an interaction? Physical protein interactions Proteins that physically touch each other within a complex Members of the same stable complex Transient interactions, e.g. a protein kinase with its substrate More broadly defined “functional interactions” Direct neighbors in metabolic networks Members of the same pathway The pragmatic definition – whatever the assays detect Affinity purification tends to find members of stable complexes Yeast two-hybrid assays also detects more transient interactions
Binary representations of purification data © Drug Discovery Today: TARGETS, 2004
Topology based quality scores Scoring scheme for yeast two-hybrid data: S1 = -log((N 1 +1) · (N 2 +1)) N 1  and N 2  are the numbers of non-shared interaction partners Similar scoring schemes have been published by Saito  et al. Scoring scheme for complex pull-down data: S2 = log[(N 12 · N)/((N 1 +1) · (N 2 +1))] N 12  is the number of purifications containing both proteins N 1  is the number containing protein 1, N 2  is defined similarly N is the total number of purifications Both schemes aim at identifying ubiquitous interactors
Calibration of quality scores and combination of evidence  Different pieces of evidence are not directly comparable A different raw quality score is used for each evidence type  Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference The accuracy relative to a “gold standard” is calculated within score intervals The resulting points are approximated by a sigmoid
Benchmarks for protein interaction sets To benchmark interaction sets, one needs a reference set Several options exist Directly compare with a curated set of protein complexes from e.g. MIPS Check consistency with metabolic pathways from e.g. KEGG Check consistency with GO biological process or cellular component categories Look for co-expression of genes within complexes © Current Opinions in Structural Biology, 2004
Benchmark of published interaction sets against the MIPS curated yeast complexes Data sets were filtered to remove the most obvious biases by removing ribosomal proteins and interactions obtained from MIPS High specificity is often obtained at the price of low coverage
Filtering by subcellular localization Proteins cannot interact if they are not in the same place Large-scale subcellular localization screens have been made in yeast A matrix can be constructed that described the compartments between which interactions should be allowed Two proteins cannot interact if no combination of observed subcellular compartments allow for interaction
Restricting the network to a “system” Why do large-scale interaction data have high error rates? In a systematic screen we test the hypotheses that any protein in interacts with any other protein in the cell The vast majority of these possible interaction do not take place By subsequently limiting the “interaction search space” to only the system of interest, the error rate can be reduced to that of small scale experiments! A simple strategy for making a network of a “system” Define an initial parts list of proteins that should be in the system Use “high confidence” interactions to pull in additional proteins Show all “medium confidence” interactions within the system
Can the type of interaction be predicted by combining different evidence types? Different types of experiment evidence tell us something different Correct Y2H interactions that are missed by complex purification methods generally correspond to transient interactions
Conclusions When dealing with high-throughput experimental data, it is crucial to do proper benchmarking Globally, the error rates are generally very high A very large part of the errors can be filtered away by computational methods, allowing high confidence data sets to be constructed
Questions?
Part 3 Prediction protein features and function Lars Juhl Jensen EMBL Heidelberg
Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
Proteins – more than just globular domains Eukaryotic linear motifs (ELMs) Ligand peptides Modification sites Targeting signals Disordered regions Transmembrane helices Toby Gibson, EMBL Heidelberg Insulin Receptor Substrate 1
Most ELMs are “information poor” Weak/short consensus sequences for ELMs The typical ELM only has three conserved residues Some variance is often allowed even for these ELMs are very hard to predict from sequence Consensus sequences simply match everywhere The information is not in the local sequence Most ELMs can only be predicted using context Toby Gibson, EMBL Heidelberg L . C . E RB interaction [RK] .{0,1} V . F PP1 interaction R . L .{0,1} [FLIMVP] Cyclin binding motif SP . [KR] CDK phosphorylation L . . LL NR Box P . L . P MYND finger interaction F . . . W . . [LIV] MDM2-binding RGD Integrin-binding SKL$ Peroxisome targeting [RK][RK] . [ST] PKA phosphorylation
Prediction of protein disorder/globularity Using known domains SMART Pfam Interpro Ab initio from sequence GlobPlot DisEMBL PONDR Toby Gibson, EMBL Heidelberg Known Domains Order Preference Disorder Preference
Prediction of signal peptides from sequence Signal peptides play different roles They mediate transport of proteins  to the ER in eukaryotes They target proteins for secretion in prokaryotes The architecture of signal peptides Positively charged N-terminus Hydrophobic core Short, more polar region Cleavage site with small amino acids at positions -3 and -1 Signal peptides can be accurately predicted by several methods Henrik Nielsen, CBS, DTU Lyngby
Function prediction from post translational modifications Proteins with similar function may not be related in sequence Still they must perform their function in the context of the same cellular machinery Similarities in features such like PTMs and physical/chemical properties could be expected for proteins with similar function Henrik Nielsen, CBS, DTU Lyngby
The concept of ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category, also optimizing the feature combinations Assign a probability for each category from the NN outputs © Journal of Molecular Biology, 2002
Training of neural networks Human protein protein sequences from SWISS-PROT were assigned to functional classes based on their keywords by using the EUCLID dictionary The set of sequences was divided into a test and a training set with no significant sequence similarity between the two sets Neural networks were first trained for single features and subsequently for combinations of the best performing features
Prediction performance on cellular role categories © Journal of Molecular Biology, 2002
© Journal of Molecular Biology, 2002
An example – 1AOZ vs. 1PLC scoring matrix: BLOSUM50, gap penalties: -12/-2 15.5% identity; Global alignment score: -23   10  20  30  40  50  60 1AOZ  SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH   .. .. :  ... .  . ..:  . :...: . .:  ...:.  1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD   10  20  30  40    70  80  90  100  110  120 1AOZ  WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI   .:  :.  .  . :  .  ::::  ..  .  .:.  : :  ::. :..  1 PLC  EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT   50  60  70  80  90  1AOZ  VDPPQGKKE   :.  1PLC VN-------
An enzyme and a non-enzyme from the Cupredoxin superfamily
Similar structure different functions Many examples exist of structurally similar proteins which have different functions Two PDB structures from the Cupredoxin superfamily were shown 1AOZ is an enzyme 1PLC is not an enzyme Despite their structural similarity, our method predicts both correctly # Functional category  1AOZ  1PLC    Amino_acid_biosynthesis  0.126 0.070   Biosynthesis_of_cofactors  0.100 0.075   Cell_envelope  0.429 0.032   Cellular_processes  0.057 0.059   Central_intermediary_metabolism 0.063 0.041   Energy_metabolism  0.126  0.268   Fatty_acid_metabolism  0.027   0.072   Purines_and_pyrimidines  0.439   0.088   Regulatory_functions  0.102 0.019   Replication_and_transcription  0.052 0.089   Translation  0.079 0.150   Transport_and_binding  0.032 0.052 # Enzyme/nonenzyme    Enzyme  0.773  0.310   Nonenzyme  0.227   0.690 # Enzyme class    Oxidoreductase (EC 1.-.-.-)  0.077 0.077   Transferase  (EC 2.-.-.-)  0.260 0.099   Hydrolase  (EC 3.-.-.-)  0.114 0.071   Lyase  (EC 4.-.-.-)  0.025 0.020   Isomerase  (EC 5.-.-.-)  0.010 0.068   Ligase  (EC 6.-.-.-)  0.017 0.017
Conclusions Short linear motifs are likely equally important for protein function as the large well studied domains The features are generally very hard to predict from sequence, however, some can be predicted Many functional classes of proteins can be predicted from sequence alone by non-homology based methods
Questions?
Part 4 Qualitative modeling of the of the yeast cell cycle Lars Juhl Jensen EMBL Heidelberg
Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
Qualitative versus quantitative modeling Our aim: a qualitative model of the yeast cell cycle that is accurate event at the level individual interactions provides a global overview of temporal complex formation © Chen et al., Mol. Biol. Cell, 2004 Ulrik de Lichtenberg, CBS, DTU Lyngby
Model generation through data integration Model Generation A Parts List Literature Microarray data Dynamic data Microarray data Proteomics data PPI data TF-target data Connections YER001W YBR088C YOL007C YPL127C YNR009W YDR224C YDL003W YBL003C YDR225W YBR010W YKR013W … YDR097C YBR089W YBR054W YMR215W YBR071W YBL002W YGR189C YNL031C YNL030W YNL283C YGR152C … Ulrik de Lichtenberg, CBS, DTU Lyngby
Getting the parts list yeast culture Microarrays Gene expression Expression profile Ulrik de Lichtenberg, CBS, DTU Lyngby Cho  et al. &  Spellman  et al. 600 periodically expressed genes (with associated peak times) that encode “dynamic proteins” The Parts list New Analysis
The temporal interaction network Observation:  For two thirds of the dynamic proteins, no interactions were found Why?   Some may be missed components of the complexes and modules already in the network Some may not participate in protein-protein interactions But, the majority probably participate in  transient  interactions that are not so well captured by current interaction assays Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
Interactions are close in time Observation:  Interacting dynamic proteins typically expressed close in time Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
Static proteins play a major role Observation:  Static ( scaffold ) proteins comprise about a third of the network and participate in interactions throughout the entire cycle Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
Just-in-time synthesis? yes and no! Observation:  The dynamic proteins are generally expressed just before they are needed to carry out their function, generally referred to as  just-in-time synthesis But, the general design principle seems to be that only some key components of each module/complex are dynamic This suggests a mechanism of  just-in-time assembly  or  partial just-in-time synthesis Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
Network as a discovery tools Observation:  The network places 30+ uncharacterized proteins in a temporal interaction context.  The network thus generates detailed hypothesis about their function. Observation:  The network  contains entire novel modules and complexes. Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
Network Hubs: “Party” versus “Date” “ Date” Hub:  the hub protein interacts with different proteins at different times. “ Party” Hub:   the hub protein and its interactors are  expressed close in time. Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
Transcription is linked to phosphorylation Observation:   332 putative targets of the cyclin-dependent kinase Cdc28 have been determined experimentally (Übersax et al.). We find that: 6%  of  all yeast proteins  are putative Cdk targets 8%  of the  static proteins  (white) are putative Cdk targets 27%  of the  dynamic   proteins  (colored) are putative Cdk targets Conclusion:  this reveals a hitherto undescribed link between the levels of transcriptional and post-translation control of the cell cycle Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
A neural network strategy for prediction of cell cycle related proteins Ulrik de Lichtenberg, CBS, DTU Lyngby
Prediction of cell cycle related proteins from sequence derived features Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003
Evaluating the performance Ulrik de Lichtenberg, CBS, DTU Lyngby
Ulrik de Lichtenberg, CBS, DTU Lyngby
The yeast cell cycle in feature space © Journal of Molecular Biology, 2003 Ulrik de Lichtenberg, CBS, DTU Lyngby
S phase feature snapshot S phase 40% into the cell cycle we see High isoelectric point Many nuclear proteins Short proteins Low N-glycosylation potential Low potential for Ser/Thr-phosphorylation Few PEST regions Low aliphatic index Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003
G 1 /S phase feature snapshot G1/S transition 25% into the cell cycle we see Low isoelectric point Many extracellular proteins Many PEST regions Very high Tyr-phosphorylation potential Higher glycosylation potential Higher potential for Ser/Thr-phosphorylation Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003
Conclusions Accurate models can be constructed by careful integration of several types of high-throughput experimental data  We have constructed a model of the yeast cell cycle that reveals global trends that were not previously known The same strategies are applicable to other systems The integrative approach is applicable to any process for which both interaction data and time series are available Most broad classes of proteins can be predicted using neural networks with sequence derived features as input.
Questions?
Summary The many types of high-throughput data should to be Better standardization and quality control is crucial Scoring schemes and filtering schemes can reduce the error rate of high-throughput data drastically  Integration of many evidence types allows high-confidence predictions of functional relationships New biological discoveries can be made through data integration There is more to proteins than just globular domains Proteins contain many short linear motifs (ELMs) Most of these are very difficult to predict from sequence Sequence derived features can give hints about protein function
Acknowledgments STRING and ArrayProspector Peer Bork Christian von Mering Jan Korbel Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Sean Hooper Mathilde Foglierini Julien Lagarde Chris Workman ELMs – linear motifs Rune Linding Toby Gibson Rob Russell Protein feature/function prediction Søren Brunak Alfonso Valencia Ramneek Gupta Can Kesmir Kristoffer Rapacki Hans-Henrik Stærfeldt Henrik Nielsen Nikolaj Blom Claus A.F. Andersen Anders Krogh Steen Knudsen Chris Workman Damien Devos Javier Tamames Analysis of the yeast cell cycle Ulrik de Lichtenberg Thomas Skøt Anders Fausbøll Søren Brunak
Thank you!

More Related Content

PPT
STRING - Modeling of pathways through cross-species integration of large-scal...
PPT
STRING - Cross-species integration of known and predicted protein-protein int...
PPT
STRING - Cross-species integration of known and predicted protein-protein int...
PPT
STRING - Prediction of functionally associated proteins from heterogeneous ge...
PPT
STRING - Prediction of functional relations, modules, and networks from heter...
PPT
STRING - Prediction of protein networks through integration of diverse large-...
PPTX
How to analyse large data sets
PDF
NetBioSIG2012 anyatsalenko-en-viz
STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Prediction of functionally associated proteins from heterogeneous ge...
STRING - Prediction of functional relations, modules, and networks from heter...
STRING - Prediction of protein networks through integration of diverse large-...
How to analyse large data sets
NetBioSIG2012 anyatsalenko-en-viz

What's hot (20)

DOCX
my 6th paper
PDF
Analysis of gene expression microarray data of patients with Spinal Muscular ...
PDF
Gene Selection for Sample Classification in Microarray: Clustering Based Method
PDF
NetBioSIG2013-Talk Gang Su
PPT
Modeling the dynamic assembly of cell cycle complexes from high-throughput data
PDF
NetBioSIG2013-KEYNOTE Benno Schwikowski
PDF
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
PDF
nm0915-965-2
PDF
Research Statement Chien-Wei Lin
PPT
Exploiting technical replicate variance in omics data analysis (RepExplore)
PDF
Sample Work For Engineering Literature Review and Gap Identification
PDF
Condspe
PPTX
From systems biology
PPTX
Introduction to systems biology
PDF
Identification of novel potential anti cancer agents using network pharmacolo...
PDF
Network motifs in integrated cellular networks of transcription–regulation an...
PPTX
System biology and its tools
PDF
PPT
NetBioSIG2013-KEYNOTE Michael Schroeder
PDF
Novel network pharmacology methods for drug mechanism of action identificatio...
my 6th paper
Analysis of gene expression microarray data of patients with Spinal Muscular ...
Gene Selection for Sample Classification in Microarray: Clustering Based Method
NetBioSIG2013-Talk Gang Su
Modeling the dynamic assembly of cell cycle complexes from high-throughput data
NetBioSIG2013-KEYNOTE Benno Schwikowski
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
nm0915-965-2
Research Statement Chien-Wei Lin
Exploiting technical replicate variance in omics data analysis (RepExplore)
Sample Work For Engineering Literature Review and Gap Identification
Condspe
From systems biology
Introduction to systems biology
Identification of novel potential anti cancer agents using network pharmacolo...
Network motifs in integrated cellular networks of transcription–regulation an...
System biology and its tools
NetBioSIG2013-KEYNOTE Michael Schroeder
Novel network pharmacology methods for drug mechanism of action identificatio...
Ad

Similar to Proteomics - Analysis and integration of large-scale data sets (20)

PPT
STRING - Prediction of a functional association network for the yeast mitocho...
PPTX
String.pptx
PPT
STRING: Prediction of protein networks through integration of diverse large-s...
PPT
Prediction of protein function
PPT
STRING - Prediction of protein networks through integration of diverse large-...
PPT
Introduction to STRING
ZIP
Exploring proteins, chemicals and their interactions with STRING and STITCH
PPT
Integration of diverse large-scale datasets
PPT
Cornell Pbsb 20090126 Nets
PDF
PDF
Data Mining for Systems Biology Methods and Protocols 1st Edition Koji Tsuda
PDF
interactome file to share in the field of omics
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Protein networks as a scaffold for structuring other data
PPT
The STRING database - Quality scores for heterogeneous interaction data
PPT
Protein interaction networks
PPT
Protein protein interaction important doc
PPT
Data integration and functional association networks
PPTX
Interactomeee
PDF
Comparative Genomics 1st Edition Philipp Pagel
STRING - Prediction of a functional association network for the yeast mitocho...
String.pptx
STRING: Prediction of protein networks through integration of diverse large-s...
Prediction of protein function
STRING - Prediction of protein networks through integration of diverse large-...
Introduction to STRING
Exploring proteins, chemicals and their interactions with STRING and STITCH
Integration of diverse large-scale datasets
Cornell Pbsb 20090126 Nets
Data Mining for Systems Biology Methods and Protocols 1st Edition Koji Tsuda
interactome file to share in the field of omics
STRING & related databases: Large-scale integration of heterogeneous data
Protein networks as a scaffold for structuring other data
The STRING database - Quality scores for heterogeneous interaction data
Protein interaction networks
Protein protein interaction important doc
Data integration and functional association networks
Interactomeee
Comparative Genomics 1st Edition Philipp Pagel
Ad

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
PPT
The Art of Counting: Scoring and ranking co-occurrences in literature
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization
The Art of Counting: Scoring and ranking co-occurrences in literature

Recently uploaded (20)

PPT
Chapter four Project-Preparation material
PDF
Types of control:Qualitative vs Quantitative
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
MSPs in 10 Words - Created by US MSP Network
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
DOCX
Business Management - unit 1 and 2
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PPT
Data mining for business intelligence ch04 sharda
PPTX
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
Training And Development of Employee .pdf
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
Business model innovation report 2022.pdf
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
How to Get Business Funding for Small Business Fast
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
Chapter four Project-Preparation material
Types of control:Qualitative vs Quantitative
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
HR Introduction Slide (1).pptx on hr intro
MSPs in 10 Words - Created by US MSP Network
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
Business Management - unit 1 and 2
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
Data mining for business intelligence ch04 sharda
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Training And Development of Employee .pdf
Roadmap Map-digital Banking feature MB,IB,AB
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
Business model innovation report 2022.pdf
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
How to Get Business Funding for Small Business Fast
Ôn tập tiếng anh trong kinh doanh nâng cao

Proteomics - Analysis and integration of large-scale data sets

  • 1. Proteomics Analysis and integration of large-scale data sets Lars Juhl Jensen EMBL Heidelberg
  • 2. Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
  • 3. Part 1 Methods for predicting protein-protein interactions Lars Juhl Jensen EMBL Heidelberg
  • 4. Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
  • 5. Cross-species integration of diverse data Challenges and promises of large-scale data integration Explosive increase in both the amounts and different types of high-throughput data sets that are being produced These data are highly heterogeneous and lack standardization Most data sets are error-prone and suffer from systematic biases Experiments should be integrated across model organisms STRING is a web resource that integrates and transfers diverse large-scale data across 100+ species, but it is not a primary repository for experimental data a curated database of complexes or pathways a substitute for expert annotation
  • 6. What is STRING? Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
  • 7. Genomic context methods © Nature Biotechnology, 2004
  • 8. Inferring functional modules from gene presence/absence patterns T rends in Microbiology
  • 9. Inferring functional modules from gene presence/absence patterns T rends in Microbiology
  • 10. Inferring functional modules from gene presence/absence patterns T rends in Microbiology
  • 11. Inferring functional modules from gene presence/absence patterns T rends in Microbiology Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring proteins Cellulosomes Cellulose The “Cellulosome”
  • 12. Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
  • 13. Inferring functional associations from evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
  • 14. Predicting functional and physical interactions from gene fusion/fission events Find in A genes that match a the same gene in B Exclude overlapping alignments Calibrate against KEGG maps Calculate all-against-all pairwise alignments
  • 15. Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
  • 16. Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
  • 17. The Qspline method for non-linear intensity normalization of expression data From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black) A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline function As reference one can either use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation
  • 18. Non-linear normalization of intensities and correction for spatial effects Downloaded SMD data After intensity normalization Spatial bias estimate After spatial normalization
  • 19. Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
  • 20. Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence cannot be guaranteed for in-paralogs These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional equivalence are calculated from all-against-all alignment Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships ? Source species Target species
  • 21. The power of cross-species transfer and evidence integration
  • 22. The power of cross-species transfer and evidence integration
  • 23. The power of cross-species transfer and evidence integration
  • 24. The power of cross-species transfer and evidence integration
  • 25. The power of cross-species transfer and evidence integration
  • 26. The power of cross-species transfer and evidence integration
  • 27. Conclusions Many types of data can be used for interaction prediction To make the best of these data they must each be benchmarked integrated across species The STRING web resource does just this
  • 29. Part 2 Quality control of high-throughput interaction data Lars Juhl Jensen EMBL Heidelberg
  • 30. Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
  • 31. Protein interaction data sets Many high-throughput data sets published the past 5 years S. cerevisiae is by far the best covered organism Recently, large data sets were made for two metazoans Two fundamentally different techniques have been used Affinity purification/MS The yeast two-hybrid assay Interaction databases IntAct, BIND, DIP, MINT Species specific databases © Current Opinions in Structural Biology, 2004
  • 32. The topology of protein interaction networks A multitude of publications exist on protein network topology Global measures of topology Degree distribution Mean path length Clustering coefficient Theoretical models of networks Random Scale-free Hierarchical Local topology, network motifs
  • 33. What is an interaction? Physical protein interactions Proteins that physically touch each other within a complex Members of the same stable complex Transient interactions, e.g. a protein kinase with its substrate More broadly defined “functional interactions” Direct neighbors in metabolic networks Members of the same pathway The pragmatic definition – whatever the assays detect Affinity purification tends to find members of stable complexes Yeast two-hybrid assays also detects more transient interactions
  • 34. Binary representations of purification data © Drug Discovery Today: TARGETS, 2004
  • 35. Topology based quality scores Scoring scheme for yeast two-hybrid data: S1 = -log((N 1 +1) · (N 2 +1)) N 1 and N 2 are the numbers of non-shared interaction partners Similar scoring schemes have been published by Saito et al. Scoring scheme for complex pull-down data: S2 = log[(N 12 · N)/((N 1 +1) · (N 2 +1))] N 12 is the number of purifications containing both proteins N 1 is the number containing protein 1, N 2 is defined similarly N is the total number of purifications Both schemes aim at identifying ubiquitous interactors
  • 36. Calibration of quality scores and combination of evidence Different pieces of evidence are not directly comparable A different raw quality score is used for each evidence type Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference The accuracy relative to a “gold standard” is calculated within score intervals The resulting points are approximated by a sigmoid
  • 37. Benchmarks for protein interaction sets To benchmark interaction sets, one needs a reference set Several options exist Directly compare with a curated set of protein complexes from e.g. MIPS Check consistency with metabolic pathways from e.g. KEGG Check consistency with GO biological process or cellular component categories Look for co-expression of genes within complexes © Current Opinions in Structural Biology, 2004
  • 38. Benchmark of published interaction sets against the MIPS curated yeast complexes Data sets were filtered to remove the most obvious biases by removing ribosomal proteins and interactions obtained from MIPS High specificity is often obtained at the price of low coverage
  • 39. Filtering by subcellular localization Proteins cannot interact if they are not in the same place Large-scale subcellular localization screens have been made in yeast A matrix can be constructed that described the compartments between which interactions should be allowed Two proteins cannot interact if no combination of observed subcellular compartments allow for interaction
  • 40. Restricting the network to a “system” Why do large-scale interaction data have high error rates? In a systematic screen we test the hypotheses that any protein in interacts with any other protein in the cell The vast majority of these possible interaction do not take place By subsequently limiting the “interaction search space” to only the system of interest, the error rate can be reduced to that of small scale experiments! A simple strategy for making a network of a “system” Define an initial parts list of proteins that should be in the system Use “high confidence” interactions to pull in additional proteins Show all “medium confidence” interactions within the system
  • 41. Can the type of interaction be predicted by combining different evidence types? Different types of experiment evidence tell us something different Correct Y2H interactions that are missed by complex purification methods generally correspond to transient interactions
  • 42. Conclusions When dealing with high-throughput experimental data, it is crucial to do proper benchmarking Globally, the error rates are generally very high A very large part of the errors can be filtered away by computational methods, allowing high confidence data sets to be constructed
  • 44. Part 3 Prediction protein features and function Lars Juhl Jensen EMBL Heidelberg
  • 45. Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
  • 46. Proteins – more than just globular domains Eukaryotic linear motifs (ELMs) Ligand peptides Modification sites Targeting signals Disordered regions Transmembrane helices Toby Gibson, EMBL Heidelberg Insulin Receptor Substrate 1
  • 47. Most ELMs are “information poor” Weak/short consensus sequences for ELMs The typical ELM only has three conserved residues Some variance is often allowed even for these ELMs are very hard to predict from sequence Consensus sequences simply match everywhere The information is not in the local sequence Most ELMs can only be predicted using context Toby Gibson, EMBL Heidelberg L . C . E RB interaction [RK] .{0,1} V . F PP1 interaction R . L .{0,1} [FLIMVP] Cyclin binding motif SP . [KR] CDK phosphorylation L . . LL NR Box P . L . P MYND finger interaction F . . . W . . [LIV] MDM2-binding RGD Integrin-binding SKL$ Peroxisome targeting [RK][RK] . [ST] PKA phosphorylation
  • 48. Prediction of protein disorder/globularity Using known domains SMART Pfam Interpro Ab initio from sequence GlobPlot DisEMBL PONDR Toby Gibson, EMBL Heidelberg Known Domains Order Preference Disorder Preference
  • 49. Prediction of signal peptides from sequence Signal peptides play different roles They mediate transport of proteins to the ER in eukaryotes They target proteins for secretion in prokaryotes The architecture of signal peptides Positively charged N-terminus Hydrophobic core Short, more polar region Cleavage site with small amino acids at positions -3 and -1 Signal peptides can be accurately predicted by several methods Henrik Nielsen, CBS, DTU Lyngby
  • 50. Function prediction from post translational modifications Proteins with similar function may not be related in sequence Still they must perform their function in the context of the same cellular machinery Similarities in features such like PTMs and physical/chemical properties could be expected for proteins with similar function Henrik Nielsen, CBS, DTU Lyngby
  • 51. The concept of ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category, also optimizing the feature combinations Assign a probability for each category from the NN outputs © Journal of Molecular Biology, 2002
  • 52. Training of neural networks Human protein protein sequences from SWISS-PROT were assigned to functional classes based on their keywords by using the EUCLID dictionary The set of sequences was divided into a test and a training set with no significant sequence similarity between the two sets Neural networks were first trained for single features and subsequently for combinations of the best performing features
  • 53. Prediction performance on cellular role categories © Journal of Molecular Biology, 2002
  • 54. © Journal of Molecular Biology, 2002
  • 55. An example – 1AOZ vs. 1PLC scoring matrix: BLOSUM50, gap penalties: -12/-2 15.5% identity; Global alignment score: -23 10 20 30 40 50 60 1AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40 70 80 90 100 110 120 1AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1 PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90 1AOZ VDPPQGKKE :. 1PLC VN-------
  • 56. An enzyme and a non-enzyme from the Cupredoxin superfamily
  • 57. Similar structure different functions Many examples exist of structurally similar proteins which have different functions Two PDB structures from the Cupredoxin superfamily were shown 1AOZ is an enzyme 1PLC is not an enzyme Despite their structural similarity, our method predicts both correctly # Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052 # Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690 # Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017
  • 58. Conclusions Short linear motifs are likely equally important for protein function as the large well studied domains The features are generally very hard to predict from sequence, however, some can be predicted Many functional classes of proteins can be predicted from sequence alone by non-homology based methods
  • 60. Part 4 Qualitative modeling of the of the yeast cell cycle Lars Juhl Jensen EMBL Heidelberg
  • 61. Overview Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text Quality control of high-throughput interaction data Types of data sets available Network representations of interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function Qualitative modeling of the yeast cell cycle Modeling the cell cycle through large-scale data integration What does the model tell us? A neural network approach to predicting cell cycle proteins The cell cycle in feature space
  • 62. Qualitative versus quantitative modeling Our aim: a qualitative model of the yeast cell cycle that is accurate event at the level individual interactions provides a global overview of temporal complex formation © Chen et al., Mol. Biol. Cell, 2004 Ulrik de Lichtenberg, CBS, DTU Lyngby
  • 63. Model generation through data integration Model Generation A Parts List Literature Microarray data Dynamic data Microarray data Proteomics data PPI data TF-target data Connections YER001W YBR088C YOL007C YPL127C YNR009W YDR224C YDL003W YBL003C YDR225W YBR010W YKR013W … YDR097C YBR089W YBR054W YMR215W YBR071W YBL002W YGR189C YNL031C YNL030W YNL283C YGR152C … Ulrik de Lichtenberg, CBS, DTU Lyngby
  • 64. Getting the parts list yeast culture Microarrays Gene expression Expression profile Ulrik de Lichtenberg, CBS, DTU Lyngby Cho et al. & Spellman et al. 600 periodically expressed genes (with associated peak times) that encode “dynamic proteins” The Parts list New Analysis
  • 65. The temporal interaction network Observation: For two thirds of the dynamic proteins, no interactions were found Why? Some may be missed components of the complexes and modules already in the network Some may not participate in protein-protein interactions But, the majority probably participate in transient interactions that are not so well captured by current interaction assays Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 66. Interactions are close in time Observation: Interacting dynamic proteins typically expressed close in time Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 67. Static proteins play a major role Observation: Static ( scaffold ) proteins comprise about a third of the network and participate in interactions throughout the entire cycle Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 68. Just-in-time synthesis? yes and no! Observation: The dynamic proteins are generally expressed just before they are needed to carry out their function, generally referred to as just-in-time synthesis But, the general design principle seems to be that only some key components of each module/complex are dynamic This suggests a mechanism of just-in-time assembly or partial just-in-time synthesis Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 69. Network as a discovery tools Observation: The network places 30+ uncharacterized proteins in a temporal interaction context. The network thus generates detailed hypothesis about their function. Observation: The network contains entire novel modules and complexes. Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 70. Network Hubs: “Party” versus “Date” “ Date” Hub: the hub protein interacts with different proteins at different times. “ Party” Hub: the hub protein and its interactors are expressed close in time. Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 71. Transcription is linked to phosphorylation Observation: 332 putative targets of the cyclin-dependent kinase Cdc28 have been determined experimentally (Übersax et al.). We find that: 6% of all yeast proteins are putative Cdk targets 8% of the static proteins (white) are putative Cdk targets 27% of the dynamic proteins (colored) are putative Cdk targets Conclusion: this reveals a hitherto undescribed link between the levels of transcriptional and post-translation control of the cell cycle Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005
  • 72. A neural network strategy for prediction of cell cycle related proteins Ulrik de Lichtenberg, CBS, DTU Lyngby
  • 73. Prediction of cell cycle related proteins from sequence derived features Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003
  • 74. Evaluating the performance Ulrik de Lichtenberg, CBS, DTU Lyngby
  • 75. Ulrik de Lichtenberg, CBS, DTU Lyngby
  • 76. The yeast cell cycle in feature space © Journal of Molecular Biology, 2003 Ulrik de Lichtenberg, CBS, DTU Lyngby
  • 77. S phase feature snapshot S phase 40% into the cell cycle we see High isoelectric point Many nuclear proteins Short proteins Low N-glycosylation potential Low potential for Ser/Thr-phosphorylation Few PEST regions Low aliphatic index Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003
  • 78. G 1 /S phase feature snapshot G1/S transition 25% into the cell cycle we see Low isoelectric point Many extracellular proteins Many PEST regions Very high Tyr-phosphorylation potential Higher glycosylation potential Higher potential for Ser/Thr-phosphorylation Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003
  • 79. Conclusions Accurate models can be constructed by careful integration of several types of high-throughput experimental data We have constructed a model of the yeast cell cycle that reveals global trends that were not previously known The same strategies are applicable to other systems The integrative approach is applicable to any process for which both interaction data and time series are available Most broad classes of proteins can be predicted using neural networks with sequence derived features as input.
  • 81. Summary The many types of high-throughput data should to be Better standardization and quality control is crucial Scoring schemes and filtering schemes can reduce the error rate of high-throughput data drastically Integration of many evidence types allows high-confidence predictions of functional relationships New biological discoveries can be made through data integration There is more to proteins than just globular domains Proteins contain many short linear motifs (ELMs) Most of these are very difficult to predict from sequence Sequence derived features can give hints about protein function
  • 82. Acknowledgments STRING and ArrayProspector Peer Bork Christian von Mering Jan Korbel Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Sean Hooper Mathilde Foglierini Julien Lagarde Chris Workman ELMs – linear motifs Rune Linding Toby Gibson Rob Russell Protein feature/function prediction Søren Brunak Alfonso Valencia Ramneek Gupta Can Kesmir Kristoffer Rapacki Hans-Henrik Stærfeldt Henrik Nielsen Nikolaj Blom Claus A.F. Andersen Anders Krogh Steen Knudsen Chris Workman Damien Devos Javier Tamames Analysis of the yeast cell cycle Ulrik de Lichtenberg Thomas Skøt Anders Fausbøll Søren Brunak