SlideShare a Scribd company logo
STRING Prediction of functionally associated proteins from heterogeneous genome scale data sets Lars Juhl Jensen EMBL Heidelberg
Cross-species integration of diverse data Challenges and promises of large-scale data integration Explosive increase in both the amounts and different types of high-throughput data sets that are being produced These data are highly heterogeneous and lack standardization Most data sets are error-prone and suffer from systematic biases Experiments should be integrated across model organisms STRING is a web resource that integrates and transfers diverse large-scale data across 100+ species, but it is  not a primary repository for experimental data a curated database of complexes or pathways a substitute for expert annotation
STRING provides a modular protein network by integrating diverse types of evidence Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
Two modes of operation “ Protein mode” Separate network for each species “ COG mode” One network covering all species
Inferring functional modules from gene presence/absence patterns T rends in Microbiology Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring  proteins Cellulosomes Cellulose The “Cellulosome”
Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
Score calibration against a common reference Different pieces of evidence are not directly comparable A different raw quality score is used for each evidence type  Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference Requirements for the reference Must represent a compromise of the all types of evidence Broad species coverage Our chosen reference is KEGG metabolic maps
Predicting functional and physical interactions from gene fusion/fission events Find in  A  genes that match a the same gene in  B Exclude overlapping alignments Calibrate against KEGG  maps Calculate all-against-all pairwise alignments
Inferring functional associations from evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence cannot be guaranteed for in-paralogs These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional equivalence are calculated from all-against-all alignment Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships ? Source species Target species
Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
The power of cross-species transfer and evidence integration
Predicting and defining metabolic pathways and other functional modules Image: Molecular Biology of the Cell, 3 . rd edition Metabolism overview Defined manually: cutting metabolic maps into pathways Purine biosynthesis Histidine biosynthesis Defined objectively: standard clustering of genome-scale data
Getting more specific – generally speaking Benchmarking against one common reference allows integration of heterogeneous data The different types of data do not all tell us about the same kind of functional associations It should be possible to assign likely interaction types from supporting evidence types An accurate model of the yeast mitotic cell cycle Approach High confidence set of physical interactions Custom analysis of cell cycle expression data Observations Dynamic assembly of cell cycle complexes Temporal regulation of Cdk specificity
Summary Quality assessment of each individual large-scale data set is a prerequisite for successful data integration High confidence prediction of functional associations and modules is possible when combining lines of evidence Transfer of evidence between species is an increasingly important aspect of large-scale data integration Take a look at STRING – an update is in the pipeline
Acknowledgments The STRING team Christian von Mering Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Mathilde Foglierini Peer Bork ArrayProspector web service Julien Lagarde Chris Workman NetView visualization tool Sean Hooper Analysis of yeast cell cycle Ulrik de Lichtenberg Thomas Skøt Anders Fausbøll Søren Brunak Web resources string.embl.de www.bork.embl.de/ArrayProspector www.bork.embl.de/synonyms
Thank you!
STRING Examples for practical session Lars Juhl Jensen EMBL Heidelberg
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

More Related Content

PPT
STRING - Prediction of protein networks through integration of diverse large-...
PPT
STRING - Prediction of functional relations, modules, and networks from heter...
PPT
STRING - Cross-species integration of known and predicted protein-protein int...
PPT
STRING - Modeling of pathways through cross-species integration of large-scal...
PPT
STRING - Cross-species integration of known and predicted protein-protein int...
PPT
STRING - Prediction of protein networks through integration of diverse large-...
PPT
Proteomics - Analysis and integration of large-scale data sets
DOCX
my 6th paper
STRING - Prediction of protein networks through integration of diverse large-...
STRING - Prediction of functional relations, modules, and networks from heter...
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Prediction of protein networks through integration of diverse large-...
Proteomics - Analysis and integration of large-scale data sets
my 6th paper

What's hot (20)

PPT
Modeling the dynamic assembly of cell cycle complexes from high-throughput data
PPT
Dynamic complex formation during the yeast cell cycle
PPTX
From systems biology
PDF
nm0915-965-2
PPT
STRING: Prediction of protein networks through integration of diverse large-s...
PPTX
GENOME DATA ANALYSIS
PPT
Dynamic complex formation during the yeast cell cycle
PPT
Integration of biomedical data and electronic publications
PPTX
Systems Biology Approaches to Cancer
PPT
Systems biology: Bioinformatics on complete biological system
PPT
20080516 Spontaneous separation of bi-stable biochemical systems
PPTX
Introduction to systems biology
PDF
Condspe
DOCX
Statistical SignificancePieceFinal
PDF
Personalized models for Quantitative Systems Pharmacology
PPTX
System biology and its tools
PPTX
WikiGenomes Poster (ISMB)
PDF
evolutionary game theory presentation
PDF
Introduction to Network Medicine
PDF
NetBioSIG2012 anyatsalenko-en-viz
Modeling the dynamic assembly of cell cycle complexes from high-throughput data
Dynamic complex formation during the yeast cell cycle
From systems biology
nm0915-965-2
STRING: Prediction of protein networks through integration of diverse large-s...
GENOME DATA ANALYSIS
Dynamic complex formation during the yeast cell cycle
Integration of biomedical data and electronic publications
Systems Biology Approaches to Cancer
Systems biology: Bioinformatics on complete biological system
20080516 Spontaneous separation of bi-stable biochemical systems
Introduction to systems biology
Condspe
Statistical SignificancePieceFinal
Personalized models for Quantitative Systems Pharmacology
System biology and its tools
WikiGenomes Poster (ISMB)
evolutionary game theory presentation
Introduction to Network Medicine
NetBioSIG2012 anyatsalenko-en-viz
Ad

Viewers also liked (7)

PPT
Alexandraslides3
PPT
Utilizing literature for biological discovery
PPT
Alexandraslides2
PPT
Publicdesign Presentacion 2008
PDF
Agosto 1o. 2008
PPT
La Historia Vacambiando
PPTX
Black Ink Cashflow Secrets Your Accountant Never Shared
Alexandraslides3
Utilizing literature for biological discovery
Alexandraslides2
Publicdesign Presentacion 2008
Agosto 1o. 2008
La Historia Vacambiando
Black Ink Cashflow Secrets Your Accountant Never Shared
Ad

Similar to STRING - Prediction of functionally associated proteins from heterogeneous genome scale data sets (20)

PPT
Introduction to STRING
PPT
STRING - Prediction of a functional association network for the yeast mitocho...
ZIP
Exploring proteins, chemicals and their interactions with STRING and STITCH
PPT
Interaction prediction with STRING - Principles and examples
PPTX
String.pptx
PPT
Prediction of protein function
PPT
Cross-species data integration
PPT
Integration of diverse large-scale datasets
PDF
Investigating plant systems using data integration and network analysis
PPT
Functional association networks - The STRING and STITCH web resources
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Data integration and functional association networks
PPT
The STRING database - Quality scores for heterogeneous interaction data
PPT
The STRING database
PDF
PHYLOGENOMICS
PPT
Network integration of heterogeneous data
PDF
Data Mining GenBank for Phylogenetic inference - T. Vision
KEY
STRING/STITCH tutorial
PPTX
Genetic disease identification and medical diagnosis using MF, CC, BF, MicroR...
PPT
Data integration - Integration of functional associations using STRING
Introduction to STRING
STRING - Prediction of a functional association network for the yeast mitocho...
Exploring proteins, chemicals and their interactions with STRING and STITCH
Interaction prediction with STRING - Principles and examples
String.pptx
Prediction of protein function
Cross-species data integration
Integration of diverse large-scale datasets
Investigating plant systems using data integration and network analysis
Functional association networks - The STRING and STITCH web resources
STRING & related databases: Large-scale integration of heterogeneous data
Data integration and functional association networks
The STRING database - Quality scores for heterogeneous interaction data
The STRING database
PHYLOGENOMICS
Network integration of heterogeneous data
Data Mining GenBank for Phylogenetic inference - T. Vision
STRING/STITCH tutorial
Genetic disease identification and medical diagnosis using MF, CC, BF, MicroR...
Data integration - Integration of functional associations using STRING

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
PPT
The Art of Counting: Scoring and ranking co-occurrences in literature
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization
The Art of Counting: Scoring and ranking co-occurrences in literature

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
sap open course for s4hana steps from ECC to s4
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Approach and Philosophy of On baking technology

STRING - Prediction of functionally associated proteins from heterogeneous genome scale data sets

  • 1. STRING Prediction of functionally associated proteins from heterogeneous genome scale data sets Lars Juhl Jensen EMBL Heidelberg
  • 2. Cross-species integration of diverse data Challenges and promises of large-scale data integration Explosive increase in both the amounts and different types of high-throughput data sets that are being produced These data are highly heterogeneous and lack standardization Most data sets are error-prone and suffer from systematic biases Experiments should be integrated across model organisms STRING is a web resource that integrates and transfers diverse large-scale data across 100+ species, but it is not a primary repository for experimental data a curated database of complexes or pathways a substitute for expert annotation
  • 3. STRING provides a modular protein network by integrating diverse types of evidence Genomic neighborhood Species co-occurrence Gene fusions Database imports Exp. interaction data Microarray expression data Literature co-mentioning
  • 4. Two modes of operation “ Protein mode” Separate network for each species “ COG mode” One network covering all species
  • 5. Inferring functional modules from gene presence/absence patterns T rends in Microbiology Resting protuberances Protracted protuberance Cellulose © Trends Microbiol, 1999 Cell Cell wall Anchoring proteins Cellulosomes Cellulose The “Cellulosome”
  • 6. Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
  • 7. Score calibration against a common reference Different pieces of evidence are not directly comparable A different raw quality score is used for each evidence type Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference Requirements for the reference Must represent a compromise of the all types of evidence Broad species coverage Our chosen reference is KEGG metabolic maps
  • 8. Predicting functional and physical interactions from gene fusion/fission events Find in A genes that match a the same gene in B Exclude overlapping alignments Calibrate against KEGG maps Calculate all-against-all pairwise alignments
  • 9. Inferring functional associations from evolutionarily conserved operons Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
  • 10. Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence cannot be guaranteed for in-paralogs These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional equivalence are calculated from all-against-all alignment Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships ? Source species Target species
  • 11. Integrating physical interaction screens Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
  • 12. Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
  • 13. Co-mentioning in the scientific literature Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
  • 14. The power of cross-species transfer and evidence integration
  • 15. Predicting and defining metabolic pathways and other functional modules Image: Molecular Biology of the Cell, 3 . rd edition Metabolism overview Defined manually: cutting metabolic maps into pathways Purine biosynthesis Histidine biosynthesis Defined objectively: standard clustering of genome-scale data
  • 16. Getting more specific – generally speaking Benchmarking against one common reference allows integration of heterogeneous data The different types of data do not all tell us about the same kind of functional associations It should be possible to assign likely interaction types from supporting evidence types An accurate model of the yeast mitotic cell cycle Approach High confidence set of physical interactions Custom analysis of cell cycle expression data Observations Dynamic assembly of cell cycle complexes Temporal regulation of Cdk specificity
  • 17. Summary Quality assessment of each individual large-scale data set is a prerequisite for successful data integration High confidence prediction of functional associations and modules is possible when combining lines of evidence Transfer of evidence between species is an increasingly important aspect of large-scale data integration Take a look at STRING – an update is in the pipeline
  • 18. Acknowledgments The STRING team Christian von Mering Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Mathilde Foglierini Peer Bork ArrayProspector web service Julien Lagarde Chris Workman NetView visualization tool Sean Hooper Analysis of yeast cell cycle Ulrik de Lichtenberg Thomas Skøt Anders Fausbøll Søren Brunak Web resources string.embl.de www.bork.embl.de/ArrayProspector www.bork.embl.de/synonyms
  • 20. STRING Examples for practical session Lars Juhl Jensen EMBL Heidelberg
  • 21.  
  • 22.  
  • 23.  
  • 24.  
  • 25.  
  • 26.  
  • 27.  
  • 28.  
  • 29.  
  • 30.  
  • 31.  
  • 32.  
  • 33.  
  • 34.  
  • 35.