SlideShare a Scribd company logo
Prediction of protein function Lars Juhl Jensen EMBL Heidelberg
Overview Part 1 Homology-based transfer of annotation Function prediction from protein domains Part 2 Prediction of functional motifs from sequence Feature-based prediction of protein function Part 3 Prediction of functional interaction networks
Why do we need to predict function?
What do we mean by function? The concept “function” is not clearly defined A structural biologist, a cell biologist, and a medical doctor will have very different views Many levels of granularity For the overall definition of “function”, the knowledge and description can be more or less specific Functional categories are somewhat artificial People like to put things in boxes …
 
Descriptions of protein function Controlled vocabularies Gene Ontology SwissProt keywords KEGG pathways EcoCyc pathways Interaction networks More accurate data models Reactome Systems Biology Markup Language (SBML)
Molecular function Molecular function describes activities, such as catalytic or binding activities, at the molecular level GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity
Biological process A biological process is series of events accomplished by one or more ordered assemblies of molecular functions An  example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport It can be difficult to distinguish between a biological process and a molecular function
Cellular component A cellular component is just that, a component of a cell that is part of some larger object It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) The cellular component categories are probably the best defined categories since they correspond to actual entities
Homology-based transfer of annotation Lars Juhl Jensen EMBL Heidelberg
Detection of homologs Pairwise sequence similarity searches BLAST (fastest) FASTA Full Smith-Waterman (most sensitive) Profile-based similarity searches PSI-BLAST Hidden Markov Models (HMMs) Sequence similarity should always be evaluated at the protein level
 
Sequence similarity, sequence homology, and functional homology Sequence similarity means that the sequences are similar – no more, no less Sequence homology implies that the proteins are encoded by genes that share a common ancestry Functional homology means that two proteins from two organisms have the same function Sequence similarity or sequence homology does not guarantee functional homology
Orthologs vs. paralogs
Functional consequences of gene duplication Neofunctionalization One copy has retained the ancestral function and can be treated as a 1–to–1 ortholog (functional homolog) The other copy have changed their function and behave much like paralogs Subfunctionalization Each copy has taken on a part of the ancestral function A functional homolog cannot be defined Each ortholog typically has the same molecular function in a different sub-process or location
1–to–1 orthology A single gene in one organism corresponds to a single gene in another organism These can generally be assumed to encode functionally equivalent proteins Same molecular function Same biological process Same localization 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms
1–to–many orthology A single gene in one organism corresponds to multiple genes in another organism Any mixture of neo- and sub-functionalizations can have occurred Typically same molecular function Often different biological process or sub-process Often different sub-cellular localization or tissue 1–to–many orthology is very common between simple model organisms and higher eukaryotes
Many–to–many orthology Many genes in each organism have arisen from a single gene in their last common ancestor Different neo- and sub-functionalizations have likely taken place in each lineage Typically same molecular function Often different biological process or sub-process Often different sub-cellular localization or tissue Many–to–many orthology is common between higher eukaryotes that are distantly related
Detection of orthologs Reconstruction of phylogenetic trees The theoretically most correct way Works for analyzing particular genes of interest Methods based on reciprocal matches What currently works at the genomic scale Manual curation Detection of very remote orthologs may require that knowledge on gene synteny and/or protein function is taken into account
Construction of gene trees Identify the relevant proteins Sequence similarity and possibly additional information Construct a blocked multiple sequence alignment Use, for example, Muscle and Gblocks Reconstruct the most likely phylogenetic tree Use, for example, PhyML Orthologs and paralogs can be trivially extracted based on a gene tree
Reciprocal matches Simple “best reciprocal match” is a bad choice Can only deal with one-to-one orthology Detection of in-paralogs Similarity higher with species than between species Orthologs can now be detected based on best reciprocal matches between in-paralogous groups One or more out-group organisms can optionally be used to improve the definition of orthologs
 
Orthologous groups Orthologs and paralogs are in principle always defined with respect to two organisms Orthologous groups instead try to encompass an entire set of organisms The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover
Definition of orthologous groups
 
 
COGs, KOGs, and NOGs The COGs and KOGs were manually curated These were automatically expanded to more species Tri-clustering Detection of in-paralogs Identification of triangles of best reciprocal matches Merging of triangles that share an edge Broad phylogenetics coverage COGs and NOGs cover all three domains of life KOGs cover all eukaryotes
 
 
Clustering based on similarity All-against-all sequence similarity is calculated A standard clustering method is applied to define groups of homologous genes TribeMCL Hierarchical clustering These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs
 
Meta-servers Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies
Function prediction from protein domains Lars Juhl Jensen EMBL Heidelberg
When homology searches fail Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function No functional information can thus be transferred based on simple sequence homology By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function
Protein domains Many eukaryotic proteins consist of multiple globular domains that can fold independently These domains have been mixed and matched through evolution Each type of domain contributes towards the molecular function of the complete protein Numerous resources are able to identify such domains from sequence alone using HMMs
 
 
 
 
 
 
 
 
Which domain resource should I use? SMART is focused on signal transduction domains Pfam is very actively developed and thus tends to have the most up-to-date domain collection InterPro is useful for genome annotation since the domains are annotated with GO terms CDD is conveniently integrated with the NCBI BLAST web interface
Predicting globular domains and intrinsically disordered regions Not all globular domains have been discovered and the databases are thus not comprehensive Methods exist for predicting from sequence which regions are globular and which are disordered GlobPlot uses a simple propensity scale DisEMBL, DISOPRED, and PONDR all use ensembles of artificial neural networks Many disordered regions are important for protein function and they should thus not be ignored
 
 
 
 
 
 
Summary Functional annotation Molecular function vs. biological process Inference of molecular function by sequence similarity Biological process only transferable between orthologs Detection of orthologs In-depth studies: phylogenetic trees Automated analysis: InParanoid and COG/KOG/NOG Profile searches for protein domains Each domains contributes a different molecular function
Acknowledgments Christian von Mering Christopher Creevey Ivica Letunic Rune Linding Tobias Doerks Francesca Ciccarelli Berend Snel Martijn Huynen Toby Gibson Rob Russell Peer Bork
Prediction of functional motifs from sequence Lars Juhl Jensen EMBL Heidelberg
Proteins – more than just globular domains Transmembrane helices Disordered regions Eukaryotic linear motifs (ELMs) Modification sites, e.g. phosphorylation sites Ligand peptides, e.g. SH3 binding sites Targeting signals, e.g nuclear localization sequences The short functional motifs are as important as the globular domains
Insulin Receptor Substrate 1
Databases of functional motifs Fewer and smaller databases General databases of motifs: ProSite and ELM Phosphorylation sites: Phospho.ELM and PhosphoSite These databases contain much fewer instances that protein domain databases Curation is more difficult Protein domain databases can be constructed based on analysis of protein sequences alone Short functional motifs must be curated based on experimental evidence
 
 
 
 
Prediction of ELMs Most functional motifs are “information poor” Weak/short consensus sequences for ELMs The typical ELM only has three conserved residues Some variance is often allowed even for these ELMs are very hard to predict from sequence Simply consensus sequences match everywhere Even more advanced methods like PSSMs, ANNs, or SVMs give poor specificity The full information is not in the site itself
 
 
 
 
Construction of data sets Compiling an initial data set Positive examples can be obtained from existing databases or curated from the literature Good negative examples are often harder to get Separate training and test sets A method may be able to learn the training examples but to generalize to new examples Homology reduction! It is crucial that there is no significant sequence similarity between examples in the training and test sets
Machine learning Numerous algorithms exist Artificial neural networks Support vector machines Decision trees The choice of algorithm is not so important Providing the relevant input is important Having high-quality training data is crucial
 
 
Kinase-specific prediction of phosphorylation sites (NetPhosK) Artificial neural networks (ANNs) were trained several different kinases The sequence logos show only the positive examples Negative examples also provide information Also, ANNs and SVMs can capture correlations between positions
Prediction of signal peptides from sequence (SignalP) Function Eukaryotic proteins are targeted to the ER Prokaryotic proteins are targeted for secretion Architecture Positively charged N-terminus Hydrophobic core Short, more polar region Cleavage site Signal peptides can be accurately predicted
Machine learning can help identify errors in curated databases Some of the manually curated databases contain obvious errors that can be eliminated General “SIGNAL” errors Wrong signal peptide cleavage site The secreted protein is processed by proteases Signal peptide include propeptide Wrong start codon used
Signal peptide or propeptide
Signal peptide or propeptide Propeptide cleavage Signal peptide cleavage
Wrong start codon
Use of short linear motifs for function prediction Only a few motifs (mostly localization signals) can be predicted with high accuracy Even in these cases advanced machine learning methods are typically needed These can be treated in the same way as domains Most motifs are weak, and predictions should be approached with care To tell if these sites are likely to be true, one needs to consider the context An experiment is needed to prove that it is functional
Feature-based prediction of protein function Lars Juhl Jensen EMBL Heidelberg
Function prediction from post translational modifications Proteins with similar function may not be related in sequence Still they must perform their function in the context of the same cellular machinery Similarities in features such like PTMs and physical/chemical properties could be expected for proteins with similar function
The concept of ProtFun
 
Function prediction on the human prion sequence ############## ProtFun 1.1 predictions ############## >PRIO_HUMAN # Functional category  Prob  Odds Amino_acid_biosynthesis  0.020  0.909 Biosynthesis_of_cofactors  0.032  0.444 Cell_envelope  0.146  2.393 Cellular_processes  0.053  0.726 Central_intermediary_metabolism  0.130  2.063 Energy_metabolism  0.029  0.322 Fatty_acid_metabolism  0.017  1.308 Purines_and_pyrimidines  0.528  2.173 Regulatory_functions  0.013  0.081 Replication_and_transcription  0.020  0.075 Translation  0.035  0.795 Transport_and_binding  => 0.831  2.027 # Enzyme/nonenzyme  Prob  Odds Enzyme  0.250  0.873 Nonenzyme  => 0.750  1.051 # Enzyme class  Prob  Odds Oxidoreductase (EC 1.-.-.-)  0.070  0.336 Transferase  (EC 2.-.-.-)  0.031  0.090 Hydrolase  (EC 3.-.-.-)  0.057  0.180 Isomerase  (EC 4.-.-.-)  0.020  0.426 Ligase  (EC 5.-.-.-)  0.010  0.313 Lyase  (EC 6.-.-.-)  0.017  0.334
ProtFun data sets Labeling of training and test data Cellular role categories: human SwissProt sequences were categorizes using EUCLID Enzyme categories: top-level enzyme classifications were extract from human SwissProt description lines Gene Ontology terms were transferred from InterPro The sequences were divided into training and test sets without significant sequence similarity Binary predictors were for each category
Prediction performance on cellular role categories
Prediction performance on enzyme categories
Predictive performance on Gene Ontology categories
Non-classical secretion Some proteins without N-terminal signal peptides are secreted via alternative secretion pathways Several growth factors, i.e. FGF1 and FGF2 Interleukine 1 beta HIV-1 tat No consensus sequence motif is known Maybe they have some features in common with other secreted proteins …
SecretomeP data sets Training and test set Positive examples: 3321 extracellular mammalian proteins with their signal peptides removed Negative examples: 3654 mammalian proteins from cytoplasm or nucleus Validation set 14 known non-classically secreted proteins
Secreted proteins are typically small
ROC plot for SecretomeP
Similar properties of classically and non-classically secreted proteins
 
A look into the black box Neural networks are often criticized  for being a “black box” method However, there are several ways to investigate what a neural network ensemble has learned Which fraction of the ensemble use a certain feature? How good performance can be attained using each of the features individually? How much does performance decrease if the neural networks are retrained without a certain feature (or combination of features)?
 
 
 
SecretomeP feature usage
ProtFun performance for other organisms Our predictors work in general for eukaryotes Best performance on metazoan proteins Some categories work quite well for prokaryotes Most metabolism categories Transport and binding While other categories fail Energy metabolism Regulatory functions
Mapping category performances onto input features
Performance contribution of sequence derived features The correlations between features and function is conserved for eukaryotes Some correlations extend to archaea and bacteria Physical/chemical properties Secondary structure and transmembrane helices Other correlations only hold for eukaryotes PTMs and subcellular localization features
Evolution conserves protein features and function Protein features are more conserved between orthologs than paralogs This leads to ProtFun predicting orthologs to be more likely to share function than paralogs That prediction is fully consistent with the notion that it is best to infer function from orthologous proteins
Conclusions Short linear motifs are likely equally important for protein function as the large well-studied domains These are much harder to predict from sequence Reasonable accuracy can be obtained by applying machine learning methods on high-quality datasets Many classes of proteins can be predicted based on such sequence derived-protein features These methods a not nearly as reliable as homology However, often they are the only option
Acknowledgments Ramneek Gupta Can Kesmir Jannick Dyrløv Bendtsen Henrik Nielsen Nikolaj Blom Francesca Diella Rune Linding Damien Devos Alfonso Valencia Søren Brunak Toby Gibson
Prediction of functional interaction networks Lars Juhl Jensen EMBL Heidelberg
What is an interaction? Physical protein interactions Proteins that physically touch each other Members of the same stable complex Transient interactions, e.g. a kinase and its substrate The pragmatic definition – whatever the assay in question can measure Functional interactions Neighbors in metabolic networks Members of the same pathway
The use of interaction networks for function prediction A functional interaction implies that two proteins are involved in the same biological process However, the networks do not divide proteins into a predefined set of functional classes such as the Gene Ontology terms Functional associations do not require homology to proteins of know function, and can complement the predictions even when homology is present
 
 
Functional interaction networks
Evidence types Genomic context methods Phylogenetic profiles, gene neighborhood, and fusion Primary experimental data Physical protein interactions and gene expression data Manually curated databases Pathways and protein complexes Automatic literature mining Co-ocurrence and Natural Language Processing
Phylogenetic profiles
 
 
 
Cell Cellulosomes Cellulose
Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
Gene neighborhood
Gene neighborhood Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
Gene fusion
Gene fusion Find in  A  genes that match a the same gene in  B Exclude overlapping alignments Calibrate against KEGG  maps Calculate all-against-all pairwise alignments
Calibration of quality scores Different pieces of evidence are not directly comparable A different raw quality score is used for each evidence type  Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference The accuracy relative to a “gold standard” is calculated within score intervals The resulting points are approximated by a sigmoid
Data integration
Protein-protein interaction databases Imported databases BIND, Biomolecular Interaction Network Database DIP, Database of Interacting Proteins GRID, General Repository for Interaction Datasets HPRD, Human Protein Reference Database MINT, Molecular Interactions Database Databases to be added IntAct PDB
Physical protein interactions Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
Binary representations of purification data
Topology based quality scores Scoring scheme for yeast two-hybrid data: S1 = -log((N 1 +1) · (N 2 +1)) N 1  and N 2  are the numbers of non-shared interaction partners Similar scoring schemes have been published by Saito  et al. Scoring scheme for complex pull-down data: S2 = log[(N 12 · N)/((N 1 +1) · (N 2 +1))] N 12  is the number of purifications containing both proteins N 1  is the number containing protein 1, N 2  is defined similarly N is the total number of purifications Both schemes aim at identifying ubiquitous interactors
Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
Databases of curated knowledge Pathway databases BioCarta KEGG, Kyoto Encyclopedia of Genes and Genomes Reactome STKE, Signal Transduction Knowledge Environment Curated protein complexes MIPS, Munich Information center for Protein Sequences Databases to be added Gene Ontology annotation
Co-occurrence in the scientific texts Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
Databases used for text mining Corpora Medline OMIM, Online Mendelian Inheritance in Man SGD, Saccharomyces Genome Database The Interactive Fly These text sources are all parsed and converted into a unified format Gene synonyms Ensembl SwissProt HUGO LocusLink SGD TAIR Cross references and sequence comparison is used for merging
Gene  and protein  names Cue words for entity recognition Verbs for relation extraction [ nxgene  The  GAL4   gene ] [ nxexpr  T he  expression  of   [ nxgene   the cytochrome  genes   [ nxpg   CYC1  and  CYC7 ]]] is  controlled  by [ nxpg   HAP1 ] Natural Language Processing
Multiple types of interactions
Transfer of evidence STRING “red” – COG mode Each node in the network represents a COG For each pair of COGs, the highest confidence score for each evidence type counts from each clade The scores are combined using na ïve Bayes STRING “blue” – protein mode Each node in the network represents a single locus Evidence from other organisms are transferred based on fuzzy orthology The scores are combined using na ïve Bayes
 
 
Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence is not guaranteed for paralogs These problems are addressed by our “fuzzy orthology” scheme Functional equivalence scores are calculated from all-against-all alignment Evidence is distributed across possible pairs ? Source species Target species
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The power of cross-species transfer and evidence integration
The big challenge
Prediction of “mode of action”
Summary Functional interaction networks are useful for predicting the biological role of a protein Many algorithms and types of data can be used for predicting functional interactions Each method must be benchmarked The different types of evidence should be integrated in a probabilistic scoring scheme To make the most of the available data, evidence should also be transferred between organisms
Acknowledgments Christian von Mering Jasmin Saric Berend Snel Sean Hooper Rossitza Ouzounova Samuel Chaffron Julien Lagarde Mathilde Foglierini Isabel Rojas Martijn Huynen Peer Bork

More Related Content

PPTX
Multiple sequence alignment
PDF
Gene prediction method
PPTX
Scoring schemes in bioinformatics
PPTX
Kegg
PPTX
Sequence Alignment
PPTX
Chou fasman algorithm for protein structure prediction
PPTX
Protein database
PPT
Structural genomics
Multiple sequence alignment
Gene prediction method
Scoring schemes in bioinformatics
Kegg
Sequence Alignment
Chou fasman algorithm for protein structure prediction
Protein database
Structural genomics

What's hot (20)

PPTX
Genomic databases
PDF
PPTX
Protein data bank
PPTX
Dynamic programming and pairwise sequence alignment
PPTX
Protein Threading
PPTX
Genome annotation
PPTX
Flux balance analysis
PDF
Gene prediction methods vijay
PPTX
Protein protein interactions
PPTX
PPTX
Protein data bank
PPTX
Major databases in bioinformatics
PPTX
Rna seq and chip seq
PPTX
Protein fold recognition and ab_initio modeling
PDF
dot plot analysis
PPTX
Protein micro array
PDF
Gene prediction strategies
Genomic databases
Protein data bank
Dynamic programming and pairwise sequence alignment
Protein Threading
Genome annotation
Flux balance analysis
Gene prediction methods vijay
Protein protein interactions
Protein data bank
Major databases in bioinformatics
Rna seq and chip seq
Protein fold recognition and ab_initio modeling
dot plot analysis
Protein micro array
Gene prediction strategies
Ad

Similar to Protein function prediction (20)

PDF
BITS - Introduction to comparative genomics
PPTX
Mapping protein to function
PPTX
Chibucos annot go_final
PPTX
Comparative genomics ................pptx
PPTX
Comparative genomics
PPTX
Functional proteomics, and tools
PPT
Proteomics: lecture (1) introduction to proteomics
PPTX
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
PPTX
Plant Pathogen Genome Data: My Life In Sequences
PPTX
Comparative genomics
PPT
Prediction of protein function
PPT
Utilizing literature for biological discovery
PPTX
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
PPTX
Current Strategies for Genetics Hypertension (1).pptx
PPTX
gene mapping, clonning of disease gene(1).pptx
PDF
Genomics Of Plants And Fungi Mycology 1st Edition Rolf A Prade
PPT
Lecture__on__Proteomics_Introduction.ppt
PPT
Protein Chemistry-Proteomics-Lec1_Intro.ppt
PPTX
ORF, Gene Clustering, Overlapping Genes and.pptx
PPTX
protein function on genome wide scale analysis.pptx
BITS - Introduction to comparative genomics
Mapping protein to function
Chibucos annot go_final
Comparative genomics ................pptx
Comparative genomics
Functional proteomics, and tools
Proteomics: lecture (1) introduction to proteomics
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Plant Pathogen Genome Data: My Life In Sequences
Comparative genomics
Prediction of protein function
Utilizing literature for biological discovery
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
Current Strategies for Genetics Hypertension (1).pptx
gene mapping, clonning of disease gene(1).pptx
Genomics Of Plants And Fungi Mycology 1st Edition Rolf A Prade
Lecture__on__Proteomics_Introduction.ppt
Protein Chemistry-Proteomics-Lec1_Intro.ppt
ORF, Gene Clustering, Overlapping Genes and.pptx
protein function on genome wide scale analysis.pptx
Ad

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
STRING & related databases: Large-scale integration of heterogeneous data
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology

Protein function prediction

  • 1. Prediction of protein function Lars Juhl Jensen EMBL Heidelberg
  • 2. Overview Part 1 Homology-based transfer of annotation Function prediction from protein domains Part 2 Prediction of functional motifs from sequence Feature-based prediction of protein function Part 3 Prediction of functional interaction networks
  • 3. Why do we need to predict function?
  • 4. What do we mean by function? The concept “function” is not clearly defined A structural biologist, a cell biologist, and a medical doctor will have very different views Many levels of granularity For the overall definition of “function”, the knowledge and description can be more or less specific Functional categories are somewhat artificial People like to put things in boxes …
  • 5.  
  • 6. Descriptions of protein function Controlled vocabularies Gene Ontology SwissProt keywords KEGG pathways EcoCyc pathways Interaction networks More accurate data models Reactome Systems Biology Markup Language (SBML)
  • 7. Molecular function Molecular function describes activities, such as catalytic or binding activities, at the molecular level GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity
  • 8. Biological process A biological process is series of events accomplished by one or more ordered assemblies of molecular functions An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport It can be difficult to distinguish between a biological process and a molecular function
  • 9. Cellular component A cellular component is just that, a component of a cell that is part of some larger object It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) The cellular component categories are probably the best defined categories since they correspond to actual entities
  • 10. Homology-based transfer of annotation Lars Juhl Jensen EMBL Heidelberg
  • 11. Detection of homologs Pairwise sequence similarity searches BLAST (fastest) FASTA Full Smith-Waterman (most sensitive) Profile-based similarity searches PSI-BLAST Hidden Markov Models (HMMs) Sequence similarity should always be evaluated at the protein level
  • 12.  
  • 13. Sequence similarity, sequence homology, and functional homology Sequence similarity means that the sequences are similar – no more, no less Sequence homology implies that the proteins are encoded by genes that share a common ancestry Functional homology means that two proteins from two organisms have the same function Sequence similarity or sequence homology does not guarantee functional homology
  • 15. Functional consequences of gene duplication Neofunctionalization One copy has retained the ancestral function and can be treated as a 1–to–1 ortholog (functional homolog) The other copy have changed their function and behave much like paralogs Subfunctionalization Each copy has taken on a part of the ancestral function A functional homolog cannot be defined Each ortholog typically has the same molecular function in a different sub-process or location
  • 16. 1–to–1 orthology A single gene in one organism corresponds to a single gene in another organism These can generally be assumed to encode functionally equivalent proteins Same molecular function Same biological process Same localization 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms
  • 17. 1–to–many orthology A single gene in one organism corresponds to multiple genes in another organism Any mixture of neo- and sub-functionalizations can have occurred Typically same molecular function Often different biological process or sub-process Often different sub-cellular localization or tissue 1–to–many orthology is very common between simple model organisms and higher eukaryotes
  • 18. Many–to–many orthology Many genes in each organism have arisen from a single gene in their last common ancestor Different neo- and sub-functionalizations have likely taken place in each lineage Typically same molecular function Often different biological process or sub-process Often different sub-cellular localization or tissue Many–to–many orthology is common between higher eukaryotes that are distantly related
  • 19. Detection of orthologs Reconstruction of phylogenetic trees The theoretically most correct way Works for analyzing particular genes of interest Methods based on reciprocal matches What currently works at the genomic scale Manual curation Detection of very remote orthologs may require that knowledge on gene synteny and/or protein function is taken into account
  • 20. Construction of gene trees Identify the relevant proteins Sequence similarity and possibly additional information Construct a blocked multiple sequence alignment Use, for example, Muscle and Gblocks Reconstruct the most likely phylogenetic tree Use, for example, PhyML Orthologs and paralogs can be trivially extracted based on a gene tree
  • 21. Reciprocal matches Simple “best reciprocal match” is a bad choice Can only deal with one-to-one orthology Detection of in-paralogs Similarity higher with species than between species Orthologs can now be detected based on best reciprocal matches between in-paralogous groups One or more out-group organisms can optionally be used to improve the definition of orthologs
  • 22.  
  • 23. Orthologous groups Orthologs and paralogs are in principle always defined with respect to two organisms Orthologous groups instead try to encompass an entire set of organisms The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover
  • 25.  
  • 26.  
  • 27. COGs, KOGs, and NOGs The COGs and KOGs were manually curated These were automatically expanded to more species Tri-clustering Detection of in-paralogs Identification of triangles of best reciprocal matches Merging of triangles that share an edge Broad phylogenetics coverage COGs and NOGs cover all three domains of life KOGs cover all eukaryotes
  • 28.  
  • 29.  
  • 30. Clustering based on similarity All-against-all sequence similarity is calculated A standard clustering method is applied to define groups of homologous genes TribeMCL Hierarchical clustering These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs
  • 31.  
  • 32. Meta-servers Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies
  • 33. Function prediction from protein domains Lars Juhl Jensen EMBL Heidelberg
  • 34. When homology searches fail Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function No functional information can thus be transferred based on simple sequence homology By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function
  • 35. Protein domains Many eukaryotic proteins consist of multiple globular domains that can fold independently These domains have been mixed and matched through evolution Each type of domain contributes towards the molecular function of the complete protein Numerous resources are able to identify such domains from sequence alone using HMMs
  • 36.  
  • 37.  
  • 38.  
  • 39.  
  • 40.  
  • 41.  
  • 42.  
  • 43.  
  • 44. Which domain resource should I use? SMART is focused on signal transduction domains Pfam is very actively developed and thus tends to have the most up-to-date domain collection InterPro is useful for genome annotation since the domains are annotated with GO terms CDD is conveniently integrated with the NCBI BLAST web interface
  • 45. Predicting globular domains and intrinsically disordered regions Not all globular domains have been discovered and the databases are thus not comprehensive Methods exist for predicting from sequence which regions are globular and which are disordered GlobPlot uses a simple propensity scale DisEMBL, DISOPRED, and PONDR all use ensembles of artificial neural networks Many disordered regions are important for protein function and they should thus not be ignored
  • 46.  
  • 47.  
  • 48.  
  • 49.  
  • 50.  
  • 51.  
  • 52. Summary Functional annotation Molecular function vs. biological process Inference of molecular function by sequence similarity Biological process only transferable between orthologs Detection of orthologs In-depth studies: phylogenetic trees Automated analysis: InParanoid and COG/KOG/NOG Profile searches for protein domains Each domains contributes a different molecular function
  • 53. Acknowledgments Christian von Mering Christopher Creevey Ivica Letunic Rune Linding Tobias Doerks Francesca Ciccarelli Berend Snel Martijn Huynen Toby Gibson Rob Russell Peer Bork
  • 54. Prediction of functional motifs from sequence Lars Juhl Jensen EMBL Heidelberg
  • 55. Proteins – more than just globular domains Transmembrane helices Disordered regions Eukaryotic linear motifs (ELMs) Modification sites, e.g. phosphorylation sites Ligand peptides, e.g. SH3 binding sites Targeting signals, e.g nuclear localization sequences The short functional motifs are as important as the globular domains
  • 57. Databases of functional motifs Fewer and smaller databases General databases of motifs: ProSite and ELM Phosphorylation sites: Phospho.ELM and PhosphoSite These databases contain much fewer instances that protein domain databases Curation is more difficult Protein domain databases can be constructed based on analysis of protein sequences alone Short functional motifs must be curated based on experimental evidence
  • 58.  
  • 59.  
  • 60.  
  • 61.  
  • 62. Prediction of ELMs Most functional motifs are “information poor” Weak/short consensus sequences for ELMs The typical ELM only has three conserved residues Some variance is often allowed even for these ELMs are very hard to predict from sequence Simply consensus sequences match everywhere Even more advanced methods like PSSMs, ANNs, or SVMs give poor specificity The full information is not in the site itself
  • 63.  
  • 64.  
  • 65.  
  • 66.  
  • 67. Construction of data sets Compiling an initial data set Positive examples can be obtained from existing databases or curated from the literature Good negative examples are often harder to get Separate training and test sets A method may be able to learn the training examples but to generalize to new examples Homology reduction! It is crucial that there is no significant sequence similarity between examples in the training and test sets
  • 68. Machine learning Numerous algorithms exist Artificial neural networks Support vector machines Decision trees The choice of algorithm is not so important Providing the relevant input is important Having high-quality training data is crucial
  • 69.  
  • 70.  
  • 71. Kinase-specific prediction of phosphorylation sites (NetPhosK) Artificial neural networks (ANNs) were trained several different kinases The sequence logos show only the positive examples Negative examples also provide information Also, ANNs and SVMs can capture correlations between positions
  • 72. Prediction of signal peptides from sequence (SignalP) Function Eukaryotic proteins are targeted to the ER Prokaryotic proteins are targeted for secretion Architecture Positively charged N-terminus Hydrophobic core Short, more polar region Cleavage site Signal peptides can be accurately predicted
  • 73. Machine learning can help identify errors in curated databases Some of the manually curated databases contain obvious errors that can be eliminated General “SIGNAL” errors Wrong signal peptide cleavage site The secreted protein is processed by proteases Signal peptide include propeptide Wrong start codon used
  • 74. Signal peptide or propeptide
  • 75. Signal peptide or propeptide Propeptide cleavage Signal peptide cleavage
  • 77. Use of short linear motifs for function prediction Only a few motifs (mostly localization signals) can be predicted with high accuracy Even in these cases advanced machine learning methods are typically needed These can be treated in the same way as domains Most motifs are weak, and predictions should be approached with care To tell if these sites are likely to be true, one needs to consider the context An experiment is needed to prove that it is functional
  • 78. Feature-based prediction of protein function Lars Juhl Jensen EMBL Heidelberg
  • 79. Function prediction from post translational modifications Proteins with similar function may not be related in sequence Still they must perform their function in the context of the same cellular machinery Similarities in features such like PTMs and physical/chemical properties could be expected for proteins with similar function
  • 80. The concept of ProtFun
  • 81.  
  • 82. Function prediction on the human prion sequence ############## ProtFun 1.1 predictions ############## >PRIO_HUMAN # Functional category Prob Odds Amino_acid_biosynthesis 0.020 0.909 Biosynthesis_of_cofactors 0.032 0.444 Cell_envelope 0.146 2.393 Cellular_processes 0.053 0.726 Central_intermediary_metabolism 0.130 2.063 Energy_metabolism 0.029 0.322 Fatty_acid_metabolism 0.017 1.308 Purines_and_pyrimidines 0.528 2.173 Regulatory_functions 0.013 0.081 Replication_and_transcription 0.020 0.075 Translation 0.035 0.795 Transport_and_binding => 0.831 2.027 # Enzyme/nonenzyme Prob Odds Enzyme 0.250 0.873 Nonenzyme => 0.750 1.051 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.070 0.336 Transferase (EC 2.-.-.-) 0.031 0.090 Hydrolase (EC 3.-.-.-) 0.057 0.180 Isomerase (EC 4.-.-.-) 0.020 0.426 Ligase (EC 5.-.-.-) 0.010 0.313 Lyase (EC 6.-.-.-) 0.017 0.334
  • 83. ProtFun data sets Labeling of training and test data Cellular role categories: human SwissProt sequences were categorizes using EUCLID Enzyme categories: top-level enzyme classifications were extract from human SwissProt description lines Gene Ontology terms were transferred from InterPro The sequences were divided into training and test sets without significant sequence similarity Binary predictors were for each category
  • 84. Prediction performance on cellular role categories
  • 85. Prediction performance on enzyme categories
  • 86. Predictive performance on Gene Ontology categories
  • 87. Non-classical secretion Some proteins without N-terminal signal peptides are secreted via alternative secretion pathways Several growth factors, i.e. FGF1 and FGF2 Interleukine 1 beta HIV-1 tat No consensus sequence motif is known Maybe they have some features in common with other secreted proteins …
  • 88. SecretomeP data sets Training and test set Positive examples: 3321 extracellular mammalian proteins with their signal peptides removed Negative examples: 3654 mammalian proteins from cytoplasm or nucleus Validation set 14 known non-classically secreted proteins
  • 89. Secreted proteins are typically small
  • 90. ROC plot for SecretomeP
  • 91. Similar properties of classically and non-classically secreted proteins
  • 92.  
  • 93. A look into the black box Neural networks are often criticized for being a “black box” method However, there are several ways to investigate what a neural network ensemble has learned Which fraction of the ensemble use a certain feature? How good performance can be attained using each of the features individually? How much does performance decrease if the neural networks are retrained without a certain feature (or combination of features)?
  • 94.  
  • 95.  
  • 96.  
  • 98. ProtFun performance for other organisms Our predictors work in general for eukaryotes Best performance on metazoan proteins Some categories work quite well for prokaryotes Most metabolism categories Transport and binding While other categories fail Energy metabolism Regulatory functions
  • 99. Mapping category performances onto input features
  • 100. Performance contribution of sequence derived features The correlations between features and function is conserved for eukaryotes Some correlations extend to archaea and bacteria Physical/chemical properties Secondary structure and transmembrane helices Other correlations only hold for eukaryotes PTMs and subcellular localization features
  • 101. Evolution conserves protein features and function Protein features are more conserved between orthologs than paralogs This leads to ProtFun predicting orthologs to be more likely to share function than paralogs That prediction is fully consistent with the notion that it is best to infer function from orthologous proteins
  • 102. Conclusions Short linear motifs are likely equally important for protein function as the large well-studied domains These are much harder to predict from sequence Reasonable accuracy can be obtained by applying machine learning methods on high-quality datasets Many classes of proteins can be predicted based on such sequence derived-protein features These methods a not nearly as reliable as homology However, often they are the only option
  • 103. Acknowledgments Ramneek Gupta Can Kesmir Jannick Dyrløv Bendtsen Henrik Nielsen Nikolaj Blom Francesca Diella Rune Linding Damien Devos Alfonso Valencia Søren Brunak Toby Gibson
  • 104. Prediction of functional interaction networks Lars Juhl Jensen EMBL Heidelberg
  • 105. What is an interaction? Physical protein interactions Proteins that physically touch each other Members of the same stable complex Transient interactions, e.g. a kinase and its substrate The pragmatic definition – whatever the assay in question can measure Functional interactions Neighbors in metabolic networks Members of the same pathway
  • 106. The use of interaction networks for function prediction A functional interaction implies that two proteins are involved in the same biological process However, the networks do not divide proteins into a predefined set of functional classes such as the Gene Ontology terms Functional associations do not require homology to proteins of know function, and can complement the predictions even when homology is present
  • 107.  
  • 108.  
  • 110. Evidence types Genomic context methods Phylogenetic profiles, gene neighborhood, and fusion Primary experimental data Physical protein interactions and gene expression data Manually curated databases Pathways and protein complexes Automatic literature mining Co-ocurrence and Natural Language Processing
  • 112.  
  • 113.  
  • 114.  
  • 116. Formalizing the phylogenetic profile method Align all proteins against all Calculate best-hit profile Join similar species by PCA Calculate PC profile distances Calibrate against KEGG maps
  • 118. Gene neighborhood Identify runs of adjacent genes with the same direction Score each gene pair based on intergenic distances Calibrate against KEGG maps Infer associations in other species
  • 120. Gene fusion Find in A genes that match a the same gene in B Exclude overlapping alignments Calibrate against KEGG maps Calculate all-against-all pairwise alignments
  • 121. Calibration of quality scores Different pieces of evidence are not directly comparable A different raw quality score is used for each evidence type Quality differences exist among data sets of the same type Solved by calibrating all scores against a common reference The accuracy relative to a “gold standard” is calculated within score intervals The resulting points are approximated by a sigmoid
  • 123. Protein-protein interaction databases Imported databases BIND, Biomolecular Interaction Network Database DIP, Database of Interacting Proteins GRID, General Repository for Interaction Datasets HPRD, Human Protein Reference Database MINT, Molecular Interactions Database Databases to be added IntAct PDB
  • 124. Physical protein interactions Make binary representation of complexes Yeast two-hybrid data sets are inherently binary Calculate score from number of (co-)occurrences Calculate score from non-shared partners Calibrate against KEGG maps Infer associations in other species Combine evidence from experiments
  • 125. Binary representations of purification data
  • 126. Topology based quality scores Scoring scheme for yeast two-hybrid data: S1 = -log((N 1 +1) · (N 2 +1)) N 1 and N 2 are the numbers of non-shared interaction partners Similar scoring schemes have been published by Saito et al. Scoring scheme for complex pull-down data: S2 = log[(N 12 · N)/((N 1 +1) · (N 2 +1))] N 12 is the number of purifications containing both proteins N 1 is the number containing protein 1, N 2 is defined similarly N is the total number of purifications Both schemes aim at identifying ubiquitous interactors
  • 127. Mining microarray expression databases Re-normalize arrays by modern method to remove biases Build expression matrix Combine similar arrays by PCA Construct predictor by Gaussian kernel density estimation Calibrate against KEGG maps Infer associations in other species
  • 128. Databases of curated knowledge Pathway databases BioCarta KEGG, Kyoto Encyclopedia of Genes and Genomes Reactome STKE, Signal Transduction Knowledge Environment Curated protein complexes MIPS, Munich Information center for Protein Sequences Databases to be added Gene Ontology annotation
  • 129. Co-occurrence in the scientific texts Associate abstracts with species Identify gene names in title/abstract Count (co-)occurrences of genes Test significance of associations Calibrate against KEGG maps Infer associations in other species
  • 130. Databases used for text mining Corpora Medline OMIM, Online Mendelian Inheritance in Man SGD, Saccharomyces Genome Database The Interactive Fly These text sources are all parsed and converted into a unified format Gene synonyms Ensembl SwissProt HUGO LocusLink SGD TAIR Cross references and sequence comparison is used for merging
  • 131. Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxgene The GAL4 gene ] [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ] Natural Language Processing
  • 132. Multiple types of interactions
  • 133. Transfer of evidence STRING “red” – COG mode Each node in the network represents a COG For each pair of COGs, the highest confidence score for each evidence type counts from each clade The scores are combined using na ïve Bayes STRING “blue” – protein mode Each node in the network represents a single locus Evidence from other organisms are transferred based on fuzzy orthology The scores are combined using na ïve Bayes
  • 134.  
  • 135.  
  • 136. Evidence transfer based on “fuzzy orthology” Orthology transfer is tricky Correct assignment of orthology is difficult for distant species Functional equivalence is not guaranteed for paralogs These problems are addressed by our “fuzzy orthology” scheme Functional equivalence scores are calculated from all-against-all alignment Evidence is distributed across possible pairs ? Source species Target species
  • 137. The power of cross-species transfer and evidence integration
  • 138. The power of cross-species transfer and evidence integration
  • 139. The power of cross-species transfer and evidence integration
  • 140. The power of cross-species transfer and evidence integration
  • 141. The power of cross-species transfer and evidence integration
  • 142. The power of cross-species transfer and evidence integration
  • 144. Prediction of “mode of action”
  • 145. Summary Functional interaction networks are useful for predicting the biological role of a protein Many algorithms and types of data can be used for predicting functional interactions Each method must be benchmarked The different types of evidence should be integrated in a probabilistic scoring scheme To make the most of the available data, evidence should also be transferred between organisms
  • 146. Acknowledgments Christian von Mering Jasmin Saric Berend Snel Sean Hooper Rossitza Ouzounova Samuel Chaffron Julien Lagarde Mathilde Foglierini Isabel Rojas Martijn Huynen Peer Bork