SlideShare a Scribd company logo
Gene Ontology
Slimming tips
Val Wood GO consortium meeting Cambridge Oct 2017
Whole genome slims
● Provide a summary of an organim’s biology
● As a resource to plan curation (unannotated genes, intersections); to identify
“unknown/uncharacterised genes”
■ Need to be biologically relevant, reduced redundancy is better
■ Need as complete coverage as possible
Single gene overview (Allied Genome Resources ribbon)
○ Informs database user to branches of GO applicable to a single gene product (filter)
■ Usually higher level general grouping terms (redundancy is less critical)
● To interpret analysis (slimming prior to enrichment helps to interpret results, orientation)
● Summarize/display experimental results- smallest possible number of terms, but specific enough to
convey results
● Taxon specific slim
Slimming results sets (subsets of genes)
● There is no “one size fits all”, different slims for different use cases.
● With a ‘generic slim’ we should to provide instructions how to refine
Common Uses of GO Slims
Coverage 1: Only slim one aspect at a time!
All 3 aspects “unknown” =
Biological Process unknown =
(103+195+23+429)
103
750
Pombe using Pombase slim
Unslimmed
Unknown
Coverage 2: Distinguish unannotated/unknown/unslimmed
= IDs not recognised by the slim tool (i.e not in GO database)
http://go.princeton.edu/cgi-bin/GOTermMapper
will provide all 3 numbers (and can use your own slim)
Unannotated
These 365 identifiers were not annotated in the slim, but they had non-root annotations that were not in
the slim:
These 734 identifiers had no non-root annotations:
Total 1099 un-slimmed
Pombe using AGR slim
This number should be small in
a slim with good coverage
Coverage 3: Minimise “unslimmed”
Pombe using PomBase slim
It is difficult to define a slim to cover all annotated gene
products without including terms with:
■ very small numbers of annotations,
■ or high level or biologically
uninformative terms
PomBase AGR slim summary in matrix http://amigo.geneontology.org/matrix (Terms are on both axis, totals on diagonal)
Some terms are not
biologically informative for
a generic slim
because they can apply to
*any* biological process
Indicated by intersections
with every process
Information content low
Non-specific
Exact subset
OK
Relevance 1: A balance between coverage and content
Relevance 1: Avoid going “to high”
Broad groupings
Good for ribbon diagrams
(display)
Not good for summarizing
biology.
“Response to stimulus” is not
very informative about biology
(but covers >8000 (33%) mouse
gene products )
Regulation of biological process
50%
Mouse AGR slim in matrix http://amigo.geneontology.org/matrix (Terms are on both axis, totals on diagonal)
Mouse using AGR slim and GO term slim mapper
Your input list contains 22928 genes.
These 2037 identifiers were found to be unannotated:
These 420 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim:
These 4037 identifiers had no non-root annotations:
Deleting “cell differentiation” loses 0 (descendant of development).
Deleting “cell proliferation” loses only 5, most covered by development and cell cycle.
Deleting “regulation of biological process” only loses 51 even though over half 11658 proteins
annotated.
These 476 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim:
Relevance 2: Minimising overlaps (redundancy)
Because many gene products are annotated to multiple terms, it not possible to
create a slim with no overlaps.
If term removal doesn’t change number slimmed is might not be so be useful for
a slim.
Complete subset terms should be avoided
Relevance 3: Lumping vs. splitting with common parent
Very little intersection
Between/within modules
Largely unconnected
but have common parent in the
GO:
nuclear and mitochondrial
gene expression
transmembrane and
vesicle-mediated and
nucleocytoplasmic
transport
Relevance 3: Value of splitting, example with real data
Geneexpression
From Hayles et al A genome-wide resource of cell cycle and cell shape genes of fission yeast.
Current PomBase slim in
matrix, overlaps low,
information content high
Zero overlaps between
vesicle-mediated transport
Nucleocytoplasmic transport
and transmembrane
transport, not biologically
connected.
Relevance 3: Lumping vs. splitting with common parent
Relevance 4: Avoid single step processes
● GO:0016570 histone modification
● GO:0006468 protein phosphorylation
● GO:0006470 protein dephosphorylation
● GO:0043543 protein acylation
● GO:0016310 phosphorylation
● GO:0016311 dephosphorylation
● GO:0055114 oxidation-reduction process
● GO:0006464 cellular protein modification process
● GO:0043086 negative regulation of catalytic activity
All are examples of molecular function grouping terms in the BP ontology.
Not informative about physiological role, only biochemical role
For this reason “protein metabolism” the ancestor of protein modifications should
also be avoided in the generic slim.
Proposed Iterative procedure
Evaluate individual species coverage of existing generic slim (BP)
What is missing? Add terms to cover
Evaluate species coverage
Which terms could be removed without affecting coverage? Remove
Test (evaluate species coverage changes)
What is missing? Add terms to cover
Evaluate species coverage
Which terms should be split to improve biological relevence? Split
Check coverage was not affected (or recommend improved annotation
specificity)
Spares
Possible changes to evaluate
Remove
cell proliferation
cell differentiation
cellular component organization and biogenesis
RNA processing (see Gene expression)
regulation of biological process
Add
cytoskeleton organization
chromatin organization
ribosome biogenesis (>1000 annot)
tRNA metabolic process (1157 annot)
gene expression (includes translation)
Not covered currently
detoxification
amino acid metabolic process (or
vitamin metabolic process small
cofactor metabolic process molecule?)

More Related Content

PPTX
Molecular basis of evolution and softwares used in phylogenetic tree contruction
PPTX
Association mapping
PPTX
Association mapping
PDF
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
PDF
CpG Island Identification with Hidden Markov Models
PPT
Basics of association_mapping
PPTX
Status and prospects of association mapping in crop plants
PPTX
Cell Adhesion Molecules(cams) and its types, Cadherins and Integrins and inte...
Molecular basis of evolution and softwares used in phylogenetic tree contruction
Association mapping
Association mapping
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
CpG Island Identification with Hidden Markov Models
Basics of association_mapping
Status and prospects of association mapping in crop plants
Cell Adhesion Molecules(cams) and its types, Cadherins and Integrins and inte...

What's hot (20)

PPTX
Molecular evolution
PPT
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
PPTX
PPTX
Genome wide association studies seminar
PPTX
Plant Genetic
PPTX
Comparative and functional genomics
PPT
B10vrv4133
PPTX
Mutation
PDF
Introduction to epigenetics and study design
PDF
Genetics for Under-graduates - Dr HK Garg
PPTX
10.2 inherritance
PPT
Mutations
PPT
Genetic Mutations 1
PPTX
Mutation gene and chromosomal
PDF
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
PPTX
Mutations can change the meaning of genes
PPTX
Mutation and its types
PPT
Genetic Mutations
PPTX
Comparative genomics
Molecular evolution
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
Genome wide association studies seminar
Plant Genetic
Comparative and functional genomics
B10vrv4133
Mutation
Introduction to epigenetics and study design
Genetics for Under-graduates - Dr HK Garg
10.2 inherritance
Mutations
Genetic Mutations 1
Mutation gene and chromosomal
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Mutations can change the meaning of genes
Mutation and its types
Genetic Mutations
Comparative genomics
Ad

Similar to GO slimming tips (20)

PDF
Curate locally, think globally
PPTX
Hidden in plain sight
PPTX
New PomBase website features
PPTX
The Gene Ontology & Gene Ontology Annotation resources
PPTX
Why Life is Difficult, and What We MIght Do About It
PPTX
Nutrigenomics: The Genome food interface
PDF
Cancer genomics and proteomics published article 7-11-2017
PPT
UniProt-GOA
 
PPTX
Chibucos annot go_final
PPTX
Designing a community resource - Sandra Orchard
PPT
Lecture 1 Introduction to Bioinformatics BCH 433.ppt
PDF
Towards Incidental Collaboratories For Experimental Data
PPTX
Systems Nutrition of the Gut-Liver Axis and the Role of the Microbiome
PPT
Gene Ontology Project
PPT
Nutrigenomics
PDF
bioinformatics enabling knowledge generation from agricultural omics data
PPTX
Introduction
PDF
Lock - PomBase community curation
PPTX
GigaScience: data and beta-database launch. Announcing GigaDB
PDF
BITS: Overview of important biological databases beyond sequences
Curate locally, think globally
Hidden in plain sight
New PomBase website features
The Gene Ontology & Gene Ontology Annotation resources
Why Life is Difficult, and What We MIght Do About It
Nutrigenomics: The Genome food interface
Cancer genomics and proteomics published article 7-11-2017
UniProt-GOA
 
Chibucos annot go_final
Designing a community resource - Sandra Orchard
Lecture 1 Introduction to Bioinformatics BCH 433.ppt
Towards Incidental Collaboratories For Experimental Data
Systems Nutrition of the Gut-Liver Axis and the Role of the Microbiome
Gene Ontology Project
Nutrigenomics
bioinformatics enabling knowledge generation from agricultural omics data
Introduction
Lock - PomBase community curation
GigaScience: data and beta-database launch. Announcing GigaDB
BITS: Overview of important biological databases beyond sequences
Ad

Recently uploaded (20)

PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
famous lake in india and its disturibution and importance
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Sciences of Europe No 170 (2025)
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
BIOMOLECULES PPT........................
lecture 2026 of Sjogren's syndrome l .pdf
Placing the Near-Earth Object Impact Probability in Context
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
famous lake in india and its disturibution and importance
Biophysics 2.pdffffffffffffffffffffffffff
Sciences of Europe No 170 (2025)
The KM-GBF monitoring framework – status & key messages.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Comparative Structure of Integument in Vertebrates.pptx
2. Earth - The Living Planet earth and life
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Derivatives of integument scales, beaks, horns,.pptx
. Radiology Case Scenariosssssssssssssss
POSITIONING IN OPERATION THEATRE ROOM.ppt
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Introduction to Fisheries Biotechnology_Lesson 1.pptx
BIOMOLECULES PPT........................

GO slimming tips

  • 1. Gene Ontology Slimming tips Val Wood GO consortium meeting Cambridge Oct 2017
  • 2. Whole genome slims ● Provide a summary of an organim’s biology ● As a resource to plan curation (unannotated genes, intersections); to identify “unknown/uncharacterised genes” ■ Need to be biologically relevant, reduced redundancy is better ■ Need as complete coverage as possible Single gene overview (Allied Genome Resources ribbon) ○ Informs database user to branches of GO applicable to a single gene product (filter) ■ Usually higher level general grouping terms (redundancy is less critical) ● To interpret analysis (slimming prior to enrichment helps to interpret results, orientation) ● Summarize/display experimental results- smallest possible number of terms, but specific enough to convey results ● Taxon specific slim Slimming results sets (subsets of genes) ● There is no “one size fits all”, different slims for different use cases. ● With a ‘generic slim’ we should to provide instructions how to refine Common Uses of GO Slims
  • 3. Coverage 1: Only slim one aspect at a time! All 3 aspects “unknown” = Biological Process unknown = (103+195+23+429) 103 750
  • 4. Pombe using Pombase slim Unslimmed Unknown Coverage 2: Distinguish unannotated/unknown/unslimmed = IDs not recognised by the slim tool (i.e not in GO database) http://go.princeton.edu/cgi-bin/GOTermMapper will provide all 3 numbers (and can use your own slim) Unannotated
  • 5. These 365 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim: These 734 identifiers had no non-root annotations: Total 1099 un-slimmed Pombe using AGR slim This number should be small in a slim with good coverage Coverage 3: Minimise “unslimmed” Pombe using PomBase slim It is difficult to define a slim to cover all annotated gene products without including terms with: ■ very small numbers of annotations, ■ or high level or biologically uninformative terms
  • 6. PomBase AGR slim summary in matrix http://amigo.geneontology.org/matrix (Terms are on both axis, totals on diagonal) Some terms are not biologically informative for a generic slim because they can apply to *any* biological process Indicated by intersections with every process Information content low Non-specific Exact subset OK Relevance 1: A balance between coverage and content
  • 7. Relevance 1: Avoid going “to high” Broad groupings Good for ribbon diagrams (display) Not good for summarizing biology. “Response to stimulus” is not very informative about biology (but covers >8000 (33%) mouse gene products ) Regulation of biological process 50% Mouse AGR slim in matrix http://amigo.geneontology.org/matrix (Terms are on both axis, totals on diagonal)
  • 8. Mouse using AGR slim and GO term slim mapper Your input list contains 22928 genes. These 2037 identifiers were found to be unannotated: These 420 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim: These 4037 identifiers had no non-root annotations: Deleting “cell differentiation” loses 0 (descendant of development). Deleting “cell proliferation” loses only 5, most covered by development and cell cycle. Deleting “regulation of biological process” only loses 51 even though over half 11658 proteins annotated. These 476 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim: Relevance 2: Minimising overlaps (redundancy) Because many gene products are annotated to multiple terms, it not possible to create a slim with no overlaps. If term removal doesn’t change number slimmed is might not be so be useful for a slim. Complete subset terms should be avoided
  • 9. Relevance 3: Lumping vs. splitting with common parent Very little intersection Between/within modules Largely unconnected but have common parent in the GO: nuclear and mitochondrial gene expression transmembrane and vesicle-mediated and nucleocytoplasmic transport
  • 10. Relevance 3: Value of splitting, example with real data Geneexpression From Hayles et al A genome-wide resource of cell cycle and cell shape genes of fission yeast.
  • 11. Current PomBase slim in matrix, overlaps low, information content high Zero overlaps between vesicle-mediated transport Nucleocytoplasmic transport and transmembrane transport, not biologically connected. Relevance 3: Lumping vs. splitting with common parent
  • 12. Relevance 4: Avoid single step processes ● GO:0016570 histone modification ● GO:0006468 protein phosphorylation ● GO:0006470 protein dephosphorylation ● GO:0043543 protein acylation ● GO:0016310 phosphorylation ● GO:0016311 dephosphorylation ● GO:0055114 oxidation-reduction process ● GO:0006464 cellular protein modification process ● GO:0043086 negative regulation of catalytic activity All are examples of molecular function grouping terms in the BP ontology. Not informative about physiological role, only biochemical role For this reason “protein metabolism” the ancestor of protein modifications should also be avoided in the generic slim.
  • 13. Proposed Iterative procedure Evaluate individual species coverage of existing generic slim (BP) What is missing? Add terms to cover Evaluate species coverage Which terms could be removed without affecting coverage? Remove Test (evaluate species coverage changes) What is missing? Add terms to cover Evaluate species coverage Which terms should be split to improve biological relevence? Split Check coverage was not affected (or recommend improved annotation specificity)
  • 15. Possible changes to evaluate Remove cell proliferation cell differentiation cellular component organization and biogenesis RNA processing (see Gene expression) regulation of biological process Add cytoskeleton organization chromatin organization ribosome biogenesis (>1000 annot) tRNA metabolic process (1157 annot) gene expression (includes translation) Not covered currently detoxification amino acid metabolic process (or vitamin metabolic process small cofactor metabolic process molecule?)