SlideShare a Scribd company logo
A Centralized Model Organism Database (CMOD)
for the Long Tail of Genomes
ABSTRACT
Andrew I. Su, Benjamin M. Good, Chinmay Naik and Adriel Carolino
The Scripps Research Institute, La Jolla, California, USA
Background
How Gene Wiki?
We acknowledge support from the National
Institute of General Medical Sciences
(GM089820 and GM083924).
CONTACT
Benjamin Good: bgood@scripps.edu, @bgood
Andrew Su: asu@scripps.edu, @andrewsu
How Gene Wiki? The CMOD visionGENE WIKI EXAMPLEABSTRACT
FUNDING
Progress and status
CONCLUSION
One: structure from text miningThe Dark Matter of genome annotation
We need more hands on deck! We have
multiple positions open for postdocs and
programmers interested in crowdsourcing
and bioinformatics projects (like CMOD)!
1
10
100
1000
10000
100000
1000000
1997
1999
2001
2003
2005
2007
2009
2011
2013
2015
2017
2019
2021
2023
2025
Bacteria
Eukaryotes
Archaea
Model organism databases (MODs) are fantastic resources for
organizing genomic information for commonly-studied
organisms. To facilitate the creation and maintenance of MODs,
the Generic Model Organism Database (GMOD) Project
provides “a set of interoperable open-source software
components for visualizing, annotating, and managing biological
data.”
Provide a database of the
world’s knowledge that
anyone can edit.
- Denny Vrandečić
Despite the obvious success and value of GMOD, the number of
sequenced genomes is growing exponentially. Does this model
scale with the rate of genome sequencing?
Figure courtesy Scott Cain
Wikidata (http://wikidata.org) is an innovative and important
new tool for community-based knowledge management.
Wikidata is supported by the Wikimedia Foundation, which also
operates Wikipedia. In short, Wikidata is to structured data what
Wikipedia is to free text.
Model organism databases are fantastic resources for
genomics researchers. But relatively few model organisms
have stable funding for their database, and the number of
sequenced genomes is increasing exponentially. It seems
impractical to create and fund a model organism database
for each of them. Here, we describe our efforts to build a
Centralized Model Organism Database (CMOD), a single
online resource to support all genomes and organisms. To
scale to the Long Tail of Genomes, CMOD employs an open
editing model in which the entire research community is
empowered to edit and maintain genomic data. We
describe our efforts to systematically populate CMOD with
two core data types across all organisms – genome
annotations and Gene Ontology annotations.
We propose to build a Centralized Model Organism Database
(CMOD), which would house gene and genome annotations for
all genomes. This database would be based on Wikidata,
enabling it to be community-curated, continuously-updated, and
computer-readable.
CMOD
Gene and genome annotations
CMOD data can be accessed using a number of mechanisms.
The Wikidata web interface offers convenient access using a web
browser. The Wikidata application programming interface (API)
and associated programming libraries allow programmers and
bioinformaticians computational access to the data. Wikidata
export to RDF offers compatibility with the Semantic Web and
Linked Data. We also envision that many popular GMOD tools,
including Gbrowse, Jbrowse, and and WebApollo, can be
modified to use CMOD as the back-end data warehouse.
Wikidata
Wikidata currently catalogs over 14 million entities, and
describes those entities in the form of 27 million statements.
This knowledgebase is the product of over 50 million edits. Of
those edits, ~90% are contributed by bots that predominantly
import data from structured resources, and 10% are contributed
by human editors.
This seminal paper identified 517 operons and 103 small
regulatory RNAs in Listeria monocytogenes, an important human
pathogen. Unfortunately, these annotations cannot be
downloaded from the Broad’s “Listeria monocytogenes
Database”, nor NCBI Genome, nor UCSC’s Microbial Genome
Browser, nor EnsemblBacteria, nor any GMOD instance. The
only place they are available is from the Supplementary
information on the Nature website in PDF format.
We have loaded gene and genome annotation data for ~1000
human genes, the human proteins they encode, and their mouse
orthologs according to the data model shown above. The code
repository for managing these data is available at
https://bitbucket.org/sulab/wikidatagenebot.
The Skeptic’s Corner
Will CMOD scale with the exponential growth in sequenced
genomes? Yes, because there is no gatekeeper to adding new
content. Anyone is empowered to directly contribute. Even
though the technical infrastructure is centralized, the data
management is highly distributed.
Who will contribute to CMOD? We envision a wide spectrum of
contributors, from large biocuration/annotation centers adding
large data sets, to individual bioinformaticians who deposit
structured versions of previously unstructured data, to individual
scientists contributing individual annotations.
Will CMOD content be trustworthy? Like Wikipedia, we expect
that Wikidata overall will asymptotically approach perfect
accuracy and completeness. Moreover, because provenance is a
core part of the data model, the presence/absence/type of the
reference can be used to systematically filter the knowledgebase
according to each user’s needs.
Managing genomic information and knowledge is a critical
challenge for biomedical research. Community infrastructure
that allows individuals to collaboratively and collectively organize
knowledge has the potential to be an enabling technology in
biological research. Here, we propose CMOD as one such
application that is particularly focused on the Long Tail of
sequenced genomes.
Cumulative number of sequenced genomes

More Related Content

PPTX
FAIR Agronomy, where are we? The KnetMiner Use Case
PPT
Bioinformatics
PDF
Genomics2 Phenomics Complete
PDF
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
PPTX
AgriSchemas: Sharing Agrifood data with Bioschemas
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
PDF
Technology R&D Theme 2: From Descriptive to Predictive Networks
PPT
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
FAIR Agronomy, where are we? The KnetMiner Use Case
Bioinformatics
Genomics2 Phenomics Complete
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
AgriSchemas: Sharing Agrifood data with Bioschemas
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Technology R&D Theme 2: From Descriptive to Predictive Networks
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

What's hot (18)

PDF
Introduction to Bioinformatics.
PPTX
Wikidata and the Semantic Web of Food
PPTX
Careers in bioinformatics, Scope, Skills and Jobs
PPTX
Interoperable Data for KnetMiner and DFW Use Cases
PPT
Introduction to Bioinformatics Slides
PDF
Bioinformatics
PPT
B.sc biochem i bobi u-1 introduction to bioinformatics
PPTX
Bioinformatics: What, Why and Where?
PPTX
Computational Biology and Bioinformatics
PPT
Bioinformatics
PPT
Role of bioinformatics in life sciences research
PPTX
Career oppurtunities in the field of Bioinformatics
PPT
Data sharing - Data management - The SysMO-SEEK Story
PDF
Gcc talk baltimore july 2014
PPTX
Globus Genomics: Democratizing NGS Analysis
PPTX
Application of bioinformatics
PPTX
2016 bmdid-mappings
Introduction to Bioinformatics.
Wikidata and the Semantic Web of Food
Careers in bioinformatics, Scope, Skills and Jobs
Interoperable Data for KnetMiner and DFW Use Cases
Introduction to Bioinformatics Slides
Bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformatics
Bioinformatics: What, Why and Where?
Computational Biology and Bioinformatics
Bioinformatics
Role of bioinformatics in life sciences research
Career oppurtunities in the field of Bioinformatics
Data sharing - Data management - The SysMO-SEEK Story
Gcc talk baltimore july 2014
Globus Genomics: Democratizing NGS Analysis
Application of bioinformatics
2016 bmdid-mappings
Ad

Similar to Centralized Model Organism Database (Biocuration 2014 poster) (20)

PPT
Web services for sharing germplasm data sets, at FAO in Rome (2006)
PPTX
Gene Wiki and Mark2Cure update for BD2K
PPTX
Web based servers and softwares for genome analysis
PDF
20 years of evolution in data production in health and life sciences
PPTX
Celsi®, CELL SIGNALING
PPTX
CELSI®, CELL SIGNALING
PPTX
Celsi®, a virtual simulation software for cell signaling pathways
PDF
Big Data, The Community and The Commons (May 12, 2014)
PDF
A consistent and efficient graphical User Interface Design and Querying Organ...
PPTX
Mrr iti pm_poster
PDF
Use of open_linked_data_in_bioinformatics
PDF
SFSCON23 - Michele Finelli - Management of large genomic data with free software
PPTX
Software Pipelines: The Good, The Bad and The Ugly
PDF
Big Data and AI for Covid-19
PDF
Big Data and AI in Fighting Against COVID-19
PPTX
8. Data mining_warehousing_integration.pptx
PDF
A benchmark study of machine learning models for online fake news detection
PPTX
What is Biological Computing And How It Will Change Our World
PPT
2011Field talk at iEVOBIO 2011
PDF
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
Web services for sharing germplasm data sets, at FAO in Rome (2006)
Gene Wiki and Mark2Cure update for BD2K
Web based servers and softwares for genome analysis
20 years of evolution in data production in health and life sciences
Celsi®, CELL SIGNALING
CELSI®, CELL SIGNALING
Celsi®, a virtual simulation software for cell signaling pathways
Big Data, The Community and The Commons (May 12, 2014)
A consistent and efficient graphical User Interface Design and Querying Organ...
Mrr iti pm_poster
Use of open_linked_data_in_bioinformatics
SFSCON23 - Michele Finelli - Management of large genomic data with free software
Software Pipelines: The Good, The Bad and The Ugly
Big Data and AI for Covid-19
Big Data and AI in Fighting Against COVID-19
8. Data mining_warehousing_integration.pptx
A benchmark study of machine learning models for online fake news detection
What is Biological Computing And How It Will Change Our World
2011Field talk at iEVOBIO 2011
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
Ad

More from Andrew Su (20)

PPTX
Building and mining a heterogeneous biomedical knowledge graph
PPTX
Wikidata as a FAIR knowledge graph for the life sciences
PPTX
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
PPTX
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
PPTX
WikiGenomes Poster (ISMB)
PPTX
The case for an open biomedical knowledgebase
PPTX
Open data, compound repurposing, and rare diseases (ISCB)
PPTX
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
PPTX
Citizen Science and Rare Disease Research
PPTX
Open biomedical knowledge using crowdsourcing and citizen science
PPTX
Heart BD2K, Biocuration, and Citizen Science
PPTX
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
PPTX
Using Citizen Science to organize biomedical knowledge
PPTX
UCSD / DBMI seminar 2015-02-6
PPTX
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
PPTX
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
PPTX
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
PPTX
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
Building and mining a heterogeneous biomedical knowledge graph
Wikidata as a FAIR knowledge graph for the life sciences
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
WikiGenomes Poster (ISMB)
The case for an open biomedical knowledgebase
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Citizen Science and Rare Disease Research
Open biomedical knowledge using crowdsourcing and citizen science
Heart BD2K, Biocuration, and Citizen Science
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Using Citizen Science to organize biomedical knowledge
UCSD / DBMI seminar 2015-02-6
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...

Recently uploaded (20)

PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PPT
Mutation in dna of bacteria and repairss
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPTX
PMR- PPT.pptx for students and doctors tt
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
Understanding the Circulatory System……..
PPT
veterinary parasitology ````````````.ppt
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Substance Disorders- part different drugs change body
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPT
Presentation of a Romanian Institutee 2.
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Mutation in dna of bacteria and repairss
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
Animal tissues, epithelial, muscle, connective, nervous tissue
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PMR- PPT.pptx for students and doctors tt
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
Understanding the Circulatory System……..
veterinary parasitology ````````````.ppt
Introcution to Microbes Burton's Biology for the Health
Substance Disorders- part different drugs change body
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Hypertension_Training_materials_English_2024[1] (1).pptx
Presentation of a Romanian Institutee 2.
BODY FLUIDS AND CIRCULATION class 11 .pptx
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat

Centralized Model Organism Database (Biocuration 2014 poster)

  • 1. A Centralized Model Organism Database (CMOD) for the Long Tail of Genomes ABSTRACT Andrew I. Su, Benjamin M. Good, Chinmay Naik and Adriel Carolino The Scripps Research Institute, La Jolla, California, USA Background How Gene Wiki? We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924). CONTACT Benjamin Good: bgood@scripps.edu, @bgood Andrew Su: asu@scripps.edu, @andrewsu How Gene Wiki? The CMOD visionGENE WIKI EXAMPLEABSTRACT FUNDING Progress and status CONCLUSION One: structure from text miningThe Dark Matter of genome annotation We need more hands on deck! We have multiple positions open for postdocs and programmers interested in crowdsourcing and bioinformatics projects (like CMOD)! 1 10 100 1000 10000 100000 1000000 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2025 Bacteria Eukaryotes Archaea Model organism databases (MODs) are fantastic resources for organizing genomic information for commonly-studied organisms. To facilitate the creation and maintenance of MODs, the Generic Model Organism Database (GMOD) Project provides “a set of interoperable open-source software components for visualizing, annotating, and managing biological data.” Provide a database of the world’s knowledge that anyone can edit. - Denny Vrandečić Despite the obvious success and value of GMOD, the number of sequenced genomes is growing exponentially. Does this model scale with the rate of genome sequencing? Figure courtesy Scott Cain Wikidata (http://wikidata.org) is an innovative and important new tool for community-based knowledge management. Wikidata is supported by the Wikimedia Foundation, which also operates Wikipedia. In short, Wikidata is to structured data what Wikipedia is to free text. Model organism databases are fantastic resources for genomics researchers. But relatively few model organisms have stable funding for their database, and the number of sequenced genomes is increasing exponentially. It seems impractical to create and fund a model organism database for each of them. Here, we describe our efforts to build a Centralized Model Organism Database (CMOD), a single online resource to support all genomes and organisms. To scale to the Long Tail of Genomes, CMOD employs an open editing model in which the entire research community is empowered to edit and maintain genomic data. We describe our efforts to systematically populate CMOD with two core data types across all organisms – genome annotations and Gene Ontology annotations. We propose to build a Centralized Model Organism Database (CMOD), which would house gene and genome annotations for all genomes. This database would be based on Wikidata, enabling it to be community-curated, continuously-updated, and computer-readable. CMOD Gene and genome annotations CMOD data can be accessed using a number of mechanisms. The Wikidata web interface offers convenient access using a web browser. The Wikidata application programming interface (API) and associated programming libraries allow programmers and bioinformaticians computational access to the data. Wikidata export to RDF offers compatibility with the Semantic Web and Linked Data. We also envision that many popular GMOD tools, including Gbrowse, Jbrowse, and and WebApollo, can be modified to use CMOD as the back-end data warehouse. Wikidata Wikidata currently catalogs over 14 million entities, and describes those entities in the form of 27 million statements. This knowledgebase is the product of over 50 million edits. Of those edits, ~90% are contributed by bots that predominantly import data from structured resources, and 10% are contributed by human editors. This seminal paper identified 517 operons and 103 small regulatory RNAs in Listeria monocytogenes, an important human pathogen. Unfortunately, these annotations cannot be downloaded from the Broad’s “Listeria monocytogenes Database”, nor NCBI Genome, nor UCSC’s Microbial Genome Browser, nor EnsemblBacteria, nor any GMOD instance. The only place they are available is from the Supplementary information on the Nature website in PDF format. We have loaded gene and genome annotation data for ~1000 human genes, the human proteins they encode, and their mouse orthologs according to the data model shown above. The code repository for managing these data is available at https://bitbucket.org/sulab/wikidatagenebot. The Skeptic’s Corner Will CMOD scale with the exponential growth in sequenced genomes? Yes, because there is no gatekeeper to adding new content. Anyone is empowered to directly contribute. Even though the technical infrastructure is centralized, the data management is highly distributed. Who will contribute to CMOD? We envision a wide spectrum of contributors, from large biocuration/annotation centers adding large data sets, to individual bioinformaticians who deposit structured versions of previously unstructured data, to individual scientists contributing individual annotations. Will CMOD content be trustworthy? Like Wikipedia, we expect that Wikidata overall will asymptotically approach perfect accuracy and completeness. Moreover, because provenance is a core part of the data model, the presence/absence/type of the reference can be used to systematically filter the knowledgebase according to each user’s needs. Managing genomic information and knowledge is a critical challenge for biomedical research. Community infrastructure that allows individuals to collaboratively and collectively organize knowledge has the potential to be an enabling technology in biological research. Here, we propose CMOD as one such application that is particularly focused on the Long Tail of sequenced genomes. Cumulative number of sequenced genomes