SlideShare a Scribd company logo
A Centralized Model Organism Database
(CMOD) for the Long Tail of Sequenced
Genomes
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org

OK

January 16, 2014

GMOD 2014

OK
2

Why am I giving this keynote?
3

Harnessing
the crowd…

http://www.flickr.com/photos/portland_mike/6140660504/
… to organize
information
4

http://www.flickr.com/photos/45697441@N00/6629580443
My simplified history of MODs

5
My simplified history of MODs

6
GMOD is widely used

199 (!) organizations listed as GMOD users

7
Does the current model scale?

8
Does the current model scale?

9
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes
Does the current model scale?

11
The Long Tail of genomic data is being lost

Identified 517 operons and 103 small regulatory RNAs...

12
The Long Tail of genomic data is being lost

Identified 517 operons and 103 small regulatory RNAs...

13
At least you can download structured data…

14
Centralized Model Organism Database concept

CMOD

15
16

GMOD as a Service (GaaS)

http://www.flickr.com/photos/aigle_dore/5626312363/
17

http://www.flickr.com/photos/shannonmary/187131727/
18

GO Annotation
Counts

Few genes are well annotated…
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF

65%

41%
20,473
proteincoding
genes

Genes, sorted by decreasing counts

Data: NCBI, February 2013
… because the literature is sparsely curated?

Number of PubMed-indexed articles
1,000,000
800,000
600,000
400,000

200,000
0
1979

1984

1989

1994

1999

2004

2009

19
… because the literature is sparsely curated?

Number capacity read by scientist
Average of articlesof humantypical scientist

20

10

0

1979

1984

1989

1994

1999

2004

2009

20
21

311,696 articles (1.5% of PubMed)
have been cited by GO annotations
22

Sooner or later, the
research community will
need to be involved in the
0
annotation effort to scale
up to the rate of data
generation.
The Long Tail is a prolific source of content

Short
Head
Content
produced

Long Tail

Contributors (sorted)

News :
Video:
Product reviews:
Food reviews:
Talent judging:

Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics

Blogs
YouTube
Amazon reviews
Yelp
American Idol

23
Wikipedia is reasonably accurate

24
Wikipedia has breadth and depth

25

Articles

Words
(millions)

Wikipedia

Britannica
Online

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
26

We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
Filtering, extracting, and summarizing PubMed

Documents

Concepts

Review article
Filtering, extracting, and summarizing PubMed

Documents

Concepts
Wiki success depends on a positive feedback
Gene wiki page utility

1
2

Number of
contributors

100
200

Number of
users

29
10,000 gene “stubs” within Wikipedia

30

Utility

Users
Contributors

Protein structure
Gene
summary
Symbols and
identifiers
Gene Ontology
annotations
Protein
interactions

Linked
references

Tissue expression
pattern

Links to structured
databases

Huss, PLoS Biol, 2008
Gene Wiki has a critical mass of readers

31

Utility
Total: 4.0 million views / month

Users
Contributors

Huss, PLoS Biol, 2008; Good, NAR, 2011
32

Gene Wiki has a critical mass of editors

Editors

Edits

Edit count

Editor count

Utility

Users
Contributors

Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
A review article for every gene is powerful

Reelin: 98 editors, 703 edits since July 2002
Hyperlinks to related concepts
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
References to the literature

33
Making the Gene Wiki more computable

Free text

Structured annotations

34
35

Filling the gaps in gene annotation
NCBI Entrez Gene: 334

Gene Wiki
mapping

Wikilink

Candidate
assertion

GO:0006897

GO exact
match

6319 novel GO annotations
2147 novel DO annotations
36

Gene Wiki content improves enrichment analysis
axon
guidance
(GO:0007411)

Enrichment
analysis

GO term
811 articles

264 genes

PubMed
abstracts

Gene list

GO:0007411
Yes
Linked genes
through
PubMed

No

Yes

13

2

No

251

12033

P = 1.55 E-20

Concept
recognition
37

Gene Wiki content improves enrichment analysis
muscle
contraction
(GO:0006936)

Enrichment
analysis

GO term
251 articles

87 genes

Gene list

PubMed
abstracts

Concept
recognition

+
Gene Wiki
87 articles

GO:0006936
Linked genes
through
PubMed

GO:0006936
Linked genes
through
PubMed +
Gene Wiki

P = 1.0

P = 1.22 E-09
38

Gene Wiki content improves enrichment analysis

p-value
(PubMed + GW)

More
significant
PubMed only

Muscle
contraction

More
significant
PubMed + GW

p-value (PubMed only)
39

The
Long Tail of scientists
is a valuable source of
information on gene
function
Can we skip text mining?

http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
41

Wikidata
Provide a database of the
world’s knowledge that
anyone can edit
- Denny Vrandečić
Wikidata understands scale

42
Wikidata understands scale

43

14 million Wikidata items…

…13 million total genes in Entrez Gene
Wikidata understands scale

44

27 million Wikidata statements…

…150k total GO annotations
Wikidata for biology

45

Q414043

Reelin

Protein
Property:P31

Q8054

is a

Property:P129

regulates

Interacts
with

Q187126

Neural
development

Q1345738

VLDL receptor

Property:P128

Glycoprotein

Q1979313

Amyloid
precursor
protein

http://www.wikidata.org/wiki/Q414043

Q423510
Wikidata for biology

46

Q414043

Q8054
Property:P31

Property:P128

Q187126

Q1345738
Q1979313

Property:P129
Q423510
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Increasing biological data in Wikidata

http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force

47
Loading genomic data into Wikidata
Entrez
Gene
Ensembl

UniProt

UCSC

PDB

RefSeq

48
Wikidata gene model

49

Added ~1000 human
genes so far….
50

Wikidata as CMOD?

CMOD
51

Wikidata as CMOD?

CMOD
Powered by:

CMOD
52

The
Long Tail of
bioinformaticians
can collaboratively
build a Centralized
Model Organism
Database (CMOD).
Gene Wiki Collaborators
Doug Howe, ZFIN
John Hogenesch, U Penn
Jon Huss, GNF
Luca de Alfaro, UCSC
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim
Many Wikipedia editors
WP:MCB Project

Group members
Katie Fisch
Ben Good
Salvatore Loguercio

53

Tobias Meissner
Max Nanis
Chunlei Wu

Key group alumni

Adriel Carolino
Erik Clarke
Jon Huss
Marc Leglise
Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco

Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

More Related Content

PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
PPTX
Gene Wiki and Wikimedia Foundation SPARQL workshop
PPTX
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
DOC
Human cloning by vvr ias
PPTX
UCSD / DBMI seminar 2015-02-6
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
PPTX
Crowdsourcing to structure biological knowledge (USC/ISI)
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Gene Wiki and Wikimedia Foundation SPARQL workshop
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Human cloning by vvr ias
UCSD / DBMI seminar 2015-02-6
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing to structure biological knowledge (USC/ISI)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Similar to A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes (20)

PPTX
Centralized Model Organism Database (Biocuration 2014 poster)
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
PPTX
2016 mem good
PPTX
Gene Wiki and Mark2Cure update for BD2K
PPTX
2016 bd2k bgood_wikidata
PPTX
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
PPTX
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
PPTX
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
PPTX
Opportunities and challenges presented by Wikidata in the context of biocuration
PPTX
Wikidata for biomedical knowledge integration and curation
PPTX
Wikidata workshop for ISB Biocuration 2016
PPTX
Gene Wiki at Phenotype RCN annual meeting
PPT
RML NCBI Resources
PPT
Intro bioinfo
PPT
Intro bioinfo
PPTX
Biomedical data
PPTX
Wikidata and the Semantic Web of Food
PPT
2015 04 22_time_labs_shared
PPT
Wiki-based Gene Reports in Medical Genetics 421
PDF
Open-Source Bioinformatics for Data Scientists with Amanda Schierz
Centralized Model Organism Database (Biocuration 2014 poster)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
2016 mem good
Gene Wiki and Mark2Cure update for BD2K
2016 bd2k bgood_wikidata
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
Opportunities and challenges presented by Wikidata in the context of biocuration
Wikidata for biomedical knowledge integration and curation
Wikidata workshop for ISB Biocuration 2016
Gene Wiki at Phenotype RCN annual meeting
RML NCBI Resources
Intro bioinfo
Intro bioinfo
Biomedical data
Wikidata and the Semantic Web of Food
2015 04 22_time_labs_shared
Wiki-based Gene Reports in Medical Genetics 421
Open-Source Bioinformatics for Data Scientists with Amanda Schierz
Ad

More from Andrew Su (18)

PPTX
Building and mining a heterogeneous biomedical knowledge graph
PPTX
Wikidata as a FAIR knowledge graph for the life sciences
PPTX
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
PPTX
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
PPTX
WikiGenomes Poster (ISMB)
PPTX
The case for an open biomedical knowledgebase
PPTX
Open data, compound repurposing, and rare diseases (ISCB)
PPTX
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
PPTX
Citizen Science and Rare Disease Research
PPTX
Open biomedical knowledge using crowdsourcing and citizen science
PPTX
Heart BD2K, Biocuration, and Citizen Science
PPTX
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
PPTX
Using Citizen Science to organize biomedical knowledge
PPTX
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
PPTX
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
PPTX
Wikipedia as an engine for scientific communication and collaboration at mass...
PPTX
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
PPTX
20120220 Tri-Con Cloud Computing Symposium
Building and mining a heterogeneous biomedical knowledge graph
Wikidata as a FAIR knowledge graph for the life sciences
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
WikiGenomes Poster (ISMB)
The case for an open biomedical knowledgebase
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Citizen Science and Rare Disease Research
Open biomedical knowledge using crowdsourcing and citizen science
Heart BD2K, Biocuration, and Citizen Science
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Using Citizen Science to organize biomedical knowledge
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Wikipedia as an engine for scientific communication and collaboration at mass...
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
20120220 Tri-Con Cloud Computing Symposium
Ad

Recently uploaded (20)

PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
1. Introduction to Computer Programming.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
A Presentation on Artificial Intelligence
OMC Textile Division Presentation 2021.pptx
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
1. Introduction to Computer Programming.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Assigned Numbers - 2025 - Bluetooth® Document
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cloud_computing_Infrastucture_as_cloud_p
TLE Review Electricity (Electricity).pptx
Getting Started with Data Integration: FME Form 101
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A Presentation on Artificial Intelligence

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes

Editor's Notes

  • #6: At least three functions – gene and genome annotation, software development, system administration
  • #7: GMOD reduces redundancy in software development
  • #9: 96 species shown
  • #10: ~3000 species shown
  • #11: Currently over 3000 sequenced genomes, will hit 10,000 in 2015, 100k in 2022, 1M in 2028# sequenced genomes doubles ~2 years
  • #12: Every group still individually hosts database and web servers
  • #16: CMOD reduces redundancy in system administration, leaving MOD communities to focus on what they do best – gene and genome annotations.
  • #19: We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
  • #22: Numbers updated 7/15/2011
  • #28: Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • #29: Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • #35: Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • #37: Tried on 773 GO categories, significant in 356 cases (46%)
  • #39: We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • #40: Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • #41: Hamburger to cow algorithm or ‘wishful thinking” requires Jurassic Park technology
  • #42: Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  • #43: Wikipedia: > 4M articles, averages over 2500 views per second
  • #52: CMOD reduces redundancy in system administration, leaving MOD communities to focus on what they do best – gene and genome annotations.