1
Gene-specific review
article for every
human gene
Data integration for
genes, drugs,
diseases
Robust classifiers of breast
cancer prognosis
Annotation of
biomedical literature
Expert-guided
classifier design
Gene-centric
web portal
Bioinformatics
algorithm
optimization
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
Slides: slideshare.net/andrewsu
Mark2Cure – biocuration by microtasking
• Challenge: The biomedical literature is
massive and growing exponentially, but it is
largely inaccessible
• Opportunity: Better access to existing
knowledge can make scientific process more
efficient and productive
• Current situation
– Manual biocuration by experts
– Natural language processing
2
Mark2Cure – biocuration by microtasking
• Our approach: Use Amazon Mechanical Turk
platform for paid microtask crowdsourcing
• Results: reproduced an expert-generated gold
standard at equivalent accuracy, shorter time,
fraction of cost
3
K = 6
F score = 0.87
Precision
Recall
• 593 documents
• 9 days
• 145 workers
• $0.06 / task
• Total cost: $630.96
Mark2Cure – biocuration by citizen science
• Our approach: Use volunteer-based citizen
science for microtask crowdsourcing
• Results: reproduced an expert-generated gold
standard at equivalent accuracy, shorter time,
at no cost
4
• 593 documents
• 28 days
• 212 workers
• Total cost: $0.00
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k = 6
F score = 0.84
PrecisionRecall
Voting threshold
http://mark2cure.org
Collaborative knowledge management
• Challenge: Biomedical research allows for
genome-scale profiling, but few genes are
previously known to researcher
• Opportunity: Better access to existing
knowledge can make scientific process more
efficient and productive
• Current situation
– Review articles (but sparse coverage)
– Lots of reading of primary literature
5
Collaborative knowledge management
• Our approach: Create
a gene-specific review
article for every human
gene that is
collaboratively written,
continuously updated,
and community
reviewed
• Results: 5M page
views and >1000 edits
per month
6
Collaborative knowledge management
• Our approach: Create
a gene-specific Wikidata
database entry for every
human gene that is
collaboratively
integrated, continuously
updated, and
community reviewed
• Results: all human
genes and diseases
loaded in Wikidata, soon
to have drugs and
relationships
7
Bioinformatics algorithm optimization
• Challenge: Antibody sequence clustering is
computationally expensive (CPU and memory)
• Opportunity: Large-scale clustering of
antibody sequences can aid vaccine
development
• Current situation: Research-grade code can
cluster ~100k sequences in 1.7 hours on high
memory (150 GB) machine.
8
Bioinformatics algorithm optimization
• Our approach: Ran TopCoder contest for 10
days, offering $7500 in prize money
• Results: Best solution can cluster 2.3M
sequences in 30 seconds on a typical desktop
computer (1.1 GB)
9
log(# sequences processed)
log(executiontime)
Benchmarks
10
Cyrus Afrasiabi
Ramya Gamini
Louis Gioia
Salvatore Loguercio
Adam Mark
Erick Scott
Greg Stupp
Kevin Xin
Other group members
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Mark2Cure
Ben Good
Max Nanis
Ginger Tsueng
Chunlei Wu
All Mark2Curators!
Funding and Support
BioGPS: GM83924
Gene Wiki: GM089820
BD2K Center of Excellence: GM114833
Gene Wiki
Ben Good
Sebastian Burgstaller
Andra Waagmeester
Elvira Mitraka, UMB
Lynn Schriml, UMB
Paul Pavlidis, UBC
Gang Fu, NCBI
Contests
Chunlei Wu
Ben Good
Brian Briney, TSRI
Dennis Burton, TSRI
Rinat Sergeev, HBS
Jin Paik, HBS
Karim Laklani, HBS
Jingbo Shang
Rashid Sial, Appirio
Join the team! bit.ly/sulabawesome
Game for breast cancer prognosis
• Challenge: Genomic classifiers of disease are
difficult to train in a way that consistently
validates on secondary datasets
• Opportunity: Better classifiers of disease
diagnosis and/or prognosis have many clinical
applications
• Current situation: Most attempts to train
classifiers rely on machine learning methods
that utilize little or no biological knowledge
11
Game for breast cancer prognosis
• Our approach: Enlist a crowd of expert game
players with diverse perspectives to identify
most biologically relevant genes
• Results: Gene sets derived from game player
data showed comparable performance to
expert-generated gene sets
12
• 1077 registered players
• 15,669 games played
• Demographics
– 59% male, 41% female
– 21-29 is most frequent age group
– 35% had graduate degree, 32%
were biologists

More Related Content

PPTX
Heart BD2K, Biocuration, and Citizen Science
PPT
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
PPTX
Design Poster
PPTX
Citizen Science and Rare Disease Research
PPTX
Building a Biomedical Knowledge Garden
PDF
Overview of Next Gen Sequencing Data Analysis
PDF
San diego-supercomputing-sc17-user-group
PPT
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
Heart BD2K, Biocuration, and Citizen Science
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Design Poster
Citizen Science and Rare Disease Research
Building a Biomedical Knowledge Garden
Overview of Next Gen Sequencing Data Analysis
San diego-supercomputing-sc17-user-group
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...

Similar to Panel on Citizen Science and Crowdsourcing Games - March 27, 2015 (20)

PDF
Hedlund_biogrid_BOSC2009
PPTX
How novel compute technology transforms life science research
PPTX
Docker in Open Science Data Analysis Challenges by Bruce Hoff
PDF
Introduction to Next Generation Sequencing
PPTX
ReComp and P4@NU: Reproducible Data Science for Health
PDF
Utilization of virtual microscopy in a cooperative group setting
PDF
(Bio)Hackathons
PPT
World Community Grid
PDF
High Performance Computing and the Opportunity with Cognitive Technology
PPTX
CI4CC sustainability-panel
PPTX
Developing a Research Case Study
PPT
Introduction to Cancer Genomics Databases
PDF
Considerations and challenges in building an end to-end microbiome workflow
PPT
wolstencroft-ogf20-astro
PPTX
WCG7 (assembled)
PDF
"The Reverse Factory: Embedded Vision in High-Volume Laboratory Applications,...
PPTX
VariantSpark: applying Spark-based machine learning methods to genomic inform...
PPTX
SaaS and the Transformation of Research
PPTX
Software Sustainability: Better Software Better Science
PDF
Collins seattle-2014-final
Hedlund_biogrid_BOSC2009
How novel compute technology transforms life science research
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Introduction to Next Generation Sequencing
ReComp and P4@NU: Reproducible Data Science for Health
Utilization of virtual microscopy in a cooperative group setting
(Bio)Hackathons
World Community Grid
High Performance Computing and the Opportunity with Cognitive Technology
CI4CC sustainability-panel
Developing a Research Case Study
Introduction to Cancer Genomics Databases
Considerations and challenges in building an end to-end microbiome workflow
wolstencroft-ogf20-astro
WCG7 (assembled)
"The Reverse Factory: Embedded Vision in High-Volume Laboratory Applications,...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
SaaS and the Transformation of Research
Software Sustainability: Better Software Better Science
Collins seattle-2014-final
Ad

More from Andrew Su (20)

PPTX
Building and mining a heterogeneous biomedical knowledge graph
PPTX
Wikidata as a FAIR knowledge graph for the life sciences
PPTX
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
PPTX
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
PPTX
WikiGenomes Poster (ISMB)
PPTX
The case for an open biomedical knowledgebase
PPTX
Open data, compound repurposing, and rare diseases (ISCB)
PPTX
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
PPTX
Open biomedical knowledge using crowdsourcing and citizen science
PPTX
Using Citizen Science to organize biomedical knowledge
PPTX
UCSD / DBMI seminar 2015-02-6
PPTX
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
PPTX
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
PPTX
Centralized Model Organism Database (Biocuration 2014 poster)
PPTX
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
PPTX
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
PPTX
Wikipedia as an engine for scientific communication and collaboration at mass...
Building and mining a heterogeneous biomedical knowledge graph
Wikidata as a FAIR knowledge graph for the life sciences
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
WikiGenomes Poster (ISMB)
The case for an open biomedical knowledgebase
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open biomedical knowledge using crowdsourcing and citizen science
Using Citizen Science to organize biomedical knowledge
UCSD / DBMI seminar 2015-02-6
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Centralized Model Organism Database (Biocuration 2014 poster)
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Wikipedia as an engine for scientific communication and collaboration at mass...
Ad

Recently uploaded (20)

PDF
Wound infection.pdfWound infection.pdf123
PPTX
Probability.pptx pearl lecture first year
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPT
Mutation in dna of bacteria and repairss
PPTX
endocrine - management of adrenal incidentaloma.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Microbes in human welfare class 12 .pptx
PPTX
A powerpoint on colorectal cancer with brief background
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPT
LEC Synthetic Biology and its application.ppt
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPT
veterinary parasitology ````````````.ppt
Wound infection.pdfWound infection.pdf123
Probability.pptx pearl lecture first year
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Mutation in dna of bacteria and repairss
endocrine - management of adrenal incidentaloma.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Microbes in human welfare class 12 .pptx
A powerpoint on colorectal cancer with brief background
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Introcution to Microbes Burton's Biology for the Health
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Animal tissues, epithelial, muscle, connective, nervous tissue
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
LEC Synthetic Biology and its application.ppt
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
veterinary parasitology ````````````.ppt

Panel on Citizen Science and Crowdsourcing Games - March 27, 2015

  • 1. 1 Gene-specific review article for every human gene Data integration for genes, drugs, diseases Robust classifiers of breast cancer prognosis Annotation of biomedical literature Expert-guided classifier design Gene-centric web portal Bioinformatics algorithm optimization Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org Slides: slideshare.net/andrewsu
  • 2. Mark2Cure – biocuration by microtasking • Challenge: The biomedical literature is massive and growing exponentially, but it is largely inaccessible • Opportunity: Better access to existing knowledge can make scientific process more efficient and productive • Current situation – Manual biocuration by experts – Natural language processing 2
  • 3. Mark2Cure – biocuration by microtasking • Our approach: Use Amazon Mechanical Turk platform for paid microtask crowdsourcing • Results: reproduced an expert-generated gold standard at equivalent accuracy, shorter time, fraction of cost 3 K = 6 F score = 0.87 Precision Recall • 593 documents • 9 days • 145 workers • $0.06 / task • Total cost: $630.96
  • 4. Mark2Cure – biocuration by citizen science • Our approach: Use volunteer-based citizen science for microtask crowdsourcing • Results: reproduced an expert-generated gold standard at equivalent accuracy, shorter time, at no cost 4 • 593 documents • 28 days • 212 workers • Total cost: $0.00 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k = 6 F score = 0.84 PrecisionRecall Voting threshold http://mark2cure.org
  • 5. Collaborative knowledge management • Challenge: Biomedical research allows for genome-scale profiling, but few genes are previously known to researcher • Opportunity: Better access to existing knowledge can make scientific process more efficient and productive • Current situation – Review articles (but sparse coverage) – Lots of reading of primary literature 5
  • 6. Collaborative knowledge management • Our approach: Create a gene-specific review article for every human gene that is collaboratively written, continuously updated, and community reviewed • Results: 5M page views and >1000 edits per month 6
  • 7. Collaborative knowledge management • Our approach: Create a gene-specific Wikidata database entry for every human gene that is collaboratively integrated, continuously updated, and community reviewed • Results: all human genes and diseases loaded in Wikidata, soon to have drugs and relationships 7
  • 8. Bioinformatics algorithm optimization • Challenge: Antibody sequence clustering is computationally expensive (CPU and memory) • Opportunity: Large-scale clustering of antibody sequences can aid vaccine development • Current situation: Research-grade code can cluster ~100k sequences in 1.7 hours on high memory (150 GB) machine. 8
  • 9. Bioinformatics algorithm optimization • Our approach: Ran TopCoder contest for 10 days, offering $7500 in prize money • Results: Best solution can cluster 2.3M sequences in 30 seconds on a typical desktop computer (1.1 GB) 9 log(# sequences processed) log(executiontime) Benchmarks
  • 10. 10 Cyrus Afrasiabi Ramya Gamini Louis Gioia Salvatore Loguercio Adam Mark Erick Scott Greg Stupp Kevin Xin Other group members Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Mark2Cure Ben Good Max Nanis Ginger Tsueng Chunlei Wu All Mark2Curators! Funding and Support BioGPS: GM83924 Gene Wiki: GM089820 BD2K Center of Excellence: GM114833 Gene Wiki Ben Good Sebastian Burgstaller Andra Waagmeester Elvira Mitraka, UMB Lynn Schriml, UMB Paul Pavlidis, UBC Gang Fu, NCBI Contests Chunlei Wu Ben Good Brian Briney, TSRI Dennis Burton, TSRI Rinat Sergeev, HBS Jin Paik, HBS Karim Laklani, HBS Jingbo Shang Rashid Sial, Appirio Join the team! bit.ly/sulabawesome
  • 11. Game for breast cancer prognosis • Challenge: Genomic classifiers of disease are difficult to train in a way that consistently validates on secondary datasets • Opportunity: Better classifiers of disease diagnosis and/or prognosis have many clinical applications • Current situation: Most attempts to train classifiers rely on machine learning methods that utilize little or no biological knowledge 11
  • 12. Game for breast cancer prognosis • Our approach: Enlist a crowd of expert game players with diverse perspectives to identify most biologically relevant genes • Results: Gene sets derived from game player data showed comparable performance to expert-generated gene sets 12 • 1077 registered players • 15,669 games played • Demographics – 59% male, 41% female – 21-29 is most frequent age group – 35% had graduate degree, 32% were biologists