SlideShare a Scribd company logo
Infrastructure for Data
Intensive Biology
“Better Science through Superior Software”
C. Titus Brown
Current research:
Compressive algorithms for
sequence analysis
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Can we enable and accelerate sequence-
based inquiry by making all basic analysis
easier and some analyses possible?
Three super-awesome
technologies…
1. Low-memory k-mer counting
(Zhang et al., PLoS One, 2014)
2. Compressible assembly graphs
(Pell et al., PNAS, 2012)
3. Streaming lossy compression of sequence
data
(Brown et al., arXiv, 2012)
…implemented in one super-
awesome software package…
github.com/ged-lab/khmer/
BSD licensed
Openly developed using good practice.
> 10 external contributors.
Thousands of downloads/month.
50 citations in 3 years.
We think > 1000 people are using it; have heard
from dozens.
…enabling super-awesome
biology.
1. Assembling soil metagenomes
Howe et al., PNAS, 2014
2. Understanding bone-eating worm symbionts
Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome
(in preparation)
4. Understanding derived anural development in
Molgulid ascidians (in preparation)
Early on, lack of replicability in pubs slowed us down =>
Strategy: “level up” the field
High quality & novel science,
done openly,
written up in reproducible and
remixable papers,
using IPython Notebook,
and posted to preprint servers.
Expression based
clustering of 85 lamprey
tissue samples (de novo
assembly of 3 billion reads)
~ 1 month
Camille Scott
Open protocols for the
cloud: ~$100/analysis
Read cleaning
Preprocessing
Assembly
Annotation
khmer-protocols.readthedocs.org/
Transcriptome and metagenome assembly protocols
The data challenge in biology
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic,
metabolomic, …?)
We currently have no good way of querying,
exploring, investigating, or mining these data
sets, especially across multiple locations..
Moreover, most data is unavailable until after
publication…
…which, in practice, means it will be lost.
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Graph queries
assembled
sequence
nitrite
reductase
ppaZ
SIMILARITY TO ALSO CONTAINS
raw
sequence
across public & walled-garden data sets:
See Lee,
Alekseyenko, Brown,
paper in SciPy 2009:
the “pygr” project.
The larger vision
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.
Who needs this?
Everyone.
Environmental microbiology, evo devo,
agriculture, VetMed...
How would I start?
1-2 pilot projects w/domain
postdocs: drive computational
infrastructure with biology
problems.
Support postdocs with
software engineer
(infrastructure) and graduate
student CS (research).
Cross-train postdocs in data-
intensive research methods
and software engineering.
Note: finding existing data is not a
problem :)
“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs
physical parameters – potential
collab.
Via Elizabeth Kujawinski
Education and training
Biology is underprepared for data-intensive
investigation.
We must teach and train the next generations.
~5-10 workshops / year, novice -> masterclass; open
materials.
Deeply self-interested:
What problems does everyone have, now?
(Assembly)
What problems do leading-edge researchers have?
(Data integration)
Pre-answered Questions
Q: What will be open?
A: Everything; I succeed & fail publicly.
Q: How will you measure success?
A: By other people using & extending our
“products” without talking to us.
Blog: ivory.idyll.org/blog/ - search for “moore”, “satire”
@ctitusbrown
Graph queries
across public & walled-garden data
sets:
“What data sets contain <this gene>?”
“Which reads match to <this gene>, but
not in <conserved domain>?”
“Give me relative abundance of <gene
X> across all data sets, grouped by
nitrogen exposure.”

More Related Content

PPTX
Learning Systems for Science
PPTX
Coding the Continuum
PPTX
Data Tribology: Overcoming Data Friction with Cloud Automation
PPTX
Genome-scale Big Data Pipelines
PPT
Many Task Applications for Grids and Supercomputers
PPTX
Scaling collaborative data science with Globus and Jupyter
PPTX
Petrel: A Programmatically Accessible Research Data Service
PPTX
Big Data Science with H2O in R
Learning Systems for Science
Coding the Continuum
Data Tribology: Overcoming Data Friction with Cloud Automation
Genome-scale Big Data Pipelines
Many Task Applications for Grids and Supercomputers
Scaling collaborative data science with Globus and Jupyter
Petrel: A Programmatically Accessible Research Data Service
Big Data Science with H2O in R

What's hot (20)

PDF
Big data ecosystem
PDF
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
PPTX
VariantSpark - a Spark library for genomics
PPTX
Bioinformatics Data Pipelines built by CSIRO on AWS
PPTX
NERSC, AI and the Superfacility, Debbie Bard
PPTX
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
PDF
The Materials Project - Combining Science and Informatics to Accelerate Mater...
PPTX
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
PDF
Deep Learning in Deep Space
PDF
Big data from the LHC commissioning: practical lessons from big science - Sim...
PDF
Insight_150115_Demo
PPTX
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
PDF
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
PPT
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
PDF
How HPC and large-scale data analytics are transforming experimental science
PPT
Friday talk 11.02.2011
PDF
Computational Materials Design and Data Dissemination through the Materials P...
PDF
Materials design using knowledge from millions of journal articles via natura...
PPTX
RasterFrames + STAC
PDF
Reusable Software and Open Data To Optimize Agriculture
Big data ecosystem
October 1 NISO Training Thursday: Using Alerting Systems to Ensure OA Policy ...
VariantSpark - a Spark library for genomics
Bioinformatics Data Pipelines built by CSIRO on AWS
NERSC, AI and the Superfacility, Debbie Bard
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
The Materials Project - Combining Science and Informatics to Accelerate Mater...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Deep Learning in Deep Space
Big data from the LHC commissioning: practical lessons from big science - Sim...
Insight_150115_Demo
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...
How HPC and large-scale data analytics are transforming experimental science
Friday talk 11.02.2011
Computational Materials Design and Data Dissemination through the Materials P...
Materials design using knowledge from millions of journal articles via natura...
RasterFrames + STAC
Reusable Software and Open Data To Optimize Agriculture
Ad

Viewers also liked (20)

PDF
Deadlocks
PPTX
BOSC 2012 panel discussion
PPT
Results Now Presentation
PDF
The Next Generation of Intel: The Dawn of Nehalem
PDF
Social Media Evidence
PDF
Keeping the Gold: Successfully Resolving Preference Claims
PPTX
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
PDF
Contiguity Principle
PPT
CG borodino
PDF
Dock It Customer Intro 14 Aug 09
PDF
IT & Business Centre
PPT
ADR in practice: Pre-Claim Conciliation – 2 years on
PPTX
Climate Summit
PPT
Futura+ Idealcombi
PDF
Mn1 sec 2 - les 4 - (taghabun 1-18)
PPTX
Healthcare Costs And Performance in the OECD
PDF
e-book: Social Business Now
PDF
Whitepaper De Menskant Van Sourcing
PDF
Light Duty, Good Faith Job Offers + Transitional Work
PDF
Osss...!!! Magazine Concept
Deadlocks
BOSC 2012 panel discussion
Results Now Presentation
The Next Generation of Intel: The Dawn of Nehalem
Social Media Evidence
Keeping the Gold: Successfully Resolving Preference Claims
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Contiguity Principle
CG borodino
Dock It Customer Intro 14 Aug 09
IT & Business Centre
ADR in practice: Pre-Claim Conciliation – 2 years on
Climate Summit
Futura+ Idealcombi
Mn1 sec 2 - les 4 - (taghabun 1-18)
Healthcare Costs And Performance in the OECD
e-book: Social Business Now
Whitepaper De Menskant Van Sourcing
Light Duty, Good Faith Job Offers + Transitional Work
Osss...!!! Magazine Concept
Ad

Similar to 2014 moore-ddd (20)

PPTX
2015 genome-center
PDF
Ruby on bioinformatics
PDF
Free software and bioinformatics
PPTX
Data analysis & integration challenges in genomics
PPT
Biological Database Systems
PPT
Bioinformatic_Databases_2.ppt
PPT
Bioinformatic_Databases_2xcxzczxcxzxcxzc
PPT
Bioinformatic databases 2
PPT
Bioinformatic databases 2
PPTX
2016 davis-plantbio
PDF
100505 koenig biological_databases
PDF
PDF文档.pdf
PDF
Bonnal bosc2010 bio_ruby
PPTX
bioinformatics presentation in the master presentation
PPTX
biological detabase
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
PDF
Developing an open source community for cloud bioinformatics
PPTX
2013 bms-retreat-talk
PPTX
2013 nas-ehs-data-integration-dc
PPT
Bioinformatic_Databases_2.ppt Bioinformatics
2015 genome-center
Ruby on bioinformatics
Free software and bioinformatics
Data analysis & integration challenges in genomics
Biological Database Systems
Bioinformatic_Databases_2.ppt
Bioinformatic_Databases_2xcxzczxcxzxcxzc
Bioinformatic databases 2
Bioinformatic databases 2
2016 davis-plantbio
100505 koenig biological_databases
PDF文档.pdf
Bonnal bosc2010 bio_ruby
bioinformatics presentation in the master presentation
biological detabase
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Developing an open source community for cloud bioinformatics
2013 bms-retreat-talk
2013 nas-ehs-data-integration-dc
Bioinformatic_Databases_2.ppt Bioinformatics

More from c.titus.brown (20)

PPTX
2016 bergen-sars
PPTX
2016 davis-biotech
PPTX
2015 beacon-metagenome-tutorial
PPTX
2015 aem-grs-keynote
PPTX
2015 msu-code-review
PPTX
2015 illinois-talk
PPTX
2015 mcgill-talk
PPTX
2015 pycon-talk
PPTX
2015 opencon-webcast
PPTX
2015 vancouver-vanbug
PPTX
2015 osu-metagenome
PPTX
2015 ohsu-metagenome
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2015 pag-metagenome
PPTX
2014 nyu-bio-talk
PPTX
2014 bangkok-talk
PPTX
2014 anu-canberra-streaming
PPTX
2014 nicta-reproducibility
PPTX
2014 aus-agta
2016 bergen-sars
2016 davis-biotech
2015 beacon-metagenome-tutorial
2015 aem-grs-keynote
2015 msu-code-review
2015 illinois-talk
2015 mcgill-talk
2015 pycon-talk
2015 opencon-webcast
2015 vancouver-vanbug
2015 osu-metagenome
2015 ohsu-metagenome
2015 balti-and-bioinformatics
2015 pag-chicken
2015 pag-metagenome
2014 nyu-bio-talk
2014 bangkok-talk
2014 anu-canberra-streaming
2014 nicta-reproducibility
2014 aus-agta

2014 moore-ddd

  • 1. Infrastructure for Data Intensive Biology “Better Science through Superior Software” C. Titus Brown
  • 2. Current research: Compressive algorithms for sequence analysis Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Can we enable and accelerate sequence- based inquiry by making all basic analysis easier and some analyses possible?
  • 3. Three super-awesome technologies… 1. Low-memory k-mer counting (Zhang et al., PLoS One, 2014) 2. Compressible assembly graphs (Pell et al., PNAS, 2012) 3. Streaming lossy compression of sequence data (Brown et al., arXiv, 2012)
  • 4. …implemented in one super- awesome software package… github.com/ged-lab/khmer/ BSD licensed Openly developed using good practice. > 10 external contributors. Thousands of downloads/month. 50 citations in 3 years. We think > 1000 people are using it; have heard from dozens.
  • 5. …enabling super-awesome biology. 1. Assembling soil metagenomes Howe et al., PNAS, 2014 2. Understanding bone-eating worm symbionts Goffredi et al., ISME, 2014. 3. An ultra-deep look at the lamprey transcriptome (in preparation) 4. Understanding derived anural development in Molgulid ascidians (in preparation)
  • 6. Early on, lack of replicability in pubs slowed us down => Strategy: “level up” the field High quality & novel science, done openly, written up in reproducible and remixable papers, using IPython Notebook, and posted to preprint servers. Expression based clustering of 85 lamprey tissue samples (de novo assembly of 3 billion reads) ~ 1 month Camille Scott
  • 7. Open protocols for the cloud: ~$100/analysis Read cleaning Preprocessing Assembly Annotation khmer-protocols.readthedocs.org/ Transcriptome and metagenome assembly protocols
  • 8. The data challenge in biology In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations.. Moreover, most data is unavailable until after publication… …which, in practice, means it will be lost.
  • 9. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 10. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 11. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 12. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 13. Graph queries assembled sequence nitrite reductase ppaZ SIMILARITY TO ALSO CONTAINS raw sequence across public & walled-garden data sets: See Lee, Alekseyenko, Brown, paper in SciPy 2009: the “pygr” project.
  • 14. The larger vision Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future.
  • 15. Who needs this? Everyone. Environmental microbiology, evo devo, agriculture, VetMed...
  • 16. How would I start? 1-2 pilot projects w/domain postdocs: drive computational infrastructure with biology problems. Support postdocs with software engineer (infrastructure) and graduate student CS (research). Cross-train postdocs in data- intensive research methods and software engineering. Note: finding existing data is not a problem :) “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski
  • 17. Education and training Biology is underprepared for data-intensive investigation. We must teach and train the next generations. ~5-10 workshops / year, novice -> masterclass; open materials. Deeply self-interested: What problems does everyone have, now? (Assembly) What problems do leading-edge researchers have? (Data integration)
  • 18. Pre-answered Questions Q: What will be open? A: Everything; I succeed & fail publicly. Q: How will you measure success? A: By other people using & extending our “products” without talking to us. Blog: ivory.idyll.org/blog/ - search for “moore”, “satire” @ctitusbrown
  • 19. Graph queries across public & walled-garden data sets: “What data sets contain <this gene>?” “Which reads match to <this gene>, but not in <conserved domain>?” “Give me relative abundance of <gene X> across all data sets, grouped by nitrogen exposure.”

Editor's Notes

  • #3: Squeeze information out of data; speed up downstream analyses; make impossible possible.
  • #4: Applicable to many basic sequence analysis problems: error removal, species sorting, and de novo sequence assembly.
  • #5: Hard to tell how many people are using it because it’s freely available in several locations.
  • #6: The point is to enable biology; volume and velocity of data from sequencers is blocking.
  • #7: Doing computational science with good software engineering approaches is enabling; scientist + soft eng grad students are super capable.
  • #8: 1000s of people want to do what we do, can’t collaborate with them all => open protocols. Forkable, ctiable, open, tested. This is your methods section for computational analysis.
  • #10: Analyze data in cloud; import and export important; connect to other databases.
  • #11: Analyze data in cloud; import and export important; connect to other databases.
  • #12: Analyze data in cloud; import and export important; connect to other databases.
  • #13: Analyze data in cloud; import and export important; connect to other databases.
  • #14: Set up infrastructure for distributed query; base on graph database concept of standing relationships between data sets.
  • #15: Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.
  • #17: Drive with pilot projects; train domain postdocs in computation; e.g. 20+ sites with multi-omic sampling, clearly the future but no way to analyze the data.
  • #18: Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
  • #19: Mention moore science fiction project