SlideShare a Scribd company logo
Cloud-scale genomics: examples and lessons Ben Langmead Department of Biostatistics
Why? Cost? Elastic supply Not my hardware Our only hope? Why not? Cost? Harder to program Less user-friendly Data movement Loosely-coupled only Privacy (e.g. IRB) Cloud debate on 1 slide 1.6 Gbp/day 1 5 Gbp/day 1 25 Gbp/day 2 1. http://www.politigenomics.com/next-generation-sequencing-informatics 2. http://www.politigenomics.com/2010/01/hiseq-2000.html Conclusion: let’s try it but hedge our bets
Crossbow GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC |||||||||  ||||| TCTCTCCCA GG AGAGC Align Aggregate Reference Call: HET A, G p-value: 0.0023 GTCGCAGTATCTGTCT GTCGCAGTATCTGT NN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TAT A TCGCAGTATCT T TAT A TCGCAGTATCTG N AT A TCGCAGTAT N TG CCCTAT A TCGCAGTAT A CACCCTATGTCGCA A CACCCTAT C TCGCA A CACCCTATGTCGCA GA - CACCCTATGTCGC CCGGA - CACCCTAT A T CCGGA - CACCCTAT A T GCCGGA - CACCCTATG Statistics Parallel by read Handled by Hadoop Parallel by genome bin
Myrna Gene 1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC |||||||||  ||||| TCTCTCCCA GG AGAGC Align Gene 1 differentially expressed?: YES p-value: 0.0012 TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC |||||||||  ||||| TCTCTCCCA GG AGAGC Sample A Sample B Align Aggregate Aggregate Overlap Aggregate Normalize Aggregate Normalize Aggregate Statistics Parallel by read Handled by Hadoop Parallel by genome bin Handled by Hadoop Parallel by sample Handled by Hadoop Parallel by gene
Myrna Table 1 . Timing and cost for a Myrna experiment with 1.1 billion 35 bp unpaired reads   from the Pickrell   et al  study as input.  Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the Northern Virginia zone and $0.78 in other zones, plus a $0.12 per-node-per-hour surcharge for Elastic MapReduce in all zones.  Times can vary subject to, for example, congestion and Internet traffic conditions. Data transfer adds about 1hr:15m, $11 Myrna Runtime, Cost for 1.1 billion reads from Pickrell  et al  study EC2 Nodes 1 master,  10 workers 1 master,  20 workers 1 master,  40 workers Worker CPU cores 80 160 320 Wall clock time 4h:20m 2h:32m 1h:38m Cluster setup 4m 4m 3m Align 2h:56m 1h:31m 54m Overlap 52m 31m 16m Normalize 6m 7m 6m Statistics 9m 6m 6m Summarize & Postprocess 13m 14m 13m Approximate cost (N. Virginia / Elsewhere) $44.00 / $49.50 $50.40 / $56.70 $65.60 / $73.80
Myrna 71% 55%
Bet-hedging architecture Cloud driver script Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Singleton driver script Wrapper bowtie Wrapper soapsnp Postprocess Perl, fork, sort Hadoop driver script Cloud mode Hadoop mode Single-computer mode
Acknowledgements Michael Schatz Jimmy Lin Mihai Pop Steven Salzberg Jeff Leek Kasper Hansen Hector Corrada Bravo Rafael Irizarry
Crossbow Data transfer adds about 1hr:15m, $28
Crossbow 43% 58%

More Related Content

PPTX
Z mega product
PPTX
Z mega product
PDF
Ce di l_1800_0
PDF
E4 HPC Workshop 2012
PDF
GTC 2017: Powering the AI Revolution
PDF
Vault encryption support
PPSX
WRM-Presentation-EN
Z mega product
Z mega product
Ce di l_1800_0
E4 HPC Workshop 2012
GTC 2017: Powering the AI Revolution
Vault encryption support
WRM-Presentation-EN

Viewers also liked (18)

PPT
Issr plodinec
PPTX
M2 k4.2 e1 bantuan pernafasan
PPTX
Take Your Small Business Global
PPT
BTM Group Overview
PPT
любовь твоя бог
PPT
Linked In Power Point 2
PPT
How To Use Your Website to Get Customers
PPTX
Pileoffruit
PPS
中秋 快 _1[1..
PDF
Aprendiendo uml en_24_horas
PPTX
Vänsterpartiet - Tisdagens frukostseminarie i Almedalen
PPTX
Camera care
PPSX
605專屬搭畢業特輯
PPT
Bird oral gr 5
PPT
Final project lourdes
PPTX
Battery care
PPTX
Presentacion ingles jaime torres
Issr plodinec
M2 k4.2 e1 bantuan pernafasan
Take Your Small Business Global
BTM Group Overview
любовь твоя бог
Linked In Power Point 2
How To Use Your Website to Get Customers
Pileoffruit
中秋 快 _1[1..
Aprendiendo uml en_24_horas
Vänsterpartiet - Tisdagens frukostseminarie i Almedalen
Camera care
605專屬搭畢業特輯
Bird oral gr 5
Final project lourdes
Battery care
Presentacion ingles jaime torres
Ad

Similar to Langmead bosc2010 cloud-genomics (20)

PPTX
In silico analysis for unknown data
PPTX
GPRS TUNNELING PROTOCOL (GTP).pptxxxxxxxx
PPTX
Towards reading genomic data using deep learning-driven NLP techniques
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
FIWARE Global Summit - Smart City / Community Services and Infrastructures
PDF
An FPGA-based acceleration methodology and performance model for iterative st...
PDF
IRJET- Metastability Mitigation & Error Masking of High Speed Flip-Flop
PPT
Kitzmiller Openhelisphereproject Bosc2008
PDF
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
PPTX
Biotech Era Ahead: Transcriptomics
PDF
SRv6 Mobile User Plane : Initial POC and Implementation
PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PDF
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
PDF
Time sync: Existing mobile networks need to be ready for 5G and time-sensitiv...
PDF
IRJET- An Improved DCM-Based Tunable True Random Number Generator for Xilinx ...
PDF
Key Factors that affect 5G Throughput, Possible Causes and Ways to optimize.pdf
PPTX
Cloud-based dynamic distributed optimisation of integrated process planning a...
PDF
Edge trends mizuno
PPT
Gene mutations
PDF
Proportional-integral genetic algorithm controller for stability of TCP network
In silico analysis for unknown data
GPRS TUNNELING PROTOCOL (GTP).pptxxxxxxxx
Towards reading genomic data using deep learning-driven NLP techniques
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
FIWARE Global Summit - Smart City / Community Services and Infrastructures
An FPGA-based acceleration methodology and performance model for iterative st...
IRJET- Metastability Mitigation & Error Masking of High Speed Flip-Flop
Kitzmiller Openhelisphereproject Bosc2008
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
Biotech Era Ahead: Transcriptomics
SRv6 Mobile User Plane : Initial POC and Implementation
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
Time sync: Existing mobile networks need to be ready for 5G and time-sensitiv...
IRJET- An Improved DCM-Based Tunable True Random Number Generator for Xilinx ...
Key Factors that affect 5G Throughput, Possible Causes and Ways to optimize.pdf
Cloud-based dynamic distributed optimisation of integrated process planning a...
Edge trends mizuno
Gene mutations
Proportional-integral genetic algorithm controller for stability of TCP network
Ad

More from BOSC 2010 (20)

PPTX
Mercer bosc2010 microsoft_framework
PDF
Schultheiss bosc2010 persistance-web-services
PPT
Swertz bosc2010 molgenis
PPT
Rice bosc2010 emboss
PDF
Morris bosc2010 evoker
PPT
Kono bosc2010 pathway_projector
PPTX
Kanterakis bosc2010 molgenis
PDF
Gautier bosc2010 pythonbioconductor
PDF
Gardler bosc2010 community_developmentattheasf
PDF
Friedberg bosc2010 iprstats
PDF
Fields bosc2010 bio_perl
PDF
Chapman bosc2010 biopython
PDF
Bonnal bosc2010 bio_ruby
PDF
Puton bosc2010 bio_python-modules-rna
PPT
Bader bosc2010 cytoweb
PDF
Talevich bosc2010 bio-phylo
PPTX
Zmasek bosc2010 aptx
PPTX
Wilkinson bosc2010 moby-to-sadi
PPT
Venkatesan bosc2010 onto-toolkit
PPT
Taylor bosc2010
Mercer bosc2010 microsoft_framework
Schultheiss bosc2010 persistance-web-services
Swertz bosc2010 molgenis
Rice bosc2010 emboss
Morris bosc2010 evoker
Kono bosc2010 pathway_projector
Kanterakis bosc2010 molgenis
Gautier bosc2010 pythonbioconductor
Gardler bosc2010 community_developmentattheasf
Friedberg bosc2010 iprstats
Fields bosc2010 bio_perl
Chapman bosc2010 biopython
Bonnal bosc2010 bio_ruby
Puton bosc2010 bio_python-modules-rna
Bader bosc2010 cytoweb
Talevich bosc2010 bio-phylo
Zmasek bosc2010 aptx
Wilkinson bosc2010 moby-to-sadi
Venkatesan bosc2010 onto-toolkit
Taylor bosc2010

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Modernizing your data center with Dell and AMD
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Modernizing your data center with Dell and AMD
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Langmead bosc2010 cloud-genomics

  • 1. Cloud-scale genomics: examples and lessons Ben Langmead Department of Biostatistics
  • 2. Why? Cost? Elastic supply Not my hardware Our only hope? Why not? Cost? Harder to program Less user-friendly Data movement Loosely-coupled only Privacy (e.g. IRB) Cloud debate on 1 slide 1.6 Gbp/day 1 5 Gbp/day 1 25 Gbp/day 2 1. http://www.politigenomics.com/next-generation-sequencing-informatics 2. http://www.politigenomics.com/2010/01/hiseq-2000.html Conclusion: let’s try it but hedge our bets
  • 3. Crossbow GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Align Aggregate Reference Call: HET A, G p-value: 0.0023 GTCGCAGTATCTGTCT GTCGCAGTATCTGT NN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TAT A TCGCAGTATCT T TAT A TCGCAGTATCTG N AT A TCGCAGTAT N TG CCCTAT A TCGCAGTAT A CACCCTATGTCGCA A CACCCTAT C TCGCA A CACCCTATGTCGCA GA - CACCCTATGTCGC CCGGA - CACCCTAT A T CCGGA - CACCCTAT A T GCCGGA - CACCCTATG Statistics Parallel by read Handled by Hadoop Parallel by genome bin
  • 4. Myrna Gene 1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Align Gene 1 differentially expressed?: YES p-value: 0.0012 TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Sample A Sample B Align Aggregate Aggregate Overlap Aggregate Normalize Aggregate Normalize Aggregate Statistics Parallel by read Handled by Hadoop Parallel by genome bin Handled by Hadoop Parallel by sample Handled by Hadoop Parallel by gene
  • 5. Myrna Table 1 . Timing and cost for a Myrna experiment with 1.1 billion 35 bp unpaired reads from the Pickrell et al study as input. Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the Northern Virginia zone and $0.78 in other zones, plus a $0.12 per-node-per-hour surcharge for Elastic MapReduce in all zones. Times can vary subject to, for example, congestion and Internet traffic conditions. Data transfer adds about 1hr:15m, $11 Myrna Runtime, Cost for 1.1 billion reads from Pickrell et al study EC2 Nodes 1 master, 10 workers 1 master, 20 workers 1 master, 40 workers Worker CPU cores 80 160 320 Wall clock time 4h:20m 2h:32m 1h:38m Cluster setup 4m 4m 3m Align 2h:56m 1h:31m 54m Overlap 52m 31m 16m Normalize 6m 7m 6m Statistics 9m 6m 6m Summarize & Postprocess 13m 14m 13m Approximate cost (N. Virginia / Elsewhere) $44.00 / $49.50 $50.40 / $56.70 $65.60 / $73.80
  • 7. Bet-hedging architecture Cloud driver script Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Singleton driver script Wrapper bowtie Wrapper soapsnp Postprocess Perl, fork, sort Hadoop driver script Cloud mode Hadoop mode Single-computer mode
  • 8. Acknowledgements Michael Schatz Jimmy Lin Mihai Pop Steven Salzberg Jeff Leek Kasper Hansen Hector Corrada Bravo Rafael Irizarry
  • 9. Crossbow Data transfer adds about 1hr:15m, $28