SlideShare a Scribd company logo
Instrument ALL the things:
Studying data-intensive
workflows in the clowd.
C. Titus Brown
Michigan State University
(See blog post)
A few upfront definitions
Big Data, n: whatever is still inconvenient to compute on.
Data scientist, n: a statistician who lives in San Francisco.
Professor, n: someone who writes grants to fund people
who do the work (c.f. Fernando Perez)
I am a professor (not a data scientist) who
writes grants so that others can do data-
intensive biology.
This talk dedicated to Terry Peppers
Titus, I no longer understand
what you actually do…
Daddy, what do you do at
work!?
I assemble puzzles for a living.
Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
Three bioinformatic strategies in use
• Greedy: “if the piece sorta fits…”
• N2 – “Do these two pieces match? How about
this next one?”
• The Dutch approach.
The Dutch Solution
(De Bruijn assembly)
Find similarities within puzzle pieces
The Dutch Solution
Algorithmically:
• Is linear in time with number of pieces 
(Way better than N2!)
• Is linear in memory with volume of data 
(This is due to errors in digitization process.)
Practical memory measurements
Velvet measurements (Adina Howe)
GB RAM
(About $500 of data)
Our research challenges –
1. It costs only $10k & 1 week to generate
enough sequence data that no commodity
computer (and few supercomputers) can
assemble it.
2. Hundreds -> thousands of such data sets are
being generated each year.
Our research challenges –
1. It costs only $10k & 1 week to generate
enough sequence data that no commodity
computer (and few supercomputers) can
assemble it.
2. Hundreds -> thousands of such data sets are
being generated each year.
Our research (i) - CS
• Streaming lossy compression approach that
discards pieces we’ve seen before.
• Low memory probabilistic data structures.
(…see Pycon 2013 talk)
=> RAM now scales better: O(I) where I << N
(I is sample dependent but typically I < N/20)
Our research (ii) - approach
• Open source, open data, open science, and
reproducible computational research.
– GitHub
– Automated testing, CI, & literate reSTing
– Blogging, Twitter
– IPython Notebook for data analysis, figures.
• Protocols for assembling in the cloud.
Molgula oculata
Molgula occulta
Molgula oculata
Real solutions, tackling squishy biology!
Elijah Lowe & Billie Swalla
Doing things right => #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Benchmarking strategy
• Rent a bunch of cloud VMs from Amazon and
Rackspace.
• Extract commands from tutorials using
literate-resting.
• Use ‘sar’ (sysstat pkg) to sample CPU, RAM,
and disk I/O.
Benchmarking output
Data subset; AWS m1.xlarge
Each protocol has many steps
Data subset; AWS m1.xlarge
Most interested in RAM-intensive bit
Data subset; AWS m1.xlarge
Most interested in RAM-intensive bit
Complete data; AWS m1.xlarge
Observation #1: Rackspace is faster
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlarge
EBS, max
IOPS ephemeral 49.1 $23.56
m1.xlarge
EBS, max
IOPS EBS, max IOPS 52.5 $25.20
Surprise #1: AWS ephemeral storage is
FASTER
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlarge
EBS, max
IOPS ephemeral 49.1 $23.56
m1.xlarge
EBS, max
IOPS EBS, max IOPS 52.5 $25.20
Observation #2: NUMA costs
Same task done with varying memory sizes.
Observation #2: NUMA costs
Same task done with varying memory sizes.
Can’t we just use a faster computer?
• Demo data on m1.xlarge: 2789 s
• Demo data on m3.xlarge: 1970 s – 30% faster!
(Why?
m3.xlarge has 2x40 GB SSD drives & 40% faster
cores.)
Great! Let’s try it out!
Observation #3: multifaceted problem!
• Full data on m1.xlarge: 45.5 h
• Full data on m3.xlarge: out of disk space.
We need about 200 GB to run the full pipeline.
You can have fast disk or lots of disk but not
both, for the moment.
Future directions
1. Invest in cache-local data structures and
algorithms.
2. Invest in streaming/in-memory approaches.
3. Not clear (to me) that straight code
optimization or infrastructure engineering is
worthwhile investment.
Frequently Offered Solutions
1. You should like, totally multithread that.
(See: McDonald & Brown, POSA)
2. Hadoop will just crush that workload, dude.
(Unlikely to be cost-effective.)
3. Have you tried <my proprietary Big Data
technology stack>?
(Thatz Not Science)
Optimization vs scaling
• Linear time/memory improvements would not
have addressed our core problem.
(2 years, 20x improvement, 100x increase in data.)
• Puzzle problem is a graph problem with big
data, no locality, small compute. Not friendly.
• We need(ed) to scale our algorithms.
• Can now run on single-chassis, in ~15 GB RAM.
Optimization vs scaling --
Scaling can be more important!
What are we losing by focusing our
engineering on pleasantly parallel
problems?
• Hadoop is fundamentally not that interesting.
• Research is about the 100x.
• Scaling new problems, evaluating/creating
new data structures and algorithms, etc.
(From my PyCon 2011 talk.)
Theme: Life’s too short to tackle the
easy problems – come to academia!
Thanks!
• Leigh Sheneman, for starting the
benchmarking project.
• Labbies: Michael R. Crusoe, Luiz Irber, Likit
Preeyanon, Camille Scott, and Qingpeng
Zhang.
Thanks!
• github.com/ged-lab/
– khmer – core project
– khmer-protocols – tutorials/acceptance tests
– literate-resting – script to pull out code from reST tutorials
• Blog post at: http://ivory.idyll.org/blog/2014-pycon.html
• Michael R. Crusoe, Likit Preeyanon, Camille Scott, and
Qingpeng Zhang are here at PyCon.
…note, you can probably afford to
buy them off me :)
Different computational strategies for
k-mer counting, revealed!
Khmer-counting paper pipeline; Qingpeng Zhang

More Related Content

PPTX
2015 msu-code-review
PPTX
TDD er død. Lenge leve TDD!
PDF
TDD, the way to better software | Dan Ursu | CodeWay 2015
PPTX
2014 toronto-torbug
PDF
Daniel Cerecedo | From legacy to cloud... and beyond | Codemotion Madrid 2018
PPTX
Importance of test automation, excuses and TDD introduction
PPTX
Reproducibility: 10 Simple Rules
PPTX
Testing & should i do it
2015 msu-code-review
TDD er død. Lenge leve TDD!
TDD, the way to better software | Dan Ursu | CodeWay 2015
2014 toronto-torbug
Daniel Cerecedo | From legacy to cloud... and beyond | Codemotion Madrid 2018
Importance of test automation, excuses and TDD introduction
Reproducibility: 10 Simple Rules
Testing & should i do it

Viewers also liked (20)

PPTX
2014 ismb-extra-slides
PPT
Chapter 10 - Added Values
PPTX
Seismic Waves
PPT
Castello Normanno Di Adrano
PPT
Manduca
PDF
polar bears
DOCX
Evaluaciones de jheickson noguera
PDF
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
PDF
Turning event attendees into active active participants
PPS
Anunturi De Pomina
KEY
Vrouwen In Het Management
PPTX
Volcano 3
PDF
Writing
PPTX
유기화학 2nd
PPS
Ferrari
PDF
Grandparents day
PDF
Kakapo slideshow by Izak and Ezra
PPT
E Syn Doc2032009112513
PPS
Printemps
PPTX
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
2014 ismb-extra-slides
Chapter 10 - Added Values
Seismic Waves
Castello Normanno Di Adrano
Manduca
polar bears
Evaluaciones de jheickson noguera
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Turning event attendees into active active participants
Anunturi De Pomina
Vrouwen In Het Management
Volcano 3
Writing
유기화학 2nd
Ferrari
Grandparents day
Kakapo slideshow by Izak and Ezra
E Syn Doc2032009112513
Printemps
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Ad

Similar to 2014 pycon-talk (20)

PPTX
Talk at Bioinformatics Open Source Conference, 2012
PPTX
CT Brown - Doing next-gen sequencing analysis in the cloud
PPTX
2014 nicta-reproducibility
PPT
PUC Masterclass Big Data
PDF
Data Structures and Algorithms for Big Databases
PDF
Bender kuszmaul tutorial-xldb12
PPTX
Big Data Analytics: Finding diamonds in the rough with Azure
PPTX
2014 manchester-reproducibility
PPTX
2014 aus-agta
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
PDF
Fast and Scalable Python
PDF
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
PDF
PyData Paris 2015 - Closing keynote Francesc Alted
PPTX
Big data-denis-rothman
PDF
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
PDF
The Data Janitor Returns | Daniel Molnar | DN18
PDF
Data Science Accelerator Program
PDF
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
PDF
Data science presentation
PPTX
2013 py con awesome big data algorithms
Talk at Bioinformatics Open Source Conference, 2012
CT Brown - Doing next-gen sequencing analysis in the cloud
2014 nicta-reproducibility
PUC Masterclass Big Data
Data Structures and Algorithms for Big Databases
Bender kuszmaul tutorial-xldb12
Big Data Analytics: Finding diamonds in the rough with Azure
2014 manchester-reproducibility
2014 aus-agta
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Fast and Scalable Python
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
PyData Paris 2015 - Closing keynote Francesc Alted
Big data-denis-rothman
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
The Data Janitor Returns | Daniel Molnar | DN18
Data Science Accelerator Program
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Data science presentation
2013 py con awesome big data algorithms
Ad

More from c.titus.brown (20)

PPTX
2016 bergen-sars
PPTX
2016 davis-plantbio
PPTX
2016 davis-biotech
PPTX
2015 genome-center
PPTX
2015 beacon-metagenome-tutorial
PPTX
2015 aem-grs-keynote
PPTX
2015 illinois-talk
PPTX
2015 mcgill-talk
PPTX
2015 pycon-talk
PPTX
2015 opencon-webcast
PPTX
2015 vancouver-vanbug
PPTX
2015 osu-metagenome
PPTX
2015 ohsu-metagenome
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2015 pag-metagenome
PPTX
2014 nyu-bio-talk
PPTX
2014 bangkok-talk
PPTX
2014 anu-canberra-streaming
PPTX
2014 abic-talk
2016 bergen-sars
2016 davis-plantbio
2016 davis-biotech
2015 genome-center
2015 beacon-metagenome-tutorial
2015 aem-grs-keynote
2015 illinois-talk
2015 mcgill-talk
2015 pycon-talk
2015 opencon-webcast
2015 vancouver-vanbug
2015 osu-metagenome
2015 ohsu-metagenome
2015 balti-and-bioinformatics
2015 pag-chicken
2015 pag-metagenome
2014 nyu-bio-talk
2014 bangkok-talk
2014 anu-canberra-streaming
2014 abic-talk

Recently uploaded (20)

PDF
. Radiology Case Scenariosssssssssssssss
PPTX
famous lake in india and its disturibution and importance
PPTX
Microbiology with diagram medical studies .pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
2. Earth - The Living Planet earth and life
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
The scientific heritage No 166 (166) (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPT
protein biochemistry.ppt for university classes
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
. Radiology Case Scenariosssssssssssssss
famous lake in india and its disturibution and importance
Microbiology with diagram medical studies .pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Introduction to Cardiovascular system_structure and functions-1
2. Earth - The Living Planet earth and life
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
The scientific heritage No 166 (166) (2025)
Biophysics 2.pdffffffffffffffffffffffffff
POSITIONING IN OPERATION THEATRE ROOM.ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
protein biochemistry.ppt for university classes
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
The KM-GBF monitoring framework – status & key messages.pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
2. Earth - The Living Planet Module 2ELS
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

2014 pycon-talk

  • 1. Instrument ALL the things: Studying data-intensive workflows in the clowd. C. Titus Brown Michigan State University (See blog post)
  • 2. A few upfront definitions Big Data, n: whatever is still inconvenient to compute on. Data scientist, n: a statistician who lives in San Francisco. Professor, n: someone who writes grants to fund people who do the work (c.f. Fernando Perez) I am a professor (not a data scientist) who writes grants so that others can do data- intensive biology.
  • 3. This talk dedicated to Terry Peppers Titus, I no longer understand what you actually do… Daddy, what do you do at work!?
  • 4. I assemble puzzles for a living. Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
  • 5. Three bioinformatic strategies in use • Greedy: “if the piece sorta fits…” • N2 – “Do these two pieces match? How about this next one?” • The Dutch approach.
  • 6. The Dutch Solution (De Bruijn assembly) Find similarities within puzzle pieces
  • 7. The Dutch Solution Algorithmically: • Is linear in time with number of pieces  (Way better than N2!) • Is linear in memory with volume of data  (This is due to errors in digitization process.)
  • 8. Practical memory measurements Velvet measurements (Adina Howe) GB RAM (About $500 of data)
  • 9. Our research challenges – 1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2. Hundreds -> thousands of such data sets are being generated each year.
  • 10. Our research challenges – 1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2. Hundreds -> thousands of such data sets are being generated each year.
  • 11. Our research (i) - CS • Streaming lossy compression approach that discards pieces we’ve seen before. • Low memory probabilistic data structures. (…see Pycon 2013 talk) => RAM now scales better: O(I) where I << N (I is sample dependent but typically I < N/20)
  • 12. Our research (ii) - approach • Open source, open data, open science, and reproducible computational research. – GitHub – Automated testing, CI, & literate reSTing – Blogging, Twitter – IPython Notebook for data analysis, figures. • Protocols for assembling in the cloud.
  • 13. Molgula oculata Molgula occulta Molgula oculata Real solutions, tackling squishy biology! Elijah Lowe & Billie Swalla
  • 14. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 15. Benchmarking strategy • Rent a bunch of cloud VMs from Amazon and Rackspace. • Extract commands from tutorials using literate-resting. • Use ‘sar’ (sysstat pkg) to sample CPU, RAM, and disk I/O.
  • 17. Each protocol has many steps Data subset; AWS m1.xlarge
  • 18. Most interested in RAM-intensive bit Data subset; AWS m1.xlarge
  • 19. Most interested in RAM-intensive bit Complete data; AWS m1.xlarge
  • 20. Observation #1: Rackspace is faster machine data disk working hours cost rackspace-15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20
  • 21. Surprise #1: AWS ephemeral storage is FASTER machine data disk working hours cost rackspace-15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20
  • 22. Observation #2: NUMA costs Same task done with varying memory sizes.
  • 23. Observation #2: NUMA costs Same task done with varying memory sizes.
  • 24. Can’t we just use a faster computer? • Demo data on m1.xlarge: 2789 s • Demo data on m3.xlarge: 1970 s – 30% faster! (Why? m3.xlarge has 2x40 GB SSD drives & 40% faster cores.) Great! Let’s try it out!
  • 25. Observation #3: multifaceted problem! • Full data on m1.xlarge: 45.5 h • Full data on m3.xlarge: out of disk space. We need about 200 GB to run the full pipeline. You can have fast disk or lots of disk but not both, for the moment.
  • 26. Future directions 1. Invest in cache-local data structures and algorithms. 2. Invest in streaming/in-memory approaches. 3. Not clear (to me) that straight code optimization or infrastructure engineering is worthwhile investment.
  • 27. Frequently Offered Solutions 1. You should like, totally multithread that. (See: McDonald & Brown, POSA) 2. Hadoop will just crush that workload, dude. (Unlikely to be cost-effective.) 3. Have you tried <my proprietary Big Data technology stack>? (Thatz Not Science)
  • 28. Optimization vs scaling • Linear time/memory improvements would not have addressed our core problem. (2 years, 20x improvement, 100x increase in data.) • Puzzle problem is a graph problem with big data, no locality, small compute. Not friendly. • We need(ed) to scale our algorithms. • Can now run on single-chassis, in ~15 GB RAM.
  • 30. Scaling can be more important!
  • 31. What are we losing by focusing our engineering on pleasantly parallel problems? • Hadoop is fundamentally not that interesting. • Research is about the 100x. • Scaling new problems, evaluating/creating new data structures and algorithms, etc.
  • 32. (From my PyCon 2011 talk.) Theme: Life’s too short to tackle the easy problems – come to academia!
  • 33. Thanks! • Leigh Sheneman, for starting the benchmarking project. • Labbies: Michael R. Crusoe, Luiz Irber, Likit Preeyanon, Camille Scott, and Qingpeng Zhang.
  • 34. Thanks! • github.com/ged-lab/ – khmer – core project – khmer-protocols – tutorials/acceptance tests – literate-resting – script to pull out code from reST tutorials • Blog post at: http://ivory.idyll.org/blog/2014-pycon.html • Michael R. Crusoe, Likit Preeyanon, Camille Scott, and Qingpeng Zhang are here at PyCon. …note, you can probably afford to buy them off me :)
  • 35. Different computational strategies for k-mer counting, revealed! Khmer-counting paper pipeline; Qingpeng Zhang

Editor's Notes

  • #2: …spent last15 years getting to the point where I earn considerably less than many of you
  • #5: Billions of pieces; hi-dimensional puzzle
  • #15: Acceptance testing other people’s software
  • #17: Color.
  • #18: Walk through
  • #21: Add cost.
  • #22: Add cost.
  • #28: Apparently I’m approachable, trying to work on that.