SlideShare a Scribd company logo
Data and Computing Infrastructure
for the Life Sciences:
Best Practices, Observations, and
Lessons Learned
Chris Dwan
https://dwan.org
Agenda:
• Real world systems
• Power, cooling, and information
transfer
• Molecules, Measurements, and
Models
• Business
• Technology
• People
Real World Systems
5 x Quad P-II 450MHz Dell
(Rocks 3.1)
2 ‘G’ series TimeLogic
Decypher boards
(BLAST accelerators)
Alphaserver E40
Quad Processor
7 x Dual P-III 1.4 GHz
Dell (Rocks 3.1)
Research computing
at CCGB
Center for Computational
Genomics &
Bioinformatics
University of
Minnesota
2002
Research computing
systems at NASA
Data Acquisition and
Analysis Center (DAAC) for
the Earth Science
Technology Office (ESTO)
Langley Air Force Base
Hampton Roads, VA
2007
1.2PB GPFS (28 cabinets,
2,560 disk drives)
100TB XSan
~50 compute servers
New York
Genome Center
Manhattan, NY
2012
~1PB
Storage
Commodity
Compute
More
commodity
compute
GPUs
Network
More
commodity
compute
And a bit of cloud
Broad Institute
Cambridge, MA
2015
~20PB file
storage
Another ~20PB
file storage
Enterprise
services
Private Cloud
More Linux HPC
14k Cores
Linux HPC
Differentiated
Hardware
Secondary Data Center
Quite a lot
of cloud
Sema4
Clinical Genomic
Screening and Diagnostics
Stamford, CT
2021
All cloud, except for the physical
touch points with the lab, and
one little network closet
Cambridge, MA
2024
1.2PB flash
storage
Network
Commodity
Servers
GPUs
Artificial
Intelligence
Direct connections
to multiple public
cloud providers Private cloud
Diverse fiber from
HQ to data center Modest data
center
footprint
1. Power, Cooling, and Information
Transfer
Electricity is expensive, cooling is a pain,
and light moves really fast
𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
Amount of power
(you get charged
for kilowatt hours)
Capacity of the pipe
(circuit breakers
are in Amps)
“Pressure” in
the pipe
Ohm’s Law
Important Facts
Transmission of electrical power is much
more efficient at higher voltages
High voltages at the same amperage will
deliver too much power and damage
sensitive components.
Ohm’s Law
𝑞=−𝑘∇𝑇 Fourier’s Law
Heat flux: How
much energy
moves.
Thermal conductivity
of the material
carrying the energy
Temperature gradient
Important Fact
k of liquid water is ~25 times
greater than that of air
k of metals are thousands of
times greater than that of air.
𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
Ohm’s Law
𝑞=−𝑘∇𝑇 Fourier’s Law
Channel
capacity
(b/sec)
Bandwidth
(hertz)
Signal to
Noise ratio
C Shannon-Hartley
Theorem
Important Facts:
The channel capacity of just one strand of
single mode fiber optic cable is functionally
unlimited (2.4 Pb/sec)
Coast to coast latencies on the commercial
internet are ~80ms
𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
Computing infrastructure should be:
• Adjacent to cheap, renewable power
• Someplace where cooling is
straightforward
• Close to high quality fiber
You probably do not need to put the
ARTIFICIAL INTELLIGENCE in your
2. Molecules, Measurements, and Models
The underlying mechanisms of biology are pretty stable.
The models we use to understand them change
constantly
The size of the human genome hasn’t changed
very much in ~3*105 years:
Humans have 23 pairs of chromosomes with a
total of 3.2 x 109
unique locations
If a base call is 2 bits (four letter alphabet) and
the quality scores are 8 bits, then we need 10 bits
/ base pair.
30x WGS = 30 x 10 x 3.2x109
bits
= 9.6 x 109+1+1
bits
= 960,000,000,000 bits
= 120,000,000,000 bytes
K
M
G
The reads to sequence a human
genome to 30x, including 8-bit quality
scores, are still about 120GB prior to
compression and de-duplication.
Level 1:
Raw instrument outputs
Reads, gel images, …
Level 2:
Normalized / QC’d instrument outputs
Quantifications, BQSR, …
Level 3:
Mappings against a model
Variants, Peptides, Regulation, Causation …
Level 4:
Interpretation
Clinical significance, diagnosis, ….
Levels 1 and 2 are
immutable. They are
statements of fact.
Level 3 and beyond are
model dependent. We
must interrogate both the
model and data.
Predicting coat color in mice:
An LLM might be able to do the job, but you’ll
have no idea if it’s correct or not.
GWAS will show correlations, but it’ll open
more questions than it resolves.
Just use a Punnett square.
What colour are your eyes?
Teaching the genetics of eye colour
& colour vision
Mackey, Eye 36, 704-715 (2022)
A lovely review of a genetics topic that
we all thought we learned in junior
high.
• Store primary data (Level 1) forever
• Support multiple versions of Level 3 (and
beyond) as models and context change
• Be cognizant of models and assumptions,
especially if you are outside of your expertise
• Parsimony still holds: The simplest model
that gets the job done is usually the best
one.
3. Business
Do not call the tortoise unworthy because she is not
something else. (Whitman, Song of Myself)
Fast
Good
Cheap
Early discovery / startup
Lots of consultants and
cloud
Move fast! Break stuff!
Clinical Trials
Internalize core functions
Compliance / quality
For Profit Operations
Internalize non-core
functions
“Cloud Hangover”
Technologists:
• Align with business priorities
• Nobody cares about the tech
Strategists:
• Translate business priorities into technology guidance
• Plan for success! Cost and quality will eventually
matter.
Executives:
• Give clear guidance.
4. Technology
The future is already here, it’s just not very well
distributed. (William Gibson)
2007: Amazon launched EC2 and S3
Shortly thereafter, I wrote a PERL script:
• Spin up an instance
• Install / configure / activate MySql and Apache
• Populate MySql by screen scraping various public
websites
• Output an URL where I could see my formatted results
Over time, we did away with:
• The server and the standalone database (yay serverless!)
• Me running the script (yay lambda!)
• The screen scraping (yay APIs!)
• PERL (good riddance!)
• Much of the HTML intermediation
Yes, I actually did replace
myself with a small PERL script.
In 2013, I was occasionally
mystified about how best to
engage with a “cloud strategy”
“To be without method is
deplorable, but to depend
entirely on method is worse.”
The Mustard Seed Garden Manual of Painting, 1679
CLOUD
AI!!!
CLOUD
AI!!!
“Cloud hangover” was mostly a
realization that cloud technologies
did not change rental economics.
I suspect that “AI hangover” will
involve realizing that domain
expertise matters.
5. People
Everybody had a 1st day
Coach and develop junior people
Listen to them about new tech
“Largest survivable failure”
This is the only way to create
experts
By working 24 hours a day, one could
multiply themselves three times. To
multiply oneself more than three times,
the only recourse is to train others to
take over some of your work.
Hyman Rickover
Most investors are not domain
experts.
Investors flock to health care and life sciences
because we are a good bet (even now).
It is on us to understand their business
priorities and how those might shape
technology choices - especially in the very early
days.
It is on us as scientists, technologists, and
domain experts to guide, shape, and push
back on incorrect, unethical, and harmful asks.
People Matter
Human organizations are machines that require
constant supervision and tuning.
Unsupervised, they merely do what we made them
to do. In for-profit companies that means hoarding
resources and growing without limit.
If we want good outcomes for people, we need to
keep humans in the loop and make human
happiness and well-being explicit priorities of the
system.
This will not happen by accident, but it is not
impossible and we should push for it.
Thank
You

More Related Content

PDF
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
PPT
Cloud Computing and Innovations for Optimizing Life Sciences Research
PPTX
BioIT Trends - 2014 Internet2 Technology Exchange
PDF
2013: Trends from the Trenches
KEY
2012: Trends from the Trenches
PPTX
One Size Does Not Fit All
PPTX
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
PPTX
Service and Support for Science IT -Peter Kunzst, University of Zurich
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Cloud Computing and Innovations for Optimizing Life Sciences Research
BioIT Trends - 2014 Internet2 Technology Exchange
2013: Trends from the Trenches
2012: Trends from the Trenches
One Size Does Not Fit All
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
Service and Support for Science IT -Peter Kunzst, University of Zurich

Similar to Data and Computing Infrastructure for the Life Sciences (20)

PDF
BioTeam Trends from the Trenches - NIH, April 2014
KEY
Trends from the Trenches (Singapore Edition)
PDF
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
PDF
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
PPTX
2017 bio it world
PPTX
Mapping Life Science Informatics to the Cloud
PPTX
Code for science (rev 1)
PDF
2013 10-30-sbc361-reproducible designsandsustainablesoftware
PDF
2014 10-15-Nextbug edinburgh
PDF
MelissaJarquin_Portfolio extract_compressed
PPTX
Trends from the Trenches: 2019
PPTX
2016 05 sanger
PDF
High-Performance Networking Use Cases in Life Sciences
PDF
2018 learning approach-digitaltrends
PDF
BioIT World 2016 - HPC Trends from the Trenches
PPTX
2019 BioIt World - Post cloud legacy edition
PDF
Binary crosswords
PPTX
Building Data Ecosystems for Accelerated Discovery
PDF
Cloud computing in biomedicine intel talk
PDF
Coevolutionary Fuzzy Modeling 1st Edition Carlos Andrs Pea Reyes Auth
BioTeam Trends from the Trenches - NIH, April 2014
Trends from the Trenches (Singapore Edition)
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
2017 bio it world
Mapping Life Science Informatics to the Cloud
Code for science (rev 1)
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2014 10-15-Nextbug edinburgh
MelissaJarquin_Portfolio extract_compressed
Trends from the Trenches: 2019
2016 05 sanger
High-Performance Networking Use Cases in Life Sciences
2018 learning approach-digitaltrends
BioIT World 2016 - HPC Trends from the Trenches
2019 BioIt World - Post cloud legacy edition
Binary crosswords
Building Data Ecosystems for Accelerated Discovery
Cloud computing in biomedicine intel talk
Coevolutionary Fuzzy Modeling 1st Edition Carlos Andrs Pea Reyes Auth
Ad

More from Chris Dwan (20)

PDF
Somerville Police Staffing Final Report.pdf
PDF
2023 Ward 2 community meeting.pdf
PDF
Somerville FY23 Proposed Budget
PPTX
Production Bioinformatics, emphasis on Production
PPTX
#Defund thepolice
PPTX
2009 cluster user training
PPTX
No Free Lunch: Metadata in the life sciences
PDF
Somerville ufc memo tree hearing
PDF
2011 career-fair
PPTX
Advocacy in the Enterprise (what works, what doesn't)
PPTX
"The Cutting Edge Can Hurt You"
PPT
Introduction to HPC
PPT
Intro bioinformatics
PDF
Proposed tree protection ordinance
PDF
Tree Ordinance Change Matrix
PDF
Tree protection overhaul
PDF
Response from newport
PDF
Sacramento underpass bid_docs
PDF
Somerville tree stat 2019 02 12
PPTX
Ivaloo harrison kent
Somerville Police Staffing Final Report.pdf
2023 Ward 2 community meeting.pdf
Somerville FY23 Proposed Budget
Production Bioinformatics, emphasis on Production
#Defund thepolice
2009 cluster user training
No Free Lunch: Metadata in the life sciences
Somerville ufc memo tree hearing
2011 career-fair
Advocacy in the Enterprise (what works, what doesn't)
"The Cutting Edge Can Hurt You"
Introduction to HPC
Intro bioinformatics
Proposed tree protection ordinance
Tree Ordinance Change Matrix
Tree protection overhaul
Response from newport
Sacramento underpass bid_docs
Somerville tree stat 2019 02 12
Ivaloo harrison kent
Ad

Recently uploaded (20)

PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
famous lake in india and its disturibution and importance
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
neck nodes and dissection types and lymph nodes levels
TOTAL hIP ARTHROPLASTY Presentation.pptx
BIOMOLECULES PPT........................
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
lecture 2026 of Sjogren's syndrome l .pdf
The KM-GBF monitoring framework – status & key messages.pptx
2. Earth - The Living Planet earth and life
ECG_Course_Presentation د.محمد صقران ppt
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
famous lake in india and its disturibution and importance
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Taita Taveta Laboratory Technician Workshop Presentation.pptx
6.1 High Risk New Born. Padetric health ppt
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The scientific heritage No 166 (166) (2025)
Classification Systems_TAXONOMY_SCIENCE8.pptx

Data and Computing Infrastructure for the Life Sciences

  • 1. Data and Computing Infrastructure for the Life Sciences: Best Practices, Observations, and Lessons Learned Chris Dwan https://dwan.org
  • 2. Agenda: • Real world systems • Power, cooling, and information transfer • Molecules, Measurements, and Models • Business • Technology • People
  • 4. 5 x Quad P-II 450MHz Dell (Rocks 3.1) 2 ‘G’ series TimeLogic Decypher boards (BLAST accelerators) Alphaserver E40 Quad Processor 7 x Dual P-III 1.4 GHz Dell (Rocks 3.1) Research computing at CCGB Center for Computational Genomics & Bioinformatics University of Minnesota 2002
  • 5. Research computing systems at NASA Data Acquisition and Analysis Center (DAAC) for the Earth Science Technology Office (ESTO) Langley Air Force Base Hampton Roads, VA 2007 1.2PB GPFS (28 cabinets, 2,560 disk drives) 100TB XSan ~50 compute servers
  • 6. New York Genome Center Manhattan, NY 2012 ~1PB Storage Commodity Compute More commodity compute GPUs Network More commodity compute And a bit of cloud
  • 7. Broad Institute Cambridge, MA 2015 ~20PB file storage Another ~20PB file storage Enterprise services Private Cloud More Linux HPC 14k Cores Linux HPC Differentiated Hardware Secondary Data Center Quite a lot of cloud
  • 8. Sema4 Clinical Genomic Screening and Diagnostics Stamford, CT 2021 All cloud, except for the physical touch points with the lab, and one little network closet
  • 9. Cambridge, MA 2024 1.2PB flash storage Network Commodity Servers GPUs Artificial Intelligence Direct connections to multiple public cloud providers Private cloud Diverse fiber from HQ to data center Modest data center footprint
  • 10. 1. Power, Cooling, and Information Transfer Electricity is expensive, cooling is a pain, and light moves really fast
  • 11. 𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠 Amount of power (you get charged for kilowatt hours) Capacity of the pipe (circuit breakers are in Amps) “Pressure” in the pipe Ohm’s Law Important Facts Transmission of electrical power is much more efficient at higher voltages High voltages at the same amperage will deliver too much power and damage sensitive components.
  • 12. Ohm’s Law 𝑞=−𝑘∇𝑇 Fourier’s Law Heat flux: How much energy moves. Thermal conductivity of the material carrying the energy Temperature gradient Important Fact k of liquid water is ~25 times greater than that of air k of metals are thousands of times greater than that of air. 𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
  • 13. Ohm’s Law 𝑞=−𝑘∇𝑇 Fourier’s Law Channel capacity (b/sec) Bandwidth (hertz) Signal to Noise ratio C Shannon-Hartley Theorem Important Facts: The channel capacity of just one strand of single mode fiber optic cable is functionally unlimited (2.4 Pb/sec) Coast to coast latencies on the commercial internet are ~80ms 𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
  • 14. Computing infrastructure should be: • Adjacent to cheap, renewable power • Someplace where cooling is straightforward • Close to high quality fiber You probably do not need to put the ARTIFICIAL INTELLIGENCE in your
  • 15. 2. Molecules, Measurements, and Models The underlying mechanisms of biology are pretty stable. The models we use to understand them change constantly
  • 16. The size of the human genome hasn’t changed very much in ~3*105 years: Humans have 23 pairs of chromosomes with a total of 3.2 x 109 unique locations If a base call is 2 bits (four letter alphabet) and the quality scores are 8 bits, then we need 10 bits / base pair. 30x WGS = 30 x 10 x 3.2x109 bits = 9.6 x 109+1+1 bits = 960,000,000,000 bits = 120,000,000,000 bytes K M G The reads to sequence a human genome to 30x, including 8-bit quality scores, are still about 120GB prior to compression and de-duplication.
  • 17. Level 1: Raw instrument outputs Reads, gel images, … Level 2: Normalized / QC’d instrument outputs Quantifications, BQSR, … Level 3: Mappings against a model Variants, Peptides, Regulation, Causation … Level 4: Interpretation Clinical significance, diagnosis, …. Levels 1 and 2 are immutable. They are statements of fact. Level 3 and beyond are model dependent. We must interrogate both the model and data.
  • 18. Predicting coat color in mice: An LLM might be able to do the job, but you’ll have no idea if it’s correct or not. GWAS will show correlations, but it’ll open more questions than it resolves. Just use a Punnett square. What colour are your eyes? Teaching the genetics of eye colour & colour vision Mackey, Eye 36, 704-715 (2022) A lovely review of a genetics topic that we all thought we learned in junior high.
  • 19. • Store primary data (Level 1) forever • Support multiple versions of Level 3 (and beyond) as models and context change • Be cognizant of models and assumptions, especially if you are outside of your expertise • Parsimony still holds: The simplest model that gets the job done is usually the best one.
  • 20. 3. Business Do not call the tortoise unworthy because she is not something else. (Whitman, Song of Myself)
  • 21. Fast Good Cheap Early discovery / startup Lots of consultants and cloud Move fast! Break stuff! Clinical Trials Internalize core functions Compliance / quality For Profit Operations Internalize non-core functions “Cloud Hangover”
  • 22. Technologists: • Align with business priorities • Nobody cares about the tech Strategists: • Translate business priorities into technology guidance • Plan for success! Cost and quality will eventually matter. Executives: • Give clear guidance.
  • 23. 4. Technology The future is already here, it’s just not very well distributed. (William Gibson)
  • 24. 2007: Amazon launched EC2 and S3 Shortly thereafter, I wrote a PERL script: • Spin up an instance • Install / configure / activate MySql and Apache • Populate MySql by screen scraping various public websites • Output an URL where I could see my formatted results Over time, we did away with: • The server and the standalone database (yay serverless!) • Me running the script (yay lambda!) • The screen scraping (yay APIs!) • PERL (good riddance!) • Much of the HTML intermediation Yes, I actually did replace myself with a small PERL script.
  • 25. In 2013, I was occasionally mystified about how best to engage with a “cloud strategy”
  • 26. “To be without method is deplorable, but to depend entirely on method is worse.” The Mustard Seed Garden Manual of Painting, 1679 CLOUD AI!!! CLOUD AI!!!
  • 27. “Cloud hangover” was mostly a realization that cloud technologies did not change rental economics. I suspect that “AI hangover” will involve realizing that domain expertise matters.
  • 29. Everybody had a 1st day Coach and develop junior people Listen to them about new tech “Largest survivable failure” This is the only way to create experts By working 24 hours a day, one could multiply themselves three times. To multiply oneself more than three times, the only recourse is to train others to take over some of your work. Hyman Rickover
  • 30. Most investors are not domain experts. Investors flock to health care and life sciences because we are a good bet (even now). It is on us to understand their business priorities and how those might shape technology choices - especially in the very early days. It is on us as scientists, technologists, and domain experts to guide, shape, and push back on incorrect, unethical, and harmful asks.
  • 31. People Matter Human organizations are machines that require constant supervision and tuning. Unsupervised, they merely do what we made them to do. In for-profit companies that means hoarding resources and growing without limit. If we want good outcomes for people, we need to keep humans in the loop and make human happiness and well-being explicit priorities of the system. This will not happen by accident, but it is not impossible and we should push for it.

Editor's Notes

  • #21: Kitchen / Cloud metaphor.