My slides from the 2025 Bio-IT World Expo.
I tried to lift above the churn to find constants that an architect or strategist could use to make well informed and durable technology choices.
Data and Computing Infrastructure for the Life Sciences
1. Data and Computing Infrastructure
for the Life Sciences:
Best Practices, Observations, and
Lessons Learned
Chris Dwan
https://dwan.org
2. Agenda:
• Real world systems
• Power, cooling, and information
transfer
• Molecules, Measurements, and
Models
• Business
• Technology
• People
4. 5 x Quad P-II 450MHz Dell
(Rocks 3.1)
2 ‘G’ series TimeLogic
Decypher boards
(BLAST accelerators)
Alphaserver E40
Quad Processor
7 x Dual P-III 1.4 GHz
Dell (Rocks 3.1)
Research computing
at CCGB
Center for Computational
Genomics &
Bioinformatics
University of
Minnesota
2002
5. Research computing
systems at NASA
Data Acquisition and
Analysis Center (DAAC) for
the Earth Science
Technology Office (ESTO)
Langley Air Force Base
Hampton Roads, VA
2007
1.2PB GPFS (28 cabinets,
2,560 disk drives)
100TB XSan
~50 compute servers
6. New York
Genome Center
Manhattan, NY
2012
~1PB
Storage
Commodity
Compute
More
commodity
compute
GPUs
Network
More
commodity
compute
And a bit of cloud
7. Broad Institute
Cambridge, MA
2015
~20PB file
storage
Another ~20PB
file storage
Enterprise
services
Private Cloud
More Linux HPC
14k Cores
Linux HPC
Differentiated
Hardware
Secondary Data Center
Quite a lot
of cloud
8. Sema4
Clinical Genomic
Screening and Diagnostics
Stamford, CT
2021
All cloud, except for the physical
touch points with the lab, and
one little network closet
10. 1. Power, Cooling, and Information
Transfer
Electricity is expensive, cooling is a pain,
and light moves really fast
11. 𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
Amount of power
(you get charged
for kilowatt hours)
Capacity of the pipe
(circuit breakers
are in Amps)
“Pressure” in
the pipe
Ohm’s Law
Important Facts
Transmission of electrical power is much
more efficient at higher voltages
High voltages at the same amperage will
deliver too much power and damage
sensitive components.
12. Ohm’s Law
𝑞=−𝑘∇𝑇 Fourier’s Law
Heat flux: How
much energy
moves.
Thermal conductivity
of the material
carrying the energy
Temperature gradient
Important Fact
k of liquid water is ~25 times
greater than that of air
k of metals are thousands of
times greater than that of air.
𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
13. Ohm’s Law
𝑞=−𝑘∇𝑇 Fourier’s Law
Channel
capacity
(b/sec)
Bandwidth
(hertz)
Signal to
Noise ratio
C Shannon-Hartley
Theorem
Important Facts:
The channel capacity of just one strand of
single mode fiber optic cable is functionally
unlimited (2.4 Pb/sec)
Coast to coast latencies on the commercial
internet are ~80ms
𝑊𝑎𝑡𝑡𝑠=𝐴𝑚𝑝𝑠×𝑉𝑜𝑙𝑡𝑠
14. Computing infrastructure should be:
• Adjacent to cheap, renewable power
• Someplace where cooling is
straightforward
• Close to high quality fiber
You probably do not need to put the
ARTIFICIAL INTELLIGENCE in your
15. 2. Molecules, Measurements, and Models
The underlying mechanisms of biology are pretty stable.
The models we use to understand them change
constantly
16. The size of the human genome hasn’t changed
very much in ~3*105 years:
Humans have 23 pairs of chromosomes with a
total of 3.2 x 109
unique locations
If a base call is 2 bits (four letter alphabet) and
the quality scores are 8 bits, then we need 10 bits
/ base pair.
30x WGS = 30 x 10 x 3.2x109
bits
= 9.6 x 109+1+1
bits
= 960,000,000,000 bits
= 120,000,000,000 bytes
K
M
G
The reads to sequence a human
genome to 30x, including 8-bit quality
scores, are still about 120GB prior to
compression and de-duplication.
17. Level 1:
Raw instrument outputs
Reads, gel images, …
Level 2:
Normalized / QC’d instrument outputs
Quantifications, BQSR, …
Level 3:
Mappings against a model
Variants, Peptides, Regulation, Causation …
Level 4:
Interpretation
Clinical significance, diagnosis, ….
Levels 1 and 2 are
immutable. They are
statements of fact.
Level 3 and beyond are
model dependent. We
must interrogate both the
model and data.
18. Predicting coat color in mice:
An LLM might be able to do the job, but you’ll
have no idea if it’s correct or not.
GWAS will show correlations, but it’ll open
more questions than it resolves.
Just use a Punnett square.
What colour are your eyes?
Teaching the genetics of eye colour
& colour vision
Mackey, Eye 36, 704-715 (2022)
A lovely review of a genetics topic that
we all thought we learned in junior
high.
19. • Store primary data (Level 1) forever
• Support multiple versions of Level 3 (and
beyond) as models and context change
• Be cognizant of models and assumptions,
especially if you are outside of your expertise
• Parsimony still holds: The simplest model
that gets the job done is usually the best
one.
20. 3. Business
Do not call the tortoise unworthy because she is not
something else. (Whitman, Song of Myself)
21. Fast
Good
Cheap
Early discovery / startup
Lots of consultants and
cloud
Move fast! Break stuff!
Clinical Trials
Internalize core functions
Compliance / quality
For Profit Operations
Internalize non-core
functions
“Cloud Hangover”
22. Technologists:
• Align with business priorities
• Nobody cares about the tech
Strategists:
• Translate business priorities into technology guidance
• Plan for success! Cost and quality will eventually
matter.
Executives:
• Give clear guidance.
24. 2007: Amazon launched EC2 and S3
Shortly thereafter, I wrote a PERL script:
• Spin up an instance
• Install / configure / activate MySql and Apache
• Populate MySql by screen scraping various public
websites
• Output an URL where I could see my formatted results
Over time, we did away with:
• The server and the standalone database (yay serverless!)
• Me running the script (yay lambda!)
• The screen scraping (yay APIs!)
• PERL (good riddance!)
• Much of the HTML intermediation
Yes, I actually did replace
myself with a small PERL script.
25. In 2013, I was occasionally
mystified about how best to
engage with a “cloud strategy”
26. “To be without method is
deplorable, but to depend
entirely on method is worse.”
The Mustard Seed Garden Manual of Painting, 1679
CLOUD
AI!!!
CLOUD
AI!!!
27. “Cloud hangover” was mostly a
realization that cloud technologies
did not change rental economics.
I suspect that “AI hangover” will
involve realizing that domain
expertise matters.
29. Everybody had a 1st day
Coach and develop junior people
Listen to them about new tech
“Largest survivable failure”
This is the only way to create
experts
By working 24 hours a day, one could
multiply themselves three times. To
multiply oneself more than three times,
the only recourse is to train others to
take over some of your work.
Hyman Rickover
30. Most investors are not domain
experts.
Investors flock to health care and life sciences
because we are a good bet (even now).
It is on us to understand their business
priorities and how those might shape
technology choices - especially in the very early
days.
It is on us as scientists, technologists, and
domain experts to guide, shape, and push
back on incorrect, unethical, and harmful asks.
31. People Matter
Human organizations are machines that require
constant supervision and tuning.
Unsupervised, they merely do what we made them
to do. In for-profit companies that means hoarding
resources and growing without limit.
If we want good outcomes for people, we need to
keep humans in the loop and make human
happiness and well-being explicit priorities of the
system.
This will not happen by accident, but it is not
impossible and we should push for it.