SlideShare a Scribd company logo
Erin Shellman
DAML July, 27 2016
Catching the most with
high-throughput screening
Catching the most with high-throughput screening
Zymergen provides
a platform for the
rapid improvement
of microbial strains
through genetic
engineering.
http://www.yourgenome.org/facts/what-is-genetic-engineering
What if you don’t know which
gene to perturb?
Reduce the system to it’s
constituent parts and
experiment on each part, one
at a time, until you’ve described
the causal mechanism.
…then publish a paper.
Try thousands of
things and see
what works.
High-throughput
screening (HTS) is a
process for evaluating
many simultaneous
hypotheses.
Tier 1 Screen
Tier 2 Screen
Tank validation
Screening goals
Minimize false negatives
Minimize false positives
Confirm that our best tier 2
strains perform well in the
tank
Operational
Implications
HTS poses two
unique challenges:
1. low sample size
2. high variance
Small sample size
This is largely intentional. We
know most things don’t work, so
why waste resources on a
gamble?
High variance
Experimental
complexity creates
opportunities for
injection of bias and
unwanted variance.
😓
• Many classical statistical methods
assume normality and common
variance—which we can’t assume.
• We need to be especially thoughtful in
designing our tests.
Real, unknown distributions
15% difference in means
When we take
measurements, we get
little bits of information
about what these shapes
might be.
But we don’t always arrive
at the right answer.
0.86 0.98 1.2
meanbasestrain = 1.01 meancandidate = 0.81
0.55 0.87 1.02
p-value = 0.31
0.86 0.98 1.2
meanbasestrain = 1.01 meancandidate = 0.81
0.55 0.87 1.02
How many measurements
should I take so I can sleep
knowing that I’ve made the
best possible promotion
decisions?
p-value = 0.31
It depends.
What is the expected effect size, i.e. how different
do you think the strains are?
5% difference in means 50% difference in means
How variable are the measurement values?
It depends.
5% difference in means 50% difference in means
Power analysis
• Power analysis is a method for estimating the
sample size required to detect changes at
assumed levels.
• Power is the probability of detecting a difference,
when a difference is present.
• We compute it through simulation.
Power is a fixed parameter
The power threshold is set at 0.80,
meaning if we run the same experiment
100 times, we can expect to detect
differences in means at least 80 out of
those 100 times.
Simulation study design
t-test sum rank
contamination
no
contamination
• Parametric test, i.e.
assumes that the data
are normally distributed
• Sensitive to extreme
values
• Non-parametric test, i.e.
makes no distributional
assumptions
• Less sensitive to extreme
values
t-test sum rank
Initialize Strains:
𝝁basestrain, 𝝈basestrain,
𝝁mutant, 𝝈mutant
Initialize Campaign:
basestrain, mutants, N,
contamination rate, test
Simulate Data
Test for differences
in means
X5000
power = # times diff
detected / 5000
N = range(3, 11)
mu_ref = 0.80
sigma_ref = 0.30
mu_range = np.arange(0.80, 1.70, 0.10)
sigma_range = np.arange(0.05, 2, 0.10)
reference.get_observations(3)
Out:array([ 0.96, 1.00, 1.28])
mutant.get_observations(3)
Out: array([ 1.98, 1.60, 1.70])
e.g.
e.g.
Catching the most with high-throughput screening
Catching the most with high-throughput screening
Catching the most with high-throughput screening
Catching the most with high-throughput screening
A candidate strain would
have to show about 40%
improvement to be
detectable with 3 replicates
A candidate strain would
have to show about 15%
improvement to be
detectable with 10 replicates
No contamination
5% contamination
Catching the most with high-throughput screening
Results
• The presence of extreme values undermines
our ability to detect differences by effectively
decreasing N.
• We can make progress in the face of
extreme values by using non-parametric tests,
like sum rank, that perform equally well in
ideal conditions and better than the t-test in
typical conditions.
Adaptive experimental design?
STRAINPERFORMANCE
PROJECT TIME
1. ZERO TO MILLIGRAMS
Hits are big enough to detect with
low N.
2. MILLIGRAMS TO KILOGRAMS
Hits sizes shrinking and becoming more
variable as low hanging fruits dry up.
3. KILOGRAMS TO COMMODITY
Hit sizes are at their smallest as we
approach the theoretical max.
Tier 1 Screen
Tier 2 Screen
Tank validation
Screening goals
Minimize false negatives
Minimize false positives
Confirm that our best tier 2
strains perform well in the
tank
many hypotheses
low N
low promotion threshold
fewer hypotheses
bigger N
higher promotion threshold
Operational
Implications
Zero to
Milligrams
Milligrams to
Kilograms
Kilograms to
Commodity
Tier 1
4 replicates
p-value <= 0.10
6 replicates
p-value <= 0.10
8 replicates
p-value <= 0.10
Tier 2
8 replicates
p-value <= 0.05
12 replicates
p-value <= 0.05
16 replicates
p-value <= 0.05
Tank Tank is truth.☝
Hypothetical design
Thanks for listening!
Questions?

More Related Content

PDF
Performance Metrics for Machine Learning Algorithms
PPTX
Risk tolerances: A PB & J Approach
PDF
General Concepts of Machine Learning
PPTX
M1 regression metrics_middleschool
PDF
Mistakes I've Made- Cam Davidson-Pilon
PDF
Cross validation
PDF
8. testing of hypothesis for variable &amp; attribute data
PPT
Chapter 4(2) Hypothesisi Testing
Performance Metrics for Machine Learning Algorithms
Risk tolerances: A PB & J Approach
General Concepts of Machine Learning
M1 regression metrics_middleschool
Mistakes I've Made- Cam Davidson-Pilon
Cross validation
8. testing of hypothesis for variable &amp; attribute data
Chapter 4(2) Hypothesisi Testing

Similar to Catching the most with high-throughput screening (20)

PPTX
Strain improvement technique
PPT
Microarray Statistics
PPTX
strain improvement to increase yeild of selected molecules.pptx
PPTX
Microbial strain selection..
PDF
An Ensemble of Filters and Wrappers for Microarray Data Classification
PDF
An Ensemble of Filters and Wrappers for Microarray Data Classification
PPSX
Mse June 24 2011
PDF
ecir2019tutorial-finalised
PDF
How to validate your model
PDF
P0126557 slides
PDF
HTS by mukesh
PDF
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
PPTX
phd ppt2 sample reference download1.pptx
PPTX
mutagenesis
PDF
outiar.pdf
PDF
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
PDF
Robust biomarker selection from RT-qPCR data using statistical consensus crit...
PPTX
Validating an Automated Rapid Method: Determining Time to Results
PPT
Microarray Data Analysis
PDF
report
Strain improvement technique
Microarray Statistics
strain improvement to increase yeild of selected molecules.pptx
Microbial strain selection..
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification
Mse June 24 2011
ecir2019tutorial-finalised
How to validate your model
P0126557 slides
HTS by mukesh
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
phd ppt2 sample reference download1.pptx
mutagenesis
outiar.pdf
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Robust biomarker selection from RT-qPCR data using statistical consensus crit...
Validating an Automated Rapid Method: Determining Time to Results
Microarray Data Analysis
report
Ad

More from Erin Shellman (9)

PDF
Case studies in data-driven merchandising
PDF
Building Robust Pipelines with Airflow
PDF
Developing effective data scientists
PDF
Bot or Not
PDF
Downloading the internet with Python + Scrapy
PDF
Fun! with the Twitter API
PDF
real time real talk
PDF
Collaborative Filtering for fun ...and profit!
PDF
Assumptions: Check yo'self before you wreck yourself
Case studies in data-driven merchandising
Building Robust Pipelines with Airflow
Developing effective data scientists
Bot or Not
Downloading the internet with Python + Scrapy
Fun! with the Twitter API
real time real talk
Collaborative Filtering for fun ...and profit!
Assumptions: Check yo'self before you wreck yourself
Ad

Recently uploaded (20)

PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPT
LEC Synthetic Biology and its application.ppt
PDF
Science Form five needed shit SCIENEce so
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPT
veterinary parasitology ````````````.ppt
PPT
Presentation of a Romanian Institutee 2.
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
Substance Disorders- part different drugs change body
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
Understanding the Circulatory System……..
PPTX
Biomechanics of the Hip - Basic Science.pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
BODY FLUIDS AND CIRCULATION class 11 .pptx
Placing the Near-Earth Object Impact Probability in Context
LEC Synthetic Biology and its application.ppt
Science Form five needed shit SCIENEce so
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
veterinary parasitology ````````````.ppt
Presentation of a Romanian Institutee 2.
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
TORCH INFECTIONS in pregnancy with toxoplasma
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Seminar Hypertension and Kidney diseases.pptx
Substance Disorders- part different drugs change body
lecture 2026 of Sjogren's syndrome l .pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Animal tissues, epithelial, muscle, connective, nervous tissue
Understanding the Circulatory System……..
Biomechanics of the Hip - Basic Science.pptx

Catching the most with high-throughput screening

  • 1. Erin Shellman DAML July, 27 2016 Catching the most with high-throughput screening
  • 3. Zymergen provides a platform for the rapid improvement of microbial strains through genetic engineering.
  • 5. What if you don’t know which gene to perturb?
  • 6. Reduce the system to it’s constituent parts and experiment on each part, one at a time, until you’ve described the causal mechanism. …then publish a paper.
  • 7. Try thousands of things and see what works.
  • 8. High-throughput screening (HTS) is a process for evaluating many simultaneous hypotheses.
  • 9. Tier 1 Screen Tier 2 Screen Tank validation Screening goals Minimize false negatives Minimize false positives Confirm that our best tier 2 strains perform well in the tank Operational Implications
  • 10. HTS poses two unique challenges: 1. low sample size 2. high variance
  • 11. Small sample size This is largely intentional. We know most things don’t work, so why waste resources on a gamble?
  • 12. High variance Experimental complexity creates opportunities for injection of bias and unwanted variance.
  • 13. 😓 • Many classical statistical methods assume normality and common variance—which we can’t assume. • We need to be especially thoughtful in designing our tests.
  • 14. Real, unknown distributions 15% difference in means
  • 15. When we take measurements, we get little bits of information about what these shapes might be.
  • 16. But we don’t always arrive at the right answer. 0.86 0.98 1.2 meanbasestrain = 1.01 meancandidate = 0.81 0.55 0.87 1.02 p-value = 0.31
  • 17. 0.86 0.98 1.2 meanbasestrain = 1.01 meancandidate = 0.81 0.55 0.87 1.02 How many measurements should I take so I can sleep knowing that I’ve made the best possible promotion decisions? p-value = 0.31
  • 18. It depends. What is the expected effect size, i.e. how different do you think the strains are? 5% difference in means 50% difference in means
  • 19. How variable are the measurement values? It depends. 5% difference in means 50% difference in means
  • 20. Power analysis • Power analysis is a method for estimating the sample size required to detect changes at assumed levels. • Power is the probability of detecting a difference, when a difference is present. • We compute it through simulation.
  • 21. Power is a fixed parameter The power threshold is set at 0.80, meaning if we run the same experiment 100 times, we can expect to detect differences in means at least 80 out of those 100 times.
  • 22. Simulation study design t-test sum rank contamination no contamination
  • 23. • Parametric test, i.e. assumes that the data are normally distributed • Sensitive to extreme values • Non-parametric test, i.e. makes no distributional assumptions • Less sensitive to extreme values t-test sum rank
  • 24. Initialize Strains: 𝝁basestrain, 𝝈basestrain, 𝝁mutant, 𝝈mutant Initialize Campaign: basestrain, mutants, N, contamination rate, test Simulate Data Test for differences in means X5000 power = # times diff detected / 5000 N = range(3, 11) mu_ref = 0.80 sigma_ref = 0.30 mu_range = np.arange(0.80, 1.70, 0.10) sigma_range = np.arange(0.05, 2, 0.10) reference.get_observations(3) Out:array([ 0.96, 1.00, 1.28]) mutant.get_observations(3) Out: array([ 1.98, 1.60, 1.70]) e.g. e.g.
  • 29. A candidate strain would have to show about 40% improvement to be detectable with 3 replicates A candidate strain would have to show about 15% improvement to be detectable with 10 replicates
  • 32. Results • The presence of extreme values undermines our ability to detect differences by effectively decreasing N. • We can make progress in the face of extreme values by using non-parametric tests, like sum rank, that perform equally well in ideal conditions and better than the t-test in typical conditions.
  • 33. Adaptive experimental design? STRAINPERFORMANCE PROJECT TIME 1. ZERO TO MILLIGRAMS Hits are big enough to detect with low N. 2. MILLIGRAMS TO KILOGRAMS Hits sizes shrinking and becoming more variable as low hanging fruits dry up. 3. KILOGRAMS TO COMMODITY Hit sizes are at their smallest as we approach the theoretical max.
  • 34. Tier 1 Screen Tier 2 Screen Tank validation Screening goals Minimize false negatives Minimize false positives Confirm that our best tier 2 strains perform well in the tank many hypotheses low N low promotion threshold fewer hypotheses bigger N higher promotion threshold Operational Implications
  • 35. Zero to Milligrams Milligrams to Kilograms Kilograms to Commodity Tier 1 4 replicates p-value <= 0.10 6 replicates p-value <= 0.10 8 replicates p-value <= 0.10 Tier 2 8 replicates p-value <= 0.05 12 replicates p-value <= 0.05 16 replicates p-value <= 0.05 Tank Tank is truth.☝ Hypothetical design