SlideShare a Scribd company logo
STATISTICAL
LITERACY FOR
DEEP TECH
Noel Jee
Think in distributions
Part 1
Noel Jee, PhD
https://medium.com/mytake/understanding-different-types-of-distributions-you-will-encounter-as-a-data-scientist-27ea4c375eec
Noel Jee, PhD
Disclaimers
• All examples are made up or blinded
• None of this is investment advice
• All views are my own
• Everything is illustrative
• This is not a statistics class
Credentials
• PhD in something science-y
• Took that one STAT400 class in college
• 4 years of evaluating startups
• Until last week, worked at
Research from top
institutions
Combined team experience
of over 67 years!
Changing the world through
new biology
Research from top
institutions
Combined team experience
of over 67 years!
Changing the world through
new biology
For illustrative purposes only
Healthy Disease
Research from top
institutions
Combined team experience
of over 67 years!
Changing the world through
new biology
Banana factor + DNA
n = 51
AUC 1.00
Breast Cancer
Banana factor + RNA
n = 80
AUC 0.97
Breast Cancer
Banana factor + Protein
n = 369
AUC 0.94
Breast Cancer
Banana factor + DNA
n = 87
AUC 0.98
Colon Cancer
Banana factor + Protein
n = 46
AUC 0.97
Colon Cancer
Banana factor
n = 36
AUC 0.89
Systemic Lupus Erythematosus
Amazing numbers can be deceiving
AUC => ROC => Sensitivity/Specificity
Most data can be viewed as distributions
Most data can be viewed as distributions
uniform
exponential
log
Most data can be viewed as distributions
Most data can be viewed as distributions
Thinking in distributions
is a framework – not
always applicable but
very useful when it is
uniform
exponential
log
Distributions explain sens/spec tradeoffs
Healthy Diseased
Distributions explain sens/spec tradeoffs
True positives
Healthy Diseased
True negatives
FN FP
Distributions explain sens/spec tradeoffs
True positives
Healthy Diseased
True negatives
FN FP
Distributions explain sens/spec tradeoffs
True positives
Healthy Diseased
True negatives
FN FP
Technologies change
distributions to
decrease tradeoffs
The ROC and the AUC – the AUROC
The ROC and the AUC – the AUROC
Banana factor + Protein
n = 369
AUC 0.94
Breast Cancer
Banana factor + Protein
n = 46
AUC 0.97
Colon Cancer
The ROC and the AUC – the AUROC
AUC can be deceptive – look
at the shape of the ROC curve
Truly good and too good to be true
Banana factor + RNA
n = 80
AUC 0.97
Breast Cancer
Truly good and too good to be true
Banana factor + RNA
n = 80
AUC 0.97
Breast Cancer
Truly good and too good to be true
Banana factor + RNA
n = 80
AUC 0.97
Breast Cancer
Deconstructing the data into components is necessary, but not sufficient
Truly good and too good to be true
Deconstructing the data into components
is necessary, but not sufficient
Banana factor + RNA
n = 80
AUC 0.97
Breast Cancer
Additional questions
• What’s the current and
assumed future standard
[of care]?
• What’s already known [in
the literature]?
• Etc.
Then…
Dig into the sample
n = 80
Healthy vs diseased? Disease
stage? Age? Prior treatments?
Method of diagnosis? Region?
Beyond sensitivity and specificity
https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
Beyond sensitivity and specificity
https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
Beyond sensitivity and specificity
https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
93.8%
sensitivity
95.6%
specificity
Beyond sensitivity and specificity
https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
93.8%
sensitivity
95.6%
specificity
53%
PPV
99.7%
NPV
Beyond sensitivity and specificity
https://step1.medbullets.com/stats/101006/testing-and-screening
Great numbers can still lead to
a useless test – it’s important to
understand the whole picture
https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
93.8%
sensitivity
95.6%
specificity
53%
PPV
99.7%
NPV
A COVID story
DOI: 10.3389/fmicb.2020.01818
PCR
Antigen
Antibody
A COVID story Antigen test attributes
• 80% sensitivity
• 15-30 min TAT
• < 50% of PCR cost
PCR
Antigen
Antibody
A COVID story
DOI: 10.3389/fmicb.2020.01818
Antigen test attributes
• 80% sensitivity
• 15-30 min TAT
• < 50% of PCR cost
PCR
Antigen
Antibody
Sensitivity
~80% of patients who tested positive with PCR
“True positive” is a relative
phrase; always note the baseline
comparison
A COVID story Antigen test attributes
• 80% sensitivity
• 15-30 min TAT
• < 50% of PCR cost
PCR
Antigen
Antibody
Sensitivity
~80% of patients who tested positive with PCR
Sample
Symptomatic patients with COVID-19
“True positive” is a relative
phrase; always note the baseline
comparison
Including asymptomatic patients reduced
the sensitivity below 40%
Sometimes numbers just mean nothing
Accuracy =
True positives + True negatives
All tests
90% accuracy could mean nothing
Sometimes numbers just mean nothing
Combined team experience
of over 67 years!
Accuracy =
True positives + True negatives
All tests
90% accuracy could mean nothing
Sometimes numbers just mean nothing
Combined team experience
of over 67 years!
Accuracy =
True positives + True negatives
All tests
90% accuracy could mean nothing
https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
“There are three kinds of lies:
lies, damned lies,
and statistics”
Mark Twain*
* who was wrongly quoting someone else – sometimes words are deceptive too
Impact.Tech "Statistical Literacy for Deep Tech"
STATISTICAL
LITERACY FOR
DEEP TECH
Noel Jee
Deconstruct everything
Part 2
Noel Jee, PhD
Defeating aging
through big data
Using AI … big data … unique signature of bok bok 99.9%
accuracy … novel drugs bok bok capture $100B total
addressable market
Not a real company
Defeating aging
through big data
“AI” – What do you want it to mean?
Statistical regression
R2 =
explained variation
total variation
Confidence interval:
the range of values where the
true mean likely lies
Linear Non-linear
“AI” – What do you want it to mean?
Healthy
Disease
Machine learning
sample feature
extraction
classification output
Binary – healthy, disease
Ordinal – mild, moderate, severe
Nominal – lupus, alzheimer’s, cancer
Logistic regression
“AI” – What do you want it to mean?
Healthy
Disease
Machine learning
sample feature
extraction
classification output
“AI” – What do you want it to mean?
Healthy
Disease
Machine learning
sample feature
extraction
classification output
Start with the output and input
before digging into the “AI”
What’s goes in?
The output: 99.9% sensitivity / specificity
Output looks great, let’s move on to the input
What’s goes in?
The output: 99.9% sensitivity / specificity
“Over 2,000 micro RNAs and
20,000 genes were fed into
our proprietary algorithms”
“We trained our algorithms on
200 patient samples”
Output looks great, let’s move on to the input
“Our algorithm was validated
on a public dataset of
2,000,000 individuals”
When is bigger not better?
DOI: 10.1093/ofid/ofv093
Heatmap of gene expression
“Over 2,000 micro RNAs and
20,000 genes were fed into
our proprietary algorithms”
When is bigger not better?
DOI: 10.1093/ofid/ofv093
Overfitting
More variables
may lead to
Heatmap of gene expression
When is bigger not better?
DOI: 10.1093/ofid/ofv093
https://brainstation.io/blog/3-ux-tips-from-the-worlds-first-ux-designer
Overfitting Impractical
More variables
may lead to
may be
Heatmap of gene expression
When is bigger not better?
DOI: 10.1093/ofid/ofv093
https://brainstation.io/blog/3-ux-tips-from-the-worlds-first-ux-designer
Overfitting Impractical
More variables
may lead to
may be
and is almost
certainly biased
Heatmap of gene expression
Where might bias be hiding?
6 billion
base pairs
99%
accurate
sequencer
60 million
errors
Whole
genome
Where might bias be hiding?
6 billion
base pairs
99%
accurate
sequencer
60 million
errors
Whole
genome
1% error is
biased, not
random
Bias is intrinsic to
each technique
Where might bias be hiding?
DOI: 10.1093/ofid/ofv093
Where might bias be hiding?
20,000
genes
30%
captured
70% not
captured
DOI: 10.1093/ofid/ofv093
Where might bias be hiding?
20,000
genes
30%
captured
70% not
captured
May only be 2,000
genes but contain most
relevant genes
May not be relevant
DOI: 10.1093/ofid/ofv093
Where might bias be hiding?
20,000
genes
30%
captured
Even bias might not be a
dealbreaker, look for consistency
70% not
captured
DOI: 10.1093/ofid/ofv093
May only be 2,000
genes but contain most
relevant genes
May not be relevant
When are great samples not great?
200 doctor-annotated
samples were used to train
our algorithms
Patients are not representative of the targeted
population
When are great samples not great?
200 doctor-annotated
samples were used to
train our algorithms
Googled some celebrity images
One doctor diagnosed the entire sample population,
introducing bias for the “true positive”
Patients are not representative of the targeted
population
When are great samples not great?
200 doctor-annotated
samples were used to
train our algorithms
Googled some celebrity images
Great data can still be garbage –
context is key
One doctor diagnosed the entire sample population,
introducing bias for the “true positive”
When are great samples not great?
“But it’s not why the polls were wrong. It just isn’t. People
tell the truth when you ask them who they’re voting for.
They really do, on average. The reason why the polls
are wrong is because the people who were answering
these surveys were the wrong people.”
David Shor (Vox interview)
How do I know it’s real?
Validation methods have
hard tradeoffs for startups
Retrospective
Prospective
Aging
Ageless
How do I know it’s real?
Retrospective
• Quality of data
• Sample size
• Sample breakdown
• Blinding
• Training vs validation
Validation methods have
hard tradeoffs for startups
How do I know it’s real?
Prospective
Retrospective
• Quality of data
• Sample size
• Sample breakdown
• Blinding
• Training vs validation
Validation methods have
hard tradeoffs for startups
• Recruitment centers
• Trial arms
• Statistical power
• Endpoints
• Duration
How do I know it’s real?
Prospective
Retrospective
• Quality of data
• Sample size
• Sample breakdown
• Blinding
• Training vs validation
Validation methods have
hard tradeoffs for startups
• Recruitment centers
• Trial arms
• Statistical power
• Endpoints
• Duration
https://everydaypsych.com/scientific-survey-results-blog-awesome/
Impact.Tech "Statistical Literacy for Deep Tech"
STATISTICAL
LITERACY FOR
DEEP TECH
Noel Jee
Find Relevance
Part 3
Noel Jee, PhD
Significance and the pesky p-value
Significance and the pesky p-value
Possible outcomes
large p
small p
Significance and the pesky p-value
Possible outcomes
large p
small p
Presented p values will almost always be small – when they’re not,
they’re usually hidden in supplemental / backup data
Significance doesn’t imply relevance
Significance
(decreasing
p-value)
Effect size
Significance doesn’t imply relevance
Significance
(decreasing
p-value)
Effect size
Meh
Meh
Significance doesn’t imply relevance
Significance
(decreasing
p-value)
Effect size
Meh
Meh
When significance is assumed relevant
$-
$5
$10
$15
$20
$25
$30
7/1/20 8/20/20 10/9/20 11/28/20 1/17/21 3/8/21
ATI-450 P2a - ~$19
Today - ~$25
Left 4 Dead
Data Readout - ~$6
More than 3x jump in share price overnight. Why?
Significant jump on Significant data
https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
Significant jump on significant data
https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
Relevance isn’t found in significance
• Randomization
led to 2 control
subjects
• Control group
was ~10 years
younger
Significance
can add little
benefit to
assessing risks
https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
Getting started with technical diligence
Dissect the key data from the pitch
Getting started with technical diligence
Dissect the key data from the pitch
Ask yourself if it makes sense
Getting started with technical diligence
Catalogue data into a ‘Rumsfeld Matrix’
Dissect the key data from the pitch
Ask yourself if it makes sense
Getting started with technical diligence
Catalogue data into a ‘Rumsfeld Matrix’
Dissect the key data from the pitch
Understand experimental methods
Ask yourself if it makes sense
Getting started with technical diligence
Catalogue data into a ‘Rumsfeld Matrix’
Dissect the key data from the pitch
Understand experimental methods
Prioritize and “read” related papers
Ask yourself if it makes sense
Getting started with technical diligence
Catalogue data into a ‘Rumsfeld Matrix’
Dissect the key data from the pitch
Understand experimental methods
Prioritize and “read” related papers
Talk to people who know more than you
Ask yourself if it makes sense
How to read journal articles
1. Read the title and
abstract
2. Dive into the figures
(data)
3. Check out the
methods
4. Look at the
supplemental data
If everything checks out, read the paper (or don’t)
Reading list
linkedin.com/in/noeljee
Impact.Tech "Statistical Literacy for Deep Tech"

More Related Content

PDF
Building Analytic Acumen with Less Classroom "Training" and More Learning
PPTX
How To Drive Clinical Improvement Programs That Get Results - HAS Session 20
PPTX
There Is A 90% Probability That Your Son Is Pregnant: Predicting The Future ...
PPTX
Day 1 In Review Through Analytics - HAS Session 15
PPTX
Getting The Most Out of Your Data Analyst - HAS Session 9
PPTX
Introducing Health Catalyst University: An Innovative Approach for Accelerati...
PPTX
The Top Six Early Detection and Action Must-Haves for Improving Outcomes
PPTX
Lean Principles in Healthcare: 2 Important Tools Organizations Must Have
Building Analytic Acumen with Less Classroom "Training" and More Learning
How To Drive Clinical Improvement Programs That Get Results - HAS Session 20
There Is A 90% Probability That Your Son Is Pregnant: Predicting The Future ...
Day 1 In Review Through Analytics - HAS Session 15
Getting The Most Out of Your Data Analyst - HAS Session 9
Introducing Health Catalyst University: An Innovative Approach for Accelerati...
The Top Six Early Detection and Action Must-Haves for Improving Outcomes
Lean Principles in Healthcare: 2 Important Tools Organizations Must Have

What's hot (20)

PDF
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
PDF
Healthcare Transformation: The Journey of High-Value Healthcare
PDF
Skip Out on the Classroom: How to Transform Learning in the Clinical Setting
PPTX
The Top 8 Skills Every Healthcare Process Improvement Leader Must Have
PPTX
Predictive Analytics: Dale Sanders Presentation at Plante Moran Healthcare E...
PDF
The next steps in Lean Healthcare - Summarizing the Challenges
PPT
Utah hospital aug 2014
PPTX
Clinical Quality Improvement - Dr. Croston's 7 Tips
PPTX
Leading Adaptive Change to Create Value in Healthcare
PPTX
Implementation, Change Management and the Application of Healthcare Analytics
PPTX
Laying the Foundation for Sustainable Change and Success
PPTX
Governance in Healthcare: Leadership for Successful Improvement
PPTX
7 Features of Highly Effective Outcomes Improvement Projects
PPTX
Microsoft: A Waking Giant In Healthcare Analytics and Big Data
PPTX
How to Measure Health Outcomes that Matter to Everyone
PDF
Late Binding: The New Standard For Data Warehousing
PDF
12 steps to better healthcare
PPTX
Finding Actionable Insights from Healthcare's Big Data
PPTX
Is Value-Based Healthcare Here to Stay? Looking for Answers in New Policies
PPTX
How to Assess the ROI of Your Population Health Initiative
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
Healthcare Transformation: The Journey of High-Value Healthcare
Skip Out on the Classroom: How to Transform Learning in the Clinical Setting
The Top 8 Skills Every Healthcare Process Improvement Leader Must Have
Predictive Analytics: Dale Sanders Presentation at Plante Moran Healthcare E...
The next steps in Lean Healthcare - Summarizing the Challenges
Utah hospital aug 2014
Clinical Quality Improvement - Dr. Croston's 7 Tips
Leading Adaptive Change to Create Value in Healthcare
Implementation, Change Management and the Application of Healthcare Analytics
Laying the Foundation for Sustainable Change and Success
Governance in Healthcare: Leadership for Successful Improvement
7 Features of Highly Effective Outcomes Improvement Projects
Microsoft: A Waking Giant In Healthcare Analytics and Big Data
How to Measure Health Outcomes that Matter to Everyone
Late Binding: The New Standard For Data Warehousing
12 steps to better healthcare
Finding Actionable Insights from Healthcare's Big Data
Is Value-Based Healthcare Here to Stay? Looking for Answers in New Policies
How to Assess the ROI of Your Population Health Initiative
Ad

Similar to Impact.Tech "Statistical Literacy for Deep Tech" (20)

PDF
"Statistical Literacy for Deep Tech" by Noel Jee
PDF
Stuart Reid - When Passion Obscures the Facts:The Case For Evidence-Based Te...
PDF
MedChemica BigData What Is That All About?
PDF
Artificial Intelligence for Discovery
PDF
DeepSec 2014 - The Measured CSO
PPT
DC Slide Show
PPTX
Behavioural change presentation from Mobile World Congress 2016
PDF
Transforming Healthcare: The Quantification of Everything
PPTX
Realtime prediction of C-section risk for laboring mothers
PDF
200804 qnl evidence-basedmanagement
PPTX
2014 abic-talk
PPTX
sience 2.0 : an illustration of good research practices in a real study
PPT
Eliminating Harm
PDF
Niets doen is geen optie : 2020
PPTX
Biohackers Seattle June 2016 Microbiome Hacking
PPTX
"Empowering Consumer with Smart Devices and Smart Data for Proactive ‘Health ...
PPTX
Fore FAIR ISMB 2019
PDF
Big Data, Small Data
PDF
From Biology to Industry. A Blogger’s Journey to Data Science.
PPTX
Crash Course in A/B testing
"Statistical Literacy for Deep Tech" by Noel Jee
Stuart Reid - When Passion Obscures the Facts:The Case For Evidence-Based Te...
MedChemica BigData What Is That All About?
Artificial Intelligence for Discovery
DeepSec 2014 - The Measured CSO
DC Slide Show
Behavioural change presentation from Mobile World Congress 2016
Transforming Healthcare: The Quantification of Everything
Realtime prediction of C-section risk for laboring mothers
200804 qnl evidence-basedmanagement
2014 abic-talk
sience 2.0 : an illustration of good research practices in a real study
Eliminating Harm
Niets doen is geen optie : 2020
Biohackers Seattle June 2016 Microbiome Hacking
"Empowering Consumer with Smart Devices and Smart Data for Proactive ‘Health ...
Fore FAIR ISMB 2019
Big Data, Small Data
From Biology to Industry. A Blogger’s Journey to Data Science.
Crash Course in A/B testing
Ad

More from Impact.Tech (17)

PDF
"Bridging Carbon Markets and Agriculture" by David Babson
PDF
"Sustainable Fashion" by Beth Esponnette
PPTX
"Carbon Sequestration" by Noah Deich
PPTX
Clean Chemicals
PDF
"The Future of Connectivity" by Siamak Ebadi
PDF
Intellectual Property for Deep Tech
PDF
Impact.tech: "Bioengineering is not Programming" by Louis Metzger IV
PDF
Impact.tech: Cellular Agriculture by Elliot Swartz
PDF
Martin Borch Jensen - The Science of Aging 2019
PDF
Genome Editing & Gene Therapy by Eric Kelsic
PDF
Impact.tech: Clean Chemicals by Wojciech Osowiecki
PDF
Impact tech: Opportunities in Clean Meat and Cellular Agriculture by Liz Specht
PDF
Impact.tech: Opportunities in Plant-based Food Technologies by Liz Specht
PPTX
"The Arrival of Quantum Computing" by Will Zeng
PPTX
The Arrival of Quantum Computing – Quantum Networks
PDF
"The Science of Aging" by Martin Borch Jensen
PDF
Preserving and Enhancing Impact: Corporate Forms
"Bridging Carbon Markets and Agriculture" by David Babson
"Sustainable Fashion" by Beth Esponnette
"Carbon Sequestration" by Noah Deich
Clean Chemicals
"The Future of Connectivity" by Siamak Ebadi
Intellectual Property for Deep Tech
Impact.tech: "Bioengineering is not Programming" by Louis Metzger IV
Impact.tech: Cellular Agriculture by Elliot Swartz
Martin Borch Jensen - The Science of Aging 2019
Genome Editing & Gene Therapy by Eric Kelsic
Impact.tech: Clean Chemicals by Wojciech Osowiecki
Impact tech: Opportunities in Clean Meat and Cellular Agriculture by Liz Specht
Impact.tech: Opportunities in Plant-based Food Technologies by Liz Specht
"The Arrival of Quantum Computing" by Will Zeng
The Arrival of Quantum Computing – Quantum Networks
"The Science of Aging" by Martin Borch Jensen
Preserving and Enhancing Impact: Corporate Forms

Recently uploaded (20)

DOCX
How to Become a Criminal Profiler or Behavioural Analyst.docx
PPTX
Condensed_Food_Science_Lecture1_Precised.pptx
PPTX
Job-opportunities lecture about it skills
PPTX
Principles of Inheritance and variation class 12.pptx
PPT
Gsisgdkddkvdgjsjdvdbdbdbdghjkhgcvvkkfcxxfg
PPTX
Discovering the LMA Course by Tim Han.pptx
PPTX
ESD MODULE-5hdbdhbdbdbdbbdbdbbdndbdbdbdbbdbd
PPTX
CORE 1 HOUSEKEEPING TOURISM SECTOR POWERPOINT
PPTX
microtomy kkk. presenting to cryst in gl
PDF
Daisia Frank: Strategy-Driven Real Estate with Heart.pdf
PPTX
The Stock at arrangement the stock and product.pptx
PDF
L-0018048598visual cloud book for PCa-pdf.pdf
DOC
field study for teachers graduating samplr
PPTX
PMP (Project Management Professional) course prepares individuals
PDF
Understanding the Rhetorical Situation Presentation in Blue Orange Muted Il_2...
PDF
Prostaglandin E2.pdf orthoodontics op kharbanda
PPTX
_+✅+JANUARY+2025+MONTHLY+CA.pptx current affairs
PPTX
PE3-WEEK-3sdsadsadasdadadwadwdsdddddd.pptx
PPTX
normal_menstrual_cycle_,,physiology.PPTX
PPTX
internship presentation of bsnl in colllege
How to Become a Criminal Profiler or Behavioural Analyst.docx
Condensed_Food_Science_Lecture1_Precised.pptx
Job-opportunities lecture about it skills
Principles of Inheritance and variation class 12.pptx
Gsisgdkddkvdgjsjdvdbdbdbdghjkhgcvvkkfcxxfg
Discovering the LMA Course by Tim Han.pptx
ESD MODULE-5hdbdhbdbdbdbbdbdbbdndbdbdbdbbdbd
CORE 1 HOUSEKEEPING TOURISM SECTOR POWERPOINT
microtomy kkk. presenting to cryst in gl
Daisia Frank: Strategy-Driven Real Estate with Heart.pdf
The Stock at arrangement the stock and product.pptx
L-0018048598visual cloud book for PCa-pdf.pdf
field study for teachers graduating samplr
PMP (Project Management Professional) course prepares individuals
Understanding the Rhetorical Situation Presentation in Blue Orange Muted Il_2...
Prostaglandin E2.pdf orthoodontics op kharbanda
_+✅+JANUARY+2025+MONTHLY+CA.pptx current affairs
PE3-WEEK-3sdsadsadasdadadwadwdsdddddd.pptx
normal_menstrual_cycle_,,physiology.PPTX
internship presentation of bsnl in colllege

Impact.Tech "Statistical Literacy for Deep Tech"

  • 2. Think in distributions Part 1 Noel Jee, PhD https://medium.com/mytake/understanding-different-types-of-distributions-you-will-encounter-as-a-data-scientist-27ea4c375eec
  • 3. Noel Jee, PhD Disclaimers • All examples are made up or blinded • None of this is investment advice • All views are my own • Everything is illustrative • This is not a statistics class Credentials • PhD in something science-y • Took that one STAT400 class in college • 4 years of evaluating startups • Until last week, worked at
  • 4. Research from top institutions Combined team experience of over 67 years! Changing the world through new biology
  • 5. Research from top institutions Combined team experience of over 67 years! Changing the world through new biology
  • 6. For illustrative purposes only Healthy Disease Research from top institutions Combined team experience of over 67 years! Changing the world through new biology
  • 7. Banana factor + DNA n = 51 AUC 1.00 Breast Cancer Banana factor + RNA n = 80 AUC 0.97 Breast Cancer Banana factor + Protein n = 369 AUC 0.94 Breast Cancer Banana factor + DNA n = 87 AUC 0.98 Colon Cancer Banana factor + Protein n = 46 AUC 0.97 Colon Cancer Banana factor n = 36 AUC 0.89 Systemic Lupus Erythematosus Amazing numbers can be deceiving AUC => ROC => Sensitivity/Specificity
  • 8. Most data can be viewed as distributions
  • 9. Most data can be viewed as distributions uniform exponential log
  • 10. Most data can be viewed as distributions
  • 11. Most data can be viewed as distributions Thinking in distributions is a framework – not always applicable but very useful when it is uniform exponential log
  • 12. Distributions explain sens/spec tradeoffs Healthy Diseased
  • 13. Distributions explain sens/spec tradeoffs True positives Healthy Diseased True negatives FN FP
  • 14. Distributions explain sens/spec tradeoffs True positives Healthy Diseased True negatives FN FP
  • 15. Distributions explain sens/spec tradeoffs True positives Healthy Diseased True negatives FN FP Technologies change distributions to decrease tradeoffs
  • 16. The ROC and the AUC – the AUROC
  • 17. The ROC and the AUC – the AUROC Banana factor + Protein n = 369 AUC 0.94 Breast Cancer Banana factor + Protein n = 46 AUC 0.97 Colon Cancer
  • 18. The ROC and the AUC – the AUROC AUC can be deceptive – look at the shape of the ROC curve
  • 19. Truly good and too good to be true Banana factor + RNA n = 80 AUC 0.97 Breast Cancer
  • 20. Truly good and too good to be true Banana factor + RNA n = 80 AUC 0.97 Breast Cancer
  • 21. Truly good and too good to be true Banana factor + RNA n = 80 AUC 0.97 Breast Cancer Deconstructing the data into components is necessary, but not sufficient
  • 22. Truly good and too good to be true Deconstructing the data into components is necessary, but not sufficient Banana factor + RNA n = 80 AUC 0.97 Breast Cancer Additional questions • What’s the current and assumed future standard [of care]? • What’s already known [in the literature]? • Etc. Then… Dig into the sample n = 80 Healthy vs diseased? Disease stage? Age? Prior treatments? Method of diagnosis? Region?
  • 23. Beyond sensitivity and specificity https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
  • 24. Beyond sensitivity and specificity https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests
  • 25. Beyond sensitivity and specificity https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests 93.8% sensitivity 95.6% specificity
  • 26. Beyond sensitivity and specificity https://step1.medbullets.com/stats/101006/testing-and-screening https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests 93.8% sensitivity 95.6% specificity 53% PPV 99.7% NPV
  • 27. Beyond sensitivity and specificity https://step1.medbullets.com/stats/101006/testing-and-screening Great numbers can still lead to a useless test – it’s important to understand the whole picture https://www.medmastery.com/guide/covid-19-clinical-guide/covid-19-test-validity-how-accurate-are-available-tests 93.8% sensitivity 95.6% specificity 53% PPV 99.7% NPV
  • 28. A COVID story DOI: 10.3389/fmicb.2020.01818 PCR Antigen Antibody
  • 29. A COVID story Antigen test attributes • 80% sensitivity • 15-30 min TAT • < 50% of PCR cost PCR Antigen Antibody
  • 30. A COVID story DOI: 10.3389/fmicb.2020.01818 Antigen test attributes • 80% sensitivity • 15-30 min TAT • < 50% of PCR cost PCR Antigen Antibody Sensitivity ~80% of patients who tested positive with PCR “True positive” is a relative phrase; always note the baseline comparison
  • 31. A COVID story Antigen test attributes • 80% sensitivity • 15-30 min TAT • < 50% of PCR cost PCR Antigen Antibody Sensitivity ~80% of patients who tested positive with PCR Sample Symptomatic patients with COVID-19 “True positive” is a relative phrase; always note the baseline comparison Including asymptomatic patients reduced the sensitivity below 40%
  • 32. Sometimes numbers just mean nothing Accuracy = True positives + True negatives All tests 90% accuracy could mean nothing
  • 33. Sometimes numbers just mean nothing Combined team experience of over 67 years! Accuracy = True positives + True negatives All tests 90% accuracy could mean nothing
  • 34. Sometimes numbers just mean nothing Combined team experience of over 67 years! Accuracy = True positives + True negatives All tests 90% accuracy could mean nothing https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
  • 35. “There are three kinds of lies: lies, damned lies, and statistics” Mark Twain* * who was wrongly quoting someone else – sometimes words are deceptive too
  • 40. Using AI … big data … unique signature of bok bok 99.9% accuracy … novel drugs bok bok capture $100B total addressable market Not a real company Defeating aging through big data
  • 41. “AI” – What do you want it to mean? Statistical regression R2 = explained variation total variation Confidence interval: the range of values where the true mean likely lies Linear Non-linear
  • 42. “AI” – What do you want it to mean? Healthy Disease Machine learning sample feature extraction classification output Binary – healthy, disease Ordinal – mild, moderate, severe Nominal – lupus, alzheimer’s, cancer Logistic regression
  • 43. “AI” – What do you want it to mean? Healthy Disease Machine learning sample feature extraction classification output
  • 44. “AI” – What do you want it to mean? Healthy Disease Machine learning sample feature extraction classification output Start with the output and input before digging into the “AI”
  • 45. What’s goes in? The output: 99.9% sensitivity / specificity Output looks great, let’s move on to the input
  • 46. What’s goes in? The output: 99.9% sensitivity / specificity “Over 2,000 micro RNAs and 20,000 genes were fed into our proprietary algorithms” “We trained our algorithms on 200 patient samples” Output looks great, let’s move on to the input “Our algorithm was validated on a public dataset of 2,000,000 individuals”
  • 47. When is bigger not better? DOI: 10.1093/ofid/ofv093 Heatmap of gene expression “Over 2,000 micro RNAs and 20,000 genes were fed into our proprietary algorithms”
  • 48. When is bigger not better? DOI: 10.1093/ofid/ofv093 Overfitting More variables may lead to Heatmap of gene expression
  • 49. When is bigger not better? DOI: 10.1093/ofid/ofv093 https://brainstation.io/blog/3-ux-tips-from-the-worlds-first-ux-designer Overfitting Impractical More variables may lead to may be Heatmap of gene expression
  • 50. When is bigger not better? DOI: 10.1093/ofid/ofv093 https://brainstation.io/blog/3-ux-tips-from-the-worlds-first-ux-designer Overfitting Impractical More variables may lead to may be and is almost certainly biased Heatmap of gene expression
  • 51. Where might bias be hiding? 6 billion base pairs 99% accurate sequencer 60 million errors Whole genome
  • 52. Where might bias be hiding? 6 billion base pairs 99% accurate sequencer 60 million errors Whole genome 1% error is biased, not random Bias is intrinsic to each technique
  • 53. Where might bias be hiding? DOI: 10.1093/ofid/ofv093
  • 54. Where might bias be hiding? 20,000 genes 30% captured 70% not captured DOI: 10.1093/ofid/ofv093
  • 55. Where might bias be hiding? 20,000 genes 30% captured 70% not captured May only be 2,000 genes but contain most relevant genes May not be relevant DOI: 10.1093/ofid/ofv093
  • 56. Where might bias be hiding? 20,000 genes 30% captured Even bias might not be a dealbreaker, look for consistency 70% not captured DOI: 10.1093/ofid/ofv093 May only be 2,000 genes but contain most relevant genes May not be relevant
  • 57. When are great samples not great? 200 doctor-annotated samples were used to train our algorithms
  • 58. Patients are not representative of the targeted population When are great samples not great? 200 doctor-annotated samples were used to train our algorithms Googled some celebrity images One doctor diagnosed the entire sample population, introducing bias for the “true positive”
  • 59. Patients are not representative of the targeted population When are great samples not great? 200 doctor-annotated samples were used to train our algorithms Googled some celebrity images Great data can still be garbage – context is key One doctor diagnosed the entire sample population, introducing bias for the “true positive”
  • 60. When are great samples not great? “But it’s not why the polls were wrong. It just isn’t. People tell the truth when you ask them who they’re voting for. They really do, on average. The reason why the polls are wrong is because the people who were answering these surveys were the wrong people.” David Shor (Vox interview)
  • 61. How do I know it’s real? Validation methods have hard tradeoffs for startups Retrospective Prospective Aging Ageless
  • 62. How do I know it’s real? Retrospective • Quality of data • Sample size • Sample breakdown • Blinding • Training vs validation Validation methods have hard tradeoffs for startups
  • 63. How do I know it’s real? Prospective Retrospective • Quality of data • Sample size • Sample breakdown • Blinding • Training vs validation Validation methods have hard tradeoffs for startups • Recruitment centers • Trial arms • Statistical power • Endpoints • Duration
  • 64. How do I know it’s real? Prospective Retrospective • Quality of data • Sample size • Sample breakdown • Blinding • Training vs validation Validation methods have hard tradeoffs for startups • Recruitment centers • Trial arms • Statistical power • Endpoints • Duration
  • 69. Significance and the pesky p-value
  • 70. Significance and the pesky p-value Possible outcomes large p small p
  • 71. Significance and the pesky p-value Possible outcomes large p small p Presented p values will almost always be small – when they’re not, they’re usually hidden in supplemental / backup data
  • 72. Significance doesn’t imply relevance Significance (decreasing p-value) Effect size
  • 73. Significance doesn’t imply relevance Significance (decreasing p-value) Effect size Meh Meh
  • 74. Significance doesn’t imply relevance Significance (decreasing p-value) Effect size Meh Meh
  • 75. When significance is assumed relevant $- $5 $10 $15 $20 $25 $30 7/1/20 8/20/20 10/9/20 11/28/20 1/17/21 3/8/21 ATI-450 P2a - ~$19 Today - ~$25 Left 4 Dead Data Readout - ~$6 More than 3x jump in share price overnight. Why?
  • 76. Significant jump on Significant data https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
  • 77. Significant jump on significant data https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
  • 78. Relevance isn’t found in significance • Randomization led to 2 control subjects • Control group was ~10 years younger Significance can add little benefit to assessing risks https://investor.aclaristx.com/static-files/a47d4b59-da7b-417a-b62f-02ac4210dea7
  • 79. Getting started with technical diligence Dissect the key data from the pitch
  • 80. Getting started with technical diligence Dissect the key data from the pitch Ask yourself if it makes sense
  • 81. Getting started with technical diligence Catalogue data into a ‘Rumsfeld Matrix’ Dissect the key data from the pitch Ask yourself if it makes sense
  • 82. Getting started with technical diligence Catalogue data into a ‘Rumsfeld Matrix’ Dissect the key data from the pitch Understand experimental methods Ask yourself if it makes sense
  • 83. Getting started with technical diligence Catalogue data into a ‘Rumsfeld Matrix’ Dissect the key data from the pitch Understand experimental methods Prioritize and “read” related papers Ask yourself if it makes sense
  • 84. Getting started with technical diligence Catalogue data into a ‘Rumsfeld Matrix’ Dissect the key data from the pitch Understand experimental methods Prioritize and “read” related papers Talk to people who know more than you Ask yourself if it makes sense
  • 85. How to read journal articles 1. Read the title and abstract 2. Dive into the figures (data) 3. Check out the methods 4. Look at the supplemental data If everything checks out, read the paper (or don’t)