SlideShare a Scribd company logo
gsk.com
AI & Big Data Expo, London
Machine learning, biomedical data & trust
Paul Agapow (Statistics & Data Science Innovation Hub)
Background & disclaimer
• Previously a health informatician, biomedical ML
researcher, bioinformatician, “computer guy”,
disease chaser, epi-informatician,
phylogeneticist, evolutionary biologist,
immunologist, biochemist …
• Now a director @GSK
• This presentation does not reflect thought,
policy or projects in progress at GSK
• There are no conflicts of interest
10 June 2021 3
“AI will not replace
drug hunters, but drug
hunters who don’t use
AI will be replaced by
those who do.”
-Andrew Hopkins, CEO Exscientia
4
5
07 February 2023
3 hurdles to using AI/ML in therapy development
Biological & physiological
complexity
Insufficient & uneven data
A gap between AI/ML practice &
medical needs
To make a
new drug,
you must
first solve for
everything
6
12 July 2021 7
The complexity of biology:
About 50 trillion cells of 200 types
Each cell has 23 pairs of chromosomes
In total 6.4 billion basepairs (positions)
Organised into about 18,000 genes
(Or maybe more like 40,000 genes)
Genetic material elsewhere in the cell
Epigenetic modification
1 million different types of molecules
Lifestyle & history
Exposure & environment
Immune system repertoire & priming
…
Of which we know only a fraction
The data types and sources we need are myriad & varied
8
Hughes et al. (2010) ”Principles of early drug discovery”
• There are many different
modalities of intervention
• With different (data)
considerations & different
levels of ML experience
07 February 2023 9
There are many different means to the same end
McKinsey, EvaluatePharma 2022
It’s often not
the right data
• Difficult / expensive to generate
• Unstructured
• Unlabeled
• The wrong type
• Sparse, unevenly sampled
• WEIRD
• In different formats and silos
10
07 February 2023 11
Melanie Mitchell via Dagmar Monett
A disconnect between AI/ML practice and medical needs
Academic focus on problems with low medical value
• There are many models
that work perfectly … in
the lab
• Why?
- Unrealistic or poor
training data
- Emphasis on hitting
metrics
07 February 2023 12
A disconnect between AI/ML practice and medical needs
A tendency to treat biomedicine as simply a data / ML problem
The classic
analytical
tension
13
What we need to solve
What we tend to solve
Easy things
Available, ideal data
Ground truth
Simplify
“Interesting”
“Table-land”
Useful things
Incomplete messy data
Unclear biological reality
Uncertain findings
Needful
“Network-land”
14
Laure Wynants via Maarten van Smeden
A disconnect between AI/ML practice and medical needs
Many ”good” models are not fit for production
07 February 2023 15
• The pandemic prompted a flood of publications &
preprints
• Most plagued by the usual biomedical AI problems
• … and also produced by those outside the field
• As a general principle, any paper applying ML to COVID
is terrible
• Bad models in a crisis situation are not neutral, they
distract, expend effort, are an opportunity cost
COVID was a lightning rod for bad biomedical ML
07 February 2023 16
• What does it purport to do: Find risk factors
associated with deterioration of COVID patients
• Why? Better / faster assessment of incoming
patients
• Who? Patients admitted to two hospitals with +ve
PCR test for COVID with CT scan with lesions
• Data? Demographics, bloods, labs, breathing/
oxygen scores, CT scans manually scored
“Interpretable Prediction of Severity & Crucial Factors of COVID Patients”
Zheng et al. BioMed Research International (2021), DOI: 10.1155/2021/8840835
07 February 2023 17
• Conflates diagnosis & prognosis
• The cohort:
- Suggested this can replace PCR but cohort are selected
by PCR result
- The act of taking a CT scan in some ways selects for
cohort
- Unclear when some readings taken, when we are looking
at deterioration
- Are the training set the set that a model might be used on
in the clinic?
- Not many critical – so actually testing for severe cases
- What’s the split between hospitals
- Patients are different already, pre-existing conditions
- Association with age & general health
- Old patients running a temperature with lesioned lungs do
poorly
• Clinical use:
- Will all this data be available in a timely fashion for a
model in the clinic
- If the severity is based of bloods & oxygenation readings,
why not just use them
- Information complexity?
• Validation:
- Would it work for another time period at same hospitals?
At other hospitals?
• Analytics
- “The impenetrable wall of math”
- XGBoost is always a good place to start
- Ensemble methods usually are
- Feature interaction?
- Some features overlap (neutrophils, n. ratio, NLR)
- What features correlate?
- No attempt to simplify model
- Any model is interpretable with SHAP
• Still useful for intrinsic / research purposes
Thoughts and questions
Not necessarily faults, not all easily answerable
07 February 2023 18
• Models will always tell you the truth
- But it’s the truth conditioned on the data they’ve seen
- It might not be the truth you think
• Biomedical data is complex, it always come with a context
• Patients are complex, they always come with a medical history
• How were these patients selected?
• What is this model actually saying and why?
• Does this model replicate in other populations?
• But despite all this, we have to make and actionably interpret
models
Some principles for better biomedical ML
Click to enter
title here
Why not join us?
19
Academic Press (2021)
Click to enter
title here
Some light
reading
20
Academic Press (2021)

More Related Content

PDF
Beyond Proofs of Concept for Biomedical AI
PDF
Machine learning, health data & the limits of knowledge
PPTX
ML & AI in pharma: an overview
PPTX
The End of the Drug Development Casino?
PDF
AI in Healthcare
PDF
Where AI will (and won't) revolutionize biomedicine
PPTX
Interpreting Complex Real World Data for Pharmaceutical Research
PDF
Filling the gaps in translational research
Beyond Proofs of Concept for Biomedical AI
Machine learning, health data & the limits of knowledge
ML & AI in pharma: an overview
The End of the Drug Development Casino?
AI in Healthcare
Where AI will (and won't) revolutionize biomedicine
Interpreting Complex Real World Data for Pharmaceutical Research
Filling the gaps in translational research

Similar to ML, biomedical data & trust (20)

PDF
Multi-omics for drug discovery: what we lose, what we gain
PDF
ML & AI in Drug development: the hidden part of the iceberg
PPTX
ai-in-healthcare-202011-201117103639.pptx
PDF
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
PPTX
Big Data: Learning from MIMIC- Celi
PPTX
Big Data & ML for Clinical Data
PPTX
Diabetes Data Science
PPTX
Atul Butte NIPS 2017 ML4H
PDF
Conference-The-future-will-be-digital-and-biology-but who-will-lead-watson-go...
PDF
AI in pharma & biotech: possibilities and realities
PPTX
MDC Connects Series 2021 | A Guide to Complex Medicines: Developing the assay...
PDF
Clinical studies & observational trials in the age of AI
PPTX
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI
PPTX
An Introduction to Artificial Intelligence for the Everyday Radiologist
PDF
grandroundsonai-190917135538.pdf
PPTX
Artificial Intelligence and ChatGPT: Impacts and Challenges for Medical Educa...
PPTX
Will Biomedical Research Fundamentally Change in the Era of Big Data?
PPTX
[DSC Europe 23][DigiHealth] Dimitrios Kalogeropoulos A Sustainable Future for...
PDF
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
PDF
2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
Multi-omics for drug discovery: what we lose, what we gain
ML & AI in Drug development: the hidden part of the iceberg
ai-in-healthcare-202011-201117103639.pptx
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Big Data: Learning from MIMIC- Celi
Big Data & ML for Clinical Data
Diabetes Data Science
Atul Butte NIPS 2017 ML4H
Conference-The-future-will-be-digital-and-biology-but who-will-lead-watson-go...
AI in pharma & biotech: possibilities and realities
MDC Connects Series 2021 | A Guide to Complex Medicines: Developing the assay...
Clinical studies & observational trials in the age of AI
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI
An Introduction to Artificial Intelligence for the Everyday Radiologist
grandroundsonai-190917135538.pdf
Artificial Intelligence and ChatGPT: Impacts and Challenges for Medical Educa...
Will Biomedical Research Fundamentally Change in the Era of Big Data?
[DSC Europe 23][DigiHealth] Dimitrios Kalogeropoulos A Sustainable Future for...
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
Ad

More from Paul Agapow (16)

PDF
Opportunities for AI in drug development 202412.pdf
PDF
Career advice for new bio-(x)-ists, Dec2024.pdf
PDF
Can drug repurposing be saved with AI 202405.pdf
PDF
IA, la clave de la genomica (May 2024).pdf
PDF
Digital Biomarkers, a (too) brief introduction.pdf
PDF
How to make every mistake and still have a career, Feb2024.pdf
PDF
Get yourself a better bioinformatics job
PPTX
Bioinformatics! (What is it good for?)
PDF
Machine Learning for Preclinical Research
PDF
AI for Precision Medicine (Pragmatic preclinical data science)
PDF
Patient subtypes: real or not?
PDF
Big biomedical data is a lie
PDF
eTRIKS at Pharma IT 2017, London
PDF
Introduction to Snakemake
PPTX
Analysing biomedical data (ers october 2017)
PPTX
Interpreting transcriptomics (ers berlin 2017)
Opportunities for AI in drug development 202412.pdf
Career advice for new bio-(x)-ists, Dec2024.pdf
Can drug repurposing be saved with AI 202405.pdf
IA, la clave de la genomica (May 2024).pdf
Digital Biomarkers, a (too) brief introduction.pdf
How to make every mistake and still have a career, Feb2024.pdf
Get yourself a better bioinformatics job
Bioinformatics! (What is it good for?)
Machine Learning for Preclinical Research
AI for Precision Medicine (Pragmatic preclinical data science)
Patient subtypes: real or not?
Big biomedical data is a lie
eTRIKS at Pharma IT 2017, London
Introduction to Snakemake
Analysing biomedical data (ers october 2017)
Interpreting transcriptomics (ers berlin 2017)
Ad

Recently uploaded (20)

PDF
Khadir.pdf Acacia catechu drug Ayurvedic medicine
PPTX
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
DOCX
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
PPTX
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
PPTX
Note on Abortion.pptx for the student note
PPT
Breast Cancer management for medicsl student.ppt
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPTX
SKIN Anatomy and physiology and associated diseases
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
PPTX
Slider: TOC sampling methods for cleaning validation
PPTX
Imaging of parasitic D. Case Discussions.pptx
PPT
Management of Acute Kidney Injury at LAUTECH
PPTX
Respiratory drugs, drugs acting on the respi system
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPTX
Acid Base Disorders educational power point.pptx
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
PPTX
1 General Principles of Radiotherapy.pptx
PPTX
CME 2 Acute Chest Pain preentation for education
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
Khadir.pdf Acacia catechu drug Ayurvedic medicine
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
Note on Abortion.pptx for the student note
Breast Cancer management for medicsl student.ppt
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
SKIN Anatomy and physiology and associated diseases
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
Slider: TOC sampling methods for cleaning validation
Imaging of parasitic D. Case Discussions.pptx
Management of Acute Kidney Injury at LAUTECH
Respiratory drugs, drugs acting on the respi system
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
Acid Base Disorders educational power point.pptx
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
1 General Principles of Radiotherapy.pptx
CME 2 Acute Chest Pain preentation for education
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx

ML, biomedical data & trust

  • 1. gsk.com AI & Big Data Expo, London Machine learning, biomedical data & trust Paul Agapow (Statistics & Data Science Innovation Hub)
  • 2. Background & disclaimer • Previously a health informatician, biomedical ML researcher, bioinformatician, “computer guy”, disease chaser, epi-informatician, phylogeneticist, evolutionary biologist, immunologist, biochemist … • Now a director @GSK • This presentation does not reflect thought, policy or projects in progress at GSK • There are no conflicts of interest
  • 3. 10 June 2021 3 “AI will not replace drug hunters, but drug hunters who don’t use AI will be replaced by those who do.” -Andrew Hopkins, CEO Exscientia
  • 4. 4
  • 5. 5 07 February 2023 3 hurdles to using AI/ML in therapy development Biological & physiological complexity Insufficient & uneven data A gap between AI/ML practice & medical needs
  • 6. To make a new drug, you must first solve for everything 6
  • 7. 12 July 2021 7 The complexity of biology: About 50 trillion cells of 200 types Each cell has 23 pairs of chromosomes In total 6.4 billion basepairs (positions) Organised into about 18,000 genes (Or maybe more like 40,000 genes) Genetic material elsewhere in the cell Epigenetic modification 1 million different types of molecules Lifestyle & history Exposure & environment Immune system repertoire & priming … Of which we know only a fraction
  • 8. The data types and sources we need are myriad & varied 8 Hughes et al. (2010) ”Principles of early drug discovery”
  • 9. • There are many different modalities of intervention • With different (data) considerations & different levels of ML experience 07 February 2023 9 There are many different means to the same end McKinsey, EvaluatePharma 2022
  • 10. It’s often not the right data • Difficult / expensive to generate • Unstructured • Unlabeled • The wrong type • Sparse, unevenly sampled • WEIRD • In different formats and silos 10
  • 11. 07 February 2023 11 Melanie Mitchell via Dagmar Monett A disconnect between AI/ML practice and medical needs Academic focus on problems with low medical value
  • 12. • There are many models that work perfectly … in the lab • Why? - Unrealistic or poor training data - Emphasis on hitting metrics 07 February 2023 12 A disconnect between AI/ML practice and medical needs A tendency to treat biomedicine as simply a data / ML problem
  • 13. The classic analytical tension 13 What we need to solve What we tend to solve Easy things Available, ideal data Ground truth Simplify “Interesting” “Table-land” Useful things Incomplete messy data Unclear biological reality Uncertain findings Needful “Network-land”
  • 14. 14 Laure Wynants via Maarten van Smeden A disconnect between AI/ML practice and medical needs Many ”good” models are not fit for production
  • 15. 07 February 2023 15 • The pandemic prompted a flood of publications & preprints • Most plagued by the usual biomedical AI problems • … and also produced by those outside the field • As a general principle, any paper applying ML to COVID is terrible • Bad models in a crisis situation are not neutral, they distract, expend effort, are an opportunity cost COVID was a lightning rod for bad biomedical ML
  • 16. 07 February 2023 16 • What does it purport to do: Find risk factors associated with deterioration of COVID patients • Why? Better / faster assessment of incoming patients • Who? Patients admitted to two hospitals with +ve PCR test for COVID with CT scan with lesions • Data? Demographics, bloods, labs, breathing/ oxygen scores, CT scans manually scored “Interpretable Prediction of Severity & Crucial Factors of COVID Patients” Zheng et al. BioMed Research International (2021), DOI: 10.1155/2021/8840835
  • 17. 07 February 2023 17 • Conflates diagnosis & prognosis • The cohort: - Suggested this can replace PCR but cohort are selected by PCR result - The act of taking a CT scan in some ways selects for cohort - Unclear when some readings taken, when we are looking at deterioration - Are the training set the set that a model might be used on in the clinic? - Not many critical – so actually testing for severe cases - What’s the split between hospitals - Patients are different already, pre-existing conditions - Association with age & general health - Old patients running a temperature with lesioned lungs do poorly • Clinical use: - Will all this data be available in a timely fashion for a model in the clinic - If the severity is based of bloods & oxygenation readings, why not just use them - Information complexity? • Validation: - Would it work for another time period at same hospitals? At other hospitals? • Analytics - “The impenetrable wall of math” - XGBoost is always a good place to start - Ensemble methods usually are - Feature interaction? - Some features overlap (neutrophils, n. ratio, NLR) - What features correlate? - No attempt to simplify model - Any model is interpretable with SHAP • Still useful for intrinsic / research purposes Thoughts and questions Not necessarily faults, not all easily answerable
  • 18. 07 February 2023 18 • Models will always tell you the truth - But it’s the truth conditioned on the data they’ve seen - It might not be the truth you think • Biomedical data is complex, it always come with a context • Patients are complex, they always come with a medical history • How were these patients selected? • What is this model actually saying and why? • Does this model replicate in other populations? • But despite all this, we have to make and actionably interpret models Some principles for better biomedical ML
  • 19. Click to enter title here Why not join us? 19 Academic Press (2021)
  • 20. Click to enter title here Some light reading 20 Academic Press (2021)