SlideShare a Scribd company logo
Machine learning,
health data & the limits
of knowledge
Paul Agapow
ONC R&D ML&AI AstraZeneca
<paul.agapow@astrazeneca.com>
20201/3/10
2
Disclosure
• Does not reflect official AZ thought or projects
• No conflicts of interest
3
About me
• Have been a:
• At
• Oncology R&D ML&AI / RWE @AZ
• Data Science Institute @ICL
• Centre for Infection @HPA (UK)
• Universities, industry, government …
health informatician, data scientist, bioinformatician, database
administrator, epi-informaticist, software dev, data manager,
consultant, molecular geneticist, data scientist, evolutionary
scientist, biochemist, phylogeneticist, immunologist, programmer …
Using this paper as a jumping-off point
• The Hierarchical Classifier for COVID-19
Resistance Evaluation (2021) Shakhovska,
Izonin & Melnykova, Data v6:6
• https://doi.org/10.3390/data6010006
• https://www.mdpi.com/2306-
5729/6/1/6/htm
• How to analyse for patterns in COVID data
when the observational data is diverse &
complex
4
Data is a
saviour & a
curse
• Data & analytics has saved us several times in the current
crisis
• But too much data can create problems
• And data is not information
5
RWE: real world evidence
6
• Electronic Health Records
• Registries
• Claims databases
• Repurposed trial data
• Defined:
• Anything that isn’t an RCT
(randomised controlled trial)
• Observational data
• Anything we have to consider the
context & sourcing of?
• Why?
• Cheap
• Ethical
• Accesses scales & types of data &
situations that are otherwise
unavailable
• Where was it collected?
• Who did they look for?
• What are those peoples
habits and histories?
But all (RWE) data is biased
What population does it
come from?
• “severe asthma” or
“PDL1 expression”
• What are the diagnostic
devices?
• What’s common medical
practice there?
What are the definitions
used?
• E.g. surveys, visits
• Are inclusion / exclusion
at random?
• What incidental
correlations?
• Choice of features
What causes data to be
included / excluded?
7
The COVID publication: is it good data?
• Do we know where it came from?
• Do we know who is in it?
• Is there missing data?
• “maybe” COVID?
• Are the populations comparable?
• Are antibody levels comparable?
• Different test kits?
• Imbalanced classes?
8
The data
How do we analyse RWE correctly?
• Patients are complex:
• Co-morbidities
• Lifestyle, prior history, exposure
• Demographics, genetics, epigenome, microbiome …
• Disease is complex:
• Affects different body subsystems
• Health data is complex:
• Sparse, irregular
• A product of a healthcare system …
• Underlying models unclear
• Many opportunities for confounders & noise
9
10
Is ML the best approach for RWE analysis?
Messy data
Clear
assumptions
Explicit
models … No model
Statistical modelling Machine Learning / AI
…
a continuum of approaches
Few
assumptions
Clean &
controlled data
Trained from
data
Larger data
But what are the pitfalls of using ML on health data?
11
• Need more (labelled) data
• Bias – how was the data
sourced?
• Needs to be handled carefully
• May require specialised
computation & skills
• Some problems difficult to
adapt to ML
• Interpretability – data never
lies, but what is it telling us?
Clustering: how simple algorithms can
actually be very complex
• Idea of clustering is simple: but what does it actually do?
• Every dataset has clusters, even random noise
• Do clusters reflect the underlying reality?
• Are the clusters revealed valid and/or robust?
• Are the clusters of groups you are interested in?
• A cluster is the truth, it’s a hypothesis
(The paper is modestly convincing about these points)
12
The COVID publication: is it good methodology?
• Many different methods but:
• What’s the concordance?
• What use is 6-7 methods?
• Ensemble them?
• Where’s the validation?
• What’s the question?
• How many people are actually
infected with COVID? or
• Can we build a model to calculate
this?
13
The data
What makes a good machine-learning approach?
14
• Be clear what it is predicting
• It should be reproducible
• It should be validated:
• Internally: performance, convergence, loss,
sensitivity, robust, …
• Externally: against another dataset
• Almost any ML method can
• Do (slightly) better than humans
• Get better than 50%
• If it is “better”, compared to what?
How do the
systems in the
paper measure
up?
15
How do we know what a system is doing?
• Interpretability is non-negotiable
• AI models can only be built for data that
you have
• Biased data gives rise to biased models
• A model may not be doing what we
think it is
• Toolkits like Shap & Lime make
interpretability easy and comparable
(Paper used very interpretable systems)
How could this have been done better?
• What question are we trying to solve?
• “What’s the actual level of infected people in the population”?
• In what time period or setting?
• What’s actionable?
• What data can we get?
• What data can we get for validation?
• We don’t need 6-7 different methods, just 1 good one
• Be clear about “how good” the results are
16
Summary
• RWE may be a broad and over-reaching category
• But it underlines the complexity & biases of health data
• ML may be the best approach for analysing RWE
• However its power and flexibility introduces other problems
• Data “bias”
• Validation
• Interpretability
• ML “findings” are almost always just hypotheses
• Healthcare analytics should not be about analytics but about biology
17
Final thought
• If you are driven by science and passionate about improving lives, why not work at
AstraZeneca?
• Example jobs – please visit our careers website
• Principal Data Scientist - https://careers.astrazeneca.com/job/gaithersburg/principal-
data-scientist/7684/14833674
• Associate Director Imaging & AI - Imaging & Data Analytics -
https://careers.astrazeneca.com/job/gothenburg/associate-director-imaging-and-ai-
imaging-and-data-analytics/7684/14469379
• Data Sciences & AI Graduate Programme – UK -
https://careers.astrazeneca.com/data-sciences-and-ai-graduate-programme
18

More Related Content

PPTX
Medical data diagnosis
PDF
Filling the gaps in translational research
PDF
ML & AI in Drug development: the hidden part of the iceberg
PPTX
ML & AI in pharma: an overview
PDF
Beyond Proofs of Concept for Biomedical AI
PDF
Multi-omics for drug discovery: what we lose, what we gain
PPTX
The End of the Drug Development Casino?
PDF
Machine Learning for Preclinical Research
Medical data diagnosis
Filling the gaps in translational research
ML & AI in Drug development: the hidden part of the iceberg
ML & AI in pharma: an overview
Beyond Proofs of Concept for Biomedical AI
Multi-omics for drug discovery: what we lose, what we gain
The End of the Drug Development Casino?
Machine Learning for Preclinical Research

What's hot (20)

PPTX
Big Data & ML for Clinical Data
PDF
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
PDF
Dichotomania and other challenges for the collaborating biostatistician
PPTX
AI at GSK_Kim Branson_mHealth Israel
PDF
Artificial intelligence in health care (drug discovery) in pharmacy
PPTX
AI is the Future of Drug Discovery
PPTX
Big Data Provides Opportunities, Challenges and a Better Future in Health and...
PPTX
Interpreting Complex Real World Data for Pharmaceutical Research
PDF
Make clinical prediction models great again
PPTX
Machine learning in health data analytics and pharmacovigilance
PPTX
Artificial intelligence in drug discovery
PPTX
Ai in drug discovery and drug development
PDF
Machine learning in medicine: calm down
PDF
How Artificial Intelligence in Transforming Pharma
PPTX
How to establish and evaluate clinical prediction models - Statswork
PPTX
Artificial intelligence ppt
PDF
How Artificial Intelligence is Reducing Costs and Improving Outcomes in Pharm...
PDF
Thoughts on Machine Learning and Artificial Intelligence
PDF
Machine Learning and Prediction in Medicine
PPTX
Iot evolution_expo_fl-2019_haw
Big Data & ML for Clinical Data
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Dichotomania and other challenges for the collaborating biostatistician
AI at GSK_Kim Branson_mHealth Israel
Artificial intelligence in health care (drug discovery) in pharmacy
AI is the Future of Drug Discovery
Big Data Provides Opportunities, Challenges and a Better Future in Health and...
Interpreting Complex Real World Data for Pharmaceutical Research
Make clinical prediction models great again
Machine learning in health data analytics and pharmacovigilance
Artificial intelligence in drug discovery
Ai in drug discovery and drug development
Machine learning in medicine: calm down
How Artificial Intelligence in Transforming Pharma
How to establish and evaluate clinical prediction models - Statswork
Artificial intelligence ppt
How Artificial Intelligence is Reducing Costs and Improving Outcomes in Pharm...
Thoughts on Machine Learning and Artificial Intelligence
Machine Learning and Prediction in Medicine
Iot evolution_expo_fl-2019_haw
Ad

Similar to Machine learning, health data & the limits of knowledge (20)

PPTX
ML, biomedical data & trust
PDF
AI in Healthcare
PPTX
Melissa Informatics - Data Quality and AI
PDF
Standards in health informatics - Problem, clinical models and terminologies
PPTX
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
PPTX
ai-in-healthcare-202011-201117103639.pptx
PPTX
AMDIS CHIME Fall Symposium
PPTX
Atul Butte's presentation to the Association of Medical School Pediatric Depa...
PPTX
An Introduction to Artificial Intelligence for the Everyday Radiologist
PDF
grandroundsonai-190917135538.pdf
PDF
Sdal air health and social development (jan. 27, 2014) final
PPTX
Mental Wellness Analyzer: Leveraging Data for Better Mental Health Insights -...
PDF
Standards in health informatics - problem, clinical models and terminology
PPTX
Atul Butte NIPS 2017 ML4H
PDF
Where AI will (and won't) revolutionize biomedicine
PPTX
The Learning Health System: Thinking and Acting Across Scales
PPT
Analyzing and Interpreting Data statippt
PDF
Digital Health Transformation for Health Executives (January 18, 2022)
PPTX
Data Quality in Healthcare: An Important Challenge
PPTX
eHealth: Big Data, Sports Analysis & Clinical Records
ML, biomedical data & trust
AI in Healthcare
Melissa Informatics - Data Quality and AI
Standards in health informatics - Problem, clinical models and terminologies
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
ai-in-healthcare-202011-201117103639.pptx
AMDIS CHIME Fall Symposium
Atul Butte's presentation to the Association of Medical School Pediatric Depa...
An Introduction to Artificial Intelligence for the Everyday Radiologist
grandroundsonai-190917135538.pdf
Sdal air health and social development (jan. 27, 2014) final
Mental Wellness Analyzer: Leveraging Data for Better Mental Health Insights -...
Standards in health informatics - problem, clinical models and terminology
Atul Butte NIPS 2017 ML4H
Where AI will (and won't) revolutionize biomedicine
The Learning Health System: Thinking and Acting Across Scales
Analyzing and Interpreting Data statippt
Digital Health Transformation for Health Executives (January 18, 2022)
Data Quality in Healthcare: An Important Challenge
eHealth: Big Data, Sports Analysis & Clinical Records
Ad

More from Paul Agapow (17)

PDF
Clinical studies & observational trials in the age of AI
PDF
AI in pharma & biotech: possibilities and realities
PDF
Opportunities for AI in drug development 202412.pdf
PDF
Career advice for new bio-(x)-ists, Dec2024.pdf
PDF
Can drug repurposing be saved with AI 202405.pdf
PDF
IA, la clave de la genomica (May 2024).pdf
PDF
Digital Biomarkers, a (too) brief introduction.pdf
PDF
How to make every mistake and still have a career, Feb2024.pdf
PDF
Get yourself a better bioinformatics job
PPTX
Bioinformatics! (What is it good for?)
PDF
AI for Precision Medicine (Pragmatic preclinical data science)
PDF
Patient subtypes: real or not?
PDF
Big biomedical data is a lie
PDF
eTRIKS at Pharma IT 2017, London
PDF
Introduction to Snakemake
PPTX
Analysing biomedical data (ers october 2017)
PPTX
Interpreting transcriptomics (ers berlin 2017)
Clinical studies & observational trials in the age of AI
AI in pharma & biotech: possibilities and realities
Opportunities for AI in drug development 202412.pdf
Career advice for new bio-(x)-ists, Dec2024.pdf
Can drug repurposing be saved with AI 202405.pdf
IA, la clave de la genomica (May 2024).pdf
Digital Biomarkers, a (too) brief introduction.pdf
How to make every mistake and still have a career, Feb2024.pdf
Get yourself a better bioinformatics job
Bioinformatics! (What is it good for?)
AI for Precision Medicine (Pragmatic preclinical data science)
Patient subtypes: real or not?
Big biomedical data is a lie
eTRIKS at Pharma IT 2017, London
Introduction to Snakemake
Analysing biomedical data (ers october 2017)
Interpreting transcriptomics (ers berlin 2017)

Recently uploaded (20)

PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
PPT
Management of Acute Kidney Injury at LAUTECH
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PDF
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
PPTX
Uterus anatomy embryology, and clinical aspects
PDF
Khadir.pdf Acacia catechu drug Ayurvedic medicine
PPTX
surgery guide for USMLE step 2-part 1.pptx
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
PDF
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
DOCX
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
PPT
CHAPTER FIVE. '' Association in epidemiological studies and potential errors
PDF
CT Anatomy for Radiotherapy.pdf eryuioooop
PPTX
Acid Base Disorders educational power point.pptx
PPTX
Fundamentals of human energy transfer .pptx
PPTX
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
PPTX
Neuropathic pain.ppt treatment managment
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
Medical Evidence in the Criminal Justice Delivery System in.pdf
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
Management of Acute Kidney Injury at LAUTECH
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
Uterus anatomy embryology, and clinical aspects
Khadir.pdf Acacia catechu drug Ayurvedic medicine
surgery guide for USMLE step 2-part 1.pptx
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
MENTAL HEALTH - NOTES.ppt for nursing students
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
CHAPTER FIVE. '' Association in epidemiological studies and potential errors
CT Anatomy for Radiotherapy.pdf eryuioooop
Acid Base Disorders educational power point.pptx
Fundamentals of human energy transfer .pptx
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
Neuropathic pain.ppt treatment managment

Machine learning, health data & the limits of knowledge

  • 1. Machine learning, health data & the limits of knowledge Paul Agapow ONC R&D ML&AI AstraZeneca <paul.agapow@astrazeneca.com> 20201/3/10
  • 2. 2 Disclosure • Does not reflect official AZ thought or projects • No conflicts of interest
  • 3. 3 About me • Have been a: • At • Oncology R&D ML&AI / RWE @AZ • Data Science Institute @ICL • Centre for Infection @HPA (UK) • Universities, industry, government … health informatician, data scientist, bioinformatician, database administrator, epi-informaticist, software dev, data manager, consultant, molecular geneticist, data scientist, evolutionary scientist, biochemist, phylogeneticist, immunologist, programmer …
  • 4. Using this paper as a jumping-off point • The Hierarchical Classifier for COVID-19 Resistance Evaluation (2021) Shakhovska, Izonin & Melnykova, Data v6:6 • https://doi.org/10.3390/data6010006 • https://www.mdpi.com/2306- 5729/6/1/6/htm • How to analyse for patterns in COVID data when the observational data is diverse & complex 4
  • 5. Data is a saviour & a curse • Data & analytics has saved us several times in the current crisis • But too much data can create problems • And data is not information 5
  • 6. RWE: real world evidence 6 • Electronic Health Records • Registries • Claims databases • Repurposed trial data • Defined: • Anything that isn’t an RCT (randomised controlled trial) • Observational data • Anything we have to consider the context & sourcing of? • Why? • Cheap • Ethical • Accesses scales & types of data & situations that are otherwise unavailable
  • 7. • Where was it collected? • Who did they look for? • What are those peoples habits and histories? But all (RWE) data is biased What population does it come from? • “severe asthma” or “PDL1 expression” • What are the diagnostic devices? • What’s common medical practice there? What are the definitions used? • E.g. surveys, visits • Are inclusion / exclusion at random? • What incidental correlations? • Choice of features What causes data to be included / excluded? 7
  • 8. The COVID publication: is it good data? • Do we know where it came from? • Do we know who is in it? • Is there missing data? • “maybe” COVID? • Are the populations comparable? • Are antibody levels comparable? • Different test kits? • Imbalanced classes? 8 The data
  • 9. How do we analyse RWE correctly? • Patients are complex: • Co-morbidities • Lifestyle, prior history, exposure • Demographics, genetics, epigenome, microbiome … • Disease is complex: • Affects different body subsystems • Health data is complex: • Sparse, irregular • A product of a healthcare system … • Underlying models unclear • Many opportunities for confounders & noise 9
  • 10. 10 Is ML the best approach for RWE analysis? Messy data Clear assumptions Explicit models … No model Statistical modelling Machine Learning / AI … a continuum of approaches Few assumptions Clean & controlled data Trained from data Larger data
  • 11. But what are the pitfalls of using ML on health data? 11 • Need more (labelled) data • Bias – how was the data sourced? • Needs to be handled carefully • May require specialised computation & skills • Some problems difficult to adapt to ML • Interpretability – data never lies, but what is it telling us?
  • 12. Clustering: how simple algorithms can actually be very complex • Idea of clustering is simple: but what does it actually do? • Every dataset has clusters, even random noise • Do clusters reflect the underlying reality? • Are the clusters revealed valid and/or robust? • Are the clusters of groups you are interested in? • A cluster is the truth, it’s a hypothesis (The paper is modestly convincing about these points) 12
  • 13. The COVID publication: is it good methodology? • Many different methods but: • What’s the concordance? • What use is 6-7 methods? • Ensemble them? • Where’s the validation? • What’s the question? • How many people are actually infected with COVID? or • Can we build a model to calculate this? 13 The data
  • 14. What makes a good machine-learning approach? 14 • Be clear what it is predicting • It should be reproducible • It should be validated: • Internally: performance, convergence, loss, sensitivity, robust, … • Externally: against another dataset • Almost any ML method can • Do (slightly) better than humans • Get better than 50% • If it is “better”, compared to what? How do the systems in the paper measure up?
  • 15. 15 How do we know what a system is doing? • Interpretability is non-negotiable • AI models can only be built for data that you have • Biased data gives rise to biased models • A model may not be doing what we think it is • Toolkits like Shap & Lime make interpretability easy and comparable (Paper used very interpretable systems)
  • 16. How could this have been done better? • What question are we trying to solve? • “What’s the actual level of infected people in the population”? • In what time period or setting? • What’s actionable? • What data can we get? • What data can we get for validation? • We don’t need 6-7 different methods, just 1 good one • Be clear about “how good” the results are 16
  • 17. Summary • RWE may be a broad and over-reaching category • But it underlines the complexity & biases of health data • ML may be the best approach for analysing RWE • However its power and flexibility introduces other problems • Data “bias” • Validation • Interpretability • ML “findings” are almost always just hypotheses • Healthcare analytics should not be about analytics but about biology 17
  • 18. Final thought • If you are driven by science and passionate about improving lives, why not work at AstraZeneca? • Example jobs – please visit our careers website • Principal Data Scientist - https://careers.astrazeneca.com/job/gaithersburg/principal- data-scientist/7684/14833674 • Associate Director Imaging & AI - Imaging & Data Analytics - https://careers.astrazeneca.com/job/gothenburg/associate-director-imaging-and-ai- imaging-and-data-analytics/7684/14469379 • Data Sciences & AI Graduate Programme – UK - https://careers.astrazeneca.com/data-sciences-and-ai-graduate-programme 18