SlideShare a Scribd company logo
Lorenzo Rossi, PhD
Data Scientist
City of Hope National Medical Center
DataCon LA, August 2019
Best Practices for Prototyping
Machine Learning Models for
Healthcare
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Machine learning in healthcare is growing fast,
but best practices are not well established yet
Towards Guidelines for ML in Health (8.2018, Stanford)
Motivations for ML in Healthcare
1. Lots of information about patients, but not enough time for clinicians
to process it
2. Physicians spend too much time typing information about patients
during encounters
3. Overwhelming amount of false alerts (e.g. in ICU)
Topics
1. The electronic health record (EHR)
2. Cohort definition
3. Data quality
4. Training - testing split
5. Performance metrics and reporting
6. Survival analysis
Topics
1. The electronic health record (EHR)
2. Cohort definition
3. Data quality
4. Training - testing split
5. Performance metrics and reporting
6. Survival curves
Data preparation
1. The Electronic Health Record (EHR)
• Laboratory tests
• Vitals
• Diagnoses
• Medications
• X-rays, CT scans, EKGs, …
• Notes
EHR data are very heterogeneous
• Laboratory tests [multi dimensional time series]
• Vitals [multi dimensional time series]
• Diagnoses [text, codes]
• Medications [text, codes, numeric]
• X-rays, CT scans, EKGs,… [2D - 3D images, time series, ..]
• Notes [text]
EHR data are very heterogeneous
• labs
• vitals
• notes
• …
Time is a key aspect of EHR data
p01
p02
p03
time
• labs
• vitals
• notes
• …
Time is a key aspect of EHR data
p01
p02
p03
Temporal resolution varies a lot
• ICU patient [minutes]
• Hospital patient [hours]
• Outpatient [weeks]
time
• Unplanned 30 day readmission
• Length of stay
• Mortality
• Sepsis
• ICU admission
• Surgical complications
Events hospitals want to predict from EHR data
• Unplanned 30 day readmission
• Length of stay
• Mortality
• Sepsis
• ICU admission
• Surgical complications
Events hospitals want to predict from EHR data
Improve capacity
• Unplanned 30 day readmission
• Length of stay
• Mortality
• Sepsis
• ICU admission
• Surgical complications
Events hospitals want to predict from EHR data
Improve capacity
Optimize decisions
Consider only binary prediction tasks for simplicity
Prediction algorithm gives score from 0 to 1
– E.g. close to 1 → high risk of readmission within 30 days
0 / 1
Consider only binary prediction tasks for simplicity
Prediction algorithm gives score from 0 to 1
– E.g. close to 1 → high risk of readmission within 30 days
Trade-off between falsely detected and missed targets
0 / 1
2. Cohort Definition
Individuals “who experienced particular event during specific
period of time”
Cohort
Individuals “who experienced particular event during specific
period of time”
Given prediction task, select clinically relevant cohort
E.g. for surgery complication prediction, patients who had one
or more surgeries between 2011 and 2018.
Cohort
A. Pick records of subset of patients
• labs
• vitals
• notes
• …p01
p02
p03
time
B. Pick a prediction time for each patients. Records after
prediction time are discarded
• labs
• vitals
• notes
• …p01
p02
p03
time
B. Pick a prediction time for each patients. Records after
prediction time are discarded
• labs
• vitals
• notes
• …p01
p02
p03
time
3. Data Quality
[Image source: SalesForce]
EHR data challenging in many different ways
Example: most common non-numeric entries for
lab values in a legacy HER system
• pending
• “>60”
• see note
• not done
• “<2”
• normal
• “1+”
• “2 to 5”
• “<250”
• “<0.1”
Example: discrepancies in dates of death
between hospital records and Social Security (~
4.8 % of shared patients)
Anomalies vs. Outliers
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomaly: illegitimate data point generated by process
different from one producing rest of data
Need domain knowledge to differentiate
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomaly: illegitimate data point generated by process
different from one producing rest of data
Need domain knowledge to differentiate
E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL.
µ=3.5, σ=0.65 over cohort.
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomaly: illegitimate data point generated by process
different from one generating rest of data
Need domain knowledge to differentiate
E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL.
µ=3.5, σ=0.65 over cohort.
ρ = -1 → ?
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomaly: illegitimate data point generated by process
different from one generating rest of data
Need domain knowledge to differentiate
E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL.
µ=3.5, σ=0.65 over cohort.
ρ = -1 → anomaly (treat as missing value)
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomaly: illegitimate data point generated by process
different from one generating rest of data
Need domain knowledge to differentiate
E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL.
µ=3.5, σ=0.65 over cohort.
ρ = 1 → ?
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomaly: illegitimate data point generated by process
different from one generating rest of data
Need domain knowledge to differentiate
E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL.
µ=3.5, σ=0.65 over cohort.
ρ = 1 → possibly a outlier (clinically relevant)
4. Training - Testing Split
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
• Machine learning models evaluated on ability to make
prediction on new (unseen) data
• Split train (cross-validation) and test sets based on
temporal criteria
– e.g. no records in train set after prediction dates in test set
– random splits, even if stratified, could include records virtually
from ‘future’ to train model
• In retrospective studies should also avoid records of same
patients across train and test
– model could just learn to recognize patients
Guidelines
5. Performance Metrics and Reporting
Background
Generally highly imbalanced problems:
15% unplanned 30 day readmissions
< 10% sepsis cases
< 1% 30 day mortality
Types of Performance Metrics
1. Measure trade-offs
– (ROC) AUC
– average precision / PR AUC
2. Measure error rate at specific decision point
– false positive, false negative rates
– precision, recall
– F1
– accuracy
Types of Performance Metrics (II)
1. Measure trade-offs
– AUC, average precision / PR AUC,
– good for global performance characterization and (intra)-
model comparisons
2. Measure error rate at a specific decision point
– false positives, false negatives, …, precision, recall
– possibly good for interpretation of specific clinical costs and
benefits
Don’t use accuracy unless dataset is balanced
ROC AUC can be misleading too
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
ROC AUC can be misleading (II)
[Avati, Ng et al., Countdown Regression: Sharp and Calibrated
Survival Predictions. ArXiv, 2018]
ROC AUC (1 year) > ROC AUC (5 years), but PR AUC (1
year) < PR AUC (5 years)! Latter prediction task is easier.
[Avati, Ng et al., Countdown Regression: Sharp and Calibrated
Survival Predictions. ArXiv, 2018]
Performance should be reported with both types
of metrics
• 1 or 2 metrics for trade-off evaluation
– ROC AUC
– average precision
• 1 metric for performance at clinically meaningful decision
point
– e.g. recall @ 90% precision
Performance should be reported with both types
of metrics
• 1 or 2 metrics for trade-off evaluation
– ROC AUC
– average precision
• 1 metric for performance at clinically meaningful decision
point
– e.g. recall @ 90% precision
+ Comparison with a known benchmark (baseline)
Metrics in Stanford 2017 paper on mortality
prediction: AUC, average precision, recall @ 90%
Benchmarks
Main paper [Google, Nature, 2018] only reports deep
learning results with no benchmark comparison
Comparison only in supplemental online file (not on
Nature paper): deep learning only 1-2% better than
logistic regression benchmark
Plot scales can be deceiving [undisclosed
vendor, 2017]!
Same TP, FP plots rescaled
6. Survival Analysis
B. Pick a prediction time for each patients. Records after
prediction time are discarded
• labs
• vitals
• notes
• …p01
p02
p03
C. Plot survival curves
• Consider binary classification tasks
– Event of interest (e.g. death) either happens or not before
censoring time
• Survival curve: distribution of time to event and time to
censoring
Different selections of prediction times lead to
different survival profiles over same cohort
Example: high percentage of patients deceased within 30
days. Model trained to distinguish mostly between
relatively healthy and moribund patients
Example: high percentage of patients deceased within 30
days. Model trained to distinguish mostly between
relatively healthy and moribund patients → performance
overestimate
Final Remarks
• Outliers should not to be treated like anomalies
• Split train (CV) and test sets temporally
• Metrics:
– ROC AUC alone could be misleading
– Precision-Recall curve often more useful than ROC
– Compare with meaningful benchmarks
• Performance possibly overestimated for cohorts with
unrealistic survival curves
Thank You!
Twitter: @LorenzoARossi
Supplemental Material
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Example: ROC Curve
Very high detection rate,
but also high false alarm rate
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi

More Related Content

PPT
A Database is not what you think and different designs serve different purposes.
PDF
Clinical data analytics
PPT
Biostatistics in Clinical Research
PDF
Clinical sas programmer
PPT
3 cross sectional study
PDF
Consort 2010 explanation and elaboration (bmj)
PPTX
Reporting guidelines
PDF
Prognostics with optimization
A Database is not what you think and different designs serve different purposes.
Clinical data analytics
Biostatistics in Clinical Research
Clinical sas programmer
3 cross sectional study
Consort 2010 explanation and elaboration (bmj)
Reporting guidelines
Prognostics with optimization

What's hot (20)

PPT
Fallacies indrayan
PPTX
Knowledge discovery in medicine
PDF
Sample size and power calculations
PPTX
Scientific Studies Reporting Guidelines
PDF
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
PPSX
Trends in clinical research and career gd 09_may20
PDF
Searching for Evidence
PPT
Amsterdam 11.06.2008
PPTX
To Cochrane or not: that's the question
PPT
When to Select Observational Studies as Evidence for Comparative Effectivenes...
PDF
Therapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
PPTX
Day 1 (Lecture 3): Predictive Analytics in Healthcare
PPT
How to conduct meta analysis
PPTX
Meta analysis
PPTX
Research methodology and biostatistics
PPTX
lecture C
PDF
Common statistical pitfalls in basic science research
PPT
Malmo 11.11.2008
Fallacies indrayan
Knowledge discovery in medicine
Sample size and power calculations
Scientific Studies Reporting Guidelines
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
Trends in clinical research and career gd 09_may20
Searching for Evidence
Amsterdam 11.06.2008
To Cochrane or not: that's the question
When to Select Observational Studies as Evidence for Comparative Effectivenes...
Therapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
Day 1 (Lecture 3): Predictive Analytics in Healthcare
How to conduct meta analysis
Meta analysis
Research methodology and biostatistics
lecture C
Common statistical pitfalls in basic science research
Malmo 11.11.2008
Ad

Similar to Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi (20)

PPTX
Final_Presentation.pptx
PPTX
SHE, Quality, and Ethics in Medical Laboratories - PCLP
PPTX
Cadth 2015 c2 tt eincea_cadth_042015
PDF
Data analysis ( Bio-statistic )
PPT
statistics introduction.ppt
PDF
Developing and validating statistical models for clinical prediction and prog...
PDF
TW MN 16 IPA HE 01 20_Mission 3.1.1.3_Lecture 2.pdf
PPTX
Routine health data module-5-session-2.pptx
PPTX
bio equivalence studies
PDF
Introduction to Medical Statistics - Master.Eman Khashabapptx.pdf
PPTX
Nonparametric Methods for Analyzing Longitudinal Data in Biostatistics Assign...
PDF
Evaluation of the clinical value of biomarkers for risk prediction
PDF
Automated Abstracting - NCRA San Antonio 2015
PDF
Statistics for DP Biology IA
PPTX
Biological variation as an uncertainty component
PDF
Quality control clia
PDF
In tech quality-control_in_clinical_laboratories
PPT
18- Introduction and levels of measurements.ppt
PPT
First in man tokyo
PDF
Cenduit_Whitepaper_Forecasting_Present_14June2016
Final_Presentation.pptx
SHE, Quality, and Ethics in Medical Laboratories - PCLP
Cadth 2015 c2 tt eincea_cadth_042015
Data analysis ( Bio-statistic )
statistics introduction.ppt
Developing and validating statistical models for clinical prediction and prog...
TW MN 16 IPA HE 01 20_Mission 3.1.1.3_Lecture 2.pdf
Routine health data module-5-session-2.pptx
bio equivalence studies
Introduction to Medical Statistics - Master.Eman Khashabapptx.pdf
Nonparametric Methods for Analyzing Longitudinal Data in Biostatistics Assign...
Evaluation of the clinical value of biomarkers for risk prediction
Automated Abstracting - NCRA San Antonio 2015
Statistics for DP Biology IA
Biological variation as an uncertainty component
Quality control clia
In tech quality-control_in_clinical_laboratories
18- Introduction and levels of measurements.ppt
First in man tokyo
Cenduit_Whitepaper_Forecasting_Present_14June2016
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi

  • 1. Lorenzo Rossi, PhD Data Scientist City of Hope National Medical Center DataCon LA, August 2019 Best Practices for Prototyping Machine Learning Models for Healthcare
  • 3. Machine learning in healthcare is growing fast, but best practices are not well established yet Towards Guidelines for ML in Health (8.2018, Stanford)
  • 4. Motivations for ML in Healthcare 1. Lots of information about patients, but not enough time for clinicians to process it 2. Physicians spend too much time typing information about patients during encounters 3. Overwhelming amount of false alerts (e.g. in ICU)
  • 5. Topics 1. The electronic health record (EHR) 2. Cohort definition 3. Data quality 4. Training - testing split 5. Performance metrics and reporting 6. Survival analysis
  • 6. Topics 1. The electronic health record (EHR) 2. Cohort definition 3. Data quality 4. Training - testing split 5. Performance metrics and reporting 6. Survival curves Data preparation
  • 7. 1. The Electronic Health Record (EHR)
  • 8. • Laboratory tests • Vitals • Diagnoses • Medications • X-rays, CT scans, EKGs, … • Notes EHR data are very heterogeneous
  • 9. • Laboratory tests [multi dimensional time series] • Vitals [multi dimensional time series] • Diagnoses [text, codes] • Medications [text, codes, numeric] • X-rays, CT scans, EKGs,… [2D - 3D images, time series, ..] • Notes [text] EHR data are very heterogeneous
  • 10. • labs • vitals • notes • … Time is a key aspect of EHR data p01 p02 p03 time
  • 11. • labs • vitals • notes • … Time is a key aspect of EHR data p01 p02 p03 Temporal resolution varies a lot • ICU patient [minutes] • Hospital patient [hours] • Outpatient [weeks] time
  • 12. • Unplanned 30 day readmission • Length of stay • Mortality • Sepsis • ICU admission • Surgical complications Events hospitals want to predict from EHR data
  • 13. • Unplanned 30 day readmission • Length of stay • Mortality • Sepsis • ICU admission • Surgical complications Events hospitals want to predict from EHR data Improve capacity
  • 14. • Unplanned 30 day readmission • Length of stay • Mortality • Sepsis • ICU admission • Surgical complications Events hospitals want to predict from EHR data Improve capacity Optimize decisions
  • 15. Consider only binary prediction tasks for simplicity Prediction algorithm gives score from 0 to 1 – E.g. close to 1 → high risk of readmission within 30 days 0 / 1
  • 16. Consider only binary prediction tasks for simplicity Prediction algorithm gives score from 0 to 1 – E.g. close to 1 → high risk of readmission within 30 days Trade-off between falsely detected and missed targets 0 / 1
  • 18. Individuals “who experienced particular event during specific period of time” Cohort
  • 19. Individuals “who experienced particular event during specific period of time” Given prediction task, select clinically relevant cohort E.g. for surgery complication prediction, patients who had one or more surgeries between 2011 and 2018. Cohort
  • 20. A. Pick records of subset of patients • labs • vitals • notes • …p01 p02 p03 time
  • 21. B. Pick a prediction time for each patients. Records after prediction time are discarded • labs • vitals • notes • …p01 p02 p03 time
  • 22. B. Pick a prediction time for each patients. Records after prediction time are discarded • labs • vitals • notes • …p01 p02 p03 time
  • 23. 3. Data Quality [Image source: SalesForce]
  • 24. EHR data challenging in many different ways
  • 25. Example: most common non-numeric entries for lab values in a legacy HER system • pending • “>60” • see note • not done • “<2” • normal • “1+” • “2 to 5” • “<250” • “<0.1”
  • 26. Example: discrepancies in dates of death between hospital records and Social Security (~ 4.8 % of shared patients)
  • 28. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one producing rest of data Need domain knowledge to differentiate
  • 29. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one producing rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort.
  • 30. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = -1 → ?
  • 31. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = -1 → anomaly (treat as missing value)
  • 32. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = 1 → ?
  • 33. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = 1 → possibly a outlier (clinically relevant)
  • 34. 4. Training - Testing Split
  • 36. • Machine learning models evaluated on ability to make prediction on new (unseen) data • Split train (cross-validation) and test sets based on temporal criteria – e.g. no records in train set after prediction dates in test set – random splits, even if stratified, could include records virtually from ‘future’ to train model • In retrospective studies should also avoid records of same patients across train and test – model could just learn to recognize patients Guidelines
  • 37. 5. Performance Metrics and Reporting
  • 38. Background Generally highly imbalanced problems: 15% unplanned 30 day readmissions < 10% sepsis cases < 1% 30 day mortality
  • 39. Types of Performance Metrics 1. Measure trade-offs – (ROC) AUC – average precision / PR AUC 2. Measure error rate at specific decision point – false positive, false negative rates – precision, recall – F1 – accuracy
  • 40. Types of Performance Metrics (II) 1. Measure trade-offs – AUC, average precision / PR AUC, – good for global performance characterization and (intra)- model comparisons 2. Measure error rate at a specific decision point – false positives, false negatives, …, precision, recall – possibly good for interpretation of specific clinical costs and benefits
  • 41. Don’t use accuracy unless dataset is balanced
  • 42. ROC AUC can be misleading too
  • 44. ROC AUC can be misleading (II) [Avati, Ng et al., Countdown Regression: Sharp and Calibrated Survival Predictions. ArXiv, 2018]
  • 45. ROC AUC (1 year) > ROC AUC (5 years), but PR AUC (1 year) < PR AUC (5 years)! Latter prediction task is easier. [Avati, Ng et al., Countdown Regression: Sharp and Calibrated Survival Predictions. ArXiv, 2018]
  • 46. Performance should be reported with both types of metrics • 1 or 2 metrics for trade-off evaluation – ROC AUC – average precision • 1 metric for performance at clinically meaningful decision point – e.g. recall @ 90% precision
  • 47. Performance should be reported with both types of metrics • 1 or 2 metrics for trade-off evaluation – ROC AUC – average precision • 1 metric for performance at clinically meaningful decision point – e.g. recall @ 90% precision + Comparison with a known benchmark (baseline)
  • 48. Metrics in Stanford 2017 paper on mortality prediction: AUC, average precision, recall @ 90%
  • 50. Main paper [Google, Nature, 2018] only reports deep learning results with no benchmark comparison
  • 51. Comparison only in supplemental online file (not on Nature paper): deep learning only 1-2% better than logistic regression benchmark
  • 52. Plot scales can be deceiving [undisclosed vendor, 2017]!
  • 53. Same TP, FP plots rescaled
  • 55. B. Pick a prediction time for each patients. Records after prediction time are discarded • labs • vitals • notes • …p01 p02 p03
  • 56. C. Plot survival curves • Consider binary classification tasks – Event of interest (e.g. death) either happens or not before censoring time • Survival curve: distribution of time to event and time to censoring
  • 57. Different selections of prediction times lead to different survival profiles over same cohort
  • 58. Example: high percentage of patients deceased within 30 days. Model trained to distinguish mostly between relatively healthy and moribund patients
  • 59. Example: high percentage of patients deceased within 30 days. Model trained to distinguish mostly between relatively healthy and moribund patients → performance overestimate
  • 60. Final Remarks • Outliers should not to be treated like anomalies • Split train (CV) and test sets temporally • Metrics: – ROC AUC alone could be misleading – Precision-Recall curve often more useful than ROC – Compare with meaningful benchmarks • Performance possibly overestimated for cohorts with unrealistic survival curves
  • 64. Example: ROC Curve Very high detection rate, but also high false alarm rate