SlideShare a Scribd company logo
Thesis Defense
Investigating the Use Of Novel Data Mining And Machine

Learning Methods in Healthcare Data Sources Of Multiple

Nature
Roberto Batista
Content
• Introduction
• Part I - Survey Data
• Overview

• Literature review

• Methods

• Exploratory Data Analysis

• Data Transformation

• Conclusions
• Part II - Electronic Health Record
• Overview

• Literature review

• Methods

• Exploratory Data Analysis

• Data Transformation

• Conclusions
2
Introduction
3
2008 2015
9.4% 84%
Introduction
• Non-federal hospitals with basic systems.
The Office of the National Coordinator for Health Information Technology (ONC) 4
gender ethnicity religion age social finance exams
Health Data Sources
National Library of Medicine (NIH)
EMR
$
$
$
$
$
$ SpO2
EHR
5
Survey Electronic Medical Record Claim Data Vital Signs Data Electronic Health Record
PART I PART II
Thesis Components
EHR
6
Survey Electronic Health Record
Part I
Survey Data
-How to identify personality traits groups in the
Health and Retirement Study survey data?
7
Health and Retirement Study
(HRS)
8
HRS Overview
3 surveys
6 aspects
5
aspects
5
aspects
22,000
>50 yo
9 aspects
4
Derived
Datasets
58.54%
Medical Ethics
Training
9
Literature Review
Gould et al., 2015:
Verifies the symptoms of anxiety and depression in
veterans and non-veterans using CES-D and BAI.
Seligman et Al., 2018:
Machine Learning improves the understanding of social
determinants of health.
Hülür et al., 2015:
Investigates association between subjective memory,
subjective age and personality traits.
Fehrman et al., 2015:
Personality correlation with the consumption of eight
psychoactive drugs and its consumption by individuals.
Aschwanden et al., 2019:
Personality traits associations with the probability of
having a preventive screening for cancer.
Five personality Traits
(OCEAN):

• Openness

• Conscientiousness

• Extraversion

• Agreeableness

• Neuroticism
10
Machine Learning Studies
HRS Datasets Overview
11
HRS - RAND
HRS Core HRS Exit HRS Post-Exit
• Adult ADHD

• Financial

• Material Hardship

• Long-term Care

• Medication Non-
Adherence

• Religious
• Proxy informant

• Health

• Family

• Finance
• Proxy informant

• Unresolved
financial
situations
1992
|
2016
1992
|
2016
1992
|
2016
1992
|
2016
HRS Datasets of Interest
12
HRS - RAND
HRS Core HRS Exit HRS Post-Exit
HRS Datasets of Interest
2006, 2008,
2010, 2012
HRS - RAND
HRS Core - Section LB - Left-Behind
Subjective well-being, lifestyle and experience of stress, quality of
Social ties, personality traits, work-related beliefs, and self-
related beliefs.
HRS Core - Section D - Cognition
Immediate and delayed free recall, working memory and mental
processing, vocabulary, mental status, and self-rated memory.
13
HRS - RAND
HRS Core HRS Exit HRS Post-Exit
2006, 2008,
2010, 2012
Data Conversion
14
HRS
Data Transformation
HRS:
• RAND
• Core D
• Core LB
15
Methods
Cloud of Individuals:
Stars represents 

individuals
Cloud of Variables:
Points represents 

variables
A B C
1 a1 b2 c1
⋮ ⋮ ⋮ ⋮
i a2 b2 c3
i’ a1 b1 c1
⋮ ⋮ ⋮ ⋮
N a4 b2 c2
• Unsupervised Machine Learning
• Multiple Correspondence Analysis (MCA)

• Clustering
16
sophist_A lot
sophist_Some
bminded_A lot
curious_A lot
intellig_A lot
imagina_A lot
creative_A lot sympath_A lot
softheart_A lot
caring_A lot
warm_A lot
helpful_A lot
talkactive_A lot
active_A lot
lively_A lot
friendly_A lot
outgoing_A lot
careless_A lot
careless_Not at all
thorough_A lot
hardworker_A lot responsible_A lot
organized_A lot
calm_A lot
nervous_Not at all
worry_Not at all
moody_Not at all
−0.5
0.0
0.5
1.0
1.5
−1.00 −0.75 −0.50 −0.25 0.00
Dim1 (8.1%)
Dim2
(4.7%)
Region 1
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
sophist_A little
bminded_Some
curious_Some
intellig_Some
imagina_A little
creative_A little
creative_Some
sympath_Some
softheart_Some
caring_Some
warm_Some
helpful_Some
talkactive_A little
talkactive_Some
active_Some
lively_Some
friendly_Some
outgoing_Some
careless_A little
careless_Some
thorough_Some
hardworker_Some
organized_Some
calm_Some
nervous_A little
nervous_Some
worry_A little
worry_Some
moody_A little
moody_Some
−0.50
−0.25
0.00
0.25
0.50
0.0 0.5
Dim1 (8.1%)
Dim2
(4.7%) Region 2
sophist_A little
sophist_Not at all
bminded_A little
bminded_Not at all
curious_A little
intellig_A little
imagina_A little
imagina_Not at all
creative_Not at all
sympath_A little
softheart_A little
caring_A little
caring_Some
warm_A little
helpful_A little
talkactive_A little
talkactive_Not at all
active_A little
lively_A little
friendly_A little
friendly_Some
outgoing_A little
outgoing_Not at all
thorough_A little
hardworker_A little
responsible_A little
responsible_Some
organized_A little
organized_Not at all
calm_A little
calm_Some
nervous_A lot
worry_A lot
moody_A lot
−1
0
1
2
−0.5 0.0 0.5 1.0 1.5 2.0
Dim1 (8.1%)
Dim2
(4.7%)
Region 3 bminded_Not at all
curious_Not at all
intellig_Not at all
imagina_Not at all
sympath_Not at all
softheart_Not at all
warm_Not at all
helpful_Not at all
active_Not at all
lively_Not at all
friendly_Not at all
thorough_Not at all
hardworker_Not at all
calm_Not at all
1
2
3
4
5
0.5 1.0 1.5 2.0
Dim1 (8.1%)
Dim2
(4.7%)
Region 4
17
Conclusions
Clusters:
18
Conclusions
• The hierarchical clustering technique applied to the low
dimensional representation of participants, provided by the MCA
method, suggested a reasonable separation of the respondent
profile characterized by a personality scale.

• This can be applied to survey design and sampling procedures.

• This can support correlation studies with other physical and mental
health indicators.
19
Paper Presented and Published
18th IEEE International
Conference on
Machine Learning and
Applications - ICMLA
2019
December 16-19, Boca Raton,
Florida, USA
20
Part II
Electronic Health Record
- How to predict Intensive Care Unit (ICU) Length
of Stay (LOS) using Machine Learning models?
21
Medical Information Mart for
Intensive Care - III
(MIMIC-III)
22
MIMIC-III Overview
NB, 15 >
2.1 days
7.76%
380
meas.
11.5%
44.1%
53,423
adm
6.9
days
EHR
7,870
38,597 
23
Beth Israel Deaconess Medical Center
CareVue DB
MetaVision DB
MIMIC-III
24
Literature Review
Azari et al., 2012:
Approached the LOS prediction identifying similar groups. Reached
accuracy of 74.3%.
Van Houdenhoven et al., 2007:
LOS prediction elective esophagectomy with reconstruction for
carcinoma, with presence of gastroesophageal reflux disease, and
respiratory minute volume transthoracic. R2 of 45%.
Clark & Ryan, 2002:
Tested with demographics younger than 55 years old reach the
highest accuracy of 69%, individuals in the range of 55 and 70 yo
reached 13%, and the group older than 70 years old 17%.
Gustafson, 1968:
Uses five different methodologies for predicting the LOS of inguinal
herniotomy patients.
Afrin et al., 2019:
Predict LOS using three classifications, focused on the age and
death outcome of the patients. Accuracy 54.8% (RF and LR).
Intensive Care Unit (ICU) Length of Stay (LOS)
25
Wait time for
ICU Admission
ICU Management Important predictor
for Death Rate
ICU Cost
Data Accessing
Data Specimens only Research Training - CITI Program:
1. Belmont Report and Its Principles (ID 1127)

2. History and Ethics of Human Subjects Research (ID 498)

3. Basic Institutional Review Board (IRB) Regulations and

4. Review Process (ID 2)

5. Records-Based Research (ID 5)

6. Genetic Research in Human Populations (ID 6)

7. Populations in Research Requiring Additional Considerations and/or Protections (ID16680)

8. Conflicts of Interest in Human Subjects Research (ID 17464)
26
Exploratory Data Analysis
26 CSV Files SQLite
CSV to SQLite
Conversion
27
Data Transformation
28
CSV
STAYS
CSV
PATIENTS
1. ROW_ID
2. SUBJECT_ID
3. GENDER
4. DOB
5. DOD
6. DOD_HOSP
7. DOD_SSN
8. EXPIRE_FLAG
1. ROW_ID
2. SUBJECT_ID
3. HADM_ID
4. ICUSTAY_ID
5. DBSOURCE
6. FIRST_CAREUNIT
7. LAST_CAREUNIT
8. FIRST_WARDID
9. LAST_WARDID
10.INTIME
11.OUTTIME
12.LOS
CSV
DIAGNOSIS
1. ROW_ID
2. SUBJECT_ID
3. HADM_ID
4. SEQ_NUM
5. ICD9_CODE
CSV
ADMISSIONS
1. ROW_ID
2. SUBJECT_ID
3. HADM_ID
4. ADMITTIME
5. DISCHTIME
6. DEATHTIME
7. ADMISSION_TYPE
8. ADMISSION_LOCATION
9. DISCHARGE_LOCATION
10.INSURANCE
11.LANGUAGE
12.RELIGION
13.MARITAL_STATUS
14.ETHNICITY
15.EDREGTIME
16.EDOUTTIME
17.DIAGNOSIS
18.HOSPITAL_EXPIRE_FLAG
19.HAS_CHARTEVENTS_DATA
Data Transformation
2
1
3
4
5
6
7
29
Methods
Tidymodels framework:

• rsample (data sampling)

• recipes (data preprocess)

• parsnip (machine learning modeling)

• yardstick (performance evaluation)

• Algorithm families: Decision Trees,
Random Forest, Boosted Trees, SVM,
and Linear Regression
x
x
x
x
30
Methods
Predictors
• Ethnicity

• Respiratory diagnosis

Subset

• ICU: SICU

• Admission Type: Urgency

Linear Regression
• R2 Adj.: 63.75%

• RMSE: 9.56

Classifier
• Accuracy: 92.7%
31
x
x
x
x
Conclusions
• LOS prediction is a very specific prediction task, case oriented and is
unlikely that one model can generalize for any case.

• It was possible to create a specific prediction model for:

• Surgical Intensive Care Unit

• Admitted from Emergency

• Patients with respiratory disease diagnosed

• The use of the novel R library tidymodels enables the use of multiple
ML libraries, under a unifying collection of packages for modeling and
statistical analysis that share the underlying design philosophy, grammar,
and data structures of the modern data science tools in the tidyverse.
32
Next Steps
• Format a paper and submit to
machine learning conferences/
journals.

• ACM-BCB ’20: 8th ACM
International Conference on
Bioinformatics, Computational
Biology,and Health Informatics
• Apply unsupervised learning
technics used in the part I to the
MIMIC-III dataset.

• Create MIMIC-III subsets with
Lab Exams for further
investigation.
33
Thanks to
34
Thanks to
35
Friends at
Thanks to
Icons: http://www.flaticons.com 36
Friends at

More Related Content

PDF
Investigating the use of novel data mining and machine learning methods in he...
PDF
HEALTH PREDICTION ANALYSIS USING DATA MINING
PPTX
HEALTH PREDICTION ANALYSIS USING DATA MINING
PPTX
Predicting Disease with Machine Learning.pptx
PPTX
Future of Healthcare Forum (Digital Health 2017) - Andrew Satz
PDF
IRJET- Disease Prediction System
PPTX
Location Tracking And Health Prediction Using Machine Learning Algorithm
PDF
IRJET- Disease Prediction using Machine Learning
Investigating the use of novel data mining and machine learning methods in he...
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
Predicting Disease with Machine Learning.pptx
Future of Healthcare Forum (Digital Health 2017) - Andrew Satz
IRJET- Disease Prediction System
Location Tracking And Health Prediction Using Machine Learning Algorithm
IRJET- Disease Prediction using Machine Learning

Similar to Investigating the Use Of Novel Data Mining And Machine Learning Methods in Healthcare Data Sources Of Multiple Nature (20)

PPTX
Predicting and visualizing the heart diseases by machine learning algorithms ...
PDF
Medical Informatics: Computational Analytics in Healthcare
PDF
Machine learning for mental health: predicting transitions from addiction to ...
PDF
IRJET - Classification and Prediction for Hospital Admissions through Emergen...
PDF
Deep learning for biomedical discovery and data mining II
DOCX
Review on Psychology Research Based on Artificial Intelligence Methodologies....
PPTX
Predictive Analytics and Machine Learning for Healthcare - Diabetes
PDF
Predictions And Analytics In Healthcare: Advancements In Machine Learning
PDF
Pacmed - Machine Learning in health care: opportunities and challanges in pra...
PDF
IRJET- Hospital Admission Prediction: A Technology Survey
PDF
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
PDF
“Detection of Diseases using Machine Learning”
PPTX
Artificial Intelligence And Machine Learning In Healthcare: A Cardiovascular ...
PDF
Real-time Analytics for the Healthcare Industry: Arrythmia Detection- Impetus...
PDF
Societal Impact of Applied Data Science on the Big Data Stack
PPTX
final.pptx
PDF
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
PPTX
Machine Learning in Healthcare: A Case Study
PDF
Cloud based Health Prediction System
PPTX
Data mining for diabetes readmission
Predicting and visualizing the heart diseases by machine learning algorithms ...
Medical Informatics: Computational Analytics in Healthcare
Machine learning for mental health: predicting transitions from addiction to ...
IRJET - Classification and Prediction for Hospital Admissions through Emergen...
Deep learning for biomedical discovery and data mining II
Review on Psychology Research Based on Artificial Intelligence Methodologies....
Predictive Analytics and Machine Learning for Healthcare - Diabetes
Predictions And Analytics In Healthcare: Advancements In Machine Learning
Pacmed - Machine Learning in health care: opportunities and challanges in pra...
IRJET- Hospital Admission Prediction: A Technology Survey
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
“Detection of Diseases using Machine Learning”
Artificial Intelligence And Machine Learning In Healthcare: A Cardiovascular ...
Real-time Analytics for the Healthcare Industry: Arrythmia Detection- Impetus...
Societal Impact of Applied Data Science on the Big Data Stack
final.pptx
IRJET- Web-based Application to Detect Heart Attack using Machine Learning
Machine Learning in Healthcare: A Case Study
Cloud based Health Prediction System
Data mining for diabetes readmission
Ad

More from robertowilliams (7)

PPTX
Songdo Demographics
PPTX
Robbiot intro
PPTX
Introduction to Data Science in IoT Projects.
PDF
Project Luckie
PPTX
Robbio intro
PPTX
PPTX
Introdução a Wearables
Songdo Demographics
Robbiot intro
Introduction to Data Science in IoT Projects.
Project Luckie
Robbio intro
Introdução a Wearables
Ad

Recently uploaded (20)

PPTX
Imaging of parasitic D. Case Discussions.pptx
PPTX
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
PPTX
Clinical approach and Radiotherapy principles.pptx
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PDF
شيت_عطا_0000000000000000000000000000.pdf
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPTX
anal canal anatomy with illustrations...
PPTX
Transforming Regulatory Affairs with ChatGPT-5.pptx
PPTX
surgery guide for USMLE step 2-part 1.pptx
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PPTX
History and examination of abdomen, & pelvis .pptx
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PPTX
Respiratory drugs, drugs acting on the respi system
PPTX
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PPT
Breast Cancer management for medicsl student.ppt
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
PPTX
LUNG ABSCESS - respiratory medicine - ppt
PDF
Human Health And Disease hggyutgghg .pdf
Imaging of parasitic D. Case Discussions.pptx
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
Clinical approach and Radiotherapy principles.pptx
Medical Evidence in the Criminal Justice Delivery System in.pdf
شيت_عطا_0000000000000000000000000000.pdf
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
anal canal anatomy with illustrations...
Transforming Regulatory Affairs with ChatGPT-5.pptx
surgery guide for USMLE step 2-part 1.pptx
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
History and examination of abdomen, & pelvis .pptx
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
Respiratory drugs, drugs acting on the respi system
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
Breast Cancer management for medicsl student.ppt
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
LUNG ABSCESS - respiratory medicine - ppt
Human Health And Disease hggyutgghg .pdf

Investigating the Use Of Novel Data Mining And Machine Learning Methods in Healthcare Data Sources Of Multiple Nature

  • 1. Thesis Defense Investigating the Use Of Novel Data Mining And Machine Learning Methods in Healthcare Data Sources Of Multiple Nature Roberto Batista
  • 2. Content • Introduction • Part I - Survey Data • Overview • Literature review • Methods • Exploratory Data Analysis • Data Transformation • Conclusions • Part II - Electronic Health Record • Overview • Literature review • Methods • Exploratory Data Analysis • Data Transformation • Conclusions 2
  • 4. 2008 2015 9.4% 84% Introduction • Non-federal hospitals with basic systems. The Office of the National Coordinator for Health Information Technology (ONC) 4 gender ethnicity religion age social finance exams
  • 5. Health Data Sources National Library of Medicine (NIH) EMR $ $ $ $ $ $ SpO2 EHR 5 Survey Electronic Medical Record Claim Data Vital Signs Data Electronic Health Record
  • 6. PART I PART II Thesis Components EHR 6 Survey Electronic Health Record
  • 7. Part I Survey Data -How to identify personality traits groups in the Health and Retirement Study survey data? 7
  • 8. Health and Retirement Study (HRS) 8
  • 9. HRS Overview 3 surveys 6 aspects 5 aspects 5 aspects 22,000 >50 yo 9 aspects 4 Derived Datasets 58.54% Medical Ethics Training 9
  • 10. Literature Review Gould et al., 2015: Verifies the symptoms of anxiety and depression in veterans and non-veterans using CES-D and BAI. Seligman et Al., 2018: Machine Learning improves the understanding of social determinants of health. Hülür et al., 2015: Investigates association between subjective memory, subjective age and personality traits. Fehrman et al., 2015: Personality correlation with the consumption of eight psychoactive drugs and its consumption by individuals. Aschwanden et al., 2019: Personality traits associations with the probability of having a preventive screening for cancer. Five personality Traits (OCEAN): • Openness • Conscientiousness • Extraversion • Agreeableness • Neuroticism 10 Machine Learning Studies
  • 11. HRS Datasets Overview 11 HRS - RAND HRS Core HRS Exit HRS Post-Exit • Adult ADHD • Financial • Material Hardship • Long-term Care • Medication Non- Adherence • Religious • Proxy informant • Health • Family • Finance • Proxy informant • Unresolved financial situations 1992 | 2016 1992 | 2016 1992 | 2016 1992 | 2016
  • 12. HRS Datasets of Interest 12 HRS - RAND HRS Core HRS Exit HRS Post-Exit
  • 13. HRS Datasets of Interest 2006, 2008, 2010, 2012 HRS - RAND HRS Core - Section LB - Left-Behind Subjective well-being, lifestyle and experience of stress, quality of Social ties, personality traits, work-related beliefs, and self- related beliefs. HRS Core - Section D - Cognition Immediate and delayed free recall, working memory and mental processing, vocabulary, mental status, and self-rated memory. 13 HRS - RAND HRS Core HRS Exit HRS Post-Exit 2006, 2008, 2010, 2012
  • 15. Data Transformation HRS: • RAND • Core D • Core LB 15
  • 16. Methods Cloud of Individuals: Stars represents individuals Cloud of Variables: Points represents variables A B C 1 a1 b2 c1 ⋮ ⋮ ⋮ ⋮ i a2 b2 c3 i’ a1 b1 c1 ⋮ ⋮ ⋮ ⋮ N a4 b2 c2 • Unsupervised Machine Learning • Multiple Correspondence Analysis (MCA) • Clustering 16
  • 17. sophist_A lot sophist_Some bminded_A lot curious_A lot intellig_A lot imagina_A lot creative_A lot sympath_A lot softheart_A lot caring_A lot warm_A lot helpful_A lot talkactive_A lot active_A lot lively_A lot friendly_A lot outgoing_A lot careless_A lot careless_Not at all thorough_A lot hardworker_A lot responsible_A lot organized_A lot calm_A lot nervous_Not at all worry_Not at all moody_Not at all −0.5 0.0 0.5 1.0 1.5 −1.00 −0.75 −0.50 −0.25 0.00 Dim1 (8.1%) Dim2 (4.7%) Region 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● sophist_A little bminded_Some curious_Some intellig_Some imagina_A little creative_A little creative_Some sympath_Some softheart_Some caring_Some warm_Some helpful_Some talkactive_A little talkactive_Some active_Some lively_Some friendly_Some outgoing_Some careless_A little careless_Some thorough_Some hardworker_Some organized_Some calm_Some nervous_A little nervous_Some worry_A little worry_Some moody_A little moody_Some −0.50 −0.25 0.00 0.25 0.50 0.0 0.5 Dim1 (8.1%) Dim2 (4.7%) Region 2 sophist_A little sophist_Not at all bminded_A little bminded_Not at all curious_A little intellig_A little imagina_A little imagina_Not at all creative_Not at all sympath_A little softheart_A little caring_A little caring_Some warm_A little helpful_A little talkactive_A little talkactive_Not at all active_A little lively_A little friendly_A little friendly_Some outgoing_A little outgoing_Not at all thorough_A little hardworker_A little responsible_A little responsible_Some organized_A little organized_Not at all calm_A little calm_Some nervous_A lot worry_A lot moody_A lot −1 0 1 2 −0.5 0.0 0.5 1.0 1.5 2.0 Dim1 (8.1%) Dim2 (4.7%) Region 3 bminded_Not at all curious_Not at all intellig_Not at all imagina_Not at all sympath_Not at all softheart_Not at all warm_Not at all helpful_Not at all active_Not at all lively_Not at all friendly_Not at all thorough_Not at all hardworker_Not at all calm_Not at all 1 2 3 4 5 0.5 1.0 1.5 2.0 Dim1 (8.1%) Dim2 (4.7%) Region 4 17
  • 19. Conclusions • The hierarchical clustering technique applied to the low dimensional representation of participants, provided by the MCA method, suggested a reasonable separation of the respondent profile characterized by a personality scale. • This can be applied to survey design and sampling procedures. • This can support correlation studies with other physical and mental health indicators. 19
  • 20. Paper Presented and Published 18th IEEE International Conference on Machine Learning and Applications - ICMLA 2019 December 16-19, Boca Raton, Florida, USA 20
  • 21. Part II Electronic Health Record - How to predict Intensive Care Unit (ICU) Length of Stay (LOS) using Machine Learning models? 21
  • 22. Medical Information Mart for Intensive Care - III (MIMIC-III) 22
  • 23. MIMIC-III Overview NB, 15 > 2.1 days 7.76% 380 meas. 11.5% 44.1% 53,423 adm 6.9 days EHR 7,870 38,597  23
  • 24. Beth Israel Deaconess Medical Center CareVue DB MetaVision DB MIMIC-III 24
  • 25. Literature Review Azari et al., 2012: Approached the LOS prediction identifying similar groups. Reached accuracy of 74.3%. Van Houdenhoven et al., 2007: LOS prediction elective esophagectomy with reconstruction for carcinoma, with presence of gastroesophageal reflux disease, and respiratory minute volume transthoracic. R2 of 45%. Clark & Ryan, 2002: Tested with demographics younger than 55 years old reach the highest accuracy of 69%, individuals in the range of 55 and 70 yo reached 13%, and the group older than 70 years old 17%. Gustafson, 1968: Uses five different methodologies for predicting the LOS of inguinal herniotomy patients. Afrin et al., 2019: Predict LOS using three classifications, focused on the age and death outcome of the patients. Accuracy 54.8% (RF and LR). Intensive Care Unit (ICU) Length of Stay (LOS) 25 Wait time for ICU Admission ICU Management Important predictor for Death Rate ICU Cost
  • 26. Data Accessing Data Specimens only Research Training - CITI Program: 1. Belmont Report and Its Principles (ID 1127) 2. History and Ethics of Human Subjects Research (ID 498) 3. Basic Institutional Review Board (IRB) Regulations and 4. Review Process (ID 2) 5. Records-Based Research (ID 5) 6. Genetic Research in Human Populations (ID 6) 7. Populations in Research Requiring Additional Considerations and/or Protections (ID16680) 8. Conflicts of Interest in Human Subjects Research (ID 17464) 26
  • 27. Exploratory Data Analysis 26 CSV Files SQLite CSV to SQLite Conversion 27
  • 28. Data Transformation 28 CSV STAYS CSV PATIENTS 1. ROW_ID 2. SUBJECT_ID 3. GENDER 4. DOB 5. DOD 6. DOD_HOSP 7. DOD_SSN 8. EXPIRE_FLAG 1. ROW_ID 2. SUBJECT_ID 3. HADM_ID 4. ICUSTAY_ID 5. DBSOURCE 6. FIRST_CAREUNIT 7. LAST_CAREUNIT 8. FIRST_WARDID 9. LAST_WARDID 10.INTIME 11.OUTTIME 12.LOS CSV DIAGNOSIS 1. ROW_ID 2. SUBJECT_ID 3. HADM_ID 4. SEQ_NUM 5. ICD9_CODE CSV ADMISSIONS 1. ROW_ID 2. SUBJECT_ID 3. HADM_ID 4. ADMITTIME 5. DISCHTIME 6. DEATHTIME 7. ADMISSION_TYPE 8. ADMISSION_LOCATION 9. DISCHARGE_LOCATION 10.INSURANCE 11.LANGUAGE 12.RELIGION 13.MARITAL_STATUS 14.ETHNICITY 15.EDREGTIME 16.EDOUTTIME 17.DIAGNOSIS 18.HOSPITAL_EXPIRE_FLAG 19.HAS_CHARTEVENTS_DATA
  • 30. Methods Tidymodels framework: • rsample (data sampling) • recipes (data preprocess) • parsnip (machine learning modeling) • yardstick (performance evaluation) • Algorithm families: Decision Trees, Random Forest, Boosted Trees, SVM, and Linear Regression x x x x 30
  • 31. Methods Predictors • Ethnicity • Respiratory diagnosis Subset • ICU: SICU • Admission Type: Urgency Linear Regression • R2 Adj.: 63.75% • RMSE: 9.56 Classifier • Accuracy: 92.7% 31 x x x x
  • 32. Conclusions • LOS prediction is a very specific prediction task, case oriented and is unlikely that one model can generalize for any case. • It was possible to create a specific prediction model for: • Surgical Intensive Care Unit • Admitted from Emergency • Patients with respiratory disease diagnosed • The use of the novel R library tidymodels enables the use of multiple ML libraries, under a unifying collection of packages for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the modern data science tools in the tidyverse. 32
  • 33. Next Steps • Format a paper and submit to machine learning conferences/ journals. • ACM-BCB ’20: 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics • Apply unsupervised learning technics used in the part I to the MIMIC-III dataset. • Create MIMIC-III subsets with Lab Exams for further investigation. 33