Final_Presentation.pptx

Analysis and Early Prediction of
Sepsis using Clinical Data
By Anushree Ankola
Advisor: Dr. Anand Panangadan
Reviewer : Prof. Tseng Chen James

Agenda
 Sepsis – Affects and Symptoms
 Objective
 Challenge Dataset
 Procedure
 EDA – missing values
 Feature Engineering
 EDA – Dataset Imbalance
 Choosing Accuracy metric
 Decision Tree for prediction
 Future Scope 1 - Using XGBoost for prediction
 Further research and findings
 Conclusion
 Reference

Sepsis - Affect and Symptoms
 Affects:
• very young children,
• older adults,
• people with chronic diseases,
• and those with weakened immune system
 Sepsis can be difficult to diagnose because it occurs quickly and can be confused
with other conditions. Watch for a combination of the following symptoms.
 S Shivering, fever, or very cold
E Extreme pain or general discomfort (“worst ever”)
P Pale or discolored skin
S Sleepy, difficult to rouse, confused
I “I feel like I might die!”
S Short of breath

Objective
 Goal of the analysis is the early detection of sepsis using physiological data.
 The early prediction of sepsis is potentially life-saving, and we aim to predict
sepsis 6 hours before the clinical prediction of sepsis.
 Late prediction of sepsis is potentially life-threatening, and also consumes heavy
hospital resources.
 By predicting sepsis in non-sepsis patients or predicting sepsis very early in sepsis
patients consumes limited resources and we can assume the risk of prediction to
be minimal but revolutionary.

Challenge Dataset
 Data used in the competition is sourced from ICU patients in two separate hospital
systems and is obtained from Physionet.
 The data will be split into 70% Training and 30 % testing set. The training set will be
split for validating the training set.
 The original data for each patient will be contained within a single pipe-delimited text
file. Each file will have the same header and each row will represent a single hour's
worth of data. Each hospital have 20,000 patients and hence 20,000 files.
 Available patient co-variates consist of Demographics, Vital Signs, and Laboratory
values
 Features:
• 8 Vital Signs : Heart Rate, Temperature , Blood Pressure, Respiratory rate,
• 26 Laboratory Values : Platelet Count, Glucose , Calcium etc
• 6 Demographics : Age, Gender, Time in ICU , Hospital Admit time
 1 Label :
• 0 (Non-sepsis) and 1 (Sepsis)

Assumptions
 Combined dataset by appending all the patient files
 Total files: 43,765 psv files
 Shape of original file: (1552287 * 41)
 The dataset is not time dependent.
 2 approaches to solve it:
1. Add a time component and patient ID
2. Ignoring time component and consider each row independently
 Following 2nd approach. Reason: Can predict sepsis without past patient data. More
robust and need less resources.

Procedure
COMBINE ALL DATA NON-TIME
DEPENDENT
APPROACH
HANDLING MISSING
VALUES
HANDLING DATA
IMBALANCE
BASELINE
PREDICTION
FEATURE
ENGINEERING

EDA - Handling
Missing Values
 Most of Laboratory Data are having missing
values (Fig)
 There are more than 90% of missingness in
the dataset
 2 steps to handle:
• Remove features with missingness > 92%
• Categorically encode features to handle
missingness.

Feature Selection – Part 1
 Two Approaches employed for Feature Selection:
1. Checked correlation of features contributing to the presence of Sepsis
2. Read health magazines and Research journals such as
• US National Library of Medicine, National Institutes of Health
• Centers for Disease Control and Prevention
• Sepsis - The American Journal of Medicine
and filtered out the most named indicator of Sepsis
 Outcome: Heart rate, Pulse Oximetry, Body temperature, Blood
Pressure (SBP, DBP), Mean Arterial Pressure, Respiration rate, Frac of
inspired oxygen, Age, Gender, Hospital Admission Time and ICU
length of stay.

Feature Engineering & label encoding
 Developed 8 new features and are described:
1. new_age : has 3 categorical values – old, young and adult
2. new_hr, new_temp, new_o2sat, new_bp, new_resp, new_map, new_fio2: has 3
categorical values – normal, abnormal and missing
 Next, performed feature section again on them and selected all above features,
plus Gender, Hospital Admission Time and ICU length of Stay for further
processing as a training set

 ]
 All these are categorically values. They are encoded so that it is easier to run a ML
algorithm.

EDA – Handling Data
Imbalance
 98% of patients does not have sepsis and 2%
have sepsis.
 Problem with Accuracy
 Ways to deal with Imbalance:
• Under sampling
• Oversampling
• Using a good algorithm
• Using Balanced Bagging Classifier
 Which is better?
• Balanced Bagging Classifier with Decision Trees

Training Data with Decision
Trees
 Pre-work:
• Common classification Metrics are not useful as there is an imbalance in
the data– accuracy score
• Precision is defined as the fraction of relevant examples (true positives)
among all of the examples which were predicted to belong in a certain
class.
Precision = (true positives) / (true positives + false positives)
• Recall is defined as the fraction of examples which were predicted to
belong to a class with respect to all of the examples that truly belong in
the class.
Recall = (true positives) / (true positives + false negatives)

Training Data with
Decision Trees
 Using Balanced Bagging Classifier from
imblearn library, which automatically create
balanced samples of the input data.
 has the parameter 'ratio' that should control
how the data is sampled. I have used majority
- resample the majority class
 From Fig, although ROC curve seems
promising, we can see that P-R curve is not
great at classifying.

Training the data with XGBoost
XGBoost - eXtreme Gradient Boosting
• Boosting: Method converts
weak learners -> strong learners
• Boosting algorithm like XGBoost adds iterations of
the model sequentially, adjusting the weights of the
weak-learners along the way. This reduces bias from
the model and typically improves accuracy.
• Benefits of XGBoost: Highly scalable/parallelizable,
quick to execute, and typically out performs other
algorithms.

Further Research and Findings
 Time component Approach ; need domain expert
 PCA for understanding variables better
 Using SMOTE for handling Imbalance
 Work further on XGBoost
 Better Feature Engineering
 Ways to reduce Hospital stay time
Learning Curve with the Project
 Python – Object Oriented Structure and Programming
 Libraries heavily used – Sklearn, Matplotlib
 Built on Jupyter Notebook

Conclusion
 We have handled the missing ness and imbalance in the large dataset
 We removed missing values > 92%
 Performed feature engineering (8 new features) and selected important features
 We aimed to predict the onset of the sepsis by 6 hours and so far the Machine
Learning model employed seem to classify it partially
 The project has a scope of continuing with further research on the importance of
the features, better model building and under the guidance of a good health
science domain expert.

References
[1] https://www.physionet.org/content/challenge-2019/1.0.0/
[2] https://www.datacamp.com/community/tutorials/decision-tree-classification-python
[3] https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-
tree-accuracy-6d3bb6c95e5b
[4] https://towardsdatascience.com/early-detection-of-sepsis-using-physiological-data-
78d5f31fab9d
[5] https://iopscience.iop.org/article/10.1088/1757-899X/428/1/012004
[6] https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-
classification-in-python/
[7] https://www.cdc.gov/
[8] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6429642/
[9] http://www.erogol.com/fighting-class-unbalance-supervised-ml-problem/

Thank You
 I would like to thank my advisor Dr.
Anand Panangadan for helping me
with the project
 I would like to thank my friends at
Edward Life Sciences for advising me
on ways to approach the problem
 I would like my university for giving
me the necessary skills to attempt and
complete the project

Final_Presentation.pptx

More Related Content

Similar to Final_Presentation.pptx (20)

Recently uploaded (20)

Final_Presentation.pptx

Editor's Notes