Analysis and Early Prediction of
Sepsis using Clinical Data
By Anushree Ankola
Advisor: Dr. Anand Panangadan
Reviewer : Prof. Tseng Chen James
Agenda
 Sepsis – Affects and Symptoms
 Objective
 Challenge Dataset
 Procedure
 EDA – missing values
 Feature Engineering
 EDA – Dataset Imbalance
 Choosing Accuracy metric
 Decision Tree for prediction
 Future Scope 1 - Using XGBoost for prediction
 Further research and findings
 Conclusion
 Reference
Sepsis - Statistics
Sepsis - Affect and Symptoms
 Affects:
• very young children,
• older adults,
• people with chronic diseases,
• and those with weakened immune system
 Sepsis can be difficult to diagnose because it occurs quickly and can be confused
with other conditions. Watch for a combination of the following symptoms.
 S Shivering, fever, or very cold
E Extreme pain or general discomfort (“worst ever”)
P Pale or discolored skin
S Sleepy, difficult to rouse, confused
I “I feel like I might die!”
S Short of breath
Objective
 Goal of the analysis is the early detection of sepsis using physiological data.
 The early prediction of sepsis is potentially life-saving, and we aim to predict
sepsis 6 hours before the clinical prediction of sepsis.
 Late prediction of sepsis is potentially life-threatening, and also consumes heavy
hospital resources.
 By predicting sepsis in non-sepsis patients or predicting sepsis very early in sepsis
patients consumes limited resources and we can assume the risk of prediction to
be minimal but revolutionary.
Challenge Dataset
 Data used in the competition is sourced from ICU patients in two separate hospital
systems and is obtained from Physionet.
 The data will be split into 70% Training and 30 % testing set. The training set will be
split for validating the training set.
 The original data for each patient will be contained within a single pipe-delimited text
file. Each file will have the same header and each row will represent a single hour's
worth of data. Each hospital have 20,000 patients and hence 20,000 files.
 Available patient co-variates consist of Demographics, Vital Signs, and Laboratory
values
 Features:
• 8 Vital Signs : Heart Rate, Temperature , Blood Pressure, Respiratory rate,
• 26 Laboratory Values : Platelet Count, Glucose , Calcium etc
• 6 Demographics : Age, Gender, Time in ICU , Hospital Admit time
 1 Label :
• 0 (Non-sepsis) and 1 (Sepsis)
Sepsis Data
Assumptions
 Combined dataset by appending all the patient files
 Total files: 43,765 psv files
 Shape of original file: (1552287 * 41)
 The dataset is not time dependent.
 2 approaches to solve it:
1. Add a time component and patient ID
2. Ignoring time component and consider each row independently
 Following 2nd approach. Reason: Can predict sepsis without past patient data. More
robust and need less resources.
Procedure
COMBINE ALL DATA NON-TIME
DEPENDENT
APPROACH
HANDLING MISSING
VALUES
HANDLING DATA
IMBALANCE
BASELINE
PREDICTION
FEATURE
ENGINEERING
EDA - Handling
Missing Values
 Most of Laboratory Data are having missing
values (Fig)
 There are more than 90% of missingness in
the dataset
 2 steps to handle:
• Remove features with missingness > 92%
• Categorically encode features to handle
missingness.
Feature Selection – Part 1
 Two Approaches employed for Feature Selection:
1. Checked correlation of features contributing to the presence of Sepsis
2. Read health magazines and Research journals such as
• US National Library of Medicine, National Institutes of Health
• Centers for Disease Control and Prevention
• Sepsis - The American Journal of Medicine
and filtered out the most named indicator of Sepsis
 Outcome: Heart rate, Pulse Oximetry, Body temperature, Blood
Pressure (SBP, DBP), Mean Arterial Pressure, Respiration rate, Frac of
inspired oxygen, Age, Gender, Hospital Admission Time and ICU
length of stay.
Feature Engineering & label encoding
 Developed 8 new features and are described:
1. new_age : has 3 categorical values – old, young and adult
2. new_hr, new_temp, new_o2sat, new_bp, new_resp, new_map, new_fio2: has 3
categorical values – normal, abnormal and missing
 Next, performed feature section again on them and selected all above features,
plus Gender, Hospital Admission Time and ICU length of Stay for further
processing as a training set
 ]
 All these are categorically values. They are encoded so that it is easier to run a ML
algorithm.
EDA – Handling Data
Imbalance
 98% of patients does not have sepsis and 2%
have sepsis.
 Problem with Accuracy
 Ways to deal with Imbalance:
• Under sampling
• Oversampling
• Using a good algorithm
• Using Balanced Bagging Classifier
 Which is better?
• Balanced Bagging Classifier with Decision Trees
Training Data with Decision
Trees
 Pre-work:
• Common classification Metrics are not useful as there is an imbalance in
the data– accuracy score
• Precision is defined as the fraction of relevant examples (true positives)
among all of the examples which were predicted to belong in a certain
class.
Precision = (true positives) / (true positives + false positives)
• Recall is defined as the fraction of examples which were predicted to
belong to a class with respect to all of the examples that truly belong in
the class.
Recall = (true positives) / (true positives + false negatives)
Training Data with
Decision Trees
 Using Balanced Bagging Classifier from
imblearn library, which automatically create
balanced samples of the input data.
 has the parameter 'ratio' that should control
how the data is sampled. I have used majority
- resample the majority class
 From Fig, although ROC curve seems
promising, we can see that P-R curve is not
great at classifying.
Training the data with XGBoost
XGBoost - eXtreme Gradient Boosting
• Boosting: Method converts
weak learners -> strong learners
• Boosting algorithm like XGBoost adds iterations of
the model sequentially, adjusting the weights of the
weak-learners along the way. This reduces bias from
the model and typically improves accuracy.
• Benefits of XGBoost: Highly scalable/parallelizable,
quick to execute, and typically out performs other
algorithms.
Further Research and Findings
 Time component Approach ; need domain expert
 PCA for understanding variables better
 Using SMOTE for handling Imbalance
 Work further on XGBoost
 Better Feature Engineering
 Ways to reduce Hospital stay time
Learning Curve with the Project
 Python – Object Oriented Structure and Programming
 Libraries heavily used – Sklearn, Matplotlib
 Built on Jupyter Notebook
Conclusion
 We have handled the missing ness and imbalance in the large dataset
 We removed missing values > 92%
 Performed feature engineering (8 new features) and selected important features
 We aimed to predict the onset of the sepsis by 6 hours and so far the Machine
Learning model employed seem to classify it partially
 The project has a scope of continuing with further research on the importance of
the features, better model building and under the guidance of a good health
science domain expert.
References
[1] https://www.physionet.org/content/challenge-2019/1.0.0/
[2] https://www.datacamp.com/community/tutorials/decision-tree-classification-python
[3] https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification-
tree-accuracy-6d3bb6c95e5b
[4] https://towardsdatascience.com/early-detection-of-sepsis-using-physiological-data-
78d5f31fab9d
[5] https://iopscience.iop.org/article/10.1088/1757-899X/428/1/012004
[6] https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-
classification-in-python/
[7] https://www.cdc.gov/
[8] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6429642/
[9] http://www.erogol.com/fighting-class-unbalance-supervised-ml-problem/
Thank You
 I would like to thank my advisor Dr.
Anand Panangadan for helping me
with the project
 I would like to thank my friends at
Edward Life Sciences for advising me
on ways to approach the problem
 I would like my university for giving
me the necessary skills to attempt and
complete the project

More Related Content

PDF
IRJET - Accuracy Prediction and Classification using Machine Learning Techniq...
PDF
Sepsis Prediction Using Machine Learning
PDF
Sepsis Prediction Using Machine Learning
PDF
Enhancing sepsis detection using feed-forward neural networks with hyperparam...
PPTX
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
PPTX
Prevention of Sepsis Through Machine Learning Driven Targeted Early Detection
PDF
PDF
Data-driven Disease Phenotyping and Bulk Learning
IRJET - Accuracy Prediction and Classification using Machine Learning Techniq...
Sepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine Learning
Enhancing sepsis detection using feed-forward neural networks with hyperparam...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
Prevention of Sepsis Through Machine Learning Driven Targeted Early Detection
Data-driven Disease Phenotyping and Bulk Learning

Similar to Final_Presentation.pptx (20)

PPTX
Early hospital mortality prediction using vital signals
PDF
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
PDF
IRJET- Sepsis Severity Prediction using Machine Learning
PDF
Covid19 Risk Prediction Tec Mty
PDF
IRJET- Disease Prediction System
PPTX
Boost model accuracy of imbalanced covid 19 mortality prediction
PDF
Clinical_Decision_Support_For_Heart_Disease
PPTX
heart final last sem.pptx
PPTX
Sepsis Resilience Prediction
PPTX
Disease Prediction And Doctor Appointment system
PDF
Fundamentals of data science presentation
PPTX
Chase presentation
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
PDF
ICU Patient Deterioration Prediction : A Data-Mining Approach
PDF
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
DOCX
Prevent COVID-19 using ML
PPTX
Data Science for (Health) Science: tales from a challenging front line, and h...
PDF
Machine learning approach for predicting heart and diabetes diseases using da...
PPTX
A review on early hospital mortality prediction using vital signals
PPTX
Predicting Disease with Machine Learning.pptx
Early hospital mortality prediction using vital signals
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
IRJET- Sepsis Severity Prediction using Machine Learning
Covid19 Risk Prediction Tec Mty
IRJET- Disease Prediction System
Boost model accuracy of imbalanced covid 19 mortality prediction
Clinical_Decision_Support_For_Heart_Disease
heart final last sem.pptx
Sepsis Resilience Prediction
Disease Prediction And Doctor Appointment system
Fundamentals of data science presentation
Chase presentation
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
Prevent COVID-19 using ML
Data Science for (Health) Science: tales from a challenging front line, and h...
Machine learning approach for predicting heart and diabetes diseases using da...
A review on early hospital mortality prediction using vital signals
Predicting Disease with Machine Learning.pptx
Ad

Recently uploaded (20)

PDF
cerebral aneurysm.. neurosurgery , anaesthesia
PDF
demography and familyplanning-181222172149.pdf
PPTX
Benign prostatic hyperplasia, uro anaesthesia
PPTX
Nepal health service act.pptx by Sunil Sharma
PPTX
guidance--unit 1 semester-5 bsc nursing.
PDF
chapter 14.pdf Ch+12+SGOB.docx hilighted important stuff on exa,
PPTX
Obstetric management in women with epilepsy.pptx
PDF
Fundamentals Final Review Questions.docx.pdf
DOCX
PT10 continues to explose your mind right after reading
PPTX
Arthritis Types, Signs & Treatment with physiotherapy management
PPTX
Fever and skin rash - Approach.pptxBy Dr Gururaja R , Paediatrician. An usef...
PPTX
Nancy Caroline Emergency Paramedic Chapter 11
PPTX
ACUTE CALCULAR CHOLECYSTITIS: A CASE STUDY
PDF
Medical_Biology_and_Genetics_Current_Studies_I.pdf
PPTX
OSTEOMYELITIS and OSTEORADIONECROSIS.pptx
PPTX
Nancy Caroline Emergency Paramedic Chapter 14
PDF
Essentials of Hysteroscopy at World Laparoscopy Hospital
PPTX
ANALGESIC AND ANTI-INFLAMMssssssATORY DRUGS.pptx
PPTX
Understanding The Self : 1Sexual health
PPTX
Nancy Caroline Emergency Paramedic Chapter 7
cerebral aneurysm.. neurosurgery , anaesthesia
demography and familyplanning-181222172149.pdf
Benign prostatic hyperplasia, uro anaesthesia
Nepal health service act.pptx by Sunil Sharma
guidance--unit 1 semester-5 bsc nursing.
chapter 14.pdf Ch+12+SGOB.docx hilighted important stuff on exa,
Obstetric management in women with epilepsy.pptx
Fundamentals Final Review Questions.docx.pdf
PT10 continues to explose your mind right after reading
Arthritis Types, Signs & Treatment with physiotherapy management
Fever and skin rash - Approach.pptxBy Dr Gururaja R , Paediatrician. An usef...
Nancy Caroline Emergency Paramedic Chapter 11
ACUTE CALCULAR CHOLECYSTITIS: A CASE STUDY
Medical_Biology_and_Genetics_Current_Studies_I.pdf
OSTEOMYELITIS and OSTEORADIONECROSIS.pptx
Nancy Caroline Emergency Paramedic Chapter 14
Essentials of Hysteroscopy at World Laparoscopy Hospital
ANALGESIC AND ANTI-INFLAMMssssssATORY DRUGS.pptx
Understanding The Self : 1Sexual health
Nancy Caroline Emergency Paramedic Chapter 7
Ad

Final_Presentation.pptx

  • 1. Analysis and Early Prediction of Sepsis using Clinical Data By Anushree Ankola Advisor: Dr. Anand Panangadan Reviewer : Prof. Tseng Chen James
  • 2. Agenda  Sepsis – Affects and Symptoms  Objective  Challenge Dataset  Procedure  EDA – missing values  Feature Engineering  EDA – Dataset Imbalance  Choosing Accuracy metric  Decision Tree for prediction  Future Scope 1 - Using XGBoost for prediction  Further research and findings  Conclusion  Reference
  • 4. Sepsis - Affect and Symptoms  Affects: • very young children, • older adults, • people with chronic diseases, • and those with weakened immune system  Sepsis can be difficult to diagnose because it occurs quickly and can be confused with other conditions. Watch for a combination of the following symptoms.  S Shivering, fever, or very cold E Extreme pain or general discomfort (“worst ever”) P Pale or discolored skin S Sleepy, difficult to rouse, confused I “I feel like I might die!” S Short of breath
  • 5. Objective  Goal of the analysis is the early detection of sepsis using physiological data.  The early prediction of sepsis is potentially life-saving, and we aim to predict sepsis 6 hours before the clinical prediction of sepsis.  Late prediction of sepsis is potentially life-threatening, and also consumes heavy hospital resources.  By predicting sepsis in non-sepsis patients or predicting sepsis very early in sepsis patients consumes limited resources and we can assume the risk of prediction to be minimal but revolutionary.
  • 6. Challenge Dataset  Data used in the competition is sourced from ICU patients in two separate hospital systems and is obtained from Physionet.  The data will be split into 70% Training and 30 % testing set. The training set will be split for validating the training set.  The original data for each patient will be contained within a single pipe-delimited text file. Each file will have the same header and each row will represent a single hour's worth of data. Each hospital have 20,000 patients and hence 20,000 files.  Available patient co-variates consist of Demographics, Vital Signs, and Laboratory values  Features: • 8 Vital Signs : Heart Rate, Temperature , Blood Pressure, Respiratory rate, • 26 Laboratory Values : Platelet Count, Glucose , Calcium etc • 6 Demographics : Age, Gender, Time in ICU , Hospital Admit time  1 Label : • 0 (Non-sepsis) and 1 (Sepsis)
  • 8. Assumptions  Combined dataset by appending all the patient files  Total files: 43,765 psv files  Shape of original file: (1552287 * 41)  The dataset is not time dependent.  2 approaches to solve it: 1. Add a time component and patient ID 2. Ignoring time component and consider each row independently  Following 2nd approach. Reason: Can predict sepsis without past patient data. More robust and need less resources.
  • 9. Procedure COMBINE ALL DATA NON-TIME DEPENDENT APPROACH HANDLING MISSING VALUES HANDLING DATA IMBALANCE BASELINE PREDICTION FEATURE ENGINEERING
  • 10. EDA - Handling Missing Values  Most of Laboratory Data are having missing values (Fig)  There are more than 90% of missingness in the dataset  2 steps to handle: • Remove features with missingness > 92% • Categorically encode features to handle missingness.
  • 11. Feature Selection – Part 1  Two Approaches employed for Feature Selection: 1. Checked correlation of features contributing to the presence of Sepsis 2. Read health magazines and Research journals such as • US National Library of Medicine, National Institutes of Health • Centers for Disease Control and Prevention • Sepsis - The American Journal of Medicine and filtered out the most named indicator of Sepsis  Outcome: Heart rate, Pulse Oximetry, Body temperature, Blood Pressure (SBP, DBP), Mean Arterial Pressure, Respiration rate, Frac of inspired oxygen, Age, Gender, Hospital Admission Time and ICU length of stay.
  • 12. Feature Engineering & label encoding  Developed 8 new features and are described: 1. new_age : has 3 categorical values – old, young and adult 2. new_hr, new_temp, new_o2sat, new_bp, new_resp, new_map, new_fio2: has 3 categorical values – normal, abnormal and missing  Next, performed feature section again on them and selected all above features, plus Gender, Hospital Admission Time and ICU length of Stay for further processing as a training set
  • 13.  ]  All these are categorically values. They are encoded so that it is easier to run a ML algorithm.
  • 14. EDA – Handling Data Imbalance  98% of patients does not have sepsis and 2% have sepsis.  Problem with Accuracy  Ways to deal with Imbalance: • Under sampling • Oversampling • Using a good algorithm • Using Balanced Bagging Classifier  Which is better? • Balanced Bagging Classifier with Decision Trees
  • 15. Training Data with Decision Trees  Pre-work: • Common classification Metrics are not useful as there is an imbalance in the data– accuracy score • Precision is defined as the fraction of relevant examples (true positives) among all of the examples which were predicted to belong in a certain class. Precision = (true positives) / (true positives + false positives) • Recall is defined as the fraction of examples which were predicted to belong to a class with respect to all of the examples that truly belong in the class. Recall = (true positives) / (true positives + false negatives)
  • 16. Training Data with Decision Trees  Using Balanced Bagging Classifier from imblearn library, which automatically create balanced samples of the input data.  has the parameter 'ratio' that should control how the data is sampled. I have used majority - resample the majority class  From Fig, although ROC curve seems promising, we can see that P-R curve is not great at classifying.
  • 17. Training the data with XGBoost XGBoost - eXtreme Gradient Boosting • Boosting: Method converts weak learners -> strong learners • Boosting algorithm like XGBoost adds iterations of the model sequentially, adjusting the weights of the weak-learners along the way. This reduces bias from the model and typically improves accuracy. • Benefits of XGBoost: Highly scalable/parallelizable, quick to execute, and typically out performs other algorithms.
  • 18. Further Research and Findings  Time component Approach ; need domain expert  PCA for understanding variables better  Using SMOTE for handling Imbalance  Work further on XGBoost  Better Feature Engineering  Ways to reduce Hospital stay time Learning Curve with the Project  Python – Object Oriented Structure and Programming  Libraries heavily used – Sklearn, Matplotlib  Built on Jupyter Notebook
  • 19. Conclusion  We have handled the missing ness and imbalance in the large dataset  We removed missing values > 92%  Performed feature engineering (8 new features) and selected important features  We aimed to predict the onset of the sepsis by 6 hours and so far the Machine Learning model employed seem to classify it partially  The project has a scope of continuing with further research on the importance of the features, better model building and under the guidance of a good health science domain expert.
  • 20. References [1] https://www.physionet.org/content/challenge-2019/1.0.0/ [2] https://www.datacamp.com/community/tutorials/decision-tree-classification-python [3] https://towardsdatascience.com/using-bagging-and-boosting-to-improve-classification- tree-accuracy-6d3bb6c95e5b [4] https://towardsdatascience.com/early-detection-of-sepsis-using-physiological-data- 78d5f31fab9d [5] https://iopscience.iop.org/article/10.1088/1757-899X/428/1/012004 [6] https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for- classification-in-python/ [7] https://www.cdc.gov/ [8] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6429642/ [9] http://www.erogol.com/fighting-class-unbalance-supervised-ml-problem/
  • 21. Thank You  I would like to thank my advisor Dr. Anand Panangadan for helping me with the project  I would like to thank my friends at Edward Life Sciences for advising me on ways to approach the problem  I would like my university for giving me the necessary skills to attempt and complete the project

Editor's Notes

  • #2: Thank you so much for attending my presentation. I welcome you both. If you have any questions during my presentation please stop me and ask and I will try my best to answer them. My final year project is Analysis and Prediction of Sepsis using Clinical Data
  • #3: The agenda for today’s presentation is – first I will talk about sepsis, its statistics, affects and symptoms. The objective of the project, the challenge dataset, Procedure I took to solve the problem, Exploratory Data Analysis and my intuitions , findings and inferring the course of project, handling data imbalance and missingness, choosing the right accuracy metric. Then building prediction models, future scope of project and conclusion.
  • #4: What is Sepsis ? Sepsis is a potentially life-threatening condition caused by the body’s response to an infection. In a usual case, the body releases chemicals into bloodstream to neutralise an infection. Sepsis occurs when the body’s response to these chemicals is out of balance, triggering changes that can damage multiple organ systems. Sepsis is caused by infection and can happen to anyone. Sepsis is most common and most dangerous in: Older adults Pregnant women Children younger than 1 People who have chronic conditions, such as diabetes, kidney or lung disease, or cancer People who have weakened immune systems Statistics In USA, 270,000 people die from sepsis each year Internationally , 6 Million people die from sepsis each year US hospitals spend 24 Billion each year on sepsis (13 % of Health Budget) Each hour of delay in treatment can roughly increase mortality by 4–8 % Source : https://www.mayoclinic.org/diseases-conditions/sepsis/symptoms-causes/syc-20351214
  • #7: The Challenge data repository contains one file per patient (e.g., training/p00101.psv ). Each training data file provides a table with measurements over time. Each column of the table provides a sequence of measurements over time (e.g., heart rate over several hours), where the header of the column describes the measurement. Each row of the table provides a collection of measurements at the same time (e.g., heart rate and oxygen level at the same time). Features: Vital Signs : Heart Rate, Temperature , Blood Pressure, Respiratory rate, End tidal carbon dioxide Laboratory Values : Platelet Count, Glucose , Calcium etc Demographics : Age, Gender, Time in ICU , Hospital Admit time Label : 0 (Non-sepsis) and 1 (Sepsis) Hence we can see that this is a Binary Classification problem
  • #8: I will explain the relevant features later
  • #9: This approach would help in predicting Sepsis at each hour for any patient(with or without patient past data). The data for the problem is an hourly time sequence record for each patient. But the records do not have a time-label associated with them, so that opens the scope of interpreting it as a non-temporal problem (ignoring the time component) There are two ways in which one can approach this problem: Temporal Approach : Take into the account the time component for the data. Sepsis is diagnosed for each patient at each hour using the past data. Non-temporal Approach : Ignore the time component and treat record as independently and identically distributed. This approach would help in predicting Sepsis at each hour for any patient(with or without patient past data)
  • #10: Plan Of Action The data for the problem is an hourly time sequence record for each patient. But the records do not have a time-label associated with them, so that opens the scope of interpreting it as a non-temporal problem (ignoring the time component) There are two ways in which one can approach this problem: Temporal Approach : Take into the account the time component for the data. Sepsis is diagnosed for each patient at each hour using the past data. Non-temporal Approach : Ignore the time component and treat record as independently and identically distributed. This approach would help in predicting Sepsis at each hour for any patient(with or without patient past data)
  • #13: 1. Age¶ Three categories - Child - Age less than 10 year Adult - Age more than 10 year and less than 60 years Senior - Age more than 60
  • #15: Non-Temporal Approach In this approach we ignore the time component associated with each patient hourly record and treat them as independently and identically distributed. Train-Validation-Test -Split The data repository has data from two hospitals and a total of 40 thousand patients. The actual number of records would be higher as a patient could have stayed in the hospital for a variable amount of time. Splitting these records to train , validation and test. While splitting I have made sure that each patient is fully contained in exactly one of the splits. Train : 30K Patients Test : 5K Patients Validation : 5K Patients Note : The script to divide the data to train -test-validation split can be found here https://github.com/kskaran94/Sepsis_Identification Exploratory Data Analysis After performing descriptive data analysis on the train data, these were the concerns that highlighted Concerns Extremely Imbalance data : As we can see from the bar plot, the records are extremely imbalanced (Less than 1 % vs 99 %+) with the minority class being Sepsis (1).
  • #17: Attribute Selection Measure: Information Gain: which measures the impurity of the input set Entropy: it refers to the impurity in a group of examples Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values Gini Ratio: An extension to information gain known as the gain ratio. Gain ratio handles the issue of bias by normalizing the information gain using Split Info Gini Index: Gini Index considers a binary split for each attribute. You can compute a weighted sum of the impurity of each partition Attribute selection measure is a heuristic for selecting the splitting criterion that partition data into the best possible manner. It is also known as splitting rules because it helps us to determine breakpoints for tuples on a given node. ASM provides a rank to each feature(or attribute) by explaining the given dataset. Best score attribute will be selected as a splitting attribute (Source). In the case of a continuous-valued attribute, split points for branches also need to define. Most popular selection measures are Information Gain, Gain Ratio, and Gini Index.