SlideShare a Scribd company logo
SUBMITTED BY:
PROJECT ADVISOR:
Dr. DEBOTOSH BHATTACHARJEE
A.ACKNOWLEDGMENT:
We would like to express our sincere gratitude to all the individuals and organizations who
contributed to the successful completion of this project.
First and foremost, we extend our appreciation to the healthcare institutions and professionals who
provided the data, insights, and domain expertise necessary for this analysis. Your collaboration was
invaluable in understanding the complex dynamics of healthcare costs and patient care.
We are deeply grateful to our mentors and advisors for their guidance and constructive feedback
throughout the project. Your expertise in machine learning, data analysis, and healthcare systems was
instrumental in shaping the methodology and interpretation of results.
We also acknowledge the contributions of our technical team for their efforts in preprocessing the
data, building robust models, and ensuring the interpretability of results using tools such as SHAP.
Your dedication and teamwork were crucial in achieving the project's objectives.
Finally, we thank the open-source community for providing access to the tools and libraries that
enabled advanced data analysis and visualization. Resources like Python, Scikit-learn, Matplotlib, and
SHAP greatly facilitated the development and presentation of our findings.
This project would not have been possible without the collective efforts of everyone involved, and we
are grateful for the opportunity to contribute to meaningful insights in healthcare analytics.
B.ABSTRACT:
The growing digitization of healthcare services has led to the generation of massive datasets that offer
immense potential for analysis and decision-making. This study leverages the New York SPARCS
(Statewide Planning and Research Cooperative System) hospital inpatient discharge dataset to
evaluate four distinct methodologies: BOAT Framework, K-Means Clustering, CatBoost Regressor,
and Random Forest Regressor. Each method is assessed for its suitability in addressing specific
objectives, such as exploratory analysis, clustering, anomaly detection, and predictive modeling.
The BOAT Framework facilitates trend identification and basic modeling by providing tools for
efficient data preprocessing, feature engineering, and visualization. K-Means Clustering enables
unsupervised learning, identifying patterns in patient data while effectively detecting outliers.
CatBoost Regressor and Random Forest Regressor serve as predictive modeling tools to estimate
hospital costs, offering insights into key cost-driving factors. These methods were evaluated based on
performance metrics such as R-squared (R²), Root Mean Squared Error (RMSE), and interpretability.
Results indicate that CatBoost achieves the highest predictive accuracy, benefiting from its advanced
handling of categorical variables. Random Forest provides a reliable alternative with moderate
computational demands and robust predictions. The BOAT Framework excels in exploratory tasks,
while K-Means Clustering proves effective in anomaly detection and data segmentation. This
comparative analysis demonstrates the potential of combining exploratory and predictive techniques
to enhance healthcare analytics. Future work should explore the integration of these methodologies
into hybrid frameworks to address diverse challenges in healthcare data analysis and decision-making.
C.INTRODUCTION:
The healthcare industry generates vast amounts of data daily, ranging from patient records to hospital
operational metrics. Harnessing this data effectively is critical for improving patient outcomes,
optimizing resource utilization, and controlling costs. The New York SPARCS (Statewide Planning
and Research Cooperative System) dataset, a comprehensive repository of inpatient discharge data,
presents an invaluable opportunity to analyze patterns, identify trends, and predict outcomes within
the healthcare domain.
This project explores the application of four distinct methodologies to derive meaningful insights
from the SPARCS dataset:
1. BOAT Framework: A tool designed for exploratory analysis and trend identification,
enabling researchers to uncover key patterns in the dataset.
2. K-Means Clustering: A technique for grouping data points based on feature similarity,
providing insights into patient groupings and identifying outliers.
3. CatBoost Regressor: A cutting-edge machine learning model optimized for handling
categorical data, employed here to predict hospital costs with high accuracy.
4. Random Forest Regressor: A robust ensemble learning approach that builds multiple
decision trees for reliable predictions of healthcare costs.
The study focuses on evaluating these methodologies based on their effectiveness in data processing,
pattern recognition, and predictive accuracy. By leveraging tools like SHAP (Shapley Additive
Explanations) for interpretability and advanced clustering techniques for anomaly detection, the
project aims to offer a holistic view of healthcare analytics.
The primary goals of this project are to:
 Analyze patterns in hospital admissions and costs.
 Identify outliers and anomalies that could indicate inefficiencies or unique cases.
 Develop predictive models for hospital costs to support decision-making.
 Compare methodologies to highlight their relative strengths and limitations.
Through this multi-method approach, the study provides actionable insights into cost management,
resource allocation, and operational optimization within hospitals, contributing to the broader goal of
data-driven healthcare transformation.
D.METHODS:
1. BOAT FRAMEWORK:
BOAT (Big Data Open Analytics Tool) framework is used for exploratory analysis, trend
identification, and basic modeling in large datasets. Here, it is applied to analyze New York SPARCS
hospital inpatient discharge data . The framework's implementation can be explained in the following
steps:
1. Data Ingestion:
This step involves loading large-scale datasets efficiently into a manageable format. The BOAT
framework utilizes optimized tools and techniques to ensure that the ingestion process handles
massive data volumes without performance degradation. This ensures that the dataset is ready for
analysis without compromising speed or integrity.
2. Data Preprocessing:
During preprocessing, missing values in the dataset are addressed using appropriate imputation
techniques to avoid analytical errors. The framework cleans text-based numeric fields by
standardizing them into usable formats, such as removing non-numeric characters or normalizing
inconsistent entries. This step is critical to ensure the data’s quality and consistency before further
analysis.
3. Feature Engineering:
In this phase, categorical variables are transformed into numerical representations using methods
like one-hot encoding. This transformation enables machine learning models to process
categorical data effectively. Additionally, numerical features are scaled to standardize their
ranges, ensuring better model performance and avoiding bias introduced by unscaled variables.
4. Analysis and Modeling:
The dataset is split into training and testing subsets to validate the model’s performance
effectively. The BOAT framework applies a simple linear regression model, which is used to
explore relationships between features and predict key outcomes, such as total costs. This
approach provides insights into cost patterns and helps identify significant predictors.
5. Visualization:
The above Histogram of “Type of Admission” shows the frequency of each type of admission, helping
identify dominant categories. Key Observations:
 Emergency Admissions dominate the dataset with the highest count,
approximately 1.4 million cases, indicating that most hospital admissions are
unplanned and urgent in nature.
 Elective Admissions come next, with a smaller but significant count. These
admissions are planned and scheduled, likely for non-emergency procedures.
 Newborn admissions have a moderate count, representing cases where
hospitalizations are related to childbirth.
 Urgent, Not Available, and Trauma categories have relatively low counts,
indicating that these admission types are less common.
A Boxplot of “Total Costs” by “APR Severity of Illness” Description highlights variations in costs for
different severity levels. It also identifies potential outliers and trends in cost distribution. Key
Observations:
 Cost Distribution: For all severity levels, most total costs are concentrated near the
lower end, as seen from the dense cluster of data points close to the lower bound of
the plot.
 Outliers: There are significant outliers across all severity levels, with some cases
exceeding $3 million in total costs. This indicates that a small number of patients
incur exceptionally high costs, possibly due to extended stays or complex treatments.
 Trend: The overall spread of costs appears to increase slightly with severity, but the
median values (indicated by the central line in each box) remain relatively low. This
suggests that while more severe illnesses tend to incur higher costs, the majority of
cases remain within a lower cost range.
 Cost Variability: The high number of outliers in each category indicates substantial
variability in treatment costs, even for the same severity level.
 Resource Utilization: Extreme severity levels may be associated with greater resource
consumption, as reflected in the higher cost outliers.
2. K-MEANS CLUSTERING:
Iterative K-Means Clustering is employed to analyze and detect outliers in hospital inpatient discharge
data. The process systematically identifies small clusters that may represent anomalous or outlier data
points. This iterative approach enhances the precision of outlier detection by refining clusters through
repeated processing. The key steps involved are described below:
1. k-Means Clustering:
The initial step partitions the dataset into a fixed number (“k”) of clusters. This is done by
assigning each data point to the nearest cluster center based on a distance metric (e.g.,
Euclidean distance). The cluster centers are updated iteratively to minimize intra-cluster
variance, ensuring that the points within a cluster are as similar as possible.
2. Small Cluster Identification:
After clustering, the sizes of all clusters are evaluated. Clusters with very few data points
(below a predefined threshold, “small_cluster_threshold”) are flagged as outliers. These small
clusters often represent anomalies or data points that deviate significantly from the main
dataset patterns.
3. Outlier Removal:
Data points belonging to the flagged small clusters are removed from the dataset. This step
ensures that anomalous data points do not influence subsequent clustering iterations, thereby
refining the cluster boundaries and improving overall accuracy.
4. Iteration:
The clustering process is repeated with the remaining dataset. This iterative refinement
continues for a set number of iterations (“max_iterations”) or until no small clusters are
detected. The iterative nature of the method ensures progressive improvement in identifying
and excluding anomalous data points.
This approach is particularly effective for handling datasets with a mix of dense and sparse regions, as
it dynamically adapts to varying data distributions and ensures the robustness of the main dataset by
systematically eliminating outliers.
Visualization: A scatterplot is generated showing the clusters and outliers:
 Clusters are visualized with different colours, representing distinct patterns in
costs and stay durations.
 Outliers are marked in red (or as "x") to indicate data points that deviate
significantly from cluster norms.
Key observations include:
1. Cluster Distribution: The majority of data points are concentrated in clusters with
low total costs (below 50,000) and short lengths of stay (below 20 days).There is a
noticeable variation in cluster density, with some clusters being tightly packed (e.g.,
Cluster 0 and Cluster 4) and others more dispersed.
2. Outliers and Extreme Cases: Some points in Cluster 3 represent extreme cases, with
exceptionally high total costs (exceeding 250,000) and/or long lengths of stay (up to
100 days). These may indicate high-resource patients or unusual scenarios.
3. Trends in Costs and Stays: As total costs increase, the length of stay tends to vary
more widely, reflecting diversity in treatment complexity. Clusters demonstrate
distinct patterns, potentially indicating patient groups with similar resource
utilization.
4. No Isolated Outliers: All data points are part of the defined clusters, with no detected
outliers outside the groups.
3. CATBOOST REGRESSOR AND RANDOM FOREST :
This methods involves two separate methods:
Random Forest Regressor: Random Forest is an ensemble learning method that creates multiple
decision trees during training and outputs the average prediction of the individual trees. It works by
randomly sampling data points and features to build each tree, reducing overfitting and improving
model robustness. The final prediction is the average of the predictions from all individual trees,
making it less prone to noise and outliers compared to a single decision tree.
CatBoost Regressor: CatBoost (Categorical Boosting) is a gradient boosting algorithm designed to
handle categorical features efficiently. It converts categorical variables into numerical representations
without requiring extensive preprocessing. CatBoost uses a combination of decision trees built in a
sequential manner, where each tree attempts to correct the errors of the previous one. It is known for
its speed, accuracy, and ability to work well with datasets containing categorical variables, while also
reducing overfitting and being less sensitive to parameter tuning.
Here, we use a building regression models (Random Forest and CatBoost) to predict hospital costs
using New York SPARCS dataset. These requires some steps:
1. Data Loading and Cleaning:
The data is first loaded from a CSV file. The 'Total Costs' column is cleaned by removing
non-numeric characters, such as "$" and ",", to convert the values into a usable numeric
format. Missing values are imputed using the mean of the respective column to ensure no data
loss. Outliers with costs exceeding $200,000 are removed to improve model reliability.
Irrelevant columns are dropped, and rows with missing values in key columns are also
excluded, resulting in a clean and consistent dataset for analysis.
2. Feature Engineering:
To prepare the data for modeling, categorical features are encoded using target encoding. This
method replaces each category with the mean of the target variable ('Total Costs') for that
category, capturing the relationship between the category and the target variable. The target
variable 'Total Costs' is transformed using a logarithmic function to reduce skewness,
ensuring a more uniform distribution that enhances model performance.
3. Model Training:
The dataset is split into training and testing subsets to validate model performance. Two
regression models—Random Forest and CatBoost Regressor—are trained. Random Forest
uses an ensemble of decision trees, aggregating their predictions to improve accuracy and
robustness. CatBoost builds sequential decision trees, where each tree corrects errors made by
the previous ones. Both models are evaluated using R-squared (R²) to measure explained
variance and Root Mean Squared Error (RMSE) to assess prediction accuracy.
4. Feature Importance Analysis:
To interpret the contributions of various features to the predictions, SHAP (Shapley Additive
Explanations) values are calculated for the CatBoost model. This analysis quantifies the
impact of each feature on the model’s output. A SHAP summary plot is generated to visualize
the importance and influence of features, identifying key predictors such as 'Length of Stay'
and 'Facility Name' that significantly affect total costs.
5. Visualization: a SHAP summary plot and a scatter plot comparing model predictions to actual
values are created.
Key observations of SHAP Summary Plot :
1. Top Contributing Features: Length of Stay has the highest impact on the prediction of total costs.
Longer stays tend to increase total costs. Features like Facility Name, CCS Procedure
Description, and APR Medical Surgical Description are also significant predictors.
2. Feature Impact: High values for certain features (e.g., Length of Stay) are positively correlated
with higher costs (red points on the right).Some features, such as Age Group, have less influence
on the model's predictions (clustered near SHAP value = 0).
3. Variability: The spread of SHAP values indicates how much each feature affects individual
predictions. Features like Length of Stay and Facility Name show wide variability, meaning they
have a diverse impact depending on the case.
SHAP Summary Plot
Key observations of Model Predictions vs. Actual Plot:
1. Strong Correlation: The scatter plot shows a strong alignment between predicted and
actual total costs, indicating high model accuracy.
2. Minimal Deviation: Most points closely align with the diagonal line, suggesting low
prediction errors.
3. Outliers: Few points deviate slightly from the line, indicating rare instances of higher
prediction errors.
Model Predictions vs. Actual Plot
E.RESULTS AND COMPARISON:
The performance of each method is summarized in the table below:
METHODS RMSE R²VALUE PURPOSE STRENGTHS LIMITATIONS
CatBoost
Regressor
0.34738 0.86951
Predicting
healthcare costs
using machine
learning.
-Handles
categorical
variables
efficiently.
-Reduces
overfitting.
- Computationally
intensive.
- Requires careful
hyperparameter
tuning
Random Forest 0.37393 0.84881
Predicting
healthcare costs
with interpretable
ensemble
methods.
- Robust to
overfitting.
- Works well with
mixed data types.
- Less accurate than
CatBoost.
- High memory
usage for large
datasets.
BOAT (Big
Data Open
Source
Analytics Tool)
19977.65 0.48456
Exploratory
analysis and trend
identification in
SPARCS data
(e.g., hip
replacement costs,
mental health
trends).
-Open-source and
accessible.
-Enables broad
exploration of
datasets.
- Not optimized for
direct predictive
modeling.
K-Means
Clustering
3520.8396 0.9356
Data
preprocessing and
grouping based on
feature similarity
-Helps in
identifying patterns
or group-specific
trends.
-Useful for
unsupervised
learning tasks.
- Limited
application to
supervised
prediction tasks.
F.DISCUSSION:
This study provides a comparative evaluation of four methodologies applied to the New York
SPARCS hospital inpatient discharge dataset, offering insights into their respective capabilities and
limitations in healthcare analytics. Each method—BOAT Framework, K-Means Clustering, CatBoost
Regressor, and Random Forest Regressor—addresses distinct analytical needs, ranging from
exploratory analysis to predictive modeling. The discussion focuses on the practical implications,
performance outcomes, and integration potential of these methodologies.
The BOAT Framework proved effective for exploratory data analysis and trend identification. By
leveraging efficient data ingestion, preprocessing, and visualization tools, the framework uncovered
significant patterns, such as the dominance of emergency admissions and the presence of cost outliers
exceeding $3 million. While valuable for initial investigations, the BOAT Framework lacks advanced
predictive capabilities, limiting its application to tasks requiring more complex modeling.
K-Means Clustering excelled in grouping data points based on feature similarity and detecting
outliers. Its iterative approach ensured progressive refinement, systematically identifying anomalous
clusters. The visualization of clusters highlighted variations in cost and stay durations, offering
actionable insights into patient groupings. However, the method’s reliance on predefined cluster
numbers (“k”) and its unsuitability for supervised prediction tasks constrain its broader applicability
in predictive analytics.
CatBoost demonstrated exceptional performance in predictive modeling, achieving the highest R-
squared and lowest RMSE values among the evaluated methods. Its ability to handle categorical
features efficiently without extensive preprocessing significantly streamlined the modeling process.
The use of SHAP values provided interpretability, identifying key features such as 'Length of Stay'
and 'Facility Name' as major cost predictors. Despite its strengths, CatBoost’s computational intensity
and sensitivity to hyperparameter tuning present challenges for large-scale applications.
Random Forest offered a robust and interpretable alternative for predictive modeling. Its ensemble
approach mitigated overfitting and ensured reliable predictions, particularly for mixed data types.
Although slightly less accurate than CatBoost, Random Forest required fewer computational
resources, making it a practical choice for scenarios with limited processing power. Its limitations
include higher memory usage and reduced precision compared to more advanced methods like
CatBoost.
The findings suggest that no single method universally outperforms others across all analytical
objectives. Instead, their strengths can be synergized in hybrid approaches. For instance, combining
the exploratory capabilities of the BOAT Framework with the predictive accuracy of CatBoost or
Random Forest could provide comprehensive insights. Similarly, K-Means Clustering could serve as
a preprocessing step to detect and exclude outliers, enhancing the reliability of predictive models.
Future research should focus on the following areas:
1. Developing hybrid frameworks that integrate exploratory, clustering, and predictive methods.
2. Extending these methodologies to other healthcare datasets for broader validation.
3. Enhancing computational efficiency, particularly for resource-intensive models like CatBoost.
4. Incorporating domain knowledge into model development to improve interpretability and
relevance.
By leveraging the complementary strengths of these methodologies, healthcare analytics can achieve a
balance between exploratory insights and predictive precision, ultimately contributing to better
resource management, cost control, and patient care outcomes.
G.CONCLUSION:
This study provides a comprehensive comparison of four methodologies—BOAT Framework, K-
Means Clustering, CatBoost Regressor, and Random Forest Regressor—applied to the New York
SPARCS dataset. Each method demonstrated unique strengths tailored to specific objectives in
healthcare analytics. The BOAT Framework excelled in exploratory data analysis, offering valuable
insights into trends and patterns. K-Means Clustering effectively identified outliers and grouped data
based on feature similarities, enhancing the understanding of patient distributions.
CatBoost Regressor emerged as the most accurate predictive model, leveraging its ability to handle
categorical data with minimal preprocessing. Random Forest, while slightly less accurate, proved
robust and interpretable, making it a reliable alternative for healthcare cost prediction. The findings
emphasize the importance of selecting methods based on the specific requirements of the analysis,
whether for exploration, anomaly detection, or predictive accuracy.
The study's results highlight the potential for hybrid approaches that combine exploratory and
predictive capabilities. For instance, integrating clustering methods to preprocess data before applying
machine learning models could enhance performance and interpretability. Future research should
explore such integrated frameworks and validate these methods across diverse healthcare datasets,
aiming to optimize resource allocation and improve patient outcomes.
H.REFERENCES:
This project drew inspiration and methodology from several key research papers, including:
1. Hiding in Plain Sight: Insights About Health-Care Trends Gained Through Open Health
Data by A. Ravishankar Rao and Daniel Clarke. This study highlights the use of open health
data for analyzing trends and creating analytical tools for healthcare insights.
2. A System for Exploring Big Data: An Iterative K-Means Searchlight for Outlier Detection
on Open Health Data by A. Ravishankar Rao et al., which discusses advanced clustering
techniques applied to open healthcare datasets, including the SPARCS dataset, for outlier
detection.
3. Predictive Interpretable Analytics Models for Forecasting Healthcare Costs Using Open
Healthcare Data by A. Ravishankar Rao et al., which focuses on building predictive models
for healthcare costs using machine learning and interpretable analytics.
4. Predicting Hospital Length of Stay Using Machine Learning on a Large Open Health
Dataset authored by Raunak Jain, Mrityunjai Singh, A. Ravishankar Rao, and Rahul Garg.
This paper likely involves using machine learning techniques to predict the length of hospital
stays based on a variety of features in a large open healthcare dataset, potentially including
data from the SPARCS dataset.
These works provided a foundation for our analysis of the New York SPARCS dataset and the
methodologies employed in this project.

More Related Content

PDF
poster_INFORMS_healthcare_2015 - condensed
PDF
A Medical Price Prediction System using Boosting Algorithms through Machine L...
PPTX
Interpretable and robust hospital readmission predictions from Electronic Hea...
PPT
Innovative Insights for Smarter Care: Care Management and Analytics
PPTX
From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic
PDF
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
PPTX
Predicting Disease with Machine Learning.pptx
PDF
IRJET- Disease Prediction System
poster_INFORMS_healthcare_2015 - condensed
A Medical Price Prediction System using Boosting Algorithms through Machine L...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Innovative Insights for Smarter Care: Care Management and Analytics
From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Predicting Disease with Machine Learning.pptx
IRJET- Disease Prediction System

Similar to "Comparative Analysis of different methods for Hospital Length of Stay prediction using New York SPARCS Datasets" (20)

PDF
Learn To Apply Medical AI Optimizations.pdf
PPTX
RESEARCH PROPOSAL.pptx
PPTX
Machine Learning in Healthcare: A Case Study
PPT
Predictive Analytics in Healthcare
PPTX
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
PDF
Population Health Management
PPTX
PowerPoint_mqergsqsqsqsqsqswqse (1).pptx
PPTX
Parallel Session 1.9 Using Health Analytics for Improved Outcomes
PDF
SMART HEALTHCARE PREDICTION
PDF
Societal Impact of Applied Data Science on the Big Data Stack
PDF
Emergency patient forecasting with models based on support vector machines
DOCX
Transforming health care with ai powered
PDF
IRJET- Disease Prediction using Machine Learning
PDF
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
PDF
What Is Data Analytics?
PDF
Leveraging Data Analysis for Advancements in Healthcare and Medical Research.pdf
PDF
Predictive modeling healthcare
PDF
Smart Health Prediction System
PPTX
Expanding AI in Healthcare: Introducing the New Healthcare.AI™ by Health Cata...
PPTX
Data mining for diabetes readmission
Learn To Apply Medical AI Optimizations.pdf
RESEARCH PROPOSAL.pptx
Machine Learning in Healthcare: A Case Study
Predictive Analytics in Healthcare
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
Population Health Management
PowerPoint_mqergsqsqsqsqsqswqse (1).pptx
Parallel Session 1.9 Using Health Analytics for Improved Outcomes
SMART HEALTHCARE PREDICTION
Societal Impact of Applied Data Science on the Big Data Stack
Emergency patient forecasting with models based on support vector machines
Transforming health care with ai powered
IRJET- Disease Prediction using Machine Learning
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
What Is Data Analytics?
Leveraging Data Analysis for Advancements in Healthcare and Medical Research.pdf
Predictive modeling healthcare
Smart Health Prediction System
Expanding AI in Healthcare: Introducing the New Healthcare.AI™ by Health Cata...
Data mining for diabetes readmission
Ad

Recently uploaded (20)

PPTX
Nancy Caroline Emergency Paramedic Chapter 11
DOCX
ch 9 botes for OB aka Pregnant women eww
PPTX
Nancy Caroline Emergency Paramedic Chapter 16
PPT
Pyramid Points Acid Base Power Point (10).ppt
PPTX
Dissertationn. Topics for obg pg(3).pptx
PDF
01. Histology New Classification of histo is clear calssification
PPTX
Nancy Caroline Emergency Paramedic Chapter 14
PPTX
Newer Technologies in medical field.pptx
PPTX
PEDIATRIC OSCE, MBBS, by Dr. Sangit Chhantyal(IOM)..pptx
PPTX
First Aid and Basic Life Support Training.pptx
PPTX
FUNCTIONS OF BLOOD PART I AND PART 2 WHOLE
PPTX
POSTURE.pptx......,............. .........
PPTX
Nancy Caroline Emergency Paramedic Chapter 4
PPTX
Arthritis Types, Signs & Treatment with physiotherapy management
PDF
_OB Finals 24.pdf notes for pregnant women
PPTX
Nancy Caroline Emergency Paramedic Chapter 1
PPTX
Nancy Caroline Emergency Paramedic Chapter 13
PPTX
Diabetes_Pathology_Colourful_With_Diagrams.pptx
PPTX
PE and Health 7 Quarter 3 Lesson 1 Day 3,4 and 5.pptx
PPTX
Vaginal Bleeding and Uterine Fibroids p
Nancy Caroline Emergency Paramedic Chapter 11
ch 9 botes for OB aka Pregnant women eww
Nancy Caroline Emergency Paramedic Chapter 16
Pyramid Points Acid Base Power Point (10).ppt
Dissertationn. Topics for obg pg(3).pptx
01. Histology New Classification of histo is clear calssification
Nancy Caroline Emergency Paramedic Chapter 14
Newer Technologies in medical field.pptx
PEDIATRIC OSCE, MBBS, by Dr. Sangit Chhantyal(IOM)..pptx
First Aid and Basic Life Support Training.pptx
FUNCTIONS OF BLOOD PART I AND PART 2 WHOLE
POSTURE.pptx......,............. .........
Nancy Caroline Emergency Paramedic Chapter 4
Arthritis Types, Signs & Treatment with physiotherapy management
_OB Finals 24.pdf notes for pregnant women
Nancy Caroline Emergency Paramedic Chapter 1
Nancy Caroline Emergency Paramedic Chapter 13
Diabetes_Pathology_Colourful_With_Diagrams.pptx
PE and Health 7 Quarter 3 Lesson 1 Day 3,4 and 5.pptx
Vaginal Bleeding and Uterine Fibroids p
Ad

"Comparative Analysis of different methods for Hospital Length of Stay prediction using New York SPARCS Datasets"

  • 1. SUBMITTED BY: PROJECT ADVISOR: Dr. DEBOTOSH BHATTACHARJEE
  • 2. A.ACKNOWLEDGMENT: We would like to express our sincere gratitude to all the individuals and organizations who contributed to the successful completion of this project. First and foremost, we extend our appreciation to the healthcare institutions and professionals who provided the data, insights, and domain expertise necessary for this analysis. Your collaboration was invaluable in understanding the complex dynamics of healthcare costs and patient care. We are deeply grateful to our mentors and advisors for their guidance and constructive feedback throughout the project. Your expertise in machine learning, data analysis, and healthcare systems was instrumental in shaping the methodology and interpretation of results. We also acknowledge the contributions of our technical team for their efforts in preprocessing the data, building robust models, and ensuring the interpretability of results using tools such as SHAP. Your dedication and teamwork were crucial in achieving the project's objectives. Finally, we thank the open-source community for providing access to the tools and libraries that enabled advanced data analysis and visualization. Resources like Python, Scikit-learn, Matplotlib, and SHAP greatly facilitated the development and presentation of our findings. This project would not have been possible without the collective efforts of everyone involved, and we are grateful for the opportunity to contribute to meaningful insights in healthcare analytics. B.ABSTRACT: The growing digitization of healthcare services has led to the generation of massive datasets that offer immense potential for analysis and decision-making. This study leverages the New York SPARCS (Statewide Planning and Research Cooperative System) hospital inpatient discharge dataset to evaluate four distinct methodologies: BOAT Framework, K-Means Clustering, CatBoost Regressor, and Random Forest Regressor. Each method is assessed for its suitability in addressing specific objectives, such as exploratory analysis, clustering, anomaly detection, and predictive modeling. The BOAT Framework facilitates trend identification and basic modeling by providing tools for efficient data preprocessing, feature engineering, and visualization. K-Means Clustering enables unsupervised learning, identifying patterns in patient data while effectively detecting outliers. CatBoost Regressor and Random Forest Regressor serve as predictive modeling tools to estimate hospital costs, offering insights into key cost-driving factors. These methods were evaluated based on performance metrics such as R-squared (R²), Root Mean Squared Error (RMSE), and interpretability. Results indicate that CatBoost achieves the highest predictive accuracy, benefiting from its advanced handling of categorical variables. Random Forest provides a reliable alternative with moderate computational demands and robust predictions. The BOAT Framework excels in exploratory tasks, while K-Means Clustering proves effective in anomaly detection and data segmentation. This comparative analysis demonstrates the potential of combining exploratory and predictive techniques to enhance healthcare analytics. Future work should explore the integration of these methodologies into hybrid frameworks to address diverse challenges in healthcare data analysis and decision-making. C.INTRODUCTION: The healthcare industry generates vast amounts of data daily, ranging from patient records to hospital operational metrics. Harnessing this data effectively is critical for improving patient outcomes, optimizing resource utilization, and controlling costs. The New York SPARCS (Statewide Planning
  • 3. and Research Cooperative System) dataset, a comprehensive repository of inpatient discharge data, presents an invaluable opportunity to analyze patterns, identify trends, and predict outcomes within the healthcare domain. This project explores the application of four distinct methodologies to derive meaningful insights from the SPARCS dataset: 1. BOAT Framework: A tool designed for exploratory analysis and trend identification, enabling researchers to uncover key patterns in the dataset. 2. K-Means Clustering: A technique for grouping data points based on feature similarity, providing insights into patient groupings and identifying outliers. 3. CatBoost Regressor: A cutting-edge machine learning model optimized for handling categorical data, employed here to predict hospital costs with high accuracy. 4. Random Forest Regressor: A robust ensemble learning approach that builds multiple decision trees for reliable predictions of healthcare costs. The study focuses on evaluating these methodologies based on their effectiveness in data processing, pattern recognition, and predictive accuracy. By leveraging tools like SHAP (Shapley Additive Explanations) for interpretability and advanced clustering techniques for anomaly detection, the project aims to offer a holistic view of healthcare analytics. The primary goals of this project are to:  Analyze patterns in hospital admissions and costs.  Identify outliers and anomalies that could indicate inefficiencies or unique cases.  Develop predictive models for hospital costs to support decision-making.  Compare methodologies to highlight their relative strengths and limitations. Through this multi-method approach, the study provides actionable insights into cost management, resource allocation, and operational optimization within hospitals, contributing to the broader goal of data-driven healthcare transformation. D.METHODS: 1. BOAT FRAMEWORK: BOAT (Big Data Open Analytics Tool) framework is used for exploratory analysis, trend identification, and basic modeling in large datasets. Here, it is applied to analyze New York SPARCS hospital inpatient discharge data . The framework's implementation can be explained in the following steps: 1. Data Ingestion: This step involves loading large-scale datasets efficiently into a manageable format. The BOAT framework utilizes optimized tools and techniques to ensure that the ingestion process handles massive data volumes without performance degradation. This ensures that the dataset is ready for analysis without compromising speed or integrity. 2. Data Preprocessing: During preprocessing, missing values in the dataset are addressed using appropriate imputation techniques to avoid analytical errors. The framework cleans text-based numeric fields by standardizing them into usable formats, such as removing non-numeric characters or normalizing inconsistent entries. This step is critical to ensure the data’s quality and consistency before further analysis.
  • 4. 3. Feature Engineering: In this phase, categorical variables are transformed into numerical representations using methods like one-hot encoding. This transformation enables machine learning models to process categorical data effectively. Additionally, numerical features are scaled to standardize their ranges, ensuring better model performance and avoiding bias introduced by unscaled variables. 4. Analysis and Modeling: The dataset is split into training and testing subsets to validate the model’s performance effectively. The BOAT framework applies a simple linear regression model, which is used to explore relationships between features and predict key outcomes, such as total costs. This approach provides insights into cost patterns and helps identify significant predictors. 5. Visualization: The above Histogram of “Type of Admission” shows the frequency of each type of admission, helping identify dominant categories. Key Observations:  Emergency Admissions dominate the dataset with the highest count, approximately 1.4 million cases, indicating that most hospital admissions are unplanned and urgent in nature.  Elective Admissions come next, with a smaller but significant count. These admissions are planned and scheduled, likely for non-emergency procedures.  Newborn admissions have a moderate count, representing cases where hospitalizations are related to childbirth.  Urgent, Not Available, and Trauma categories have relatively low counts, indicating that these admission types are less common.
  • 5. A Boxplot of “Total Costs” by “APR Severity of Illness” Description highlights variations in costs for different severity levels. It also identifies potential outliers and trends in cost distribution. Key Observations:  Cost Distribution: For all severity levels, most total costs are concentrated near the lower end, as seen from the dense cluster of data points close to the lower bound of the plot.  Outliers: There are significant outliers across all severity levels, with some cases exceeding $3 million in total costs. This indicates that a small number of patients incur exceptionally high costs, possibly due to extended stays or complex treatments.  Trend: The overall spread of costs appears to increase slightly with severity, but the median values (indicated by the central line in each box) remain relatively low. This suggests that while more severe illnesses tend to incur higher costs, the majority of cases remain within a lower cost range.  Cost Variability: The high number of outliers in each category indicates substantial variability in treatment costs, even for the same severity level.  Resource Utilization: Extreme severity levels may be associated with greater resource consumption, as reflected in the higher cost outliers. 2. K-MEANS CLUSTERING: Iterative K-Means Clustering is employed to analyze and detect outliers in hospital inpatient discharge data. The process systematically identifies small clusters that may represent anomalous or outlier data points. This iterative approach enhances the precision of outlier detection by refining clusters through repeated processing. The key steps involved are described below: 1. k-Means Clustering: The initial step partitions the dataset into a fixed number (“k”) of clusters. This is done by assigning each data point to the nearest cluster center based on a distance metric (e.g.,
  • 6. Euclidean distance). The cluster centers are updated iteratively to minimize intra-cluster variance, ensuring that the points within a cluster are as similar as possible. 2. Small Cluster Identification: After clustering, the sizes of all clusters are evaluated. Clusters with very few data points (below a predefined threshold, “small_cluster_threshold”) are flagged as outliers. These small clusters often represent anomalies or data points that deviate significantly from the main dataset patterns. 3. Outlier Removal: Data points belonging to the flagged small clusters are removed from the dataset. This step ensures that anomalous data points do not influence subsequent clustering iterations, thereby refining the cluster boundaries and improving overall accuracy. 4. Iteration: The clustering process is repeated with the remaining dataset. This iterative refinement continues for a set number of iterations (“max_iterations”) or until no small clusters are detected. The iterative nature of the method ensures progressive improvement in identifying and excluding anomalous data points. This approach is particularly effective for handling datasets with a mix of dense and sparse regions, as it dynamically adapts to varying data distributions and ensures the robustness of the main dataset by systematically eliminating outliers. Visualization: A scatterplot is generated showing the clusters and outliers:  Clusters are visualized with different colours, representing distinct patterns in costs and stay durations.  Outliers are marked in red (or as "x") to indicate data points that deviate significantly from cluster norms. Key observations include: 1. Cluster Distribution: The majority of data points are concentrated in clusters with low total costs (below 50,000) and short lengths of stay (below 20 days).There is a noticeable variation in cluster density, with some clusters being tightly packed (e.g., Cluster 0 and Cluster 4) and others more dispersed. 2. Outliers and Extreme Cases: Some points in Cluster 3 represent extreme cases, with exceptionally high total costs (exceeding 250,000) and/or long lengths of stay (up to 100 days). These may indicate high-resource patients or unusual scenarios.
  • 7. 3. Trends in Costs and Stays: As total costs increase, the length of stay tends to vary more widely, reflecting diversity in treatment complexity. Clusters demonstrate distinct patterns, potentially indicating patient groups with similar resource utilization. 4. No Isolated Outliers: All data points are part of the defined clusters, with no detected outliers outside the groups. 3. CATBOOST REGRESSOR AND RANDOM FOREST : This methods involves two separate methods: Random Forest Regressor: Random Forest is an ensemble learning method that creates multiple decision trees during training and outputs the average prediction of the individual trees. It works by randomly sampling data points and features to build each tree, reducing overfitting and improving model robustness. The final prediction is the average of the predictions from all individual trees, making it less prone to noise and outliers compared to a single decision tree. CatBoost Regressor: CatBoost (Categorical Boosting) is a gradient boosting algorithm designed to handle categorical features efficiently. It converts categorical variables into numerical representations without requiring extensive preprocessing. CatBoost uses a combination of decision trees built in a sequential manner, where each tree attempts to correct the errors of the previous one. It is known for its speed, accuracy, and ability to work well with datasets containing categorical variables, while also reducing overfitting and being less sensitive to parameter tuning. Here, we use a building regression models (Random Forest and CatBoost) to predict hospital costs using New York SPARCS dataset. These requires some steps: 1. Data Loading and Cleaning: The data is first loaded from a CSV file. The 'Total Costs' column is cleaned by removing non-numeric characters, such as "$" and ",", to convert the values into a usable numeric format. Missing values are imputed using the mean of the respective column to ensure no data loss. Outliers with costs exceeding $200,000 are removed to improve model reliability. Irrelevant columns are dropped, and rows with missing values in key columns are also excluded, resulting in a clean and consistent dataset for analysis. 2. Feature Engineering: To prepare the data for modeling, categorical features are encoded using target encoding. This method replaces each category with the mean of the target variable ('Total Costs') for that category, capturing the relationship between the category and the target variable. The target variable 'Total Costs' is transformed using a logarithmic function to reduce skewness, ensuring a more uniform distribution that enhances model performance. 3. Model Training: The dataset is split into training and testing subsets to validate model performance. Two regression models—Random Forest and CatBoost Regressor—are trained. Random Forest uses an ensemble of decision trees, aggregating their predictions to improve accuracy and robustness. CatBoost builds sequential decision trees, where each tree corrects errors made by the previous ones. Both models are evaluated using R-squared (R²) to measure explained variance and Root Mean Squared Error (RMSE) to assess prediction accuracy. 4. Feature Importance Analysis: To interpret the contributions of various features to the predictions, SHAP (Shapley Additive Explanations) values are calculated for the CatBoost model. This analysis quantifies the impact of each feature on the model’s output. A SHAP summary plot is generated to visualize the importance and influence of features, identifying key predictors such as 'Length of Stay' and 'Facility Name' that significantly affect total costs.
  • 8. 5. Visualization: a SHAP summary plot and a scatter plot comparing model predictions to actual values are created. Key observations of SHAP Summary Plot : 1. Top Contributing Features: Length of Stay has the highest impact on the prediction of total costs. Longer stays tend to increase total costs. Features like Facility Name, CCS Procedure Description, and APR Medical Surgical Description are also significant predictors. 2. Feature Impact: High values for certain features (e.g., Length of Stay) are positively correlated with higher costs (red points on the right).Some features, such as Age Group, have less influence on the model's predictions (clustered near SHAP value = 0). 3. Variability: The spread of SHAP values indicates how much each feature affects individual predictions. Features like Length of Stay and Facility Name show wide variability, meaning they have a diverse impact depending on the case. SHAP Summary Plot
  • 9. Key observations of Model Predictions vs. Actual Plot: 1. Strong Correlation: The scatter plot shows a strong alignment between predicted and actual total costs, indicating high model accuracy. 2. Minimal Deviation: Most points closely align with the diagonal line, suggesting low prediction errors. 3. Outliers: Few points deviate slightly from the line, indicating rare instances of higher prediction errors. Model Predictions vs. Actual Plot
  • 10. E.RESULTS AND COMPARISON: The performance of each method is summarized in the table below: METHODS RMSE R²VALUE PURPOSE STRENGTHS LIMITATIONS CatBoost Regressor 0.34738 0.86951 Predicting healthcare costs using machine learning. -Handles categorical variables efficiently. -Reduces overfitting. - Computationally intensive. - Requires careful hyperparameter tuning Random Forest 0.37393 0.84881 Predicting healthcare costs with interpretable ensemble methods. - Robust to overfitting. - Works well with mixed data types. - Less accurate than CatBoost. - High memory usage for large datasets. BOAT (Big Data Open Source Analytics Tool) 19977.65 0.48456 Exploratory analysis and trend identification in SPARCS data (e.g., hip replacement costs, mental health trends). -Open-source and accessible. -Enables broad exploration of datasets. - Not optimized for direct predictive modeling. K-Means Clustering 3520.8396 0.9356 Data preprocessing and grouping based on feature similarity -Helps in identifying patterns or group-specific trends. -Useful for unsupervised learning tasks. - Limited application to supervised prediction tasks.
  • 11. F.DISCUSSION: This study provides a comparative evaluation of four methodologies applied to the New York SPARCS hospital inpatient discharge dataset, offering insights into their respective capabilities and limitations in healthcare analytics. Each method—BOAT Framework, K-Means Clustering, CatBoost Regressor, and Random Forest Regressor—addresses distinct analytical needs, ranging from exploratory analysis to predictive modeling. The discussion focuses on the practical implications, performance outcomes, and integration potential of these methodologies. The BOAT Framework proved effective for exploratory data analysis and trend identification. By leveraging efficient data ingestion, preprocessing, and visualization tools, the framework uncovered significant patterns, such as the dominance of emergency admissions and the presence of cost outliers exceeding $3 million. While valuable for initial investigations, the BOAT Framework lacks advanced predictive capabilities, limiting its application to tasks requiring more complex modeling. K-Means Clustering excelled in grouping data points based on feature similarity and detecting outliers. Its iterative approach ensured progressive refinement, systematically identifying anomalous clusters. The visualization of clusters highlighted variations in cost and stay durations, offering actionable insights into patient groupings. However, the method’s reliance on predefined cluster numbers (“k”) and its unsuitability for supervised prediction tasks constrain its broader applicability in predictive analytics. CatBoost demonstrated exceptional performance in predictive modeling, achieving the highest R- squared and lowest RMSE values among the evaluated methods. Its ability to handle categorical features efficiently without extensive preprocessing significantly streamlined the modeling process. The use of SHAP values provided interpretability, identifying key features such as 'Length of Stay' and 'Facility Name' as major cost predictors. Despite its strengths, CatBoost’s computational intensity and sensitivity to hyperparameter tuning present challenges for large-scale applications. Random Forest offered a robust and interpretable alternative for predictive modeling. Its ensemble approach mitigated overfitting and ensured reliable predictions, particularly for mixed data types. Although slightly less accurate than CatBoost, Random Forest required fewer computational resources, making it a practical choice for scenarios with limited processing power. Its limitations include higher memory usage and reduced precision compared to more advanced methods like CatBoost. The findings suggest that no single method universally outperforms others across all analytical objectives. Instead, their strengths can be synergized in hybrid approaches. For instance, combining the exploratory capabilities of the BOAT Framework with the predictive accuracy of CatBoost or Random Forest could provide comprehensive insights. Similarly, K-Means Clustering could serve as a preprocessing step to detect and exclude outliers, enhancing the reliability of predictive models. Future research should focus on the following areas: 1. Developing hybrid frameworks that integrate exploratory, clustering, and predictive methods. 2. Extending these methodologies to other healthcare datasets for broader validation. 3. Enhancing computational efficiency, particularly for resource-intensive models like CatBoost. 4. Incorporating domain knowledge into model development to improve interpretability and relevance. By leveraging the complementary strengths of these methodologies, healthcare analytics can achieve a balance between exploratory insights and predictive precision, ultimately contributing to better resource management, cost control, and patient care outcomes.
  • 12. G.CONCLUSION: This study provides a comprehensive comparison of four methodologies—BOAT Framework, K- Means Clustering, CatBoost Regressor, and Random Forest Regressor—applied to the New York SPARCS dataset. Each method demonstrated unique strengths tailored to specific objectives in healthcare analytics. The BOAT Framework excelled in exploratory data analysis, offering valuable insights into trends and patterns. K-Means Clustering effectively identified outliers and grouped data based on feature similarities, enhancing the understanding of patient distributions. CatBoost Regressor emerged as the most accurate predictive model, leveraging its ability to handle categorical data with minimal preprocessing. Random Forest, while slightly less accurate, proved robust and interpretable, making it a reliable alternative for healthcare cost prediction. The findings emphasize the importance of selecting methods based on the specific requirements of the analysis, whether for exploration, anomaly detection, or predictive accuracy. The study's results highlight the potential for hybrid approaches that combine exploratory and predictive capabilities. For instance, integrating clustering methods to preprocess data before applying machine learning models could enhance performance and interpretability. Future research should explore such integrated frameworks and validate these methods across diverse healthcare datasets, aiming to optimize resource allocation and improve patient outcomes. H.REFERENCES: This project drew inspiration and methodology from several key research papers, including: 1. Hiding in Plain Sight: Insights About Health-Care Trends Gained Through Open Health Data by A. Ravishankar Rao and Daniel Clarke. This study highlights the use of open health data for analyzing trends and creating analytical tools for healthcare insights. 2. A System for Exploring Big Data: An Iterative K-Means Searchlight for Outlier Detection on Open Health Data by A. Ravishankar Rao et al., which discusses advanced clustering techniques applied to open healthcare datasets, including the SPARCS dataset, for outlier detection. 3. Predictive Interpretable Analytics Models for Forecasting Healthcare Costs Using Open Healthcare Data by A. Ravishankar Rao et al., which focuses on building predictive models for healthcare costs using machine learning and interpretable analytics. 4. Predicting Hospital Length of Stay Using Machine Learning on a Large Open Health Dataset authored by Raunak Jain, Mrityunjai Singh, A. Ravishankar Rao, and Rahul Garg. This paper likely involves using machine learning techniques to predict the length of hospital stays based on a variety of features in a large open healthcare dataset, potentially including data from the SPARCS dataset. These works provided a foundation for our analysis of the New York SPARCS dataset and the methodologies employed in this project.