"Comparative Analysis of different methods for Hospital Length of Stay prediction using New York SPARCS Datasets"

SUBMITTED BY:
PROJECT ADVISOR:
Dr. DEBOTOSH BHATTACHARJEE

A.ACKNOWLEDGMENT:
We would like to express our sincere gratitude to all the individuals and organizations who
contributed to the successful completion of this project.
First and foremost, we extend our appreciation to the healthcare institutions and professionals who
provided the data, insights, and domain expertise necessary for this analysis. Your collaboration was
invaluable in understanding the complex dynamics of healthcare costs and patient care.
We are deeply grateful to our mentors and advisors for their guidance and constructive feedback
throughout the project. Your expertise in machine learning, data analysis, and healthcare systems was
instrumental in shaping the methodology and interpretation of results.
We also acknowledge the contributions of our technical team for their efforts in preprocessing the
data, building robust models, and ensuring the interpretability of results using tools such as SHAP.
Your dedication and teamwork were crucial in achieving the project's objectives.
Finally, we thank the open-source community for providing access to the tools and libraries that
enabled advanced data analysis and visualization. Resources like Python, Scikit-learn, Matplotlib, and
SHAP greatly facilitated the development and presentation of our findings.
This project would not have been possible without the collective efforts of everyone involved, and we
are grateful for the opportunity to contribute to meaningful insights in healthcare analytics.
B.ABSTRACT:
The growing digitization of healthcare services has led to the generation of massive datasets that offer
immense potential for analysis and decision-making. This study leverages the New York SPARCS
(Statewide Planning and Research Cooperative System) hospital inpatient discharge dataset to
evaluate four distinct methodologies: BOAT Framework, K-Means Clustering, CatBoost Regressor,
and Random Forest Regressor. Each method is assessed for its suitability in addressing specific
objectives, such as exploratory analysis, clustering, anomaly detection, and predictive modeling.
The BOAT Framework facilitates trend identification and basic modeling by providing tools for
efficient data preprocessing, feature engineering, and visualization. K-Means Clustering enables
unsupervised learning, identifying patterns in patient data while effectively detecting outliers.
CatBoost Regressor and Random Forest Regressor serve as predictive modeling tools to estimate
hospital costs, offering insights into key cost-driving factors. These methods were evaluated based on
performance metrics such as R-squared (R²), Root Mean Squared Error (RMSE), and interpretability.
Results indicate that CatBoost achieves the highest predictive accuracy, benefiting from its advanced
handling of categorical variables. Random Forest provides a reliable alternative with moderate
computational demands and robust predictions. The BOAT Framework excels in exploratory tasks,
while K-Means Clustering proves effective in anomaly detection and data segmentation. This
comparative analysis demonstrates the potential of combining exploratory and predictive techniques
to enhance healthcare analytics. Future work should explore the integration of these methodologies
into hybrid frameworks to address diverse challenges in healthcare data analysis and decision-making.
C.INTRODUCTION:
The healthcare industry generates vast amounts of data daily, ranging from patient records to hospital
operational metrics. Harnessing this data effectively is critical for improving patient outcomes,
optimizing resource utilization, and controlling costs. The New York SPARCS (Statewide Planning

and Research Cooperative System) dataset, a comprehensive repository of inpatient discharge data,
presents an invaluable opportunity to analyze patterns, identify trends, and predict outcomes within
the healthcare domain.
This project explores the application of four distinct methodologies to derive meaningful insights
from the SPARCS dataset:
1. BOAT Framework: A tool designed for exploratory analysis and trend identification,
enabling researchers to uncover key patterns in the dataset.
2. K-Means Clustering: A technique for grouping data points based on feature similarity,
providing insights into patient groupings and identifying outliers.
3. CatBoost Regressor: A cutting-edge machine learning model optimized for handling
categorical data, employed here to predict hospital costs with high accuracy.
4. Random Forest Regressor: A robust ensemble learning approach that builds multiple
decision trees for reliable predictions of healthcare costs.
The study focuses on evaluating these methodologies based on their effectiveness in data processing,
pattern recognition, and predictive accuracy. By leveraging tools like SHAP (Shapley Additive
Explanations) for interpretability and advanced clustering techniques for anomaly detection, the
project aims to offer a holistic view of healthcare analytics.
The primary goals of this project are to:
 Analyze patterns in hospital admissions and costs.
 Identify outliers and anomalies that could indicate inefficiencies or unique cases.
 Develop predictive models for hospital costs to support decision-making.
 Compare methodologies to highlight their relative strengths and limitations.
Through this multi-method approach, the study provides actionable insights into cost management,
resource allocation, and operational optimization within hospitals, contributing to the broader goal of
data-driven healthcare transformation.
D.METHODS:
1. BOAT FRAMEWORK:
BOAT (Big Data Open Analytics Tool) framework is used for exploratory analysis, trend
identification, and basic modeling in large datasets. Here, it is applied to analyze New York SPARCS
hospital inpatient discharge data . The framework's implementation can be explained in the following
steps:
1. Data Ingestion:
This step involves loading large-scale datasets efficiently into a manageable format. The BOAT
framework utilizes optimized tools and techniques to ensure that the ingestion process handles
massive data volumes without performance degradation. This ensures that the dataset is ready for
analysis without compromising speed or integrity.
2. Data Preprocessing:
During preprocessing, missing values in the dataset are addressed using appropriate imputation
techniques to avoid analytical errors. The framework cleans text-based numeric fields by
standardizing them into usable formats, such as removing non-numeric characters or normalizing
inconsistent entries. This step is critical to ensure the data’s quality and consistency before further
analysis.

3. Feature Engineering:
In this phase, categorical variables are transformed into numerical representations using methods
like one-hot encoding. This transformation enables machine learning models to process
categorical data effectively. Additionally, numerical features are scaled to standardize their
ranges, ensuring better model performance and avoiding bias introduced by unscaled variables.
4. Analysis and Modeling:
The dataset is split into training and testing subsets to validate the model’s performance
effectively. The BOAT framework applies a simple linear regression model, which is used to
explore relationships between features and predict key outcomes, such as total costs. This
approach provides insights into cost patterns and helps identify significant predictors.
5. Visualization:
The above Histogram of “Type of Admission” shows the frequency of each type of admission, helping
identify dominant categories. Key Observations:
 Emergency Admissions dominate the dataset with the highest count,
approximately 1.4 million cases, indicating that most hospital admissions are
unplanned and urgent in nature.
 Elective Admissions come next, with a smaller but significant count. These
admissions are planned and scheduled, likely for non-emergency procedures.
 Newborn admissions have a moderate count, representing cases where
hospitalizations are related to childbirth.
 Urgent, Not Available, and Trauma categories have relatively low counts,
indicating that these admission types are less common.

A Boxplot of “Total Costs” by “APR Severity of Illness” Description highlights variations in costs for
different severity levels. It also identifies potential outliers and trends in cost distribution. Key
Observations:
 Cost Distribution: For all severity levels, most total costs are concentrated near the
lower end, as seen from the dense cluster of data points close to the lower bound of
the plot.
 Outliers: There are significant outliers across all severity levels, with some cases
exceeding $3 million in total costs. This indicates that a small number of patients
incur exceptionally high costs, possibly due to extended stays or complex treatments.
 Trend: The overall spread of costs appears to increase slightly with severity, but the
median values (indicated by the central line in each box) remain relatively low. This
suggests that while more severe illnesses tend to incur higher costs, the majority of
cases remain within a lower cost range.
 Cost Variability: The high number of outliers in each category indicates substantial
variability in treatment costs, even for the same severity level.
 Resource Utilization: Extreme severity levels may be associated with greater resource
consumption, as reflected in the higher cost outliers.
2. K-MEANS CLUSTERING:
Iterative K-Means Clustering is employed to analyze and detect outliers in hospital inpatient discharge
data. The process systematically identifies small clusters that may represent anomalous or outlier data
points. This iterative approach enhances the precision of outlier detection by refining clusters through
repeated processing. The key steps involved are described below:
1. k-Means Clustering:
The initial step partitions the dataset into a fixed number (“k”) of clusters. This is done by
assigning each data point to the nearest cluster center based on a distance metric (e.g.,

Euclidean distance). The cluster centers are updated iteratively to minimize intra-cluster
variance, ensuring that the points within a cluster are as similar as possible.
2. Small Cluster Identification:
After clustering, the sizes of all clusters are evaluated. Clusters with very few data points
(below a predefined threshold, “small_cluster_threshold”) are flagged as outliers. These small
clusters often represent anomalies or data points that deviate significantly from the main
dataset patterns.
3. Outlier Removal:
Data points belonging to the flagged small clusters are removed from the dataset. This step
ensures that anomalous data points do not influence subsequent clustering iterations, thereby
refining the cluster boundaries and improving overall accuracy.
4. Iteration:
The clustering process is repeated with the remaining dataset. This iterative refinement
continues for a set number of iterations (“max_iterations”) or until no small clusters are
detected. The iterative nature of the method ensures progressive improvement in identifying
and excluding anomalous data points.
This approach is particularly effective for handling datasets with a mix of dense and sparse regions, as
it dynamically adapts to varying data distributions and ensures the robustness of the main dataset by
systematically eliminating outliers.
Visualization: A scatterplot is generated showing the clusters and outliers:
 Clusters are visualized with different colours, representing distinct patterns in
costs and stay durations.
 Outliers are marked in red (or as "x") to indicate data points that deviate
significantly from cluster norms.
Key observations include:
1. Cluster Distribution: The majority of data points are concentrated in clusters with
low total costs (below 50,000) and short lengths of stay (below 20 days).There is a
noticeable variation in cluster density, with some clusters being tightly packed (e.g.,
Cluster 0 and Cluster 4) and others more dispersed.
2. Outliers and Extreme Cases: Some points in Cluster 3 represent extreme cases, with
exceptionally high total costs (exceeding 250,000) and/or long lengths of stay (up to
100 days). These may indicate high-resource patients or unusual scenarios.

3. Trends in Costs and Stays: As total costs increase, the length of stay tends to vary
more widely, reflecting diversity in treatment complexity. Clusters demonstrate
distinct patterns, potentially indicating patient groups with similar resource
utilization.
4. No Isolated Outliers: All data points are part of the defined clusters, with no detected
outliers outside the groups.
3. CATBOOST REGRESSOR AND RANDOM FOREST :
This methods involves two separate methods:
Random Forest Regressor: Random Forest is an ensemble learning method that creates multiple
decision trees during training and outputs the average prediction of the individual trees. It works by
randomly sampling data points and features to build each tree, reducing overfitting and improving
model robustness. The final prediction is the average of the predictions from all individual trees,
making it less prone to noise and outliers compared to a single decision tree.
CatBoost Regressor: CatBoost (Categorical Boosting) is a gradient boosting algorithm designed to
handle categorical features efficiently. It converts categorical variables into numerical representations
without requiring extensive preprocessing. CatBoost uses a combination of decision trees built in a
sequential manner, where each tree attempts to correct the errors of the previous one. It is known for
its speed, accuracy, and ability to work well with datasets containing categorical variables, while also
reducing overfitting and being less sensitive to parameter tuning.
Here, we use a building regression models (Random Forest and CatBoost) to predict hospital costs
using New York SPARCS dataset. These requires some steps:
1. Data Loading and Cleaning:
The data is first loaded from a CSV file. The 'Total Costs' column is cleaned by removing
non-numeric characters, such as "$" and ",", to convert the values into a usable numeric
format. Missing values are imputed using the mean of the respective column to ensure no data
loss. Outliers with costs exceeding $200,000 are removed to improve model reliability.
Irrelevant columns are dropped, and rows with missing values in key columns are also
excluded, resulting in a clean and consistent dataset for analysis.
2. Feature Engineering:
To prepare the data for modeling, categorical features are encoded using target encoding. This
method replaces each category with the mean of the target variable ('Total Costs') for that
category, capturing the relationship between the category and the target variable. The target
variable 'Total Costs' is transformed using a logarithmic function to reduce skewness,
ensuring a more uniform distribution that enhances model performance.
3. Model Training:
The dataset is split into training and testing subsets to validate model performance. Two
regression models—Random Forest and CatBoost Regressor—are trained. Random Forest
uses an ensemble of decision trees, aggregating their predictions to improve accuracy and
robustness. CatBoost builds sequential decision trees, where each tree corrects errors made by
the previous ones. Both models are evaluated using R-squared (R²) to measure explained
variance and Root Mean Squared Error (RMSE) to assess prediction accuracy.
4. Feature Importance Analysis:
To interpret the contributions of various features to the predictions, SHAP (Shapley Additive
Explanations) values are calculated for the CatBoost model. This analysis quantifies the
impact of each feature on the model’s output. A SHAP summary plot is generated to visualize
the importance and influence of features, identifying key predictors such as 'Length of Stay'
and 'Facility Name' that significantly affect total costs.

5. Visualization: a SHAP summary plot and a scatter plot comparing model predictions to actual
values are created.
Key observations of SHAP Summary Plot :
1. Top Contributing Features: Length of Stay has the highest impact on the prediction of total costs.
Longer stays tend to increase total costs. Features like Facility Name, CCS Procedure
Description, and APR Medical Surgical Description are also significant predictors.
2. Feature Impact: High values for certain features (e.g., Length of Stay) are positively correlated
with higher costs (red points on the right).Some features, such as Age Group, have less influence
on the model's predictions (clustered near SHAP value = 0).
3. Variability: The spread of SHAP values indicates how much each feature affects individual
predictions. Features like Length of Stay and Facility Name show wide variability, meaning they
have a diverse impact depending on the case.
SHAP Summary Plot

Key observations of Model Predictions vs. Actual Plot:
1. Strong Correlation: The scatter plot shows a strong alignment between predicted and
actual total costs, indicating high model accuracy.
2. Minimal Deviation: Most points closely align with the diagonal line, suggesting low
prediction errors.
3. Outliers: Few points deviate slightly from the line, indicating rare instances of higher
prediction errors.
Model Predictions vs. Actual Plot

E.RESULTS AND COMPARISON:
The performance of each method is summarized in the table below:
METHODS RMSE R²VALUE PURPOSE STRENGTHS LIMITATIONS
CatBoost
Regressor
0.34738 0.86951
Predicting
healthcare costs
using machine
learning.
-Handles
categorical
variables
efficiently.
-Reduces
overfitting.
- Computationally
intensive.
- Requires careful
hyperparameter
tuning
Random Forest 0.37393 0.84881
Predicting
healthcare costs
with interpretable
ensemble
methods.
- Robust to
overfitting.
- Works well with
mixed data types.
- Less accurate than
CatBoost.
- High memory
usage for large
datasets.
BOAT (Big
Data Open
Source
Analytics Tool)
19977.65 0.48456
Exploratory
analysis and trend
identification in
SPARCS data
(e.g., hip
replacement costs,
mental health
trends).
-Open-source and
accessible.
-Enables broad
exploration of
datasets.
- Not optimized for
direct predictive
modeling.
K-Means
Clustering
3520.8396 0.9356
Data
preprocessing and
grouping based on
feature similarity
-Helps in
identifying patterns
or group-specific
trends.
-Useful for
unsupervised
learning tasks.
- Limited
application to
supervised
prediction tasks.

F.DISCUSSION:
This study provides a comparative evaluation of four methodologies applied to the New York
SPARCS hospital inpatient discharge dataset, offering insights into their respective capabilities and
limitations in healthcare analytics. Each method—BOAT Framework, K-Means Clustering, CatBoost
Regressor, and Random Forest Regressor—addresses distinct analytical needs, ranging from
exploratory analysis to predictive modeling. The discussion focuses on the practical implications,
performance outcomes, and integration potential of these methodologies.
The BOAT Framework proved effective for exploratory data analysis and trend identification. By
leveraging efficient data ingestion, preprocessing, and visualization tools, the framework uncovered
significant patterns, such as the dominance of emergency admissions and the presence of cost outliers
exceeding $3 million. While valuable for initial investigations, the BOAT Framework lacks advanced
predictive capabilities, limiting its application to tasks requiring more complex modeling.
K-Means Clustering excelled in grouping data points based on feature similarity and detecting
outliers. Its iterative approach ensured progressive refinement, systematically identifying anomalous
clusters. The visualization of clusters highlighted variations in cost and stay durations, offering
actionable insights into patient groupings. However, the method’s reliance on predefined cluster
numbers (“k”) and its unsuitability for supervised prediction tasks constrain its broader applicability
in predictive analytics.
CatBoost demonstrated exceptional performance in predictive modeling, achieving the highest R-
squared and lowest RMSE values among the evaluated methods. Its ability to handle categorical
features efficiently without extensive preprocessing significantly streamlined the modeling process.
The use of SHAP values provided interpretability, identifying key features such as 'Length of Stay'
and 'Facility Name' as major cost predictors. Despite its strengths, CatBoost’s computational intensity
and sensitivity to hyperparameter tuning present challenges for large-scale applications.
Random Forest offered a robust and interpretable alternative for predictive modeling. Its ensemble
approach mitigated overfitting and ensured reliable predictions, particularly for mixed data types.
Although slightly less accurate than CatBoost, Random Forest required fewer computational
resources, making it a practical choice for scenarios with limited processing power. Its limitations
include higher memory usage and reduced precision compared to more advanced methods like
CatBoost.
The findings suggest that no single method universally outperforms others across all analytical
objectives. Instead, their strengths can be synergized in hybrid approaches. For instance, combining
the exploratory capabilities of the BOAT Framework with the predictive accuracy of CatBoost or
Random Forest could provide comprehensive insights. Similarly, K-Means Clustering could serve as
a preprocessing step to detect and exclude outliers, enhancing the reliability of predictive models.
Future research should focus on the following areas:
1. Developing hybrid frameworks that integrate exploratory, clustering, and predictive methods.
2. Extending these methodologies to other healthcare datasets for broader validation.
3. Enhancing computational efficiency, particularly for resource-intensive models like CatBoost.
4. Incorporating domain knowledge into model development to improve interpretability and
relevance.
By leveraging the complementary strengths of these methodologies, healthcare analytics can achieve a
balance between exploratory insights and predictive precision, ultimately contributing to better
resource management, cost control, and patient care outcomes.

G.CONCLUSION:
This study provides a comprehensive comparison of four methodologies—BOAT Framework, K-
Means Clustering, CatBoost Regressor, and Random Forest Regressor—applied to the New York
SPARCS dataset. Each method demonstrated unique strengths tailored to specific objectives in
healthcare analytics. The BOAT Framework excelled in exploratory data analysis, offering valuable
insights into trends and patterns. K-Means Clustering effectively identified outliers and grouped data
based on feature similarities, enhancing the understanding of patient distributions.
CatBoost Regressor emerged as the most accurate predictive model, leveraging its ability to handle
categorical data with minimal preprocessing. Random Forest, while slightly less accurate, proved
robust and interpretable, making it a reliable alternative for healthcare cost prediction. The findings
emphasize the importance of selecting methods based on the specific requirements of the analysis,
whether for exploration, anomaly detection, or predictive accuracy.
The study's results highlight the potential for hybrid approaches that combine exploratory and
predictive capabilities. For instance, integrating clustering methods to preprocess data before applying
machine learning models could enhance performance and interpretability. Future research should
explore such integrated frameworks and validate these methods across diverse healthcare datasets,
aiming to optimize resource allocation and improve patient outcomes.
H.REFERENCES:
This project drew inspiration and methodology from several key research papers, including:
1. Hiding in Plain Sight: Insights About Health-Care Trends Gained Through Open Health
Data by A. Ravishankar Rao and Daniel Clarke. This study highlights the use of open health
data for analyzing trends and creating analytical tools for healthcare insights.
2. A System for Exploring Big Data: An Iterative K-Means Searchlight for Outlier Detection
on Open Health Data by A. Ravishankar Rao et al., which discusses advanced clustering
techniques applied to open healthcare datasets, including the SPARCS dataset, for outlier
detection.
3. Predictive Interpretable Analytics Models for Forecasting Healthcare Costs Using Open
Healthcare Data by A. Ravishankar Rao et al., which focuses on building predictive models
for healthcare costs using machine learning and interpretable analytics.
4. Predicting Hospital Length of Stay Using Machine Learning on a Large Open Health
Dataset authored by Raunak Jain, Mrityunjai Singh, A. Ravishankar Rao, and Rahul Garg.
This paper likely involves using machine learning techniques to predict the length of hospital
stays based on a variety of features in a large open healthcare dataset, potentially including
data from the SPARCS dataset.
These works provided a foundation for our analysis of the New York SPARCS dataset and the
methodologies employed in this project.

"Comparative Analysis of different methods for Hospital Length of Stay prediction using New York SPARCS Datasets"

More Related Content

Similar to "Comparative Analysis of different methods for Hospital Length of Stay prediction using New York SPARCS Datasets" (20)

Recently uploaded (20)