SlideShare a Scribd company logo
CIS-5210 HEALTHCARE DATA ANALYTICS
1
Diabetic Encounter Analysis
Monika Mishra
Sushant Burde
CIS 5210: Healthcare Data Analytics
Submitted to: Professor Shilpa Balan
CIS-5210 HEALTHCARE DATA ANALYTICS
2
Table of Contents
S. No. Topic Page No.
1 DATA SET
1. Data Set URL
2. About the dataset
3. Dataset details
4. Column details
3
3
4
4-5
2 DATA REFINEMENT
1. Removing duplicates
2. Removing unwanted column
3. Removing unwanted spaces
4. Converting Text to Columns
6
7
8
9
3 ANALYSIS & VISUALIZATIONS
1. Bar Chart
2. Box Plot
3. Line Chart
4. Pie Chart
5. Mosaic Plot
6. Bar-Line Chart
10-11
12-13
14-15
16-17
18-19
20-21
4 STATISTICAL SUMMARY 22-23
5 STATISTICAL TEST
1. One-Way Frequency
2. Correlation Analysis
3. T-Test
24-26
27
28-29
6 REFERENCES 30
CIS-5210 HEALTHCARE DATA ANALYTICS
3
DATA SET
1. Data Set URL:
https://www.kaggle.com/brandao/diabetes
2. About the dataset:
The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and
integrated delivery networks. It includes over 50 features representing patient and
hospital outcomes. Information was extracted from the database for encounters that
satisfied the following criteria.
 It is an inpatient encounter (a hospital admission).
 It is a diabetic encounter, that is, one during which any kind of diabetes was
entered to the system as a diagnosis.
 The length of stay was at least 1 day and at most 14 days.
 Laboratory tests were performed during the encounter.
 Medications were administered during the encounter.
The data contains such attributes as patient number, race, gender, age, admission type,
time in hospital, medical specialty of admitting physician, number of lab test performed,
HbA1c test result, diagnosis, number of medication, diabetic medications, number of
outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
3. Dataset details:
CIS-5210 HEALTHCARE DATA ANALYTICS
4
Original
File Size 19.2 MB
Number of columns 55
Number of rows 101767
File format CSV
Modified for the analysis
File Size 6 MB
Number of columns 15
Number of rows 68379
File format CSV
4. Column details:
The original dataset had 55 columns. For our analysis, we have reduced it to 15 columns.
The details of the columns are given below:
Column Name Column Detail
Encounter ID Unique identifier of an encounter
Patient number Unique identifier of a patient
Race Patient’s race
Gender Patient’s gender
Age Patient’s age group
CIS-5210 HEALTHCARE DATA ANALYTICS
5
Time in hospital Integer number of days between admission and
discharge
Medical specialty Treated by which department
Number of laboratories
procedures
Number of lab tests performed during the
encounter
Number of
procedures
Number of procedures (other than lab tests)
performed during the encounter
Number of
medications
Number of distinct generic names administered
during the encounter
Number of outpatients
visits
Number of outpatient visits of the patient in the
year preceding the encounter
Number of
emergency visits
Number of emergency visits of the patient in the
year preceding the encounter
Number of inpatients
visits
Number of inpatient visits of the patient in the
year preceding the encounter
Number of diagnoses Number of diagnoses entered to the system
Diabetes medications Indicates if there was any diabetic medication
prescribed
CIS-5210 HEALTHCARE DATA ANALYTICS
6
DATA REFINEMENT
Removing Duplicates
Before
After
Process
Explanation:
There were many duplicate rows present in the dataset. We used the “Remove Duplicates”
feature of the excel to remove duplicates. The “Remove Duplicate” feature can be found through
the path DataTable ToolsRemove Duplicates.
CIS-5210 HEALTHCARE DATA ANALYTICS
7
Removing Unwanted Columns
Before
After
Process
Explanation:
There were many columns which had just one value and were not required for visualizations. So,
I deleted those columns. One of those deleted columns is “max_glu_serum”. I selected the
column, right clicked on it and then clicked “Delete”.
CIS-5210 HEALTHCARE DATA ANALYTICS
8
Removing Unwanted Spaces
Before
After
Process
Explanation:
There were white spaces in between the words for the column medical specialty. I created a new
column and used formula builder to Use TRIM function on the medical specialty column. This
removed the white spaces between the words.
CIS-5210 HEALTHCARE DATA ANALYTICS
9
Converting text to columns
Before
After
Process
Explanation:
Two columns – race and gender were merged into one column. I used the “Convert Text to
Column wizard to separate the two details in two columns using comma as delimiter. The wizard
can be found through the path DataText to Columns.
CIS-5210 HEALTHCARE DATA ANALYTICS
10
ANALYSIS & VISUALIZATIONS
1. Which race had more diabetic encounter?
Chart used:
 Bar Chart
Analysis:
The above bar chart provides the diabetic encounter of various races. It can be seen that
the Caucasian race had the largest diabetic encounter of 51,042. It is followed by African
American race with 12,604 frequency. Hispanic race has a frequency of 1,372. The Asian
have the lowest diabetic encounter.
CIS-5210 HEALTHCARE DATA ANALYTICS
11
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
12
2. What are the statistics of number of diagnoses?
Chart used:
 Box Plot
Analysis:
A box plot is a graphical rendition of statistical data based on the minimum, first quartile,
median, third quartile, and maximum. It shows the statistics for number of diagnoses. The
mean is about 7.6 and the median is 9. The first quartile value is 6 while the third quartile
value is 9. The minimum value is 1 while the maximum value is 16. These numbers are
based on the total observations of 68,379 for the variable number of diagnoses.
CIS-5210 HEALTHCARE DATA ANALYTICS
13
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
14
3. Which age group has the highest inpatient encounter ?
Categories used:
 Line Chart
Analysis:
The above line chart shows the frequency of the age group of the inpatient encounter. The
highest inpatient encounter had been for the age group 70-80. The second age group with the
highest diabetic inpatient encounter is for the age group 60-70. The least inpatient encounter
is for the age group 0-10. In general, the encounter increases with increase of age group 70-
80. After that, a decline is observed.
CIS-5210 HEALTHCARE DATA ANALYTICS
15
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
16
4. Which medical specialty was involved with the highest patient encounter?
Categories used:
 Pie Chart
Analysis:
Pie charts show the relative contribution of the parts to the whole. The size of a slice
represents the contribution of the data to the total chart statistic. The Internal Medicine
department had the highest encounter of the diabetic patient. The least have been encountered
by the Surgery-General department.
CIS-5210 HEALTHCARE DATA ANALYTICS
17
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
18
5. Are there more females than males who take diabetic medicines?
Categories used:
 Mosaic Plot
Analysis:
Mosaic plots display tiles that correspond to the crosstabulation table cells. The areas of the
tiles are proportional to the frequencies of the table cells. Maximum males and females
admitted to the hospitals take diabetic medicines. The number females who take diabetic
medicines are lesser than the number of males.
CIS-5210 HEALTHCARE DATA ANALYTICS
19
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
20
6. Which race accounts for maximum and minimum number of inpatient and
outpatient?
Categories used:
 Bar-Line Chart
Analysis:
The above chart displays number of outpatient and number of inpatient grouped by different
race. The Caucasian race tops in both the number of outpatients and number of inpatients.
The Asian race has the minimum value for both number of outpatient and number of
inpatients.
CIS-5210 HEALTHCARE DATA ANALYTICS
21
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
22
STATISTICAL SUMMARY
Analysis:
Statistics Value Meaning
Mean 4.28
It is the average of the time spent in hospital. It is the
summation of all total time spent in hospital by total
number of observations (68379)
Std Dev
(Standard
Deviation)
2.92 It indicates the extent of deviation for the time spent in
hospital. In this case, it is closed to mean.
Minimum 1 The lowest value of the time spent in hospital
Maximum 14 The highest value of the time spent in hospital
Median 4
It represents the middle number in a given sequence of
numbers when it’s ordered by rank
N 68379
It is the total number of observations or total number of
rows in the table
CIS-5210 HEALTHCARE DATA ANALYTICS
23
We have taken the analysis variable as the time spent in hospital. The above table shows the
statistical summary with explanation.
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
24
STATISTICAL TESTS
1. One – Way Frequency
CIS-5210 HEALTHCARE DATA ANALYTICS
25
Analysis:
For the one-way frequency test, we have taken gender as the analysis variable and number of
inpatients as frequency count. We want to know which gender had more inpatients
encounters.
From the table and the “Distribution of gender” graph, it can be seen the number of
inpatients for the female gender is higher than the male gender. The female gender has a
frequency count of 24, 985 which is 55.13% while that of male is 20, 339 which is 44.87%.
Cumulative frequency is defined as a running total of frequencies. The frequency of an
element in a set refers to how many of that element there are in the set. Cumulative frequency
can also be defined as the sum of all previous frequencies up to the current point.
CIS-5210 HEALTHCARE DATA ANALYTICS
26
The cumulative frequency is important when analyzing data, where the value of the
cumulative frequency indicates the number of elements in the data set that lie below the
current value.
The cumulative frequency adds up to total number of observations which in the above case is
45, 324. The cumulative percentage is always 100% for the last group which in my analysis
is for the Male gender. The “Cumulative Distribution of gender” graph displays the
cumulative frequency distribution.
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
27
2. Correlation Analysis
Analysis:
The Correlation Analysis provides statistics for investigating associations among
variables. In the above case the correlation analysis is being performed for the
variables time_in_hospital and number_diagnoses, the value for which is 0.21469. It
means both the variables are weakly co-related. A value close to 1 signifies strong co-
relationship.
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
28
3. T – Test
Analysis:
A T-test is a type of inferential statistic used to determine if there is a significant
difference between the means of two groups, which may be related in certain features.
A T-test is used as a hypothesis testing tool, which allows testing of an assumption
applicable to a population.
For my analysis, we have used one-sample t-test taking time_in_hospital as the
analysis variable. A one-sample T- test compares the mean of the sample to the null
hypothesis mean.
Using the Kolmogorov-Smirnov test value, since p<alpha (p<0.0100), there is
significant difference in the variable time_in_hospital.
In fact, using Cramer-von Mises test value and Anderson-Darling test value too, p
value is less than the corresponding alpha value (p<0.0050). And therefore, there is
significant difference in the variable time_in_hospital.
CIS-5210 HEALTHCARE DATA ANALYTICS
29
Full Screenshot:
CIS-5210 HEALTHCARE DATA ANALYTICS
30
REFERENCES
https://www.kaggle.com/brandao/diabetes
https://www.wyzant.com/resources/lessons/math/statistics_and_probability/averages/cumulative_
frequency_percentiles_and_quartiles

More Related Content

PDF
Prediction of Heart Disease Using Data Mining Techniques- A Review
PDF
Acute coronary-syndrome-prediction-using-data-mining-techniques--an-application
DOCX
Health informationexchangeacrossus healthinstitution (1)
PDF
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
PDF
Prognosis of Diabetes by Performing Data Mining of HbA1c
PDF
HYBRID MORTALITY PREDICTION USING MULTIPLE SOURCE SYSTEMS
PPTX
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
PDF
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS
Prediction of Heart Disease Using Data Mining Techniques- A Review
Acute coronary-syndrome-prediction-using-data-mining-techniques--an-application
Health informationexchangeacrossus healthinstitution (1)
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
Prognosis of Diabetes by Performing Data Mining of HbA1c
HYBRID MORTALITY PREDICTION USING MULTIPLE SOURCE SYSTEMS
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS

What's hot (10)

PPTX
PDF
Android Based Questionnaires Application for Heart Disease Prediction System
 
PDF
Linkage Detection of Features that Cause Stroke using Feyn Qlattice Machine L...
PDF
Automated diagnosis of hepatitis b using multilayer mamdani
 
PDF
Heart Disease Prediction using Machine Learning Algorithm
 
PPT
Smart health disease prediction python django
PDF
IRJET - Effective Heart Disease Prediction using Distinct Machine Learning Te...
PDF
Survey on data mining techniques in heart disease prediction
PDF
A Heart Disease Prediction Model using Decision Tree
PDF
Chronic Kidney Disease Prediction
Android Based Questionnaires Application for Heart Disease Prediction System
 
Linkage Detection of Features that Cause Stroke using Feyn Qlattice Machine L...
Automated diagnosis of hepatitis b using multilayer mamdani
 
Heart Disease Prediction using Machine Learning Algorithm
 
Smart health disease prediction python django
IRJET - Effective Heart Disease Prediction using Distinct Machine Learning Te...
Survey on data mining techniques in heart disease prediction
A Heart Disease Prediction Model using Decision Tree
Chronic Kidney Disease Prediction
Ad

Similar to Diabetic Encounter Analysis using SAS studio (20)

PDF
Implementing Clinical Decision Support System Using NaĂŻve Bayesian Classifier
PDF
Dr Shahadat Uddin - University of Sydney
PDF
Shahadat Uddin
PDF
eBook - Data Analytics in Healthcare
PDF
Data Science in Healthcare
 
PPTX
Smart Health Disease Prediction django machinelearning.pptx
PDF
Purple and white modern advertising presentation
PPTX
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
PDF
Icd 9-cm-2007
PDF
Automated clinical documentation improvement
PDF
Quality tools (2), Ola Elgaddar, 30 09 - 2013
PDF
IRJET- Survey on Risk Estimation of Chronic Disease using Machine Learning
PDF
DESIGN AND IMPLEMENTATION OF CARDIAC DISEASE USING NAIVE BAYES TECHNIQUE
PDF
Machine learning and operations research to find diabetics at risk for readmi...
PDF
IRJET- Analysis of Hospital Resources with Mortality Rates using Apriori ...
PDF
In this programming assignment, you will be creating a Health Inform.pdf
DOCX
DIABETES MELLITUS WITH Applied Research in Healthcare Administration.docx
PPTX
subham(view2)-final-oops-PROJECT.pptx
PDF
Robert Sutter Portfolio
PPTX
How to establish and evaluate clinical prediction models - Statswork
Implementing Clinical Decision Support System Using NaĂŻve Bayesian Classifier
Dr Shahadat Uddin - University of Sydney
Shahadat Uddin
eBook - Data Analytics in Healthcare
Data Science in Healthcare
 
Smart Health Disease Prediction django machinelearning.pptx
Purple and white modern advertising presentation
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Icd 9-cm-2007
Automated clinical documentation improvement
Quality tools (2), Ola Elgaddar, 30 09 - 2013
IRJET- Survey on Risk Estimation of Chronic Disease using Machine Learning
DESIGN AND IMPLEMENTATION OF CARDIAC DISEASE USING NAIVE BAYES TECHNIQUE
Machine learning and operations research to find diabetics at risk for readmi...
IRJET- Analysis of Hospital Resources with Mortality Rates using Apriori ...
In this programming assignment, you will be creating a Health Inform.pdf
DIABETES MELLITUS WITH Applied Research in Healthcare Administration.docx
subham(view2)-final-oops-PROJECT.pptx
Robert Sutter Portfolio
How to establish and evaluate clinical prediction models - Statswork
Ad

More from Monika Mishra (8)

PPTX
Aws image recognition
PPTX
Drug Review Analysis Using Elasticsearch and Kibana
PPTX
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
DOCX
Re-admit Historical using SAS Visual Analytics
DOCX
Superstore Data Analysis using R
DOCX
LA Energy and Water Efficiency Statistics using Tableau
PPTX
Predicting Amazon Rating Using Spark ML and Azure ML
PPTX
Amazon Product Review Data Analysis
Aws image recognition
Drug Review Analysis Using Elasticsearch and Kibana
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
Re-admit Historical using SAS Visual Analytics
Superstore Data Analysis using R
LA Energy and Water Efficiency Statistics using Tableau
Predicting Amazon Rating Using Spark ML and Azure ML
Amazon Product Review Data Analysis

Recently uploaded (20)

PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Transcultural that can help you someday.
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
annual-report-2024-2025 original latest.
PDF
Introduction to Data Science and Data Analysis
PDF
Introduction to the R Programming Language
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Global Data and Analytics Market Outlook Report
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Microsoft Core Cloud Services powerpoint
retention in jsjsksksksnbsndjddjdnFPD.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Transcultural that can help you someday.
ISS -ESG Data flows What is ESG and HowHow
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
Introduction to Data Science and Data Analysis
Introduction to the R Programming Language
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
SAP 2 completion done . PRESENTATION.pptx
A Complete Guide to Streamlining Business Processes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Global Data and Analytics Market Outlook Report
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx

Diabetic Encounter Analysis using SAS studio

  • 1. CIS-5210 HEALTHCARE DATA ANALYTICS 1 Diabetic Encounter Analysis Monika Mishra Sushant Burde CIS 5210: Healthcare Data Analytics Submitted to: Professor Shilpa Balan
  • 2. CIS-5210 HEALTHCARE DATA ANALYTICS 2 Table of Contents S. No. Topic Page No. 1 DATA SET 1. Data Set URL 2. About the dataset 3. Dataset details 4. Column details 3 3 4 4-5 2 DATA REFINEMENT 1. Removing duplicates 2. Removing unwanted column 3. Removing unwanted spaces 4. Converting Text to Columns 6 7 8 9 3 ANALYSIS & VISUALIZATIONS 1. Bar Chart 2. Box Plot 3. Line Chart 4. Pie Chart 5. Mosaic Plot 6. Bar-Line Chart 10-11 12-13 14-15 16-17 18-19 20-21 4 STATISTICAL SUMMARY 22-23 5 STATISTICAL TEST 1. One-Way Frequency 2. Correlation Analysis 3. T-Test 24-26 27 28-29 6 REFERENCES 30
  • 3. CIS-5210 HEALTHCARE DATA ANALYTICS 3 DATA SET 1. Data Set URL: https://www.kaggle.com/brandao/diabetes 2. About the dataset: The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.  It is an inpatient encounter (a hospital admission).  It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.  The length of stay was at least 1 day and at most 14 days.  Laboratory tests were performed during the encounter.  Medications were administered during the encounter. The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc. 3. Dataset details:
  • 4. CIS-5210 HEALTHCARE DATA ANALYTICS 4 Original File Size 19.2 MB Number of columns 55 Number of rows 101767 File format CSV Modified for the analysis File Size 6 MB Number of columns 15 Number of rows 68379 File format CSV 4. Column details: The original dataset had 55 columns. For our analysis, we have reduced it to 15 columns. The details of the columns are given below: Column Name Column Detail Encounter ID Unique identifier of an encounter Patient number Unique identifier of a patient Race Patient’s race Gender Patient’s gender Age Patient’s age group
  • 5. CIS-5210 HEALTHCARE DATA ANALYTICS 5 Time in hospital Integer number of days between admission and discharge Medical specialty Treated by which department Number of laboratories procedures Number of lab tests performed during the encounter Number of procedures Number of procedures (other than lab tests) performed during the encounter Number of medications Number of distinct generic names administered during the encounter Number of outpatients visits Number of outpatient visits of the patient in the year preceding the encounter Number of emergency visits Number of emergency visits of the patient in the year preceding the encounter Number of inpatients visits Number of inpatient visits of the patient in the year preceding the encounter Number of diagnoses Number of diagnoses entered to the system Diabetes medications Indicates if there was any diabetic medication prescribed
  • 6. CIS-5210 HEALTHCARE DATA ANALYTICS 6 DATA REFINEMENT Removing Duplicates Before After Process Explanation: There were many duplicate rows present in the dataset. We used the “Remove Duplicates” feature of the excel to remove duplicates. The “Remove Duplicate” feature can be found through the path DataTable ToolsRemove Duplicates.
  • 7. CIS-5210 HEALTHCARE DATA ANALYTICS 7 Removing Unwanted Columns Before After Process Explanation: There were many columns which had just one value and were not required for visualizations. So, I deleted those columns. One of those deleted columns is “max_glu_serum”. I selected the column, right clicked on it and then clicked “Delete”.
  • 8. CIS-5210 HEALTHCARE DATA ANALYTICS 8 Removing Unwanted Spaces Before After Process Explanation: There were white spaces in between the words for the column medical specialty. I created a new column and used formula builder to Use TRIM function on the medical specialty column. This removed the white spaces between the words.
  • 9. CIS-5210 HEALTHCARE DATA ANALYTICS 9 Converting text to columns Before After Process Explanation: Two columns – race and gender were merged into one column. I used the “Convert Text to Column wizard to separate the two details in two columns using comma as delimiter. The wizard can be found through the path DataText to Columns.
  • 10. CIS-5210 HEALTHCARE DATA ANALYTICS 10 ANALYSIS & VISUALIZATIONS 1. Which race had more diabetic encounter? Chart used:  Bar Chart Analysis: The above bar chart provides the diabetic encounter of various races. It can be seen that the Caucasian race had the largest diabetic encounter of 51,042. It is followed by African American race with 12,604 frequency. Hispanic race has a frequency of 1,372. The Asian have the lowest diabetic encounter.
  • 11. CIS-5210 HEALTHCARE DATA ANALYTICS 11 Full Screenshot:
  • 12. CIS-5210 HEALTHCARE DATA ANALYTICS 12 2. What are the statistics of number of diagnoses? Chart used:  Box Plot Analysis: A box plot is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum. It shows the statistics for number of diagnoses. The mean is about 7.6 and the median is 9. The first quartile value is 6 while the third quartile value is 9. The minimum value is 1 while the maximum value is 16. These numbers are based on the total observations of 68,379 for the variable number of diagnoses.
  • 13. CIS-5210 HEALTHCARE DATA ANALYTICS 13 Full Screenshot:
  • 14. CIS-5210 HEALTHCARE DATA ANALYTICS 14 3. Which age group has the highest inpatient encounter ? Categories used:  Line Chart Analysis: The above line chart shows the frequency of the age group of the inpatient encounter. The highest inpatient encounter had been for the age group 70-80. The second age group with the highest diabetic inpatient encounter is for the age group 60-70. The least inpatient encounter is for the age group 0-10. In general, the encounter increases with increase of age group 70- 80. After that, a decline is observed.
  • 15. CIS-5210 HEALTHCARE DATA ANALYTICS 15 Full Screenshot:
  • 16. CIS-5210 HEALTHCARE DATA ANALYTICS 16 4. Which medical specialty was involved with the highest patient encounter? Categories used:  Pie Chart Analysis: Pie charts show the relative contribution of the parts to the whole. The size of a slice represents the contribution of the data to the total chart statistic. The Internal Medicine department had the highest encounter of the diabetic patient. The least have been encountered by the Surgery-General department.
  • 17. CIS-5210 HEALTHCARE DATA ANALYTICS 17 Full Screenshot:
  • 18. CIS-5210 HEALTHCARE DATA ANALYTICS 18 5. Are there more females than males who take diabetic medicines? Categories used:  Mosaic Plot Analysis: Mosaic plots display tiles that correspond to the crosstabulation table cells. The areas of the tiles are proportional to the frequencies of the table cells. Maximum males and females admitted to the hospitals take diabetic medicines. The number females who take diabetic medicines are lesser than the number of males.
  • 19. CIS-5210 HEALTHCARE DATA ANALYTICS 19 Full Screenshot:
  • 20. CIS-5210 HEALTHCARE DATA ANALYTICS 20 6. Which race accounts for maximum and minimum number of inpatient and outpatient? Categories used:  Bar-Line Chart Analysis: The above chart displays number of outpatient and number of inpatient grouped by different race. The Caucasian race tops in both the number of outpatients and number of inpatients. The Asian race has the minimum value for both number of outpatient and number of inpatients.
  • 21. CIS-5210 HEALTHCARE DATA ANALYTICS 21 Full Screenshot:
  • 22. CIS-5210 HEALTHCARE DATA ANALYTICS 22 STATISTICAL SUMMARY Analysis: Statistics Value Meaning Mean 4.28 It is the average of the time spent in hospital. It is the summation of all total time spent in hospital by total number of observations (68379) Std Dev (Standard Deviation) 2.92 It indicates the extent of deviation for the time spent in hospital. In this case, it is closed to mean. Minimum 1 The lowest value of the time spent in hospital Maximum 14 The highest value of the time spent in hospital Median 4 It represents the middle number in a given sequence of numbers when it’s ordered by rank N 68379 It is the total number of observations or total number of rows in the table
  • 23. CIS-5210 HEALTHCARE DATA ANALYTICS 23 We have taken the analysis variable as the time spent in hospital. The above table shows the statistical summary with explanation. Full Screenshot:
  • 24. CIS-5210 HEALTHCARE DATA ANALYTICS 24 STATISTICAL TESTS 1. One – Way Frequency
  • 25. CIS-5210 HEALTHCARE DATA ANALYTICS 25 Analysis: For the one-way frequency test, we have taken gender as the analysis variable and number of inpatients as frequency count. We want to know which gender had more inpatients encounters. From the table and the “Distribution of gender” graph, it can be seen the number of inpatients for the female gender is higher than the male gender. The female gender has a frequency count of 24, 985 which is 55.13% while that of male is 20, 339 which is 44.87%. Cumulative frequency is defined as a running total of frequencies. The frequency of an element in a set refers to how many of that element there are in the set. Cumulative frequency can also be defined as the sum of all previous frequencies up to the current point.
  • 26. CIS-5210 HEALTHCARE DATA ANALYTICS 26 The cumulative frequency is important when analyzing data, where the value of the cumulative frequency indicates the number of elements in the data set that lie below the current value. The cumulative frequency adds up to total number of observations which in the above case is 45, 324. The cumulative percentage is always 100% for the last group which in my analysis is for the Male gender. The “Cumulative Distribution of gender” graph displays the cumulative frequency distribution. Full Screenshot:
  • 27. CIS-5210 HEALTHCARE DATA ANALYTICS 27 2. Correlation Analysis Analysis: The Correlation Analysis provides statistics for investigating associations among variables. In the above case the correlation analysis is being performed for the variables time_in_hospital and number_diagnoses, the value for which is 0.21469. It means both the variables are weakly co-related. A value close to 1 signifies strong co- relationship. Full Screenshot:
  • 28. CIS-5210 HEALTHCARE DATA ANALYTICS 28 3. T – Test Analysis: A T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features. A T-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population. For my analysis, we have used one-sample t-test taking time_in_hospital as the analysis variable. A one-sample T- test compares the mean of the sample to the null hypothesis mean. Using the Kolmogorov-Smirnov test value, since p<alpha (p<0.0100), there is significant difference in the variable time_in_hospital. In fact, using Cramer-von Mises test value and Anderson-Darling test value too, p value is less than the corresponding alpha value (p<0.0050). And therefore, there is significant difference in the variable time_in_hospital.
  • 29. CIS-5210 HEALTHCARE DATA ANALYTICS 29 Full Screenshot:
  • 30. CIS-5210 HEALTHCARE DATA ANALYTICS 30 REFERENCES https://www.kaggle.com/brandao/diabetes https://www.wyzant.com/resources/lessons/math/statistics_and_probability/averages/cumulative_ frequency_percentiles_and_quartiles