X18125514 ca2-statisticsfor dataanalytics

National College of Ireland
Project Submission Sheet – 2018/2019
Student Name: SHANTANU DESHPANDE
Student ID: X18125514
Programme: MSc. In Data Analytics (Cohort B) Year: Jan 2019
Module: Statistics for Data Analytics
Lecturer: Prof. Tony Delaney
Submission Due
Date:
7th
January 2019
Project Title: CA2 – Regression
Word Count: 2,329
I hereby certify that the information contained in this (my submission) is information
pertaining to research I conducted for this project. All information other than my own
contribution will be fully referenced and listed in the relevant bibliography section at
the rear of the project.
ALL internet material must be referenced in the references section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author's written or electronic work is illegal (plagiarism) and may result in
disciplinary action. Students may be required to undergo a viva (oral examination) if
there is suspicion about the validity of their submitted work.
Signature: ………………………………………………………………………………………………………………
Date: ………………………………………………………………………………………………………………
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. Projects should be submitted to your Programme Coordinator.
3. You must ensure that you retain a HARD COPY of ALL projects, both for your own
reference and in case a project is lost or mislaid. It is not sufficient to keep a copy on
computer. Please do not bind projects or place in covers unless specifically requested.
4. You must ensure that all projects are submitted to your Programme Coordinator on or
before the required submission date. Late submissions will incur penalties.
5. All projects must be submitted and passed in order to successfully complete the year.
Any project/assignment not submitted will be marked as a fail.
Office Use Only
Signature:
Date:
Penalty Applied (if applicable):

MULTI LINEAR REGRESSION
Objective of the study: The objective is to analyze the chosen data table with help of an automated software ‘IBM-
SPSS’ by using multi linear regression to predict relation of a dependent variable with two independent variables.
Problem Analysis:
The relevant datasets have been taken from the below web links:
1) Adult Mortality rate (probability of dying between 15 and 60 years per 1000 population) by country-
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000004,
2) Adult Obesity Rate (adults aged >= 20 years who are obese (%))-
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000010
3) Alcohol Consumption among adults aged >= 15 years (litres of alcohol per person per year)
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000011
Description of the dataset:
The dataset consists of 3 variables, mortality, obesity and alcohol. Out of these 3, the mortality column is assumed as
dependent variable whereas obesity & alcohol are assumed as independent variables. The multiple regression
analysis is taken to speculate the causes of adult mortality with the help of 2 independent variables. As the
dependant variable is a continuous variable multi linear regression model is used for this data. Let us use the
independent variables and analyze to see how well adult mortality is predictable.
Description of Analysis:
The analysis is carried out to show observation of significance of different independent variables on the prediction of
the dependent variable.
1. b-Value:
The b-value explains the extent to which each independent value influences the result if the results are constant for
all other independent values.
2. Durbin Watson Method:
The value obtained from Durbin Watson method should ideally be close to 2 for an efficient result whereas if the
obtained result is <1 or >3 then observation obtained is far away from the predicted result.
3. ANOVA / F-test:
The F test asserts whether the variance thus obtained by the advocated model is considerably higher than error
within the calculated model. It will tell us whether the use of multiple regression is fine at predicting values of the
result.

4. Collinearity Test:
This test is used to check whether taken predictor variables are closely related to each other or not. This test
confirms by two results one tolerance which should not be greater than 1 and secondly the VIF should be in range of
1-10.
Assumptions:
There are several key assumptions of multi linear regression. They are: homoscedasticity, linearity, normality and
multicollinearity.
 By observing the normal PP plot, we are hoping that all our data points will lie in a reasonably straight
diagonal line from bottom left to top right. (Pallant, 2016)
 This would suggest us that there is no major deviation from normality.
 From our plot we can see that our points are either on the diagonal line or close to it hence no major
deviation from normal.

 Through the Scatter Plot, we are hoping that the data should be roughly rectangularly distributed with
almost major part of data should be concentrated in the centre. (Pallant, 2016)
 We can see from our data that it is fulfilling the criteria and no particular pattern is observed in the scatter
plot.
 Hence, we can assure that it is homoscedastic.
 Outliers can be identified from above scatter plot. Tabachnick and Fidell define outliers as cases that have
standardised residual of more than 3.3 or less than -3.3 (Pallant, 2016)
 Here, we can identify 3 potential outliers located at the top-centre of the scatter plot.
 From the above histogram, we can clearly see that model fits very well in the graph of normal distribution.

Multicollinearity:
Correlations
mortalityrate alcohol obese
Pearson Correlation mortalityrate 1.000 -.336 -.453
alcohol -.336 1.000 .240
obese -.453 .240 1.000
Sig. (1-tailed) mortalityrate . .000 .000
alcohol .000 . .001
obese .000 .001 .
N mortalityrate 178 178 178
alcohol 178 178 178
obese 178 178 178
 The correlation table gives us the idea about the correlation between the independent variables and the
dependent variables. Normally, the correlated values lie within the range of -1 to 1.
 There should be some relationship between the dependent and independent variable (preferably above 0.3)
 In this case, we can observe that our dependant variable, ‘mortalityrate’ has a moderate and negative
correlation with both the independent variables ‘alcohol’ and ‘obese’ with the values being -0.336 and -
0.453 respectively.
 Alternatively, the correlation between independent variables should not be very high (<0.7)
 Our independent variables have a weak correlation between them with a value of 0.24. As this value is much
less than the threshold value of 0.7, we can assure that there is an absence of multicollinearity within the
independent variables.

IBM SPSS Analysis Interpretation:
1) B-test:
Coefficientsa
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) 271.961 13.704 19.845 .000
alcohol -5.996 1.669 -.241 -3.592 .000
obese -3.258 .553 -.395 -5.894 .000
a. Dependent Variable: mortalityrate
 From the above table, we can explain about the constants and the slope which forms the regression line
equation.
 Based on the above figure, we can frame the equation for our regression line as follows-
Y = 271.961 – 5.996(alcohol) – 3.258(obese)
 The coefficient (B value) of a independent variable in a multiple regression model tells us the amount by
which dependent variable changes if that independent variable increases by one and the values of all other
independent variables in the model remain the same.
2) Durbin-Watson Method:
Model Summaryb
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
Durbin-
Watson
R Square
Change
F
Change df1 df2
Sig. F
Change
1 .513a
.263 .255 87.622 .263 31.273 2 175 .000 2.011
a. Predictors: (Constant), obese, alcohol
b. Dependent Variable: mortalityrate
 The observed Durbin-Watson value is 2.011 which tells us that our observation fits in to the data model from
which we can infer that our predictor and prediction values are continuous.
 To get an idea of how well our model is able to predict the values in the dependent variable, we refer the R
value. In this case, R level of 0.51 illustrates that our model gives a moderate level of prediction.
 We refer the R square value to understand how much our response variable is explained by our predictor
variable. In this case, 26% of variance in the response variable is explained by our predictor variables.

3) ANOVA / F-Test:
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 482647.118 2 241323.559 30.678 .000b
Residual 1376609.134 175 7866.338
Total 1859256.253 177
b. Predictors: (Constant), obese, alcohol
 In ANOVA, we can observe from the Sum of Squares column that about 482647.118 of our response out of
1859256.253 variables is explained by our predictor variable, which also means around 1376609.134 of y
variables were not explained by our x variable.
 The obtained value of F-test is 30.678 %
4) Collinearity:
Coefficientsa
Model
Unstandardized
Coefficients
Sig.
95.0% Confidence
Interval for B Correlations Collinearity Statistics
B Std. Error
Lower
Bound
Upper
Bound
Zero-
order Partial Part Tolerance VIF
1 (Constant) 270.366 13.464 .000 243.792 296.939
alcohol -5.909 1.649 .000 -9.164 -2.654 -.336 -.261 -.232 .942 1.061
obese -3.249 .543 .000 -4.321 -2.177 -.457 -.412 -.388 .942 1.061
 Tolerance is an indicator of how much of the variability of the specified independent is not explained by the
other independent variables in the model and is calculated using the formula 1– R squared for each variable.
If this value is very small (less than .10) it indicates that the multiple correlation with other variables is high,
suggesting the possibility of multicollinearity. The other value given is the VIF (Variance inflation factor),
which is just the inverse of the Tolerance value (1 divided by Tolerance). VIF values above 10 would be a
concern here, indicating multicollinearity. (Pallant, 2016)
 The VIF under collinearity statistics gives us the score of 1.061 which is far below 10 and hence we can
assure there is no multicollinearity.

Conclusion:
Our conclusion from all test shows that there is a close association between the independent and dependent
variables. Under the influence of independent variables, there is a significant amount of change in the dependent
variable.
On the basis of the above figures, we can clearly observe how closely the dependent and independent variables are
related with each other. Any minor change in the value of dependent variable causes a change in the value of
independent variable.

LOGISTIC LINEAR REGRESSION
Objective of the study: To analyze the chosen data table by using an automated software ‘IBM-SPSS’ with the help
of logistic linear regression operation to compare dependent variable that is dichotomous with two independent
variables.
Source of dataset:
The data which was used for multiple regression has been used for logistic regression as well. Also the predictor
variable and response variable is same in logistic regression as that in multiple regression however the response
variable has been converted into dichotomous such that the value corresponding to 177.81 or above has been
assigned 1 and values corresponding below 177.81 has been assigned 0.
Description of the analysis:
The analysis is performed to observe the significance of the independent variables on the prediction of the
dichotomous dependent variable under the 95% Confidence interval.
1) The Hosmer and Lemeshow Test:
As per this test, if the significance value is less than 0.5 then the model is a poor fit. Therefore, we want a
significance value above 0.5 for our model. If so, we can say that our model is a good fit.
2) Cox & Snell and Nagalkerke R square:
This test provides an indication of the amount of variation in the dependent variable explained by the
model. Rather than the true R square values, these are termed as pseudo R square statistics. (Pallant, 2016).
A value close to 1 would be termed as perfect fit whereas close to 0 would show little to no relationship.
3) Variables in equation table:
The value in this equation, Wald value, is equivalent to the t-statistics in regression. Similarly, the B value in
this table is analogous to the b value in regression.
4) Percentage Accuracy in Classification:
A percentage table would help us understand by what percent the variable have achieved accuracy with
regards to overall factors.
5) The Omni Bus Test:
The omnibus test compares with the Block 0 on the sig. Value that is present in both the tables.

IBM SPSS Analysis Interpretation:
1) Hosmer and Lemeshow Test:
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 11.605 8 .170
 As stated by Hosmer-Lemeshow [1], in order for a model to fit properly the significance value should ideally
exceed 0.05. As our significance value is 0.170 we can say that our model is a good fit.
2) Cox & Snell and Nagarkerke R square:
Model Summary
Step -2 Log likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 174.266a
.310 .418
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
 The variation in the response variable of the model is demonstrated by the values in the Nagelkerke R
Square and Cox & Snell R Square [1]
 The values corresponding to both in our model are 0.418 & 0.310.
 The variability of response variables lies between the range 0.310 to 0.418
3) Variables in equation model:
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a
alcohol -.154 .051 9.148 1 .002 .858 .776 .947
obese -.108 .020 28.192 1 .000 .898 .863 .934
Constant 2.274 .430 28.014 1 .000 9.720
a. Variable(s) entered on step 1: alcohol, obese.
 Here we can see the contribution of individual x variable in the above table.
 Alcohol rate is higher than the obesity rate, hence alcohol consumption rate should be controlled first.
4) Percentage Accuracy in classification:
Classification Tablea

Observed
Predicted
mortality Percentage
Correct0 1
Step 1 mortality 0 91 15 85.8
1 19 53 73.6
Overall Percentage 80.9
a. The cut value is .500
 We can observe that 85.8 % of values are covered of 0 and 73.6 % values of 1 are covered by the software.
 A total of 80.9 % of the values are covered by both 0 and 1 and the difference, i.e. 19.1 % of the values are
not covered by the software.
5) The Omnibus Test:
Classification Tablea,b
Observed
Predicted
mortality
Percentage Correct0 1
Step 0 mortality 0 106 0 100.0
1 72 0 .0
Overall Percentage 59.6
a. Constant is included in the model.
b. The cut value is .500
In the classification table, the values that have been recognized by the software are the predominant values which
are covered in 0 and they cover 59.6 % of the values. The residual values lay under 1 and it is about 40.4% and is not
covered by the software.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant -.387 .153 6.414 1 .011 .679
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 65.960 2 .000
Block 65.960 2 .000
Model 65.960 2 .000
In the omnibus test, the p value is 0.000 whereas the condition it should satisfy is p < 0.05. Therefore, we can assure
that the model in Block 0 is not better than this model as with 2 degrees of freedom we have chi-square value of
65.960.

Conclusion:
The conclusion that can be derived from all the above tests is that the dependent and independent variables are
building a relationship to build a data model. It can be observed that with a change in the value of independent
variable, there is a significant change in the value of the dichotomous independent variable. Thus showing how
closely the independent and dependent variables are related with each other.
References :
 Pallant, Julie. SPSS Survival Manual : a Step by Step Guide to Data Analysis Using SPSS.
 https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
 https://statistics.laerd.com/spss-tutorials/binomial-logistic-regression-using-spssstatistics.php

X18125514 ca2-statisticsfor dataanalytics

More Related Content

What's hot (19)

Similar to X18125514 ca2-statisticsfor dataanalytics (20)

More from Shantanu Deshpande (7)

Recently uploaded (20)

X18125514 ca2-statisticsfor dataanalytics