SlideShare a Scribd company logo
National College of Ireland
Project Submission Sheet – 2018/2019
Student Name: SHANTANU DESHPANDE
Student ID: X18125514
Programme: MSc. In Data Analytics (Cohort B) Year: Jan 2019
Module: Statistics for Data Analytics
Lecturer: Prof. Tony Delaney
Submission Due
Date:
7th
January 2019
Project Title: CA2 – Regression
Word Count: 2,329
I hereby certify that the information contained in this (my submission) is information
pertaining to research I conducted for this project. All information other than my own
contribution will be fully referenced and listed in the relevant bibliography section at
the rear of the project.
ALL internet material must be referenced in the references section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author's written or electronic work is illegal (plagiarism) and may result in
disciplinary action. Students may be required to undergo a viva (oral examination) if
there is suspicion about the validity of their submitted work.
Signature: ………………………………………………………………………………………………………………
Date: ………………………………………………………………………………………………………………
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. Projects should be submitted to your Programme Coordinator.
3. You must ensure that you retain a HARD COPY of ALL projects, both for your own
reference and in case a project is lost or mislaid. It is not sufficient to keep a copy on
computer. Please do not bind projects or place in covers unless specifically requested.
4. You must ensure that all projects are submitted to your Programme Coordinator on or
before the required submission date. Late submissions will incur penalties.
5. All projects must be submitted and passed in order to successfully complete the year.
Any project/assignment not submitted will be marked as a fail.
Office Use Only
Signature:
Date:
Penalty Applied (if applicable):
MULTI LINEAR REGRESSION
Objective of the study: The objective is to analyze the chosen data table with help of an automated software ‘IBM-
SPSS’ by using multi linear regression to predict relation of a dependent variable with two independent variables.
Problem Analysis:
The relevant datasets have been taken from the below web links:
1) Adult Mortality rate (probability of dying between 15 and 60 years per 1000 population) by country-
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000004,
2) Adult Obesity Rate (adults aged >= 20 years who are obese (%))-
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000010
3) Alcohol Consumption among adults aged >= 15 years (litres of alcohol per person per year)
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000011
Description of the dataset:
The dataset consists of 3 variables, mortality, obesity and alcohol. Out of these 3, the mortality column is assumed as
dependent variable whereas obesity & alcohol are assumed as independent variables. The multiple regression
analysis is taken to speculate the causes of adult mortality with the help of 2 independent variables. As the
dependant variable is a continuous variable multi linear regression model is used for this data. Let us use the
independent variables and analyze to see how well adult mortality is predictable.
Description of Analysis:
The analysis is carried out to show observation of significance of different independent variables on the prediction of
the dependent variable.
1. b-Value:
The b-value explains the extent to which each independent value influences the result if the results are constant for
all other independent values.
2. Durbin Watson Method:
The value obtained from Durbin Watson method should ideally be close to 2 for an efficient result whereas if the
obtained result is <1 or >3 then observation obtained is far away from the predicted result.
3. ANOVA / F-test:
The F test asserts whether the variance thus obtained by the advocated model is considerably higher than error
within the calculated model. It will tell us whether the use of multiple regression is fine at predicting values of the
result.
4. Collinearity Test:
This test is used to check whether taken predictor variables are closely related to each other or not. This test
confirms by two results one tolerance which should not be greater than 1 and secondly the VIF should be in range of
1-10.
Assumptions:
There are several key assumptions of multi linear regression. They are: homoscedasticity, linearity, normality and
multicollinearity.
 By observing the normal PP plot, we are hoping that all our data points will lie in a reasonably straight
diagonal line from bottom left to top right. (Pallant, 2016)
 This would suggest us that there is no major deviation from normality.
 From our plot we can see that our points are either on the diagonal line or close to it hence no major
deviation from normal.
 Through the Scatter Plot, we are hoping that the data should be roughly rectangularly distributed with
almost major part of data should be concentrated in the centre. (Pallant, 2016)
 We can see from our data that it is fulfilling the criteria and no particular pattern is observed in the scatter
plot.
 Hence, we can assure that it is homoscedastic.
 Outliers can be identified from above scatter plot. Tabachnick and Fidell define outliers as cases that have
standardised residual of more than 3.3 or less than -3.3 (Pallant, 2016)
 Here, we can identify 3 potential outliers located at the top-centre of the scatter plot.
 From the above histogram, we can clearly see that model fits very well in the graph of normal distribution.
Multicollinearity:
Correlations
mortalityrate alcohol obese
Pearson Correlation mortalityrate 1.000 -.336 -.453
alcohol -.336 1.000 .240
obese -.453 .240 1.000
Sig. (1-tailed) mortalityrate . .000 .000
alcohol .000 . .001
obese .000 .001 .
N mortalityrate 178 178 178
alcohol 178 178 178
obese 178 178 178
 The correlation table gives us the idea about the correlation between the independent variables and the
dependent variables. Normally, the correlated values lie within the range of -1 to 1.
 There should be some relationship between the dependent and independent variable (preferably above 0.3)
 In this case, we can observe that our dependant variable, ‘mortalityrate’ has a moderate and negative
correlation with both the independent variables ‘alcohol’ and ‘obese’ with the values being -0.336 and -
0.453 respectively.
 Alternatively, the correlation between independent variables should not be very high (<0.7)
 Our independent variables have a weak correlation between them with a value of 0.24. As this value is much
less than the threshold value of 0.7, we can assure that there is an absence of multicollinearity within the
independent variables.
IBM SPSS Analysis Interpretation:
1) B-test:
Coefficientsa
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) 271.961 13.704 19.845 .000
alcohol -5.996 1.669 -.241 -3.592 .000
obese -3.258 .553 -.395 -5.894 .000
a. Dependent Variable: mortalityrate
 From the above table, we can explain about the constants and the slope which forms the regression line
equation.
 Based on the above figure, we can frame the equation for our regression line as follows-
Y = 271.961 – 5.996(alcohol) – 3.258(obese)
 The coefficient (B value) of a independent variable in a multiple regression model tells us the amount by
which dependent variable changes if that independent variable increases by one and the values of all other
independent variables in the model remain the same.
2) Durbin-Watson Method:
Model Summaryb
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
Durbin-
Watson
R Square
Change
F
Change df1 df2
Sig. F
Change
1 .513a
.263 .255 87.622 .263 31.273 2 175 .000 2.011
a. Predictors: (Constant), obese, alcohol
b. Dependent Variable: mortalityrate
 The observed Durbin-Watson value is 2.011 which tells us that our observation fits in to the data model from
which we can infer that our predictor and prediction values are continuous.
 To get an idea of how well our model is able to predict the values in the dependent variable, we refer the R
value. In this case, R level of 0.51 illustrates that our model gives a moderate level of prediction.
 We refer the R square value to understand how much our response variable is explained by our predictor
variable. In this case, 26% of variance in the response variable is explained by our predictor variables.
3) ANOVA / F-Test:
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 482647.118 2 241323.559 30.678 .000b
Residual 1376609.134 175 7866.338
Total 1859256.253 177
a. Dependent Variable: mortalityrate
b. Predictors: (Constant), obese, alcohol
 In ANOVA, we can observe from the Sum of Squares column that about 482647.118 of our response out of
1859256.253 variables is explained by our predictor variable, which also means around 1376609.134 of y
variables were not explained by our x variable.
 The obtained value of F-test is 30.678 %
4) Collinearity:
Coefficientsa
Model
Unstandardized
Coefficients
Sig.
95.0% Confidence
Interval for B Correlations Collinearity Statistics
B Std. Error
Lower
Bound
Upper
Bound
Zero-
order Partial Part Tolerance VIF
1 (Constant) 270.366 13.464 .000 243.792 296.939
alcohol -5.909 1.649 .000 -9.164 -2.654 -.336 -.261 -.232 .942 1.061
obese -3.249 .543 .000 -4.321 -2.177 -.457 -.412 -.388 .942 1.061
a. Dependent Variable: mortalityrate
 Tolerance is an indicator of how much of the variability of the specified independent is not explained by the
other independent variables in the model and is calculated using the formula 1– R squared for each variable.
If this value is very small (less than .10) it indicates that the multiple correlation with other variables is high,
suggesting the possibility of multicollinearity. The other value given is the VIF (Variance inflation factor),
which is just the inverse of the Tolerance value (1 divided by Tolerance). VIF values above 10 would be a
concern here, indicating multicollinearity. (Pallant, 2016)
 The VIF under collinearity statistics gives us the score of 1.061 which is far below 10 and hence we can
assure there is no multicollinearity.
Conclusion:
Our conclusion from all test shows that there is a close association between the independent and dependent
variables. Under the influence of independent variables, there is a significant amount of change in the dependent
variable.
On the basis of the above figures, we can clearly observe how closely the dependent and independent variables are
related with each other. Any minor change in the value of dependent variable causes a change in the value of
independent variable.
LOGISTIC LINEAR REGRESSION
Objective of the study: To analyze the chosen data table by using an automated software ‘IBM-SPSS’ with the help
of logistic linear regression operation to compare dependent variable that is dichotomous with two independent
variables.
Source of dataset:
The data which was used for multiple regression has been used for logistic regression as well. Also the predictor
variable and response variable is same in logistic regression as that in multiple regression however the response
variable has been converted into dichotomous such that the value corresponding to 177.81 or above has been
assigned 1 and values corresponding below 177.81 has been assigned 0.
Description of the analysis:
The analysis is performed to observe the significance of the independent variables on the prediction of the
dichotomous dependent variable under the 95% Confidence interval.
1) The Hosmer and Lemeshow Test:
As per this test, if the significance value is less than 0.5 then the model is a poor fit. Therefore, we want a
significance value above 0.5 for our model. If so, we can say that our model is a good fit.
2) Cox & Snell and Nagalkerke R square:
This test provides an indication of the amount of variation in the dependent variable explained by the
model. Rather than the true R square values, these are termed as pseudo R square statistics. (Pallant, 2016).
A value close to 1 would be termed as perfect fit whereas close to 0 would show little to no relationship.
3) Variables in equation table:
The value in this equation, Wald value, is equivalent to the t-statistics in regression. Similarly, the B value in
this table is analogous to the b value in regression.
4) Percentage Accuracy in Classification:
A percentage table would help us understand by what percent the variable have achieved accuracy with
regards to overall factors.
5) The Omni Bus Test:
The omnibus test compares with the Block 0 on the sig. Value that is present in both the tables.
IBM SPSS Analysis Interpretation:
1) Hosmer and Lemeshow Test:
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 11.605 8 .170
 As stated by Hosmer-Lemeshow [1], in order for a model to fit properly the significance value should ideally
exceed 0.05. As our significance value is 0.170 we can say that our model is a good fit.
2) Cox & Snell and Nagarkerke R square:
Model Summary
Step -2 Log likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 174.266a
.310 .418
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
 The variation in the response variable of the model is demonstrated by the values in the Nagelkerke R
Square and Cox & Snell R Square [1]
 The values corresponding to both in our model are 0.418 & 0.310.
 The variability of response variables lies between the range 0.310 to 0.418
3) Variables in equation model:
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a
alcohol -.154 .051 9.148 1 .002 .858 .776 .947
obese -.108 .020 28.192 1 .000 .898 .863 .934
Constant 2.274 .430 28.014 1 .000 9.720
a. Variable(s) entered on step 1: alcohol, obese.
 Here we can see the contribution of individual x variable in the above table.
 Alcohol rate is higher than the obesity rate, hence alcohol consumption rate should be controlled first.
4) Percentage Accuracy in classification:
Classification Tablea
Observed
Predicted
mortality Percentage
Correct0 1
Step 1 mortality 0 91 15 85.8
1 19 53 73.6
Overall Percentage 80.9
a. The cut value is .500
 We can observe that 85.8 % of values are covered of 0 and 73.6 % values of 1 are covered by the software.
 A total of 80.9 % of the values are covered by both 0 and 1 and the difference, i.e. 19.1 % of the values are
not covered by the software.
5) The Omnibus Test:
Classification Tablea,b
Observed
Predicted
mortality
Percentage Correct0 1
Step 0 mortality 0 106 0 100.0
1 72 0 .0
Overall Percentage 59.6
a. Constant is included in the model.
b. The cut value is .500
In the classification table, the values that have been recognized by the software are the predominant values which
are covered in 0 and they cover 59.6 % of the values. The residual values lay under 1 and it is about 40.4% and is not
covered by the software.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant -.387 .153 6.414 1 .011 .679
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 65.960 2 .000
Block 65.960 2 .000
Model 65.960 2 .000
In the omnibus test, the p value is 0.000 whereas the condition it should satisfy is p < 0.05. Therefore, we can assure
that the model in Block 0 is not better than this model as with 2 degrees of freedom we have chi-square value of
65.960.
Conclusion:
The conclusion that can be derived from all the above tests is that the dependent and independent variables are
building a relationship to build a data model. It can be observed that with a change in the value of independent
variable, there is a significant change in the value of the dichotomous independent variable. Thus showing how
closely the independent and dependent variables are related with each other.
References :
 Pallant, Julie. SPSS Survival Manual : a Step by Step Guide to Data Analysis Using SPSS.
 https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
 https://statistics.laerd.com/spss-tutorials/binomial-logistic-regression-using-spssstatistics.php

More Related Content

PDF
Statistics_Regression_Project
DOCX
Regression project
PDF
Statistical analysis of Multiple and Logistic Regression
PDF
Statistics for Data Analytics
PDF
X18136931 statistics ca2_updated
PDF
Regression and Classification Analysis
PDF
Stats ca report_18180485
PDF
Ordinal logistic regression
Statistics_Regression_Project
Regression project
Statistical analysis of Multiple and Logistic Regression
Statistics for Data Analytics
X18136931 statistics ca2_updated
Regression and Classification Analysis
Stats ca report_18180485
Ordinal logistic regression

What's hot (19)

PDF
ECON104RoughDraft1
PDF
Multiple Regression and Logistic Regression
DOCX
Lab Based E-portfolio
PPTX
Heteroscedasticity
PDF
Multicollinearity1
PPTX
Multicollinearity PPT
DOCX
7 classical assumptions of ordinary least squares
PDF
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
PPTX
Logistic regression
PDF
Preprocessing of Low Response Data for Predictive Modeling
PDF
Heteroscedasticity
PPT
Multicollinearity
PPT
Multinomial logisticregression basicrelationships
PPTX
Multiple Linear Regression
PPTX
Multiple Linear Regression
PPTX
Wisconsin hospital - Healthcare Cost Prediction
PPTX
Logistic regression
PPTX
Multicolinearity
PDF
Multiple Linear Regression Applications in Real Estate Pricing
ECON104RoughDraft1
Multiple Regression and Logistic Regression
Lab Based E-portfolio
Heteroscedasticity
Multicollinearity1
Multicollinearity PPT
7 classical assumptions of ordinary least squares
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
Logistic regression
Preprocessing of Low Response Data for Predictive Modeling
Heteroscedasticity
Multicollinearity
Multinomial logisticregression basicrelationships
Multiple Linear Regression
Multiple Linear Regression
Wisconsin hospital - Healthcare Cost Prediction
Logistic regression
Multicolinearity
Multiple Linear Regression Applications in Real Estate Pricing
Ad

Similar to X18125514 ca2-statisticsfor dataanalytics (20)

DOCX
30REGRESSION Regression is a statistical tool that a.docx
PDF
linear regression PDF.pdf
PDF
Statistics - Multiple Regression and Two Way Anova
PPTX
14. Regression_RcOMMANDER .pptx
PPTX
Regression analysis in R
PPTX
Regression analysis
PPT
Ders 2 ols .ppt
PPTX
6 the six uContinuous data analysis.pptx
PPTX
Presentation on Regression Analysis
PPTX
Regression analysis complete notes along with exampls
PDF
Multiple linear regression
PPT
604_multiplee.ppt
PPT
My regression lecture mk3 (uploaded to web ct)
PPTX
Regression_JAMOVI.pptx- Statistical data analysis
PDF
Statistics For Data Analytics - Multiple &amp; logistic regression
DOCX
Exercise 29Calculating Simple Linear RegressionSimple linear reg.docx
PDF
linear model multiple predictors.pdf
PPTX
Linear regression by Kodebay
PPTX
Quantitative Data Analysis: Hypothesis Testing
PDF
Applied statistics lecture_6
30REGRESSION Regression is a statistical tool that a.docx
linear regression PDF.pdf
Statistics - Multiple Regression and Two Way Anova
14. Regression_RcOMMANDER .pptx
Regression analysis in R
Regression analysis
Ders 2 ols .ppt
6 the six uContinuous data analysis.pptx
Presentation on Regression Analysis
Regression analysis complete notes along with exampls
Multiple linear regression
604_multiplee.ppt
My regression lecture mk3 (uploaded to web ct)
Regression_JAMOVI.pptx- Statistical data analysis
Statistics For Data Analytics - Multiple &amp; logistic regression
Exercise 29Calculating Simple Linear RegressionSimple linear reg.docx
linear model multiple predictors.pdf
Linear regression by Kodebay
Quantitative Data Analysis: Hypothesis Testing
Applied statistics lecture_6
Ad

More from Shantanu Deshpande (7)

PDF
Prediction of Corporate Bankruptcy using Machine Learning Techniques
PDF
Corporate bankruptcy prediction using Deep learning techniques
PDF
Analyzing financial behavior of a person based on financial literacy
PDF
Pneumonia detection using CNN
PDF
Pharmaceutical store management system
PDF
Data-Warehouse-and-Business-Intelligence
PDF
Dsm project-h base-cassandra
Prediction of Corporate Bankruptcy using Machine Learning Techniques
Corporate bankruptcy prediction using Deep learning techniques
Analyzing financial behavior of a person based on financial literacy
Pneumonia detection using CNN
Pharmaceutical store management system
Data-Warehouse-and-Business-Intelligence
Dsm project-h base-cassandra

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Monthly Chronicles - July 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf

X18125514 ca2-statisticsfor dataanalytics

  • 1. National College of Ireland Project Submission Sheet – 2018/2019 Student Name: SHANTANU DESHPANDE Student ID: X18125514 Programme: MSc. In Data Analytics (Cohort B) Year: Jan 2019 Module: Statistics for Data Analytics Lecturer: Prof. Tony Delaney Submission Due Date: 7th January 2019 Project Title: CA2 – Regression Word Count: 2,329 I hereby certify that the information contained in this (my submission) is information pertaining to research I conducted for this project. All information other than my own contribution will be fully referenced and listed in the relevant bibliography section at the rear of the project. ALL internet material must be referenced in the references section. Students are encouraged to use the Harvard Referencing Standard supplied by the Library. To use other author's written or electronic work is illegal (plagiarism) and may result in disciplinary action. Students may be required to undergo a viva (oral examination) if there is suspicion about the validity of their submitted work. Signature: ……………………………………………………………………………………………………………… Date: ……………………………………………………………………………………………………………… PLEASE READ THE FOLLOWING INSTRUCTIONS: 1. Please attach a completed copy of this sheet to each project (including multiple copies). 2. Projects should be submitted to your Programme Coordinator. 3. You must ensure that you retain a HARD COPY of ALL projects, both for your own reference and in case a project is lost or mislaid. It is not sufficient to keep a copy on computer. Please do not bind projects or place in covers unless specifically requested. 4. You must ensure that all projects are submitted to your Programme Coordinator on or before the required submission date. Late submissions will incur penalties. 5. All projects must be submitted and passed in order to successfully complete the year. Any project/assignment not submitted will be marked as a fail. Office Use Only Signature: Date: Penalty Applied (if applicable):
  • 2. MULTI LINEAR REGRESSION Objective of the study: The objective is to analyze the chosen data table with help of an automated software ‘IBM- SPSS’ by using multi linear regression to predict relation of a dependent variable with two independent variables. Problem Analysis: The relevant datasets have been taken from the below web links: 1) Adult Mortality rate (probability of dying between 15 and 60 years per 1000 population) by country- http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000004, 2) Adult Obesity Rate (adults aged >= 20 years who are obese (%))- http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000010 3) Alcohol Consumption among adults aged >= 15 years (litres of alcohol per person per year) http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000011 Description of the dataset: The dataset consists of 3 variables, mortality, obesity and alcohol. Out of these 3, the mortality column is assumed as dependent variable whereas obesity & alcohol are assumed as independent variables. The multiple regression analysis is taken to speculate the causes of adult mortality with the help of 2 independent variables. As the dependant variable is a continuous variable multi linear regression model is used for this data. Let us use the independent variables and analyze to see how well adult mortality is predictable. Description of Analysis: The analysis is carried out to show observation of significance of different independent variables on the prediction of the dependent variable. 1. b-Value: The b-value explains the extent to which each independent value influences the result if the results are constant for all other independent values. 2. Durbin Watson Method: The value obtained from Durbin Watson method should ideally be close to 2 for an efficient result whereas if the obtained result is <1 or >3 then observation obtained is far away from the predicted result. 3. ANOVA / F-test: The F test asserts whether the variance thus obtained by the advocated model is considerably higher than error within the calculated model. It will tell us whether the use of multiple regression is fine at predicting values of the result.
  • 3. 4. Collinearity Test: This test is used to check whether taken predictor variables are closely related to each other or not. This test confirms by two results one tolerance which should not be greater than 1 and secondly the VIF should be in range of 1-10. Assumptions: There are several key assumptions of multi linear regression. They are: homoscedasticity, linearity, normality and multicollinearity.  By observing the normal PP plot, we are hoping that all our data points will lie in a reasonably straight diagonal line from bottom left to top right. (Pallant, 2016)  This would suggest us that there is no major deviation from normality.  From our plot we can see that our points are either on the diagonal line or close to it hence no major deviation from normal.
  • 4.  Through the Scatter Plot, we are hoping that the data should be roughly rectangularly distributed with almost major part of data should be concentrated in the centre. (Pallant, 2016)  We can see from our data that it is fulfilling the criteria and no particular pattern is observed in the scatter plot.  Hence, we can assure that it is homoscedastic.  Outliers can be identified from above scatter plot. Tabachnick and Fidell define outliers as cases that have standardised residual of more than 3.3 or less than -3.3 (Pallant, 2016)  Here, we can identify 3 potential outliers located at the top-centre of the scatter plot.  From the above histogram, we can clearly see that model fits very well in the graph of normal distribution.
  • 5. Multicollinearity: Correlations mortalityrate alcohol obese Pearson Correlation mortalityrate 1.000 -.336 -.453 alcohol -.336 1.000 .240 obese -.453 .240 1.000 Sig. (1-tailed) mortalityrate . .000 .000 alcohol .000 . .001 obese .000 .001 . N mortalityrate 178 178 178 alcohol 178 178 178 obese 178 178 178  The correlation table gives us the idea about the correlation between the independent variables and the dependent variables. Normally, the correlated values lie within the range of -1 to 1.  There should be some relationship between the dependent and independent variable (preferably above 0.3)  In this case, we can observe that our dependant variable, ‘mortalityrate’ has a moderate and negative correlation with both the independent variables ‘alcohol’ and ‘obese’ with the values being -0.336 and - 0.453 respectively.  Alternatively, the correlation between independent variables should not be very high (<0.7)  Our independent variables have a weak correlation between them with a value of 0.24. As this value is much less than the threshold value of 0.7, we can assure that there is an absence of multicollinearity within the independent variables.
  • 6. IBM SPSS Analysis Interpretation: 1) B-test: Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig.B Std. Error Beta 1 (Constant) 271.961 13.704 19.845 .000 alcohol -5.996 1.669 -.241 -3.592 .000 obese -3.258 .553 -.395 -5.894 .000 a. Dependent Variable: mortalityrate  From the above table, we can explain about the constants and the slope which forms the regression line equation.  Based on the above figure, we can frame the equation for our regression line as follows- Y = 271.961 – 5.996(alcohol) – 3.258(obese)  The coefficient (B value) of a independent variable in a multiple regression model tells us the amount by which dependent variable changes if that independent variable increases by one and the values of all other independent variables in the model remain the same. 2) Durbin-Watson Method: Model Summaryb Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics Durbin- Watson R Square Change F Change df1 df2 Sig. F Change 1 .513a .263 .255 87.622 .263 31.273 2 175 .000 2.011 a. Predictors: (Constant), obese, alcohol b. Dependent Variable: mortalityrate  The observed Durbin-Watson value is 2.011 which tells us that our observation fits in to the data model from which we can infer that our predictor and prediction values are continuous.  To get an idea of how well our model is able to predict the values in the dependent variable, we refer the R value. In this case, R level of 0.51 illustrates that our model gives a moderate level of prediction.  We refer the R square value to understand how much our response variable is explained by our predictor variable. In this case, 26% of variance in the response variable is explained by our predictor variables.
  • 7. 3) ANOVA / F-Test: ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 482647.118 2 241323.559 30.678 .000b Residual 1376609.134 175 7866.338 Total 1859256.253 177 a. Dependent Variable: mortalityrate b. Predictors: (Constant), obese, alcohol  In ANOVA, we can observe from the Sum of Squares column that about 482647.118 of our response out of 1859256.253 variables is explained by our predictor variable, which also means around 1376609.134 of y variables were not explained by our x variable.  The obtained value of F-test is 30.678 % 4) Collinearity: Coefficientsa Model Unstandardized Coefficients Sig. 95.0% Confidence Interval for B Correlations Collinearity Statistics B Std. Error Lower Bound Upper Bound Zero- order Partial Part Tolerance VIF 1 (Constant) 270.366 13.464 .000 243.792 296.939 alcohol -5.909 1.649 .000 -9.164 -2.654 -.336 -.261 -.232 .942 1.061 obese -3.249 .543 .000 -4.321 -2.177 -.457 -.412 -.388 .942 1.061 a. Dependent Variable: mortalityrate  Tolerance is an indicator of how much of the variability of the specified independent is not explained by the other independent variables in the model and is calculated using the formula 1– R squared for each variable. If this value is very small (less than .10) it indicates that the multiple correlation with other variables is high, suggesting the possibility of multicollinearity. The other value given is the VIF (Variance inflation factor), which is just the inverse of the Tolerance value (1 divided by Tolerance). VIF values above 10 would be a concern here, indicating multicollinearity. (Pallant, 2016)  The VIF under collinearity statistics gives us the score of 1.061 which is far below 10 and hence we can assure there is no multicollinearity.
  • 8. Conclusion: Our conclusion from all test shows that there is a close association between the independent and dependent variables. Under the influence of independent variables, there is a significant amount of change in the dependent variable. On the basis of the above figures, we can clearly observe how closely the dependent and independent variables are related with each other. Any minor change in the value of dependent variable causes a change in the value of independent variable.
  • 9. LOGISTIC LINEAR REGRESSION Objective of the study: To analyze the chosen data table by using an automated software ‘IBM-SPSS’ with the help of logistic linear regression operation to compare dependent variable that is dichotomous with two independent variables. Source of dataset: The data which was used for multiple regression has been used for logistic regression as well. Also the predictor variable and response variable is same in logistic regression as that in multiple regression however the response variable has been converted into dichotomous such that the value corresponding to 177.81 or above has been assigned 1 and values corresponding below 177.81 has been assigned 0. Description of the analysis: The analysis is performed to observe the significance of the independent variables on the prediction of the dichotomous dependent variable under the 95% Confidence interval. 1) The Hosmer and Lemeshow Test: As per this test, if the significance value is less than 0.5 then the model is a poor fit. Therefore, we want a significance value above 0.5 for our model. If so, we can say that our model is a good fit. 2) Cox & Snell and Nagalkerke R square: This test provides an indication of the amount of variation in the dependent variable explained by the model. Rather than the true R square values, these are termed as pseudo R square statistics. (Pallant, 2016). A value close to 1 would be termed as perfect fit whereas close to 0 would show little to no relationship. 3) Variables in equation table: The value in this equation, Wald value, is equivalent to the t-statistics in regression. Similarly, the B value in this table is analogous to the b value in regression. 4) Percentage Accuracy in Classification: A percentage table would help us understand by what percent the variable have achieved accuracy with regards to overall factors. 5) The Omni Bus Test: The omnibus test compares with the Block 0 on the sig. Value that is present in both the tables.
  • 10. IBM SPSS Analysis Interpretation: 1) Hosmer and Lemeshow Test: Hosmer and Lemeshow Test Step Chi-square df Sig. 1 11.605 8 .170  As stated by Hosmer-Lemeshow [1], in order for a model to fit properly the significance value should ideally exceed 0.05. As our significance value is 0.170 we can say that our model is a good fit. 2) Cox & Snell and Nagarkerke R square: Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square 1 174.266a .310 .418 a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.  The variation in the response variable of the model is demonstrated by the values in the Nagelkerke R Square and Cox & Snell R Square [1]  The values corresponding to both in our model are 0.418 & 0.310.  The variability of response variables lies between the range 0.310 to 0.418 3) Variables in equation model: Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a alcohol -.154 .051 9.148 1 .002 .858 .776 .947 obese -.108 .020 28.192 1 .000 .898 .863 .934 Constant 2.274 .430 28.014 1 .000 9.720 a. Variable(s) entered on step 1: alcohol, obese.  Here we can see the contribution of individual x variable in the above table.  Alcohol rate is higher than the obesity rate, hence alcohol consumption rate should be controlled first. 4) Percentage Accuracy in classification: Classification Tablea
  • 11. Observed Predicted mortality Percentage Correct0 1 Step 1 mortality 0 91 15 85.8 1 19 53 73.6 Overall Percentage 80.9 a. The cut value is .500  We can observe that 85.8 % of values are covered of 0 and 73.6 % values of 1 are covered by the software.  A total of 80.9 % of the values are covered by both 0 and 1 and the difference, i.e. 19.1 % of the values are not covered by the software. 5) The Omnibus Test: Classification Tablea,b Observed Predicted mortality Percentage Correct0 1 Step 0 mortality 0 106 0 100.0 1 72 0 .0 Overall Percentage 59.6 a. Constant is included in the model. b. The cut value is .500 In the classification table, the values that have been recognized by the software are the predominant values which are covered in 0 and they cover 59.6 % of the values. The residual values lay under 1 and it is about 40.4% and is not covered by the software. Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 0 Constant -.387 .153 6.414 1 .011 .679 Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 65.960 2 .000 Block 65.960 2 .000 Model 65.960 2 .000 In the omnibus test, the p value is 0.000 whereas the condition it should satisfy is p < 0.05. Therefore, we can assure that the model in Block 0 is not better than this model as with 2 degrees of freedom we have chi-square value of 65.960.
  • 12. Conclusion: The conclusion that can be derived from all the above tests is that the dependent and independent variables are building a relationship to build a data model. It can be observed that with a change in the value of independent variable, there is a significant change in the value of the dichotomous independent variable. Thus showing how closely the independent and dependent variables are related with each other. References :  Pallant, Julie. SPSS Survival Manual : a Step by Step Guide to Data Analysis Using SPSS.  https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php  https://statistics.laerd.com/spss-tutorials/binomial-logistic-regression-using-spssstatistics.php