Statistic report

STATISTICS REPORT
ON
MULTIPLE REGRESSION,
LOGISTIC REGRESSION
AND TWO-WAY ANOVA
By: SONALI GUPTA
X01527245
Msc in Data Analytics
National College ofIreland

Table of Contents
MULTIPLE REGRESSION ANALYSIS....................................................................................... 3
DATA SOURCE...................................................................................................................... 3
OBJECTIVE............................................................................................................................ 3
DATA INFORMATION........................................................................................................... 3
SOFTWARE............................................................................................................................ 3
DATA CLEANING.................................................................................................................. 4
OUTPUT OF MULITPLE REGRESSION..................................................................................... 4
DATA SUMMARY...................................................................................................................... 4
CORRELATION MATRIX.......................................................................................................5
PAIRWISE MATRIX OF SCATTER PLOT.............................................................................. 6
LINEAR MODEL FIT.............................................................................................................. 7
RESIDUALPLOT.................................................................................................................... 8
RESULT..................................................................................................................................8
LOGISTIC REGRESSION ........................................................................................................... 9
DATA SOURCE...................................................................................................................... 9
OBJECTIVE............................................................................................................................ 9
DATA INFORMATION........................................................................................................... 9
SOFTWARE............................................................................................................................ 9
PROCEDURE OF LOGISTIC REGRESSION IN SPSS ........................................................... 10
OUTPUT OF LOGISTIC REGRESSION................................................................................. 12
CASE PROCESSING............................................................................................................. 12
CATEGORICAL VARIABLES CODINGS............................................................................. 13
CLASSIFICATION TABLE................................................................................................... 13
OMNIBUS TESTS OF MODEL COEFFICIENTS ................................................................... 13
HOSMER AND LEMESHOW TEST...................................................................................... 14
MODEL SUMMARY............................................................................................................. 14
CLASSIFICATION TABLE................................................................................................... 14
VARIABLES IN THE EQUATION......................................................................................... 15
CASEWISE LIST................................................................................................................... 15
RESULT................................................................................................................................ 15
TWO-WAY ANOVA................................................................................................................. 16
DATA SOURCE.................................................................................................................... 16
OBJECTIVE.......................................................................................................................... 17
DATA INFORMATION......................................................................................................... 17
SOFTWARE.......................................................................................................................... 17

PROCEDURE FOR TWO-WAY ANOVA IN SPSS................................................................. 17
OUTPUT OF TWO-WAY ANOVA ........................................................................................ 21
DESCRIPTIVE STATISTICS................................................................................................. 21
LEVENE’S TEST OF EQUALITY OF ERROR VARIANCES................................................. 21
INTERACTION EFFECTS..................................................................................................... 22
PLOTS................................................................................................................................... 24
RESULT................................................................................................................................ 24
REFERENCES....................................................................................................................... 24

MULTIPLE REGRESSION ANALYSIS
In simple terms, multiple regression is used to describe the relationship between continuous
dependent variable with two or more independent variables.
DATA SOURCE
This analysis has been done on soil solution of UK environment monitoring. The link of the
dataset is as follows: https://data.gov.uk/dataset/uk-environmental-change-network-ecn-soil-
solution-chemistry-data-1992-2012
The data was present in csv format.
OBJECTIVE
The reason of selecting this dataset so that we can analyse which factor depends on soil
quality.
The Objective of this analysis is to
• Study the various components of soil.
• Study the impact of class type with different components.
• Study the relationship between all the components.
DATA INFORMATION
This dataset contains 1014 observations and 12 variables.
1) Type: it contains the different types of soil. For ex. Bixa, Sodic, Grassland,
Horticulture, Lawn etc.
2) Ph: Ph is potential of hydrogen acidity scale from 0 to 14.
3) Conductivity
4) Carbon
5) Nitrogen
6) Phosphorus
7) Potassium
8) WHC
9) Class type: High fertile, low fertile, medium fertile.
SOFTWARE
R is simple, effective and opensource language and which is highly use for analysing data
manipulation, data handling, data visualization, statistical result and graphics.
In R studio, data was loaded by read.table command which shown below:
soil<-read.table("soil_15nov_ANN.CSV",sep=",",header =T)

DATA CLEANING
This dataset has so many NA value so clean the dataset by complete.cases() function. Using
this function all the NA value in file removed and dataset clean and class type is string type
dataset so this column also removed.
clean=complete.cases(soil)
soil_data=soil[clean,]
OUTPUT OF MULITPLE REGRESSION
DATA SUMMARY
This table represents the summary of the dataset which contain minimum, 1st quartile,
median, mean, 3rd quartile and maximum value of every component.
summary(soil_data)
This table shows the summary of the data in terms of maximum, minimum, mean, median, 1st
quartile, 3rd quartile of each components.
depth ph conductivity carbon nitrogen phosphorus potassium WHC porosity
Min. 1 5.6 40 0.015 3.98 4 60 27.1 29.3
1st
Qu. 1 8.2 265 0.12 17 14.43 200 42 40
Median 2 8.81 410 0.235 30.52 19.84 290 46.3 44.5
Mean 2.4 8.8 499.1 0.395 37.08 24.29 379.9 47.4 44.96
3rd
Qu. 3 9.7 660 0.589 50.4 32.3 400 52.5 49
Max. 4 11.5 1720 2.35 185 82.42 3000 76.8 65.72

CORRELATION MATRIX
Correlation shows the relationship between two variables and describe the relationship is
positive trend or negative trend.
cor(soil_data)
depth ph conductivity carbon nitrogen
depth 1.00000000 0.3513589 0.43664406 -0.5636996 -0.5411645
ph. 0.35135888 1.0000000 0.75864988 -0.7100049 -0.6449397
conductivity 0.43664406 0.7586499 1.00000000 -0.5188081 -0.4582881
carbon -0.56369961 -0.7100049 -0.51880806 1.0000000 0.7896135
nitrogen -0.54116446 -0.6449397 -0.45828809 0.7896135 1.0000000
phosphorus -0.32741495 -0.6335258 -0.51024771 0.6872795 0.6610887
potassium -0.20167701 -0.5825826 0.31843323 0.6639411 0.4568622
WHC -0.22324381 -0.2304440 -0.27958500 0.2367519 0.2386911
porosity -0.20642418 -0.2157805 -0.27194201 0.2055014 0.2414338
class 0.07352693 0.1423210 0.04065294 -0.1363034 -0.1771025
phosphorus potassium WHC porosity class
depth -0.32741496 -0.2016770 -0.2232438 -0.2064242 0.073526927
ph -0.63352582 -0.5825826 -0.2304440 -0.2157805 0.14232104
conductivity -0.51024770 -0.3184332 -0.2795850 -0.2719420 0.040652945
carbon 0.687279522 0.6639411 0.2367519 0.2055014 -0.136303384
nitrogen 0.661088713 0.4568622 0.2386911 0.2414338 -0.177102523
phosphorus 1.0000000 0.6125755 0.3444425 0.3380200 0.001868296
potassium 0.612575523 1.0000000 0.1906594 0.1153716 -0.117629191
WHC 0.344442511 0.1906594 1.0000000 0.8128219 0.395313489
porosity 0.338020020 0.1153716 0.8128219 1.0000000 0.33599933
class 0.001868296 -0.1176292 0.3953135 0.3359993 1.000000000

PAIRWISE MATRIX OF SCATTER PLOT
Using below command we can easily analyse the relationship between each component.
pairs(soil_data[c("depth","ph","conductivity","carbon","nitrogen","phospho
rus","WHC","porosity","class")])
Thisfigure representsif ph value rise thenconductivityisalsoincreasingthisshowsthe correlationis
positive andif phvalue decreasingthenconductivityisalsodecreasingmeanscorrelationis
negative. Aswe cansee there isno correlationbetweenclassanddepth.

LINEAR MODEL FIT
fit<-lm(class ~ ., data= soil_data)
fit
Call:
lm(formula = class ~ ., data = Soil_dataset)
Coefficients:
(Intercept) depth ph conductivity carbon
nitrogen phosphorus
-1.291e-01 2.770e-02 6.608e-02 1.793e-05 1.466e-01 -7
.559e-03 7.730e-03
potassium WHC porosity
-2.519e-04 3.501e-02 5.376e-03
summary(fit)
Call:
lm(formula = class ~ ., data = soil_data)
Residuals:
Min 1Q Median 3Q Max
-1.29350 -0.32275 0.00726 0.43999 1.29096
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.291e-01 3.360e-01 -0.384 0.701002
depth 2.770e-02 2.309e-02 1.199 0.230659
ph 6.608e-02 3.313e-02 1.995 0.046367 *
conductivity 1.793e-05 1.031e-04 0.174 0.862007
carbon 1.466e-01 1.091e-01 1.344 0.179149
nitrogen -7.559e-03 1.313e-03 -5.755 1.15e-08 ***
phosphorus 7.730e-03 2.239e-03 3.452 0.000579 ***
potassium -2.520e-04 7.570e-05 -3.328 0.000906 ***
WHC 3.501e-02 4.011e-03 8.730 < 2e-16 ***
porosity 5.376e-03 4.962e-03 1.084 0.278821
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5995 on 1004 degrees of freedom
Multiple R-squared: 0.2583, Adjusted R-squared: 0.2517
F-statistic: 38.86 on 9 and 1004 DF, p-value: < 2.2e-16
As we can see using summary() command, it give the summarisation of statistic regression
the p-value of depth, ph, conductivity, carbon, phosphorus, WHC and porosity are more
significant than nitrogen and potassium. The value of R square is 0.2583 means 25%
variation.

RESIDUAL PLOT
plot(fit, which = 1)
RESULT
In thismultiple regressionanalysis,we are tryingtofindthe relationshipwithdifferentvariable and
alsolookwhatimpact betweenothers. There isstrongrelationshipbetweenphandconductivity.

LOGISTIC REGRESSION
In simple terms, logistic regression is statistical method which is used for analyse and find the
relationship between one or more independent variable that predict the outcome whereas
outcome always measured with a dichotomous variable.
Dependent variable act as dichotomous. for ex: (win/lose; yes/no; fail/pass).
Assumptions:
1. Dependent variable is always Dichotomous in nature.
2. No outliers in the data which shows the negative impact.
3. Outcome variable represent as 0 and 1.
Logistic Regressioninmathematic term:
Logit(p)=
For I = 1 to n.
DATA SOURCE
https://data.gov.uk/dataset/victim-and-offender-gender-and-age
OBJECTIVE
The data set representsthe drugstestreportbetweendifferentage group.
• To check the probabilityof decisionwhoquitthe drugornot quitthe drug (dependent
variable) with gender,differentage groupandpersonswhoaddictwiththe drugs.
DATA INFORMATION
Dependentvariable:Decision (whoquitthe drugsaftertestscore) codedas 0 = “no” and1 = “yes”.
Independentvariable:
Gendercodedas 0 = “male”and 1 = “female”.
Age
Drug codedas 1 = “high”and 2 = “low”(whichmeanswhohigh/low addictwithdrugs)
SOFTWARE
SPSS software is use for analyse the output of logistic regression.

PROCEDUREOF LOGISTIC REGRESSIONIN SPSS
1. Import the dataset and screenshot is below.
2. Click -> Analyze -> Regression->Binary Logistic.

3. After, move the dependent variable in dependent text box and in below window move the
independent variable and click categorical box.
4. On clicking on Categorical button below window is open, then move the categorical
variable in categorical Covariates box and after this click on first then continue.

5. Click option button, tick the classification plots, Hosmer-Lemeshow goodness-of-fit,
casewise listing and CI for exp. then click continue and OK.
OUTPUT OF LOGISTIC REGRESSION
CASE PROCESSING
This table explain the number of cases in our data set.
Case Processing Summary
Unweighted Casesa
N Percent
Selected Cases Included in Analysis 199 100.0
Missing Cases 0 .0
Total 199 100.0
Unselected Cases 0 .0
Total 199 100.0
a. If weightis in effect, see classification table for the total number of
cases.

CATEGORICAL VARIABLES CODINGS
This table represent the independent categorical variables.
Categorical Variables Codings
Frequency
Parameter
coding
(1)
drug addiction high 91 .000
low 108 1.000
gender male 121 .000
female 78 1.000
CLASSIFICATION TABLE
This table describe the total Percentage of correctly classified cases is 64.3 percent and also
denote the higher percentage of people answering to no for quit drugs.
OMNIBUS TESTS OF MODEL COEFFICIENTS
Thistable givesthe resultof model performsandnopredictorvalue inthismodel.The significant
value inthismodel is.000 whichislessthan.005 whichmeansmodel isbestfitandreport that noto
quitdrugs. Th Chi-square valuereportas24.022 with3 degreesof freedom.
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 24.022 3 .000
Block 24.022 3 .000
Model 24.022 3 .000

HOSMER AND LEMESHOW TEST
This test means if significance value is less than .05, so it indicates poor fit model. So
actually, we need model where sig. value is greater than .05. In this test Chi-square value is
1.730 with significance level is .988 so sig. value is larger than .05 which represents to
support for the model.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 1.730 8 .988
MODEL SUMMARY
It provides useful information for the model and it also indicate the variation in the dependent
variable where maximum value is 1 and minimum value is 0. In this table two R Square value
.114 and .156 which means that between 11 percent and 15 percent of variability explained
by this set of variables.
Model Summary
Step -2 Log likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 235.293a
.114 .156
a. Estimation terminated atiteration number 4 because
parameter estimates changed byless than .001.
CLASSIFICATION TABLE
This table describe how model predict the correct category (to quit drugs/ no to quit drugs)
for every case. From this table, 69.3 per cent model is correctly satisfied.
In this table, 62.0 percent people who quit drugs this is sensitivity of the model and 73.4
percent people who don’t quit the drugs this is specificity of the model.

CASEWISE LIST
No table is created so which means sample for model fit well. And there are no outliers.
RESULT
Thismodel containsthree independentvariablesuchasage,genderand drugaddiction.The model
showsall the predictorwasstatisticallysignificant. InChi-Square (df=3,24.022) and p value isless
than .001 that meansmodel wasable todistinguishbetweenwhohighlyaddictedwithdrugsand
lowaddictedwithdrugs.The model also describes 11.4 % byCox andSnell Square and 15.6 % by
NagelkerkeRsquaredof the variance indrug addiction.
REFERENCES
1. Hosmer,D. & Lemeshow,S.(2000).AppliedLogistic Regression(SecondEdition). New York:John
Wiley&Sons,Inc.
2. Long, J. Scott(1997). RegressionModelsforCategorical andLimitedDependent
Variables.ThousandOaks,CA:Sage Publications.
VARIABLES IN THE EQUATION
This test also called as Wald test and used for status of predictor variable. In this table we
have looking for significance value which is less than .05.
So, in this table, we have one sig. value (gender, p = .000).
B values comes from the multiple regression. B value justify the relationship and we look for
the positive and negative value.
Exp(B) represents the odds ratio for every independent variable. In this table odd person who
answering yes, they have high addicted with drugs is .961 times more than who low addicted
with drugs.
In this table 95% confidence interval for Exp(B) which gives the lower and upper value. In
this table
(drug addiction OR = .961) ranges from .516 to 1.789 that means population lies between
.516 and 1.789.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a
age .001 .012 .003 1 .957 1.001 .977 1.025
gender(1) 1.503 .317 22.530 1 .000 4.494 2.416 8.359
drug
addiction(1)
-.040 .317 .016 1 .900 .961 .516 1.789
Constant -1.252 .583 4.617 1 .032 .286
a. Variable(s) entered on step 1: age, gender,drug addiction.

TWO-WAY ANOVA
Two-way ANOVA is the extended form of one-way ANOVA. In one-way ANOVA has one
independent variable whereas two-way ANOVA has two independent variables. In this
technique two independent variables depend on one dependent variable.
Advantage of Two-way ANOVA:
1) Using two-way ANOVA, we can examine the main effect for both independent variables.
2) possibility of an interaction effect increases which means one independent variable
depend on dependent variables and also depend on the second independent variable.
ASSUMPTIONS
When we doing Two-Way ANOVA, some key points are as follows:
1) Two independent variables are always categorical which means category of gender
such either female or male so that female replace by 1 and male replace 0. In this data
set Irish classification and country group are two categorical variables.
2) Dependent variable always continuous type which means value differ every time. In
this case total number of student enrolments. For more example like score of students
which is change every time.
3) Outliers represents the negative impression to the Two-way ANOVA.
Two-Way ANOVATable
DATA SOURCE
https://data.gov.ie/dataset/data-on-individual-schools

OBJECTIVE
The dataset is based on total enrolment in mainstream national schools for 2016/2017 school
year on the basis on country and Irish classification.
1) To find the total no enrolments on the basis of country.
2) To find the enrolments difference on the basis of Irish classification which means all
subject through Irish or no subject through Irish.
DATA INFORMATION
There are two categorical independent variables one is Irish classification which noted as All
subject through Irish = 1 and No subject through Irish = 0 another Country groups which
grouping as Donegal = 1, Dublin = 2, Galway = 3.
Dependent variable is total no of students enrol in 2016/2017.
SOFTWARE
SPSS software use for this analysis.
PROCEDUREFOR TWO-WAYANOVA IN SPSS
1. Import the data set in SPSS software and the screenshot is below:

2. Click Analyze -> General Linear Model -> Univariate
3.After clicking univariate below window open, using the arrow move the Total pupils in
Dependent Variable box.
Country and Irish classification in Fixed Factor box.

4.Click Options button then in Display section choose Descriptive statistics, estimate of
effect size and Homogeneity tests then click on Continue.
5.After this, choose Post Hoc then move County and IrishClassification Description in to Post
Hoc Test and in below window choose Tukey and then click Continue.

6. Choose Plot button, insert County in Horizontal Axis and IrishClassification Description in
Separate Lines then click continue and after this click OK.

OUTPUT OF TWO-WAY ANOVA
DESCRIPTIVE STATISTICS
This table represent the mean, Std. deviation and particulars of each county.
Mean of Donegal is 108.33, mean of Galway is 130.95 and mean of Limerick is 168.33.
Std. Deviation of Galway is 195.876 which is more than Std. Deviation of Donegal and
Limerick.
Descriptive Statistics
DependentVariable: Total Pupils
County
Irish Classification
Description Mean Std. Deviation N
Donegal All subjects through Irish 77.86 74.570 37
No subjects through Irish 116.68 114.962 135
Total 108.33 108.550 172
Galway All subjects through Irish 132.09 143.605 44
Total 130.95 114.347 223
limerick All subjects through Irish 311.00 195.876 6
Total 168.33 148.851 134
All subjects through Irish 121.37 135.098 87
Total 133.06 124.143 529
LEVENE’S TEST OF EQUALITY OF ERROR VARIANCES
Decision Rule:
If p <= .05 means variances are significantly different.
If p >= .05 means variances are not significantly different.
This table represent the significance value is .006 which is less than .05 means there is
violation of homogeneity of variance assumption.
Levene's Test of Equality of Error Variancesa
F df1 df2 Sig.
3.309 5 523 .006
Tests the null hypothesis thatthe error variance of the dependentvariable is equal across groups.
a. Design:Intercept+ County + IrishClassificationDescription + County * IrishClassificationDescription

INTERACTION EFFECTS
Interaction effect shows the combined effects of factors on the dependent variable. For the
interaction effect value of significance level is always less than .05.
In the Effects table significance value is .003 which is less than .05 means there is significant
difference between in the effect of county in No subjects through Irish or All subjects through
Irish.
Tests of Between-Subjects Effects
Source
Type III Sum of
Squares df Mean Square F Sig.
Partial Eta
Squared
Corrected Model 444536.320a
5 88907.264 6.044 .000 .055
Intercept 3645418.517 1 3645418.517 247.837 .000 .322
County 378953.543 2 189476.772 12.882 .000 .047
IrishClassificationDescription 52848.123 1 52848.123 3.593 .059 .007
County *
IrishClassificationDescription
170988.424 2 85494.212 5.812 .003 .022
Error 7692784.621 523 14708.957
Total 17503582.000 529
Corrected Total 8137320.941 528
a. R Squared = .055 (Adjusted R Squared = .046)
EFFECT SIZE
Effect size of Irish classification Description is less than .05 which means result is
statistically significant. (In partial era squared column).

POST-HOC TEST
In this represent the Post- for Tukey Honestly Significant Difference. Post Hoc test only
significant if we use more than one independent variable. In this multiple comparison table
Galway and Limerick shows the significant value which is less than .05 which means there is
significant difference between county group. Value is less than .o5 means mean difference
value show with asterisk mark.
Multiple Comparisons
Tukey HSD
(I) County (J) County
Mean Difference
(I-J) Std. Error Sig.
95% Confidence Interval
Lower Bound Upper Bound
Donegal Galway -22.61 12.308 .158 -51.54 6.31
Limerick -60.00*
13.974 .000 -92.84 -27.15
Galway Donegal 22.61 12.308 .158 -6.31 51.54
Limerick -37.38*
13.256 .014 -68.54 -6.22
limerick Donegal 60.00*
13.974 .000 27.15 92.84
Galway 37.38*
13.256 .014 6.22 68.54
Based on observed means.
The error term is Mean Square(Error) = 14708.957.
*. The mean difference is significantatthe .05 level.

PLOTS
Plot is very important to analyse the output of two-way Anova and we can easily understand
the relationship between variables.
From this plot graph, All subjects through Irish is slightly less with No subject through Irish
for Donegal County, there is no difference between All subjects through Irish and No
subjects through Irish for Galway but in Limerick there is quite large difference between All
subject through Irish and No subject through Irish.
RESULT
Two-way Anova is used for to know about the enrolment is differ from different county
according to Irish classification. Irish classification as All subject through Irish or No subject
through Irish. County divided in three different group Donegal as 1, Galway as 2 and
Limerick as 3. The interaction effect between county and Irish classification distribution is
F= 5.812, p = .003.
REFERENCES
1.Brett Lantz(2013) Machine learning with R.Second Edition.
2. Pallant J. (2016) SPSS survival Manual. 6th Ed. New York, McGraw Hill Education.

Statistic report

More Related Content

Recently uploaded (20)

Featured (20)

Statistic report