SlideShare a Scribd company logo
STATISTICS REPORT
ON
MULTIPLE REGRESSION,
LOGISTIC REGRESSION
AND TWO-WAY ANOVA
By: SONALI GUPTA
X01527245
Msc in Data Analytics
National College ofIreland
Table of Contents
MULTIPLE REGRESSION ANALYSIS....................................................................................... 3
DATA SOURCE...................................................................................................................... 3
OBJECTIVE............................................................................................................................ 3
DATA INFORMATION........................................................................................................... 3
SOFTWARE............................................................................................................................ 3
DATA CLEANING.................................................................................................................. 4
OUTPUT OF MULITPLE REGRESSION..................................................................................... 4
DATA SUMMARY...................................................................................................................... 4
CORRELATION MATRIX.......................................................................................................5
PAIRWISE MATRIX OF SCATTER PLOT.............................................................................. 6
LINEAR MODEL FIT.............................................................................................................. 7
RESIDUALPLOT.................................................................................................................... 8
RESULT..................................................................................................................................8
LOGISTIC REGRESSION ........................................................................................................... 9
DATA SOURCE...................................................................................................................... 9
OBJECTIVE............................................................................................................................ 9
DATA INFORMATION........................................................................................................... 9
SOFTWARE............................................................................................................................ 9
PROCEDURE OF LOGISTIC REGRESSION IN SPSS ........................................................... 10
OUTPUT OF LOGISTIC REGRESSION................................................................................. 12
CASE PROCESSING............................................................................................................. 12
CATEGORICAL VARIABLES CODINGS............................................................................. 13
CLASSIFICATION TABLE................................................................................................... 13
OMNIBUS TESTS OF MODEL COEFFICIENTS ................................................................... 13
HOSMER AND LEMESHOW TEST...................................................................................... 14
MODEL SUMMARY............................................................................................................. 14
CLASSIFICATION TABLE................................................................................................... 14
VARIABLES IN THE EQUATION......................................................................................... 15
CASEWISE LIST................................................................................................................... 15
RESULT................................................................................................................................ 15
TWO-WAY ANOVA................................................................................................................. 16
DATA SOURCE.................................................................................................................... 16
OBJECTIVE.......................................................................................................................... 17
DATA INFORMATION......................................................................................................... 17
SOFTWARE.......................................................................................................................... 17
PROCEDURE FOR TWO-WAY ANOVA IN SPSS................................................................. 17
OUTPUT OF TWO-WAY ANOVA ........................................................................................ 21
DESCRIPTIVE STATISTICS................................................................................................. 21
LEVENE’S TEST OF EQUALITY OF ERROR VARIANCES................................................. 21
INTERACTION EFFECTS..................................................................................................... 22
PLOTS................................................................................................................................... 24
RESULT................................................................................................................................ 24
REFERENCES....................................................................................................................... 24
MULTIPLE REGRESSION ANALYSIS
In simple terms, multiple regression is used to describe the relationship between continuous
dependent variable with two or more independent variables.
DATA SOURCE
This analysis has been done on soil solution of UK environment monitoring. The link of the
dataset is as follows: https://data.gov.uk/dataset/uk-environmental-change-network-ecn-soil-
solution-chemistry-data-1992-2012
The data was present in csv format.
OBJECTIVE
The reason of selecting this dataset so that we can analyse which factor depends on soil
quality.
The Objective of this analysis is to
• Study the various components of soil.
• Study the impact of class type with different components.
• Study the relationship between all the components.
DATA INFORMATION
This dataset contains 1014 observations and 12 variables.
1) Type: it contains the different types of soil. For ex. Bixa, Sodic, Grassland,
Horticulture, Lawn etc.
2) Ph: Ph is potential of hydrogen acidity scale from 0 to 14.
3) Conductivity
4) Carbon
5) Nitrogen
6) Phosphorus
7) Potassium
8) WHC
9) Class type: High fertile, low fertile, medium fertile.
SOFTWARE
R is simple, effective and opensource language and which is highly use for analysing data
manipulation, data handling, data visualization, statistical result and graphics.
In R studio, data was loaded by read.table command which shown below:
soil<-read.table("soil_15nov_ANN.CSV",sep=",",header =T)
DATA CLEANING
This dataset has so many NA value so clean the dataset by complete.cases() function. Using
this function all the NA value in file removed and dataset clean and class type is string type
dataset so this column also removed.
clean=complete.cases(soil)
soil_data=soil[clean,]
OUTPUT OF MULITPLE REGRESSION
DATA SUMMARY
This table represents the summary of the dataset which contain minimum, 1st quartile,
median, mean, 3rd quartile and maximum value of every component.
summary(soil_data)
This table shows the summary of the data in terms of maximum, minimum, mean, median, 1st
quartile, 3rd quartile of each components.
depth ph conductivity carbon nitrogen phosphorus potassium WHC porosity
Min. 1 5.6 40 0.015 3.98 4 60 27.1 29.3
1st
Qu. 1 8.2 265 0.12 17 14.43 200 42 40
Median 2 8.81 410 0.235 30.52 19.84 290 46.3 44.5
Mean 2.4 8.8 499.1 0.395 37.08 24.29 379.9 47.4 44.96
3rd
Qu. 3 9.7 660 0.589 50.4 32.3 400 52.5 49
Max. 4 11.5 1720 2.35 185 82.42 3000 76.8 65.72
CORRELATION MATRIX
Correlation shows the relationship between two variables and describe the relationship is
positive trend or negative trend.
cor(soil_data)
depth ph conductivity carbon nitrogen
depth 1.00000000 0.3513589 0.43664406 -0.5636996 -0.5411645
ph. 0.35135888 1.0000000 0.75864988 -0.7100049 -0.6449397
conductivity 0.43664406 0.7586499 1.00000000 -0.5188081 -0.4582881
carbon -0.56369961 -0.7100049 -0.51880806 1.0000000 0.7896135
nitrogen -0.54116446 -0.6449397 -0.45828809 0.7896135 1.0000000
phosphorus -0.32741495 -0.6335258 -0.51024771 0.6872795 0.6610887
potassium -0.20167701 -0.5825826 0.31843323 0.6639411 0.4568622
WHC -0.22324381 -0.2304440 -0.27958500 0.2367519 0.2386911
porosity -0.20642418 -0.2157805 -0.27194201 0.2055014 0.2414338
class 0.07352693 0.1423210 0.04065294 -0.1363034 -0.1771025
phosphorus potassium WHC porosity class
depth -0.32741496 -0.2016770 -0.2232438 -0.2064242 0.073526927
ph -0.63352582 -0.5825826 -0.2304440 -0.2157805 0.14232104
conductivity -0.51024770 -0.3184332 -0.2795850 -0.2719420 0.040652945
carbon 0.687279522 0.6639411 0.2367519 0.2055014 -0.136303384
nitrogen 0.661088713 0.4568622 0.2386911 0.2414338 -0.177102523
phosphorus 1.0000000 0.6125755 0.3444425 0.3380200 0.001868296
potassium 0.612575523 1.0000000 0.1906594 0.1153716 -0.117629191
WHC 0.344442511 0.1906594 1.0000000 0.8128219 0.395313489
porosity 0.338020020 0.1153716 0.8128219 1.0000000 0.33599933
class 0.001868296 -0.1176292 0.3953135 0.3359993 1.000000000
PAIRWISE MATRIX OF SCATTER PLOT
Using below command we can easily analyse the relationship between each component.
pairs(soil_data[c("depth","ph","conductivity","carbon","nitrogen","phospho
rus","WHC","porosity","class")])
Thisfigure representsif ph value rise thenconductivityisalsoincreasingthisshowsthe correlationis
positive andif phvalue decreasingthenconductivityisalsodecreasingmeanscorrelationis
negative. Aswe cansee there isno correlationbetweenclassanddepth.
LINEAR MODEL FIT
fit<-lm(class ~ ., data= soil_data)
fit
Call:
lm(formula = class ~ ., data = Soil_dataset)
Coefficients:
(Intercept) depth ph conductivity carbon
nitrogen phosphorus
-1.291e-01 2.770e-02 6.608e-02 1.793e-05 1.466e-01 -7
.559e-03 7.730e-03
potassium WHC porosity
-2.519e-04 3.501e-02 5.376e-03
summary(fit)
Call:
lm(formula = class ~ ., data = soil_data)
Residuals:
Min 1Q Median 3Q Max
-1.29350 -0.32275 0.00726 0.43999 1.29096
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.291e-01 3.360e-01 -0.384 0.701002
depth 2.770e-02 2.309e-02 1.199 0.230659
ph 6.608e-02 3.313e-02 1.995 0.046367 *
conductivity 1.793e-05 1.031e-04 0.174 0.862007
carbon 1.466e-01 1.091e-01 1.344 0.179149
nitrogen -7.559e-03 1.313e-03 -5.755 1.15e-08 ***
phosphorus 7.730e-03 2.239e-03 3.452 0.000579 ***
potassium -2.520e-04 7.570e-05 -3.328 0.000906 ***
WHC 3.501e-02 4.011e-03 8.730 < 2e-16 ***
porosity 5.376e-03 4.962e-03 1.084 0.278821
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5995 on 1004 degrees of freedom
Multiple R-squared: 0.2583, Adjusted R-squared: 0.2517
F-statistic: 38.86 on 9 and 1004 DF, p-value: < 2.2e-16
As we can see using summary() command, it give the summarisation of statistic regression
the p-value of depth, ph, conductivity, carbon, phosphorus, WHC and porosity are more
significant than nitrogen and potassium. The value of R square is 0.2583 means 25%
variation.
RESIDUAL PLOT
plot(fit, which = 1)
RESULT
In thismultiple regressionanalysis,we are tryingtofindthe relationshipwithdifferentvariable and
alsolookwhatimpact betweenothers. There isstrongrelationshipbetweenphandconductivity.
LOGISTIC REGRESSION
In simple terms, logistic regression is statistical method which is used for analyse and find the
relationship between one or more independent variable that predict the outcome whereas
outcome always measured with a dichotomous variable.
Dependent variable act as dichotomous. for ex: (win/lose; yes/no; fail/pass).
Assumptions:
1. Dependent variable is always Dichotomous in nature.
2. No outliers in the data which shows the negative impact.
3. Outcome variable represent as 0 and 1.
Logistic Regressioninmathematic term:
Logit(p)=
For I = 1 to n.
DATA SOURCE
https://data.gov.uk/dataset/victim-and-offender-gender-and-age
OBJECTIVE
The data set representsthe drugstestreportbetweendifferentage group.
• To check the probabilityof decisionwhoquitthe drugornot quitthe drug (dependent
variable) with gender,differentage groupandpersonswhoaddictwiththe drugs.
DATA INFORMATION
Dependentvariable:Decision (whoquitthe drugsaftertestscore) codedas 0 = “no” and1 = “yes”.
Independentvariable:
Gendercodedas 0 = “male”and 1 = “female”.
Age
Drug codedas 1 = “high”and 2 = “low”(whichmeanswhohigh/low addictwithdrugs)
SOFTWARE
SPSS software is use for analyse the output of logistic regression.
PROCEDUREOF LOGISTIC REGRESSIONIN SPSS
1. Import the dataset and screenshot is below.
2. Click -> Analyze -> Regression->Binary Logistic.
3. After, move the dependent variable in dependent text box and in below window move the
independent variable and click categorical box.
4. On clicking on Categorical button below window is open, then move the categorical
variable in categorical Covariates box and after this click on first then continue.
5. Click option button, tick the classification plots, Hosmer-Lemeshow goodness-of-fit,
casewise listing and CI for exp. then click continue and OK.
OUTPUT OF LOGISTIC REGRESSION
CASE PROCESSING
This table explain the number of cases in our data set.
Case Processing Summary
Unweighted Casesa
N Percent
Selected Cases Included in Analysis 199 100.0
Missing Cases 0 .0
Total 199 100.0
Unselected Cases 0 .0
Total 199 100.0
a. If weightis in effect, see classification table for the total number of
cases.
CATEGORICAL VARIABLES CODINGS
This table represent the independent categorical variables.
Categorical Variables Codings
Frequency
Parameter
coding
(1)
drug addiction high 91 .000
low 108 1.000
gender male 121 .000
female 78 1.000
CLASSIFICATION TABLE
This table describe the total Percentage of correctly classified cases is 64.3 percent and also
denote the higher percentage of people answering to no for quit drugs.
OMNIBUS TESTS OF MODEL COEFFICIENTS
Thistable givesthe resultof model performsandnopredictorvalue inthismodel.The significant
value inthismodel is.000 whichislessthan.005 whichmeansmodel isbestfitandreport that noto
quitdrugs. Th Chi-square valuereportas24.022 with3 degreesof freedom.
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 24.022 3 .000
Block 24.022 3 .000
Model 24.022 3 .000
HOSMER AND LEMESHOW TEST
This test means if significance value is less than .05, so it indicates poor fit model. So
actually, we need model where sig. value is greater than .05. In this test Chi-square value is
1.730 with significance level is .988 so sig. value is larger than .05 which represents to
support for the model.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 1.730 8 .988
MODEL SUMMARY
It provides useful information for the model and it also indicate the variation in the dependent
variable where maximum value is 1 and minimum value is 0. In this table two R Square value
.114 and .156 which means that between 11 percent and 15 percent of variability explained
by this set of variables.
Model Summary
Step -2 Log likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 235.293a
.114 .156
a. Estimation terminated atiteration number 4 because
parameter estimates changed byless than .001.
CLASSIFICATION TABLE
This table describe how model predict the correct category (to quit drugs/ no to quit drugs)
for every case. From this table, 69.3 per cent model is correctly satisfied.
In this table, 62.0 percent people who quit drugs this is sensitivity of the model and 73.4
percent people who don’t quit the drugs this is specificity of the model.
CASEWISE LIST
No table is created so which means sample for model fit well. And there are no outliers.
RESULT
Thismodel containsthree independentvariablesuchasage,genderand drugaddiction.The model
showsall the predictorwasstatisticallysignificant. InChi-Square (df=3,24.022) and p value isless
than .001 that meansmodel wasable todistinguishbetweenwhohighlyaddictedwithdrugsand
lowaddictedwithdrugs.The model also describes 11.4 % byCox andSnell Square and 15.6 % by
NagelkerkeRsquaredof the variance indrug addiction.
REFERENCES
1. Hosmer,D. & Lemeshow,S.(2000).AppliedLogistic Regression(SecondEdition). New York:John
Wiley&Sons,Inc.
2. Long, J. Scott(1997). RegressionModelsforCategorical andLimitedDependent
Variables.ThousandOaks,CA:Sage Publications.
VARIABLES IN THE EQUATION
This test also called as Wald test and used for status of predictor variable. In this table we
have looking for significance value which is less than .05.
So, in this table, we have one sig. value (gender, p = .000).
B values comes from the multiple regression. B value justify the relationship and we look for
the positive and negative value.
Exp(B) represents the odds ratio for every independent variable. In this table odd person who
answering yes, they have high addicted with drugs is .961 times more than who low addicted
with drugs.
In this table 95% confidence interval for Exp(B) which gives the lower and upper value. In
this table
(drug addiction OR = .961) ranges from .516 to 1.789 that means population lies between
.516 and 1.789.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a
age .001 .012 .003 1 .957 1.001 .977 1.025
gender(1) 1.503 .317 22.530 1 .000 4.494 2.416 8.359
drug
addiction(1)
-.040 .317 .016 1 .900 .961 .516 1.789
Constant -1.252 .583 4.617 1 .032 .286
a. Variable(s) entered on step 1: age, gender,drug addiction.
TWO-WAY ANOVA
Two-way ANOVA is the extended form of one-way ANOVA. In one-way ANOVA has one
independent variable whereas two-way ANOVA has two independent variables. In this
technique two independent variables depend on one dependent variable.
Advantage of Two-way ANOVA:
1) Using two-way ANOVA, we can examine the main effect for both independent variables.
2) possibility of an interaction effect increases which means one independent variable
depend on dependent variables and also depend on the second independent variable.
ASSUMPTIONS
When we doing Two-Way ANOVA, some key points are as follows:
1) Two independent variables are always categorical which means category of gender
such either female or male so that female replace by 1 and male replace 0. In this data
set Irish classification and country group are two categorical variables.
2) Dependent variable always continuous type which means value differ every time. In
this case total number of student enrolments. For more example like score of students
which is change every time.
3) Outliers represents the negative impression to the Two-way ANOVA.
Two-Way ANOVATable
DATA SOURCE
https://data.gov.ie/dataset/data-on-individual-schools
OBJECTIVE
The dataset is based on total enrolment in mainstream national schools for 2016/2017 school
year on the basis on country and Irish classification.
1) To find the total no enrolments on the basis of country.
2) To find the enrolments difference on the basis of Irish classification which means all
subject through Irish or no subject through Irish.
DATA INFORMATION
There are two categorical independent variables one is Irish classification which noted as All
subject through Irish = 1 and No subject through Irish = 0 another Country groups which
grouping as Donegal = 1, Dublin = 2, Galway = 3.
Dependent variable is total no of students enrol in 2016/2017.
SOFTWARE
SPSS software use for this analysis.
PROCEDUREFOR TWO-WAYANOVA IN SPSS
1. Import the data set in SPSS software and the screenshot is below:
2. Click Analyze -> General Linear Model -> Univariate
3.After clicking univariate below window open, using the arrow move the Total pupils in
Dependent Variable box.
Country and Irish classification in Fixed Factor box.
4.Click Options button then in Display section choose Descriptive statistics, estimate of
effect size and Homogeneity tests then click on Continue.
5.After this, choose Post Hoc then move County and IrishClassification Description in to Post
Hoc Test and in below window choose Tukey and then click Continue.
6. Choose Plot button, insert County in Horizontal Axis and IrishClassification Description in
Separate Lines then click continue and after this click OK.
OUTPUT OF TWO-WAY ANOVA
DESCRIPTIVE STATISTICS
This table represent the mean, Std. deviation and particulars of each county.
Mean of Donegal is 108.33, mean of Galway is 130.95 and mean of Limerick is 168.33.
Std. Deviation of Galway is 195.876 which is more than Std. Deviation of Donegal and
Limerick.
Descriptive Statistics
DependentVariable: Total Pupils
County
Irish Classification
Description Mean Std. Deviation N
Donegal All subjects through Irish 77.86 74.570 37
No subjects through Irish 116.68 114.962 135
Total 108.33 108.550 172
Galway All subjects through Irish 132.09 143.605 44
No subjects through Irish 130.66 106.420 179
Total 130.95 114.347 223
limerick All subjects through Irish 311.00 195.876 6
No subjects through Irish 161.64 143.826 128
Total 168.33 148.851 134
All subjects through Irish 121.37 135.098 87
No subjects through Irish 135.36 121.903 442
Total 133.06 124.143 529
LEVENE’S TEST OF EQUALITY OF ERROR VARIANCES
Decision Rule:
If p <= .05 means variances are significantly different.
If p >= .05 means variances are not significantly different.
This table represent the significance value is .006 which is less than .05 means there is
violation of homogeneity of variance assumption.
Levene's Test of Equality of Error Variancesa
DependentVariable: Total Pupils
F df1 df2 Sig.
3.309 5 523 .006
Tests the null hypothesis thatthe error variance of the dependentvariable is equal across groups.
a. Design:Intercept+ County + IrishClassificationDescription + County * IrishClassificationDescription
INTERACTION EFFECTS
Interaction effect shows the combined effects of factors on the dependent variable. For the
interaction effect value of significance level is always less than .05.
In the Effects table significance value is .003 which is less than .05 means there is significant
difference between in the effect of county in No subjects through Irish or All subjects through
Irish.
Tests of Between-Subjects Effects
DependentVariable: Total Pupils
Source
Type III Sum of
Squares df Mean Square F Sig.
Partial Eta
Squared
Corrected Model 444536.320a
5 88907.264 6.044 .000 .055
Intercept 3645418.517 1 3645418.517 247.837 .000 .322
County 378953.543 2 189476.772 12.882 .000 .047
IrishClassificationDescription 52848.123 1 52848.123 3.593 .059 .007
County *
IrishClassificationDescription
170988.424 2 85494.212 5.812 .003 .022
Error 7692784.621 523 14708.957
Total 17503582.000 529
Corrected Total 8137320.941 528
a. R Squared = .055 (Adjusted R Squared = .046)
EFFECT SIZE
Effect size of Irish classification Description is less than .05 which means result is
statistically significant. (In partial era squared column).
POST-HOC TEST
In this represent the Post- for Tukey Honestly Significant Difference. Post Hoc test only
significant if we use more than one independent variable. In this multiple comparison table
Galway and Limerick shows the significant value which is less than .05 which means there is
significant difference between county group. Value is less than .o5 means mean difference
value show with asterisk mark.
Multiple Comparisons
DependentVariable: Total Pupils
Tukey HSD
(I) County (J) County
Mean Difference
(I-J) Std. Error Sig.
95% Confidence Interval
Lower Bound Upper Bound
Donegal Galway -22.61 12.308 .158 -51.54 6.31
Limerick -60.00*
13.974 .000 -92.84 -27.15
Galway Donegal 22.61 12.308 .158 -6.31 51.54
Limerick -37.38*
13.256 .014 -68.54 -6.22
limerick Donegal 60.00*
13.974 .000 27.15 92.84
Galway 37.38*
13.256 .014 6.22 68.54
Based on observed means.
The error term is Mean Square(Error) = 14708.957.
*. The mean difference is significantatthe .05 level.
PLOTS
Plot is very important to analyse the output of two-way Anova and we can easily understand
the relationship between variables.
From this plot graph, All subjects through Irish is slightly less with No subject through Irish
for Donegal County, there is no difference between All subjects through Irish and No
subjects through Irish for Galway but in Limerick there is quite large difference between All
subject through Irish and No subject through Irish.
RESULT
Two-way Anova is used for to know about the enrolment is differ from different county
according to Irish classification. Irish classification as All subject through Irish or No subject
through Irish. County divided in three different group Donegal as 1, Galway as 2 and
Limerick as 3. The interaction effect between county and Irish classification distribution is
F= 5.812, p = .003.
REFERENCES
1.Brett Lantz(2013) Machine learning with R.Second Edition.
2. Pallant J. (2016) SPSS survival Manual. 6th Ed. New York, McGraw Hill Education.

More Related Content

PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
PDF
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
PDF
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PDF
Project crm submission sonali
PDF
HBase Mongo_DB Project
PDF
Dwbi Project
PDF
Salesforce
Storytelling For The Web: Integrate Storytelling in your Design Process
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
2024 Trend Updates: What Really Works In SEO & Content Marketing
Project crm submission sonali
HBase Mongo_DB Project
Dwbi Project
Salesforce

Recently uploaded (20)

PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Introduction to Data Science and Data Analysis
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Computer network topology notes for revision
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ISS -ESG Data flows What is ESG and HowHow
Introduction to Data Science and Data Analysis
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
oil_refinery_comprehensive_20250804084928 (1).pptx
modul_python (1).pptx for professional and student
Qualitative Qantitative and Mixed Methods.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STERILIZATION AND DISINFECTION-1.ppthhhbx
Computer network topology notes for revision
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Analytics and business intelligence.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Ad
Ad

Statistic report

  • 1. STATISTICS REPORT ON MULTIPLE REGRESSION, LOGISTIC REGRESSION AND TWO-WAY ANOVA By: SONALI GUPTA X01527245 Msc in Data Analytics National College ofIreland
  • 2. Table of Contents MULTIPLE REGRESSION ANALYSIS....................................................................................... 3 DATA SOURCE...................................................................................................................... 3 OBJECTIVE............................................................................................................................ 3 DATA INFORMATION........................................................................................................... 3 SOFTWARE............................................................................................................................ 3 DATA CLEANING.................................................................................................................. 4 OUTPUT OF MULITPLE REGRESSION..................................................................................... 4 DATA SUMMARY...................................................................................................................... 4 CORRELATION MATRIX.......................................................................................................5 PAIRWISE MATRIX OF SCATTER PLOT.............................................................................. 6 LINEAR MODEL FIT.............................................................................................................. 7 RESIDUALPLOT.................................................................................................................... 8 RESULT..................................................................................................................................8 LOGISTIC REGRESSION ........................................................................................................... 9 DATA SOURCE...................................................................................................................... 9 OBJECTIVE............................................................................................................................ 9 DATA INFORMATION........................................................................................................... 9 SOFTWARE............................................................................................................................ 9 PROCEDURE OF LOGISTIC REGRESSION IN SPSS ........................................................... 10 OUTPUT OF LOGISTIC REGRESSION................................................................................. 12 CASE PROCESSING............................................................................................................. 12 CATEGORICAL VARIABLES CODINGS............................................................................. 13 CLASSIFICATION TABLE................................................................................................... 13 OMNIBUS TESTS OF MODEL COEFFICIENTS ................................................................... 13 HOSMER AND LEMESHOW TEST...................................................................................... 14 MODEL SUMMARY............................................................................................................. 14 CLASSIFICATION TABLE................................................................................................... 14 VARIABLES IN THE EQUATION......................................................................................... 15 CASEWISE LIST................................................................................................................... 15 RESULT................................................................................................................................ 15 TWO-WAY ANOVA................................................................................................................. 16 DATA SOURCE.................................................................................................................... 16 OBJECTIVE.......................................................................................................................... 17 DATA INFORMATION......................................................................................................... 17 SOFTWARE.......................................................................................................................... 17
  • 3. PROCEDURE FOR TWO-WAY ANOVA IN SPSS................................................................. 17 OUTPUT OF TWO-WAY ANOVA ........................................................................................ 21 DESCRIPTIVE STATISTICS................................................................................................. 21 LEVENE’S TEST OF EQUALITY OF ERROR VARIANCES................................................. 21 INTERACTION EFFECTS..................................................................................................... 22 PLOTS................................................................................................................................... 24 RESULT................................................................................................................................ 24 REFERENCES....................................................................................................................... 24
  • 4. MULTIPLE REGRESSION ANALYSIS In simple terms, multiple regression is used to describe the relationship between continuous dependent variable with two or more independent variables. DATA SOURCE This analysis has been done on soil solution of UK environment monitoring. The link of the dataset is as follows: https://data.gov.uk/dataset/uk-environmental-change-network-ecn-soil- solution-chemistry-data-1992-2012 The data was present in csv format. OBJECTIVE The reason of selecting this dataset so that we can analyse which factor depends on soil quality. The Objective of this analysis is to • Study the various components of soil. • Study the impact of class type with different components. • Study the relationship between all the components. DATA INFORMATION This dataset contains 1014 observations and 12 variables. 1) Type: it contains the different types of soil. For ex. Bixa, Sodic, Grassland, Horticulture, Lawn etc. 2) Ph: Ph is potential of hydrogen acidity scale from 0 to 14. 3) Conductivity 4) Carbon 5) Nitrogen 6) Phosphorus 7) Potassium 8) WHC 9) Class type: High fertile, low fertile, medium fertile. SOFTWARE R is simple, effective and opensource language and which is highly use for analysing data manipulation, data handling, data visualization, statistical result and graphics. In R studio, data was loaded by read.table command which shown below: soil<-read.table("soil_15nov_ANN.CSV",sep=",",header =T)
  • 5. DATA CLEANING This dataset has so many NA value so clean the dataset by complete.cases() function. Using this function all the NA value in file removed and dataset clean and class type is string type dataset so this column also removed. clean=complete.cases(soil) soil_data=soil[clean,] OUTPUT OF MULITPLE REGRESSION DATA SUMMARY This table represents the summary of the dataset which contain minimum, 1st quartile, median, mean, 3rd quartile and maximum value of every component. summary(soil_data) This table shows the summary of the data in terms of maximum, minimum, mean, median, 1st quartile, 3rd quartile of each components. depth ph conductivity carbon nitrogen phosphorus potassium WHC porosity Min. 1 5.6 40 0.015 3.98 4 60 27.1 29.3 1st Qu. 1 8.2 265 0.12 17 14.43 200 42 40 Median 2 8.81 410 0.235 30.52 19.84 290 46.3 44.5 Mean 2.4 8.8 499.1 0.395 37.08 24.29 379.9 47.4 44.96 3rd Qu. 3 9.7 660 0.589 50.4 32.3 400 52.5 49 Max. 4 11.5 1720 2.35 185 82.42 3000 76.8 65.72
  • 6. CORRELATION MATRIX Correlation shows the relationship between two variables and describe the relationship is positive trend or negative trend. cor(soil_data) depth ph conductivity carbon nitrogen depth 1.00000000 0.3513589 0.43664406 -0.5636996 -0.5411645 ph. 0.35135888 1.0000000 0.75864988 -0.7100049 -0.6449397 conductivity 0.43664406 0.7586499 1.00000000 -0.5188081 -0.4582881 carbon -0.56369961 -0.7100049 -0.51880806 1.0000000 0.7896135 nitrogen -0.54116446 -0.6449397 -0.45828809 0.7896135 1.0000000 phosphorus -0.32741495 -0.6335258 -0.51024771 0.6872795 0.6610887 potassium -0.20167701 -0.5825826 0.31843323 0.6639411 0.4568622 WHC -0.22324381 -0.2304440 -0.27958500 0.2367519 0.2386911 porosity -0.20642418 -0.2157805 -0.27194201 0.2055014 0.2414338 class 0.07352693 0.1423210 0.04065294 -0.1363034 -0.1771025 phosphorus potassium WHC porosity class depth -0.32741496 -0.2016770 -0.2232438 -0.2064242 0.073526927 ph -0.63352582 -0.5825826 -0.2304440 -0.2157805 0.14232104 conductivity -0.51024770 -0.3184332 -0.2795850 -0.2719420 0.040652945 carbon 0.687279522 0.6639411 0.2367519 0.2055014 -0.136303384 nitrogen 0.661088713 0.4568622 0.2386911 0.2414338 -0.177102523 phosphorus 1.0000000 0.6125755 0.3444425 0.3380200 0.001868296 potassium 0.612575523 1.0000000 0.1906594 0.1153716 -0.117629191 WHC 0.344442511 0.1906594 1.0000000 0.8128219 0.395313489 porosity 0.338020020 0.1153716 0.8128219 1.0000000 0.33599933 class 0.001868296 -0.1176292 0.3953135 0.3359993 1.000000000
  • 7. PAIRWISE MATRIX OF SCATTER PLOT Using below command we can easily analyse the relationship between each component. pairs(soil_data[c("depth","ph","conductivity","carbon","nitrogen","phospho rus","WHC","porosity","class")]) Thisfigure representsif ph value rise thenconductivityisalsoincreasingthisshowsthe correlationis positive andif phvalue decreasingthenconductivityisalsodecreasingmeanscorrelationis negative. Aswe cansee there isno correlationbetweenclassanddepth.
  • 8. LINEAR MODEL FIT fit<-lm(class ~ ., data= soil_data) fit Call: lm(formula = class ~ ., data = Soil_dataset) Coefficients: (Intercept) depth ph conductivity carbon nitrogen phosphorus -1.291e-01 2.770e-02 6.608e-02 1.793e-05 1.466e-01 -7 .559e-03 7.730e-03 potassium WHC porosity -2.519e-04 3.501e-02 5.376e-03 summary(fit) Call: lm(formula = class ~ ., data = soil_data) Residuals: Min 1Q Median 3Q Max -1.29350 -0.32275 0.00726 0.43999 1.29096 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.291e-01 3.360e-01 -0.384 0.701002 depth 2.770e-02 2.309e-02 1.199 0.230659 ph 6.608e-02 3.313e-02 1.995 0.046367 * conductivity 1.793e-05 1.031e-04 0.174 0.862007 carbon 1.466e-01 1.091e-01 1.344 0.179149 nitrogen -7.559e-03 1.313e-03 -5.755 1.15e-08 *** phosphorus 7.730e-03 2.239e-03 3.452 0.000579 *** potassium -2.520e-04 7.570e-05 -3.328 0.000906 *** WHC 3.501e-02 4.011e-03 8.730 < 2e-16 *** porosity 5.376e-03 4.962e-03 1.084 0.278821 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5995 on 1004 degrees of freedom Multiple R-squared: 0.2583, Adjusted R-squared: 0.2517 F-statistic: 38.86 on 9 and 1004 DF, p-value: < 2.2e-16 As we can see using summary() command, it give the summarisation of statistic regression the p-value of depth, ph, conductivity, carbon, phosphorus, WHC and porosity are more significant than nitrogen and potassium. The value of R square is 0.2583 means 25% variation.
  • 9. RESIDUAL PLOT plot(fit, which = 1) RESULT In thismultiple regressionanalysis,we are tryingtofindthe relationshipwithdifferentvariable and alsolookwhatimpact betweenothers. There isstrongrelationshipbetweenphandconductivity.
  • 10. LOGISTIC REGRESSION In simple terms, logistic regression is statistical method which is used for analyse and find the relationship between one or more independent variable that predict the outcome whereas outcome always measured with a dichotomous variable. Dependent variable act as dichotomous. for ex: (win/lose; yes/no; fail/pass). Assumptions: 1. Dependent variable is always Dichotomous in nature. 2. No outliers in the data which shows the negative impact. 3. Outcome variable represent as 0 and 1. Logistic Regressioninmathematic term: Logit(p)= For I = 1 to n. DATA SOURCE https://data.gov.uk/dataset/victim-and-offender-gender-and-age OBJECTIVE The data set representsthe drugstestreportbetweendifferentage group. • To check the probabilityof decisionwhoquitthe drugornot quitthe drug (dependent variable) with gender,differentage groupandpersonswhoaddictwiththe drugs. DATA INFORMATION Dependentvariable:Decision (whoquitthe drugsaftertestscore) codedas 0 = “no” and1 = “yes”. Independentvariable: Gendercodedas 0 = “male”and 1 = “female”. Age Drug codedas 1 = “high”and 2 = “low”(whichmeanswhohigh/low addictwithdrugs) SOFTWARE SPSS software is use for analyse the output of logistic regression.
  • 11. PROCEDUREOF LOGISTIC REGRESSIONIN SPSS 1. Import the dataset and screenshot is below. 2. Click -> Analyze -> Regression->Binary Logistic.
  • 12. 3. After, move the dependent variable in dependent text box and in below window move the independent variable and click categorical box. 4. On clicking on Categorical button below window is open, then move the categorical variable in categorical Covariates box and after this click on first then continue.
  • 13. 5. Click option button, tick the classification plots, Hosmer-Lemeshow goodness-of-fit, casewise listing and CI for exp. then click continue and OK. OUTPUT OF LOGISTIC REGRESSION CASE PROCESSING This table explain the number of cases in our data set. Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 199 100.0 Missing Cases 0 .0 Total 199 100.0 Unselected Cases 0 .0 Total 199 100.0 a. If weightis in effect, see classification table for the total number of cases.
  • 14. CATEGORICAL VARIABLES CODINGS This table represent the independent categorical variables. Categorical Variables Codings Frequency Parameter coding (1) drug addiction high 91 .000 low 108 1.000 gender male 121 .000 female 78 1.000 CLASSIFICATION TABLE This table describe the total Percentage of correctly classified cases is 64.3 percent and also denote the higher percentage of people answering to no for quit drugs. OMNIBUS TESTS OF MODEL COEFFICIENTS Thistable givesthe resultof model performsandnopredictorvalue inthismodel.The significant value inthismodel is.000 whichislessthan.005 whichmeansmodel isbestfitandreport that noto quitdrugs. Th Chi-square valuereportas24.022 with3 degreesof freedom. Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 24.022 3 .000 Block 24.022 3 .000 Model 24.022 3 .000
  • 15. HOSMER AND LEMESHOW TEST This test means if significance value is less than .05, so it indicates poor fit model. So actually, we need model where sig. value is greater than .05. In this test Chi-square value is 1.730 with significance level is .988 so sig. value is larger than .05 which represents to support for the model. Hosmer and Lemeshow Test Step Chi-square df Sig. 1 1.730 8 .988 MODEL SUMMARY It provides useful information for the model and it also indicate the variation in the dependent variable where maximum value is 1 and minimum value is 0. In this table two R Square value .114 and .156 which means that between 11 percent and 15 percent of variability explained by this set of variables. Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square 1 235.293a .114 .156 a. Estimation terminated atiteration number 4 because parameter estimates changed byless than .001. CLASSIFICATION TABLE This table describe how model predict the correct category (to quit drugs/ no to quit drugs) for every case. From this table, 69.3 per cent model is correctly satisfied. In this table, 62.0 percent people who quit drugs this is sensitivity of the model and 73.4 percent people who don’t quit the drugs this is specificity of the model.
  • 16. CASEWISE LIST No table is created so which means sample for model fit well. And there are no outliers. RESULT Thismodel containsthree independentvariablesuchasage,genderand drugaddiction.The model showsall the predictorwasstatisticallysignificant. InChi-Square (df=3,24.022) and p value isless than .001 that meansmodel wasable todistinguishbetweenwhohighlyaddictedwithdrugsand lowaddictedwithdrugs.The model also describes 11.4 % byCox andSnell Square and 15.6 % by NagelkerkeRsquaredof the variance indrug addiction. REFERENCES 1. Hosmer,D. & Lemeshow,S.(2000).AppliedLogistic Regression(SecondEdition). New York:John Wiley&Sons,Inc. 2. Long, J. Scott(1997). RegressionModelsforCategorical andLimitedDependent Variables.ThousandOaks,CA:Sage Publications. VARIABLES IN THE EQUATION This test also called as Wald test and used for status of predictor variable. In this table we have looking for significance value which is less than .05. So, in this table, we have one sig. value (gender, p = .000). B values comes from the multiple regression. B value justify the relationship and we look for the positive and negative value. Exp(B) represents the odds ratio for every independent variable. In this table odd person who answering yes, they have high addicted with drugs is .961 times more than who low addicted with drugs. In this table 95% confidence interval for Exp(B) which gives the lower and upper value. In this table (drug addiction OR = .961) ranges from .516 to 1.789 that means population lies between .516 and 1.789. Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a age .001 .012 .003 1 .957 1.001 .977 1.025 gender(1) 1.503 .317 22.530 1 .000 4.494 2.416 8.359 drug addiction(1) -.040 .317 .016 1 .900 .961 .516 1.789 Constant -1.252 .583 4.617 1 .032 .286 a. Variable(s) entered on step 1: age, gender,drug addiction.
  • 17. TWO-WAY ANOVA Two-way ANOVA is the extended form of one-way ANOVA. In one-way ANOVA has one independent variable whereas two-way ANOVA has two independent variables. In this technique two independent variables depend on one dependent variable. Advantage of Two-way ANOVA: 1) Using two-way ANOVA, we can examine the main effect for both independent variables. 2) possibility of an interaction effect increases which means one independent variable depend on dependent variables and also depend on the second independent variable. ASSUMPTIONS When we doing Two-Way ANOVA, some key points are as follows: 1) Two independent variables are always categorical which means category of gender such either female or male so that female replace by 1 and male replace 0. In this data set Irish classification and country group are two categorical variables. 2) Dependent variable always continuous type which means value differ every time. In this case total number of student enrolments. For more example like score of students which is change every time. 3) Outliers represents the negative impression to the Two-way ANOVA. Two-Way ANOVATable DATA SOURCE https://data.gov.ie/dataset/data-on-individual-schools
  • 18. OBJECTIVE The dataset is based on total enrolment in mainstream national schools for 2016/2017 school year on the basis on country and Irish classification. 1) To find the total no enrolments on the basis of country. 2) To find the enrolments difference on the basis of Irish classification which means all subject through Irish or no subject through Irish. DATA INFORMATION There are two categorical independent variables one is Irish classification which noted as All subject through Irish = 1 and No subject through Irish = 0 another Country groups which grouping as Donegal = 1, Dublin = 2, Galway = 3. Dependent variable is total no of students enrol in 2016/2017. SOFTWARE SPSS software use for this analysis. PROCEDUREFOR TWO-WAYANOVA IN SPSS 1. Import the data set in SPSS software and the screenshot is below:
  • 19. 2. Click Analyze -> General Linear Model -> Univariate 3.After clicking univariate below window open, using the arrow move the Total pupils in Dependent Variable box. Country and Irish classification in Fixed Factor box.
  • 20. 4.Click Options button then in Display section choose Descriptive statistics, estimate of effect size and Homogeneity tests then click on Continue. 5.After this, choose Post Hoc then move County and IrishClassification Description in to Post Hoc Test and in below window choose Tukey and then click Continue.
  • 21. 6. Choose Plot button, insert County in Horizontal Axis and IrishClassification Description in Separate Lines then click continue and after this click OK.
  • 22. OUTPUT OF TWO-WAY ANOVA DESCRIPTIVE STATISTICS This table represent the mean, Std. deviation and particulars of each county. Mean of Donegal is 108.33, mean of Galway is 130.95 and mean of Limerick is 168.33. Std. Deviation of Galway is 195.876 which is more than Std. Deviation of Donegal and Limerick. Descriptive Statistics DependentVariable: Total Pupils County Irish Classification Description Mean Std. Deviation N Donegal All subjects through Irish 77.86 74.570 37 No subjects through Irish 116.68 114.962 135 Total 108.33 108.550 172 Galway All subjects through Irish 132.09 143.605 44 No subjects through Irish 130.66 106.420 179 Total 130.95 114.347 223 limerick All subjects through Irish 311.00 195.876 6 No subjects through Irish 161.64 143.826 128 Total 168.33 148.851 134 All subjects through Irish 121.37 135.098 87 No subjects through Irish 135.36 121.903 442 Total 133.06 124.143 529 LEVENE’S TEST OF EQUALITY OF ERROR VARIANCES Decision Rule: If p <= .05 means variances are significantly different. If p >= .05 means variances are not significantly different. This table represent the significance value is .006 which is less than .05 means there is violation of homogeneity of variance assumption. Levene's Test of Equality of Error Variancesa DependentVariable: Total Pupils F df1 df2 Sig. 3.309 5 523 .006 Tests the null hypothesis thatthe error variance of the dependentvariable is equal across groups. a. Design:Intercept+ County + IrishClassificationDescription + County * IrishClassificationDescription
  • 23. INTERACTION EFFECTS Interaction effect shows the combined effects of factors on the dependent variable. For the interaction effect value of significance level is always less than .05. In the Effects table significance value is .003 which is less than .05 means there is significant difference between in the effect of county in No subjects through Irish or All subjects through Irish. Tests of Between-Subjects Effects DependentVariable: Total Pupils Source Type III Sum of Squares df Mean Square F Sig. Partial Eta Squared Corrected Model 444536.320a 5 88907.264 6.044 .000 .055 Intercept 3645418.517 1 3645418.517 247.837 .000 .322 County 378953.543 2 189476.772 12.882 .000 .047 IrishClassificationDescription 52848.123 1 52848.123 3.593 .059 .007 County * IrishClassificationDescription 170988.424 2 85494.212 5.812 .003 .022 Error 7692784.621 523 14708.957 Total 17503582.000 529 Corrected Total 8137320.941 528 a. R Squared = .055 (Adjusted R Squared = .046) EFFECT SIZE Effect size of Irish classification Description is less than .05 which means result is statistically significant. (In partial era squared column).
  • 24. POST-HOC TEST In this represent the Post- for Tukey Honestly Significant Difference. Post Hoc test only significant if we use more than one independent variable. In this multiple comparison table Galway and Limerick shows the significant value which is less than .05 which means there is significant difference between county group. Value is less than .o5 means mean difference value show with asterisk mark. Multiple Comparisons DependentVariable: Total Pupils Tukey HSD (I) County (J) County Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval Lower Bound Upper Bound Donegal Galway -22.61 12.308 .158 -51.54 6.31 Limerick -60.00* 13.974 .000 -92.84 -27.15 Galway Donegal 22.61 12.308 .158 -6.31 51.54 Limerick -37.38* 13.256 .014 -68.54 -6.22 limerick Donegal 60.00* 13.974 .000 27.15 92.84 Galway 37.38* 13.256 .014 6.22 68.54 Based on observed means. The error term is Mean Square(Error) = 14708.957. *. The mean difference is significantatthe .05 level.
  • 25. PLOTS Plot is very important to analyse the output of two-way Anova and we can easily understand the relationship between variables. From this plot graph, All subjects through Irish is slightly less with No subject through Irish for Donegal County, there is no difference between All subjects through Irish and No subjects through Irish for Galway but in Limerick there is quite large difference between All subject through Irish and No subject through Irish. RESULT Two-way Anova is used for to know about the enrolment is differ from different county according to Irish classification. Irish classification as All subject through Irish or No subject through Irish. County divided in three different group Donegal as 1, Galway as 2 and Limerick as 3. The interaction effect between county and Irish classification distribution is F= 5.812, p = .003. REFERENCES 1.Brett Lantz(2013) Machine learning with R.Second Edition. 2. Pallant J. (2016) SPSS survival Manual. 6th Ed. New York, McGraw Hill Education.