Lect w8 w9_correlation_regression

Correlation, Regression & T-test
Prepared By: Dr. Kumara Thevan a/l
Krishnan

Introduction
Investigation on a relationship between two or more
numerical or quantitative variables can be conducted
using techniques of correlation and regression analysis.
!
- Correlation is a statistical method used to determine
whether a linear relationship between variable exist.
!
- Simple linear regression is a statistical method used to
described the nature of the relationship between two
variables.

Definition
Scatterplot (or scatter diagram) i s a
graph in which the paired (x,y)
sample data are plotted with a
horizontal x axis and a vertical y axis.
!
Each individual (x,y) pair is plotted as a
single point.

Definition
Correlation
!
exists between two variables
when one of them is related to
the other in some way

Example
Open SPSS.
Data ; weight height biometry male
2012.sav
Graph
Legacy dialogs=> scatter plot
Height (X); Weight ( Y)

BMI test
Category BMI range – kg/m
Very severely underweight less than 15
Severely underweight from 15.0 to 16.0
Underweight from 16.0 to 18.5
Normal (healthy weight) from 18.5 to 25
Overweight from 25 to 30
Obese Class I (Moderately
from 30 to 35
obese)
Obese Class II (Severely
obese)
from 35 to 40
Obese Class III (Very
severely obese)
over 40

Normality test?
• The Kolmogorov-Smirnov and Shapiro-Wilk test.
• The compare the scores in the sample to a
normally distributed set of scores with the same
mean and s.d.
• If p>0.05, the test is non-significant. Tells us
that the distribution of the sample is not
significantly different from a normal
distribution.
• The test is significant (p<0.05) then the
distribution in question is significantly different
from a normal distribution(non-normal)

Positive Linear Correlation
y y y
x x
x
(a) Positive (b) Strong
positive
(c) Perfect
positive

Negative Linear Correlation
y y y
x x
x
(d) Negative (e) Strong
negative
(f) Perfect
negative

Test 1
- Do scatter plot using data
ExamAnxiety.sav
- Exam performance (%) – y axis
- Exam anxiety – x axis
- Color – place gender
- Results?
!
- Try 3D plot – 3 variables

Bivariate correlation
• Having taken a preliminary glance at the
data, we can proceed to conduct the
correlation analysis.

Definition
!
Linear Correlation Coefficient r
measures strength of the linear
relationship between paired x- and
y-quantitative values in a sample

No Linear Correlation
y y
x x
(g) No Correlation (h) Nonlinear Correlation

Definition
!
sometimes referred to as the
Pearson product moment correlation
coefficient

Notation for the
Linear Correlation Coefficient
n number of pairs of data presented.
Σ denotes the addition of the items indicated.
Σx denotes the sum of all x values.
Σx2 indicates that each x score should be squared and then
those squares added.
(Σx)2 indicates that the x scores should be added and the total
then squared.
Σxy indicates that each x score should be first multiplied by its
corresponding y score. After obtaining all such products, find their sum.
r represents linear correlation coefficient for a sample
ρ represents linear correlation coefficient for a population

Definition
nΣxy - (Σx)(Σy)
n(Σx2) - (Σx)2 n(Σy2) - (Σy)2 r =

Test 2
• Run the correlation analysis
ExamAnxiety.sav
• Assumption – Data is normally distributed

Results
Correlations
Exam
Performance
(%) Exam Anxiety
Time Spent
Revising
Exam Performance
(%)
Pearson
Correlation
1 -.441 .397
Sig. (1-tailed) .000 .000
N 103 103 103
Exam Anxiety Pearson
Correlation
-.441 1 -.709
Sig. (1-tailed) .000 .000
N 103 103 103
Time Spent Revising Pearson
Correlation
.397 -.709 1
Sig. (1-tailed) .000 .000
N 103 103 103
**. Correlation is significant at the 0.01 level (1-tailed).

Interpretation
• Exam performance is positively related to
the amount of time spent revising, with a
coefficient of r= 0.397, which is also
significant at p< 0.01.
!
• Exam anxiety appears to be negatively
related to the time spent revising (r=
-0.709, p< 0.01)

Interpretation
• Each variable is perfectly correlated with
itself (r=1).
!
• Exam performance is negatively related to
exam anxiety with a Pearson correlation
coefficient of r= - 0.441 and there is less
than 0.01 probability that a correlation
coeficient this big would have occurred by
chance in a sample of 103 people.

In layman term
• exam anxiety , exam mark
• Revision time , exam mark
• Revision time , exam anxiety

Hands on
• Is there a linear association between
weight and heart girth in this herd of cows?
• Weight was measured in kg and heart girth
in cm on 10 cows
!
!
!
• Assume data is normally distributed

Lect w8 w9_correlation_regression

• The sample coefficient of correlation is
0.704. The P value is 0.012, which is less
than 0.05. The conclusion is that
correlation exists in the population.
Correlations
Weight Girth
Weight Pearson Correlation
1 .704
Sig. (1-tailed) .012
N 10 10
Girth Pearson Correlation
.704 1
N 10 10
*. Correlation is significant at the 0.05 level (1-tailed).

Using R2 for interpretation
( correlation coefficient) 2 = coefficient of
determination, R2
!
R2 is a measure of the amount of variability
in one variable that is explained by the
other.

Example
Correlations
Exam
Performance
(%) Exam Anxiety
Time Spent
Revising
Exam Performance
(%)
Pearson
Correlation
1 -.441 .397
Sig. (1-tailed) .000 .000
N 103 103 103
Exam Anxiety Pearson
Correlation
-.441 1 -.709
Sig. (1-tailed) .000 .000
N 103 103 103
Time Spent Revising Pearson
Correlation
.397 -.709 1
Sig. (1-tailed) .000 .000
N 103 103 103
**. Correlation is significant at the 0.01 level (1-tailed).

Example
Exam anxiety and exam performance
• ( correlation coefficient) 2 = coefficient of
determination, R2
!
R2 = ( -0.441) 2 = 0.194
!
• In % = 0.194 x 100 = 19.4%

• Although exam anxiety was correlated
with exam performance, it can account
for only 19.4 % of variation in exam
scores.
!
• 80.6% of the variability to be accounted
for other variables such as different
ability, different level of preparation and
so on…)

Hands on
Subject Age, x Pressure, y
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
Compute the value of the correlation coefficient for the data?
Do you have enough Statistical evidence that this relationship
does not occur by chance?

Correlations
Age Pressure
Age Pearson
Correlation 1 .897
N 6 6
Pressure Pearson
Correlation .897 1
N 6 6
*. Correlation is significant at the 0.05 level (2-tailed).
R 2 = ?

Regression
Correlation do not provide the predictive
power of variables.
!
In regression analysis we fit a predictive
model to our data and use that model to
predict values of the dependent variable
from one or more independent variables.

Independent V. Dependent
• Intentionally
manipulated
• Controlled
• Vary at known rate
• Cause
• Intentionally left
alone
• Measured
• Vary at unknown
rate
• Effect

• Simple regression seeks to predict an
outcome variable from a single predictor
variable whereas multiple regression
seeks to predict an outcome from several
predictors.
!
Outcomei = (Modeli) + errori
Yi = (bo + b1 xi ) + ei

Least squares
Least squares is a method of finding the line
that best fits the data.
!
This “line of best fit” is found by
ascertaining which line, of all of the
possible lines that could be drawn, results
in the least amount of difference between
the observed data points and the line.

The vertical lines (dashed) represent the differences (or residuals)
between the line and the actual data

• “The best fit line” – there will be small
differences between the values predicted by the
line and the data that were actually observed.
!
• Our interest- in the vertical differences
between the line and the actual data because
we are using the line to predict values of Y from
values of the X-variable.
!
• Some data fall above or below the line,
indicating there is difference between the
model fitted to these data and the data
collected.

• These difference called “residuals”.
• If the “residuals” +ve and –ve cancelled
each other
!
How ?
!
• Square the differences before adding up.
• If the squared differences are large, the
line is not representative of the data; if the
squared differences is small then is
representative.

Total sum of squares, SST
SST uses the differences between the observed data and the mean value of Y

• The sum of squared differences (SS) can be
calculated for any line that is fitted to some
data; the “goodness of fit” of each line can
then be compared by looking at the sum of
squares for each.
!
• The method of least squares works by
selecting the line that has the lowest sum of
squared differences(so it chooses the line that
best represents the observed data)
!
• This “line of best fit” known as a regression
line.

Residual sum of squares, SSR
SSR uses the differences between the observed data and the regression line

SS M uses the differences between the mean
value of Y and the regression line
Model sum of squares (SS M)

F-ratio
F-test = MSM
MSR
!
MSM (mean square for the model)
!
= SS M
Number of variables in the model

F-ratio
F-test = MSM
MSR
!
MSR (mean square for the model)
!
= SS R
Number of Observation- Number of
parameters being estimated

F- ratio
• a good model should have a large F-ratio
(greater than 1 at least)

Test 1
Open sample date – Record1.sav
Graph=> scatterplot
!
Analyze=> regression

Model Summary
Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.578 .335 .331 65.991
a. Predictors: (Constant), Advertsing Budget (thousands of pounds)

Interpretation
• The value of R2 is 0.335, which tell us that
advertising expenditure can account
33.5% of the variation in record sales.
!
• This means that 66% of the variation in
record sales cannot be explained by
advertising alone

F ratio 99.58, which is significant at p< 0.001(because the value in column
labelled Sig. is less than 0.001.
!
This result tells us there is less than a 0.1% chance that an F ratio this
large would happen by chance alone.Overall, the regression model predicts record
sales significantly well.

Multiple regression
• Open data ; Record2.sav

Results
Descriptive Statistics
Mean Std. Deviation N
Record Sales
(thousands) 193.20 80.699 200
Advertsing Budget
(thousands of pounds) 6.1441E2 485.65521 200
No. of plays on Radio 1
per week 27.50 12.270 200
Attractiveness of Band 6.77 1.395 200

Correlations
Record
Sales
(thousands)
Advertsing
Budget
(thousands
of pounds)
No. of plays
on Radio 1
per week
Attractivene
ss of Band
Pearson Correlation Record Sales
(thousands)
1.000 .578 .599 .326
Advertsing Budget
(thousands of
pounds)
.578 1.000 .102 .081
No. of plays on Radio
1 per week .599 .102 1.000 .182
Attractiveness of
Band
.326 .081 .182 1.000
Sig. (1-tailed) Record Sales
(thousands)
. .000 .000 .000
Advertsing Budget
(thousands of
pounds)
.000 . .076 .128
1 per week .000 .076 . .005
Attractiveness of
Band
.000 .128 .005 .
N Record Sales
(thousands)
200 200 200 200
Advertsing Budget
(thousands of
pounds)
200 200 200 200
1 per week 200 200 200 200
Attractiveness of
200 200 200 200
Band

Model Summary
Model R R Square
Adjusted R
Square
1 .578 .335 .331 65.991
Model Summary
Model R
R
Square
Adjusted R
Square
Std. Error
of the
Estimate
Std. Error of
the Estimate
Change Statistics
Durbin-
Watson
R Square
Change
F
Change df1 df2
Sig. F
Change
1 .815 .665 .660 47.087 .665 129.498 3 196 .000 1.950
a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on
bR.a Ddeiop 1e npdeer nwt eVeakriable: Record Sales (thousands)

ANOVA
Model
Sum of
Squares df
Mean
Square F Sig.
1 Regression 433687.833 1 433687.83
3
99.587 .000
Residual 862264.167 198 4354.870
Total 1295952.00
0
199
b. Dependent Variable: Record Sales (thousands)
ANOVA
Model
Sum of
Squares df
Mean
Square F Sig.
1 Regression 861377.41
8
3 287125.80
6
129.49
8
.000
Residual 434574.58
2
196 2217.217
Total 1295952.0
00
199
a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget
(thousands of pounds), No. of plays on Radio 1 per week
b. Dependent Variable: Record Sales (thousands)

Hands on
• Open file softdrinks.sav
!
• Do multiple regression analysis
!
• Y – dependent – delivery time

Results
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1 .980 .960 .956 3.25947
a. Predictors: (Constant), distance, cases
ANOVA
Model Sum of
Squares df Mean
Square F Sig.
1 Regression 5550.811 2 2775.405 261.235 .000
Residual 233.732 22 10.624
Total 5784.543 24
a. Predictors: (Constant), distance, cases
b. Dependent Variable: time

Coefficients
Model
Unstandardized
Coefficients
Standardiz
ed
B Std. Error Beta t Sig.
1 (Constant) 2.341 1.097 2.135 .044
cases 1.616 .171 .716 9.464 .000
distance .014 .004 .301 3.981 .001
a. Dependent Variable: time

T-test
• Testing differences between means
!
• Dependent means t-test: used when there are
two experimental conditions and the same
participants took part in both conditions of
the experiment.
!
• Independent means t-test: used when there
are two experimental conditions and different
participants were assigned to each condition.

Dependent t-test
• 12 spider-phobes who were exposed to a picture of a
spider (picture) and on a separate occasion a real live
tarantula (real). Their anxiety was measured in each
condition (half of the participants were exposed to the
picture before the real spider while the other half were
exposed to the real spider first).
• Which situation caused more anxiety?
!
!
!
!
• Open spiderRM.sav

Results
Paired Samples Statistics
Mean N
Std.
Deviation
Std. Error
Mean
Pair 1 Picture of
Spider 40.00 12 9.293 2.683
Real
Spider 47.00 12 11.029 3.184
Paired Samples Correlations
N Correlation Sig.
Pair 1 Picture of Spider &
Real Spider
12 .545 .067
r= 0.545, not significantly correlated p > 0.05

Paired Samples Test
Paired Differences
t df
Sig. (2-
tailed)
Mea
n
Std.
Deviati
on
Std.
Error
Mean
95%
Confidence
Interval of the
Difference
Lower Upper
Pai
r 1
Picture of
Spider - Real
Spider
-7.00
0 9.807 2.831 -13.231 -.769 -2.47
3 11 .031
T-value minus; tells us that picture had a smaller mean that the real tarantula and so
the
Real spider led to greater anxiety than the picture.
!
Conclusion; that the exposure to a real spider caused a significantly more reported
anxiety
In spider-phobes than exposure to a picture (t(11)= -2.47, p< 0.05)

Hands on
• All students who enroll in a certain
memory course are given a pretest before
the course begin. At the completion of the
course, post test their scores are listed
here. Verify the results shown on the
output by calculating the values and
assume normality.
Std 1 2 3 4 5 6 7 8 9 10
Before 93 86 72 54 92 65 80 81 62 73
After 98 92 80 62 91 78 89 78 71 80

Independent t-test
• We have 12 spider-phobes who were
exposed to a picture of a spider and 12
different spider-phobes who were exposed
to a real life tarantula. The anxiety level
measured.
!
• Open spiderBG.sav

Group Statistics
Spider or
Picture? N Mean Std.
Deviation
Std. Error
Mean
Anxiety Picture 12 40.00 9.293 2.683
Real
Spider 12 47.00 11.029 3.184
Independent Samples Test
Levene's Test
for Equality of
Variances t-test for Equality of Means
F Sig. t df
Sig.
(2-
tailed)
Mean
Differe
nce
Std.
Error
Differe
nce
95%
Confidence
Interval of the
Difference
Lower Upper
Anxiety Equal
variances
assumed
.782 .386 -1.6
81 22 .107 -7.000 4.163 -15.63
4 1.634
Equal
variances
not
assumed
-1.6
81
21.38
5 .107 -7.000 4.163 -15.64
9 1.649

Lect w8 w9_correlation_regression

More Related Content

What's hot (18)

Similar to Lect w8 w9_correlation_regression (20)

More from Rione Drevale (20)

Recently uploaded (20)

Lect w8 w9_correlation_regression