SlideShare a Scribd company logo
Correlation, Regression & T-test 
Prepared By: Dr. Kumara Thevan a/l 
Krishnan
Introduction 
Investigation on a relationship between two or more 
numerical or quantitative variables can be conducted 
using techniques of correlation and regression analysis. 
! 
- Correlation is a statistical method used to determine 
whether a linear relationship between variable exist. 
! 
- Simple linear regression is a statistical method used to 
described the nature of the relationship between two 
variables.
Definition 
Scatterplot (or scatter diagram) i s a 
graph in which the paired (x,y) 
sample data are plotted with a 
horizontal x axis and a vertical y axis. 
! 
Each individual (x,y) pair is plotted as a 
single point.
Definition 
Correlation 
! 
exists between two variables 
when one of them is related to 
the other in some way
Example 
Open SPSS. 
Data ; weight height biometry male 
2012.sav 
Graph 
Legacy dialogs=> scatter plot 
Height (X); Weight ( Y)
BMI test 
Category BMI range – kg/m 
Very severely underweight less than 15 
Severely underweight from 15.0 to 16.0 
Underweight from 16.0 to 18.5 
Normal (healthy weight) from 18.5 to 25 
Overweight from 25 to 30 
Obese Class I (Moderately 
from 30 to 35 
obese) 
Obese Class II (Severely 
obese) 
from 35 to 40 
Obese Class III (Very 
severely obese) 
over 40
Normality test? 
• The Kolmogorov-Smirnov and Shapiro-Wilk test. 
• The compare the scores in the sample to a 
normally distributed set of scores with the same 
mean and s.d. 
• If p>0.05, the test is non-significant. Tells us 
that the distribution of the sample is not 
significantly different from a normal 
distribution. 
• The test is significant (p<0.05) then the 
distribution in question is significantly different 
from a normal distribution(non-normal)
What can you see?
Positive Linear Correlation 
y y y 
x x 
x 
(a) Positive (b) Strong 
positive 
(c) Perfect 
positive
Negative Linear Correlation 
y y y 
x x 
x 
(d) Negative (e) Strong 
negative 
(f) Perfect 
negative
What can you see?
Test 1 
- Do scatter plot using data 
ExamAnxiety.sav 
- Exam performance (%) – y axis 
- Exam anxiety – x axis 
- Color – place gender 
- Results? 
! 
- Try 3D plot – 3 variables
Bivariate correlation 
• Having taken a preliminary glance at the 
data, we can proceed to conduct the 
correlation analysis.
Definition 
! 
Linear Correlation Coefficient r 
measures strength of the linear 
relationship between paired x- and 
y-quantitative values in a sample
No Linear Correlation 
y y 
x x 
(g) No Correlation (h) Nonlinear Correlation
Definition 
! 
Linear Correlation Coefficient r 
sometimes referred to as the 
Pearson product moment correlation 
coefficient
Notation for the 
Linear Correlation Coefficient 
n number of pairs of data presented. 
Σ denotes the addition of the items indicated. 
Σx denotes the sum of all x values. 
Σx2 indicates that each x score should be squared and then 
those squares added. 
(Σx)2 indicates that the x scores should be added and the total 
then squared. 
Σxy indicates that each x score should be first multiplied by its 
corresponding y score. After obtaining all such products, find their sum. 
r represents linear correlation coefficient for a sample 
ρ represents linear correlation coefficient for a population
Definition 
Linear Correlation Coefficient r 
nΣxy - (Σx)(Σy) 
n(Σx2) - (Σx)2 n(Σy2) - (Σy)2 r =
Test 2 
• Run the correlation analysis 
ExamAnxiety.sav 
• Assumption – Data is normally distributed
Results 
Correlations 
Exam 
Performance 
(%) Exam Anxiety 
Time Spent 
Revising 
Exam Performance 
(%) 
Pearson 
Correlation 
1 -.441 .397 
Sig. (1-tailed) .000 .000 
N 103 103 103 
Exam Anxiety Pearson 
Correlation 
-.441 1 -.709 
Sig. (1-tailed) .000 .000 
N 103 103 103 
Time Spent Revising Pearson 
Correlation 
.397 -.709 1 
Sig. (1-tailed) .000 .000 
N 103 103 103 
**. Correlation is significant at the 0.01 level (1-tailed).
Interpretation 
• Exam performance is positively related to 
the amount of time spent revising, with a 
coefficient of r= 0.397, which is also 
significant at p< 0.01. 
! 
• Exam anxiety appears to be negatively 
related to the time spent revising (r= 
-0.709, p< 0.01)
Interpretation 
• Each variable is perfectly correlated with 
itself (r=1). 
! 
• Exam performance is negatively related to 
exam anxiety with a Pearson correlation 
coefficient of r= - 0.441 and there is less 
than 0.01 probability that a correlation 
coeficient this big would have occurred by 
chance in a sample of 103 people.
In layman term 
• exam anxiety , exam mark 
• Revision time , exam mark 
• Revision time , exam anxiety
Hands on 
• Is there a linear association between 
weight and heart girth in this herd of cows? 
• Weight was measured in kg and heart girth 
in cm on 10 cows 
! 
! 
! 
• Assume data is normally distributed
Lect w8 w9_correlation_regression
• The sample coefficient of correlation is 
0.704. The P value is 0.012, which is less 
than 0.05. The conclusion is that 
correlation exists in the population. 
Correlations 
Weight Girth 
Weight Pearson Correlation 
1 .704 
Sig. (1-tailed) .012 
N 10 10 
Girth Pearson Correlation 
.704 1 
Sig. (1-tailed) .012 
N 10 10 
*. Correlation is significant at the 0.05 level (1-tailed).
Using R2 for interpretation 
( correlation coefficient) 2 = coefficient of 
determination, R2 
! 
R2 is a measure of the amount of variability 
in one variable that is explained by the 
other.
Example 
Correlations 
Exam 
Performance 
(%) Exam Anxiety 
Time Spent 
Revising 
Exam Performance 
(%) 
Pearson 
Correlation 
1 -.441 .397 
Sig. (1-tailed) .000 .000 
N 103 103 103 
Exam Anxiety Pearson 
Correlation 
-.441 1 -.709 
Sig. (1-tailed) .000 .000 
N 103 103 103 
Time Spent Revising Pearson 
Correlation 
.397 -.709 1 
Sig. (1-tailed) .000 .000 
N 103 103 103 
**. Correlation is significant at the 0.01 level (1-tailed).
Example 
Exam anxiety and exam performance 
• ( correlation coefficient) 2 = coefficient of 
determination, R2 
! 
R2 = ( -0.441) 2 = 0.194 
! 
• In % = 0.194 x 100 = 19.4%
• Although exam anxiety was correlated 
with exam performance, it can account 
for only 19.4 % of variation in exam 
scores. 
! 
• 80.6% of the variability to be accounted 
for other variables such as different 
ability, different level of preparation and 
so on…)
Hands on 
Subject Age, x Pressure, y 
A 43 128 
B 48 120 
C 56 135 
D 61 143 
E 67 141 
F 70 152 
Compute the value of the correlation coefficient for the data? 
Do you have enough Statistical evidence that this relationship 
does not occur by chance?
Correlations 
Age Pressure 
Age Pearson 
Correlation 1 .897 
Sig. (2-tailed) .015 
N 6 6 
Pressure Pearson 
Correlation .897 1 
Sig. (2-tailed) .015 
N 6 6 
*. Correlation is significant at the 0.05 level (2-tailed). 
R 2 = ?
Regression 
Correlation do not provide the predictive 
power of variables. 
! 
In regression analysis we fit a predictive 
model to our data and use that model to 
predict values of the dependent variable 
from one or more independent variables.
Independent V. Dependent 
• Intentionally 
manipulated 
• Controlled 
• Vary at known rate 
• Cause 
• Intentionally left 
alone 
• Measured 
• Vary at unknown 
rate 
• Effect
• Simple regression seeks to predict an 
outcome variable from a single predictor 
variable whereas multiple regression 
seeks to predict an outcome from several 
predictors. 
! 
Outcomei = (Modeli) + errori 
Yi = (bo + b1 xi ) + ei
Lect w8 w9_correlation_regression
Least squares 
Least squares is a method of finding the line 
that best fits the data. 
! 
This “line of best fit” is found by 
ascertaining which line, of all of the 
possible lines that could be drawn, results 
in the least amount of difference between 
the observed data points and the line.
The vertical lines (dashed) represent the differences (or residuals) 
between the line and the actual data
• “The best fit line” – there will be small 
differences between the values predicted by the 
line and the data that were actually observed. 
! 
• Our interest- in the vertical differences 
between the line and the actual data because 
we are using the line to predict values of Y from 
values of the X-variable. 
! 
• Some data fall above or below the line, 
indicating there is difference between the 
model fitted to these data and the data 
collected.
• These difference called “residuals”. 
• If the “residuals” +ve and –ve cancelled 
each other 
! 
How ? 
! 
• Square the differences before adding up. 
• If the squared differences are large, the 
line is not representative of the data; if the 
squared differences is small then is 
representative.
Total sum of squares, SST 
SST uses the differences between the observed data and the mean value of Y
• The sum of squared differences (SS) can be 
calculated for any line that is fitted to some 
data; the “goodness of fit” of each line can 
then be compared by looking at the sum of 
squares for each. 
! 
• The method of least squares works by 
selecting the line that has the lowest sum of 
squared differences(so it chooses the line that 
best represents the observed data) 
! 
• This “line of best fit” known as a regression 
line.
Residual sum of squares, SSR 
SSR uses the differences between the observed data and the regression line
SS M uses the differences between the mean 
value of Y and the regression line 
Model sum of squares (SS M)
F-ratio 
F-test = MSM 
MSR 
! 
MSM (mean square for the model) 
! 
= SS M 
Number of variables in the model
F-ratio 
F-test = MSM 
MSR 
! 
MSR (mean square for the model) 
! 
= SS R 
Number of Observation- Number of 
parameters being estimated
F- ratio 
• a good model should have a large F-ratio 
(greater than 1 at least)
Test 1 
Open sample date – Record1.sav 
Graph=> scatterplot 
! 
Analyze=> regression
Lect w8 w9_correlation_regression
Model Summary 
Model R R Square 
Adjusted R 
Square 
Std. Error of 
the Estimate 
1 
.578 .335 .331 65.991 
a. Predictors: (Constant), Advertsing Budget (thousands of pounds)
Interpretation 
• The value of R2 is 0.335, which tell us that 
advertising expenditure can account 
33.5% of the variation in record sales. 
! 
• This means that 66% of the variation in 
record sales cannot be explained by 
advertising alone
F ratio 99.58, which is significant at p< 0.001(because the value in column 
labelled Sig. is less than 0.001. 
! 
This result tells us there is less than a 0.1% chance that an F ratio this 
large would happen by chance alone.Overall, the regression model predicts record 
sales significantly well.
Multiple regression 
• Open data ; Record2.sav
Lect w8 w9_correlation_regression
Lect w8 w9_correlation_regression
Lect w8 w9_correlation_regression
Lect w8 w9_correlation_regression
Results 
Descriptive Statistics 
Mean Std. Deviation N 
Record Sales 
(thousands) 193.20 80.699 200 
Advertsing Budget 
(thousands of pounds) 6.1441E2 485.65521 200 
No. of plays on Radio 1 
per week 27.50 12.270 200 
Attractiveness of Band 6.77 1.395 200
Correlations 
Record 
Sales 
(thousands) 
Advertsing 
Budget 
(thousands 
of pounds) 
No. of plays 
on Radio 1 
per week 
Attractivene 
ss of Band 
Pearson Correlation Record Sales 
(thousands) 
1.000 .578 .599 .326 
Advertsing Budget 
(thousands of 
pounds) 
.578 1.000 .102 .081 
No. of plays on Radio 
1 per week .599 .102 1.000 .182 
Attractiveness of 
Band 
.326 .081 .182 1.000 
Sig. (1-tailed) Record Sales 
(thousands) 
. .000 .000 .000 
Advertsing Budget 
(thousands of 
pounds) 
.000 . .076 .128 
No. of plays on Radio 
1 per week .000 .076 . .005 
Attractiveness of 
Band 
.000 .128 .005 . 
N Record Sales 
(thousands) 
200 200 200 200 
Advertsing Budget 
(thousands of 
pounds) 
200 200 200 200 
No. of plays on Radio 
1 per week 200 200 200 200 
Attractiveness of 
200 200 200 200 
Band
Model Summary 
Model R R Square 
Adjusted R 
Square 
1 .578 .335 .331 65.991 
a. Predictors: (Constant), Advertsing Budget (thousands of pounds) 
Model Summary 
Model R 
R 
Square 
Adjusted R 
Square 
Std. Error 
of the 
Estimate 
Std. Error of 
the Estimate 
Change Statistics 
Durbin- 
Watson 
R Square 
Change 
F 
Change df1 df2 
Sig. F 
Change 
1 .815 .665 .660 47.087 .665 129.498 3 196 .000 1.950 
a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on 
bR.a Ddeiop 1e npdeer nwt eVeakriable: Record Sales (thousands)
ANOVA 
Model 
Sum of 
Squares df 
Mean 
Square F Sig. 
1 Regression 433687.833 1 433687.83 
3 
99.587 .000 
Residual 862264.167 198 4354.870 
Total 1295952.00 
0 
199 
a. Predictors: (Constant), Advertsing Budget (thousands of pounds) 
b. Dependent Variable: Record Sales (thousands) 
ANOVA 
Model 
Sum of 
Squares df 
Mean 
Square F Sig. 
1 Regression 861377.41 
8 
3 287125.80 
6 
129.49 
8 
.000 
Residual 434574.58 
2 
196 2217.217 
Total 1295952.0 
00 
199 
a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget 
(thousands of pounds), No. of plays on Radio 1 per week 
b. Dependent Variable: Record Sales (thousands)
Hands on 
• Open file softdrinks.sav 
! 
• Do multiple regression analysis 
! 
• Y – dependent – delivery time
Results 
Model Summary 
Model R R Square 
Adjusted R 
Square 
Std. Error of 
the Estimate 
1 .980 .960 .956 3.25947 
a. Predictors: (Constant), distance, cases 
ANOVA 
Model Sum of 
Squares df Mean 
Square F Sig. 
1 Regression 5550.811 2 2775.405 261.235 .000 
Residual 233.732 22 10.624 
Total 5784.543 24 
a. Predictors: (Constant), distance, cases 
b. Dependent Variable: time
Coefficients 
Model 
Unstandardized 
Coefficients 
Standardiz 
ed 
B Std. Error Beta t Sig. 
1 (Constant) 2.341 1.097 2.135 .044 
cases 1.616 .171 .716 9.464 .000 
distance .014 .004 .301 3.981 .001 
a. Dependent Variable: time
T-test 
• Testing differences between means 
! 
• Dependent means t-test: used when there are 
two experimental conditions and the same 
participants took part in both conditions of 
the experiment. 
! 
• Independent means t-test: used when there 
are two experimental conditions and different 
participants were assigned to each condition.
Dependent t-test 
• 12 spider-phobes who were exposed to a picture of a 
spider (picture) and on a separate occasion a real live 
tarantula (real). Their anxiety was measured in each 
condition (half of the participants were exposed to the 
picture before the real spider while the other half were 
exposed to the real spider first). 
• Which situation caused more anxiety? 
! 
! 
! 
! 
• Open spiderRM.sav
Results 
Paired Samples Statistics 
Mean N 
Std. 
Deviation 
Std. Error 
Mean 
Pair 1 Picture of 
Spider 40.00 12 9.293 2.683 
Real 
Spider 47.00 12 11.029 3.184 
Paired Samples Correlations 
N Correlation Sig. 
Pair 1 Picture of Spider & 
Real Spider 
12 .545 .067 
r= 0.545, not significantly correlated p > 0.05
Paired Samples Test 
Paired Differences 
t df 
Sig. (2- 
tailed) 
Mea 
n 
Std. 
Deviati 
on 
Std. 
Error 
Mean 
95% 
Confidence 
Interval of the 
Difference 
Lower Upper 
Pai 
r 1 
Picture of 
Spider - Real 
Spider 
-7.00 
0 9.807 2.831 -13.231 -.769 -2.47 
3 11 .031 
T-value minus; tells us that picture had a smaller mean that the real tarantula and so 
the 
Real spider led to greater anxiety than the picture. 
! 
Conclusion; that the exposure to a real spider caused a significantly more reported 
anxiety 
In spider-phobes than exposure to a picture (t(11)= -2.47, p< 0.05)
Hands on 
• All students who enroll in a certain 
memory course are given a pretest before 
the course begin. At the completion of the 
course, post test their scores are listed 
here. Verify the results shown on the 
output by calculating the values and 
assume normality. 
Std 1 2 3 4 5 6 7 8 9 10 
Before 93 86 72 54 92 65 80 81 62 73 
After 98 92 80 62 91 78 89 78 71 80
Independent t-test 
• We have 12 spider-phobes who were 
exposed to a picture of a spider and 12 
different spider-phobes who were exposed 
to a real life tarantula. The anxiety level 
measured. 
! 
• Open spiderBG.sav
Group Statistics 
Spider or 
Picture? N Mean Std. 
Deviation 
Std. Error 
Mean 
Anxiety Picture 12 40.00 9.293 2.683 
Real 
Spider 12 47.00 11.029 3.184 
Independent Samples Test 
Levene's Test 
for Equality of 
Variances t-test for Equality of Means 
F Sig. t df 
Sig. 
(2- 
tailed) 
Mean 
Differe 
nce 
Std. 
Error 
Differe 
nce 
95% 
Confidence 
Interval of the 
Difference 
Lower Upper 
Anxiety Equal 
variances 
assumed 
.782 .386 -1.6 
81 22 .107 -7.000 4.163 -15.63 
4 1.634 
Equal 
variances 
not 
assumed 
-1.6 
81 
21.38 
5 .107 -7.000 4.163 -15.64 
9 1.649
Thank you

More Related Content

PDF
Multiple linear regression
PDF
Correlation in Statistics
PPTX
Introduction to Regression Analysis and R
PPT
Point Bicerial Correlation Coefficient
PPTX
PPTX
Correlation and Regression
ODP
Correlation
PPTX
Linear regression analysis
Multiple linear regression
Correlation in Statistics
Introduction to Regression Analysis and R
Point Bicerial Correlation Coefficient
Correlation and Regression
Correlation
Linear regression analysis

What's hot (18)

PPTX
Regression and corelation (Biostatistics)
PPT
Correlation and Regression
PPTX
Data analysis 1
ODP
Multiple linear regression II
PPTX
Regression analysis in R
PPTX
correlation and regression
PDF
Research Methodology Module-06
PDF
Introduction to correlation and regression analysis
PPT
Lesson 8 Linear Correlation And Regression
PDF
Correlations using SPSS
PPTX
Correlation and regression
PPT
regression and correlation
PPTX
Correlation and regression
PPTX
Correlation & Regression
PPT
Simple linear regressionn and Correlation
PPT
Correlation and regression
PPTX
Regression
PPTX
Pearson Correlation
Regression and corelation (Biostatistics)
Correlation and Regression
Data analysis 1
Multiple linear regression II
Regression analysis in R
correlation and regression
Research Methodology Module-06
Introduction to correlation and regression analysis
Lesson 8 Linear Correlation And Regression
Correlations using SPSS
Correlation and regression
regression and correlation
Correlation and regression
Correlation & Regression
Simple linear regressionn and Correlation
Correlation and regression
Regression
Pearson Correlation
Ad

Similar to Lect w8 w9_correlation_regression (20)

PPT
Correlation and Regression analysis .ppt
PPTX
Measure of Association
PPTX
Lecture 2_Chapter 4_Simple linear regression.pptx
PPTX
Correlation and Regression ppt
PPT
Regression and Co-Relation
PPTX
6 the six uContinuous data analysis.pptx
PPTX
Correlation _ Regression Analysis statistics.pptx
PPTX
Correlation.pptx
PPT
Stats For Life Module7 Oc
PPTX
3.3 correlation and regression part 2.pptx
PDF
Unit 1 Correlation- BSRM.pdf
PPTX
Introduction to Regression - The Importance.pptx
PPT
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
PPTX
Fundamental of Statistics and Types of Correlations
PPT
Medical statistics2
PPT
Correlation analysis
PDF
Correlation and Regression
DOCX
EXERCISE 23 PEARSONS PRODUCT-MOMENT CORRELATION COEFFICIENT .docx
PPTX
simple and multiple linear Regression. (1).pptx
PDF
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
Correlation and Regression analysis .ppt
Measure of Association
Lecture 2_Chapter 4_Simple linear regression.pptx
Correlation and Regression ppt
Regression and Co-Relation
6 the six uContinuous data analysis.pptx
Correlation _ Regression Analysis statistics.pptx
Correlation.pptx
Stats For Life Module7 Oc
3.3 correlation and regression part 2.pptx
Unit 1 Correlation- BSRM.pdf
Introduction to Regression - The Importance.pptx
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
Fundamental of Statistics and Types of Correlations
Medical statistics2
Correlation analysis
Correlation and Regression
EXERCISE 23 PEARSONS PRODUCT-MOMENT CORRELATION COEFFICIENT .docx
simple and multiple linear Regression. (1).pptx
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
Ad

More from Rione Drevale (20)

PPT
Risk financing
PPTX
Managing specialized risk_14
PDF
Arntzen
PDF
Banana acclimatization
DOCX
Strategic entrepreneurship tempelate
PPT
Chapter 2
PDF
Sign and symptoms in crops
PPT
Chapter 4 risk
PPT
Chapter 5 risk_
PPT
PPT
L3 amp l4_fpe3203
PPT
L2 fpe3203
PPT
L5 fpe3203 23_march_2015-1
PPT
Agricultural technology upscaling_1
PPT
Water science l3 available soil water 150912ed
PPT
Water science l2 cwr final full ed
PDF
W2 lab design_new2
PDF
W1 intro plant_tc
PPT
Risk management chpt 2
PPT
Risk management chpt 3 and 9
Risk financing
Managing specialized risk_14
Arntzen
Banana acclimatization
Strategic entrepreneurship tempelate
Chapter 2
Sign and symptoms in crops
Chapter 4 risk
Chapter 5 risk_
L3 amp l4_fpe3203
L2 fpe3203
L5 fpe3203 23_march_2015-1
Agricultural technology upscaling_1
Water science l3 available soil water 150912ed
Water science l2 cwr final full ed
W2 lab design_new2
W1 intro plant_tc
Risk management chpt 2
Risk management chpt 3 and 9

Recently uploaded (20)

PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Classroom Observation Tools for Teachers
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
GDM (1) (1).pptx small presentation for students
PDF
01-Introduction-to-Information-Management.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Classroom Observation Tools for Teachers
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O7-L3 Supply Chain Operations - ICLT Program
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
GDM (1) (1).pptx small presentation for students
01-Introduction-to-Information-Management.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
human mycosis Human fungal infections are called human mycosis..pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Cell Types and Its function , kingdom of life
VCE English Exam - Section C Student Revision Booklet
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Computing-Curriculum for Schools in Ghana
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Abdominal Access Techniques with Prof. Dr. R K Mishra
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf

Lect w8 w9_correlation_regression

  • 1. Correlation, Regression & T-test Prepared By: Dr. Kumara Thevan a/l Krishnan
  • 2. Introduction Investigation on a relationship between two or more numerical or quantitative variables can be conducted using techniques of correlation and regression analysis. ! - Correlation is a statistical method used to determine whether a linear relationship between variable exist. ! - Simple linear regression is a statistical method used to described the nature of the relationship between two variables.
  • 3. Definition Scatterplot (or scatter diagram) i s a graph in which the paired (x,y) sample data are plotted with a horizontal x axis and a vertical y axis. ! Each individual (x,y) pair is plotted as a single point.
  • 4. Definition Correlation ! exists between two variables when one of them is related to the other in some way
  • 5. Example Open SPSS. Data ; weight height biometry male 2012.sav Graph Legacy dialogs=> scatter plot Height (X); Weight ( Y)
  • 6. BMI test Category BMI range – kg/m Very severely underweight less than 15 Severely underweight from 15.0 to 16.0 Underweight from 16.0 to 18.5 Normal (healthy weight) from 18.5 to 25 Overweight from 25 to 30 Obese Class I (Moderately from 30 to 35 obese) Obese Class II (Severely obese) from 35 to 40 Obese Class III (Very severely obese) over 40
  • 7. Normality test? • The Kolmogorov-Smirnov and Shapiro-Wilk test. • The compare the scores in the sample to a normally distributed set of scores with the same mean and s.d. • If p>0.05, the test is non-significant. Tells us that the distribution of the sample is not significantly different from a normal distribution. • The test is significant (p<0.05) then the distribution in question is significantly different from a normal distribution(non-normal)
  • 9. Positive Linear Correlation y y y x x x (a) Positive (b) Strong positive (c) Perfect positive
  • 10. Negative Linear Correlation y y y x x x (d) Negative (e) Strong negative (f) Perfect negative
  • 11. What can you see?
  • 12. Test 1 - Do scatter plot using data ExamAnxiety.sav - Exam performance (%) – y axis - Exam anxiety – x axis - Color – place gender - Results? ! - Try 3D plot – 3 variables
  • 13. Bivariate correlation • Having taken a preliminary glance at the data, we can proceed to conduct the correlation analysis.
  • 14. Definition ! Linear Correlation Coefficient r measures strength of the linear relationship between paired x- and y-quantitative values in a sample
  • 15. No Linear Correlation y y x x (g) No Correlation (h) Nonlinear Correlation
  • 16. Definition ! Linear Correlation Coefficient r sometimes referred to as the Pearson product moment correlation coefficient
  • 17. Notation for the Linear Correlation Coefficient n number of pairs of data presented. Σ denotes the addition of the items indicated. Σx denotes the sum of all x values. Σx2 indicates that each x score should be squared and then those squares added. (Σx)2 indicates that the x scores should be added and the total then squared. Σxy indicates that each x score should be first multiplied by its corresponding y score. After obtaining all such products, find their sum. r represents linear correlation coefficient for a sample ρ represents linear correlation coefficient for a population
  • 18. Definition Linear Correlation Coefficient r nΣxy - (Σx)(Σy) n(Σx2) - (Σx)2 n(Σy2) - (Σy)2 r =
  • 19. Test 2 • Run the correlation analysis ExamAnxiety.sav • Assumption – Data is normally distributed
  • 20. Results Correlations Exam Performance (%) Exam Anxiety Time Spent Revising Exam Performance (%) Pearson Correlation 1 -.441 .397 Sig. (1-tailed) .000 .000 N 103 103 103 Exam Anxiety Pearson Correlation -.441 1 -.709 Sig. (1-tailed) .000 .000 N 103 103 103 Time Spent Revising Pearson Correlation .397 -.709 1 Sig. (1-tailed) .000 .000 N 103 103 103 **. Correlation is significant at the 0.01 level (1-tailed).
  • 21. Interpretation • Exam performance is positively related to the amount of time spent revising, with a coefficient of r= 0.397, which is also significant at p< 0.01. ! • Exam anxiety appears to be negatively related to the time spent revising (r= -0.709, p< 0.01)
  • 22. Interpretation • Each variable is perfectly correlated with itself (r=1). ! • Exam performance is negatively related to exam anxiety with a Pearson correlation coefficient of r= - 0.441 and there is less than 0.01 probability that a correlation coeficient this big would have occurred by chance in a sample of 103 people.
  • 23. In layman term • exam anxiety , exam mark • Revision time , exam mark • Revision time , exam anxiety
  • 24. Hands on • Is there a linear association between weight and heart girth in this herd of cows? • Weight was measured in kg and heart girth in cm on 10 cows ! ! ! • Assume data is normally distributed
  • 26. • The sample coefficient of correlation is 0.704. The P value is 0.012, which is less than 0.05. The conclusion is that correlation exists in the population. Correlations Weight Girth Weight Pearson Correlation 1 .704 Sig. (1-tailed) .012 N 10 10 Girth Pearson Correlation .704 1 Sig. (1-tailed) .012 N 10 10 *. Correlation is significant at the 0.05 level (1-tailed).
  • 27. Using R2 for interpretation ( correlation coefficient) 2 = coefficient of determination, R2 ! R2 is a measure of the amount of variability in one variable that is explained by the other.
  • 28. Example Correlations Exam Performance (%) Exam Anxiety Time Spent Revising Exam Performance (%) Pearson Correlation 1 -.441 .397 Sig. (1-tailed) .000 .000 N 103 103 103 Exam Anxiety Pearson Correlation -.441 1 -.709 Sig. (1-tailed) .000 .000 N 103 103 103 Time Spent Revising Pearson Correlation .397 -.709 1 Sig. (1-tailed) .000 .000 N 103 103 103 **. Correlation is significant at the 0.01 level (1-tailed).
  • 29. Example Exam anxiety and exam performance • ( correlation coefficient) 2 = coefficient of determination, R2 ! R2 = ( -0.441) 2 = 0.194 ! • In % = 0.194 x 100 = 19.4%
  • 30. • Although exam anxiety was correlated with exam performance, it can account for only 19.4 % of variation in exam scores. ! • 80.6% of the variability to be accounted for other variables such as different ability, different level of preparation and so on…)
  • 31. Hands on Subject Age, x Pressure, y A 43 128 B 48 120 C 56 135 D 61 143 E 67 141 F 70 152 Compute the value of the correlation coefficient for the data? Do you have enough Statistical evidence that this relationship does not occur by chance?
  • 32. Correlations Age Pressure Age Pearson Correlation 1 .897 Sig. (2-tailed) .015 N 6 6 Pressure Pearson Correlation .897 1 Sig. (2-tailed) .015 N 6 6 *. Correlation is significant at the 0.05 level (2-tailed). R 2 = ?
  • 33. Regression Correlation do not provide the predictive power of variables. ! In regression analysis we fit a predictive model to our data and use that model to predict values of the dependent variable from one or more independent variables.
  • 34. Independent V. Dependent • Intentionally manipulated • Controlled • Vary at known rate • Cause • Intentionally left alone • Measured • Vary at unknown rate • Effect
  • 35. • Simple regression seeks to predict an outcome variable from a single predictor variable whereas multiple regression seeks to predict an outcome from several predictors. ! Outcomei = (Modeli) + errori Yi = (bo + b1 xi ) + ei
  • 37. Least squares Least squares is a method of finding the line that best fits the data. ! This “line of best fit” is found by ascertaining which line, of all of the possible lines that could be drawn, results in the least amount of difference between the observed data points and the line.
  • 38. The vertical lines (dashed) represent the differences (or residuals) between the line and the actual data
  • 39. • “The best fit line” – there will be small differences between the values predicted by the line and the data that were actually observed. ! • Our interest- in the vertical differences between the line and the actual data because we are using the line to predict values of Y from values of the X-variable. ! • Some data fall above or below the line, indicating there is difference between the model fitted to these data and the data collected.
  • 40. • These difference called “residuals”. • If the “residuals” +ve and –ve cancelled each other ! How ? ! • Square the differences before adding up. • If the squared differences are large, the line is not representative of the data; if the squared differences is small then is representative.
  • 41. Total sum of squares, SST SST uses the differences between the observed data and the mean value of Y
  • 42. • The sum of squared differences (SS) can be calculated for any line that is fitted to some data; the “goodness of fit” of each line can then be compared by looking at the sum of squares for each. ! • The method of least squares works by selecting the line that has the lowest sum of squared differences(so it chooses the line that best represents the observed data) ! • This “line of best fit” known as a regression line.
  • 43. Residual sum of squares, SSR SSR uses the differences between the observed data and the regression line
  • 44. SS M uses the differences between the mean value of Y and the regression line Model sum of squares (SS M)
  • 45. F-ratio F-test = MSM MSR ! MSM (mean square for the model) ! = SS M Number of variables in the model
  • 46. F-ratio F-test = MSM MSR ! MSR (mean square for the model) ! = SS R Number of Observation- Number of parameters being estimated
  • 47. F- ratio • a good model should have a large F-ratio (greater than 1 at least)
  • 48. Test 1 Open sample date – Record1.sav Graph=> scatterplot ! Analyze=> regression
  • 50. Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 .578 .335 .331 65.991 a. Predictors: (Constant), Advertsing Budget (thousands of pounds)
  • 51. Interpretation • The value of R2 is 0.335, which tell us that advertising expenditure can account 33.5% of the variation in record sales. ! • This means that 66% of the variation in record sales cannot be explained by advertising alone
  • 52. F ratio 99.58, which is significant at p< 0.001(because the value in column labelled Sig. is less than 0.001. ! This result tells us there is less than a 0.1% chance that an F ratio this large would happen by chance alone.Overall, the regression model predicts record sales significantly well.
  • 53. Multiple regression • Open data ; Record2.sav
  • 58. Results Descriptive Statistics Mean Std. Deviation N Record Sales (thousands) 193.20 80.699 200 Advertsing Budget (thousands of pounds) 6.1441E2 485.65521 200 No. of plays on Radio 1 per week 27.50 12.270 200 Attractiveness of Band 6.77 1.395 200
  • 59. Correlations Record Sales (thousands) Advertsing Budget (thousands of pounds) No. of plays on Radio 1 per week Attractivene ss of Band Pearson Correlation Record Sales (thousands) 1.000 .578 .599 .326 Advertsing Budget (thousands of pounds) .578 1.000 .102 .081 No. of plays on Radio 1 per week .599 .102 1.000 .182 Attractiveness of Band .326 .081 .182 1.000 Sig. (1-tailed) Record Sales (thousands) . .000 .000 .000 Advertsing Budget (thousands of pounds) .000 . .076 .128 No. of plays on Radio 1 per week .000 .076 . .005 Attractiveness of Band .000 .128 .005 . N Record Sales (thousands) 200 200 200 200 Advertsing Budget (thousands of pounds) 200 200 200 200 No. of plays on Radio 1 per week 200 200 200 200 Attractiveness of 200 200 200 200 Band
  • 60. Model Summary Model R R Square Adjusted R Square 1 .578 .335 .331 65.991 a. Predictors: (Constant), Advertsing Budget (thousands of pounds) Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Std. Error of the Estimate Change Statistics Durbin- Watson R Square Change F Change df1 df2 Sig. F Change 1 .815 .665 .660 47.087 .665 129.498 3 196 .000 1.950 a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on bR.a Ddeiop 1e npdeer nwt eVeakriable: Record Sales (thousands)
  • 61. ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 433687.833 1 433687.83 3 99.587 .000 Residual 862264.167 198 4354.870 Total 1295952.00 0 199 a. Predictors: (Constant), Advertsing Budget (thousands of pounds) b. Dependent Variable: Record Sales (thousands) ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 861377.41 8 3 287125.80 6 129.49 8 .000 Residual 434574.58 2 196 2217.217 Total 1295952.0 00 199 a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per week b. Dependent Variable: Record Sales (thousands)
  • 62. Hands on • Open file softdrinks.sav ! • Do multiple regression analysis ! • Y – dependent – delivery time
  • 63. Results Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 .980 .960 .956 3.25947 a. Predictors: (Constant), distance, cases ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 5550.811 2 2775.405 261.235 .000 Residual 233.732 22 10.624 Total 5784.543 24 a. Predictors: (Constant), distance, cases b. Dependent Variable: time
  • 64. Coefficients Model Unstandardized Coefficients Standardiz ed B Std. Error Beta t Sig. 1 (Constant) 2.341 1.097 2.135 .044 cases 1.616 .171 .716 9.464 .000 distance .014 .004 .301 3.981 .001 a. Dependent Variable: time
  • 65. T-test • Testing differences between means ! • Dependent means t-test: used when there are two experimental conditions and the same participants took part in both conditions of the experiment. ! • Independent means t-test: used when there are two experimental conditions and different participants were assigned to each condition.
  • 66. Dependent t-test • 12 spider-phobes who were exposed to a picture of a spider (picture) and on a separate occasion a real live tarantula (real). Their anxiety was measured in each condition (half of the participants were exposed to the picture before the real spider while the other half were exposed to the real spider first). • Which situation caused more anxiety? ! ! ! ! • Open spiderRM.sav
  • 67. Results Paired Samples Statistics Mean N Std. Deviation Std. Error Mean Pair 1 Picture of Spider 40.00 12 9.293 2.683 Real Spider 47.00 12 11.029 3.184 Paired Samples Correlations N Correlation Sig. Pair 1 Picture of Spider & Real Spider 12 .545 .067 r= 0.545, not significantly correlated p > 0.05
  • 68. Paired Samples Test Paired Differences t df Sig. (2- tailed) Mea n Std. Deviati on Std. Error Mean 95% Confidence Interval of the Difference Lower Upper Pai r 1 Picture of Spider - Real Spider -7.00 0 9.807 2.831 -13.231 -.769 -2.47 3 11 .031 T-value minus; tells us that picture had a smaller mean that the real tarantula and so the Real spider led to greater anxiety than the picture. ! Conclusion; that the exposure to a real spider caused a significantly more reported anxiety In spider-phobes than exposure to a picture (t(11)= -2.47, p< 0.05)
  • 69. Hands on • All students who enroll in a certain memory course are given a pretest before the course begin. At the completion of the course, post test their scores are listed here. Verify the results shown on the output by calculating the values and assume normality. Std 1 2 3 4 5 6 7 8 9 10 Before 93 86 72 54 92 65 80 81 62 73 After 98 92 80 62 91 78 89 78 71 80
  • 70. Independent t-test • We have 12 spider-phobes who were exposed to a picture of a spider and 12 different spider-phobes who were exposed to a real life tarantula. The anxiety level measured. ! • Open spiderBG.sav
  • 71. Group Statistics Spider or Picture? N Mean Std. Deviation Std. Error Mean Anxiety Picture 12 40.00 9.293 2.683 Real Spider 12 47.00 11.029 3.184 Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means F Sig. t df Sig. (2- tailed) Mean Differe nce Std. Error Differe nce 95% Confidence Interval of the Difference Lower Upper Anxiety Equal variances assumed .782 .386 -1.6 81 22 .107 -7.000 4.163 -15.63 4 1.634 Equal variances not assumed -1.6 81 21.38 5 .107 -7.000 4.163 -15.64 9 1.649