SlideShare a Scribd company logo
Correlation and Regression
Topics Covered: Is there a relationship between  x  and  y ? What is the strength of this relationship Pearson’s r Can we describe this relationship and use this to predict  y  from  x ? Regression Is the relationship we have described statistically significant?  t   test   Relevance to SPM GLM
The relationship between  x  and  y Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION    CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable
Scattergrams Y X Y X Y X Y Y Y Positive correlation Negative correlation No correlation
Variance vs Covariance First, a note on your sample:  If you’re wishing to assume that your sample is representative of the general population (RANDOM EFFECTS MODEL), use the degrees of freedom (n – 1) in your calculations of variance or covariance. But if you’re simply wanting to assess your current sample (FIXED EFFECTS MODEL), substitute n for the degrees of freedom.
Variance vs Covariance Do two variables change together? Covariance: Gives information on the degree to which two variables vary together. Note how similar the covariance is to variance: the equation simply multiplies x’s error scores by y’s error scores as opposed to squaring x’s error scores. Variance: Gives information on variability of a single variable.
Covariance When X  and Y  : cov (x,y) = pos. When X  and Y  : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0
Example Covariance x y ( )( ) 0 3 - 3 0 0 2 2 - 1 - 1 1 3 4 0 1 0 4 0 1 - 3 - 3 6 6 3 3 9 What does this number tell us? x x i  y y i  x i x  y i y  3  x 3  y   7
Problem with Covariance: The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small…  even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets.
Example of how covariance value relies on variance 4.67 Covariance: 1166.67 Covariance: 28 Sum of  x error * y error  : 7000 Sum of  x error * y error  : 50 51 50 51 Mean 9 47 48 2500 0 1 7 4 48 49 900 20 21 6 1 49 50 100 40 41 5 0 50 51 0 50 51 4 1 51 52 100 60 61 3 4 52 53 900 80 81 2 9 53 54 2500 100 101 1 X error * y error y x x error * y error y x Subject   Low variance data                 High variance data      
Solution: Pearson’s r Covariance does not really tell us anything Solution: standardise this measure Pearson’s R: standardises the covariance value. Divides the covariance by the multiplied standard deviations of X and Y:
Pearson’s R continued
Limitations of r When r = 1 or r = -1: We can predict y from x with certainty all data points are on a straight line: y = ax + b r is actually  r = true r of whole population = estimate of r based on data r is very sensitive to extreme values :
Regression Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other. To do this we need REGRESSION!
Best-fit Line Aim of linear regression is to fit a straight line,  ŷ  = ax + b, to data that gives best prediction of y for any value of x This will be the line that minimises distance between data and fitted line, i.e. the residuals =  ŷ , predicted value   intercept ε ŷ  =  a x + b ε  =   residual error =  y  i  , true value slope
Least Squares Regression To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Residual ( ε ) = y -  ŷ Sum of squares of residuals =  Σ  (y –  ŷ ) 2   Model line:  ŷ  = ax + b we must find values of  a  and  b  that minimise  Σ  (y –  ŷ ) 2   a = slope, b = intercept
Finding b First we find the value of b that gives the min sum of squares ε ε b b Trying different values of b is equivalent to shifting the line up and down the scatter plot b
Finding a Now we find the value of a that gives the min sum of squares b Trying out different values of a is equivalent to changing the slope of the line, while b stays constant b b
Minimising sums of squares Need to minimise  Σ (y– ŷ ) 2 ŷ  = ax + b so need to minimise: Σ (y - ax - b) 2 If we plot the sums of squares for all different values of a and b we get a parabola, because it is a squared term So the min sum of squares is at the bottom of the curve, where the gradient is zero. Values of a and b sums of squares (S) Gradient = 0 min S
The maths bit The min sum of squares is at the bottom of the curve where the gradient = 0 So we can find a and b that give min sum of squares by taking partial derivatives of  Σ (y - ax - b) 2  with respect to a and b separately Then we solve these for 0 to give us the values of a and b that give the min sum of squares
The solution Doing this gives the following equations for a and b: From you can see that:  A low correlation coefficient gives a flatter slope (small value of a) Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a) Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a) a = r s y s x r = correlation coefficient of x and y s y  = standard deviation of y s x  = standard deviation of x
The solution cont. Our model equation is  ŷ  = ax + b This line must pass through the mean so:  We can put our equation for a into this giving:  The smaller the correlation, the closer the intercept is to the mean of y y = ax + b b = y – ax b = y – ax b = y -  r s y s x r = correlation coefficient of x and y s y  = standard deviation of y s x  = standard deviation of x x
Back to the model If the correlation is zero, we will simply predict the mean of y for every value of x, and our regression line is just a flat straight line crossing the x-axis at y But this isn’t very useful. We can calculate the regression line for any data, but the important question is how well does this line fit the data, or how good is it at predicting y from x Rearranges to: a b a a ŷ  = ax + b =   r s y s x r s y s x x + y -  x r s y s x ŷ = (x – x) + y
How good is our model? Total variance of y: Variance of predicted y values ( ŷ) : Error variance: This is the variance explained by our regression model This is the variance of the error between our predicted y values and the actual y values, and thus is the variance in y that is NOT explained by the regression model s y 2  = ∑ (y – y) 2 n - 1 SS y df y = s ŷ 2  = ∑ (ŷ – y) 2 n - 1 SS pred df ŷ = s error 2  = ∑ (y – ŷ) 2 n - 2 SS er df er =
Total variance = predicted variance + error variance s y 2  = s ŷ 2  + s er 2 Conveniently, via some complicated rearranging s ŷ 2  = r 2  s y 2 r 2  = s ŷ 2  / s y 2   so r 2  is the proportion of the variance in y that is explained by our regression model How good is our model cont.
How good is our model cont. Insert  r 2  s y 2  into  s y 2  = s ŷ 2  + s er 2  and rearrange to get: s er 2  = s y 2  – r 2 s y 2     = s y 2  (1 – r 2 ) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction
Is the model significant? i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean? F-statistic: =......= complicated rearranging And it follows that: t (n-2)  = r   (n - 2) √ 1 – r 2 (because F = t 2) So all we need to  know are r and n F (df ŷ ,df er ) = s ŷ 2 s er 2 r 2  (n - 2) 2 1 – r 2
General Linear Model Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b + ε A General Linear Model is just any model that describes the data in terms of a straight line
Multiple regression Multiple regression is used to determine the effect of a number of independent variables, x 1 , x 2 , x 3  etc, on a single dependent variable, y The different x variables are combined in a linear way and each has its own regression coefficient: y = a 1 x 1 + a 2 x 2  +…..+ a n x n  + b +  ε The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y. i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for
SPM Linear regression is a GLM that models the effect of one independent variable,  x,  on ONE dependent variable,  y  Multiple Regression models the effect of several independent variables,  x 1 ,   x 2  etc, on ONE dependent variable,  y Both are types of General Linear Model GLM can also allow you to analyse the effects of several independent x variables on several dependent variables,  y 1 , y 2 , y 3   etc, in a linear combination

More Related Content

PPT
Correlation analysis ppt
PPTX
Spearman’s rank correlation
PPTX
Point estimation
PDF
Autocorrelation (1)
PPT
Confidence Interval Estimation
PDF
OLS chapter
PPT
Correlation analysis
PPTX
Chapter 6 simple regression and correlation
Correlation analysis ppt
Spearman’s rank correlation
Point estimation
Autocorrelation (1)
Confidence Interval Estimation
OLS chapter
Correlation analysis
Chapter 6 simple regression and correlation

What's hot (20)

PPT
Part 2 Cox Regression
PPT
Powerpoint sampling distribution
PDF
Inferential Statistics.pdf
PDF
An Overview of Simple Linear Regression
PPTX
What is statistics
PPTX
Inferential statictis ready go
PPTX
Logistic regression analysis
PPTX
Normal Probabilty Distribution and its Problems
PPTX
Spearman's Rank order Correlation
PPT
Chapter 10
PDF
Correlations using SPSS
PPTX
Descriptive Statistics
PPT
Regression analysis
ODP
Correlation
PPT
Ch4 Confidence Interval
PPTX
Basics of statistics
PPTX
Topic 5 Covariance & Correlation.pptx
PPT
Time series slideshare
PPTX
statistics
PPTX
Regression and corelation (Biostatistics)
Part 2 Cox Regression
Powerpoint sampling distribution
Inferential Statistics.pdf
An Overview of Simple Linear Regression
What is statistics
Inferential statictis ready go
Logistic regression analysis
Normal Probabilty Distribution and its Problems
Spearman's Rank order Correlation
Chapter 10
Correlations using SPSS
Descriptive Statistics
Regression analysis
Correlation
Ch4 Confidence Interval
Basics of statistics
Topic 5 Covariance & Correlation.pptx
Time series slideshare
statistics
Regression and corelation (Biostatistics)
Ad

Viewers also liked (10)

PPT
Correlation & regression
PDF
Lect w8 w9_correlation_regression
PPTX
research framwork chapter 6 quantitative business analysis
PPT
Correlation Research Design
PPTX
Correlation and regression analysis
PPT
Multiple regression presentation
ODP
Multiple linear regression
PPT
Correlational research
PPS
Correlation and regression
Correlation & regression
Lect w8 w9_correlation_regression
research framwork chapter 6 quantitative business analysis
Correlation Research Design
Correlation and regression analysis
Multiple regression presentation
Multiple linear regression
Correlational research
Correlation and regression
Ad

Similar to Corr And Regress (20)

PPT
Corr-and-Regress (1).ppt
PPT
Corr-and-Regress.ppt
PPT
Cr-and-Regress.ppt
PPT
Correlation & Regression for Statistics Social Science
PPT
Corr-and-Regress.ppt
PPT
Corr-and-Regress.ppt
PPT
Corr-and-Regress.ppt
PPT
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
PPTX
CORRELATION AND REGRESSION.pptx
PPT
Statistics08_Cut_Regression.jdnkdjvbjddj
PPT
koefisienkorelasiUNTUKILMUMANAJEMENS2.ppt
DOCX
Unit 5 Correlation
PPT
Chapter05
PPTX
REGRESSION ANALYSIS THEORY EXPLAINED HERE
PDF
Chapter 14 Part I
PDF
Chapter 2 part3-Least-Squares Regression
PPTX
ML-UNIT-IV complete notes download here
PPTX
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
PDF
Correlation and Regression
Corr-and-Regress (1).ppt
Corr-and-Regress.ppt
Cr-and-Regress.ppt
Correlation & Regression for Statistics Social Science
Corr-and-Regress.ppt
Corr-and-Regress.ppt
Corr-and-Regress.ppt
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
CORRELATION AND REGRESSION.pptx
Statistics08_Cut_Regression.jdnkdjvbjddj
koefisienkorelasiUNTUKILMUMANAJEMENS2.ppt
Unit 5 Correlation
Chapter05
REGRESSION ANALYSIS THEORY EXPLAINED HERE
Chapter 14 Part I
Chapter 2 part3-Least-Squares Regression
ML-UNIT-IV complete notes download here
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
Correlation and Regression

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A comparative analysis of optical character recognition models for extracting...
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
Accuracy of neural networks in brain wave diagnosis of schizophrenia
SOPHOS-XG Firewall Administrator PPT.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Tartificialntelligence_presentation.pptx
Machine Learning_overview_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
A Presentation on Artificial Intelligence

Corr And Regress

  • 2. Topics Covered: Is there a relationship between x and y ? What is the strength of this relationship Pearson’s r Can we describe this relationship and use this to predict y from x ? Regression Is the relationship we have described statistically significant? t test Relevance to SPM GLM
  • 3. The relationship between x and y Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION  CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable
  • 4. Scattergrams Y X Y X Y X Y Y Y Positive correlation Negative correlation No correlation
  • 5. Variance vs Covariance First, a note on your sample: If you’re wishing to assume that your sample is representative of the general population (RANDOM EFFECTS MODEL), use the degrees of freedom (n – 1) in your calculations of variance or covariance. But if you’re simply wanting to assess your current sample (FIXED EFFECTS MODEL), substitute n for the degrees of freedom.
  • 6. Variance vs Covariance Do two variables change together? Covariance: Gives information on the degree to which two variables vary together. Note how similar the covariance is to variance: the equation simply multiplies x’s error scores by y’s error scores as opposed to squaring x’s error scores. Variance: Gives information on variability of a single variable.
  • 7. Covariance When X and Y : cov (x,y) = pos. When X and Y : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0
  • 8. Example Covariance x y ( )( ) 0 3 - 3 0 0 2 2 - 1 - 1 1 3 4 0 1 0 4 0 1 - 3 - 3 6 6 3 3 9 What does this number tell us? x x i  y y i  x i x  y i y  3  x 3  y   7
  • 9. Problem with Covariance: The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small… even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets.
  • 10. Example of how covariance value relies on variance 4.67 Covariance: 1166.67 Covariance: 28 Sum of x error * y error : 7000 Sum of x error * y error : 50 51 50 51 Mean 9 47 48 2500 0 1 7 4 48 49 900 20 21 6 1 49 50 100 40 41 5 0 50 51 0 50 51 4 1 51 52 100 60 61 3 4 52 53 900 80 81 2 9 53 54 2500 100 101 1 X error * y error y x x error * y error y x Subject   Low variance data                 High variance data      
  • 11. Solution: Pearson’s r Covariance does not really tell us anything Solution: standardise this measure Pearson’s R: standardises the covariance value. Divides the covariance by the multiplied standard deviations of X and Y:
  • 13. Limitations of r When r = 1 or r = -1: We can predict y from x with certainty all data points are on a straight line: y = ax + b r is actually r = true r of whole population = estimate of r based on data r is very sensitive to extreme values :
  • 14. Regression Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other. To do this we need REGRESSION!
  • 15. Best-fit Line Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best prediction of y for any value of x This will be the line that minimises distance between data and fitted line, i.e. the residuals = ŷ , predicted value intercept ε ŷ = a x + b ε = residual error = y i , true value slope
  • 16. Least Squares Regression To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Residual ( ε ) = y - ŷ Sum of squares of residuals = Σ (y – ŷ ) 2 Model line: ŷ = ax + b we must find values of a and b that minimise Σ (y – ŷ ) 2 a = slope, b = intercept
  • 17. Finding b First we find the value of b that gives the min sum of squares ε ε b b Trying different values of b is equivalent to shifting the line up and down the scatter plot b
  • 18. Finding a Now we find the value of a that gives the min sum of squares b Trying out different values of a is equivalent to changing the slope of the line, while b stays constant b b
  • 19. Minimising sums of squares Need to minimise Σ (y– ŷ ) 2 ŷ = ax + b so need to minimise: Σ (y - ax - b) 2 If we plot the sums of squares for all different values of a and b we get a parabola, because it is a squared term So the min sum of squares is at the bottom of the curve, where the gradient is zero. Values of a and b sums of squares (S) Gradient = 0 min S
  • 20. The maths bit The min sum of squares is at the bottom of the curve where the gradient = 0 So we can find a and b that give min sum of squares by taking partial derivatives of Σ (y - ax - b) 2 with respect to a and b separately Then we solve these for 0 to give us the values of a and b that give the min sum of squares
  • 21. The solution Doing this gives the following equations for a and b: From you can see that: A low correlation coefficient gives a flatter slope (small value of a) Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a) Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a) a = r s y s x r = correlation coefficient of x and y s y = standard deviation of y s x = standard deviation of x
  • 22. The solution cont. Our model equation is ŷ = ax + b This line must pass through the mean so: We can put our equation for a into this giving: The smaller the correlation, the closer the intercept is to the mean of y y = ax + b b = y – ax b = y – ax b = y - r s y s x r = correlation coefficient of x and y s y = standard deviation of y s x = standard deviation of x x
  • 23. Back to the model If the correlation is zero, we will simply predict the mean of y for every value of x, and our regression line is just a flat straight line crossing the x-axis at y But this isn’t very useful. We can calculate the regression line for any data, but the important question is how well does this line fit the data, or how good is it at predicting y from x Rearranges to: a b a a ŷ = ax + b = r s y s x r s y s x x + y - x r s y s x ŷ = (x – x) + y
  • 24. How good is our model? Total variance of y: Variance of predicted y values ( ŷ) : Error variance: This is the variance explained by our regression model This is the variance of the error between our predicted y values and the actual y values, and thus is the variance in y that is NOT explained by the regression model s y 2 = ∑ (y – y) 2 n - 1 SS y df y = s ŷ 2 = ∑ (ŷ – y) 2 n - 1 SS pred df ŷ = s error 2 = ∑ (y – ŷ) 2 n - 2 SS er df er =
  • 25. Total variance = predicted variance + error variance s y 2 = s ŷ 2 + s er 2 Conveniently, via some complicated rearranging s ŷ 2 = r 2 s y 2 r 2 = s ŷ 2 / s y 2 so r 2 is the proportion of the variance in y that is explained by our regression model How good is our model cont.
  • 26. How good is our model cont. Insert r 2 s y 2 into s y 2 = s ŷ 2 + s er 2 and rearrange to get: s er 2 = s y 2 – r 2 s y 2 = s y 2 (1 – r 2 ) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction
  • 27. Is the model significant? i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean? F-statistic: =......= complicated rearranging And it follows that: t (n-2) = r (n - 2) √ 1 – r 2 (because F = t 2) So all we need to know are r and n F (df ŷ ,df er ) = s ŷ 2 s er 2 r 2 (n - 2) 2 1 – r 2
  • 28. General Linear Model Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b + ε A General Linear Model is just any model that describes the data in terms of a straight line
  • 29. Multiple regression Multiple regression is used to determine the effect of a number of independent variables, x 1 , x 2 , x 3 etc, on a single dependent variable, y The different x variables are combined in a linear way and each has its own regression coefficient: y = a 1 x 1 + a 2 x 2 +…..+ a n x n + b + ε The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y. i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for
  • 30. SPM Linear regression is a GLM that models the effect of one independent variable, x, on ONE dependent variable, y Multiple Regression models the effect of several independent variables, x 1 , x 2 etc, on ONE dependent variable, y Both are types of General Linear Model GLM can also allow you to analyse the effects of several independent x variables on several dependent variables, y 1 , y 2 , y 3 etc, in a linear combination