Linear Regression and Correlation
• Explanatory and Response Variables are Numeric
• Relationship between the mean of the response
  variable and the level of the explanatory variable
  assumed to be approximately linear (straight line)
• Model:

   Y = β 0 + β1 x + ε          ε ~ N (0, σ )
  • β1 > 0 ⇒ Positive Association
  • β1 < 0 ⇒ Negative Association
  • β1 = 0 ⇒ No Association
Least Squares Estimation of β0, β1
    ∀ β0 ≡ Mean response when x=0 (y-intercept)
    ∀ β1 ≡ Change in mean response when x increases
      by 1 unit (slope)
    • β0, β1 are unknown parameters (like µ)
    • β0+β1x ≡ Mean response when explanatory
      variable takes on the value x
    • Goal: Choose values (estimates) that minimize the
      sum of squared errors (SSE) of observed values to
      the straight-line:
                                          2                                   2
^    ^   ^
                                     ^
                                            n             ^    ^
                                                                        
                 SSE = ∑i =1  yi − y i  = ∑i =1  yi −  β 0 + β 1 xi  
                            n
y = β 0+ β1 x
                                                                    
Example - Pharmacodynamics of LSD
  • Response (y) - Math score (mean among 5 volunteers)
  • Predictor (x) - LSD tissue concentration (mean of 5 volunteers)
  • Raw Data and scatterplot of Score vs LSD concentration:
                                                      80



                                                      70



                                                      60
    Score (y)                  LSD Conc (x)
     78.93                         1.17               50
     58.20                         2.97
     67.47                         3.26               40

     37.47                         4.69
     45.65                         5.83               30
                                              SCORE




     32.92                         6.00
                                                      20
     29.97                         6.41                    1     2    3   4   5   6   7


                                                           LSD_CONC
Source: Wagner, et al (1968)
Least Squares Computations
S xx =∑−
       x x(    )    2



S xy    =∑ − )(y − )
          (x x    y

         ∑− )
          (y y      2
S yy    =



β=
^
  ∑ − )(y − ) =
   (x x    y   S                xy


   ∑− )
     (x x
    1                       2
               S                xx


β          β
^           ^
 0      = − 1 x
         y

                        2
               
                ^
         ∑ − 
          
           y  y
                 =SSE
s2      =
          n− 2     n−2
Example - Pharmacodynamics of LSD
     Score (y)   LSD Conc (x)   x-xbar    y-ybar      Sxx           Sxy          Syy
      78.93          1.17       -3.163    28.843   10.004569    -91.230409    831.918649
      58.20          2.97       -1.363    8.113     1.857769    -11.058019    65.820769
      67.47          3.26       -1.073    17.383    1.151329    -18.651959    302.168689
      37.47          4.69        0.357   -12.617    0.127449     -4.504269    159.188689
      45.65          5.83        1.497    -4.437    2.241009     -6.642189    19.686969
      32.92          6.00        1.667   -17.167    2.778889    -28.617389    294.705889
      29.97          6.41        2.077   -20.117    4.313929    -41.783009    404.693689
      350.61        30.33       -0.001    0.001    22.474943   -202.487243   2078.183343


(Column totals given in bottom row of table)

    350.61                   30.33
y=         = 50.087       x=       = 4.333
      7                         7
^    − 202.4872               ^        ^
β1 =             = − 9.01    β 0 = y − β 1 x = 50.09 − (− 9.01)(4.33) = 89.10
       22.4749
 ^
y = 89.10 − 9.01x          s 2 = 50.72
SPSS Output and Plot of Equation
                                                   Coefficientsa


                                   Unstandardized               Standardized
                                      Coefficients              Coefficients
Model                               B         Std. Error            Beta                 t                  Sig.
1       (Constant)                 89.124          7.048                                12.646                 .000
        LSD_CONC                   -9.009          1.503                        -.937   -5.994                 .002
  a. Dependent Variable: SCORE



                             Math Score vs LSD Concentration (SPSS)

                    80.00      
                                                                                        Linear Regression



                    70.00
                                               




                    60.00
                                           
            score




                    50.00

                                                                        



                    40.00
                                                            


                                                                            
                                                                                
                    30.00
                                            score = 89.12 + -9.01 * lsd_conc
                            1.00   2.00     R-Square = 0.88 5.00
                                          3.00      4.00                 6.00


                                               lsd_conc
Inference Concerning the Slope (β1)

• Parameter: Slope in the population model (β1)
                                        ^
• Estimator: Least squares estimate: β 1
• Estimated standard error: σ β = s / S
                               ^
                                    ^
                                     1       xx




• Methods of making inference regarding population:
   – Hypothesis tests (2-sided or 1-sided)
   – Confidence Intervals
Hypothesis Test for β1

• 2-Sided Test                      • 1-sided Test
    – H0: β1 = 0                          – H0: β1 = 0
    – HA: β1 ≠ 0                          – HA+: β1 > 0 or
                                          – HA-: β1 < 0
                   ^
                  β1                                ^
T .S . : tobs =   ^              T .S . : tobs =
                                                   β1
                σ β1   ^                           ^
                                                   σ β1
                                                     ^



R.R. : | tobs | ≥ tα / 2,n − 2   R.R.+ : tobs ≥ tα ,n − 2   R.R.− : tobs ≤ − tα ,n − 2
P − val : 2 P(t ≥| tobs |)       P − val + : P (t ≥ tobs ) P − val − : P (t ≤ tobs )
(1-α)100% Confidence Interval for β1

             ^            ^        ^             s
             β 1 ± tα / 2 σ β 1 ≡ β 1 ± tα / 2
                              ^


                                                 S xx

• Conclude positive association if entire interval above 0
• Conclude negative association if entire interval below 0
• Cannot conclude an association if interval contains 0
• Conclusion based on interval is same as 2-sided hypothesis test
Example - Pharmacodynamics of LSD
               ^
   n = 7 β 1 = −9.01 s = 50.72 = 7.12 S xx = 22.475
    ^         7.12
   σ β1 ^
            =        = 1.50
              22.475

    • Testing H0: β1 = 0 vs HA: β1 ≠ 0

                      − 9.01
    T .S . : tobs   =        = −6.01   R.R. :| tobs |≥ t.025,5 = 2.571
                      1.50

    • 95% Confidence Interval for β1 :

− 9.01 ± 2.571(1.50) ≡ − 9.01 ± 3.86 ≡ (−12.87,−5.15)
Correlation Coefficient
• Measures the strength of the linear association
  between two variables
• Takes on the same sign as the slope estimate from
  the linear regression
• Not effected by linear transformations of y or x
• Does not distinguish between dependent and
  independent variable (e.g. height and weight)
• Population Parameter - ρ
• Pearson’s Correlation Coefficient:
                    S xy
             r=                  −1 ≤ r ≤1
                   S xx S yy
Correlation Coefficient
• Values close to 1 in absolute value ⇒ strong
  linear association, positive or negative from sign
• Values close to 0 imply little or no association
• If data contain outliers (are non-normal),
  Spearman’s coefficient of correlation can be
  computed based on the ranks of the x and y values
• Test of H0:ρ = 0 is equivalent to test of H0:β1=0
• Coefficient of Determination (r2) - Proportion of
  variation in y “explained” by the regression on x:
                    S yy − SSE
         r = (r ) =
          2      2
                                 0 ≤ r2 ≤ 1
                         S yy
Example - Pharmacodynamics of LSD
    S xx = 22.475                                       S xy = −202.487            S yy = 2078.183                                           SSE = 253.89
                           − 202.487
   r=                                     = −0.94
                      ( 22.475)(2078.183)
       2078.183 − 253.89
   r =   2
                         = 0.88 = ( −0.94) 2
           2078.183

                                Syy                                                                     SSE
        80.00                                                                      80.00   
                
                                                                    Mean                                                                        Linear Regression


                                                                                   70.00
        70.00                                                                                               
                                  


                                                                                   60.00
        60.00                                                                                           
                            
                                                                           score
score




                                                                                   50.00
        50.00
                                                                                                                                 
                                             Mean = 50.09
                                                     

                                                                                   40.00
        40.00                                                                                                          
                                             
                                                                                                                                     
                                                                                                                                        
                                                                                   30.00
                                                                                                   score = 89.12 + -9.01 * lsd_conc
        30.00
                                                                                       1.00    2.00 R-Square = 0.88
                                                                                                       3.00     4.00     5.00    6.00
             1.00   2.00   3.00       4.00       5.00    6.00
                                                                                                            lsd_conc
                                lsd_conc
Example - SPSS Output
  Pearson’s and Spearman’s Measures
                                    Correlations

                                                         SCORE      LSD_CONC
SCORE                 Pearson Correlation                       1        -.937**
                      Sig. (2-tailed)                           .         .002
                      N                                         7            7
LSD_CONC              Pearson Correlation                   -.937**          1
                      Sig. (2-tailed)                        .002            .
                      N                                         7            7
    **. Correlation is significant at the 0.01 level (2-tailed).



                                       Correlations

                                                                 SCORE     LSD_CONC
Spearman's rho      SCORE            Correlation Coefficient       1.000        -.929**
                                     Sig. (2-tailed)                   .         .003
                                     N                                 7            7
                    LSD_CONC         Correlation Coefficient       -.929**      1.000
                                     Sig. (2-tailed)                .003            .
                                     N                                 7            7
  **. Correlation is significant at the 0.01 level (2-tailed).
Analysis of Variance in Regression
• Goal: Partition the total variation in y into
  variation “explained” by x and random variation
                           ^           ^
   ( yi − y ) = ( yi − y i ) + ( y i − y )
                                           ^       2   ^       2

   ∑ ( y − y) = ∑ ( y − y ) + ∑ ( y − y)
                   2
          i                      i             i           i

 • These three sums of squares and degrees of freedom are:
    •Total (Syy)       dfTotal = n-1

    • Error (SSE)      dfError = n-2

    • Model (SSR)       dfModel = 1
Analysis of Variance in Regression

Source of                  Sum of          Degrees of       Mean
Variation                  Squares          Freedom        Square              F
Model                       SSR                1         MSR = SSR/1      F = MSR/MSE
Error                       SSE               n-2       MSE = SSE/(n-2)
Total                        Syy              n-1


            • Analysis of Variance - F-test
            • H0: β1 = 0             HA: β1 ≠ 0
                                  MSR
            T .S . : Fobs       =
                                 MSE
            R.R. : Fobs         ≥Fα1, n −
                                   ,     2

            P−val : P ( F ≥Fobs )
Example - Pharmacodynamics of LSD

       • Total Sum of squares:

    S yy = ∑ ( yi − y ) 2 = 2078.183    dfTotal = 7 − 1 = 6

        • Error Sum of squares:
                    ^
    SSE = ∑ ( yi − y i ) 2 = 253.890   df Error = 7 − 2 = 5

         • Model Sum of Squares:
           ^
SSR = ∑ ( y i − y ) 2 = 2078.183 − 253.890 = 1824.293         df Model = 1
Example - Pharmacodynamics of LSD
Source of            Sum of          Degrees of    Mean
Variation            Squares          Freedom      Square      F
Model               1824.293             1        1824.293   35.93
Error               253.890              5         50.778
Total               2078.183             6




     •Analysis of Variance - F-test
     • H0: β1 = 0              HA: β1 ≠ 0
                        MSR
   T .S . : Fobs      =          = .93
                                  35
                       MSE
   R.R. : Fobs        ≥F.05,1, 5 = .61
                                  6
   P−val : P ( F ≥ .93)
                  35
Example - SPSS Output

                                        ANOVAb

                       Sum of
Model                 Squares           df       Mean Square    F       Sig.
1       Regression    1824.302               1      1824.302   35.928      .002a
        Residual       253.881               5        50.776
        Total         2078.183               6
  a. Predictors: (Constant), LSD_CONC
  b. Dependent Variable: SCORE
Multiple Regression

• Numeric Response variable (Y)
• p Numeric predictor variables
• Model:
   Y = β0 + β1x1 + ⋅⋅⋅ + βpxp + ε


• Partial Regression Coefficients: βi ≡ effect (on the
  mean response) of increasing the ith predictor
  variable by 1 unit, holding all other predictors
  constant
Example - Effect of Birth weight on
         Body Size in Early Adolescence
   • Response: Height at Early adolescence (n =250 cases)
   • Predictors (p=6 explanatory variables)
       • Adolescent Age (x1, in years -- 11-14)
       • Tanner stage (x2, units not given)
       • Gender (x3=1 if male, 0 if female)
       • Gestational age (x4, in weeks at birth)
       • Birth length (x5, units not given)
                • Birthweight Group (x6=1,...,6 <1500g (1), 1500-
                1999g(2), 2000-2499g(3), 2500-2999g(4), 3000-
                3499g(5), >3500g(6))
Source: Falkner, et al (2004)
Least Squares Estimation

     • Population Model for mean response:

         E (Y ) = β 0 + β1 x1 +  + β p x p

    • Least Squares Fitted (predicted) equation, minimizing SSE:

                                                                     2
^    ^     ^               ^
                                                               ^
Y = β 0 + β 1 x1 +  + β p x p           SSE = ∑  Y − Y 
                                                        
     • All statistical software packages/spreadsheets can
     compute least squares estimates and their standard errors
Analysis of Variance
     • Direct extension to ANOVA based on simple
       linear regression
     • Only adjustments are to degrees of freedom:
            – dfModel = p         dfError = n-p-1

Source of               Sum of             Degrees of        Mean
Variation               Squares             Freedom         Square               F
Model                    SSR                   p          MSR = SSR/p       F = MSR/MSE
Error                    SSE                 n-p-1      MSE = SSE/(n-p-1)
Total                     Syy                 n-1



                                        S yy − SSE        SSR
                             R =  2
                                                        =
                                             S yy         S yy
Testing for the Overall Model - F-test
• Tests whether any of the explanatory variables are
  associated with the response
• H0: β1=⋅⋅⋅=βp=0 (None of the xs associated with y)
• HA: Not all βi = 0
                   MSR               R2 / p
 T .S . : Fobs   =          =
                  MSE          (1 − 2 ) /( n −p − )
                                   R             1
 R.R. : Fobs     ≥Fα p , n −p −
                    ,          1

 P−val : P ( F ≥Fobs )
Example - Effect of Birth weight on
        Body Size in Early Adolescence
 • Authors did not print ANOVA, but did provide following:
     • n=250      p=6   R2=0.26
 • H0: β1=⋅⋅⋅=β6=0
 • HA: Not all βi = 0


                MSR            R2 / p
T .S . : Fobs =        =                     =
                MSE      (1 −R ) /( n −p − )
                              2
                                          1
              0.26 / 6          .0433
=                             =        = .2
                                         14
   (1 − .26) /( 250 − − )
         0              6 1     .0030
R.R. : Fobs ≥Fα 6 , 243 =2.13
                   ,

P−val : P ( F ≥ .2)
               14
Testing Individual Partial Coefficients - t-tests
  • Wish to determine whether the response is
    associated with a single explanatory variable, after
    controlling for the others


  • H0: βi = 0             HA: βi ≠ 0 (2-sided alternative)
                           ^
                          βi
      T .S . : t obs =    ^
                         σβ    ^
                                   i


      R.R. : | t obs | ≥ tα / 2 , n − p −1
      P − val : 2 P (t ≥| tobs |)
Example - Effect of Birth weight on
   Body Size in Early Adolescence
Variable               b      sb    t=b/sb P-val (z)
Adolescent Age       2.86    0.99    2.89     .0038
Tanner Stage         3.41    0.89    3.83     <.001
Male                 0.08    1.26    0.06     .9522
Gestational Age      -0.11 0.21     -0.52     .6030
Birth Length         0.44    0.19    2.32     .0204
Birth Wt Grp         -0.78 0.64     -1.22     .2224

Controlling for all other predictors, adolescent age,
Tanner stage, and Birth length are associated with
adolescent height measurement
Models with Dummy Variables
• Some models have both numeric and categorical
  explanatory variables (Recall gender in example)
• If a categorical variable has k levels, need to create
  k-1 dummy variables that take on the values 1 if
  the level of interest is present, 0 otherwise.
• The baseline level of the categorical variable for
  which all k-1 dummy variables are set to 0
• The regression coefficient corresponding to a
  dummy variable is the difference between the
  mean for that level and the mean for baseline
  group, controlling for all numeric predictors
Example - Deep Cervical Infections
     • Subjects - Patients with deep neck infections
     • Response (Y) - Length of Stay in hospital
     • Predictors: (One numeric, 11 Dichotomous)
              – Age (x1)
              – Gender (x2=1 if female, 0 if male)
              – Fever (x3=1 if Body Temp > 38C, 0 if not)
              – Neck swelling (x4=1 if Present, 0 if absent)
              – Neck Pain (x5=1 if Present, 0 if absent)
              – Trismus (x6=1 if Present, 0 if absent)
              – Underlying Disease (x7=1 if Present, 0 if absent)
              – Respiration Difficulty (x8=1 if Present, 0 if absent)
              – Complication (x9=1 if Present, 0 if absent)
              – WBC > 15000/mm3 (x10=1 if Present, 0 if absent)
              – CRP > 100µg/ml (x11=1 if Present, 0 if absent)
Source: Wang, et al (2003)
Example - Weather and Spinal Patients
                   • Subjects - Visitors to National Spinal Network in 23 cities
                     Completing SF-36 Form
                   • Response - Physical Function subscale (1 of 10 reported)
                   • Predictors:
                               – Patient’s age (x1)
                               – Gender (x2=1 if female, 0 if male)
                               – High temperature on day of visit (x3)
                               – Low temperature on day of visit (x4)
                               – Dew point (x5)
                               – Wet bulb (x6)
                               – Total precipitation (x7)
                               – Barometric Pressure (x7)
                               – Length of sunlight (x8)
                               – Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon,
Source: Glaser, et al (2004)
                                 wan gibbous, last Qtr, wan crescent, presumably had 8-1=7
Analysis of Covariance
• Combination of 1-Way ANOVA and Linear
  Regression
• Goal: Comparing numeric responses among k
  groups, adjusting for numeric concomitant
  variable(s), referred to as Covariate(s)
• Clinical trial applications: Response is Post-Trt
  score, covariate is Pre-Trt score
• Epidemiological applications: Outcomes
  compared across exposure conditions, adjusted for
  other risk factors (age, smoking status, sex,...)

More Related Content

PPT
BS2506 tutorial 1
PDF
Introduction to Numerical Methods for Differential Equations
PDF
Properties of bivariate and conditional Gaussian PDFs
PDF
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
PDF
Model Selection with Piecewise Regular Gauges
PDF
Numerical Methods - Oridnary Differential Equations - 3
PDF
Chapter 13opciones financieras
PDF
Finite Element Analysis Made Easy Lr
BS2506 tutorial 1
Introduction to Numerical Methods for Differential Equations
Properties of bivariate and conditional Gaussian PDFs
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Model Selection with Piecewise Regular Gauges
Numerical Methods - Oridnary Differential Equations - 3
Chapter 13opciones financieras
Finite Element Analysis Made Easy Lr

What's hot (20)

PDF
Mesh Processing Course : Multiresolution
PDF
Low Complexity Regularization of Inverse Problems
PPT
5icdsa2007 v4
PDF
Signal Processing Course : Convex Optimization
PDF
Introduction to Gaussian Processes
PDF
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
PDF
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
PPT
BS1501 tutorial 2
PDF
Signal Processing Course : Inverse Problems Regularization
PDF
Proximal Splitting and Optimal Transport
PDF
03 finding roots
PDF
On Convergence of Jungck Type Iteration for Certain Contractive Conditions
PDF
Geometry Transformation
PDF
Presentation
PPT
Ch05 6
PPTX
Meja kursi
PDF
Regression Theory
PDF
Mesh Processing Course : Active Contours
PDF
Prediction of Financial Processes
PDF
Probability and Statistics Cookbook
Mesh Processing Course : Multiresolution
Low Complexity Regularization of Inverse Problems
5icdsa2007 v4
Signal Processing Course : Convex Optimization
Introduction to Gaussian Processes
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
BS1501 tutorial 2
Signal Processing Course : Inverse Problems Regularization
Proximal Splitting and Optimal Transport
03 finding roots
On Convergence of Jungck Type Iteration for Certain Contractive Conditions
Geometry Transformation
Presentation
Ch05 6
Meja kursi
Regression Theory
Mesh Processing Course : Active Contours
Prediction of Financial Processes
Probability and Statistics Cookbook
Ad

Similar to Linreg (20)

DOC
Statistics Project1
XLS
Chapter 9
PPT
BS2506 tutorial3
PPT
Simple linear regression (final)
PDF
Coeficiente de correlacion lineal 5 1
PPT
linear regression and correlation power point
PPT
Factorial design
PPTX
Bivariate
PDF
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
PPTX
Exercise data 2_-_forecasting
PDF
Business Statistics_an overview
DOCX
Est3 tutorial3mejorado
PPT
T tests anovas and regression
DOC
Regresi Ni3.Nf
XLS
Week7 Quiz Help Excel File
PPTX
Thesis Multi Step Method 111006
PDF
2010 smg training_cardiff_day1_session1 (1 of 3)_mckenzie
PDF
Teaching the Correlation Coefficient
PPTX
Presentation2 stats
DOC
Tugas Anova Desti
Statistics Project1
Chapter 9
BS2506 tutorial3
Simple linear regression (final)
Coeficiente de correlacion lineal 5 1
linear regression and correlation power point
Factorial design
Bivariate
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Exercise data 2_-_forecasting
Business Statistics_an overview
Est3 tutorial3mejorado
T tests anovas and regression
Regresi Ni3.Nf
Week7 Quiz Help Excel File
Thesis Multi Step Method 111006
2010 smg training_cardiff_day1_session1 (1 of 3)_mckenzie
Teaching the Correlation Coefficient
Presentation2 stats
Tugas Anova Desti
Ad

Recently uploaded (20)

PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Modernising the Digital Integration Hub
PPT
Geologic Time for studying geology for geologist
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Unlock new opportunities with location data.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
August Patch Tuesday
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
A review of recent deep learning applications in wood surface defect identifi...
Final SEM Unit 1 for mit wpu at pune .pptx
Zenith AI: Advanced Artificial Intelligence
Modernising the Digital Integration Hub
Geologic Time for studying geology for geologist
Hindi spoken digit analysis for native and non-native speakers
Tartificialntelligence_presentation.pptx
Benefits of Physical activity for teenagers.pptx
Unlock new opportunities with location data.pdf
observCloud-Native Containerability and monitoring.pptx
1 - Historical Antecedents, Social Consideration.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
August Patch Tuesday
Developing a website for English-speaking practice to English as a foreign la...
A novel scalable deep ensemble learning framework for big data classification...
sustainability-14-14877-v2.pddhzftheheeeee
Module 1.ppt Iot fundamentals and Architecture
A comparative study of natural language inference in Swahili using monolingua...
A contest of sentiment analysis: k-nearest neighbor versus neural network
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game

Linreg

  • 1. Linear Regression and Correlation • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model: Y = β 0 + β1 x + ε ε ~ N (0, σ ) • β1 > 0 ⇒ Positive Association • β1 < 0 ⇒ Negative Association • β1 = 0 ⇒ No Association
  • 2. Least Squares Estimation of β0, β1 ∀ β0 ≡ Mean response when x=0 (y-intercept) ∀ β1 ≡ Change in mean response when x increases by 1 unit (slope) • β0, β1 are unknown parameters (like µ) • β0+β1x ≡ Mean response when explanatory variable takes on the value x • Goal: Choose values (estimates) that minimize the sum of squared errors (SSE) of observed values to the straight-line: 2 2 ^ ^ ^  ^  n   ^ ^  SSE = ∑i =1  yi − y i  = ∑i =1  yi −  β 0 + β 1 xi   n y = β 0+ β1 x     
  • 3. Example - Pharmacodynamics of LSD • Response (y) - Math score (mean among 5 volunteers) • Predictor (x) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60 Score (y) LSD Conc (x) 78.93 1.17 50 58.20 2.97 67.47 3.26 40 37.47 4.69 45.65 5.83 30 SCORE 32.92 6.00 20 29.97 6.41 1 2 3 4 5 6 7 LSD_CONC Source: Wagner, et al (1968)
  • 4. Least Squares Computations S xx =∑− x x( ) 2 S xy =∑ − )(y − ) (x x y ∑− ) (y y 2 S yy = β= ^ ∑ − )(y − ) = (x x y S xy ∑− ) (x x 1 2 S xx β β ^ ^ 0 = − 1 x y 2   ^ ∑ −   y y  =SSE s2 = n− 2 n−2
  • 5. Example - Pharmacodynamics of LSD Score (y) LSD Conc (x) x-xbar y-ybar Sxx Sxy Syy 78.93 1.17 -3.163 28.843 10.004569 -91.230409 831.918649 58.20 2.97 -1.363 8.113 1.857769 -11.058019 65.820769 67.47 3.26 -1.073 17.383 1.151329 -18.651959 302.168689 37.47 4.69 0.357 -12.617 0.127449 -4.504269 159.188689 45.65 5.83 1.497 -4.437 2.241009 -6.642189 19.686969 32.92 6.00 1.667 -17.167 2.778889 -28.617389 294.705889 29.97 6.41 2.077 -20.117 4.313929 -41.783009 404.693689 350.61 30.33 -0.001 0.001 22.474943 -202.487243 2078.183343 (Column totals given in bottom row of table) 350.61 30.33 y= = 50.087 x= = 4.333 7 7 ^ − 202.4872 ^ ^ β1 = = − 9.01 β 0 = y − β 1 x = 50.09 − (− 9.01)(4.33) = 89.10 22.4749 ^ y = 89.10 − 9.01x s 2 = 50.72
  • 6. SPSS Output and Plot of Equation Coefficientsa Unstandardized Standardized Coefficients Coefficients Model B Std. Error Beta t Sig. 1 (Constant) 89.124 7.048 12.646 .000 LSD_CONC -9.009 1.503 -.937 -5.994 .002 a. Dependent Variable: SCORE Math Score vs LSD Concentration (SPSS) 80.00  Linear Regression 70.00  60.00  score 50.00  40.00    30.00 score = 89.12 + -9.01 * lsd_conc 1.00 2.00 R-Square = 0.88 5.00 3.00 4.00 6.00 lsd_conc
  • 7. Inference Concerning the Slope (β1) • Parameter: Slope in the population model (β1) ^ • Estimator: Least squares estimate: β 1 • Estimated standard error: σ β = s / S ^ ^ 1 xx • Methods of making inference regarding population: – Hypothesis tests (2-sided or 1-sided) – Confidence Intervals
  • 8. Hypothesis Test for β1 • 2-Sided Test • 1-sided Test – H0: β1 = 0 – H0: β1 = 0 – HA: β1 ≠ 0 – HA+: β1 > 0 or – HA-: β1 < 0 ^ β1 ^ T .S . : tobs = ^ T .S . : tobs = β1 σ β1 ^ ^ σ β1 ^ R.R. : | tobs | ≥ tα / 2,n − 2 R.R.+ : tobs ≥ tα ,n − 2 R.R.− : tobs ≤ − tα ,n − 2 P − val : 2 P(t ≥| tobs |) P − val + : P (t ≥ tobs ) P − val − : P (t ≤ tobs )
  • 9. (1-α)100% Confidence Interval for β1 ^ ^ ^ s β 1 ± tα / 2 σ β 1 ≡ β 1 ± tα / 2 ^ S xx • Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2-sided hypothesis test
  • 10. Example - Pharmacodynamics of LSD ^ n = 7 β 1 = −9.01 s = 50.72 = 7.12 S xx = 22.475 ^ 7.12 σ β1 ^ = = 1.50 22.475 • Testing H0: β1 = 0 vs HA: β1 ≠ 0 − 9.01 T .S . : tobs = = −6.01 R.R. :| tobs |≥ t.025,5 = 2.571 1.50 • 95% Confidence Interval for β1 : − 9.01 ± 2.571(1.50) ≡ − 9.01 ± 3.86 ≡ (−12.87,−5.15)
  • 11. Correlation Coefficient • Measures the strength of the linear association between two variables • Takes on the same sign as the slope estimate from the linear regression • Not effected by linear transformations of y or x • Does not distinguish between dependent and independent variable (e.g. height and weight) • Population Parameter - ρ • Pearson’s Correlation Coefficient: S xy r= −1 ≤ r ≤1 S xx S yy
  • 12. Correlation Coefficient • Values close to 1 in absolute value ⇒ strong linear association, positive or negative from sign • Values close to 0 imply little or no association • If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the x and y values • Test of H0:ρ = 0 is equivalent to test of H0:β1=0 • Coefficient of Determination (r2) - Proportion of variation in y “explained” by the regression on x: S yy − SSE r = (r ) = 2 2 0 ≤ r2 ≤ 1 S yy
  • 13. Example - Pharmacodynamics of LSD S xx = 22.475 S xy = −202.487 S yy = 2078.183 SSE = 253.89 − 202.487 r= = −0.94 ( 22.475)(2078.183) 2078.183 − 253.89 r = 2 = 0.88 = ( −0.94) 2 2078.183 Syy SSE 80.00 80.00   Mean Linear Regression 70.00 70.00   60.00 60.00   score score 50.00 50.00  Mean = 50.09  40.00 40.00      30.00  score = 89.12 + -9.01 * lsd_conc 30.00 1.00 2.00 R-Square = 0.88 3.00 4.00 5.00 6.00 1.00 2.00 3.00 4.00 5.00 6.00 lsd_conc lsd_conc
  • 14. Example - SPSS Output Pearson’s and Spearman’s Measures Correlations SCORE LSD_CONC SCORE Pearson Correlation 1 -.937** Sig. (2-tailed) . .002 N 7 7 LSD_CONC Pearson Correlation -.937** 1 Sig. (2-tailed) .002 . N 7 7 **. Correlation is significant at the 0.01 level (2-tailed). Correlations SCORE LSD_CONC Spearman's rho SCORE Correlation Coefficient 1.000 -.929** Sig. (2-tailed) . .003 N 7 7 LSD_CONC Correlation Coefficient -.929** 1.000 Sig. (2-tailed) .003 . N 7 7 **. Correlation is significant at the 0.01 level (2-tailed).
  • 15. Analysis of Variance in Regression • Goal: Partition the total variation in y into variation “explained” by x and random variation ^ ^ ( yi − y ) = ( yi − y i ) + ( y i − y ) ^ 2 ^ 2 ∑ ( y − y) = ∑ ( y − y ) + ∑ ( y − y) 2 i i i i • These three sums of squares and degrees of freedom are: •Total (Syy) dfTotal = n-1 • Error (SSE) dfError = n-2 • Model (SSR) dfModel = 1
  • 16. Analysis of Variance in Regression Source of Sum of Degrees of Mean Variation Squares Freedom Square F Model SSR 1 MSR = SSR/1 F = MSR/MSE Error SSE n-2 MSE = SSE/(n-2) Total Syy n-1 • Analysis of Variance - F-test • H0: β1 = 0 HA: β1 ≠ 0 MSR T .S . : Fobs = MSE R.R. : Fobs ≥Fα1, n − , 2 P−val : P ( F ≥Fobs )
  • 17. Example - Pharmacodynamics of LSD • Total Sum of squares: S yy = ∑ ( yi − y ) 2 = 2078.183 dfTotal = 7 − 1 = 6 • Error Sum of squares: ^ SSE = ∑ ( yi − y i ) 2 = 253.890 df Error = 7 − 2 = 5 • Model Sum of Squares: ^ SSR = ∑ ( y i − y ) 2 = 2078.183 − 253.890 = 1824.293 df Model = 1
  • 18. Example - Pharmacodynamics of LSD Source of Sum of Degrees of Mean Variation Squares Freedom Square F Model 1824.293 1 1824.293 35.93 Error 253.890 5 50.778 Total 2078.183 6 •Analysis of Variance - F-test • H0: β1 = 0 HA: β1 ≠ 0 MSR T .S . : Fobs = = .93 35 MSE R.R. : Fobs ≥F.05,1, 5 = .61 6 P−val : P ( F ≥ .93) 35
  • 19. Example - SPSS Output ANOVAb Sum of Model Squares df Mean Square F Sig. 1 Regression 1824.302 1 1824.302 35.928 .002a Residual 253.881 5 50.776 Total 2078.183 6 a. Predictors: (Constant), LSD_CONC b. Dependent Variable: SCORE
  • 20. Multiple Regression • Numeric Response variable (Y) • p Numeric predictor variables • Model: Y = β0 + β1x1 + ⋅⋅⋅ + βpxp + ε • Partial Regression Coefficients: βi ≡ effect (on the mean response) of increasing the ith predictor variable by 1 unit, holding all other predictors constant
  • 21. Example - Effect of Birth weight on Body Size in Early Adolescence • Response: Height at Early adolescence (n =250 cases) • Predictors (p=6 explanatory variables) • Adolescent Age (x1, in years -- 11-14) • Tanner stage (x2, units not given) • Gender (x3=1 if male, 0 if female) • Gestational age (x4, in weeks at birth) • Birth length (x5, units not given) • Birthweight Group (x6=1,...,6 <1500g (1), 1500- 1999g(2), 2000-2499g(3), 2500-2999g(4), 3000- 3499g(5), >3500g(6)) Source: Falkner, et al (2004)
  • 22. Least Squares Estimation • Population Model for mean response: E (Y ) = β 0 + β1 x1 +  + β p x p • Least Squares Fitted (predicted) equation, minimizing SSE: 2 ^ ^ ^ ^   ^ Y = β 0 + β 1 x1 +  + β p x p SSE = ∑  Y − Y    • All statistical software packages/spreadsheets can compute least squares estimates and their standard errors
  • 23. Analysis of Variance • Direct extension to ANOVA based on simple linear regression • Only adjustments are to degrees of freedom: – dfModel = p dfError = n-p-1 Source of Sum of Degrees of Mean Variation Squares Freedom Square F Model SSR p MSR = SSR/p F = MSR/MSE Error SSE n-p-1 MSE = SSE/(n-p-1) Total Syy n-1 S yy − SSE SSR R = 2 = S yy S yy
  • 24. Testing for the Overall Model - F-test • Tests whether any of the explanatory variables are associated with the response • H0: β1=⋅⋅⋅=βp=0 (None of the xs associated with y) • HA: Not all βi = 0 MSR R2 / p T .S . : Fobs = = MSE (1 − 2 ) /( n −p − ) R 1 R.R. : Fobs ≥Fα p , n −p − , 1 P−val : P ( F ≥Fobs )
  • 25. Example - Effect of Birth weight on Body Size in Early Adolescence • Authors did not print ANOVA, but did provide following: • n=250 p=6 R2=0.26 • H0: β1=⋅⋅⋅=β6=0 • HA: Not all βi = 0 MSR R2 / p T .S . : Fobs = = = MSE (1 −R ) /( n −p − ) 2 1 0.26 / 6 .0433 = = = .2 14 (1 − .26) /( 250 − − ) 0 6 1 .0030 R.R. : Fobs ≥Fα 6 , 243 =2.13 , P−val : P ( F ≥ .2) 14
  • 26. Testing Individual Partial Coefficients - t-tests • Wish to determine whether the response is associated with a single explanatory variable, after controlling for the others • H0: βi = 0 HA: βi ≠ 0 (2-sided alternative) ^ βi T .S . : t obs = ^ σβ ^ i R.R. : | t obs | ≥ tα / 2 , n − p −1 P − val : 2 P (t ≥| tobs |)
  • 27. Example - Effect of Birth weight on Body Size in Early Adolescence Variable b sb t=b/sb P-val (z) Adolescent Age 2.86 0.99 2.89 .0038 Tanner Stage 3.41 0.89 3.83 <.001 Male 0.08 1.26 0.06 .9522 Gestational Age -0.11 0.21 -0.52 .6030 Birth Length 0.44 0.19 2.32 .0204 Birth Wt Grp -0.78 0.64 -1.22 .2224 Controlling for all other predictors, adolescent age, Tanner stage, and Birth length are associated with adolescent height measurement
  • 28. Models with Dummy Variables • Some models have both numeric and categorical explanatory variables (Recall gender in example) • If a categorical variable has k levels, need to create k-1 dummy variables that take on the values 1 if the level of interest is present, 0 otherwise. • The baseline level of the categorical variable for which all k-1 dummy variables are set to 0 • The regression coefficient corresponding to a dummy variable is the difference between the mean for that level and the mean for baseline group, controlling for all numeric predictors
  • 29. Example - Deep Cervical Infections • Subjects - Patients with deep neck infections • Response (Y) - Length of Stay in hospital • Predictors: (One numeric, 11 Dichotomous) – Age (x1) – Gender (x2=1 if female, 0 if male) – Fever (x3=1 if Body Temp > 38C, 0 if not) – Neck swelling (x4=1 if Present, 0 if absent) – Neck Pain (x5=1 if Present, 0 if absent) – Trismus (x6=1 if Present, 0 if absent) – Underlying Disease (x7=1 if Present, 0 if absent) – Respiration Difficulty (x8=1 if Present, 0 if absent) – Complication (x9=1 if Present, 0 if absent) – WBC > 15000/mm3 (x10=1 if Present, 0 if absent) – CRP > 100µg/ml (x11=1 if Present, 0 if absent) Source: Wang, et al (2003)
  • 30. Example - Weather and Spinal Patients • Subjects - Visitors to National Spinal Network in 23 cities Completing SF-36 Form • Response - Physical Function subscale (1 of 10 reported) • Predictors: – Patient’s age (x1) – Gender (x2=1 if female, 0 if male) – High temperature on day of visit (x3) – Low temperature on day of visit (x4) – Dew point (x5) – Wet bulb (x6) – Total precipitation (x7) – Barometric Pressure (x7) – Length of sunlight (x8) – Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon, Source: Glaser, et al (2004) wan gibbous, last Qtr, wan crescent, presumably had 8-1=7
  • 31. Analysis of Covariance • Combination of 1-Way ANOVA and Linear Regression • Goal: Comparing numeric responses among k groups, adjusting for numeric concomitant variable(s), referred to as Covariate(s) • Clinical trial applications: Response is Post-Trt score, covariate is Pre-Trt score • Epidemiological applications: Outcomes compared across exposure conditions, adjusted for other risk factors (age, smoking status, sex,...)