Regression analysis

Regression
• Regression: technique concerned with predicting some
variables by knowing others
• The process of predicting variable Y using variable X
• Tells you how values in y change as a function of changes
in values of x

Correlation and Regression
• Correlation describes the strength of a linear
relationship between two variables
• Linear means “straight line”
• Regression tells us how to draw the straight line
described by the correlation

Regression
• Calculates the “best-fit” line for a certain set of data
• The regression line makes the sum of the squares of the
residuals smaller than for any other line
• Regression minimizes residuals
80
100
120
140
160
180
200
220
60 70 80 90 100 110 120
Wt (kg)
SBP(mmHg)

Regression
• we are able to construct a best fitting straight line to the
scatter diagram points and then formulate a regression
equation in the form of:

Simple Linear Regression
Independent variable (x)
Dependentvariable(y)
The output of a regression is a function that predicts the dependent
variable based upon values of the independent variables.
Simple regression fits a straight line to the data.
y’ = b0 + b1X ± є
b0 (y intercept)
B1 = slope
= ∆y/ ∆x
є

Dependentvariable
The function will make a prediction for each observed data point.
The observation is denoted by y and the prediction is denoted by y.
Zero
Prediction: y
Observation: y
^
^

For each observation, the variation can be described as:
y = y + ε
Actual = Explained + Error
Zero
Prediction error: ε
^
Prediction: y^
Observation: y

Regression
Dependentvariable
A least squares regression selects the line with the lowest total sum
of squared prediction errors.
This value is called the Sum of Squares of Error, or SSE.

Calculating SSR
Dependentvariable
The Sum of Squares Regression (SSR) is the sum of the squared
differences between the prediction for each observation and the
population mean.
Population mean: y

Regression Formulas
The Total Sum of Squares (SST) is equal to SSR + SSE.
Mathematically,
SSR = ∑ ( y – y ) (measure of explained variation)
SSE = ∑ ( y – y ) (measure of unexplained variation)
SST = SSR + SSE = ∑ ( y – y ) (measure of total variation in y)
^
^
2
2

The Coefficient of Determination
The proportion of total variation (SST) that is explained by the
regression (SSR) is known as the Coefficient of Determination, and is
often referred to as R .
R = =
The value of R can range between 0 and 1, and the higher its value
the more accurate the regression model is. It is often referred to as a
percentage.
SSR SSR
SST SSR + SSE
2
2
2

Standard Error of Regression
The Standard Error of a regression is a measure of its variability. It
can be used in a similar manner to standard deviation, allowing for
prediction intervals.
y ± 2 standard errors will provide approximately 95% accuracy, and 3
standard errors will provide a 99% confidence interval.
Standard Error is calculated by taking the square root of the average
prediction error.
Standard Error =
SSE
n-k
Where n is the number of observations in the sample and
k is the total number of variables in the model
√

The output of a simple regression is the coefficient β and the
constant A. The equation is then:
y = A + β * x + ε
where ε is the residual error.
β is the per unit change in the dependent variable for each unit
change in the independent variable. Mathematically:
β =
∆ y
∆ x

Multiple Linear Regression
More than one independent variable can be used to explain variance in
the dependent variable, as long as they are not linearly related.
A multiple regression takes the form:
y = A + β X + β X + … + β k Xk + ε
where k is the number of variables, or parameters.
1 1 2 2

Multicollinearity
Multicollinearity is a condition in which at least 2 independent
variables are highly linearly correlated. It will often crash computers.
Example table of
Correlations
Y X1 X2
Y 1.000
X1 0.802 1.000
X2 0.848 0.578 1.000
A correlations table can suggest which independent variables may be
significant. Generally, an ind. variable that has more than a
correlation with the dependent variable and less than with any other
ind. variable can be included as a possible predictor.

Nonlinear Regression
Nonlinear functions can also be fit as regressions. Common
choices include Power, Logarithmic, Exponential, and Logistic,
but any continuous function can be used.

Not Linear Linear

x
residuals
x
Y
x
Y
x
residuals

Regression Output in Excel
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.982655
R Square 0.96561
Adjusted R Square 0.959879
Standard Error 26.01378
Observations 15
ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65E-09
Residual 12 8120.603 676.7169
Total 14 236135.2
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%
Intercept 562.151 21.0931 26.65094 4.78E-12 516.1931 608.1089
Temperature -5.436581 0.336216 -16.1699 1.64E-09 -6.169133 -4.704029
Insulation -20.01232 2.342505 -8.543127 1.91E-06 -25.1162 -14.90844
Estimated Heating Oil = 562.15 - 5.436 (Temperature) - 20.012 (Insulation)
Y = B0 + B1 X1 + B2X2 + B3X3 - - - +/- Error
Total = Estimated/Predicted +/- Error

Significance testing…
Slope
Distribution of slope ~ Tn-2
ˆ
H0: β1 = 0 (no linear relationship)
H1: β1  0 (linear relationship does exist)
)ˆ.(.
0ˆ


es
Tn-2=

Functions of multivariate analysis:
• Control for confounders
• Test for interactions between predictors (effect modification)
• Improve predictions

Continuous outcome (means)
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
Paired ttest: compares means
between two related groups (e.g.,
the same subjects before and
after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient

Covariance
1
))((
),(cov 1




n
YyXx
yx
n
i
ii
covariance is a measure of the joint variability of two random variables

cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
Interpreting Covariance

Types of variables to be analyzed
Statistical procedure
or measure of associationPredictor variable/s Outcome variable
Cross-sectional/case-control studies
Categorical (>2 groups) Continuous ANOVA
Continuous Continuous Simple linear regression
Multivariate
(categorical and
continuous)
Continuous Multiple linear regression
Categorical Categorical
Chi-square test (or Fisher’s
exact)
Binary Binary Odds ratio, risk ratio
Multivariate Binary Logistic regression
Cohort Studies/Clinical Trials
Binary Binary Risk ratio
Categorical Time-to-event Kaplan-Meier/ log-rank test
Multivariate Time-to-event
Cox-proportional hazards
regression, hazard ratio
Binary (two groups) Continuous T-test
Binary Ranks/ordinal Wilcoxon rank-sum test
Categorical Continuous Repeated measures ANOVA
Multivariate Continuous
Mixed models; GEE
modeling

Alternative summary: statistics for
various types of outcome data
Outcome Variable
Are the observations independent or
correlated?
Assumptionsindependent correlated
Continuous
(e.g. pain scale,
cognitive function)
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Binary or
categorical
(e.g. fracture yes/no)
Difference in proportions
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Chi-square test
assumes sufficient
numbers in each cell
(>=5)
Time-to-event
(e.g. time to fracture)
Kaplan-Meier statistics
Cox regression
n/a Cox regression
assumes proportional
hazards between
groups

Continuous outcome (means);
HRP 259/HRP 262
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
Paired ttest: compares means
between two related groups (e.g.,
the same subjects before and
after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient

Binary or categorical outcomes
(proportions); HRP 259/HRP 261
Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
two or more groups
Relative risks: odds ratios
or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemar’s chi-square test:
compares binary outcome between
correlated groups (e.g., before and
after)
Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)
Fisher’s exact test: compares
proportions between independent
groups when there are sparse data
(some cells <5).
McNemar’s exact test:
compares proportions between
correlated groups when there are
sparse data (some cells <5).

Time-to-event outcome (survival
data); HRP 262
Outcome
Variable
Are the observation groups independent or correlated? Modifications to
Cox regression
if proportional-
hazards is
violated:
Time-to-
event (e.g.,
time to
fracture)
Kaplan-Meier statistics: estimates survival functions for
each group (usually displayed graphically); compares survival
functions with log-rank test
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios
n/a (already over
time)
Time-dependent
predictors or time-
dependent hazard
ratios (tricky!)

Regression analysis

More Related Content

What's hot (20)

Similar to Regression analysis (20)

More from University of Jaffna (16)

Recently uploaded (20)

Regression analysis