Multiple linear regression

MULTIPLE LINEAR
REGRESSION
Avjinder Singh Kaler and Kristi Mai

We will look at a method for analyzing a linear relationship involving
more than two variables.
We focus on these key elements:
1. Finding the multiple regression equation.
2. The values of the adjusted R2, and the p-value as measures of
how well the multiple regression equation fits the sample data.

• Multiple Regression Equation – given a collection of sample data with
several (𝑘−𝑚𝑎𝑛𝑦) explanatory variables, the regression equation that
algebraically describes the relationship between the response variable 𝑦
and two or more explanatory variables 𝑥1, 𝑥2, … 𝑥 𝑘 and is:
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏 𝑘 𝑥 𝑘
• We are using more than one explanatory variable to predict a response variable now
• In practice, you need large amounts of data to use several predictor/explanatory
variables
* Guideline: Your sample size should be 10 times larger than the number of 𝑥 variables*
• Multiple Regression Line – the graph of the multiple regression equation
• This multiple regression line still fits the sample points best according to the least squares
property

• Visualization – multiple
scatterplots of each pair
(𝑥 𝑘, 𝑦) of quantitative data
can still be helpful in
determining whether there
is a relationship between
two variables
• These scatterplots can be
created one at a time.
However, it is common to
visualize all the pairs of
variables within one plot.
This is often called a pairs
plot, pairwise scatterplot
or scatterplot matrix.

Population Parameter Sample Statistic
Equation 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏 𝑘 𝑥 𝑘
Note:
• 𝑦 is the predicted value of 𝑦
• 𝑘 is the number of predictor variables (also called independent
variables or 𝑥 variables)

• Requirements for Regression:
1. The sample data is a Simple Random Sample of quantitative data
2. Each of the pairs of data (𝑥 𝑘, 𝑦) has a bivariate normal distribution
(recall this definition)
3. Random errors associated with the regression equation (i.e. residuals)
are independent and normally distributed with a mean of 0 and a
standard deviation 𝜎
• Formulas for 𝑏 𝑘:
• Statistical software will be used to calculate the individual coefficient
estimates, 𝑏 𝑘

1. Use common sense and practical considerations to include or
exclude variables
2. Consider the P-value for the test of overall model significance
• Hypotheses:
𝐻0: 𝛽1 = 𝛽2 = ⋯ = 𝛽 𝑘 = 0
𝐻1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 𝑘 ≠ 0
• Test Statistic: 𝐹 =
𝑀𝑆 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑀𝑆(𝐸𝑟𝑟𝑜𝑟)
• This will result in an ANOVA table with a p-value that expresses the overall
statistical significance of the model

3. Consider equations with high adjusted 𝑹 𝟐 values
• 𝑅 is the multiple correlation coefficient that describes the correlation
between the observed 𝑦 values and the predicted 𝑦 values
• 𝑅2
is the multiple coefficient of determination and measures how well the
multiple regression equation fits the sample data
• Problems: This measure of model “fitness” increases as more variables are
included until it can usually raise no more or only by a very little amount no
matter how significant the most recently added predictor variable may be
• Adjusted 𝑅2
is the multiple coefficient of determination that is modified to
account for the number of variables in the model and the sample size

4. Consider equations with the fewest number of predictor/explanatory
variables if models that are being compared are nearly equivalent in
terms of significance and fit (i.e. p-value and adjusted 𝑅2)
• This is known as the “Law of Parsimony”
• We are looking for the simplest yet most informative model
• Individual t-tests of particular regression parameters may help select the
correct model and eliminate insignificant explanatory variables
Notice: If the regression equation does not appear to be useful for predictions,
the best predicted value of a 𝑦 variable is still its point estimate [i.e. the sample
mean of the 𝑦 variable would be the best predicted value for that variable]

• Identify the response and potential explanatory variables by
constructing a scatterplot matrix
• Create a multiple regression model
• Perform the appropriate tests of the following:
• Overall model significance (the ANOVA i.e. the 𝐹 test)
• Individual variable significance (𝑡 tests)
• In addition, find the following:
• Find the adjusted 𝑅2 value to assess the predictive power of the model

• Perform a Residual Analysis to verify the Requirements for Linear
Regression have been satisfied:
1. Construct a residual plot and verify that there is no pattern (other than a
straight line pattern) and also verify that the residual plot does not
become thicker or thinner
• Examples are shown below:

2. Use a histogram, normal quantile plot, or Shapiro Wilk test of normality
to confirm that the values of the residuals have a distribution that is
approximately normal
• Normal Quantile Plot (aka QQ Plot) * Examples on the next 3 slides *
• Shapiro Wilk Normality Test
• This will help you assess the normality of a given set of data (in this case, the
normality of the residuals) when the visual examination of the QQ Plot and/or
the histogram of the data seem unclear to you and leave you stumped!
• Hypotheses:
H0: Th݁ ݀𝑎𝑡𝑎 ܿ‫ݏ݁𝑚݋‬ ݂‫𝑚݋ݎ‬ 𝑎 𝑛‫݈𝑎𝑚ݎ݋‬ ݀݅‫𝑛݋݅𝑡ݑܾ݅ݎ𝑡ݏ‬
H1: Th݁ ݀𝑎𝑡𝑎 ݀‫ݏ݁݋‬ 𝑛‫𝑡݋‬ 𝑎‫ݎ𝑎݁݌݌‬ 𝑡‫݋‬ ܿ‫݁𝑚݋‬ ݂‫𝑚݋ݎ‬ 𝑎 𝑛‫݈𝑎𝑚ݎ݋‬ ݀݅‫𝑛݋݅𝑡ݑܾ݅ݎ𝑡ݏ‬

Normal: Histogram of IQ scores is close to being bell-shaped, suggests that the IQ
scores are from a normal distribution. The normal quantile plot shows points that are
reasonably close to a straight-line pattern. It is safe to assume that these IQ scores
are from a normally distributed population.

Uniform: Histogram of data having a uniform distribution. The corresponding
normal quantile plot suggests that the points are not normally distributed because
the points show a systematic pattern that is not a straight-line pattern. These
sample values are not from a population having a normal distribution.

Skewed: Histogram of the amounts of rainfall in Boston for every Monday during
one year. The shape of the histogram is skewed, not bell-shaped. The
corresponding normal quantile plot shows points that are not at all close to a
straight-line pattern. These rainfall amounts are not from a population having a
normal distribution.

The table to the right includes a random
sample of heights of mothers, fathers, and their
daughters (based on data from the National
Health and Nutrition Examination).
Find the multiple regression equation in which
the response (y) variable is the height of a
daughter and the predictor (x) variables are
the height of the mother and height of the
father.

The StatCrunch results are shown here:
From the display, we see that the multiple
regression equation is:
𝐷𝑎𝑢𝑔ℎ𝑡𝑒𝑟 = 7.5 + 0.707𝑀𝑜𝑡ℎ𝑒𝑟 + 0.164 𝐹𝑎𝑡ℎ𝑒𝑟
We could write this equation as:
𝑦 = 7.5 + 0.707𝑥1 + 0.164𝑥2
where 𝑦 is the predicted height of a
daughter,
𝑥1 is the height of the mother, and 𝑥2 is the
height of the father.

The preceding technology display shows the adjusted coefficient of
determination as R-Sq(adj) = 63.7%.
When we compare this multiple regression equation to others, it is better
to use the adjusted R2 of 63.7%

Based on StatCrunch, the p-value is less than 0.0001, indicating that the
multiple regression equation has good overall significance and is usable
for predictions.
That is, it makes sense to predict the heights of daughters based on heights
of mothers and fathers.
The p-value results from a test of the null hypothesis that β1 = β2 = 0, and
rejection of this hypothesis indicates the equation is effective in predicting
the heights of daughters.

Data Set 2 in Appendix B includes the age, foot length, shoe print length,
shoe size, and height for each of 40 different subjects.
Using those sample data, find the regression equation that is the best for
predicting height.
The table on the next slide includes key results from the combinations of
the five predictor variables.

Using critical thinking and statistical analysis:
1. Delete the variable age.
2. Delete the variable shoe size, because it is really a rounded form of foot length.
3. For the remaining variables of foot length and shoe print length, select foot length
because its adjusted R2 of 0.7014 is greater than 0.6520 for shoe print length.
4. Although it appears that only foot length is best, we note that criminals usually wear
shoes, so shoe print lengths are likely to be found than foot lengths.
Hence, the final regression equation only including foot length:
𝑦 = 𝛽0 + 𝛽1 𝑥1
where 𝛽0 is the intercept, 𝛽1 is the coefficient corresponding to x1 variable (foot length).

The methods of the above section (Multiple Linear Regression) rely on variables
that are continuous in nature. Many times we are interested in dichotomous or
binary variables.
These variables have only two possible categorical outcomes such as
male/female, success/failure, dead/alive, etc.
Indicator or dummy variables are artificial variables that can be used to specify
the categories of the binary variable such as 0=male/1=female.
If an indicator variable is included in the regression model as a
predictor/explanatory variable, the methods we have are appropriate.
HOWEVER, can we handle a situation when the variable we are trying to predict
is categorical and/or binary? Notice that this is a different situation.
But, YES!!

The data in the table also includes
the dummy variable of sex (coded
as 0 = female and 1 = male).
Given that a mother is 63 inches tall
and a father is 69 inches tall, find the
regression equation and use it to
predict the height of a daughter and
a son.

Using technology, we get the regression equation:
𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐶ℎ𝑖𝑙𝑑 = 25.6 + 0.377 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑀𝑜𝑡ℎ𝑒𝑟 + 0.195 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐹𝑎𝑡ℎ𝑒𝑟 + 4.15(𝑠𝑒𝑥)
We substitute in 0 for the sex variable, 63 for the mother, and 69 for the
father, and predict the daughter will be 62.8 inches tall.
We substitute in 1 for the sex variable, 63 for the mother, and 69 for the
father, and predict the son will be 67 inches tall.

Multiple linear regression

More Related Content

What's hot (20)

Similar to Multiple linear regression (20)

More from Avjinder (Avi) Kaler (20)

Recently uploaded (20)

Multiple linear regression