SlideShare a Scribd company logo
MULTIPLE LINEAR
REGRESSION
Avjinder Singh Kaler and Kristi Mai
We will look at a method for analyzing a linear relationship involving
more than two variables.
We focus on these key elements:
1. Finding the multiple regression equation.
2. The values of the adjusted R2, and the p-value as measures of
how well the multiple regression equation fits the sample data.
• Multiple Regression Equation – given a collection of sample data with
several (𝑘−𝑚𝑎𝑛𝑦) explanatory variables, the regression equation that
algebraically describes the relationship between the response variable 𝑦
and two or more explanatory variables 𝑥1, 𝑥2, … 𝑥 𝑘 and is:
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏 𝑘 𝑥 𝑘
• We are using more than one explanatory variable to predict a response variable now
• In practice, you need large amounts of data to use several predictor/explanatory
variables
* Guideline: Your sample size should be 10 times larger than the number of 𝑥 variables*
• Multiple Regression Line – the graph of the multiple regression equation
• This multiple regression line still fits the sample points best according to the least squares
property
• Visualization – multiple
scatterplots of each pair
(𝑥 𝑘, 𝑦) of quantitative data
can still be helpful in
determining whether there
is a relationship between
two variables
• These scatterplots can be
created one at a time.
However, it is common to
visualize all the pairs of
variables within one plot.
This is often called a pairs
plot, pairwise scatterplot
or scatterplot matrix.
Population Parameter Sample Statistic
Equation 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏 𝑘 𝑥 𝑘
Note:
• 𝑦 is the predicted value of 𝑦
• 𝑘 is the number of predictor variables (also called independent
variables or 𝑥 variables)
• Requirements for Regression:
1. The sample data is a Simple Random Sample of quantitative data
2. Each of the pairs of data (𝑥 𝑘, 𝑦) has a bivariate normal distribution
(recall this definition)
3. Random errors associated with the regression equation (i.e. residuals)
are independent and normally distributed with a mean of 0 and a
standard deviation 𝜎
• Formulas for 𝑏 𝑘:
• Statistical software will be used to calculate the individual coefficient
estimates, 𝑏 𝑘
1. Use common sense and practical considerations to include or
exclude variables
2. Consider the P-value for the test of overall model significance
• Hypotheses:
𝐻0: 𝛽1 = 𝛽2 = ⋯ = 𝛽 𝑘 = 0
𝐻1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 𝑘 ≠ 0
• Test Statistic: 𝐹 =
𝑀𝑆 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑀𝑆(𝐸𝑟𝑟𝑜𝑟)
• This will result in an ANOVA table with a p-value that expresses the overall
statistical significance of the model
3. Consider equations with high adjusted 𝑹 𝟐 values
• 𝑅 is the multiple correlation coefficient that describes the correlation
between the observed 𝑦 values and the predicted 𝑦 values
• 𝑅2
is the multiple coefficient of determination and measures how well the
multiple regression equation fits the sample data
• Problems: This measure of model “fitness” increases as more variables are
included until it can usually raise no more or only by a very little amount no
matter how significant the most recently added predictor variable may be
• Adjusted 𝑅2
is the multiple coefficient of determination that is modified to
account for the number of variables in the model and the sample size
4. Consider equations with the fewest number of predictor/explanatory
variables if models that are being compared are nearly equivalent in
terms of significance and fit (i.e. p-value and adjusted 𝑅2)
• This is known as the “Law of Parsimony”
• We are looking for the simplest yet most informative model
• Individual t-tests of particular regression parameters may help select the
correct model and eliminate insignificant explanatory variables
Notice: If the regression equation does not appear to be useful for predictions,
the best predicted value of a 𝑦 variable is still its point estimate [i.e. the sample
mean of the 𝑦 variable would be the best predicted value for that variable]
• Identify the response and potential explanatory variables by
constructing a scatterplot matrix
• Create a multiple regression model
• Perform the appropriate tests of the following:
• Overall model significance (the ANOVA i.e. the 𝐹 test)
• Individual variable significance (𝑡 tests)
• In addition, find the following:
• Find the adjusted 𝑅2 value to assess the predictive power of the model
• Perform a Residual Analysis to verify the Requirements for Linear
Regression have been satisfied:
1. Construct a residual plot and verify that there is no pattern (other than a
straight line pattern) and also verify that the residual plot does not
become thicker or thinner
• Examples are shown below:
2. Use a histogram, normal quantile plot, or Shapiro Wilk test of normality
to confirm that the values of the residuals have a distribution that is
approximately normal
• Normal Quantile Plot (aka QQ Plot) * Examples on the next 3 slides *
• Shapiro Wilk Normality Test
• This will help you assess the normality of a given set of data (in this case, the
normality of the residuals) when the visual examination of the QQ Plot and/or
the histogram of the data seem unclear to you and leave you stumped!
• Hypotheses:
H0: Th݁ ݀𝑎𝑡𝑎 ܿ‫ݏ݁𝑚݋‬ ݂‫𝑚݋ݎ‬ 𝑎 𝑛‫݈𝑎𝑚ݎ݋‬ ݀݅‫𝑛݋݅𝑡ݑܾ݅ݎ𝑡ݏ‬
H1: Th݁ ݀𝑎𝑡𝑎 ݀‫ݏ݁݋‬ 𝑛‫𝑡݋‬ 𝑎‫ݎ𝑎݁݌݌‬ 𝑡‫݋‬ ܿ‫݁𝑚݋‬ ݂‫𝑚݋ݎ‬ 𝑎 𝑛‫݈𝑎𝑚ݎ݋‬ ݀݅‫𝑛݋݅𝑡ݑܾ݅ݎ𝑡ݏ‬
Normal: Histogram of IQ scores is close to being bell-shaped, suggests that the IQ
scores are from a normal distribution. The normal quantile plot shows points that are
reasonably close to a straight-line pattern. It is safe to assume that these IQ scores
are from a normally distributed population.
Uniform: Histogram of data having a uniform distribution. The corresponding
normal quantile plot suggests that the points are not normally distributed because
the points show a systematic pattern that is not a straight-line pattern. These
sample values are not from a population having a normal distribution.
Skewed: Histogram of the amounts of rainfall in Boston for every Monday during
one year. The shape of the histogram is skewed, not bell-shaped. The
corresponding normal quantile plot shows points that are not at all close to a
straight-line pattern. These rainfall amounts are not from a population having a
normal distribution.
The table to the right includes a random
sample of heights of mothers, fathers, and their
daughters (based on data from the National
Health and Nutrition Examination).
Find the multiple regression equation in which
the response (y) variable is the height of a
daughter and the predictor (x) variables are
the height of the mother and height of the
father.
The StatCrunch results are shown here:
From the display, we see that the multiple
regression equation is:
𝐷𝑎𝑢𝑔ℎ𝑡𝑒𝑟 = 7.5 + 0.707𝑀𝑜𝑡ℎ𝑒𝑟 + 0.164 𝐹𝑎𝑡ℎ𝑒𝑟
We could write this equation as:
𝑦 = 7.5 + 0.707𝑥1 + 0.164𝑥2
where 𝑦 is the predicted height of a
daughter,
𝑥1 is the height of the mother, and 𝑥2 is the
height of the father.
The preceding technology display shows the adjusted coefficient of
determination as R-Sq(adj) = 63.7%.
When we compare this multiple regression equation to others, it is better
to use the adjusted R2 of 63.7%
Based on StatCrunch, the p-value is less than 0.0001, indicating that the
multiple regression equation has good overall significance and is usable
for predictions.
That is, it makes sense to predict the heights of daughters based on heights
of mothers and fathers.
The p-value results from a test of the null hypothesis that β1 = β2 = 0, and
rejection of this hypothesis indicates the equation is effective in predicting
the heights of daughters.
Data Set 2 in Appendix B includes the age, foot length, shoe print length,
shoe size, and height for each of 40 different subjects.
Using those sample data, find the regression equation that is the best for
predicting height.
The table on the next slide includes key results from the combinations of
the five predictor variables.
Multiple linear regression
Multiple linear regression
Multiple linear regression
Multiple linear regression
Multiple linear regression
Multiple linear regression
Multiple linear regression
Multiple linear regression
Using critical thinking and statistical analysis:
1. Delete the variable age.
2. Delete the variable shoe size, because it is really a rounded form of foot length.
3. For the remaining variables of foot length and shoe print length, select foot length
because its adjusted R2 of 0.7014 is greater than 0.6520 for shoe print length.
4. Although it appears that only foot length is best, we note that criminals usually wear
shoes, so shoe print lengths are likely to be found than foot lengths.
Hence, the final regression equation only including foot length:
𝑦 = 𝛽0 + 𝛽1 𝑥1
where 𝛽0 is the intercept, 𝛽1 is the coefficient corresponding to x1 variable (foot length).
The methods of the above section (Multiple Linear Regression) rely on variables
that are continuous in nature. Many times we are interested in dichotomous or
binary variables.
These variables have only two possible categorical outcomes such as
male/female, success/failure, dead/alive, etc.
Indicator or dummy variables are artificial variables that can be used to specify
the categories of the binary variable such as 0=male/1=female.
If an indicator variable is included in the regression model as a
predictor/explanatory variable, the methods we have are appropriate.
HOWEVER, can we handle a situation when the variable we are trying to predict
is categorical and/or binary? Notice that this is a different situation.
But, YES!!
The data in the table also includes
the dummy variable of sex (coded
as 0 = female and 1 = male).
Given that a mother is 63 inches tall
and a father is 69 inches tall, find the
regression equation and use it to
predict the height of a daughter and
a son.
Using technology, we get the regression equation:
𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐶ℎ𝑖𝑙𝑑 = 25.6 + 0.377 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑀𝑜𝑡ℎ𝑒𝑟 + 0.195 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐹𝑎𝑡ℎ𝑒𝑟 + 4.15(𝑠𝑒𝑥)
We substitute in 0 for the sex variable, 63 for the mother, and 69 for the
father, and predict the daughter will be 62.8 inches tall.
We substitute in 1 for the sex variable, 63 for the mother, and 69 for the
father, and predict the son will be 67 inches tall.

More Related Content

PDF
Simple linear regression
PDF
Creative Resume Template Powerpoint Visual Resume
PPTX
Sundar Pichai PPT
PPTX
7 continents
PPT
Multiple regression presentation
PPTX
Logistic regression
PDF
Analysis of Variance (ANOVA)
PPT
Transaction management and concurrency control
Simple linear regression
Creative Resume Template Powerpoint Visual Resume
Sundar Pichai PPT
7 continents
Multiple regression presentation
Logistic regression
Analysis of Variance (ANOVA)
Transaction management and concurrency control

What's hot (20)

PPTX
Multiple Linear Regression
PDF
Linear regression theory
PPT
Regression analysis
PPTX
Regression analysis
PDF
Multiple regression
PPT
Regression analysis
PPTX
Regression analysis
PPTX
Regression Analysis
PPTX
Presentation On Regression
PPTX
Regression analysis
PDF
Introduction to correlation and regression analysis
PPTX
Regression
PPSX
Linear regression
PPTX
Regression analysis
PPT
multiple regression
PPTX
Multiple linear regression
ODP
Multiple linear regression
PPTX
Logistic regression with SPSS
PPTX
Correlation & Regression Analysis using SPSS
PPTX
Statistics-Regression analysis
Multiple Linear Regression
Linear regression theory
Regression analysis
Regression analysis
Multiple regression
Regression analysis
Regression analysis
Regression Analysis
Presentation On Regression
Regression analysis
Introduction to correlation and regression analysis
Regression
Linear regression
Regression analysis
multiple regression
Multiple linear regression
Multiple linear regression
Logistic regression with SPSS
Correlation & Regression Analysis using SPSS
Statistics-Regression analysis
Ad

Similar to Multiple linear regression (20)

PDF
Correlation in Statistics
PDF
PPTX
Simple egression.pptx
PPTX
Simple Linear Regression.pptx
PDF
Data Ananlysis lecture 7 Simon Fraser University
PPTX
Introduction to simulating data to improve your research
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PPTX
Inorganic CHEMISTRY
PPT
Ders 2 ols .ppt
PDF
Lect w8 w9_correlation_regression
PDF
Kendall's ,partial correlation and scatter plot
PPTX
Measure of Association
PPTX
6 the six uContinuous data analysis.pptx
PDF
Principal components
PPT
correlation in Marketing research uses..
PPTX
4. correlations
PPT
Quantitative_analysis.ppt
PPTX
03 Data Mining Techniques
PPTX
2. diagnostics, collinearity, transformation, and missing data
PDF
Statistical parameters
Correlation in Statistics
Simple egression.pptx
Simple Linear Regression.pptx
Data Ananlysis lecture 7 Simon Fraser University
Introduction to simulating data to improve your research
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Inorganic CHEMISTRY
Ders 2 ols .ppt
Lect w8 w9_correlation_regression
Kendall's ,partial correlation and scatter plot
Measure of Association
6 the six uContinuous data analysis.pptx
Principal components
correlation in Marketing research uses..
4. correlations
Quantitative_analysis.ppt
03 Data Mining Techniques
2. diagnostics, collinearity, transformation, and missing data
Statistical parameters
Ad

More from Avjinder (Avi) Kaler (20)

PDF
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
PDF
Tutorial for Deep Learning Project with Keras
PDF
Tutorial for DBSCAN Clustering in Machine Learning
PDF
Python Code for Classification Supervised Machine Learning.pdf
PDF
Sql tutorial for select, where, order by, null, insert functions
PDF
Kaler et al 2018 euphytica
PDF
Association mapping identifies loci for canopy coverage in diverse soybean ge...
PDF
Genome-Wide Association Mapping of Carbon Isotope and Oxygen Isotope Ratios i...
PDF
Genome-wide association mapping of canopy wilting in diverse soybean genotypes
PDF
Tutorial for Estimating Broad and Narrow Sense Heritability using R
PDF
Tutorial for Circular and Rectangular Manhattan plots
PDF
Genomic Selection with Bayesian Generalized Linear Regression model using R
PDF
Genome wide association mapping
PDF
Nutrient availability response to sulfur amendment in histosols having variab...
PDF
Sugarcane yield and plant nutrient response to sulfur amended everglades hist...
PDF
R code descriptive statistics of phenotypic data by Avjinder Kaler
PDF
Population genetics
PDF
Quantitative genetics
PDF
Abiotic stresses in plant
PDF
Seed rate calculation for experiment
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Tutorial for Deep Learning Project with Keras
Tutorial for DBSCAN Clustering in Machine Learning
Python Code for Classification Supervised Machine Learning.pdf
Sql tutorial for select, where, order by, null, insert functions
Kaler et al 2018 euphytica
Association mapping identifies loci for canopy coverage in diverse soybean ge...
Genome-Wide Association Mapping of Carbon Isotope and Oxygen Isotope Ratios i...
Genome-wide association mapping of canopy wilting in diverse soybean genotypes
Tutorial for Estimating Broad and Narrow Sense Heritability using R
Tutorial for Circular and Rectangular Manhattan plots
Genomic Selection with Bayesian Generalized Linear Regression model using R
Genome wide association mapping
Nutrient availability response to sulfur amendment in histosols having variab...
Sugarcane yield and plant nutrient response to sulfur amended everglades hist...
R code descriptive statistics of phenotypic data by Avjinder Kaler
Population genetics
Quantitative genetics
Abiotic stresses in plant
Seed rate calculation for experiment

Recently uploaded (20)

PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
master seminar digital applications in india
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Business Ethics Teaching Materials for college
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pharmacology of Heart Failure /Pharmacotherapy of CHF
master seminar digital applications in india
Renaissance Architecture: A Journey from Faith to Humanism
STATICS OF THE RIGID BODIES Hibbelers.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Pharma ospi slides which help in ospi learning
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Supply Chain Operations Speaking Notes -ICLT Program
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
O5-L3 Freight Transport Ops (International) V1.pdf
Anesthesia in Laparoscopic Surgery in India
Business Ethics Teaching Materials for college
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Insiders guide to clinical Medicine.pdf
O7-L3 Supply Chain Operations - ICLT Program
Abdominal Access Techniques with Prof. Dr. R K Mishra

Multiple linear regression

  • 2. We will look at a method for analyzing a linear relationship involving more than two variables. We focus on these key elements: 1. Finding the multiple regression equation. 2. The values of the adjusted R2, and the p-value as measures of how well the multiple regression equation fits the sample data.
  • 3. • Multiple Regression Equation – given a collection of sample data with several (𝑘−𝑚𝑎𝑛𝑦) explanatory variables, the regression equation that algebraically describes the relationship between the response variable 𝑦 and two or more explanatory variables 𝑥1, 𝑥2, … 𝑥 𝑘 and is: 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏 𝑘 𝑥 𝑘 • We are using more than one explanatory variable to predict a response variable now • In practice, you need large amounts of data to use several predictor/explanatory variables * Guideline: Your sample size should be 10 times larger than the number of 𝑥 variables* • Multiple Regression Line – the graph of the multiple regression equation • This multiple regression line still fits the sample points best according to the least squares property
  • 4. • Visualization – multiple scatterplots of each pair (𝑥 𝑘, 𝑦) of quantitative data can still be helpful in determining whether there is a relationship between two variables • These scatterplots can be created one at a time. However, it is common to visualize all the pairs of variables within one plot. This is often called a pairs plot, pairwise scatterplot or scatterplot matrix.
  • 5. Population Parameter Sample Statistic Equation 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏 𝑘 𝑥 𝑘 Note: • 𝑦 is the predicted value of 𝑦 • 𝑘 is the number of predictor variables (also called independent variables or 𝑥 variables)
  • 6. • Requirements for Regression: 1. The sample data is a Simple Random Sample of quantitative data 2. Each of the pairs of data (𝑥 𝑘, 𝑦) has a bivariate normal distribution (recall this definition) 3. Random errors associated with the regression equation (i.e. residuals) are independent and normally distributed with a mean of 0 and a standard deviation 𝜎 • Formulas for 𝑏 𝑘: • Statistical software will be used to calculate the individual coefficient estimates, 𝑏 𝑘
  • 7. 1. Use common sense and practical considerations to include or exclude variables 2. Consider the P-value for the test of overall model significance • Hypotheses: 𝐻0: 𝛽1 = 𝛽2 = ⋯ = 𝛽 𝑘 = 0 𝐻1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 𝑘 ≠ 0 • Test Statistic: 𝐹 = 𝑀𝑆 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑀𝑆(𝐸𝑟𝑟𝑜𝑟) • This will result in an ANOVA table with a p-value that expresses the overall statistical significance of the model
  • 8. 3. Consider equations with high adjusted 𝑹 𝟐 values • 𝑅 is the multiple correlation coefficient that describes the correlation between the observed 𝑦 values and the predicted 𝑦 values • 𝑅2 is the multiple coefficient of determination and measures how well the multiple regression equation fits the sample data • Problems: This measure of model “fitness” increases as more variables are included until it can usually raise no more or only by a very little amount no matter how significant the most recently added predictor variable may be • Adjusted 𝑅2 is the multiple coefficient of determination that is modified to account for the number of variables in the model and the sample size
  • 9. 4. Consider equations with the fewest number of predictor/explanatory variables if models that are being compared are nearly equivalent in terms of significance and fit (i.e. p-value and adjusted 𝑅2) • This is known as the “Law of Parsimony” • We are looking for the simplest yet most informative model • Individual t-tests of particular regression parameters may help select the correct model and eliminate insignificant explanatory variables Notice: If the regression equation does not appear to be useful for predictions, the best predicted value of a 𝑦 variable is still its point estimate [i.e. the sample mean of the 𝑦 variable would be the best predicted value for that variable]
  • 10. • Identify the response and potential explanatory variables by constructing a scatterplot matrix • Create a multiple regression model • Perform the appropriate tests of the following: • Overall model significance (the ANOVA i.e. the 𝐹 test) • Individual variable significance (𝑡 tests) • In addition, find the following: • Find the adjusted 𝑅2 value to assess the predictive power of the model
  • 11. • Perform a Residual Analysis to verify the Requirements for Linear Regression have been satisfied: 1. Construct a residual plot and verify that there is no pattern (other than a straight line pattern) and also verify that the residual plot does not become thicker or thinner • Examples are shown below:
  • 12. 2. Use a histogram, normal quantile plot, or Shapiro Wilk test of normality to confirm that the values of the residuals have a distribution that is approximately normal • Normal Quantile Plot (aka QQ Plot) * Examples on the next 3 slides * • Shapiro Wilk Normality Test • This will help you assess the normality of a given set of data (in this case, the normality of the residuals) when the visual examination of the QQ Plot and/or the histogram of the data seem unclear to you and leave you stumped! • Hypotheses: H0: Th݁ ݀𝑎𝑡𝑎 ܿ‫ݏ݁𝑚݋‬ ݂‫𝑚݋ݎ‬ 𝑎 𝑛‫݈𝑎𝑚ݎ݋‬ ݀݅‫𝑛݋݅𝑡ݑܾ݅ݎ𝑡ݏ‬ H1: Th݁ ݀𝑎𝑡𝑎 ݀‫ݏ݁݋‬ 𝑛‫𝑡݋‬ 𝑎‫ݎ𝑎݁݌݌‬ 𝑡‫݋‬ ܿ‫݁𝑚݋‬ ݂‫𝑚݋ݎ‬ 𝑎 𝑛‫݈𝑎𝑚ݎ݋‬ ݀݅‫𝑛݋݅𝑡ݑܾ݅ݎ𝑡ݏ‬
  • 13. Normal: Histogram of IQ scores is close to being bell-shaped, suggests that the IQ scores are from a normal distribution. The normal quantile plot shows points that are reasonably close to a straight-line pattern. It is safe to assume that these IQ scores are from a normally distributed population.
  • 14. Uniform: Histogram of data having a uniform distribution. The corresponding normal quantile plot suggests that the points are not normally distributed because the points show a systematic pattern that is not a straight-line pattern. These sample values are not from a population having a normal distribution.
  • 15. Skewed: Histogram of the amounts of rainfall in Boston for every Monday during one year. The shape of the histogram is skewed, not bell-shaped. The corresponding normal quantile plot shows points that are not at all close to a straight-line pattern. These rainfall amounts are not from a population having a normal distribution.
  • 16. The table to the right includes a random sample of heights of mothers, fathers, and their daughters (based on data from the National Health and Nutrition Examination). Find the multiple regression equation in which the response (y) variable is the height of a daughter and the predictor (x) variables are the height of the mother and height of the father.
  • 17. The StatCrunch results are shown here: From the display, we see that the multiple regression equation is: 𝐷𝑎𝑢𝑔ℎ𝑡𝑒𝑟 = 7.5 + 0.707𝑀𝑜𝑡ℎ𝑒𝑟 + 0.164 𝐹𝑎𝑡ℎ𝑒𝑟 We could write this equation as: 𝑦 = 7.5 + 0.707𝑥1 + 0.164𝑥2 where 𝑦 is the predicted height of a daughter, 𝑥1 is the height of the mother, and 𝑥2 is the height of the father.
  • 18. The preceding technology display shows the adjusted coefficient of determination as R-Sq(adj) = 63.7%. When we compare this multiple regression equation to others, it is better to use the adjusted R2 of 63.7%
  • 19. Based on StatCrunch, the p-value is less than 0.0001, indicating that the multiple regression equation has good overall significance and is usable for predictions. That is, it makes sense to predict the heights of daughters based on heights of mothers and fathers. The p-value results from a test of the null hypothesis that β1 = β2 = 0, and rejection of this hypothesis indicates the equation is effective in predicting the heights of daughters.
  • 20. Data Set 2 in Appendix B includes the age, foot length, shoe print length, shoe size, and height for each of 40 different subjects. Using those sample data, find the regression equation that is the best for predicting height. The table on the next slide includes key results from the combinations of the five predictor variables.
  • 29. Using critical thinking and statistical analysis: 1. Delete the variable age. 2. Delete the variable shoe size, because it is really a rounded form of foot length. 3. For the remaining variables of foot length and shoe print length, select foot length because its adjusted R2 of 0.7014 is greater than 0.6520 for shoe print length. 4. Although it appears that only foot length is best, we note that criminals usually wear shoes, so shoe print lengths are likely to be found than foot lengths. Hence, the final regression equation only including foot length: 𝑦 = 𝛽0 + 𝛽1 𝑥1 where 𝛽0 is the intercept, 𝛽1 is the coefficient corresponding to x1 variable (foot length).
  • 30. The methods of the above section (Multiple Linear Regression) rely on variables that are continuous in nature. Many times we are interested in dichotomous or binary variables. These variables have only two possible categorical outcomes such as male/female, success/failure, dead/alive, etc. Indicator or dummy variables are artificial variables that can be used to specify the categories of the binary variable such as 0=male/1=female. If an indicator variable is included in the regression model as a predictor/explanatory variable, the methods we have are appropriate. HOWEVER, can we handle a situation when the variable we are trying to predict is categorical and/or binary? Notice that this is a different situation. But, YES!!
  • 31. The data in the table also includes the dummy variable of sex (coded as 0 = female and 1 = male). Given that a mother is 63 inches tall and a father is 69 inches tall, find the regression equation and use it to predict the height of a daughter and a son.
  • 32. Using technology, we get the regression equation: 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐶ℎ𝑖𝑙𝑑 = 25.6 + 0.377 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑀𝑜𝑡ℎ𝑒𝑟 + 0.195 𝐻𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝐹𝑎𝑡ℎ𝑒𝑟 + 4.15(𝑠𝑒𝑥) We substitute in 0 for the sex variable, 63 for the mother, and 69 for the father, and predict the daughter will be 62.8 inches tall. We substitute in 1 for the sex variable, 63 for the mother, and 69 for the father, and predict the son will be 67 inches tall.