SlideShare a Scribd company logo
Simple Regression Scatterplots: Graphing Relationships
Background Information Pharmex is a chain of drugstores that operates around the country. To see how effective their advertising and other promotional activities are, the company has collected data from 50 randomly selected metropolitan regions. In each region it has compared its own promotional expenditures and sales to those of the leading competitor in the region over the past year.
Background Information -- continued There are two variables each of which are indexes, not dollar amounts. Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor Sales: Pharmex’s sales as a percentage of those of the leading competitor The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is.
PHARMEX.XLS The data are listed in this file. Here is a partial listing. What type of relationship, if any, is apparent in a scatterplot?
Creating the Scatterplot In preparing to create the scatterplot we must decide which variable should be on the horizontal axis. In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so we put Sales on the vertical axis and Promote on the horizontal axis.
Creating the Scatterplot -- continued We create the following scatterplot using StatPro’s Scatterplot procedure.
Interpretation The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect. The correlation of 0.673 is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large.
Causation Unless the data is obtained in a carefully controlled experiment - not the case here - we can never make definitive statements about causation in regression analysis. The reason for this is that we can almost never rule out the possibility that some other variable is causing the variation in  both  of the observed variables.
Background Information In Example 13.1 we created scatterplots for Pharmex.  We found that there was a positive but not perfect relationship between Promote and Sales. We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.
PHARMEX.XLS The data are listed in this file. Here is a partial listing.
Least Squares Estimation Since there are hints of a linear relationship between the two variables we can draw a line through the points to produce a reasonably good fit. However, we need to proceed systematically and not just randomly draw lines. We must choose the line that makes the vertical distances from the points to the line as small as possible. The  fitted value  is the vertical distance from the horizontal axis to the line and the  residual  is the vertical distance from the line to the point.
Least Squares Estimation -- continued The idea is simple. By using a straight line to reflect the relationship between Promote and Sales, we expect a given Sales to be at the height of the line above any particular value of Promote. That is, we expect Sales to equal the fitted value. But the relationship is not perfect. Not all points lie exactly on the line. The differences are the residuals.. They show how much the observed values differ from the fitted values.
Least Squares Estimation -- continued We can now explain how to choose the “best fitting” line through the points in the scatterplot. We choose the line with the  smallest sum of the squared  residuals. This line is called the  least squares  line. Most statistical packages perform the calculations to find this line so we need not be concerned with the technical details and hand calculating.
Finding the Least Squares Line with StatPro We use the StatPro/Regression Analysis /Simple menu item. After specifying that Sales is the response (dependent) variable and that Promote is the explanatory (independent) variable, we see the dialog box for scatterplot options as seen here.
Finding the Least Squares Line with StatPro -- continued This gives us the option of creating several scatterplots involving the fitted values and residuals. The regression output includes three parts. The first two are a list of fitted values and residuals, placed in columns next to the data set, and any scatterplots selected from the dialog box. The third part of the output is the most important. It is shown on the next slide.
Regression Output Table
The Regression Output We will eventually learn what all the output in the table means but for now we will concentrate on a small part. Specifically we find the intercept and slope of the least squares line under the Coefficient label in cells C16 and C17.  They imply that the  equation for the least squares line is    Predicated Sales = 25.1264 + 0.7623Promote
Least Square Line Equation We can interpret this equation as follows. The slope 0.7623 indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index. The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions. For instances like this when the range of observed explanatory variable values does not include 0, it is best to think of the intercept as an “anchor” for the least squares line.
The Scatterplot A useful graph in almost any regression analysis is a scatterplot of residuals (on the vertical axis) versus fitted values. The scatterplot for this data appears on the following slide. We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered  randomly  around 0 with no apparent pattern. This is the case here.
The Scatterplot of Residuals versus Fitted Values for Pharmex
Scatterplots: Graphing Relationships
Background Information The Bendrix Company manufactures various types of parts for automobiles. The manager of the factory wants to get a better understanding of overhead costs. These overhead costs include supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.
Background Information -- continued Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components. The Bendrix manager has tracked total overhead costs for 36 months.
Background Information -- continued To help explain these he also collected data on two variables that are related to the amount of work done at the factory. These variables are: MachHrs: number of machine hours used during the month ProdRuns: the number of separate production runs during the month To understand this variable we must know that Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime. The manager believes both of these variables might be responsible for variations in overhead costs. Do scatterplots support his belief?
BENDRIX1.XLS The data collected by the manager appears in this file. Each observation (row) corresponds to a single month. We want to investigate any possible relationship between the Overhead variable and the MachHrs and ProdRuns variables but because these are time series variables we should also look out for relationships between these variables and the Month variable.
The Scatterplots This data set illustrates, even with the modest number of variables, how the number of potentially useful scatterplots can grow quickly. At the least, we need to look at the scatterplots between each potential explanatory variable (MacHrs and ProdRuns) and the response variable (Overhead). These scatterplots are as follows:
Scatterplot of Overhead versus Machine Hours
Scatterplot of Overhead versus Production Runs
The Scatterplots -- continued To check for possible time series patterns we can also create a time series plot for any of the variables. This is equivalent to a scatterplot of the variable versus the Month, with the points joined by lines. One of these is the time series plot for Overhead. The plot is shown next and it shows a fairly random pattern through time, with no apparent upward trend or other obvious time series pattern. We can check that the MachHrs and ProdRuns also indicate no obvious pattern.
Time Series Plot of Overhead versus Month
The Scatterplots -- continued Finally, when multiple explanatory variables exist we can check for relationships between them. The scatterplot of MachHrs versus ProdRuns is a cloud of points that indicate no relationship worth pursuing.
In Summary The Bendrix manager should continue to explore the positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables do not appear to be related to each other.
Simple Linear Regression
Background Information In Example 13.2 we created scatterplots for Bendrix.  We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.
BENDRIX1.XLS The data collected by the manager appears in this file. The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns. Eventually we will estimate a regression equation with both of the variables included. However, if we include only one at a time, what do they tell us about the overhead costs?
Regression Output for Overhead versus MachHrs
Regression Output for Overhead versus ProdRuns
Least Squares Line Equations The two least squares lines are therefore   Predicted Overhead = 48,621 + 34.7MacHrs and   Predicated Overhead = 75,606 + 655.1ProdRuns Clearly these two equations are quite different, although each effectively breaks Overhead into a fixed component and a variable component. The equations imply that expected overhead increases by about $35 for each extra machine hour and about $655 for each extra production run.
Least Squares Line Equations -- continued The differences between these two lines can be attributed to neither one telling the whole story. If the manager’s goal is to split overhead into a fixed and variable component, then the variable component should include both of the measures of work activity to give a more complete explanation of overhead. We will see how this can be done using multiple regression at a later time.
Multiple Regression
Background Information In Example 13.2 we created scatterplots for Bendrix and in Example 13.2a we determined that the variable component of overhead must include both MachHrs and ProdRuns.  We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.
BENDRIX1.XLS The data collected by the manager appear in this file. The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns. We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation.
Solution To obtain the desired output we use StatPro/Regression Analysis/Multiple menu item. We select Overhead as the response (dependent) variable and select MachHrs and ProdRuns as the explanatory (independent) variables. The dialog box shown here then gives us options of which scatterplots to obtain and whether we want columns of fitted values and residuals placed next to the data set. For this example we will fill it in as shown.
Solution -- continued The main regression output appears in the next table.
Results The coefficients in the range C16-C18 indicate that the estimated regression equation is Predicted Overhead = 3997 + 43.45MachHrs  + 883.62ProdRuns
Interpretation of Equation The interpretation of the equation is that if the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour; and if the number of machine hours is held constant, the overhead is expected to increase by $883.62 for each extra production run. The Bendrix manager can interpret $3997 as the fixed component of overhead. The slope terms involving MachHrs and ProdRuns are the variable components of overhead.
Equation Comparison It is interesting to compare this equation with the separate equations found in the previous example:    Predicted Overhead = 48,621 + 34.7MacHrs and Predicated Overhead = 75,606 + 655.1ProdRuns Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur.
Equation Comparison -- continued The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs  and  the omitted ProdRuns on Overhead. But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. Since the coefficients have different  meanings , it is not surprising that we obtain different estimates.
Modeling Possibilities
BANK.XLS The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data.
Variables For each of the 208 employees, the data set includes the following variables: EducLev: education level, a categorical variable with categories 1 (finished high school), 2 (finished some college courses), 3 (obtianed a bachelor’s degree), 4 (took some graduate courses) and 5 (obtained a graduate degree)  JobGrade: a categorical variable indicating the current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male”
Variables -- continued YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise Salary: current annual salary in thousands of dollars Do the data provide evidence that females are discriminated against in terms of salary?
Naïve Approach A naïve approach to the problem is to compare the average salaries of the males and females. The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.

More Related Content

PPT
1 chapter 04
PPT
Lecture2 Applied Econometrics and Economic Modeling
PDF
Data Science - Part X - Time Series Forecasting
PDF
Chapter iv
PPTX
statistical learning theory
PPT
Chapter 04
PDF
Forecasting Stock Market using Multiple Linear Regression
PDF
R for statistics 2
1 chapter 04
Lecture2 Applied Econometrics and Economic Modeling
Data Science - Part X - Time Series Forecasting
Chapter iv
statistical learning theory
Chapter 04
Forecasting Stock Market using Multiple Linear Regression
R for statistics 2

What's hot (15)

PDF
Building a Regression Model using SPSS
PDF
MS Excel 2010 tutorial 5
PPT
5 6 Scatter Plots & Best Fit Lines
PPTX
Scatter plot- Complete
PPTX
Correlation analysis
PPT
PPT
Variance reduction techniques (vrt)
PDF
Financial Modeling for Real
PDF
paper438
PDF
Advanced Microsoft Excel
PPSX
Advances in ms excel
PPT
How to visualize web analytics data and choose a graph?
PPT
Data Presentation
PPTX
Statistical Graphics / Exploratory Data Analysis - DAY 2 - 8614 - B.Ed - AIOU
Building a Regression Model using SPSS
MS Excel 2010 tutorial 5
5 6 Scatter Plots & Best Fit Lines
Scatter plot- Complete
Correlation analysis
Variance reduction techniques (vrt)
Financial Modeling for Real
paper438
Advanced Microsoft Excel
Advances in ms excel
How to visualize web analytics data and choose a graph?
Data Presentation
Statistical Graphics / Exploratory Data Analysis - DAY 2 - 8614 - B.Ed - AIOU
Ad

Viewers also liked (7)

PPTX
ratio and proportion
PPT
Ratio and proportion
PDF
Correlation in simple terms
KEY
Integrated Math 2 Section 1-4 and 1-5
PDF
Maths A - Chapter 11
PPTX
Ratio and proportion
PPTX
Triangle ppt
ratio and proportion
Ratio and proportion
Correlation in simple terms
Integrated Math 2 Section 1-4 and 1-5
Maths A - Chapter 11
Ratio and proportion
Triangle ppt
Ad

Similar to Lecture7a Applied Econometrics and Economic Modeling (20)

PPT
Exploring bivariate data
PPT
Linear regression
PPTX
PPTX
Stats chapter 3
PPTX
Simple linear regression
PPTX
correlation and regression
DOCX
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
PDF
Collect 50 or more paired quantitative data items. You may use a met.pdf
PDF
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
PPT
Statistics with Computer Applications
DOCX
Requirements.docxRequirementsFont Times New RomanI NEED .docx
PPTX
Scatter plot diagram
PPTX
Advanced Econometrics L3-4.pptx
PPTX
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
PPTX
Chapter 12
PPTX
Corrleation and regression
PPT
Correlation and regression
PDF
the didactic material of Statistics II .pdf
PDF
Simple linear regression
PDF
Chapter 2 part3-Least-Squares Regression
Exploring bivariate data
Linear regression
Stats chapter 3
Simple linear regression
correlation and regression
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
Collect 50 or more paired quantitative data items. You may use a met.pdf
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Statistics with Computer Applications
Requirements.docxRequirementsFont Times New RomanI NEED .docx
Scatter plot diagram
Advanced Econometrics L3-4.pptx
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Chapter 12
Corrleation and regression
Correlation and regression
the didactic material of Statistics II .pdf
Simple linear regression
Chapter 2 part3-Least-Squares Regression

More from stone55 (11)

PPT
excel master series-Anova in-excel-to-improve-marketing
PPT
Lecture6 Applied Econometrics and Economic Modeling
PPT
Lecture 4 Applied Econometrics and Economic Modeling
PPT
Lecture7b Applied Econometrics and Economic Modeling
PPT
Lecture5 Applied Econometrics and Economic Modeling
PPT
Lecture3 Applied Econometrics and Economic Modeling
PPT
Lecture8 Applied Econometrics and Economic Modeling
PPT
lecture 1 applied econometrics and economic modeling
PPT
Consumer credit-risk3440
PPT
Summer 07-mfin7011-tang1922
PDF
smoothwall networkguide
excel master series-Anova in-excel-to-improve-marketing
Lecture6 Applied Econometrics and Economic Modeling
Lecture 4 Applied Econometrics and Economic Modeling
Lecture7b Applied Econometrics and Economic Modeling
Lecture5 Applied Econometrics and Economic Modeling
Lecture3 Applied Econometrics and Economic Modeling
Lecture8 Applied Econometrics and Economic Modeling
lecture 1 applied econometrics and economic modeling
Consumer credit-risk3440
Summer 07-mfin7011-tang1922
smoothwall networkguide

Recently uploaded (20)

PDF
Pre independence Education in Inndia.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
GDM (1) (1).pptx small presentation for students
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Lesson notes of climatology university.
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
Pre independence Education in Inndia.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
TR - Agricultural Crops Production NC III.pdf
human mycosis Human fungal infections are called human mycosis..pptx
VCE English Exam - Section C Student Revision Booklet
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Basic Mud Logging Guide for educational purpose
Pharmacology of Heart Failure /Pharmacotherapy of CHF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Cell Structure & Organelles in detailed.
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pharma ospi slides which help in ospi learning
GDM (1) (1).pptx small presentation for students
O5-L3 Freight Transport Ops (International) V1.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Lesson notes of climatology university.
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Microbial disease of the cardiovascular and lymphatic systems

Lecture7a Applied Econometrics and Economic Modeling

  • 1. Simple Regression Scatterplots: Graphing Relationships
  • 2. Background Information Pharmex is a chain of drugstores that operates around the country. To see how effective their advertising and other promotional activities are, the company has collected data from 50 randomly selected metropolitan regions. In each region it has compared its own promotional expenditures and sales to those of the leading competitor in the region over the past year.
  • 3. Background Information -- continued There are two variables each of which are indexes, not dollar amounts. Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor Sales: Pharmex’s sales as a percentage of those of the leading competitor The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is.
  • 4. PHARMEX.XLS The data are listed in this file. Here is a partial listing. What type of relationship, if any, is apparent in a scatterplot?
  • 5. Creating the Scatterplot In preparing to create the scatterplot we must decide which variable should be on the horizontal axis. In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so we put Sales on the vertical axis and Promote on the horizontal axis.
  • 6. Creating the Scatterplot -- continued We create the following scatterplot using StatPro’s Scatterplot procedure.
  • 7. Interpretation The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect. The correlation of 0.673 is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large.
  • 8. Causation Unless the data is obtained in a carefully controlled experiment - not the case here - we can never make definitive statements about causation in regression analysis. The reason for this is that we can almost never rule out the possibility that some other variable is causing the variation in both of the observed variables.
  • 9. Background Information In Example 13.1 we created scatterplots for Pharmex. We found that there was a positive but not perfect relationship between Promote and Sales. We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.
  • 10. PHARMEX.XLS The data are listed in this file. Here is a partial listing.
  • 11. Least Squares Estimation Since there are hints of a linear relationship between the two variables we can draw a line through the points to produce a reasonably good fit. However, we need to proceed systematically and not just randomly draw lines. We must choose the line that makes the vertical distances from the points to the line as small as possible. The fitted value is the vertical distance from the horizontal axis to the line and the residual is the vertical distance from the line to the point.
  • 12. Least Squares Estimation -- continued The idea is simple. By using a straight line to reflect the relationship between Promote and Sales, we expect a given Sales to be at the height of the line above any particular value of Promote. That is, we expect Sales to equal the fitted value. But the relationship is not perfect. Not all points lie exactly on the line. The differences are the residuals.. They show how much the observed values differ from the fitted values.
  • 13. Least Squares Estimation -- continued We can now explain how to choose the “best fitting” line through the points in the scatterplot. We choose the line with the smallest sum of the squared residuals. This line is called the least squares line. Most statistical packages perform the calculations to find this line so we need not be concerned with the technical details and hand calculating.
  • 14. Finding the Least Squares Line with StatPro We use the StatPro/Regression Analysis /Simple menu item. After specifying that Sales is the response (dependent) variable and that Promote is the explanatory (independent) variable, we see the dialog box for scatterplot options as seen here.
  • 15. Finding the Least Squares Line with StatPro -- continued This gives us the option of creating several scatterplots involving the fitted values and residuals. The regression output includes three parts. The first two are a list of fitted values and residuals, placed in columns next to the data set, and any scatterplots selected from the dialog box. The third part of the output is the most important. It is shown on the next slide.
  • 17. The Regression Output We will eventually learn what all the output in the table means but for now we will concentrate on a small part. Specifically we find the intercept and slope of the least squares line under the Coefficient label in cells C16 and C17. They imply that the equation for the least squares line is Predicated Sales = 25.1264 + 0.7623Promote
  • 18. Least Square Line Equation We can interpret this equation as follows. The slope 0.7623 indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index. The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions. For instances like this when the range of observed explanatory variable values does not include 0, it is best to think of the intercept as an “anchor” for the least squares line.
  • 19. The Scatterplot A useful graph in almost any regression analysis is a scatterplot of residuals (on the vertical axis) versus fitted values. The scatterplot for this data appears on the following slide. We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.
  • 20. The Scatterplot of Residuals versus Fitted Values for Pharmex
  • 22. Background Information The Bendrix Company manufactures various types of parts for automobiles. The manager of the factory wants to get a better understanding of overhead costs. These overhead costs include supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.
  • 23. Background Information -- continued Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components. The Bendrix manager has tracked total overhead costs for 36 months.
  • 24. Background Information -- continued To help explain these he also collected data on two variables that are related to the amount of work done at the factory. These variables are: MachHrs: number of machine hours used during the month ProdRuns: the number of separate production runs during the month To understand this variable we must know that Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime. The manager believes both of these variables might be responsible for variations in overhead costs. Do scatterplots support his belief?
  • 25. BENDRIX1.XLS The data collected by the manager appears in this file. Each observation (row) corresponds to a single month. We want to investigate any possible relationship between the Overhead variable and the MachHrs and ProdRuns variables but because these are time series variables we should also look out for relationships between these variables and the Month variable.
  • 26. The Scatterplots This data set illustrates, even with the modest number of variables, how the number of potentially useful scatterplots can grow quickly. At the least, we need to look at the scatterplots between each potential explanatory variable (MacHrs and ProdRuns) and the response variable (Overhead). These scatterplots are as follows:
  • 27. Scatterplot of Overhead versus Machine Hours
  • 28. Scatterplot of Overhead versus Production Runs
  • 29. The Scatterplots -- continued To check for possible time series patterns we can also create a time series plot for any of the variables. This is equivalent to a scatterplot of the variable versus the Month, with the points joined by lines. One of these is the time series plot for Overhead. The plot is shown next and it shows a fairly random pattern through time, with no apparent upward trend or other obvious time series pattern. We can check that the MachHrs and ProdRuns also indicate no obvious pattern.
  • 30. Time Series Plot of Overhead versus Month
  • 31. The Scatterplots -- continued Finally, when multiple explanatory variables exist we can check for relationships between them. The scatterplot of MachHrs versus ProdRuns is a cloud of points that indicate no relationship worth pursuing.
  • 32. In Summary The Bendrix manager should continue to explore the positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables do not appear to be related to each other.
  • 34. Background Information In Example 13.2 we created scatterplots for Bendrix. We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.
  • 35. BENDRIX1.XLS The data collected by the manager appears in this file. The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns. Eventually we will estimate a regression equation with both of the variables included. However, if we include only one at a time, what do they tell us about the overhead costs?
  • 36. Regression Output for Overhead versus MachHrs
  • 37. Regression Output for Overhead versus ProdRuns
  • 38. Least Squares Line Equations The two least squares lines are therefore Predicted Overhead = 48,621 + 34.7MacHrs and Predicated Overhead = 75,606 + 655.1ProdRuns Clearly these two equations are quite different, although each effectively breaks Overhead into a fixed component and a variable component. The equations imply that expected overhead increases by about $35 for each extra machine hour and about $655 for each extra production run.
  • 39. Least Squares Line Equations -- continued The differences between these two lines can be attributed to neither one telling the whole story. If the manager’s goal is to split overhead into a fixed and variable component, then the variable component should include both of the measures of work activity to give a more complete explanation of overhead. We will see how this can be done using multiple regression at a later time.
  • 41. Background Information In Example 13.2 we created scatterplots for Bendrix and in Example 13.2a we determined that the variable component of overhead must include both MachHrs and ProdRuns. We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.
  • 42. BENDRIX1.XLS The data collected by the manager appear in this file. The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns. We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation.
  • 43. Solution To obtain the desired output we use StatPro/Regression Analysis/Multiple menu item. We select Overhead as the response (dependent) variable and select MachHrs and ProdRuns as the explanatory (independent) variables. The dialog box shown here then gives us options of which scatterplots to obtain and whether we want columns of fitted values and residuals placed next to the data set. For this example we will fill it in as shown.
  • 44. Solution -- continued The main regression output appears in the next table.
  • 45. Results The coefficients in the range C16-C18 indicate that the estimated regression equation is Predicted Overhead = 3997 + 43.45MachHrs + 883.62ProdRuns
  • 46. Interpretation of Equation The interpretation of the equation is that if the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour; and if the number of machine hours is held constant, the overhead is expected to increase by $883.62 for each extra production run. The Bendrix manager can interpret $3997 as the fixed component of overhead. The slope terms involving MachHrs and ProdRuns are the variable components of overhead.
  • 47. Equation Comparison It is interesting to compare this equation with the separate equations found in the previous example: Predicted Overhead = 48,621 + 34.7MacHrs and Predicated Overhead = 75,606 + 655.1ProdRuns Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur.
  • 48. Equation Comparison -- continued The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead. But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. Since the coefficients have different meanings , it is not surprising that we obtain different estimates.
  • 50. BANK.XLS The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data.
  • 51. Variables For each of the 208 employees, the data set includes the following variables: EducLev: education level, a categorical variable with categories 1 (finished high school), 2 (finished some college courses), 3 (obtianed a bachelor’s degree), 4 (took some graduate courses) and 5 (obtained a graduate degree) JobGrade: a categorical variable indicating the current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male”
  • 52. Variables -- continued YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise Salary: current annual salary in thousands of dollars Do the data provide evidence that females are discriminated against in terms of salary?
  • 53. Naïve Approach A naïve approach to the problem is to compare the average salaries of the males and females. The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.