Lecture7a Applied Econometrics and Economic Modeling

Simple Regression Scatterplots: Graphing Relationships

Background Information Pharmex is a chain of drugstores that operates around the country. To see how effective their advertising and other promotional activities are, the company has collected data from 50 randomly selected metropolitan regions. In each region it has compared its own promotional expenditures and sales to those of the leading competitor in the region over the past year.

Background Information -- continued There are two variables each of which are indexes, not dollar amounts. Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor Sales: Pharmex’s sales as a percentage of those of the leading competitor The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is.

PHARMEX.XLS The data are listed in this file. Here is a partial listing. What type of relationship, if any, is apparent in a scatterplot?

Creating the Scatterplot In preparing to create the scatterplot we must decide which variable should be on the horizontal axis. In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so we put Sales on the vertical axis and Promote on the horizontal axis.

Creating the Scatterplot -- continued We create the following scatterplot using StatPro’s Scatterplot procedure.

Interpretation The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect. The correlation of 0.673 is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large.

Causation Unless the data is obtained in a carefully controlled experiment - not the case here - we can never make definitive statements about causation in regression analysis. The reason for this is that we can almost never rule out the possibility that some other variable is causing the variation in both of the observed variables.

Background Information In Example 13.1 we created scatterplots for Pharmex. We found that there was a positive but not perfect relationship between Promote and Sales. We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.

PHARMEX.XLS The data are listed in this file. Here is a partial listing.

Least Squares Estimation Since there are hints of a linear relationship between the two variables we can draw a line through the points to produce a reasonably good fit. However, we need to proceed systematically and not just randomly draw lines. We must choose the line that makes the vertical distances from the points to the line as small as possible. The fitted value is the vertical distance from the horizontal axis to the line and the residual is the vertical distance from the line to the point.

Least Squares Estimation -- continued The idea is simple. By using a straight line to reflect the relationship between Promote and Sales, we expect a given Sales to be at the height of the line above any particular value of Promote. That is, we expect Sales to equal the fitted value. But the relationship is not perfect. Not all points lie exactly on the line. The differences are the residuals.. They show how much the observed values differ from the fitted values.

Least Squares Estimation -- continued We can now explain how to choose the “best fitting” line through the points in the scatterplot. We choose the line with the smallest sum of the squared residuals. This line is called the least squares line. Most statistical packages perform the calculations to find this line so we need not be concerned with the technical details and hand calculating.

Finding the Least Squares Line with StatPro We use the StatPro/Regression Analysis /Simple menu item. After specifying that Sales is the response (dependent) variable and that Promote is the explanatory (independent) variable, we see the dialog box for scatterplot options as seen here.

Finding the Least Squares Line with StatPro -- continued This gives us the option of creating several scatterplots involving the fitted values and residuals. The regression output includes three parts. The first two are a list of fitted values and residuals, placed in columns next to the data set, and any scatterplots selected from the dialog box. The third part of the output is the most important. It is shown on the next slide.

The Regression Output We will eventually learn what all the output in the table means but for now we will concentrate on a small part. Specifically we find the intercept and slope of the least squares line under the Coefficient label in cells C16 and C17. They imply that the equation for the least squares line is Predicated Sales = 25.1264 + 0.7623Promote

Least Square Line Equation We can interpret this equation as follows. The slope 0.7623 indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index. The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions. For instances like this when the range of observed explanatory variable values does not include 0, it is best to think of the intercept as an “anchor” for the least squares line.

The Scatterplot A useful graph in almost any regression analysis is a scatterplot of residuals (on the vertical axis) versus fitted values. The scatterplot for this data appears on the following slide. We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.

The Scatterplot of Residuals versus Fitted Values for Pharmex

Scatterplots: Graphing Relationships

Background Information The Bendrix Company manufactures various types of parts for automobiles. The manager of the factory wants to get a better understanding of overhead costs. These overhead costs include supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.

Background Information -- continued Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components. The Bendrix manager has tracked total overhead costs for 36 months.

Background Information -- continued To help explain these he also collected data on two variables that are related to the amount of work done at the factory. These variables are: MachHrs: number of machine hours used during the month ProdRuns: the number of separate production runs during the month To understand this variable we must know that Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime. The manager believes both of these variables might be responsible for variations in overhead costs. Do scatterplots support his belief?

BENDRIX1.XLS The data collected by the manager appears in this file. Each observation (row) corresponds to a single month. We want to investigate any possible relationship between the Overhead variable and the MachHrs and ProdRuns variables but because these are time series variables we should also look out for relationships between these variables and the Month variable.

The Scatterplots This data set illustrates, even with the modest number of variables, how the number of potentially useful scatterplots can grow quickly. At the least, we need to look at the scatterplots between each potential explanatory variable (MacHrs and ProdRuns) and the response variable (Overhead). These scatterplots are as follows:

Scatterplot of Overhead versus Machine Hours

Scatterplot of Overhead versus Production Runs

The Scatterplots -- continued To check for possible time series patterns we can also create a time series plot for any of the variables. This is equivalent to a scatterplot of the variable versus the Month, with the points joined by lines. One of these is the time series plot for Overhead. The plot is shown next and it shows a fairly random pattern through time, with no apparent upward trend or other obvious time series pattern. We can check that the MachHrs and ProdRuns also indicate no obvious pattern.

Time Series Plot of Overhead versus Month

The Scatterplots -- continued Finally, when multiple explanatory variables exist we can check for relationships between them. The scatterplot of MachHrs versus ProdRuns is a cloud of points that indicate no relationship worth pursuing.

In Summary The Bendrix manager should continue to explore the positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables do not appear to be related to each other.

Background Information In Example 13.2 we created scatterplots for Bendrix. We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.

BENDRIX1.XLS The data collected by the manager appears in this file. The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns. Eventually we will estimate a regression equation with both of the variables included. However, if we include only one at a time, what do they tell us about the overhead costs?

Regression Output for Overhead versus MachHrs

Regression Output for Overhead versus ProdRuns

Least Squares Line Equations The two least squares lines are therefore Predicted Overhead = 48,621 + 34.7MacHrs and Predicated Overhead = 75,606 + 655.1ProdRuns Clearly these two equations are quite different, although each effectively breaks Overhead into a fixed component and a variable component. The equations imply that expected overhead increases by about $35 for each extra machine hour and about $655 for each extra production run.

Least Squares Line Equations -- continued The differences between these two lines can be attributed to neither one telling the whole story. If the manager’s goal is to split overhead into a fixed and variable component, then the variable component should include both of the measures of work activity to give a more complete explanation of overhead. We will see how this can be done using multiple regression at a later time.

Background Information In Example 13.2 we created scatterplots for Bendrix and in Example 13.2a we determined that the variable component of overhead must include both MachHrs and ProdRuns. We found that there was a positive relationship between Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables appear to have any time series behavior, and the two potential explanatory variables of MachHrs and ProdRuns do not appear to be related to each other.

BENDRIX1.XLS The data collected by the manager appear in this file. The Bendrix manufacturing data set has two explanatory variables, MachHrs and ProdRuns. We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation.

Solution To obtain the desired output we use StatPro/Regression Analysis/Multiple menu item. We select Overhead as the response (dependent) variable and select MachHrs and ProdRuns as the explanatory (independent) variables. The dialog box shown here then gives us options of which scatterplots to obtain and whether we want columns of fitted values and residuals placed next to the data set. For this example we will fill it in as shown.

Solution -- continued The main regression output appears in the next table.

Results The coefficients in the range C16-C18 indicate that the estimated regression equation is Predicted Overhead = 3997 + 43.45MachHrs + 883.62ProdRuns

Interpretation of Equation The interpretation of the equation is that if the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour; and if the number of machine hours is held constant, the overhead is expected to increase by $883.62 for each extra production run. The Bendrix manager can interpret $3997 as the fixed component of overhead. The slope terms involving MachHrs and ProdRuns are the variable components of overhead.

Equation Comparison It is interesting to compare this equation with the separate equations found in the previous example: Predicted Overhead = 48,621 + 34.7MacHrs and Predicated Overhead = 75,606 + 655.1ProdRuns Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur.

Equation Comparison -- continued The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead. But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. Since the coefficients have different meanings , it is not surprising that we obtain different estimates.

BANK.XLS The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data.

Variables For each of the 208 employees, the data set includes the following variables: EducLev: education level, a categorical variable with categories 1 (finished high school), 2 (finished some college courses), 3 (obtianed a bachelor’s degree), 4 (took some graduate courses) and 5 (obtained a graduate degree) JobGrade: a categorical variable indicating the current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male”

Variables -- continued YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise Salary: current annual salary in thousands of dollars Do the data provide evidence that females are discriminated against in terms of salary?

Naïve Approach A naïve approach to the problem is to compare the average salaries of the males and females. The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.

Lecture7a Applied Econometrics and Economic Modeling

More Related Content

What's hot (15)

Viewers also liked (7)

Similar to Lecture7a Applied Econometrics and Economic Modeling (20)

More from stone55 (11)

Recently uploaded (20)

Lecture7a Applied Econometrics and Economic Modeling