Lecture7b Applied Econometrics and Economic Modeling

Dummy Variables Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable. The trick is to create dummy variables, also called indicator or 0-1 variables. These are variables that indicate the category a given observation is in.

Dummy Variables -- continued To create dummy variables we can use an IF statement or we can use StatPro’s Dummy variable procedure. The Dummy variable procedure is usually easier particularly when there are multiple categories. Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category.

Regression Analysis In this example we create dummy variables for Gender, and EducLev. Then we can run a regression analysis with Salary as the response variable, using any combination of numerical and dummy explanatory variables. We must follow two rules: We shouldn’t use any of the original categorical variables that the dummies are based on. We should use one less dummy than the number of categories for any categorical variable.

Regression Analysis -- continued This second rule is a technical one. If we violate it the software will give us an error message. For example, Ed_1-Ed_6, any five of these variables can be used. The omitted dummy then corresponds to the reference category. As we will see the interpretation of the dummy variable coefficients are all relevant to this reference category. To get used to dummy variables in regression analysis we will proceed in several stages.

Regression Analysis -- continued We first estimate a regression equation with only one variable. The output is shown in this table. The resulting equation is Predicated Salary = 45.505 - 8.26Female

Regression Analysis -- continued To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals 37.209 and if we substitute 0 the predicated salary is 45.505. These are the average salaries of females and males. Therefore the interpretation of the -8.926 coefficient of the Female dummy variable is straightforward.

Regression Analysis -- continued The above equation only tells part of the story, it ignores all information except for gender. We expand this equation by adding the experience variables. The output is shown in this table.

Regression Analysis -- continued The corresponding equation is Predicted Salary = 35.492 + 0.998YrsExper + 0.131YrsPrior - 8.080Female It is useful to write two separate equations, one for females and one for males Predicted Salary = 27.412 + 0.988YrsExper + 0.131YrsPrior Predicted Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior We interpret the coefficient -8.080 of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell.

Regression Analysis -- continued We next add job grade to the equation by including five of the six job grade dummies. Although any five can be use we use Job_2-Job_6. The resulting output is shown in this table.

Regression Analysis -- continued The estimated regression equations is now Predicated Salary=30.230 + 0.408YrsExper + 0.149YrsPrior - 1.962Female + 2.57Job_2 + 6.295Job_3 + 10.475Job_4 +16.011Job_5 + 27.647Job_6 There are no two categorical variables involved, gender and job grade. However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values.

Regression Analysis -- continued For example, the equation for females at the fifth job grade is found by setting Female=1 and Job_5=1 and setting the other job dummies equal to 0. The equation formed is PredictedSalary = 44.279 + 0.408YrsExper + 0.150YrsPrior We interpret this equation as follows: For either gender and any job grade, the expected increase is salary for one extra year of experience with Fifth National is $408; the expected salary increase for one year experience with another bank is $149.

Regression Analysis -- continued The coefficients of the job dummies indicate the average increase in salary an employee can expect relative to the reference (lowest) job grade. The key coefficient, the negative $1962 for females indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and are in the same job grade Although the “penalty” is still substantial, it is less than a fourth of the penalty we saw before. It appears that females might be getting paid less on average partly because they are in the lower job categories.

Regression Analysis -- continued We can check whether females are disproportionately in the lower job categories by using a pivot table with JobGrade in the row area, Gender in the column area and the count (expressed as a percentage) of any variable in the data area.

Regression Analysis -- continued Clearly, females tend to be concentrated at the lower job grades. This certainly helps to explain why females get lower salaries on average, but it doesn’t explain why females are at the lower job grades in the first place. We won’t be able to provide a thorough analysis of this issue but we can add one more piece to the puzzle now by adding education level, age, and PCJob to the equation.

Regression Analysis -- continued We don’t provide the whole equation but the resulting output is shown here.

Regression Analysis -- continued The coefficients can be seen in the output. It doesn’t appear to add much to the previous equation. The “penalty” does, however, go up to $2555, which is slightly greater than the $1962. At face value we can interpret the coefficients of the education dummies as a benefit (or loss if negative) of extra education relative to a high school diploma, the reference category.

Regression Analysis -- continued The coefficient of PCJob implies that an employee with a computer-related job can expect an extra $4923 in salary relative to an employee without a computer-related job, provided the other variables are the same for each employee. The age coefficient is quite small and has little effect on salary.

Conclusion The main conclusion we can draw from the output is that there is still a plausible case to be made for discrimination against females, even after including information on all the variables in the database in the regression equation.

BANK.XLS The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data.

Question Earlier we estimated an equation for Salary suing the numerical explanatory variables YrsExper and YrsPrior and the dummy variable Female. If we drop the YrsPrior variable from the equation (for simplicity) and rerun the regression, we obtain the equation Predicted Salary = 35.824 + 0.981YrsExper - 8.012Female The R 2 value for this equation is 49.1%. If we decide to include an interaction variable between YrsExper and Female in this equation, what is the effect?

Interaction Terms An interaction variable algebraically is the product of two variables. Its effect is to allow the effect of one of the variables on Y to depend on the value of the other variable. The interaction term allows the slope of the regression line to differ between the two categories.

Solution We first need to form an interaction variable that is the product of YrsExper and Female. This can be done two ways in Excel. we can do it manually by introducing a new variable that contains the product of the two variables involved, or we can use the StatPro/Data Utilities/Create Interaction Variable menu item. Using the latter way we must select Female and YrsExper as the variables, and we do not check either of the boxes in the dialog box -- neither should be a categorical variable.

Solution -- continued Once the interaction variable has been created, we include it in the regression equation in addition to the other variables. The multiple regression output is shown here.

Solution -- continued The estimated regression equation is Predicated Salary = 30.430 + 1.528YrsExper + 4.908Female - 1.248YrsExper_Female As we discussed before it is useful to write this equation as two separate equations, one for females and one for males. The female equation is Predicated Salary = 34.528 + 0.280YrsExper and the male equation is Predicated Salary = 30.430 + 1.528YrsExper Next we can show these equations graphically.

Nonparallel Female and Male Salary Lines

Solution -- continued The Y -intercept for the female line is slightly higher - females with no experience at Fifth National Bank tend to start out slightly higher than males - but the slope of the female line is much lower. That is, males tend to move up the salary ladder much more quickly than females. Again, this provides another argument, although a somewhat different one, for gender discrimination against females. The R 2 value increased from 49.1% to 63.9%. The interaction variable has definitely added to the explanatory power of the equation.

Question A glance at the distribution of salaries of the 208 employees shows some skewness to the right - a few employees make substantially more than the majority of employees. Therefore, it might make sense to use the natural logarithm of Salary instead of Salary as the response variable. If we do this, how do we interpret the results?

Solution All of the analyses we did earlier with this data set could be repeated except with Log_Salary as the response variable. For the sake of discussion we will look only at the regression equation with Female and YrsExper as explanatory variables. After we create the Log_Salary variable and run the regression, we obtain the output shown here.

Regression Output with Log_Salary as Response Variable

Solution The estimated regression equation is Predicted Log_Salary = 3.5829 +0.0188YrsExper - 0.1616 Female The R 2 and s e values are 42.4% and 0.1794. For comparison with Salary these were 49.1% and 8.070. We first interpret that neither of these values are directly comparable to the Salary values. The two R 2 values are percentages explained of different response variables, Log_Salary and Salary. The fact that one is smaller does not mean a “worse” fit. They simply aren’t comparable.

Solution -- continued The situation for s e is even worse. Each s e is a measure of a typical residual, but the residuals in the Log_Salary equation are in log dollars, whereas the residuals in the Salary equation are in dollars. Therefore it is no surprise that the Log_Salary is much smaller than the s e for the Salary equation. If we want comparable standard error measures for the two equations, we should take antilogs of the fitted values from the Log_Salary equation to convert them back to dollars, subtract these from the original Salary values, and take the standard deviation of these residuals.

Solution -- continued The resulting standard deviation is 7.74. This is somewhat smaller than the s e from the Salary equation, an indication of a slightly better fit. Finally we interpret the equation itself. When the response variable is Log_ Y and a term on the right hand side of the equation is of the form b X, then whenever X increases by one unit Y-hat changes by a constant percentage, and this percentage is approximately equal to b (written as a percentage).

Solution -- continued This means that for each year of experience with Fifth National, an employees salary can be expected to increase 1.88%. The Female expected percentage decrease in salary is 16.16%. In other words this equation implies that females can expect to make about 16% less than men for comparable years of experience.

POWER.XLS The Public Service Electric Company produces different quantities of electricity each month, depending on the demand. This file lists the number of units of electricity produced (Units) and the total cost of producing these (Cost) for a 36-month period. The data set appears on the next slide. How can regression be used to analyze the relationship between Cost and Units?

Solution A good place to start is with a scatterplot of Cost versus Units.

Solution -- continued The scatterplot indicates a definite positive relationship and one that is nearly linear. However, there is also some evidence of curvature in the plot. The points increase slightly less rapidly as Units increase from left to right. In economic terms, there may be economics of scale, where marginal cost of the electricity decreases as more units of electricity are produced. Nevertheless, we use regression to estimate a linear relationship between Cost and Units.

Solution -- continued The resulting regression equation is Predicted Cost = 23,651 + 30.53 Units The corresponding R 2 and s e are 73.6% and $2734. We also requested a scatterplot of the residuals versus the fitted values. The scatterplot is on the next slide. Obtaining this scatterplot is always a good idea if nonlinearity is suspected. The sign of nonlinearity in this plot is that the residuals to the far left and the far right are all negative, whereas the majority of the residuals in the middle are positive.

Residuals from a Straight-Line Fit

Solution -- continued Admittedly the pattern is far from perfect - there are a few negatives in the middle - but the plot does hint at nonlinear behavior. The negative-positive-negative behavior of the residuals suggests a parabola; that is, a quadratic equation with the square of Units included in the equation. We first create a new variable Sqr_Units in the data set. This can be done manually or using StatPro’s Transform Variables menu item.

Solution -- continued Then we use multiple regression to estimate the equation for Cost with both explanatory variables, Units and Sqr_Units, included. The resulting equation from the output on the next slide is Predicated Cost = 5793 +98.3Units - 0.0600Sqr_Units Note that R 2 has increase to 82.2% and s e has decreased to $2281.

Regression Output with Squared Term Included

Solution -- continued One way to see how this regression equation fits the scatterplot of Costs versus Units is to use Excel’s trendline option. To do so activate the scatterplot, click on any point and use the Chart/Add Trendline menu item, click the Type tab and select the Polynormal type or order 2, that is a quadratic. A graph of the equation is superimposed on the scatterplot on the following slide. It shows reasonably good fit, plus an obvious curvature.

Solution -- continued The main downside to a quadratic regression equation is that there is no easy interpretation of the coefficients of Units and Sqr_Units. All we can say is that the terms in the equation combine to explain the nonlinear relationship between units produced and total cost. A final note about the equation concerns the coefficient of Sqr_Units. First, the fact that it is a negative make the parabola bend downward. This produces the decreasing marginal cost behavior, where every extra unit of electricity incurs a smaller cost.

Solution -- continued Second, we shouldn’t be fooled by the small magnitude of this coefficient. Remember that it is the coefficient of Units squared, which is a large quantity. Therefore, the effect of the product -0.0600Sqr_Units is sizable. One other possibility we might examine is a logarithmic fit. In this case we create a new variable Log_Units, the natural logarithm of Units, and then regress Cost against the single variable Log_Units.

Solution -- continued To create the new variable we can again use StatPro’s Transform Variable menu item and then we can superimpose a logarithmic curve on the scatterplot of Cost versus Units by using the trendline feature. This curve appears in the scatterplot on the next slide. To the naked eye, it appears to be similar, and about as good a fit as the quadratic curve.

Solution -- continued The resulting regression equation is Predicted Cost = -63,993 + 16,654Log_Units The values of R 2 and s e are 79.8% and 2393. These latter values indicate that the logarithmic fit is not quite as good as the quadratic fit. However, the advantage of the logarithmic equation is that it is easier to interpret.

Solution -- continued In this case, where the log of the explanatory variable is used, we can interpret its coefficient as follows. Suppose Units increases by 1%, for example from 600 to 606. Then the equation implies that the expected Cost will increase approximately $166.54. In words, every 1% increase in Units is accompanied by an expected $166.54 increase in Cost. Note that for larger values of Units, a 1% increase represents a larger absolute increase. But each such 1% increase entails the same increase in Cost. This is another way of describing the decreasing marginal cost property.

CARDEMAND.XLS This file contains annual data (1970-1987) on domestic auto sales in the United States. The data set is shown here on the next slide. The variables are defined as Quantity: annual domestic auto sales (in number of units) Price: real price index of new cars Income: real disposable income Interest: prime rate of interest Estimate and interpret a multiplicative (constant elasticity) relationship between Quantity and Price, Income and Interest.

Constant Elasticity Relationships A particular type of nonlinear relationship that has firm grounding in economic theory is called a constant elasticity relationship. It is also called a multiplicative relationship. One property of this type of relationship is that the effect of a change on any explanatory variable X i on Y depends on the levels of the other X ’s in the equation.

Solution We first take the natural logs of all four variables. This can be done in one step using the Transform Variables menu item or we can use Excel’s LN function. We then use multiple regression, with Log_Quantity as the response variable and Log_Price, Log_Income, and Log_Interest as the explanatory variables. The resulting output is shown on the next slide and the corresponding equation Predicted Log_Quantity = 4.675 - 1.185Log_Price + 2.183Log_Income - 0.19Log_Interest

Regression Output for Multiplicative Relationship

Solution -- continued If we like we can convert this back to the original variables, that is back to multiplicative form, by taking antilogs. The result is Predicted Quantity = 107.198Price -1.185 Income 2.183 Interest -0.191 In either form the equation implies that the elasticities are approximately equal to -1.185, 2.183 and -0.191. When Price increases by 1%, Quantity tends to decrease by about 1.185%; when Income increases by 1%, Quantity tends to increase by about 2.183%; and when Interest increases by 1%, Quantity tends to decrease by about 0.191%.

Conclusions Does this multiplicative equation provide a better fit to the automobile data than does an additive relationship? Without doing considerable more work it is difficult to answer this questions with certainty. As we discussed previously, it is not sufficient to compare R 2 and s e values for the two fits. We will simply state that the multiplicative relationship provides a reasonably good fit, and it makes sense economically.

LEARNING.XLS The Presario Company produces a variety of small industrial products. It has just finished producing 22 batches of a new product (new to Presario) for a customer. This file contains the times (in hours) to produce each batch. These data are in the table on the next slide. Clearly, the times have tended to decrease as Presario has gained more experience in making the product.

Data for Learning Curve Does the multiplicative learning model apply to these data, and what does it imply about the learning rate?

Learning Curve Model A final example of a multiplicative relationship is the learning curve model. A learning curve relates the unit production time (or cost) to the cumulative volume of output since that production process first began. Empirical studies indicate that production times tend to decrease by a relatively constant percentage every time cumulative output doubles. The constant percentage is called the learning rate .

Solution One way to check whether the multiplicative learning model is reasonable is to create the log variables Log_time and Log_batch in the usual way and then see whether a scatterplot of Log_Time versus Log_Batch is approximately linear. The multiplicative model implies that it should be. Such a scatterplot is shown on the next slide, along with a superimposed linear trend line. The fit appears to be quite good.

Scatterplot of Log Variables with Linear Trend Superimposed

Solution -- continued To estimate the relationship, we regress Log_Time on Log_Batch. The resulting equation is Predicated Log_Time = 4.834 - 0.155Log_Batch There are a couple of ways of interpreting this equation. First, because it is based on a multiplicative relationship, we can interpret the coefficient -0.155 as an elasticity. That is when Batch increases by 1%, Time tends to decrease by approximately 0.155%. Although this is correct it is not as “useful” as the “doubling” interpretation.

Solution -- continued We know that the estimated learning rate satisfies -0.155 = ln(learning rate/ln(2) Solving for the learning rate (multiply through by ln(2)) and then take antilogs, we find that it is 0.898, or approximately 90%. In other words, whenever cumulative production doubles, the time to produce a batch decreases by about 10%.

Predicting Future Production Times Presario could use this regression equation to predict future production times. For example, suppose the customer places an order for 15 more batches of the same product. We can use the equation to predict the log of production time for each batch, then take their antilogs and sum them to obtain the total production time. The calculations are shown in rows 26-42 of the following table. The total predicted time to finish is about 1115 hours.

Using the Learning Curve Model for Predications

Lecture7b Applied Econometrics and Economic Modeling

More Related Content

What's hot (19)

Similar to Lecture7b Applied Econometrics and Economic Modeling (20)

More from stone55 (11)

Recently uploaded (20)

Lecture7b Applied Econometrics and Economic Modeling