ie project final

1
Caleb Engelbourg
Stat 525
IE Project
The Gender Pay Gap: A Statistical Analysis
1. Introduction
1.1 Background
In the early 1960’s in the United States there was a large feminist movement to
promote women in the work place. Legislation was passed making it illegal to
discriminate on a basis of gender. Despite the laws passed by congress, there seems to
have been a discernable pay gap between men and women in the same position. We
aim to determine whether wages are based solely on gender or if other characteristics
play a role in pay discrepancy.
The Equal Pay Act of 1963 requires that organizations pay men and women the
same amount. According to the law, the organization is required to have equal pay
across genders and cannot bypass the law by making different job titles for the same
occupation. However, if there is a legitimate reason for a pay discrepancy, such as merit
or seniority, this does not go against the law (DeNisi and Griffin 2014).
There have been many cases dealing with Equal Pay Act discrimination; one
interesting case is Stanley v University of Southern California. In 1999, Marianne
Stanley, the coach of the women’s basketball team at USC, refused to take a salary
lower than the men’s coach. When the university would not match the salary, she sued
the school under the Equal Pay Act of 1963. In the ruling for her case, the court
determined that the women’s head coaching job was not substantially the same as the
men’s head coaching job. The court determined that the men’s head coach had better

2
skills, qualifications, and experience. In addition, men’s coaches have greater
responsibility in speaking and fundraising engagements (Sharp, Moore and Claussen,
2007). The court ruled in favor of the university and Stanley did not remain as the coach
of the women’s team. There have been similar cases in the United States where women
have brought up lawsuits regarding compensation issues in positions where they were
of the same status as men.
1.2 Data
In order to determine whether wages are based solely on gender or if other
characteristics play a role in pay discrepancy we will use data that was collected from
the Current Population Survey (CPS) from 1985 with a random sample of 534 people.
Information was collected on wages, sex, years of education, years of work experience,
occupational category, region of residence, marital status, union membership, age,
race, and sector. This data will be used to analyze the determining factors in pay
discrepancies.
1.3 Goals and Hypotheses
Our main goal is to determine whether or not there is a pay difference between
men and women. We also wanted to see if any other variables had influence on the
wage of an employee. We predict that there will in fact be a significant gender pay gap,
and that it will be the biggest factor for determining wage.
1.4 Model
We expect that our final model will be linear in the form shown below.

3
2. Methods / Results
2.1 Dependent Variable Selection
When looking at the data for wages, we noticed a strong right skewness in the
data. Therefore we decided that a natural log-transformation would be appropriate for
the data. As seen in the summary charts below (Figure 1), the data has a more
centered distribution after the transformation, although there are still long tails in the
distribution. Before the transformation there are many outliers with one very extreme
outlier on the high end. After the transformation there is one outlier on the high end and
one on the low end, but the data is very symmetrical. Additionally, the log transformation
allows for an implicit interpretation of the regression coefficients.
Figure 1
2.2 Independent Variable Selection
After reviewing the data from the CPS in 1985, we eventually decided to include
seven independent variables in our initial model: education, experience, sex, union,

4
south, and occupation. To select these variables, we used the best subset method, as
described below.
Since the goal of our project is to see what the effect is of wages by gender, we
made sure to include gender in every subset of variables we selected from. Gender did
not always show up in the best subset outputs, but our final model includes it as an
independent variable. Race did show up in our best subset, but we later dropped it
because we found it was not significant, as we will explain later.
When selecting our best subset, we looked for a higher adjusted R-squared, a
C(p) close to p, and a minimized SBC and AIC. When comparing best subset outputs, it
was important for us to consider that for our categorical variables, there are many
different variable names; therefore, if we wanted to select the occupation variable, we
needed to put all 5 occupation categories into the model. When comparing the top
subsets based on adjusted R-squared, C(p), SBC, and AIC, our best models included
only some of the occupation variables, so we could not choose the models with the best
adjusted R-squared, C(p), SBC, and AIC values. The results for our best subsets are
shown below, with our model in the first row (Table 1).
Table 1

5
For our categorical variables, we needed to create n-1 variables, where n is the
number of categories. Occupation was split into five separate variables: Management,
Sales, Clerical, Service and Professional, for any other sector all five of these variables
will be equal to zero. Race was separated into two variables, white and Hispanic. Any
other race was indicated when both of these variables were equal to zero.
Overall, after our exploration of the independent variables, we kept education,
experience, sex, union, race, south, and occupation.
2.3 The Model
With our predictors selected, we ran the regression in SAS; however, we found
that the race variables were not significant, so we decided to remove the two categorical
race variables from our final model. Additionally, the coefficient for gender was changed
by less than 0.003 when we removed race and our adjusted R-squared changed by less
than 0.01 for our final model, so we were comfortable with removing race from our final
model. We then decided to check interaction terms to see if sex and other variables
combined to have a significant effect on wages, but we found that no interaction term
was statistically significant in our model.
Our final model included the variables education, experience, sex, union, south,
and occupation as predictors. The fitted model is shown below:
Table 2 shows our F-value and corresponding p-value and the R-square and
Adjusted R-Square values. Table 3 shows our parameter estimates, corresponding

6
standard error, t and p-values and a 95% Confidence interval for each predictor in our
final model.
F-Value Pr > F R-Square Adj R-Square
27.97 <.0001 0.3485 0.3360
Table 2
Table 3
2.4 Residual Interpretation
For our final model, the normal Q-Q plot (Figure 2) shows good linearity with a
couple possible outliers on the tails. However, our residuals versus predicted values
(Figure 3) may be showing increasing variance. We further investigated the residuals for
each predictor and identified that education was the likely cause because its residuals
also showed potential increasing variance (Figure 4). In order to remedy the residual

7
plot, we tried different transformations, but our transformations actually made the
residual plot for education worse, as evidenced by our log transformation (Figure 5). We
therefore, decided to proceed with our model even though there is possible increasing
variance in our residual versus predicted values plot.
Figure 2 Figure 3
Figure 4 Figure 5
2.5 Model Validation
In conducting our influential cases analysis, we found no influential cases for
DFFITS or COOKS. However, there was a total of 115 influential cases for DFBETAS.
Since the influential cases for DFBETAS made up 20% of our observations, and we

8
found no influential cases with the other methods we decided to keep all of our data in
our final model.
In our final model, we found that we did not have any multicollinearity issues. All
of our predictors have VIF values close to 1 as seen in Table 4. Our average VIF is
1.47, indicating no multicollinearity issues.
Table 4
2.6 Statistical Inference
For our final model, our F-value is 27.97 with a corresponding p-value very close
to 0, so we reject the null hypothesis that all the b(i)=0. Therefore, we can conclude that
there is a regression relationship. Additionally, Our final model has an R-squared value
of 0.3485, meaning that 34.85% of the variation in the data can be explained by the
regression model.
Since our variable of interest is sex, we ran a t-test to ensure that it is statistically
significant. Our t-value is -4.99 with a corresponding p-value very close to 0, so we
reject the null hypothesis that b(1)=0. Additionally we use a 95% confidence interval to
conclude that the true coefficient of b(1) is in the interval [-0.290, -0.126].
Because we used a log transformation on Y, we interpret that as each b(i)
increases by 1, the average wages change by 100*b(j)%, all else constant. For
categorical predictors, 100*b(j) represents the average percent difference for the
average wage in that category. Therefore for our variable of interest sex, we conclude
that women make 20.8% less on average than men, all else equal. Additionally, we are

9
95% confident that the true mean difference in men’s and women’s wages is in the
interval [-29%, -12.6%].
3. Conclusion
Based off of the 1985 Current Population Survey data, we found there to be a
significant difference in the wages of men and women. Women make on average 20.8%
less on average than men. The pay gap for the different genders was the highest of any
of our categorical predictors.
Although the wage gap for gender was the largest, there are additional factors
that explain wage differences. Interestingly enough an occupation in management leads
to an average of 20.5% increase in wages, while an occupation in service leads to an
average decrease of 20% in wages. In addition, membership in a union also leads to
average of 20.6% increase in wages. Two other noteworthy factors that affects both
genders is that for every additional year of experience, average wage increases by 1%,
and an additional year of education increases average wages by 6.9%.
Although the Equal Pay Act was passed in 1963, the legislation did not end the
issue of the gender pay gap. Our results show that in 1985, there was still a significant
pay gap between the genders, with women making 20.8% less on average than men.
This problem is ongoing, as legislature is still attempting to accomplish the goal of the
original Equal Pay Act of 1963.
One such recent law is the Lilly Ledbetter Fair Pay Act of 2009. Lilly Ledbetter
was a production supervisor at Goodyear that was paid 40% less than her lowest paid
male counterpart. Her case made it all the way to the Supreme Court, where Goodyear

10
argued, “Ms. Ledbetter had been discriminated against BUT that the discrimination took
place more than 180 days before the charges were filed. Thus, the case could not be
raised because there was a 180 day limitation as part of the law” (DeNisi and Griffin).
The Supreme Court agreed with Goodyear’s defense and took away the money
awarded to her at lower court levels.
This ruling provided evidence of a flaw in the current legislation, so in 2009,
President Obama signed the Ledbetter Fair Pay Act into legislation. The new law states
that the 180-day statute of limitations restarts with each paycheck that the employee
receives, allowing damages to be paid long past when the discrimination occurs (DeNisi
and Griffin). This provided a much-needed update to the Equal Pay Act and will
hopefully help to influence payment practices of employers
The gender pay gap is an ongoing issue in the United States, with legislation
struggling to change actual practices of employers. Data analysis of the 1985 CPS
reinforces this point, as women were still paid 20.8% less on average than men 22
years after the Equal Pay Act was passed. Further research into more current data is
necessary to see if this trend is a continuing issue in the United States.

11
Works Cited
DeNisi, Angelo S., and Ricky W. Griffin. HR2. Mason: South Western Cengage
Learning, 2014. Print.
Sharp, Linda A., Anita M. Moorman, and Cathryn L. Claussen. Sport Law: A Managerial
Approach. Scottsdale, AZ: Holcomb Hathaway, 2007. Print.

ie project final

More Related Content

Viewers also liked (9)

Similar to ie project final (20)

ie project final