Week 4 Lecture 12 Significance Earlier we discussed co.docx

Week 4 Lecture 12
Significance
Earlier we discussed correlations without going into how we can
identify statistically
significant values. Our approach to this uses the t-test.
Unfortunately, Excel does not
automatically produce this form of the t-test, but setting it up
within an Excel cell is fairly easy.
And, with some slight algebra, we can determine the minimum
value that is statistically
significant for any table of correlations all of which have the
same number of pairs (for example,
a Correlation table for our data set would use 50 pairs of values,
since we have 50 members in
our sample).
The t-test formula for a correlation (r) is t = r * sqrt(n-2)/sqrt(1-
r2); the associated degrees
of freedom are n-2 (number of pairs – 2) (Lind, Marchel, &
Wathen, 2008). For some this might
look a bit off-putting, but remember that we can translate this
into Excel cells and functions and
have Excel do the arithmetic for us.
Excel Example
If we go back to our correlation table for salary, midpoint, Age,
Perf Rat, Service, and
Raise, we have:

Using Excel to create the formula and cell numbers for our key
values allows us to
quickly create a result. The T.dist.2t gives us a p-value easily.
The formula to use in finding the minimum correlation value
that is statistically
significant is r = sqrt(t^2/(t^2 + n-2)). We would find the
appropriate t value by using the
t.inv.2T(alpha, df) with alpha = 0.05 and df = n-2 or 48.
Plugging these values into the gives us
a t-value of 2.0106 or 2.011(rounded).
Putting 2.011 and 48 (n-2) into our formula gives us a r value of
0.278; therefore, in a
correlation table based on 50 pairs, any correlation greater or
equal to 0.278 would be
statistically significant.
Technical Point. If you are interested in how we obtained the
formula for determining
the minimum r value, the approach is shown below. If you are
not interested in the math, you
can safely skip this paragraph.
t = r* sqrt(n-2)/sqrt(1-r2)
Multiplying gives us t *sqrt (1- r2) = r2* (n-2)
Squaring gives us: t2 * (1- r2) = r2* (n-2)
Multiplying out gives us: t2– t2* r2 = n r2-2* r2

Adding gives us: t2= n* r2-2*r2+ t2 *r2
Factoring gives us t2= r2 *(n -2+ t2)
Dividing gives us t2 / (n -2+ t2) = r2
Taking the square root gives us r = sqrt (t2 / (n -2+ t2)
Effect Size Measures
As we have discussed, there is a difference between statistical
and practical
significance. Virtually any statistic can become statistically
significant if the sample is large
enough. In practical terms, a correlation of .30 and below is
generally considered too weak to be
of any practical significance. Additionally, the effect size
measure for Pearson’s correlation is
simply the absolute value of the correlation; the outcome has
the same general interpretation as
Cohen’s D for the t-test (0.8 is strong, and 0.2 is quite weak, for
example) (Tanner & Youssef-
Morgan, 2013).
Spearman’s Rank Correlation
Another type of correlation is the Spearman’s rank order
correlation. This correlation,
which is interpreted the same way as the Pearson’s Correlation,
can be performed on ordinal or
any ranked data. If the data used is ordinal (rankable), we use
Spearman’s rank order
correlation, rho (Tanner & Youssef-Morgan, 2013). Using the
same data, only assuming at least
one variable is ordinal would give us the following results.
Note in ranking from low to high,

similar values are given the average rank for all of the same
values. For example, in the example
below the raise of 4.7 occurs twice (the 3rd and 4th places), so
it gets a rank of 3.5.
Performance
Rating Raise
Raise -
Rank
Difference
in rank
Difference
squared PR-
Rank
1 55 3 1 0 0
2 75 3.6 2 0 0
4 80 4.7 3.5 0.5 0.25
9 100 4.7 3.5 5.5 30.25
9 100 4.8 5 4 16
4 80 4.9 6 -2 4
4 80 5.6 7 -3 9
9 100 5.7 8 1 1
6.5 90 5.8 9 -2.5 6.25
6.5 90 6 10 -3.5 12.25
Sum = 79

Spearman’s rank order correlation = 1-6*sum of differences
squared/(n*(n2 -1))
For this data, the sum of differences = 79, and n = 10. This
gives us a value of
1-6*(79/(10 *(102 -1))79 = 1 – 6* (79/(10*99) = 1-6 * (79/990)
= 1 – 6*0.08 = 0.52.
For comparison purposes, the Pearson Correlation equals 0.686.
Note that we have less information about the data when we use
ranks, particularly with
several ties in the data. This reduced information results in a
lower correlation value with
Spearman’s. This correlation is tested and interpreted the same
way as Pearson’s Coefficient is
(Lind, Marchel, & Wathen, 2008).
References
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business &
Finance. (13th Ed.) Boston: McGraw-Hill Irwin.
Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for
Managers. San Diego, CA:
Bridgeport Education.

Week 3 Lecture 11
Regression Analysis
Regression analysis is the development of an equation that
shows the impact of the
independent variables (the inputs we can generally control) on
the output result. While the
mathematical language may sound strange, most of you are
quite familiar with regression like
instructions and use them quite regularly.
To make a cake, we take 1 box mix, add 1¼ cups of water, ½
cup of oil, and 3 eggs. All
of this is combined and cooked. The recipe is an example of a
regression equation. The output
(or result or dependent variable) is the cake, the inputs (or
independent variables) are the inputs
used. Each input is accompanied by a coefficient (AKA weight
or amount) that tells us how
“much” of the variable is “used” or weighted into the outcome.
So, in an equation format, this cake recipe might look like:
Y = 1X1 + 1.25X2 + .5X3 + 3X4 where:
Y = cake
X1 = box mix
X2 = cups of water

X3 = cups of oil
X4 = an egg.
Of course, for the cake, the recipe needs to go through the
cooking process; while for
other regression equations the outputs need to go through
whatever “process” turns the inputs
into the output – this is often called “life.”
Example
With a regression analysis, we can identify what factors
influence an outcome. So, with
our Salary issue, the natural question to help us answer our
research question of do males and
females get equal pay for equal work would be: what factors
influence or explain an individual’s
pay? This is a perfect question for a multi-variate regression.
Multi-variate simply means we have
multiple input variables with a single output variable (Lind,
Marchel, & Wathen, 2008).
Variables. A regression analysis uses two distinct types of data.
The first are variables
that are at least interval level or better (the same as the other
techniques we have used so far).
The other is called a dummy variable, a variable that can be
coded 0 or 1 indicating the presence
of some characteristic. In our data set, we have two variables
that can be used as dummy coded
variables in a regression, Degree and Gender; both coded 0 or 1.
In the case of Degree, the 0
stands for having a bachelor’s degree and the 1 stands for
having an advanced degree. For
Gender, 0 means a male and 1 means a female. How these are

interpreted in a regression output
will be discussed below. For now, the significance of dummy
coding is that it allows us to
include nominal or ordinal data in our analysis.
Excel Approach. For our question of what factors influence
pay, we will use Excel’s
Regression function found in the Data Analysis section. This
function will produce two output
tables of interest. The first table tests to see if the entire
regression equation is statistically
significant; that is, do the input variables significantly impact
the output variable. If so, we
would then examine the second table – the coefficients used in a
regression equation for each of
the variables. We would have a second set of hypothesis
statements for each variable, the null
would be the coefficient equals 0 versus an alternate of the
coefficient is not equal to 0.
Typically, we list these before we start the analysis.
Step 1: For the regression equation:
Ho: The regression equation is not significant
Ha: The regression equation is significant.
For the coefficients if the regression equation is significant:
Ho: The regression coefficient equals 0
Ha: The regression coefficient is not equal to 0.
Note: We would write one pair of statements for each variable,

for space reasons, we
include only one general statement that should be applied to
each variable.
Step 2: Reject each null hypothesis claim if the related p-value
> (is greater than) p-value
= .05.
Step 3: Regression Analysis
Step 4: Perform the test. Selecting the Regression option in
Data Analysis will open a
familiar data entry box. The Input Y Range would be the salary
range including the label. The
Input X range would the labels and data for our input variables.
In this case we will use
Midpoint, Age, Performance Rating, Service, Raise, Degree,
and Gender. Be sure to check the
labels box and pick an output range upper left corner. This will
result in the following output
(values rounded to three decimal places):
Step 5: Conclusions and Interpretation. Let’s look at each table
separately.
The Regression Statistics table shows A Multiple R and an R
squared value. Multiple R
is the multiple correlation value. Similar to our Pearson
Coefficient it shows the relationship
between the dependent (output or Salary in this case) variable
with all for the independent or
input variables. Multiple R is the multiple coefficient of

determination, similar to the Pearson
coefficient of determination, it displays the percent of variation
in common between the
dependent and all of the independent variables.
The adjusted R square reduces the R square by a factor that
involves the number of
variables and the sample size, a suggestion if the design
impacted the outcome more than the
variables. We have an insignificant reduction. The standard
error is a measure of variation in
the outcome used for predictions. The count shows the number
of cases used in the regression.
The ANOVA table, sometimes called ANOR – analysis of
regression – provides us with
our test of significance outcome. Similar to the ANOVA
covered in Week 3, we look at the
Significance of F (AKA P-value) to see if we reject or fail to
reject the null hypothesis of no
significance. In this case, with a p-value of 8.44E-36 (equaling
0.00000000000000000000000000000000000844) is less than
.05, so we reject the null of no
significance. The regression equation explains a significant
proportion of the variation in our
dependent variable of salary.
Now that we have a significant regression equation, we move on
to the final table that
presents and tests the coefficients for each variable. One of the
important parts of a regression
equation is that it shows us the impact of each factor if all other
factors are held constant. A

regression has the form:
Y = A + B1* X1 + B2*X2 + B3*X3 + …. Where Y is the
output, A is the intercept (places the
line up or down on the Y axis when all other values are 0), the
B’s are the coefficient values, and
the X’s are the variable names. Before considering whether
each coefficient is statistically
significant or not, our equation would be:
Salary - -4.009 + 1.22* Midpoint + 0.029*Age – 0.096*Perf Rat
– 0.074*Service + 0.834*Raise
+ 1.002*Degree + 2.552* Gender. Whew!
What does this mean? The intercept is an adjustment factor,
one that we do not need to
analyze. For midpoint, it means that as midpoint goes up by a
thousand dollars (remember salary
and midpoint are measured in thousands), the salary goes up by
1.22 thousand – higher graded
employees are paid relatively more compared to midpoint than
others (all others things equal).
For Performance Rating, employees lose $96 (-0.096) for every
higher PR point they have –
certainly not what HR would like!
Now, let’s look at our dummy variables, Degree and Gender.
For Degree, an extra
$1,002 is added to employees having a Deg code = 1, as if Deg
= 0, the +1.002* 0 = 0; so
graduate degree holders get an extra $1002 per year. The same
thing applies to Gender, those
coded 0 get nothing extra and those coded 1 get $2,552 more
per year (all other things equal).
Since females are coded 1, if this factor is significant, they
would be paid $2552 more than males

with all other factors equal (the definition of equal work).
So, now let’s take a look at the statistical significance of each
of the variables. This is
determined with the P-value column (next to the t Stat value).
This is read the same way we
noticed in the t-test and ANVOA tables, if the value is less than
0.05 we reject the null
hypothesis of no significance.
While the intercept has a significance value, we tend to ignore
it and include the intercept
in all equations. For the other variables, the only significant
variables are: Midpoint, Perf Rating
(unrounded it was 0.0497994…), and Gender. So, the regression
equation including only our
statistically significant factors is Sal = -4.009 +1.22*Midpoint -
).096*Perf Rat + 2.552*Gender.
So, we now have a clear answer to our question about males and
females getting equal
pay for equal work. Not only is the answer no (as gender is a
significant factor in determining
salary) but also females are paid $2552 more annually all other
things equal!
This is certainly not the outcome most of us expected when we
began this journey. What
we see is that variation within any measure has some often
unanticipated outcomes, and unless
we examine the inputs into our results, we often do not
understand them very well. Single
measure tests such as the t and ANOVA tests are quite valuable
comparing similar results, but

they do not always get to the root of what causes differences.
Reference
Week 4 Lecture 10
We have been examining the question of equal pay for equal
work for several weeks
now; but have been somewhat frustrated with the equal work
part. We suspect that salary varies
with grade level, so that equal work is not done if we compare
salaries across grades. We found
that we could control the effect of grades with either of two
techniques. The first is by choosing
a variable that does not include grade level variation such as
compa-ratios (the salary divided by
midpoint). The second by statistically removing the impact of
grade level using the ANOVA
Two-factor without replication. Both of these gave us different
outcomes on the question of
male and female pay equality than examining salary only.

However, we still have not gotten a “clean” measure of equal
work as there are still other
factors that may impact work done such as performance levels
(measured by the performance
appraisal rating), seniority, education, etc. And, there could be
gender bias (and, for real world
companies, ethnic bias as well. We will not cover this, but it
can be dealt with the same way as
we will examine gender). We need to find a way to eliminate
the impact of these variables on
our pay measure as well.
This week we will look at two techniques that are very good at
examining and explaining
the influence of variables on outcomes. These are correlation
and regression techniques.
Linear Correlation
Correlation is a measure of how variables/things relate – that is,
if one variable changes
does another variable change in a predictable pattern as well?
One very well-known example is
the correlation (or relationship) between length/height of
children and weight. As children
become longer/taller their weight also increases (Tanner &
Youssef-Morgan, 2013). Using this
relationship, we can make predictions (using the technique of
regression discussed in Lecture 11
for this week) about how heavy a child should be for any given
height.
For variables that are at least interval in nature, two types of
correlation exist for a bi-
variable (two variables only) relationship– linear and
curvilinear. As they sound, linear

correlations show the extent to which the data variables move in
a straight line. Curvilinear
correlations – which we will not cover – show the extent that
variables move in curved lines.
Scatter Diagrams
An effective way to see if the data do relate in predictable ways
involves generating a
scatter diagram (AKA scatter chart) – a visual display of how
the data points – (variable 1 value,
corresponding variable 2 value) relate together (Lind, Marchel,
& Wathen, 2008).
Example1. One relationship we might expect to show a positive
(both values increasing)
relationship would be salary and performance rating, either for
the entire salary range or at least
within grades. The following scatter diagram (made with the
Excel Insert Graph functions)
show the relationship with Performance Rating on the bottom
and Salary on the on the vertical
axis. It shows if we put a straight line through the data points,
there is a very modest increase
from the lower left to upper right.
Salary (Y-axis) and Performance Rating (X-axis)
Example2. If we look at the same variables, but include Grade
as a factor, we get the
second graph (below) and see the data separated by grade. Each
grade seems to show (again, if
we were to put a straight line thru the data points for each

grade) level lines, indicating no
correlation at all. Neither graph gives us much hope that
Performance Rating is related to Salary
, something HR would probably not be happy with.
Salary Grades (Y-axis) and Performance Appraisal Rating (X-
axis)
Correlation
We will be focusing our efforts on the Pearson Correlation
Coefficient – a mathematical
value that shows the strength of the linear (straight line)
relationship between two variables
(Lind, Marchel, & Wathen, 2008). The math formula is a bit
tedious, so we will not bother with
it – but, if interested, you can ask Excel to display it (either
with Help or the “Tell me what you
want to do.” With the latter, I typed show help on Pearson
Correlation, and then selected the
“show help…” line, getting a description and the math
formula.).
Pearson correlation ranges from a value of -1.00 to a +1.00.
Any value outside of this
range indicates an error in the math or setup. A perfect
negative correlation (-1.00) means that
the data points all fit exactly on a line that runs from the upper
left corner to the lower right on a
graph, a negative slope. A perfect positive correlation (+1.00)
has the line with a positive slope

and runs from the lower left to the upper right (Tanner &
Youssef-Morgan, 2013).
As the values move away from the perfect extremes, the data
points move away from a
line to a spread around the line. If we look at our first graph
above, the overall Salary and
Performance Rating relationship, we have a correlation of +.15,
considered very low and not
particularly impressive.
Pearson Correlation. Excel finds the Pearson Correlation
Coefficient using either the fx
function Correl or the Data Analysis function Correlation. The
former is used for a single data
set with two variables, while the latter can be used for a single
or multiple data sets. The Correl
output for the Performance Rating and Salary correlation result
is:
Column
1
Column
2
Column 1 1
Column 2 0.151307 1
Note the variable names are not included, and we have three
correlations. Two will always show
a perfect +1.00 correlation of column 1 with column 1 and
column 2 with column 2; a diagonal
convention makes more sense with the Correlation table we will

look at below. The third
correlation is the column 1 with column 2 variable. It does not
matter which variable is
considered in column 1 or 2, as the result will be the same as
switching the variable columns.
We can use the Correlation function to identify correlations
between multiple data sets at
the same time, much as Descriptive Statistics could work with
multiple variables at once. In
trying to identify what variables might be impacting Salary, we
could generate the following
table. Remember, that Pearson’s Correlation requires at least
interval level data, so that not all of
our variables are used. In addition, since Salary and Compa-
ratio are two measures of the same
thing (pay) we do not want to include them in the same table.
Sal Mid Age Perf Rat Service Raise
Sal 1.000
Mid 0.989 1.000
Age 0.544 0.567 1.000
Perf Rat 0.151 0.192 0.139 1.000
Service 0.452 0.471 0.565 0.226 1.000
Raise -0.041 -0.029 -0.180 0.674 0.103 1.000
To identify all of the correlations for a single variable, find the
name in the left column. Then go
across until you reach the 1.00 value, then go down. For age, we
find that the correlation with:
Age = 0.544,

Mid = 0.567,
Age (itself) = 1.00,
Perf Rat = 0.139,
Service = 0.565, and
Raise = -0.180.
Side note: now we can see why the correlation with itself is
shown in the tables, it
provides the pivot point for reading the table outcomes. The
values above this diagonal of 1.00
values would be identical to those below, so they are not
provided to make the table visually
easier to read.
Coefficient of Determination. We will look at determining
statistical significance of
correlations in lecture three for this week. But, in the
meantime, we can consider the Coefficient
of Determination as a rough measure of usefulness (we will look
at the effect size measure in
lecture three as well). The coefficient of determination is the
square of the correlation, and
represents the percent of variation that the variables share in
common; that is, the amount of
variation in one variable’s changes that is explained by the
variation in the other variable. So,
for age and salary, the coefficient equals 0.5442 = .30
(rounded). As a rule of thumb, variable
pairs with coefficients less than (<) 70% are generally not very
valuable for prediction purposes.

References
Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for
Managers. San Diego, CA:
Bridgeport Education.

Week 4 Lecture 12 Significance Earlier we discussed co.docx

More Related Content

Similar to Week 4 Lecture 12 Significance Earlier we discussed co.docx (20)

More from cockekeshia (20)

Recently uploaded (20)

Week 4 Lecture 12 Significance Earlier we discussed co.docx