SlideShare a Scribd company logo
Week 4 Lecture 12
Significance
Earlier we discussed correlations without going into how we can
identify statistically
significant values. Our approach to this uses the t-test.
Unfortunately, Excel does not
automatically produce this form of the t-test, but setting it up
within an Excel cell is fairly easy.
And, with some slight algebra, we can determine the minimum
value that is statistically
significant for any table of correlations all of which have the
same number of pairs (for example,
a Correlation table for our data set would use 50 pairs of values,
since we have 50 members in
our sample).
The t-test formula for a correlation (r) is t = r * sqrt(n-2)/sqrt(1-
r2); the associated degrees
of freedom are n-2 (number of pairs – 2) (Lind, Marchel, &
Wathen, 2008). For some this might
look a bit off-putting, but remember that we can translate this
into Excel cells and functions and
have Excel do the arithmetic for us.
Excel Example
If we go back to our correlation table for salary, midpoint, Age,
Perf Rat, Service, and
Raise, we have:
Using Excel to create the formula and cell numbers for our key
values allows us to
quickly create a result. The T.dist.2t gives us a p-value easily.
The formula to use in finding the minimum correlation value
that is statistically
significant is r = sqrt(t^2/(t^2 + n-2)). We would find the
appropriate t value by using the
t.inv.2T(alpha, df) with alpha = 0.05 and df = n-2 or 48.
Plugging these values into the gives us
a t-value of 2.0106 or 2.011(rounded).
Putting 2.011 and 48 (n-2) into our formula gives us a r value of
0.278; therefore, in a
correlation table based on 50 pairs, any correlation greater or
equal to 0.278 would be
statistically significant.
Technical Point. If you are interested in how we obtained the
formula for determining
the minimum r value, the approach is shown below. If you are
not interested in the math, you
can safely skip this paragraph.
t = r* sqrt(n-2)/sqrt(1-r2)
Multiplying gives us t *sqrt (1- r2) = r2* (n-2)
Squaring gives us: t2 * (1- r2) = r2* (n-2)
Multiplying out gives us: t2– t2* r2 = n r2-2* r2
Adding gives us: t2= n* r2-2*r2+ t2 *r2
Factoring gives us t2= r2 *(n -2+ t2)
Dividing gives us t2 / (n -2+ t2) = r2
Taking the square root gives us r = sqrt (t2 / (n -2+ t2)
Effect Size Measures
As we have discussed, there is a difference between statistical
and practical
significance. Virtually any statistic can become statistically
significant if the sample is large
enough. In practical terms, a correlation of .30 and below is
generally considered too weak to be
of any practical significance. Additionally, the effect size
measure for Pearson’s correlation is
simply the absolute value of the correlation; the outcome has
the same general interpretation as
Cohen’s D for the t-test (0.8 is strong, and 0.2 is quite weak, for
example) (Tanner & Youssef-
Morgan, 2013).
Spearman’s Rank Correlation
Another type of correlation is the Spearman’s rank order
correlation. This correlation,
which is interpreted the same way as the Pearson’s Correlation,
can be performed on ordinal or
any ranked data. If the data used is ordinal (rankable), we use
Spearman’s rank order
correlation, rho (Tanner & Youssef-Morgan, 2013). Using the
same data, only assuming at least
one variable is ordinal would give us the following results.
Note in ranking from low to high,
similar values are given the average rank for all of the same
values. For example, in the example
below the raise of 4.7 occurs twice (the 3rd and 4th places), so
it gets a rank of 3.5.
Performance
Rating Raise
Raise -
Rank
Difference
in rank
Difference
squared PR-
Rank
1 55 3 1 0 0
2 75 3.6 2 0 0
4 80 4.7 3.5 0.5 0.25
9 100 4.7 3.5 5.5 30.25
9 100 4.8 5 4 16
4 80 4.9 6 -2 4
4 80 5.6 7 -3 9
9 100 5.7 8 1 1
6.5 90 5.8 9 -2.5 6.25
6.5 90 6 10 -3.5 12.25
Sum = 79
Spearman’s rank order correlation = 1-6*sum of differences
squared/(n*(n2 -1))
For this data, the sum of differences = 79, and n = 10. This
gives us a value of
1-6*(79/(10 *(102 -1))79 = 1 – 6* (79/(10*99) = 1-6 * (79/990)
= 1 – 6*0.08 = 0.52.
For comparison purposes, the Pearson Correlation equals 0.686.
Note that we have less information about the data when we use
ranks, particularly with
several ties in the data. This reduced information results in a
lower correlation value with
Spearman’s. This correlation is tested and interpreted the same
way as Pearson’s Coefficient is
(Lind, Marchel, & Wathen, 2008).
References
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business &
Finance. (13th Ed.) Boston: McGraw-Hill Irwin.
Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for
Managers. San Diego, CA:
Bridgeport Education.
Week 3 Lecture 11
Regression Analysis
Regression analysis is the development of an equation that
shows the impact of the
independent variables (the inputs we can generally control) on
the output result. While the
mathematical language may sound strange, most of you are
quite familiar with regression like
instructions and use them quite regularly.
To make a cake, we take 1 box mix, add 1¼ cups of water, ½
cup of oil, and 3 eggs. All
of this is combined and cooked. The recipe is an example of a
regression equation. The output
(or result or dependent variable) is the cake, the inputs (or
independent variables) are the inputs
used. Each input is accompanied by a coefficient (AKA weight
or amount) that tells us how
“much” of the variable is “used” or weighted into the outcome.
So, in an equation format, this cake recipe might look like:
Y = 1X1 + 1.25X2 + .5X3 + 3X4 where:
Y = cake
X1 = box mix
X2 = cups of water
X3 = cups of oil
X4 = an egg.
Of course, for the cake, the recipe needs to go through the
cooking process; while for
other regression equations the outputs need to go through
whatever “process” turns the inputs
into the output – this is often called “life.”
Example
With a regression analysis, we can identify what factors
influence an outcome. So, with
our Salary issue, the natural question to help us answer our
research question of do males and
females get equal pay for equal work would be: what factors
influence or explain an individual’s
pay? This is a perfect question for a multi-variate regression.
Multi-variate simply means we have
multiple input variables with a single output variable (Lind,
Marchel, & Wathen, 2008).
Variables. A regression analysis uses two distinct types of data.
The first are variables
that are at least interval level or better (the same as the other
techniques we have used so far).
The other is called a dummy variable, a variable that can be
coded 0 or 1 indicating the presence
of some characteristic. In our data set, we have two variables
that can be used as dummy coded
variables in a regression, Degree and Gender; both coded 0 or 1.
In the case of Degree, the 0
stands for having a bachelor’s degree and the 1 stands for
having an advanced degree. For
Gender, 0 means a male and 1 means a female. How these are
interpreted in a regression output
will be discussed below. For now, the significance of dummy
coding is that it allows us to
include nominal or ordinal data in our analysis.
Excel Approach. For our question of what factors influence
pay, we will use Excel’s
Regression function found in the Data Analysis section. This
function will produce two output
tables of interest. The first table tests to see if the entire
regression equation is statistically
significant; that is, do the input variables significantly impact
the output variable. If so, we
would then examine the second table – the coefficients used in a
regression equation for each of
the variables. We would have a second set of hypothesis
statements for each variable, the null
would be the coefficient equals 0 versus an alternate of the
coefficient is not equal to 0.
Typically, we list these before we start the analysis.
Step 1: For the regression equation:
Ho: The regression equation is not significant
Ha: The regression equation is significant.
For the coefficients if the regression equation is significant:
Ho: The regression coefficient equals 0
Ha: The regression coefficient is not equal to 0.
Note: We would write one pair of statements for each variable,
for space reasons, we
include only one general statement that should be applied to
each variable.
Step 2: Reject each null hypothesis claim if the related p-value
> (is greater than) p-value
= .05.
Step 3: Regression Analysis
Step 4: Perform the test. Selecting the Regression option in
Data Analysis will open a
familiar data entry box. The Input Y Range would be the salary
range including the label. The
Input X range would the labels and data for our input variables.
In this case we will use
Midpoint, Age, Performance Rating, Service, Raise, Degree,
and Gender. Be sure to check the
labels box and pick an output range upper left corner. This will
result in the following output
(values rounded to three decimal places):
Step 5: Conclusions and Interpretation. Let’s look at each table
separately.
The Regression Statistics table shows A Multiple R and an R
squared value. Multiple R
is the multiple correlation value. Similar to our Pearson
Coefficient it shows the relationship
between the dependent (output or Salary in this case) variable
with all for the independent or
input variables. Multiple R is the multiple coefficient of
determination, similar to the Pearson
coefficient of determination, it displays the percent of variation
in common between the
dependent and all of the independent variables.
The adjusted R square reduces the R square by a factor that
involves the number of
variables and the sample size, a suggestion if the design
impacted the outcome more than the
variables. We have an insignificant reduction. The standard
error is a measure of variation in
the outcome used for predictions. The count shows the number
of cases used in the regression.
The ANOVA table, sometimes called ANOR – analysis of
regression – provides us with
our test of significance outcome. Similar to the ANOVA
covered in Week 3, we look at the
Significance of F (AKA P-value) to see if we reject or fail to
reject the null hypothesis of no
significance. In this case, with a p-value of 8.44E-36 (equaling
0.00000000000000000000000000000000000844) is less than
.05, so we reject the null of no
significance. The regression equation explains a significant
proportion of the variation in our
dependent variable of salary.
Now that we have a significant regression equation, we move on
to the final table that
presents and tests the coefficients for each variable. One of the
important parts of a regression
equation is that it shows us the impact of each factor if all other
factors are held constant. A
regression has the form:
Y = A + B1* X1 + B2*X2 + B3*X3 + …. Where Y is the
output, A is the intercept (places the
line up or down on the Y axis when all other values are 0), the
B’s are the coefficient values, and
the X’s are the variable names. Before considering whether
each coefficient is statistically
significant or not, our equation would be:
Salary - -4.009 + 1.22* Midpoint + 0.029*Age – 0.096*Perf Rat
– 0.074*Service + 0.834*Raise
+ 1.002*Degree + 2.552* Gender. Whew!
What does this mean? The intercept is an adjustment factor,
one that we do not need to
analyze. For midpoint, it means that as midpoint goes up by a
thousand dollars (remember salary
and midpoint are measured in thousands), the salary goes up by
1.22 thousand – higher graded
employees are paid relatively more compared to midpoint than
others (all others things equal).
For Performance Rating, employees lose $96 (-0.096) for every
higher PR point they have –
certainly not what HR would like!
Now, let’s look at our dummy variables, Degree and Gender.
For Degree, an extra
$1,002 is added to employees having a Deg code = 1, as if Deg
= 0, the +1.002* 0 = 0; so
graduate degree holders get an extra $1002 per year. The same
thing applies to Gender, those
coded 0 get nothing extra and those coded 1 get $2,552 more
per year (all other things equal).
Since females are coded 1, if this factor is significant, they
would be paid $2552 more than males
with all other factors equal (the definition of equal work).
So, now let’s take a look at the statistical significance of each
of the variables. This is
determined with the P-value column (next to the t Stat value).
This is read the same way we
noticed in the t-test and ANVOA tables, if the value is less than
0.05 we reject the null
hypothesis of no significance.
While the intercept has a significance value, we tend to ignore
it and include the intercept
in all equations. For the other variables, the only significant
variables are: Midpoint, Perf Rating
(unrounded it was 0.0497994…), and Gender. So, the regression
equation including only our
statistically significant factors is Sal = -4.009 +1.22*Midpoint -
).096*Perf Rat + 2.552*Gender.
So, we now have a clear answer to our question about males and
females getting equal
pay for equal work. Not only is the answer no (as gender is a
significant factor in determining
salary) but also females are paid $2552 more annually all other
things equal!
This is certainly not the outcome most of us expected when we
began this journey. What
we see is that variation within any measure has some often
unanticipated outcomes, and unless
we examine the inputs into our results, we often do not
understand them very well. Single
measure tests such as the t and ANOVA tests are quite valuable
comparing similar results, but
they do not always get to the root of what causes differences.
Reference
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business &
Finance. (13th Ed.) Boston: McGraw-Hill Irwin.
Week 4 Lecture 10
We have been examining the question of equal pay for equal
work for several weeks
now; but have been somewhat frustrated with the equal work
part. We suspect that salary varies
with grade level, so that equal work is not done if we compare
salaries across grades. We found
that we could control the effect of grades with either of two
techniques. The first is by choosing
a variable that does not include grade level variation such as
compa-ratios (the salary divided by
midpoint). The second by statistically removing the impact of
grade level using the ANOVA
Two-factor without replication. Both of these gave us different
outcomes on the question of
male and female pay equality than examining salary only.
However, we still have not gotten a “clean” measure of equal
work as there are still other
factors that may impact work done such as performance levels
(measured by the performance
appraisal rating), seniority, education, etc. And, there could be
gender bias (and, for real world
companies, ethnic bias as well. We will not cover this, but it
can be dealt with the same way as
we will examine gender). We need to find a way to eliminate
the impact of these variables on
our pay measure as well.
This week we will look at two techniques that are very good at
examining and explaining
the influence of variables on outcomes. These are correlation
and regression techniques.
Linear Correlation
Correlation is a measure of how variables/things relate – that is,
if one variable changes
does another variable change in a predictable pattern as well?
One very well-known example is
the correlation (or relationship) between length/height of
children and weight. As children
become longer/taller their weight also increases (Tanner &
Youssef-Morgan, 2013). Using this
relationship, we can make predictions (using the technique of
regression discussed in Lecture 11
for this week) about how heavy a child should be for any given
height.
For variables that are at least interval in nature, two types of
correlation exist for a bi-
variable (two variables only) relationship– linear and
curvilinear. As they sound, linear
correlations show the extent to which the data variables move in
a straight line. Curvilinear
correlations – which we will not cover – show the extent that
variables move in curved lines.
Scatter Diagrams
An effective way to see if the data do relate in predictable ways
involves generating a
scatter diagram (AKA scatter chart) – a visual display of how
the data points – (variable 1 value,
corresponding variable 2 value) relate together (Lind, Marchel,
& Wathen, 2008).
Example1. One relationship we might expect to show a positive
(both values increasing)
relationship would be salary and performance rating, either for
the entire salary range or at least
within grades. The following scatter diagram (made with the
Excel Insert Graph functions)
show the relationship with Performance Rating on the bottom
and Salary on the on the vertical
axis. It shows if we put a straight line through the data points,
there is a very modest increase
from the lower left to upper right.
Salary (Y-axis) and Performance Rating (X-axis)
Example2. If we look at the same variables, but include Grade
as a factor, we get the
second graph (below) and see the data separated by grade. Each
grade seems to show (again, if
we were to put a straight line thru the data points for each
grade) level lines, indicating no
correlation at all. Neither graph gives us much hope that
Performance Rating is related to Salary
, something HR would probably not be happy with.
Salary Grades (Y-axis) and Performance Appraisal Rating (X-
axis)
Correlation
We will be focusing our efforts on the Pearson Correlation
Coefficient – a mathematical
value that shows the strength of the linear (straight line)
relationship between two variables
(Lind, Marchel, & Wathen, 2008). The math formula is a bit
tedious, so we will not bother with
it – but, if interested, you can ask Excel to display it (either
with Help or the “Tell me what you
want to do.” With the latter, I typed show help on Pearson
Correlation, and then selected the
“show help…” line, getting a description and the math
formula.).
Pearson correlation ranges from a value of -1.00 to a +1.00.
Any value outside of this
range indicates an error in the math or setup. A perfect
negative correlation (-1.00) means that
the data points all fit exactly on a line that runs from the upper
left corner to the lower right on a
graph, a negative slope. A perfect positive correlation (+1.00)
has the line with a positive slope
and runs from the lower left to the upper right (Tanner &
Youssef-Morgan, 2013).
As the values move away from the perfect extremes, the data
points move away from a
line to a spread around the line. If we look at our first graph
above, the overall Salary and
Performance Rating relationship, we have a correlation of +.15,
considered very low and not
particularly impressive.
Pearson Correlation. Excel finds the Pearson Correlation
Coefficient using either the fx
function Correl or the Data Analysis function Correlation. The
former is used for a single data
set with two variables, while the latter can be used for a single
or multiple data sets. The Correl
output for the Performance Rating and Salary correlation result
is:
Column
1
Column
2
Column 1 1
Column 2 0.151307 1
Note the variable names are not included, and we have three
correlations. Two will always show
a perfect +1.00 correlation of column 1 with column 1 and
column 2 with column 2; a diagonal
convention makes more sense with the Correlation table we will
look at below. The third
correlation is the column 1 with column 2 variable. It does not
matter which variable is
considered in column 1 or 2, as the result will be the same as
switching the variable columns.
We can use the Correlation function to identify correlations
between multiple data sets at
the same time, much as Descriptive Statistics could work with
multiple variables at once. In
trying to identify what variables might be impacting Salary, we
could generate the following
table. Remember, that Pearson’s Correlation requires at least
interval level data, so that not all of
our variables are used. In addition, since Salary and Compa-
ratio are two measures of the same
thing (pay) we do not want to include them in the same table.
Sal Mid Age Perf Rat Service Raise
Sal 1.000
Mid 0.989 1.000
Age 0.544 0.567 1.000
Perf Rat 0.151 0.192 0.139 1.000
Service 0.452 0.471 0.565 0.226 1.000
Raise -0.041 -0.029 -0.180 0.674 0.103 1.000
To identify all of the correlations for a single variable, find the
name in the left column. Then go
across until you reach the 1.00 value, then go down. For age, we
find that the correlation with:
Age = 0.544,
Mid = 0.567,
Age (itself) = 1.00,
Perf Rat = 0.139,
Service = 0.565, and
Raise = -0.180.
Side note: now we can see why the correlation with itself is
shown in the tables, it
provides the pivot point for reading the table outcomes. The
values above this diagonal of 1.00
values would be identical to those below, so they are not
provided to make the table visually
easier to read.
Coefficient of Determination. We will look at determining
statistical significance of
correlations in lecture three for this week. But, in the
meantime, we can consider the Coefficient
of Determination as a rough measure of usefulness (we will look
at the effect size measure in
lecture three as well). The coefficient of determination is the
square of the correlation, and
represents the percent of variation that the variables share in
common; that is, the amount of
variation in one variable’s changes that is explained by the
variation in the other variable. So,
for age and salary, the coefficient equals 0.5442 = .30
(rounded). As a rule of thumb, variable
pairs with coefficients less than (<) 70% are generally not very
valuable for prediction purposes.
References
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business &
Finance. (13th Ed.) Boston: McGraw-Hill Irwin.
Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for
Managers. San Diego, CA:
Bridgeport Education.

More Related Content

DOCX
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
DOCX
Week 3 Lecture 11 Regression Analysis Regression analy.docx
PPTX
Correlational Analysis on Quantitative Research.pptx
DOCX
Week 4 Lecture 10 We have been examining the question of equal p.docx
DOCX
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
DOCX
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
DOCX
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
DOCX
1. Outline the differences between Hoarding power and Encouraging..docx
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
Week 3 Lecture 11 Regression Analysis Regression analy.docx
Correlational Analysis on Quantitative Research.pptx
Week 4 Lecture 10 We have been examining the question of equal p.docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
BUS 308 – Week 4 Lecture 2 Interpreting Relationships .docx
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
1. Outline the differences between Hoarding power and Encouraging..docx

Similar to Week 4 Lecture 12 Significance Earlier we discussed co.docx (20)

DOCX
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
DOC
Ash bus 308 week 2 problem set new
DOC
Ash bus 308 week 2 problem set new
DOC
Ash bus 308 week 2 problem set new
DOC
Ash bus 308 week 2 problem set new
DOC
Ash bus 308 week 2 problem set new
ODP
Statistics for entrepreneurs
DOCX
Measures and Strengths of AssociationRemember that while w.docx
DOC
Ash bus 308 week 2 problem set new
DOCX
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
DOCX
BUS 308 Week 3 Lecture 3 Setting up ANOVA and Chi Square .docx
DOCX
BUS 308 Week 3 Lecture 3 Setting up ANOVA and Chi Square .docx
PDF
5 structured programming
PPTX
PPT
Lab 4 excel basics
PPT
Lab 4 excel basics
DOCX
SAMPLING MEAN DEFINITION The term sampling mean .docx
PDF
Predicting US house prices using Multiple Linear Regression in R
DOCX
Week 2 – Lecture 3 Making judgements about differences bet.docx
DOCX
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
Statistics for entrepreneurs
Measures and Strengths of AssociationRemember that while w.docx
Ash bus 308 week 2 problem set new
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
BUS 308 Week 3 Lecture 3 Setting up ANOVA and Chi Square .docx
BUS 308 Week 3 Lecture 3 Setting up ANOVA and Chi Square .docx
5 structured programming
Lab 4 excel basics
Lab 4 excel basics
SAMPLING MEAN DEFINITION The term sampling mean .docx
Predicting US house prices using Multiple Linear Regression in R
Week 2 – Lecture 3 Making judgements about differences bet.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx

More from cockekeshia (20)

DOCX
at least 2 references in each peer responses! I noticed .docx
DOCX
At least 2 pages longMarilyn Lysohir, an internationally celebra.docx
DOCX
At least 2 citations. APA 7TH EditionResponse 1. TITop.docx
DOCX
At each decision point, you should evaluate all options before selec.docx
DOCX
At an elevation of nearly four thousand metres above sea.docx
DOCX
At a minimum, your outline should include the followingIntroducti.docx
DOCX
At least 500 wordsPay attention to the required length of these.docx
DOCX
At a generic level, innovation is a core business process concerned .docx
DOCX
Asymmetric Cryptography•Description of each algorithm•Types•Encrypt.docx
DOCX
Astronomy HWIn 250-300 words,What was Aristarchus idea of the.docx
DOCX
Astronomy ASTA01The Sun and PlanetsDepartment of Physic.docx
DOCX
Astronomers have been reflecting laser beams off the Moon since refl.docx
DOCX
Astrategicplantoinformemergingfashionretailers.docx
DOCX
Asthma, Sleep, and Sun-SafetyPercentage of High School S.docx
DOCX
Asthma DataSchoolNumStudentIDGenderZipDOBAsthmaRADBronchitisWheezi.docx
DOCX
Assumption-Busting1. What assumption do you have that is in s.docx
DOCX
Assuming you have the results of the Business Impact Analysis and ri.docx
DOCX
Assuming you are hired by a corporation to assess the market potenti.docx
DOCX
Assuming that you are in your chosen criminal justice professi.docx
DOCX
assuming that Nietzsche is correct that conventional morality is aga.docx
at least 2 references in each peer responses! I noticed .docx
At least 2 pages longMarilyn Lysohir, an internationally celebra.docx
At least 2 citations. APA 7TH EditionResponse 1. TITop.docx
At each decision point, you should evaluate all options before selec.docx
At an elevation of nearly four thousand metres above sea.docx
At a minimum, your outline should include the followingIntroducti.docx
At least 500 wordsPay attention to the required length of these.docx
At a generic level, innovation is a core business process concerned .docx
Asymmetric Cryptography•Description of each algorithm•Types•Encrypt.docx
Astronomy HWIn 250-300 words,What was Aristarchus idea of the.docx
Astronomy ASTA01The Sun and PlanetsDepartment of Physic.docx
Astronomers have been reflecting laser beams off the Moon since refl.docx
Astrategicplantoinformemergingfashionretailers.docx
Asthma, Sleep, and Sun-SafetyPercentage of High School S.docx
Asthma DataSchoolNumStudentIDGenderZipDOBAsthmaRADBronchitisWheezi.docx
Assumption-Busting1. What assumption do you have that is in s.docx
Assuming you have the results of the Business Impact Analysis and ri.docx
Assuming you are hired by a corporation to assess the market potenti.docx
Assuming that you are in your chosen criminal justice professi.docx
assuming that Nietzsche is correct that conventional morality is aga.docx

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Cell Types and Its function , kingdom of life
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
master seminar digital applications in india
PPTX
GDM (1) (1).pptx small presentation for students
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Insiders guide to clinical Medicine.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
human mycosis Human fungal infections are called human mycosis..pptx
Pharma ospi slides which help in ospi learning
Supply Chain Operations Speaking Notes -ICLT Program
Anesthesia in Laparoscopic Surgery in India
Abdominal Access Techniques with Prof. Dr. R K Mishra
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Module 4: Burden of Disease Tutorial Slides S2 2025
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
RMMM.pdf make it easy to upload and study
VCE English Exam - Section C Student Revision Booklet
Cell Types and Its function , kingdom of life
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
master seminar digital applications in india
GDM (1) (1).pptx small presentation for students

Week 4 Lecture 12 Significance Earlier we discussed co.docx

  • 1. Week 4 Lecture 12 Significance Earlier we discussed correlations without going into how we can identify statistically significant values. Our approach to this uses the t-test. Unfortunately, Excel does not automatically produce this form of the t-test, but setting it up within an Excel cell is fairly easy. And, with some slight algebra, we can determine the minimum value that is statistically significant for any table of correlations all of which have the same number of pairs (for example, a Correlation table for our data set would use 50 pairs of values, since we have 50 members in our sample). The t-test formula for a correlation (r) is t = r * sqrt(n-2)/sqrt(1- r2); the associated degrees of freedom are n-2 (number of pairs – 2) (Lind, Marchel, & Wathen, 2008). For some this might look a bit off-putting, but remember that we can translate this into Excel cells and functions and have Excel do the arithmetic for us. Excel Example If we go back to our correlation table for salary, midpoint, Age, Perf Rat, Service, and Raise, we have:
  • 2. Using Excel to create the formula and cell numbers for our key values allows us to quickly create a result. The T.dist.2t gives us a p-value easily. The formula to use in finding the minimum correlation value that is statistically significant is r = sqrt(t^2/(t^2 + n-2)). We would find the appropriate t value by using the t.inv.2T(alpha, df) with alpha = 0.05 and df = n-2 or 48. Plugging these values into the gives us a t-value of 2.0106 or 2.011(rounded). Putting 2.011 and 48 (n-2) into our formula gives us a r value of 0.278; therefore, in a correlation table based on 50 pairs, any correlation greater or equal to 0.278 would be statistically significant. Technical Point. If you are interested in how we obtained the formula for determining the minimum r value, the approach is shown below. If you are not interested in the math, you can safely skip this paragraph. t = r* sqrt(n-2)/sqrt(1-r2) Multiplying gives us t *sqrt (1- r2) = r2* (n-2) Squaring gives us: t2 * (1- r2) = r2* (n-2) Multiplying out gives us: t2– t2* r2 = n r2-2* r2
  • 3. Adding gives us: t2= n* r2-2*r2+ t2 *r2 Factoring gives us t2= r2 *(n -2+ t2) Dividing gives us t2 / (n -2+ t2) = r2 Taking the square root gives us r = sqrt (t2 / (n -2+ t2) Effect Size Measures As we have discussed, there is a difference between statistical and practical significance. Virtually any statistic can become statistically significant if the sample is large enough. In practical terms, a correlation of .30 and below is generally considered too weak to be of any practical significance. Additionally, the effect size measure for Pearson’s correlation is simply the absolute value of the correlation; the outcome has the same general interpretation as Cohen’s D for the t-test (0.8 is strong, and 0.2 is quite weak, for example) (Tanner & Youssef- Morgan, 2013). Spearman’s Rank Correlation Another type of correlation is the Spearman’s rank order correlation. This correlation, which is interpreted the same way as the Pearson’s Correlation, can be performed on ordinal or any ranked data. If the data used is ordinal (rankable), we use Spearman’s rank order correlation, rho (Tanner & Youssef-Morgan, 2013). Using the same data, only assuming at least one variable is ordinal would give us the following results. Note in ranking from low to high,
  • 4. similar values are given the average rank for all of the same values. For example, in the example below the raise of 4.7 occurs twice (the 3rd and 4th places), so it gets a rank of 3.5. Performance Rating Raise Raise - Rank Difference in rank Difference squared PR- Rank 1 55 3 1 0 0 2 75 3.6 2 0 0 4 80 4.7 3.5 0.5 0.25 9 100 4.7 3.5 5.5 30.25 9 100 4.8 5 4 16 4 80 4.9 6 -2 4 4 80 5.6 7 -3 9 9 100 5.7 8 1 1 6.5 90 5.8 9 -2.5 6.25 6.5 90 6 10 -3.5 12.25 Sum = 79
  • 5. Spearman’s rank order correlation = 1-6*sum of differences squared/(n*(n2 -1)) For this data, the sum of differences = 79, and n = 10. This gives us a value of 1-6*(79/(10 *(102 -1))79 = 1 – 6* (79/(10*99) = 1-6 * (79/990) = 1 – 6*0.08 = 0.52. For comparison purposes, the Pearson Correlation equals 0.686. Note that we have less information about the data when we use ranks, particularly with several ties in the data. This reduced information results in a lower correlation value with Spearman’s. This correlation is tested and interpreted the same way as Pearson’s Coefficient is (Lind, Marchel, & Wathen, 2008). References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for Managers. San Diego, CA: Bridgeport Education.
  • 6. Week 3 Lecture 11 Regression Analysis Regression analysis is the development of an equation that shows the impact of the independent variables (the inputs we can generally control) on the output result. While the mathematical language may sound strange, most of you are quite familiar with regression like instructions and use them quite regularly. To make a cake, we take 1 box mix, add 1¼ cups of water, ½ cup of oil, and 3 eggs. All of this is combined and cooked. The recipe is an example of a regression equation. The output (or result or dependent variable) is the cake, the inputs (or independent variables) are the inputs used. Each input is accompanied by a coefficient (AKA weight or amount) that tells us how “much” of the variable is “used” or weighted into the outcome. So, in an equation format, this cake recipe might look like: Y = 1X1 + 1.25X2 + .5X3 + 3X4 where: Y = cake X1 = box mix X2 = cups of water
  • 7. X3 = cups of oil X4 = an egg. Of course, for the cake, the recipe needs to go through the cooking process; while for other regression equations the outputs need to go through whatever “process” turns the inputs into the output – this is often called “life.” Example With a regression analysis, we can identify what factors influence an outcome. So, with our Salary issue, the natural question to help us answer our research question of do males and females get equal pay for equal work would be: what factors influence or explain an individual’s pay? This is a perfect question for a multi-variate regression. Multi-variate simply means we have multiple input variables with a single output variable (Lind, Marchel, & Wathen, 2008). Variables. A regression analysis uses two distinct types of data. The first are variables that are at least interval level or better (the same as the other techniques we have used so far). The other is called a dummy variable, a variable that can be coded 0 or 1 indicating the presence of some characteristic. In our data set, we have two variables that can be used as dummy coded variables in a regression, Degree and Gender; both coded 0 or 1. In the case of Degree, the 0 stands for having a bachelor’s degree and the 1 stands for having an advanced degree. For Gender, 0 means a male and 1 means a female. How these are
  • 8. interpreted in a regression output will be discussed below. For now, the significance of dummy coding is that it allows us to include nominal or ordinal data in our analysis. Excel Approach. For our question of what factors influence pay, we will use Excel’s Regression function found in the Data Analysis section. This function will produce two output tables of interest. The first table tests to see if the entire regression equation is statistically significant; that is, do the input variables significantly impact the output variable. If so, we would then examine the second table – the coefficients used in a regression equation for each of the variables. We would have a second set of hypothesis statements for each variable, the null would be the coefficient equals 0 versus an alternate of the coefficient is not equal to 0. Typically, we list these before we start the analysis. Step 1: For the regression equation: Ho: The regression equation is not significant Ha: The regression equation is significant. For the coefficients if the regression equation is significant: Ho: The regression coefficient equals 0 Ha: The regression coefficient is not equal to 0. Note: We would write one pair of statements for each variable,
  • 9. for space reasons, we include only one general statement that should be applied to each variable. Step 2: Reject each null hypothesis claim if the related p-value > (is greater than) p-value = .05. Step 3: Regression Analysis Step 4: Perform the test. Selecting the Regression option in Data Analysis will open a familiar data entry box. The Input Y Range would be the salary range including the label. The Input X range would the labels and data for our input variables. In this case we will use Midpoint, Age, Performance Rating, Service, Raise, Degree, and Gender. Be sure to check the labels box and pick an output range upper left corner. This will result in the following output (values rounded to three decimal places): Step 5: Conclusions and Interpretation. Let’s look at each table separately. The Regression Statistics table shows A Multiple R and an R squared value. Multiple R is the multiple correlation value. Similar to our Pearson Coefficient it shows the relationship between the dependent (output or Salary in this case) variable with all for the independent or input variables. Multiple R is the multiple coefficient of
  • 10. determination, similar to the Pearson coefficient of determination, it displays the percent of variation in common between the dependent and all of the independent variables. The adjusted R square reduces the R square by a factor that involves the number of variables and the sample size, a suggestion if the design impacted the outcome more than the variables. We have an insignificant reduction. The standard error is a measure of variation in the outcome used for predictions. The count shows the number of cases used in the regression. The ANOVA table, sometimes called ANOR – analysis of regression – provides us with our test of significance outcome. Similar to the ANOVA covered in Week 3, we look at the Significance of F (AKA P-value) to see if we reject or fail to reject the null hypothesis of no significance. In this case, with a p-value of 8.44E-36 (equaling 0.00000000000000000000000000000000000844) is less than .05, so we reject the null of no significance. The regression equation explains a significant proportion of the variation in our dependent variable of salary. Now that we have a significant regression equation, we move on to the final table that presents and tests the coefficients for each variable. One of the important parts of a regression equation is that it shows us the impact of each factor if all other factors are held constant. A
  • 11. regression has the form: Y = A + B1* X1 + B2*X2 + B3*X3 + …. Where Y is the output, A is the intercept (places the line up or down on the Y axis when all other values are 0), the B’s are the coefficient values, and the X’s are the variable names. Before considering whether each coefficient is statistically significant or not, our equation would be: Salary - -4.009 + 1.22* Midpoint + 0.029*Age – 0.096*Perf Rat – 0.074*Service + 0.834*Raise + 1.002*Degree + 2.552* Gender. Whew! What does this mean? The intercept is an adjustment factor, one that we do not need to analyze. For midpoint, it means that as midpoint goes up by a thousand dollars (remember salary and midpoint are measured in thousands), the salary goes up by 1.22 thousand – higher graded employees are paid relatively more compared to midpoint than others (all others things equal). For Performance Rating, employees lose $96 (-0.096) for every higher PR point they have – certainly not what HR would like! Now, let’s look at our dummy variables, Degree and Gender. For Degree, an extra $1,002 is added to employees having a Deg code = 1, as if Deg = 0, the +1.002* 0 = 0; so graduate degree holders get an extra $1002 per year. The same thing applies to Gender, those coded 0 get nothing extra and those coded 1 get $2,552 more per year (all other things equal). Since females are coded 1, if this factor is significant, they would be paid $2552 more than males
  • 12. with all other factors equal (the definition of equal work). So, now let’s take a look at the statistical significance of each of the variables. This is determined with the P-value column (next to the t Stat value). This is read the same way we noticed in the t-test and ANVOA tables, if the value is less than 0.05 we reject the null hypothesis of no significance. While the intercept has a significance value, we tend to ignore it and include the intercept in all equations. For the other variables, the only significant variables are: Midpoint, Perf Rating (unrounded it was 0.0497994…), and Gender. So, the regression equation including only our statistically significant factors is Sal = -4.009 +1.22*Midpoint - ).096*Perf Rat + 2.552*Gender. So, we now have a clear answer to our question about males and females getting equal pay for equal work. Not only is the answer no (as gender is a significant factor in determining salary) but also females are paid $2552 more annually all other things equal! This is certainly not the outcome most of us expected when we began this journey. What we see is that variation within any measure has some often unanticipated outcomes, and unless we examine the inputs into our results, we often do not understand them very well. Single measure tests such as the t and ANOVA tests are quite valuable comparing similar results, but
  • 13. they do not always get to the root of what causes differences. Reference Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Week 4 Lecture 10 We have been examining the question of equal pay for equal work for several weeks now; but have been somewhat frustrated with the equal work part. We suspect that salary varies with grade level, so that equal work is not done if we compare salaries across grades. We found that we could control the effect of grades with either of two techniques. The first is by choosing a variable that does not include grade level variation such as compa-ratios (the salary divided by midpoint). The second by statistically removing the impact of grade level using the ANOVA Two-factor without replication. Both of these gave us different outcomes on the question of male and female pay equality than examining salary only.
  • 14. However, we still have not gotten a “clean” measure of equal work as there are still other factors that may impact work done such as performance levels (measured by the performance appraisal rating), seniority, education, etc. And, there could be gender bias (and, for real world companies, ethnic bias as well. We will not cover this, but it can be dealt with the same way as we will examine gender). We need to find a way to eliminate the impact of these variables on our pay measure as well. This week we will look at two techniques that are very good at examining and explaining the influence of variables on outcomes. These are correlation and regression techniques. Linear Correlation Correlation is a measure of how variables/things relate – that is, if one variable changes does another variable change in a predictable pattern as well? One very well-known example is the correlation (or relationship) between length/height of children and weight. As children become longer/taller their weight also increases (Tanner & Youssef-Morgan, 2013). Using this relationship, we can make predictions (using the technique of regression discussed in Lecture 11 for this week) about how heavy a child should be for any given height. For variables that are at least interval in nature, two types of correlation exist for a bi- variable (two variables only) relationship– linear and curvilinear. As they sound, linear
  • 15. correlations show the extent to which the data variables move in a straight line. Curvilinear correlations – which we will not cover – show the extent that variables move in curved lines. Scatter Diagrams An effective way to see if the data do relate in predictable ways involves generating a scatter diagram (AKA scatter chart) – a visual display of how the data points – (variable 1 value, corresponding variable 2 value) relate together (Lind, Marchel, & Wathen, 2008). Example1. One relationship we might expect to show a positive (both values increasing) relationship would be salary and performance rating, either for the entire salary range or at least within grades. The following scatter diagram (made with the Excel Insert Graph functions) show the relationship with Performance Rating on the bottom and Salary on the on the vertical axis. It shows if we put a straight line through the data points, there is a very modest increase from the lower left to upper right. Salary (Y-axis) and Performance Rating (X-axis) Example2. If we look at the same variables, but include Grade as a factor, we get the second graph (below) and see the data separated by grade. Each grade seems to show (again, if we were to put a straight line thru the data points for each
  • 16. grade) level lines, indicating no correlation at all. Neither graph gives us much hope that Performance Rating is related to Salary , something HR would probably not be happy with. Salary Grades (Y-axis) and Performance Appraisal Rating (X- axis) Correlation We will be focusing our efforts on the Pearson Correlation Coefficient – a mathematical value that shows the strength of the linear (straight line) relationship between two variables (Lind, Marchel, & Wathen, 2008). The math formula is a bit tedious, so we will not bother with it – but, if interested, you can ask Excel to display it (either with Help or the “Tell me what you want to do.” With the latter, I typed show help on Pearson Correlation, and then selected the “show help…” line, getting a description and the math formula.). Pearson correlation ranges from a value of -1.00 to a +1.00. Any value outside of this range indicates an error in the math or setup. A perfect negative correlation (-1.00) means that the data points all fit exactly on a line that runs from the upper left corner to the lower right on a graph, a negative slope. A perfect positive correlation (+1.00) has the line with a positive slope
  • 17. and runs from the lower left to the upper right (Tanner & Youssef-Morgan, 2013). As the values move away from the perfect extremes, the data points move away from a line to a spread around the line. If we look at our first graph above, the overall Salary and Performance Rating relationship, we have a correlation of +.15, considered very low and not particularly impressive. Pearson Correlation. Excel finds the Pearson Correlation Coefficient using either the fx function Correl or the Data Analysis function Correlation. The former is used for a single data set with two variables, while the latter can be used for a single or multiple data sets. The Correl output for the Performance Rating and Salary correlation result is: Column 1 Column 2 Column 1 1 Column 2 0.151307 1 Note the variable names are not included, and we have three correlations. Two will always show a perfect +1.00 correlation of column 1 with column 1 and column 2 with column 2; a diagonal convention makes more sense with the Correlation table we will
  • 18. look at below. The third correlation is the column 1 with column 2 variable. It does not matter which variable is considered in column 1 or 2, as the result will be the same as switching the variable columns. We can use the Correlation function to identify correlations between multiple data sets at the same time, much as Descriptive Statistics could work with multiple variables at once. In trying to identify what variables might be impacting Salary, we could generate the following table. Remember, that Pearson’s Correlation requires at least interval level data, so that not all of our variables are used. In addition, since Salary and Compa- ratio are two measures of the same thing (pay) we do not want to include them in the same table. Sal Mid Age Perf Rat Service Raise Sal 1.000 Mid 0.989 1.000 Age 0.544 0.567 1.000 Perf Rat 0.151 0.192 0.139 1.000 Service 0.452 0.471 0.565 0.226 1.000 Raise -0.041 -0.029 -0.180 0.674 0.103 1.000 To identify all of the correlations for a single variable, find the name in the left column. Then go across until you reach the 1.00 value, then go down. For age, we find that the correlation with: Age = 0.544,
  • 19. Mid = 0.567, Age (itself) = 1.00, Perf Rat = 0.139, Service = 0.565, and Raise = -0.180. Side note: now we can see why the correlation with itself is shown in the tables, it provides the pivot point for reading the table outcomes. The values above this diagonal of 1.00 values would be identical to those below, so they are not provided to make the table visually easier to read. Coefficient of Determination. We will look at determining statistical significance of correlations in lecture three for this week. But, in the meantime, we can consider the Coefficient of Determination as a rough measure of usefulness (we will look at the effect size measure in lecture three as well). The coefficient of determination is the square of the correlation, and represents the percent of variation that the variables share in common; that is, the amount of variation in one variable’s changes that is explained by the variation in the other variable. So, for age and salary, the coefficient equals 0.5442 = .30 (rounded). As a rule of thumb, variable pairs with coefficients less than (<) 70% are generally not very valuable for prediction purposes.
  • 20. References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for Managers. San Diego, CA: Bridgeport Education.