SlideShare a Scribd company logo
30
REGRESSION
Regression is a statistical tool that allows you to predict the
value of one continuous variable
from one or more other variables. When you perform a
regression analysis, you create a
regression equation that predicts the values of your DV using
the values of your IVs. Each IV is
associated with specific coefficients in the equation that
summarizes the relationship between
that IV and the DV. Once we estimate a set of coefficients in a
regression equation, we can use
hypothesis tests and confidence intervals to make inferences
about the corresponding parameters
in the population. You can also use the regression equation to
predict the value of the DV given a
specified set of values for your IVs.
Simple Linear Regression
Simple linear regression is used to predict the value of a single
continuous DV (which we will
call Y) from a single continuous IV (which we will call X).
Regression assumes that the
relationship between IV and the DV can be represented by the
equation
Yi = β0 + β 1Xi + εi,
where Yi is the value of the DV for case i, Xi is the value of the
IV for case i, β0 and β1 are
constants, and εi is the error in prediction for case i. When you
perform a regression, what you
are basically doing is determining estimates of β0 and β1 that
let you best predict values of Y
from values of X. You may remember from geometry that the
above equation is equivalent to a
straight line. This is no accident, since the purpose of simple
linear regression is to define the
line that represents the relationship between our two variables.
β0 is the intercept of the line,
indicating the expected value of Y when X = 0. β1 is the slope
of the line, indicating how much
we expect Y will change when we increase X by a single unit.
The regression equation above is written in terms of population
parameters. That indicates that
our goal is to determine the relationship between the two
variables in the population as a whole.
We typically do this by taking a sample and then performing
calculations to obtain the estimated
regression equation
Yi = b0 + b1Xi .
Once you estimate the values of b0 and b1, you can substitute in
those values and use the
regression equation to predict the expected values of the DV for
specific values of the IV.
Predicting the values of Y from the values of X is referred to as
regressing Y on X. When
analyzing data from a study you will typically want to regress
the values of the DV on the values
of the IV. This makes sense since you want to use the IV to
explain variability in the DV. We
typically calculate b0 and b1 using least squares estimation.
This chooses estimates that minimize
the sum of squared errors between the values of the estimated
regression line and the actual
observed values.
In addition to using the estimated regression equation for
prediction, you can also perform
hypothesis tests regarding the individual regression parameters.
The slope of the regression
equation (β1) represents the change in Y with a one-unit change
in X. If X predicts Y, then as X
31
increases, Y should change in some systematic way. You can
therefore test for a linear
relationship between X and Y by determining whether the slope
parameter is significantly
different from zero.
When using performing linear regression, we typically make the
following assumptions about
the error terms εi.
1. The errors have a normal distribution.
2. The same amount of error in the model is found at each level
of X.
3. The errors in the model are all independent.
To perform a simple linear regression in SPSS
• Choose Analyze !!!! Regression!!!! Linear.
• Move the DV to the Dependent box.
• Move the IV to the Independent(s) box.
• Click the Continue button.
• Click the OK button.
The output from this analysis will contain the following
sections.
• Variables Entered/Removed. This section is only used in
model building and contains
no useful information in simple linear regression.
• Model Summary. The value listed below R is the correlation
between your variables.
The value listed below R Square is the proportion of variance in
your DV that can be
accounted for by your IV. The value in the Adjusted R Square
column is a measure of
model fit, adjusting for the number of IVs in the model. The
value listed below Std.
Error of the Estimate is the standard deviation of the residuals.
• ANOVA. Here you will see an ANOVA table, which provides
an F test of the
relationship between your IV and your DV. If the F test is
significant, it indicates that
there is a relationship.
• Coefficients. This section contains a table where each row
corresponds to a single
coefficient in your model. The row labeled Constant refers to
the intercept, while the
row containing the name of your IV refers to the slope. Inside
the table, the column
labeled B contains the estimates of the parameters and the
column labeled Std. Error
contains the standard error of those parameters. The column
labeled Beta contains the
standardized regression coefficient, which is the parameter
estimate that you would get if
you standardized both the IV and the DV by subtracting off
their mean and dividing by
their standard deviations. Standardized regression coefficients
are sometimes used in
multiple regression (discussed below) to compare the relative
importance of different IVs
when predicting the DV. In simple linear regression, the
standardized regression
coefficient will always be equal to the correlation between the
IV and the DV. The
column labeled t contains the value of the t-statistic testing
whether the value of each
parameter is equal to zero. The p-value of this test is found in
the column labeled Sig. If
the value for the IV is significant, then there is a relationship
between the IV and the DV.
Note that the square of the t statistic is equal to the F statistic in
the ANOVA table and
that the p-values of the two tests are equal. This is because both
of these are testing
whether there is a significant linear relationship between your
variables.
32
Multiple Regression
Sometimes you may want to explain variability in a continuous
DV using several different
continuous IVs. Multiple regression allows us to build an
equation predicting the value of the
DV from the values of two or more IVs. The parameters of this
equation can be used to relate the
variability in our DV to the variability in specific IVs.
Sometimes people use the term
multivariate regression to refer to multiple regression, but most
statisticians do not use
ìmultiple" and ìmultivariate" as synonyms. Instead, they use the
term ìmultiple" to describe
analyses that examine the effect of two or more IVs on a single
DV, while they reserve the term
ìmultivariate" to describe analyses that examine the effect of
any number of IVs on two or more
DVs.
The general form of the multiple regression model is
Yi = β0 + β 1Xi1 + β 2Xi2 + Ö + βkXik + εi,.
The elements in this equation are the same as those found in
simple linear regression, except that
we now have k different parameters which are multipled by the
values of the k IVs to get our
predicted value. We can again use least squares estimation to
determine the estimates of these
parameters that best our observed data. Once we obtain these
estimates we can either use our
equation for prediction, or we can test whether our parameters
are significantly different from
zero to determine whether each of our IVs makes a significant
contribution to our model.
Care must be taken when making inferences based on the
coefficients obtained in multiple
regression. The way that you interpret a multiple regression
coefficient is somewhat different
from the way that you interpret coefficients obtained using
simple linear regression.
Specifically, the value of a multiple regression coefficient
represents the ability of part of the
corresponding IV that is unrelated to the other IVs to predict
the part of the DV that is unrelated
to the other IVs. It therefore represents the unique ability of the
IV to account for variability in
the DV. One implication of the way coefficients are determined
is that your parameter estimates
become very difficult to interpret if there are large correlations
among your IVs. The effect of
these relationships on multiple regression coefficients is called
multicollinearity. This changes
the values of your coefficients and greatly increases their
variance. It can cause you to find that
none of your coefficients are significantly different from zero,
even when the overall model does
a good job predicting the value of the DV.
One implication of the way coefficients are determined is that
your parameter estimates become
very difficult to interpret if there are large correlations among
your IVs. The typical effect of
multicollinearity is to reduce the size of your parameter
estimates. Since the value of the
coefficient is based on the unique ability for an IV to account
for variability in a DV, if there is a
portion of variability that is accounted for by multiple IVs, all
of their coefficients will be
reduced. Under certain circumstances multicollinearity can also
create a suppression effect. If
you have one IV that has a high correlation with another IV but
a low correlation with the DV,
you can find that the multiple regression coefficient for the
second IV from a model including
both variables can be larger (or even opposite in direction!)
compared to the coefficient from a
model that doesn't include the first IV. This happens when the
part of the second IV that is
independent of the first IV has a different relationship with the
DV than does the part that is
33
related to the first IV. It is called a suppression effect because
the relationship that appears in
multiple regression is suppressed when you just look at the
variable by itself.
To perform a multiple regression in SPSS
• Choose Analyze !!!! Regression !!!! Linear.
• Move the DV to the Dependent box.
• Move all of the IVs to the Independent(s) box.
• Click the Continue button.
• Click the OK button.
The SPSS output from a multiple regression analysis contains
the following sections.
• Variables Entered/Removed. This section is only used in
model building and contains
no useful information in standard multiple regression.
• Model Summary. The value listed below R is the multiple
correlation between your IVs
and your DV. The value listed below R square is the proportion
of variance in your DV
that can be accounted for by your IV. The value in the Adjusted
R Square column is a
measure of model fit, adjusting for the number of IVs in the
model. The value listed
below Std. Error of the Estimate is the standard deviation of the
residuals.
• ANOVA. This section provides an F test for your statistical
model. If this F is significant,
it indicates that the model as a whole (that is, all IVs combined)
predicts significantly
more variability in the DV compared to a null model that only
has an intercept parameter.
Notice that this test is affected by the number of IVs in the
model being tested.
• Coefficients. This section contains a table where each row
corresponds to a single
coefficient in your model. The row labeled Constant refers to
the intercept, while the
coefficients for each of your IVs appear in the row beginning
with the name of the IV.
Inside the table, the column labeled B contains the estimates of
the parameters and the
column labeled Std. Error contains the standard error of those
estimates. The column
labeled Beta contains the standardized regression coefficient.
The column labeled t
contains the value of the t-statistic testing whether the value of
each parameter is equal to
zero. The p-value of this test is found in the column labeled Sig.
A significant t-test
indicates that the IV is able to account for a significant amount
of variability in the DV,
independent of the other IVs in your regression model.
Multiple regression with interactions
In addition to determining the independent effect of each IV on
the DV, multiple regression can
also be used to detect interactions between your IVs. An
interaction measures the extent to
which the relationship between an IV and a DV depends on the
level of other IVs in the model.
For example, if you have an interaction between two IVs (called
a two-way interaction) then you
expect that the relationship between the first IV and the DV will
be different across different
levels of the second IV. Interactions are symmetric, so if you
have an interaction such that the
effect of IV1 on the DV depends on the level of IV2, then it is
also true that the effect of IV2 on
the DV depends on the level of IV1. It therefore does not matter
whether you say that you have
an interaction between IV1 and IV2 or an interaction between
IV2 and IV1. You can also have
interactions between more than two IVs. For example, you can
have a three-way interaction
between IV1, IV2, and IV3. This would mean that the two-way
interaction between IV1 and IV2
depends on the level of IV3. Just like two-way interactions,
three-way interactions are also
34
independent of the order of the variables. So the above three-
way interaction would also mean
that the two-way interaction between IV1 and IV3 is dependent
on the level of IV2, and that the
two-way interaction between IV2 and IV3 depends on the level
of IV1.
It is possible to have both main effects and interactions at the
same time. For example, you can
have a general trend that the value of the DV increases when the
value of a particular IV
increases along with an interaction such that the relationship is
stronger when the value of a
second IV is high than when the value of that second IV is low.
You can also have lower order
interactions in the presence of a higher order interaction. Again,
the lower-order interaction
would represent a general trend that is modified by the higher-
order interaction.
You can use linear regression to determine if there is an
interaction between a pair of IVs by
adding an interaction term to your statistical model. To detect
the interaction effect of two IVs
(X1 and X2) on a DV (Y) you would use linear regression to
estimate the equation
Yi = b0 + b 1Xi1 + b 2Xi2 + b 3Xi1Xi2.
You construct the variable for the interaction term Xi1Xi2 by
literally multiplying the value of
X1 by the value of X2 for each case in your data set. If the test
of b3 is significant, then the two
predictors have an interactive effect on the outcome variable.
In addition to the interaction term itself, your model must
contain all of the main effects of the
variables involved in the interaction as well as all of the lower-
order interaction terms that can be
created using those main effects. For example, if you want to
test for a three-way interaction you
must include the three main effects as well as all of the possible
two-way interactions that can be
made from those three variables. If you do not include the
lower-order terms then the test on the
highest order interaction will produce incorrect results.
It is important to center the variables that are involved in an
interaction before including them in
your model. That is, for each independent variable, the analyst
should subtract the mean of the
independent variable from each participantís score on that
variable. The interaction term should
then be constructed from the centered variables by multiplying
them together. The model itself
should then be tested using the centered main effects and the
constructed interaction term.
Centering your independent variables will not change their
relationship to the dependent
variable, but it will reduce the collinearity between the main
effects and the interaction term.
If the variables are not centered then none of the coefficients on
terms involving IVs involved in
the interaction will be interpretable except for the highest-order
interaction. When the variables
are centered, however, then the coefficients on the IVs can be
interpreted as representing the
main effect of the IV on the DV, averaging over the other
variables in the interaction. The
coefficients on lower-order interaction terms can similarly be
interpreted as the testing the
average strength of that lower-order interaction, averaging over
the variables that are excluded
from the lower-order interaction but included in the highest-
order interaction term. Centering
has the added benefit of reducing the collinearity between the
main effect and interaction terms.
You can perform a multiple regression including interaction
terms in SPSS just like you would a
standard multiple regression if you create your interaction terms
ahead of time. However,
35
creating these variables can be tedious when analyzing models
that contain a large number of
interaction terms. Luckily, if you choose to analyze your data
using the General Linear Model
procedure, SPSS will create these interaction terms for you
(although you still need to center all
of your original IVs beforehand). To analyze a regression model
this way in SPSS
• Center the IVs involved in the interaction.
• Choose Analyze !!!! General Linear Model !!!! Univariate.
• Move your DV to the box labeled Dependent Variable.
• Move all of the main effect terms for your IVs to the box
labeled Covariate(s).
• Click the Options button.
• Check the box next to Parameter estimates. By default this
procedure will only provide
you with tests of your IVs and not the actual parameter
estimates.
• Click the Continue button.
• By default SPSS will not include interactions between
continuous variables in its
statistical models. However, if you build a custom model you
can include whatever
terms you like. You should therefore next build a model that
includes all of the main
effects of your IVs as well as any desired interactions. To do
this
o Click the Model button.
o Click the radio button next to Custom.
o Select all of your IVs, set the drop-down menu to Main
effects, and click the
arrow button.
o For each interaction term, select the variables involved in the
interaction, set the
drop-down menu to Interaction, and click the arrow button.
o If you want all of the possible two-way interactions between a
collection of IVs
you can just select the IVs, set the drop-down menu to All 2-
way, and click the
arrow button. This procedure can also be used to get all possible
three-way, four-
way, or five-way interactions between a collection of IVs by
setting the drop-
down menu to the appropriate interaction type.
• Click the Continue button.
• Click the OK button.
The output from this analysis will contain the same sections
found in standard multiple
regression. When referring to an interaction, SPSS will display
the names of the variables
involved in the interaction separated by asterisks (*). So the
interaction between the variables
RACE and GENDER would be displayed as RACE * GENDER.
So what does it mean if you obtain a significant interaction in
regression? Remember that in
simple linear regression, the slope coefficient (b1) indicates the
expected change in Y with a one-
unit change in X. In multiple regression, the slope coefficient
for X1 indicates the expected
change in Y with a one-unit change in X1, holding all other X
values constant. Importantly, this
change in Y with a one-unit change in X1 is the same no matter
what value the other X variables
in the model take on. However, if there is a significant
interaction, the interpretation of
coefficients is slightly different. In this case, the slope
coefficient for X1 depends on the level of
the other predictor variables in the model.
36
Polynomial regression
Polynomial regression models are used when the true
relationship between a continuous
predictor variable and a continuous dependent variable is a
polynomial function, or when the
curvilinear relationship is complex or unknown but can be
approximated by a polynomial
function.
A polynomial regression model with one predictor variable is
expressed in the following way:
Yi = β0 + β1Xi + β11X2i + εi
The predictor variable (X) should be centered (discussed in the
section Multiple regression with
interactions), or else the X and X2 terms will be highly
correlated and lead to severe
multicollinearity. Additionally, you lose the ability to interpret
the lower-order coefficients in a
straightforward manner.
In the above model, the coefficient β1 is typically called the
ìlinear effectî coefficient and β11 is
called the ìquadratic effectî coefficient. If the estimate of the
coefficient β11 is significantly
different from zero then you have a significant quadratic effect
in your data. If the highest-order
term in a polynomial model is not significant, conventionally
statisticians will remove that term
from the model and rerun the regression.
The best way to choose the highest order polynomial is through
a historical or theoretical
analysis. There are certain types of relationships that are well
known to be fitted by quadratic or
cubic models. You might also determine that a specific type of
relationship should exist because
of the mechanisms responsible for the relationship between the
IV and the DV. If you are
building your model in an exploratory fashion, however, you
can estimate how high of an order
function you should use by the shape of the relationship
between the DV and that IV. If your
data appears to reverse p times (has p curves in the graph), you
should use a function whose
highest order parameter is raised to the power of p +1. In
multiple regression you can see
whether you should add an additional term for an IV by
examining a graph of the residuals
against the IV. Again, if the relationship between the residuals
and the IV appears to reverse p
times, you should add terms whose highest order parameter is
raised to the power of p + 1.
It is quite permissible to have more than one predictor variable
represented in quadratic form in
the same model. For instance:
Yi = β0 + β 1Xi1 + β 2Xi2 + β 11X2i1 + β 22X2i2 + εi
is a model with two predictor variables, both with quadratic
terms.
To perform a polynomial regression in SPSS
• Determine the highest order term that you will use for each
IV.
• Center any IVs for which you will examine higher-order
terms.
• For each IV, create new variables that are equal to your IV
raised to the powers of 2
through the power of your highest order term. Be sure to use the
centered version of your
IV.
37
• Conduct a standard multiple regression including all of the
terms for each IV.
Simultaneously testing categorical and continuous IVs
Both ANOVA and regression are actually based on the same set
of statistical ideas, the general
linear model. SPSS implements these functions in different
menu selections, but the basic way
that the independent variables are tested is fundamentally the
same. It is therefore perfectly
reasonable to combine both continuous and categorical predictor
variables in the same model,
even though people are usually taught to think of ANOVA and
regression as separate types of
analyses.
To perform an analysis in SPSS using the General Linear Model
• Choose Analyze !!!! General Linear Model !!!! Univariate.
• Move your DV to the box labeled Dependent Variable.
• Move any categorical IVs to the box labeled Fixed Factor(s).
• Move any continuous IVs to the box labeled Covariate(s).
• By default SPSS will include all possible interactions between
your categorical IVs, but
will only include the main effects of your continuous IVs. If
this is not the model you
want then you will need to define it by hand by taking the
following steps.
o Click the Model button.
o Click the radio button next to Custom.
o Add all of your main effects to the model by clicking all of
the IVs in the box
labeled Factors and covariates, setting the pull-down menu to
Main effects, and
clicking the arrow button.
o Add each of the interaction terms to your model. You can do
this one at a time by
selecting the variables included in the interaction in the box
labeled Factors and
covariates, setting the pull-down menu to Interaction, and
clicking the arrow
button for each of your interactions.
o You can also use the setting on the pull-down menu to tell
SPSS to add all
possible 2-way, 3-way, 4-way, or 5-way interactions that can be
made between
the selected variables to your model.
o Click the Continue button.
• Click the OK button.
The SPSS output from running an analysis using the General
Linear Model contains the
following sections.
• Between-Subjects Factors. This table just lists out the
different levels of any categorical
variables included in your model.
• Tests of Between-Subjects Effects. This table provides an F
test of each main effect or
interaction that you included in your model. It indicates whether
or not the effect can
independently account for a significant amount of variability in
your DV. This provides
the same results as testing the change in model R2 that you get
from the test of the set of
terms representing the effect.
Post-hoc comparisons in mixed models. You can ask SPSS to
provide post-hoc contrasts
comparing the different levels within any of your categorical
predictor variables by clicking the
Contrasts button in the variable selection window. If you want
to compare the means of cells
38
resulting from combinations of your categorical predictors, you
will need to recode them all into
a single variable as described in the section Post-hoc
comparisons for when you have two or
more factors.
The easiest way to examine the main effect of a continuous
independent variable is to graph its
relationship to the dependent variable using simple linear
regression.. You can obtain this using
the following procedure:
• Choose Analyze !!!! Regression !!!! Curve Estimation.
• Move your dependent variable into the Dependent(s) box
• Move your independent variable into the Independent box
• Make sure that Plot Models is checked
• Under the heading Models, make sure that only Linear is
checked
This will produce a graph of your data along with the least-
squares regression line. If you want
to look at the interaction between a categorical and a continuous
independent variable, you can
use the Select Cases function (described above) to limit this
graph to cases that have a particular
value on the categorical variable. Using this method several
times, you can obtain graphs of the
relationship between the continuous variable and the dependent
variable separately for each level
of the categorical independent variable.
Another option you might consider would be to recode the
continuous variables as categorical,
separating them into groups based on their value on the
continuous variables. You can then run a
standard ANOVA and compare the means of the dependent
variable for those high or low on the
continuous variable. Even if you decide to do this, you should
still base all of your conclusions
on the analysis that actually treated the variable as continuous.
Numerous simulations have
shown that there is greater power and less error in analysis that
treat truly continuous variables as
continuous compared to those that analyze them in a categorical
fashion.
39
MEDIATION
When researchers find a relationship between an independent
variable (A) and a dependent
variable (C), they may seek to uncover variables that mediate
this relationship. That is, they may
believe that the effect of variable A on variable C exits because
variable A leads to a change in a
mediating variable (M), which in turn effects the dependent
variable (C). When a variable fully
mediates a relationship, the effect of variable A on variable C
disappears when controlling for
the mediating variable. A variable partially mediates a
relationship when the effect of variable A
on variable C is significantly reduced when controlling for the
mediator. A common way of
expressing these patterns is the following:
Mediating Variable (M)
Independent Variable (A) Dependent Variable (C)
You need to conduct three different regression analyses to
determine if you have a mediated
relationship using the traditional method
Regression 1. Predict the dependent variable (C) from the
independent variable (A). The effect
of the independent variable in this model must be significant. If
there is no direct effect of A on
C, then there is no relationship to mediate.
Regression 2. Predict the mediating variable (M) from the
independent variable (A). The effect
of the independent variable in this model must be significant. If
the independent variable does
not reliably affect the mediator, the mediator cannot be
responsible for the relationship observed
between A and C.
Regression 3. Simultaneously predict the value of the
dependent variable (C) from both the
independent variable (A) and the mediating variable (M) using
multiple regression. The effect of
the independent variable should be non significant (or at least
significantly reduced, compared to
Regression 1), whereas the effect of the mediating variable must
be significant. The reduction in
the relationship between A and C indicates that the mediator is
accounting for a significant
portion of this relationship. However, if the relationship
between M and C is not significant,
then you cannot clearly determine whether M mediates the
relationship between A and C, or if A
mediates the relationship between M and C.
One can directly test for a reduction in the effect of A ! C when
controlling for the mediator by
performing a Sobel Test. This involves testing the significance
of the path between A and C
through M in Regression 3. While you cannot do a Sobel Test
in SPSS, the website
http://www.unc.edu/~preacher/sobel/sobel.htm will perform this
for you online. If you wish to
show mediation in a journal article, you will almost always be
required to show the results of the
Sobel Test.
40
CHI-SQUARE TEST OF INDEPENDENCE
A chi-square is a nonparametric test used to determine if there
is a relationship between two
categorical variables. Letís take a simple example. Suppose a
researcher brought male and
female participants into the lab and asked them which color
they preferóblue or green. The
researcher believes that color preference may be related to
gender. Notice that both gender
(male, female) and color preference (blue, green) are
categorical variables. If there is a
relationship between gender and color preference, we would
expect that the proportion of men
who prefer blue would be different than the proportion of
women who prefer blue. In general,
you have a relationship between two categorical variables when
the distribution of people across
the categories of the first variable changes across the different
categories of the second variable.
To determine if a relationship exists between gender and color
preference, the chi-square test
computes the distributions across the combination of your two
factors that you would expect if
there were no relationship between them. In then compares this
to the actual distribution found
in your data. In the example above, we have a 2 (gender: male,
female) X 2 (color preference:
green, blue) design. For each cell in the combination of the two
factors, we would compute
"observed" and "expected" counts. The observed counts are
simply the actual number of
observations found in each of the cells. The expected
proportion in each cell can be determined
by multiplying the marginal proportions found in a table. For
example, let us say that 52% of all
the participants preferred blue and 48% preferred green,
whereas 40% of the all of the
participants were men and 60% were women. The expected
proportions are presented in the
table below.
Expected proportion table
Males Females Marginal proportion
Blue 20.8% 31.2% 52%
Green 19.2% 28.8% 48%
Marginal proportion 40% 60%
As you can see, you get the expected proportion for a particular
cell by multiplying the two
marginal proportions together. You would then determine the
expected count for each cell by
multiplying the expected proportion by the total number of
participants in your study. The chi-
square statistic is a function of the difference between the
expected and observed counts across
all your cells. Luckily you do not actually need to calculate any
of this by hand, since SPSS will
compute the expected counts for each cell and perform the chi-
square test.
To perform a chi-square test of independence in SPSS
• Choose Analyze !!!! Descriptive Statistics !!!! Crosstabs.
• Put one of the variables in the Row(s) box
• Put the other variable in the Column(s) box
• Click the Statistics button.
• Check the box next to Chi-square.
• Click the Continue button.
41
• Click the OK button.
The output of this analysis will contain the following sections.
• Case Processing Summary. Provides information about
missing values in your two
variables.
• Crosstabulation. Provides you with the observed counts
within each combination of
your two variables.
• Chi-Square Tests. The first row of this table will give you the
chi-square value, its
degrees of freedom and the p-value associated with the test.
Note that the p-values
produced by a chi-square test are inappropriate if the expected
count is less than 5 in 20%
of the cells or more. If you are in this situation, you should
either redefine your coding
scheme (combining the categories with low cell counts with
other categories) or exclude
categories with low cell counts from your analysis.
42
LOGISTIC REGRESSION
The chi-square test allows us to determine if a pair of
categorical variables are related. But what
if you want to test a model using two or more independent
variables? Most of the inferential
procedures we have discussed so far require that the dependent
variable be a continuous variable.
The most common inferential statistics such as t-tests,
regression, and ANOVA, require that the
residuals have a normal distribution, and that the variance is
equal across conditions. Both of
these assumptions are likely to be seriously violated if the
dependent variable is categorical. The
answer is to use logistic regression, which does not make these
assumptions and so can be used
to determine the ability of a set of continuous or categorical
independent variables to predict the
value of a categorical dependent variable. However, standard
logistic regression assumes that all
of your observations are independent, so it cannot be directly
used to test within-subject factors.
Logistic regression generates equations that tell you exactly
how changes in your independent
variables affect the probability that the observation is in a level
of your dependent variable.
These equations are based on predicting the odds that a
particular observation is in one of two
groups. Let us say that you have two groups: a reference group
and a comparison group. The
odds that an observation is in the reference group is equal to the
probability that the observation
is in the reference group divided by the probability that it is in
the comparison group. So, if there
is a 75% chance that the observation is in the reference group,
the odds of it being in the
reference group would be .75/.25 = 3. We therefore talk about
odds in the same way that people
do when betting at a racetrack.
In logistic regression, we build an equation that predicts the
logarithm of the odds from the
values of the independent variables (which is why itís called
log-istic regression). For each
independent variable in our model, we want to calculate a
coefficient B that tells us what the
change in the log odds would be if we would increase the value
of the variable by 1. These
coefficients therefore parallel those found in a standard
regression model. However, they are
somewhat difficult to interpret because they relate the
independent variables to the log odds. To
make interpretation easier, people often transform the
coefficients into odds ratios by raising the
mathematical constant e to the power of the coefficient (eB).
The odds ratio directly tells you
how the odds increase when you change the value of the
independent variable. Specifically, the
odds of being in the reference group are multiplied by the odds
ratio when the independent
variable increases by 1.
One obvious limitation of this procedure is that we can only
compare two groups at a time. If we
want to examine a dependent variable with three or more levels,
we must actually create several
different logistic regression equations. If your dependent
variable has k levels, you will need a
total of k-1 logistic regression equations. What people typically
do is designate a specific level
of your dependent variable as the reference group, and then
generate a set of equations that each
compares one other level of the dependent variable to that
group. You must then examine the
behavior of your independent variables in each of your
equations to determine what their
influence is on your dependent variable.
To test the overall success of your model, you can determine the
probability that you can predict
the category of the dependent variable from the values of your
independent variables. The
43
higher this probability is, the stronger the relationship is
between the independent variables and
your dependent variable. You can determine this probability
iteratively using maximum
likelihood estimation. If you multiply the logarithm of this
probability by ñ2, you will obtain a
statistic that has an approximate chi-square distribution, with
degrees of freedom equal to the
number of parameters in your model. This is referred to as
ñ2LL (minus 2 log likelihood) and is
commonly used to assess the fit of the model. Large values of
ñ2LL indicate that the observed
model has poor fit. This statistic can also be used to provide a
statistical test of the relationship
between each independent variable and your dependent variable.
The importance of each term in
the model can be assessed by examining the increase in ñ2LL
when the term is dropped. This
difference also has a chi-square distribution, and can be used as
a statistical test of whether there
is an independent relationship between each term and the
dependent variable.
To performing a logistic regression in SPSS
• Choose Analyze !!!! Regression !!!! Multinomial Logistic.
• Move the categorical DV to the Dependent box.
• Move your categorical IVs to the Factor(s) box.
• Move your continuous independent variables to the
Covariate(s) box.
• By default, SPSS does not include any interaction terms in
your model. You will need to
click the Model button and manually build your model if you
want to include any
interactions.
• When you are finished, you click the Ok button to tell SPSS to
perform the analysis.
If your dependent variable only has two groups, you have the
option of selecting Analyze !!!!
Regression !!!! Binary Logistic. Though this performs the
same basic analysis, this procedure
is primarily designed to perform model building. It organizes
the output in a less straightforward
way and does not provide you with the likelihood ratio test for
each of your predictors. You are
therefore better off if you only use this selection if you are
specifically interested in using the
model-building procedures that it offers.
NOTE: The results from a binary logistic analysis in SPSS will
actually produce coefficients
that are opposite in sign when compared to the results of a
multinomial logistic regression
performed on exactly the same data. This is because the binary
procedure chooses to predict the
probability of choosing the category with the largest indicator
variable, while the multinomial
procedure chooses to predict the probability of choosing the
category with the smallest indicator
variable.
The Multinomial Logistic procedure will produce output with
the following sections.
• Case Processing Summary. Describes the levels of the
dependent variable and any
categorical independent variables.
• Model Fitting Information. Tells you the ñ2LL of both a null
model containing only the
intercept and the full model being tested. Recall that this
statistic follows a chi-square
distribution and that significant values indicate that there is a
significant amount of
variability in your DV that is not accounted for by your model.
• Pseudo R-Square. Provides a number of statistics that
researchers have developed to
represent the ability of a logistic regression model to account
for variability in the
dependent variable. Logistic regression does not have a true R-
square statistic because
44
the amount of variance is partly determined by the distribution
of the dependent variable.
The more even the observations are distributed among the levels
of the dependent
variable, the greater the variance in the observations. This
means that the R-square
values for models that have different distributions are not
directly comparable. However,
these statistics can be useful for comparing the fit of different
models predicting the same
response variable. The most commonly reported pseudo R-
square estimate is
Nagelkerkeís R-square, which is provided by SPSS in this
section.
• Likelihood Ratio Tests. Provides the likelihood ratio tests for
the IVs. The first column
of the table contains the ñ2LL (a measurement of model error
having a chi-square
distribution) of a model that does not include the factor listed in
the row. The value in
the first row (labeled Intercept) is actually the ñ2LL for the full
model. The second
column is the difference between the ñ2LL for the full model
and the ñ2LL for the model
that excludes the factor listed in the row. This is a measure of
the amount of variability
that is accounted for by the factor. This difference parallels the
Type III SS in a
regression model, and follows a chi-square distribution with
degrees of freedom equal to
the number of parameters it takes to code the factor. The final
column provides the p-
value for the test of the null hypothesis that the amount of error
in the model that
excludes the factor is the same as the amount of error in the full
model. A significant
statistic indicates that the factor does account for a significant
amount of the variability in
the dependent variable that is not captured by other variables in
the model.
• Parameter Estimates. Provides the specific coefficients of the
logistic regression
equations. You will have a number of equations equal to the
number of levels in your
dependent variable ñ 1. Each equation predicts the log odds of
your observations being
in the highest numbered level of your dependent variable
compared to another level
(which is listed in the leftmost column of the chart). Within
each equation, you will see
estimates of the standardized logistic regression coefficient for
each variable in the
model. These coefficients tell you the increase in the log odds
when the variable
increases by 1 (assuming everything else is held constant). The
next column contains the
standard errors of those coefficients. The Wald Statistic
provides another statistic testing
the significance of the individual coefficients, and is based on
the relationship between
the coefficient and its standard error. However, there is a flaw
in this statistic such that
large coefficients may have inappropriately large standard
errors, so researchers typically
prefer to use the likelihood ratio test to determine the
importance of individual factors in
the model. SPSS provides the odds ratio for the parameter
under the column Exp(B).
The last two columns in the table provide the upper and lower
bounds for a 95%
confidence interval around the odds ratio.
45
RELIABILITY
Ideally, the measurements that we take with a scale would
always replicate perfectly. However,
in the real world there are a number of external random factors
that can affect the way that
respondents provide answers to a scale. A particular
measurement taken with the scale is
therefore composed of two factors: the theoretical "true score"
of the scale and the variation
caused by random factors. Reliability is a measure of how much
of the variability in the observed
scores actually represents variability in the underlying true
score. Reliability ranges from 0 to 1.
In psychology it is preferred to have scales with reliability
greater than .7.
The reliability of a scale is heavily dependent on the number of
items composing the scale. Even
using items with poor internal consistency, you can get a
reliable scale if your scale is long
enough. For example, 10 items that have an average inter-item
correlation of only .2 will produce
a scale with a reliability of .714. However, the benefit of adding
additional items decreases as the
scale grows larger, and mostly disappears after 20 items. One
consequence of this is that adding
extra items to a scale will generally increase the scale's
reliability, even if the new items are not
particularly good. An item will have to significantly lower the
average inter-item correlation for
it to have a negative impact on reliability.
Reliability has specific implications for the utility of your scale.
The most that responses to your
scale can correlate with any other variable is equal to the square
root of the scaleís reliability.
The variability in your measure will prevent anything higher.
Therefore, the higher the reliability
of your scale, the easier it is to obtain significant findings. This
is probably what you should
think about when you want to determine if your scale has a high
enough reliability.
It should also be noted that low reliability does not call into
question results obtained using a
scale. Low reliability only hurts your chances of finding
significant results. It cannot cause you
to obtain false significance. If anything, finding significant
results with an unreliable scale
indicates that you have discovered a particularly strong effect,
since it was able to overcome the
hindrances of your unreliable scale. In this way, using a scale
with low reliability is analogous to
conducting an experiment with a small number of participants.
Calculating reliability from parallel measurements
One way to calculate reliability is to correlate the scores on
parallel measurements of the scale.
Two measurements are defined as parallel if they are distinct
(are based on different data) but
equivalent (such that you expect responses to the two
measurements to have the same true score).
The two measurements must be performed on the same (or
matched) respondents so that the
correlation can be performed. There are a number of different
ways to measure reliability using
parallel measurements. Below are several examples.
Test-Retest method. In this method, you have respondents
complete the scale at two different
points in time. The reliability of the scale can then be estimated
by the correlation between the
two scores. The accuracy of this method rests on the assumption
that the participants are
fundamentally the same (i.e., possess the same true score on
your scale) during your two test
periods. One common problem is that completing the scale the
first time can change the way that
respondents complete the scale the second time. If they
remember any of their specific responses
46
from the first period, for example, it could artificially inflate
the reliability estimate. When using
this method, you should present evidence that this is not an
issue.
Alternate Forms method. This method, also referred to as
parallel forms, is basically the same
as the Test-Retest method, but with the use of different versions
of the scale during each session.
The use of different versions reduces the likelihood that the
first administration of the scale
influences responses to the second. The reliability of the scale
can then be estimated by the
correlation between the two scores. When using alternate forms,
you should show that the
administration of the first scale did not affect responses to the
second and that the two versions
of your scale are essentially the same. The use of this method is
generally preferred to the Test-
Retest method.
Split-Halves method. One difficulty with both the Test-Retest
and the Alternate Forms methods
is that the scale responses must be collected at two different
points in time. This requires more
work and introduces the possibility that some natural event
might change the actual true score
between the two administrations of the scale. In the Split-
Halves method you only have
respondents fill out your scale one time. You then divide your
scale items into two sections (such
as the even-numbered items and the odd-numbered items) and
calculate a score for each half.
You then determine the correlation between these two scores.
Unlike the other methods, this
correlation does not estimate your scaleís reliability. Instead,
you get your estimate using the
formula:
r
r
+
=
1
2àρ
where ρà is the reliability estimate and r is the correlation that
you obtain.
Note that if you split your scale in different ways, you will
obtain different reliability estimates.
Assuming that there are no confounding variables, all split-
halves should be centered on the true
reliability. In general it is best not to use a first half/second half
split of the questionnaire since
respondents may become tired as they work through the scale.
This would mean that you would
expect greater variability in the score from the second half than
in the score from the first half.
In this case, your two measurements are not actually parallel,
making your reliability estimate
invalid. A more acceptable method would be to divide your
scale into sections of odd-numbered
and even-numbered items.
Calculating reliability from internal consistency
The other way to calculate reliability is to use a measure of
internal consistency. The most
popular of these reliability estimates is Cronbach's alpha.
Cronbach's alpha can be obtained
using the equation:
)1(1 −+
=
Nr
rN
α ,
47
where α is Cronbach's alpha, N is the number of items in the
scale, and r is the mean inter-item
correlation. From the equation we can see that α increases both
with increasing r as well as with
increasing N. Calculating Cronbach's alpha is the most
commonly used procedure to estimate
reliability. It is highly accurate and has the advantage of only
requiring a single administration of
the scale. The only real disadvantage is that it is difficult to
calculate by hand, as it requires you
to calculate the correlation between every single pair of items in
your scale. This is rarely an
issue, however, since SPSS will calculate it for you
automatically.
To obtain the α of a set of items in SPSS:
• Choose Analyze !!!! Scale !!!! Reliability analysis.
• Move all of the items in the scale to the Items box.
• Click the Statistics button.
• Check the box next to Scale if item deleted.
• Click the Continue button.
• Click the OK button.
Note: Before performing this analysis, make sure all items are
coded in the same direction. That
is, for every item, larger values should consistently indicate
either more of the construct or less
of the construct.
The output from this analysis will include a single section titled
Reliability. The reliability of
your scale will actually appear at the bottom of the output next
to the word Alpha. The top of
this section contains information about the consistency of each
item with the scale as a whole.
You use this to determine whether there are any ìbad itemsî in
your scale (i.e., ones that are not
representing the construct you are trying to measure). The
column labeled Corrected Item-
Total Correlation tells you the correlation between each item
and the average of the other items
in your scale. The column labeled Alpha if Item Deleted tells
you what the reliability of your
scale would be if you would delete the given item. You will
generally want to remove any items
where the reliability of the scale would increase if it were
deleted, and you want to keep any
items where the reliability of the scale would drop if it were
deleted. If any of your items have a
negative item-total score correlation it may mean that you
forgot to reverse code the item.
Inter-rater reliability
A final type of reliability that is commonly assessed in
psychological research is called ìinter-
rater reliability.î Inter-rater reliability is used when judges are
asked to code some stimuli, and
the analyst wants to know how much those judges agree. If the
judges are making continuous
ratings, the analyst can simply calculate a correlation between
the judgesí responses. More
commonly, judges are asked to make categorical decisions about
stimuli. In this case, reliability
is assessed via Cohenís kappa.
To obtain Cohen's kappa in SPSS, you first must set up your
data file in the appropriate manner.
The codes from each judge should be represented as separate
variables in the data set. For
example, suppose a researcher asked participants to list their
thoughts about a persuasive
message. Each judge was given a spreadsheet with one thought
per row. The two judges were
then asked to code each thought as: 1 = neutral response to the
message, 2 = positive response to
the message, 3 = negative response to the message, or 4 =
irrelevant thought. Once both judges
48
have rendered their codes, the analyst should create an SPSS
data file with two columns, one for
each judgeís codes.
To obtain Cohen's kappa in SPSS
• Choose Analyze !!!! Descriptives !!!! Crosstabs.
• Place Judge Aís responses in the Row(s) box.
• Place Judge Bís responses in the Column(s) box.
• Click the Statistics button.
• Check the box next to Kappa.
• Click the Continue button.
• Click the OK button.
The output from this analysis will contain the following
sections.
• Case Processing Summary. Reports the number observations
on which you have
ratings from both of your judges.
• Crosstabulation. This table lists all the reported values from
each judge and the number
of times each combination of codes was rendered. For example,
assuming that each
judge used all the codes in the thought-listing example (e.g.,
code values 1 ñ 4), the
output would contain a cross-tabulation table like this:
Judge A * Judge B Crosstabulation
Count
Judge B Total
1.00 2.00 3.00 4.00
1.00 5 1 6
2.00 5 1 6
3.00 1 7 8
Judge A
4.00 7 7
Total 5 7 8 7 27
The counts on the diagonal represent agreements. That is, these
counts represent the
number of times both Judges A and B coded a thought with a 1,
2, 3, or 4. The more
agreements, the better the inter-rater reliability. Values not on
the diagonal represent
disagreements. In this example, we can see that there was one
occasion when Judge A
coded a thought in category 1 but Judge B coded that same
thought in category 2.
• Symmetric Measures. The value of kappa can be found in this
section at the
intersection of the Kappa row and the Value column. This
section also reports a p-value
for the Kappa, but this is not typically used in reliability
analysis.
Note that a kappa cannot be computed on a non-symmetric
table. For instance, if Judge A had
used codes 1 ñ 4, but Judge B never used code 1 at all, the table
would not be symmetric. This is
because there would be 4 rows for Judge A but only 3 columns
for Judge B. Should you have
this situation, you should first determine which values are not
used by both judges. You then
change each instance of these codes to some other value that is
not the value chosen by the
opposite judge. Since the original code was a mismatch, you
can preserve the original amount of
agreement by simply changing the value to a different
mismatch. This way you can remove the
49
unbalanced code from your scheme while retaining the
information from every observation. You
can then use the kappa obtained from this revised data set as an
accurate measure of the
reliability of the original codes.
50
FACTOR ANALYSIS
Factor analysis is a collection of methods used to examine how
underlying constructs influence
the responses on a number of measured variables. There are
basically two types of factor
analysis: exploratory and confirmatory. Exploratory factor
analysis (EFA) attempts to discover
the nature of the constructs influencing a set of responses.
Confirmatory factor analysis (CFA)
tests whether a specified set of constructs is influencing
responses in a predicted way. SPSS
only has the capability to perform EFA. CFAs require a
program with the ability to perform
structural equation modeling, such as LISREL or AMOS.
The primary objectives of an EFA are to determine the number
of factors influencing a set of
measures and the strength of the relationship between each
factor and each observed measure.
To perform an EFA, you first identify a set of variables that you
want to analyze. SPSS will then
examine the correlation matrix between those variables to
identify those that tend to vary
together. Each of these groups will be associated with a factor
(although it is possible that a
single variable could be part of several groups and several
factors). You will also receive a set of
factor loadings, which tells you how strongly each variable is
related to each factor. They also
allow you to calculate factor scores for each participant by
multiplying the response on each
variable by the corresponding factor loading. Once you identify
the construct underlying a
factor, you can use the factor scores to tell you how much of
that construct is possessed by each
participant.
Some common uses of EFA are to:
• Identify the nature of the constructs underlying responses in a
specific content area.
• Determine what sets of items ``hang together'' in a
questionnaire.
• Demonstrate the dimensionality of a measurement scale.
Researchers often wish to
develop scales that respond to a single characteristic.
• Determine what features are most important when classifying
a group of items.
• Generate ``factor scores'' representing values of the
underlying constructs for use in other
analyses.
• Create a set of uncorrelated factor scores from a set of highly
collinear predictor
variables.
• Use a small set of factor scores to represent the variable
contained in a larger set of
variables. This is often referred to as data reduction.
It is important to note that EFA does not produce any statistical
tests. It therefore cannot ever
provide concrete evidence that a particular structure exists in
your data ñ it can only direct you to
what patterns there may be. If you want to actually test whether
a particular structure exists in
your data you should use CFA, which does allow you to test
whether your proposed structure is
able to account for a significant amount of variability in your
items.
EFA is strongly related to another procedure called principle
components analysis (PCA). The
two have basically the same purpose: to identify a set of
underlying constructs that can account
for the variability in a set of variables. However, PCA is based
on a different statistical model,
and produces slightly different results when compared to EFA.
EFA tends to produce better
results when you want to identify a set of latent factors that
underlie the responses on a set of
51
measures, whereas PCA works better when you want to perform
data reduction. Although SPSS
says that it performs ìfactor analysis,î statistically it actually
performs PCA. The differences are
slight enough that you will generally not need to be concerned
about them ñ you can use the
results from a PCA for all of the same things that you would the
results of an EFA. However, if
you want to identify latent constructs, you should be aware that
you might be able to get slightly
better results if you used a statistical package that can actually
perform EFA, such as SAS,
AMOS, or LISREL.
Factor analyses require a substantial number of subjects to
generate reliable results. As a general
rule, the minimum sample size should be the larger of 100 or 5
times the number of items in your
factor analysis. Though you can still conduct a factor analysis
with fewer subjects, the results
will not be very stable.
To perform an EFA in SPSS
• Choose Analyze !!!! Data Reduction !!!! Factor.
• Move the variables you want to include in your factor analysis
to the Variables box.
• If you want to restrict the factor analysis to those cases that
have a particular value on a
variable, you can put that variable in the Selection Variable box
and then click Value to
tell SPSS which value you want the included cases to have.
• Click the Extraction button to indicate how many factors you
want to extract from your
items. The maximum number of factors you can extract is equal
to the number of items
in your analysis, although you will typically want to examine a
much smaller number.
There are several different ways to choose how many factors to
examine. First, you may
want to look for a specific number of factors for theoretical
reasons. Second, you can
choose to keep factors that have eigenvalues over 1. A factor
with an eigenvalue of 1 is
able to account for the amount of variability present in a single
item, so factors that
account for less variability than this will likely not be very
meaningful. A final method is
to create a Scree Plot, where you graph the amount of
variability that each of the factors
is able to account for in descending order. You then use all the
factors that occur prior to
the last major drop in the amount of variance accounted for. If
you wish to use this
method, you should run the factor analysis twice - once to
generate the Scree plot, and a
second time where you specify exactly how many factors you
want to examine.
• Click the Rotation button to select a rotation method. Though
you do not need to rotate
your solution, using a rotation typically provides you with more
interpretable factors by
locating solutions with more extreme factor loadings. There are
two broad classes of
rotations: orthogonal and oblique. If you choose an orthogonal
rotation, then your
resulting factors will all be uncorrelated with each other. If you
choose an oblique
rotation, you allow your factors to be correlated. Which you
should choose depends on
your purpose for performing the factor analysis, as well as your
beliefs about the
constructs that underlie responses to your items. If you think
that the underlying
constructs are independent, or if you are specifically trying to
get a set of uncorrelated
factor scores, then you should clearly choose an orthogonal
rotation. If you think that the
underlying constructs may be correlated, then you should
choose an oblique rotation.
Varimax is the most popular orthogonal rotation, whereas Direct
Oblimin is the most
popular oblique rotation. If you decide to perform a rotation on
your solution, you
usually ignore the parts of the output that deal with the initial
(unrotated) solution since
52
the rotated solution will generally provide more interpretable
results. If you want to use
direct oblimin rotation, you will also need to specify the
parameter delta. This parameter
influences the extent that your final factors will be correlated.
Negative values lead to
lower correlations whereas positive values lead to higher
correlations. You should not
choose a value over .8 or else the high correlations will make it
very difficult to
differentiate the factors.
• If you want SPSS to save the factor scores as variables in your
data set, then you can
click the Scores button and check the box next to Save as
variables.
• Click the Ok button when you are ready for SPSS to perform
the analysis.
The output from a factor analysis will vary depending on the
type of rotation you chose. Both
orthogonal and oblique rotations will contain the following
sections.
• Communalities. The communality of a given item is the
proportion of its variance that
can be accounted for by your factors. In the first column youíll
see that the communality
for the initial extraction is always 1. This is because the full
set of factors is specifically
designed to account for the variability in the full set of items.
The second column
provides the communalities of the final set of factors that you
decided to extract.
• Total Variance Explained. Provides you with the eigenvalues
and the amount of
variance explained by each factor in both the initial and the
rotated solutions. If you
requested a Scree plot, this information will be presented in a
graph following the table.
• Component Matrix. Presents the factor loadings for the initial
solution. Factor loadings
can be interpreted as standardized regression coefficients,
regressing the factor on the
measures. Factor loadings less than .3 are considered weak,
loadings between .3 and .6
are considered moderate, and loadings greater than .6 are
considered to be large.
Factor analyses using an orthogonal rotation will include the
following section.
• Rotated Component Matrix. Provides the factor loadings for
the orthogonal rotation.
The rotated factor loadings can be interpreted in the same way
as the unrotated factor
loadings.
• Component Transformation Matrix. Provides the correlations
between the factors in
the original and in the rotated solutions.
Factor analyses using an oblique rotation will include the
following sections.
• Pattern Matrix. Provides the factor loadings for the oblique
rotation. The rotated factor
loadings can be interpreted in the same way as the unrotated
factor loadings.
• Structure Matrix. Holds the correlations between the factions
and each of the items.
This is not going to look the same as the pattern matrix because
the factors themselves
can be correlated. This means that an item can have a factor
loading of zero for one
factor but still be correlated with the factor, simply because it
loads on other factors that
are correlated with the first factor.
• Component Correlation Matrix. Provides you with the
correlations among your rotated
factors.
After you obtain the factor loadings, you will want to come up
with a theoretical interpretation of
each of your factors. You define a factor by considering the
possible constructs that could be
responsible for the observed pattern of positive and negative
loadings. You should examine the
53
items that have the largest loadings and consider what they have
in common. To ease
interpretation, you have the option of multiplying all of the
loadings for a given factor by -1.
This essentially reverses the scale of the factor, allowing you,
for example, to turn an
``unfriendliness'' factor into a ``friendliness'' factor.
54
VECTORS AND LOOPS
Vectors and loops are two tools drawn from computer
programming that can be very useful
when manipulating data. Their primary use is to perform a large
number of similar computations
using a relatively small program. Some of the more complicated
types of data manipulation can
only reasonably be done using vectors and loops.
A vector is a set of variables that are linked together because
they represent similar things. The
purpose of the vector is to provide a single name that can be
used to access any of the entire set
of variables. A loop is used to tell the computer to perform a
set of procedures a specified
number of times. Often times we need to perform the same
transformation on a large number of
variables. By using a loop, we only need to define the
transformation once, and can then tell the
computer to do the same thing to all the variables using a loop.
If you have computer-programming experience then you have
likely come across these ideas
before. However, what SPSS calls a ìvectorî is typically
referred to as an ìarrayî in most
programming languages. If you are familiar with arrays and
loops from a computer-
programming course, you are a step ahead. Vectors and loops
are used in data manipulation in
more or less the same way that arrays and loops are used in
standard computer programming.
Vectors
Vectors can only be defined and used in syntax. Before you can
use a vector you first need to
define it. You must specify the name of the vector and list what
variables are associated with it.
Variables referenced by a vector are called ìelementsî of that
vector. You declare a vector using
the following syntax.
vector Vname = varX1 to varX2.
If the variables in the vector have not already been declared,
you can do so as part of the
vector statement. For more information on this, see page 904 of
the SPSS Base Syntax
Reference Guide. The following are all acceptable vector
declarations.
vector V = v1 to v8.
vector Myvector = entry01 to entry64.
vector Grade = grade1 to grade12.
vector Income = in1992 to in2000.
The vector is given the name Vname and is used to reference a
set of variables defined by the
variable list. The elements in the vector must be declared using
the syntax first variable to last
variable. You cannot list them out individually. This means
that the variables to be included in
a vector must all be grouped together in your data set.
Vectors can be used in transformation statements just like
variables. However, the vector itself
isn't able to hold values. Instead, the vector acts as a mediator
between your statement and the
variables it references. The variables included in a vector are
placed in a specific order,
determined by the declaration statement. So if you give SPSS a
vector and an order number
(referred to as the index), it knows what specific element you
want to access. You do not need to
55
know what the exact name of the variable is - you just need to
know its location in the vector.
References to items within a vector are typically made using the
format
vname (index)
where vname is the name of the vector, and index is the
numerical position of the desired
element. Using this format, you can use a vector to reference a
variable in any place that you
would normally insert a variable name. For example, all of the
following would be valid SPSS
statements, assuming that we had defined the four variables
above.
compute V(4) = 6.
if (Myvector(30)='house') correct = correct + 1.
compute sum1 = Grade(1) + Grade(2) + Grade(3).
compute change = Income(9) - Income(1).
Note that the index used by a vector only takes into account the
position of elements in the vector
- not the names of the variables. To reference the variable
in1993 from in the Income vector
above, you would use the phrase income(2), not income(1993).
Using vectors this way doesn't provide us with much of an
advantage - we are not really saving
ourselves any effort by referring to a particular variable as
Myvector(1) instead of entry01. The
advantage comes in with the fact that the index of the vector
itself can be a variable. In this case,
the element that the vector will reference will depend on the
value of the index variable. So the
exact variable that is changed by the statement
compute Grade(t) = Grade(t) + 1.
depends on the value of t when this statement is executed. If t
has the value of 1, then the
variable grade1 will be incremented by 1. If t has a value of 8,
then the variable grade8 will be
incremented by 1. This means that the same statement can be
used to perform many different
things, simply depending what value you assign to t. This
allows you to use vectors to write
ìgenericî sections of code, where you control exactly what the
code does by assigning different
values to the index variables.
Loops
Vectors are most useful when they are combined with loops. A
loop is a statement that lets you
tell the computer to perform a set of commands a specified
number of times. In SPSS you can
tell the computer to perform a loop by using the following code:
loop loop_variable = lower_limit to upper_limit.
--commands to be repeated appear here--
end loop.
When SPSS encounters a loop statement, what it does first is set
the value of the loop variable to
be equal to the lower limit. It then performs all of the
commands inside the loop until it reaches
the end loop statement. At that point the computer adds 1 to the
loop variable, and then
compares it to the upper limit. If the new value of the loop
variable is less than or equal to the
upper limit, it goes back to the beginning of the loop and goes
through all of the commands
56
again. If the new value is greater than the upper limit, the
computer then moves to the statement
after the end loop statement. Basically, this means that the
computer performs the statements
inside the loop a total number of times equal to (upper limit ñ
lower limit + 1).
The following is an example of an SPSS program that uses a
loop to calculate a sum:
compute x = 0.
loop #t = 4 to 8.
+ compute x = x + #t.
end loop.
The first line simply initializes the variable count to the value
of zero. The second line defines
the conditions of the loop. The loop variable is named t, and
starts with a value of 4. The loop
cycles until the value of t is greater than 8. This causes the
program to perform a total of 5
cycles. During each cycle the current value of t is added to x.
At the end of this set of statements,
the variable x would have the value of 4 + 5 + 6 + 7 + 8 = 30.
In this example, the loop variable is denoted as a ìscratch
variableî because its first letter is a
number sign (#). When something is denoted as a scratch
variable in SPSS it is not saved in the
final data set. Typically we are not interested in storing the
values of our loop variables, so it is
common practice to denote them as scratch variables. For more
information on scratch variables
see page 32 of the SPSS Base Syntax Reference Guide.
You will also notice the plus sign (+) placed before the compute
statement in line 3. SPSS needs
you to start all new commands in the first column of each line.
Here we wish to indent the
command to indicate that it is part of the loop. We therefore put
the plus symbol in the first
column which tells SPSS that the actual command starts later on
the line.
Just in case you were wondering, the first statement setting x =
0 is actually necessary for the
sum to be calculated. Most programming languages, including
SPSS syntax, start variables with
missing values. Adding anything to a missing value produces a
missing value, so we must
explicitly start the variable count at zero to be able to obtain the
sum.
The Power of Combining Vectors and Loops
Though you can work with vectors and loops alone, they were
truly designed to be used together.
A combination of vectors and loops can save you incredible
amounts of time when performing
certain types of repetitive transformations. Consider the
characteristics of vectors and loops. A
vector lets you reference a set of related variables using a single
name and an index. The index
can be a variable or a mathematical expression involving one or
more variables. A loop
repeatedly performs a set of commands, incrementing a loop
variable after each cycle. What
would happen if a statement inside of a loop referenced a vector
using the loop variable as the
index? During each cycle, the loop variable increases by 1. So
during each cycle, the vector
would refer to a different variable. If you correctly design the
upper and lower limits of your
loop, you could use a loop to perform a transformation on every
element of a vector.
For an example, let's say that you conducted a reaction-time
study where research participants
observed strings of letters on the screen and judged whether
they composed a real word or not. In
your study, you had a total of 200 trials in several experimental
conditions. You want to analyze
57
your data with an ANOVA to see if the reaction time varies by
condition, but you find that the
data has a right skew (which is common). To use ANOVA, you
will need to transform the data
so that it has a normal distribution, which involves taking the
logarithm of the response time on
each trial. In terms of your data set, what you need is a set of
200 new variables whose values are
equal to the logarithms of the 200 response time variables.
Without using vectors or loops, you
would need to write 200 individual transformation statements to
create each log variable from
the corresponding response time variable. Using vectors and
loops, however, we can do the same
work with the following simple program. The program assumes
that the original response time
variables are rt001 to rt200, and the desired log variables will
be lrt001 to lrt200.
vector Rtvector = rt001 to rt200.
vector Lvector = lrt001 to lrt200.
loop #item = 1 to 200.
+ compute Lvector(#item) = log(Rtvector(#item)).
end loop.
The first two statements set up a pair of vectors, one to
represent the original response time
variables and one to represent the transformed variables. The
third statement creates a loop with
200 cycles. Each cycle of the loop corresponds to a trial in the
experiment. The fourth line
actually performs the desired transformation. During each cycle
it takes one variable from
Lvector and sets it equal to the log of the corresponding
variable in Rtvector. The fifth
line simply ends the loop. By the time this program completes,
it will have created 200 new
variables holding the log values that you desire.
In addition to greatly reducing the number of programming
lines, there are other advantages to
performing transformations using vectors and loops. If you need
to make a change to the
transformation you only need to change a single statement. If
you write separate transformations
for each variable, you must change every single statement
anytime you want to change the
specifics of the transformation. It is also much easier to read
programs that use loops than
programs with large numbers of transformation statements. The
loops naturally group together
transformations that are all of the same type, whereas with a list
you must examine each
individual transformation to find out what it does.
11
T TESTS
Many analyses in psychological research involve testing
hypotheses about means or mean
differences. Below we describe the SPSS procedures that allow
you to determine if a given
mean is equal to either a fixed value or some other mean.
One-sample t-test
You perform a one-sample t-test when you want to determine if
the mean value of a target
variable is different from a hypothesized value.
To perform a one-sample t-test in SPSS
• Choose Analyze!!!! Compare Means !!!! One-sample t-test.
• Move the variable of interest to the Test variable(s) box.
• Change the test value to the hypothesized value.
• Click the OK button.
The output from this analysis will contain the following
sections.
• One-Sample Statistics. Provides the sample size, mean,
standard deviation, and
standard error of the mean for the target variable.
• One-Sample Test. Provides the results of a t-test comparing
the mean of the target
variable to the hypothesized value. A significant test statistic
indicates that the sample
mean differs from the hypothesized value. This section also
contains the upper and lower
bounds for a 95% confidence interval around the sample mean.
Independent-samples t-test
You perform an independent-samples t-test (also called a
between-subjects t-test) when you want
to determine if the mean value on a given target variable for one
group differs from the mean
value on the target variable for a different group. This test is
only valid if the two groups have
entirely different members. To perform this test in SPSS you
must have a variable representing
group membership, such that different values on the group
variable correspond to different
groups.
To perform an independent-samples t-test in SPSS
• Choose Analyze!!!! Compare Means !!!! Independent-sample
t-test.
• Move the target variable to the Test variable(s) box.
• Move the group variable to the Grouping variable box.
• Click the Define groups button.
• Enter the values corresponding to your two groups you want to
compare in the boxes
labeled group 1 and group 2.
• Click the Continue button.
• Click the OK button.
The output from this analysis will contain the following
sections.
• Group Statistics. Provides descriptive information about your
two groups, including the
sample size, mean, standard deviation, and the standard error of
the mean.
12
• Independent Samples Test. Provides the results of two t-tests
comparing the means of
your two groups. The first row reports the results of a test
assuming that the two
variances are equal, while the second row reports the results of
a test that does not
assume the two variances are equal. The columns labeled
Levene’s Test for Equality of
Variances report an F test comparing the variances of your two
groups. If the F test is
significant then you should use the test in the second row. If it
is not significant then you
should use the test in the first row. A significant t-test
indicates that the two groups have
different means. The last two columns provide the upper and
lower bounds for a 95%
confidence interval around the difference between your two
groups.
Paired-samples t-test
You perform a paired samples t-test (also called a within-
subjects t-test) when you want to
determine whether a single group of participants differs on two
measured variables. Probably
the most common use of this test would be to compare
participantsí response on a measure
before a manipulation to their response after a manipulation.
This test works by first computing
a difference score for each participant between the within-
subject conditions (e.g. post-test ñ pre-
test). The mean of these difference scores is then compared to
zero. This is the same thing as
determining whether there is a significant difference between
the means of the two variables.
To perform a paired-samples t-test in SPSS
• Choose Analyze!!!! Compare Means !!!! Paired-samples t-test.
• Click the two variables you want to compare in the box on the
left-hand side.
• Click the arrow button.
• Click the OK button.
The output from this analysis will contain the following
sections.
• Paired Samples Statistics. Provides descriptive information
about the two variables,
including the sample size, mean, standard deviation, and the
standard error of the mean.
• Paired Samples Correlations. Provides the correlation
between the two variables.
• Paired Samples Test. Provides the results of a t-test
comparing the means of the two
variables. A significant t-test indicates that there is a
difference between the two
variables. It also contains the upper and lower bounds of a
95% confidence interval
around the difference between the two means.
13
ANALYSIS OF VARIANCE (ANOVA)
One-way between-subjects ANOVA
A one-way between-subjects ANOVA allows you to determine
if there is a relationship between
a categorical independent variable (IV) and a continuous
dependent variable (DV), where each
subject is only in one level of the IV. To determine whether
there is a relationship between the
IV and the DV, a one-way between-subjects ANOVA tests
whether the means of all of the
groups are the same. If there are any differences among the
means, we know that the value of
the DV depends on the value of the IV. The IV in an ANOVA
is referred to as a factor, and the
different groups composing the IV are referred to as the levels
of the factor. A one-way ANOVA
is also sometimes called a single factor ANOVA.
A one-way ANOVA with two groups is analogous to an
independent-samples t-test. The p-
values of the two tests will be the same, and the F statistic from
the ANOVA will be equal to the
square of the t statistic from the t-test.
To perform a one-way between-subjects ANOVA in SPSS
• Choose Analyze !!!! General Linear Model !!!! Univariate.
• Move the DV to the Dependent Variable box.
• Move the IV to the Fixed Factor(s) box.
• Click the OK button.
The output from this analysis will contain the following
sections.
• Between-Subjects Factors. Lists how many subjects are in
each level of your factor.
• Tests of Between-Subjects Effects. The row next to the name
of your factor reports a
test of whether there is a significant relationship between your
IV and the DV. A
significant F statistic means that at least two group means are
different from each other,
indicating the presence of a relationship.
You can ask SPSS to provide you with the means within each
level of your between-subjects
factor by clicking the Options button in the variable selection
window and moving your within-
subjects variable to the Display Means For box. This will add a
section to your output titled
Estimated Marginal Means containing a table with a row for
each level of your factor. The
values within each row provide the mean, standard error of the
mean, and the boundaries for a
95% confidence interval around the mean for observations
within that cell.
Post-hoc analyses for one-way between-subjects ANOVA. A
significant F statistic tells you
that at least two of your means are different from each other,
but does not tell you where the
differences may lie. Researchers commonly perform post-hoc
analyses following a significant
ANOVA to help them understand the nature of the relationship
between the IV and the DV. The
most commonly reported post-hoc tests are (in order from most
to least liberal): LSD (Least
Significant Difference test), SNK (Student-Newman-Keuls),
Tukey, and Bonferroni. The more
liberal a test is, the more likely it will find a significant
difference between your means, but the
more likely it is that this difference is actually just due to
chance.
14
Although it is the most liberal, simulations have demonstrated
that using LSD post-hoc analyses
will not substantially increase your experimentwide error rate as
long as you only perform the
post-hoc analyses after you have already obtained a significant
F statistic from an ANOVA. We
therefore recommend this method since it is most likely to
detect any differences among your
groups.
To perform post-hoc analyses in SPSS
• Repeat the steps necessary for a one-way ANOVA, but do not
press the OK button at the
end.
• Click the Post-Hoc button.
• Move the IV to the Post-Hoc Tests for box.
• Check the boxes next to the post-hoc tests you want to
perform.
• Click the Continue button.
• Click the OK button.
Requesting a post-hoc test will add one or both of the following
sections to your ANOVA
output.
• Multiple Comparisons. This section is produced by LSD,
Tukey, and Bonferroni tests.
It reports the difference between every possible pair of factor
levels and tests whether
each is significant. It also includes the boundaries for a 95%
confidence interval around
the size of each difference.
• Homogenous Subsets. This section is produced by SNK and
Tukey tests. It reports a
number of different subsets of your different factor levels. The
mean values for the factor
levels within each subset are not significantly different from
each other. This means that
there is a significant difference between the mean of two factor
levels only if they do not
appear in any of the same subsets.
Multifactor between-subjects ANOVA
Sometimes you want to examine more than one factor in the
same experiment. Although you
could analyze the effect of each factor separately, testing them
together in the same analysis
allows you to look at two additional things. First, it lets you
determine the independent influence
of each of the factors on the DV, controlling for the other IVs in
the model. The test of each IV
in a multifactor ANOVA is based solely on the part of the DV
that it can predict that is not
predicted by any of the other IVs.
Second, including multiple IVs in the same model allows you to
test for interactions among your
factors. The presence of an interaction between two variables
means that the effect of the first
IV on the DV depends on the level of the second IV. An
interaction between three variables
means that the nature of the two-way interaction between the
first two variables depends on the
level of a third variable. It is possible to have an interaction
between any number of variables.
However, researchers rarely examine interactions containing
more than three variables because
they are difficult to interpret and require large sample sizes to
detect.
Note that to obtain a valid test of a given interaction effect your
model must also include all
lower-order main effects and interactions. This means that the
model has to include terms
representing all of the main effects of the IVs involved in the
interaction, as well as all the
15
possible interactions between those IVs. So, if you want to test
a 3-way interaction between
variables A, B, and C, the model must include the main effects
for those variables, as well as the
AxB, AxC, and the BxC interactions.
To perform a multifactor ANOVA in SPSS
• Choose Analyze !!!! General Linear Model !!!! Univariate.
• Move the DV to the Dependent Variable box.
• Move all of your IVs to the Fixed Factor(s) box.
• By default SPSS will include all possible interactions between
your categorical IVs. If
this is not the model you want then you will need to define it by
hand by taking the
following steps.
o Click the Model button.
o Click the radio button next to Custom.
o Add all of your main effects to the model by clicking all of
the IVs in the box
labeled Factors and covariates, setting the pull-down menu to
Main effects, and
clicking the arrow button.
o Add each of the interaction terms to your model. You can do
this one at a time by
selecting the variables included in the interaction in the box
labeled Factors and
covariates, setting the pull-down menu to Interaction, and
clicking the arrow
button for each of your interactions. You can also use the
setting on the pull-
down menu to tell SPSS to add all possible 2-way, 3-way, 4-
way, or 5-way
interactions that can be made between the selected variables to
your model.
o Click the Continue button.
• Click the Options button and move each independent variable
and all interaction terms to
the Display means for box.
• Click the Continue button.
• Click the OK button.
The output of this analysis will contain the following sections.
• Between-Subjects Factors. Lists how many subjects are in
each level of each of your
factors.
• Tests of Between-Subjects Effects. The row next to the name
of each factor or
interaction reports a test of whether there is a significant
relationship between that effect
and the DV, independent of the other effects in the model.
You can ask SPSS to provide you with the means within the
levels of your main effects or your
interactions by clicking the Options button in the variable
selection window and moving the
appropriate term to the Display Means For box. This will add a
section to your output titled
Estimated Marginal Means containing a table for each main
effect or interaction in your
model. The table will contain a row for each cell within the
effect. The values within each row
provide the mean, standard error of the mean, and the
boundaries for a 95% confidence interval
around the mean for observations within that cell.
Graphing Interactions in an ANOVA. It is often useful to
examine a plot of the means by
condition when trying to interpret a significant interaction.
16
To get plot of means by condition from SPSS
• Perform a multifactor ANOVA as described above, but do not
click the OK button to
perform the analysis.
• Click the Plots button.
• Define all the plots you want to see.
o To plot a main effect, move the factor to the Horizontal Axis
box and click the
Add button.
o To plot a two-way interaction, move the first factor to the
Horizontal Axis box,
move the second factor to the Separate Lines box, and click the
Add button.
o To plot a three-way interaction, move the first factor to the
Horizontal Axis box,
move the second factor to the Separate Lines box, move the
third factor to the
Separate Plots box, and click the Add button.
• Click the Continue button.
• Click the OK button.
In addition to the standard ANOVA output, the plots you
requested will appear in a section titled
Profile Plots.
Post-hoc comparisons for when you have two or more factors.
Graphing the means from a
two-way or three-way between-subject ANOVA shows you the
basic form of the significant
interaction. However, the analyst may also wish to perform
post-hoc analyses to determine
which means differ from one another. If you want to compare
the levels of a single factor to one
another, you can follow the post-hoc procedures described in
the section on one-way ANOVA.
Comparing the individual cells formed by the combination of
two or more factors, however, is
slightly more complicated. SPSS does provide options to
directly make such comparisons.
Fortunately, there is a very easy method that allows one to
perform post-hocs comparing all cell
means to one another within a between-subjects interaction.
We will work with a specific example to illustrate how to
perform this analysis in SPSS.
Suppose that you wanted to compare all of the means within a
2x2x3 between-subjects factorial
design. The basic idea is to create a new variable that has a
different value for each cell in the
above design, and then use the post-hoc procedures available in
one-way ANOVA to perform
your comparisons. The total number of cells in an interaction
can be determined by multiplying
together the number of levels in each factor composing the
interaction. In our example, this
would mean that our new variable would need to have 2*2*3=12
different levels, each
corresponding to a unique combination of our three IVs.
One way to create this variable would be to use the Recode
function described above. However,
there is an easier way to do this if your IVs all use numbers to
code the different levels. In our
example we will assume that the first factor (A) has two levels
coded by the values 1 and 2, the
second factor (B) has two levels again coded by the values 1
and 2, and that the third factor (C)
has three levels coded by the values 1, 2, and 3. In this case,
you can use the Compute function
to calculate your new variable using the formula:
newcode = (A*100) + (B*10) + C
17
In this example, newcode would always be a three-digit number.
The first digit would be equal
to the level on variable A, the second digit would be equal to
the level on variable B, while the
third digit would be equal to the level on variable C. There are
two benefits to using this
transformation. First, it can be completed in a single step,
whereas assigning the groups manually
would take several separate steps. Second, you can directly see
the correspondence between the
levels of the original factors and the level of the composite
variable by looking at the digits of the
composite variable. If you actually used the values of 1 through
12 to represent the different cells
in your new variable, you would likely need to reference a table
to know the relationships
between the values of the composite and the values of the
original variables. If you ever want to
create a composite of a different number of factors (besides 3
factors, like in this example), you
follow the same general principle, basically multiplying each
factor by decreasing powers of 10,
such as the following examples.
newcode = (A*10) + B (for a two-way interaction)
newcode = (A*1000) + (B*100) + (C*10) + D (for a four-way
interaction)
Regardless of which procedure you use to create the composite
variable, you would perform the
post-hoc in SPSS by taking the following steps.
• Choose Analyze !!!! General Linear Model !!!! Univariate.
• Move the DV to the Dependent Variable box.
• Move the composite variable to the Fixed Factor(s) box.
• Click the Post-Hoc button.
• Move the composite variable to the Post-Hoc Tests for box.
• Check the boxes next to the post-hoc tests you want to
perform.
• Click the Continue button.
• Click the OK button.
The post-hoc analyses will be reported in the Multiple
Comparisons and Homogenous Subsets
sections, as described above under one-way between-subjects
ANOVA.
One-way within-subjects ANOVA
A one-way within-subjects ANOVA allows you to determine if
there is a relationship between a
categorical IV and a continuous DV, where each subject is
measured at every level of the IV.
Within-subject ANOVA should be used whenever want to
compare 3 or more groups where the
same subjects are in all of the groups. To perform a within-
subject ANOVA in SPSS you must
have your data set organized so that the subject is the unit of
analysis and you have different
variables containing the value of the DV at each level of your
within-subjects factor.
To perform a within-subject ANOVA in SPSS:
• Choose Analyze !!!! General linear model !!!! Repeated
measures.
• Type the name of the factor in the Within-Subjects Factor
Name box.
• Type the number of groups the factor represents in the Number
of Levels box.
• Click the Add button.
• Click the Define button.
• Move the variables representing the different levels of the
within-subjects factor to the
Within-Subjects Variables box.
32
Multiple Regression
Sometimes you may want to explain variability in a continuous
DV using several different
continuous IVs. Multiple regression allows us to build an
equation predicting the value of the
DV from the values of two or more IVs. The parameters of this
equation can be used to relate the
variability in our DV to the variability in specific IVs.
Sometimes people use the term
multivariate regression to refer to multiple regression, but most
statisticians do not use
ìmultiple" and ìmultivariate" as synonyms. Instead, they use the
term ìmultiple" to describe
analyses that examine the effect of two or more IVs on a single
DV, while they reserve the term
ìmultivariate" to describe analyses that examine the effect of
any number of IVs on two or more
DVs.
The general form of the multiple regression model is
Yi = β0 + β 1Xi1 + β 2Xi2 + Ö + βkXik + εi,.
The elements in this equation are the same as those found in
simple linear regression, except that
we now have k different parameters which are multipled by the
values of the k IVs to get our
predicted value. We can again use least squares estimation to
determine the estimates of these
parameters that best our observed data. Once we obtain these
estimates we can either use our
equation for prediction, or we can test whether our parameters
are significantly different from
zero to determine whether each of our IVs makes a significant
contribution to our model.
Care must be taken when making inferences based on the
coefficients obtained in multiple
regression. The way that you interpret a multiple regression
coefficient is somewhat different
from the way that you interpret coefficients obtained using
simple linear regression.
Specifically, the value of a multiple regression coefficient
represents the ability of part of the
corresponding IV that is unrelated to the other IVs to predict
the part of the DV that is unrelated
to the other IVs. It therefore represents the unique ability of the
IV to account for variability in
the DV. One implication of the way coefficients are determined
is that your parameter estimates
become very difficult to interpret if there are large correlations
among your IVs. The effect of
these relationships on multiple regression coefficients is called
multicollinearity. This changes
the values of your coefficients and greatly increases their
variance. It can cause you to find that
none of your coefficients are significantly different from zero,
even when the overall model does
a good job predicting the value of the DV.
One implication of the way coefficients are determined is that
your parameter estimates become
very difficult to interpret if there are large correlations among
your IVs. The typical effect of
multicollinearity is to reduce the size of your parameter
estimates. Since the value of the
coefficient is based on the unique ability for an IV to account
for variability in a DV, if there is a
portion of variability that is accounted for by multiple IVs, all
of their coefficients will be
reduced. Under certain circumstances multicollinearity can also
create a suppression effect. If
you have one IV that has a high correlation with another IV but
a low correlation with the DV,
you can find that the multiple regression coefficient for the
second IV from a model including
both variables can be larger (or even opposite in direction!)
compared to the coefficient from a
model that doesn't include the first IV. This happens when the
part of the second IV that is
independent of the first IV has a different relationship with the
DV than does the part that is
33
related to the first IV. It is called a suppression effect because
the relationship that appears in
multiple regression is suppressed when you just look at the
variable by itself.
To perform a multiple regression in SPSS
• Choose Analyze !!!! Regression !!!! Linear.
• Move the DV to the Dependent box.
• Move all of the IVs to the Independent(s) box.
• Click the Continue button.
• Click the OK button.
The SPSS output from a multiple regression analysis contains
the following sections.
• Variables Entered/Removed. This section is only used in
model building and contains
no useful information in standard multiple regression.
• Model Summary. The value listed below R is the multiple
correlation between your IVs
and your DV. The value listed below R square is the proportion
of variance in your DV
that can be accounted for by your IV. The value in the Adjusted
R Square column is a
measure of model fit, adjusting for the number of IVs in the
model. The value listed
below Std. Error of the Estimate is the standard deviation of the
residuals.
• ANOVA. This section provides an F test for your statistical
model. If this F is significant,
it indicates that the model as a whole (that is, all IVs combined)
predicts significantly
more variability in the DV compared to a null model that only
has an intercept parameter.
Notice that this test is affected by the number of IVs in the
model being tested.
• Coefficients. This section contains a table where each row
corresponds to a single
coefficient in your model. The row labeled Constant refers to
the intercept, while the
coefficients for each of your IVs appear in the row beginning
with the name of the IV.
Inside the table, the column labeled B contains the estimates of
the parameters and the
column labeled Std. Error contains the standard error of those
estimates. The column
labeled Beta contains the standardized regression coefficient.
The column labeled t
contains the value of the t-statistic testing whether the value of
each parameter is equal to
zero. The p-value of this test is found in the column labeled Sig.
A significant t-test
indicates that the IV is able to account for a significant amount
of variability in the DV,
independent of the other IVs in your regression model.
Multiple regression with interactions
In addition to determining the independent effect of each IV on
the DV, multiple regression can
also be used to detect interactions between your IVs. An
interaction measures the extent to
which the relationship between an IV and a DV depends on the
level of other IVs in the model.
For example, if you have an interaction between two IVs (called
a two-way interaction) then you
expect that the relationship between the first IV and the DV will
be different across different
levels of the second IV. Interactions are symmetric, so if you
have an interaction such that the
effect of IV1 on the DV depends on the level of IV2, then it is
also true that the effect of IV2 on
the DV depends on the level of IV1. It therefore does not matter
whether you say that you have
an interaction between IV1 and IV2 or an interaction between
IV2 and IV1. You can also have
interactions between more than two IVs. For example, you can
have a three-way interaction
between IV1, IV2, and IV3. This would mean that the two-way
interaction between IV1 and IV2
depends on the level of IV3. Just like two-way interactions,
three-way interactions are also
28
CORRELATION
Pearson correlation
A Pearson correlation measures the strength of the linear
relationship between two continuous
variables. A linear relationship is one that can be captured by
drawing a straight line on a
scatterplot between the two variables of interest. The value of
the correlation provides
information both about the nature and the strength of the
relationship.
• Correlations range between -1.0 and 1.0.
• The sign of the correlation describes the direction of the
relationship. A positive sign
indicates that as one variable gets larger the other also tends to
get larger, while a
negative sign indicates that as one variable gets larger the other
tends to get smaller.
• The magnitude of the correlation describes the strength of the
relationship. The further
that a correlation is from zero, the stronger the relationship is
between the two variables.
A zero correlation would indicate that the two variables aren't
related to each other at all.
Correlations only measure the strength of the linear relationship
between the two variables.
Sometimes you have a relationship that would be better
measured by a curve of some sort rather
than a straight line. In this case the correlation coefficient
would not provide a very accurate
measure of the strength of the relationship. If a line accurately
describes the relationship
between your two variables, your ability to predict the value of
one variable from the value of the
other is directly related to the correlation between them. When
the points in your scatterplot are
all clustered closely about a line your correlation will be large
and the accuracy of the predictions
will be high. If the points tend to be widely spread your
correlation will be small and the
accuracy of your predictions will be low.
The Pearson correlation assumes that both of your variables
have normal distributions. If this is
not the case then you might consider performing a Spearman
rank-order correlation instead
(described below).
To perform a Pearson correlation in SPSS
• Choose Analyze!!!! Correlate!!!! Bivariate.
• Move the variables you want to correlate to the Variables box.
• Click the OK button.
The output of this analysis will contain the following section.
• Correlations. This section contains the correlation matrix of
the variables you selected.
A variable always has a perfect correlation with itself, so the
diagonals of this matrix will
always have values of 1. The other cells in the table provide
you with the correlation
between the variable listed at the top of the column and the
variable listed to the left of
the row. Below this is a p-value testing whether the correlation
differs significantly from
zero. Finally, the bottom value in each box is the sample size
used to compute the
correlation.
Point-biserial correlation
The point-biserial correlation captures the relationship between
a dichotomous (two-value)
variable and a continuous variable. If the analyst codes the
dichotomous variable with values of
29
0 and 1, and then computes a standard Pearson correlation using
this variable, it is
mathematically equivalent to the point-biserial correlation. The
interpretation of this variable is
similar to the interpretation of the Pearson correlation. A
positive correlation indicates that
group associated with the value of 1 has larger values than the
group associated with the value of
0. A negative correlation indicates that group associated with
the value of 1 has smaller values
than the group associated with the value of 0. A value near zero
indicates no relationship
between the two variables.
To perform a point-biserial correlation in SPSS
• Make sure your categories are indicated by values of 0 and 1.
• Obtain the Pearson correlation between the categorical
variable and the continuous
variable, as discussed above.
The result of this analysis will include the same sections as
discussed in the Pearson correlation
section.
Spearman rank correlation
The Spearman rank correlation is a nonparametric equivalent to
the Pearson correlation. The
Pearson correlation assumes that both of your variables have
normal distributions. If this
assumption is violated for either of your variables then you may
choose to perform a Spearman
rank correlation instead. However, the Spearman rank
correlation is a less powerful measure of
association, so people will commonly choose to use the standard
Pearson correlation even when
the variables you want to consider are moderately nonnormal.
The Spearman Rank correlation is
typically preferred over Kendaís tau, another nonparametric
correlation measure, because its
scaling is more consistent with the standard Pearson correlation.
To perform a Spearman rank correlation in SPSS
• Choose Analyze!!!! Correlate!!!! Bivariate.
• Move the variables you want to correlate to the Variables box.
• Check the box next to Spearman.
• Click the OK button.
The output of this analysis will contain the following section.
• Correlations. This section contains the correlation matrix of
the variables you selected.
The Spearman rank correlations can be interpreted in exactly
the same way as you
interpret a standard Pearson correlation. Below each correlation
SPSS provides a p-value
testing whether the correlation is significantly different from
zero, and the sample size
used to compute the correlation.
4
DESCRIPTIVE STATISTICS
Analyses often begin by examining basic descriptive-level
information about data. The most
common and useful descriptive statistics are
• Mean
• Median
• Mode
• Frequency
• Quartiles
• Sum
• Variance
• Standard deviation
• Minimum/Maximum
• Range
Note: All of these are appropriate for continuous variables, and
frequency and mode are also
appropriate for categorical variables.
If you just want to obtain the mean and standard deviation for a
set of variables
• Choose Analyze !!!! Descriptive Statistics !!!! Descriptives.
• Move the variables of interest to the Variable(s) box.
• Click the OK button.
If you want to obtain any other statistics
• Choose Analyze!!!! Descriptive Statistics !!!! Frequencies.
• Move the variables of interest to the Variable(s) box
• Click the Statistics button.
• Check the boxes next to the statistics you want.
• Click the Continue button.
• Click the OK button.
40
CHI-SQUARE TEST OF INDEPENDENCE
A chi-square is a nonparametric test used to determine if there
is a relationship between two
categorical variables. Letís take a simple example. Suppose a
researcher brought male and
female participants into the lab and asked them which color
they preferóblue or green. The
researcher believes that color preference may be related to
gender. Notice that both gender
(male, female) and color preference (blue, green) are
categorical variables. If there is a
relationship between gender and color preference, we would
expect that the proportion of men
who prefer blue would be different than the proportion of
women who prefer blue. In general,
you have a relationship between two categorical variables when
the distribution of people across
the categories of the first variable changes across the different
categories of the second variable.
To determine if a relationship exists between gender and color
preference, the chi-square test
computes the distributions across the combination of your two
factors that you would expect if
there were no relationship between them. In then compares this
to the actual distribution found
in your data. In the example above, we have a 2 (gender: male,
female) X 2 (color preference:
green, blue) design. For each cell in the combination of the two
factors, we would compute
"observed" and "expected" counts. The observed counts are
simply the actual number of
observations found in each of the cells. The expected
proportion in each cell can be determined
by multiplying the marginal proportions found in a table. For
example, let us say that 52% of all
the participants preferred blue and 48% preferred green,
whereas 40% of the all of the
participants were men and 60% were women. The expected
proportions are presented in the
table below.
Expected proportion table
Males Females Marginal proportion
Blue 20.8% 31.2% 52%
Green 19.2% 28.8% 48%
Marginal proportion 40% 60%
As you can see, you get the expected proportion for a particular
cell by multiplying the two
marginal proportions together. You would then determine the
expected count for each cell by
multiplying the expected proportion by the total number of
participants in your study. The chi-
square statistic is a function of the difference between the
expected and observed counts across
all your cells. Luckily you do not actually need to calculate any
of this by hand, since SPSS will
compute the expected counts for each cell and perform the chi-
square test.
To perform a chi-square test of independence in SPSS
• Choose Analyze !!!! Descriptive Statistics !!!! Crosstabs.
• Put one of the variables in the Row(s) box
• Put the other variable in the Column(s) box
• Click the Statistics button.
• Check the box next to Chi-square.
• Click the Continue button.
41
• Click the OK button.
The output of this analysis will contain the following sections.
• Case Processing Summary. Provides information about
missing values in your two
variables.
• Crosstabulation. Provides you with the observed counts
within each combination of
your two variables.
• Chi-Square Tests. The first row of this table will give you the
chi-square value, its
degrees of freedom and the p-value associated with the test.
Note that the p-values
produced by a chi-square test are inappropriate if the expected
count is less than 5 in 20%
of the cells or more. If you are in this situation, you should
either redefine your coding
scheme (combining the categories with low cell counts with
other categories) or exclude
categories with low cell counts from your analysis.

More Related Content

PPT
Solving stepwise regression problems
PPTX
Multiple Linear Regression
DOCX
For this assignment, use the aschooltest.sav dataset.The d
PPTX
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
PPTX
logistic regression significance checking
PPTX
6 the six uContinuous data analysis.pptx
PDF
Stats ca report_18180485
PPTX
Linear Regression | Machine Learning | Data Science
Solving stepwise regression problems
Multiple Linear Regression
For this assignment, use the aschooltest.sav dataset.The d
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
logistic regression significance checking
6 the six uContinuous data analysis.pptx
Stats ca report_18180485
Linear Regression | Machine Learning | Data Science

Similar to 30REGRESSION Regression is a statistical tool that a.docx (20)

DOCX
Two-Variable (Bivariate) RegressionIn the last unit, we covered
PPTX
Regression_Analysis_Handout_(Methodology_Part_1).pptx
PPT
Regression_Analysis_Handout_(Methodology_Part_1).ppt
PDF
Statistics for Data Analytics
PPTX
manecohuhuhuhubasicEstimation-1.pptx
PPT
2-20-04.ppt
PDF
Multiple regression
PPT
2-20-04.ppthjjbnjjjhhhhhhhhhhhhhhhhhhhhhhhh
PPTX
Regression analysis
PPTX
SSP PRESENTATION COMPLETE ( ADVANCE ) .pptx
PPT
A presentation for Multiple linear regression.ppt
PPTX
Correlation and regression
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PDF
The normal presentation about linear regression in machine learning
PPT
Lecture 4
PPTX
Correlation and regression
PPTX
REGRESSION METasdfghjklmjhgftrHODS1.pptx
PPTX
Regression -Linear.pptx
PPTX
Chapter two 1 econometrics lecture note.pptx
PDF
Chapter 14 Part I
Two-Variable (Bivariate) RegressionIn the last unit, we covered
Regression_Analysis_Handout_(Methodology_Part_1).pptx
Regression_Analysis_Handout_(Methodology_Part_1).ppt
Statistics for Data Analytics
manecohuhuhuhubasicEstimation-1.pptx
2-20-04.ppt
Multiple regression
2-20-04.ppthjjbnjjjhhhhhhhhhhhhhhhhhhhhhhhh
Regression analysis
SSP PRESENTATION COMPLETE ( ADVANCE ) .pptx
A presentation for Multiple linear regression.ppt
Correlation and regression
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
The normal presentation about linear regression in machine learning
Lecture 4
Correlation and regression
REGRESSION METasdfghjklmjhgftrHODS1.pptx
Regression -Linear.pptx
Chapter two 1 econometrics lecture note.pptx
Chapter 14 Part I
Ad

More from tarifarmarie (20)

DOCX
CASE GS-65 DATE 021309 (REVISED 010311) .docx
DOCX
BBA 3551, Information Systems Management 1 Course Lea.docx
DOCX
BUS 599 – Assignments and Rubrics © 2019 Strayer Unive.docx
DOCX
BEAUTY AND UGLINESS IN OLMEC MONUMENTAL SCULPTUREAuthor.docx
DOCX
August 4, 2011 TAX FLIGHT IS A MYTH Higher State .docx
DOCX
BHA 3202, Standards for Health Care Staff 1 Course Le.docx
DOCX
Assignment – 8600-341 (Leading and motivating a team effectiv.docx
DOCX
BIOEN 4250 BIOMECHANICS I Laboratory 4 – Principle Stres.docx
DOCX
BHR 4680, Training and Development 1 Course Learning .docx
DOCX
Business Plan 2016 Owners Mick & Sheryl Dun.docx
DOCX
Assignment Guidelines NR224 Fundamentals - Skills NR224 .docx
DOCX
Brand Extension Marketing Plan 8GB530 Brand Extension Marketi.docx
DOCX
Building a Dynamic Organization The Stanley Lynch Investme.docx
DOCX
BBA 4351, International Economics 1 Course Learning O.docx
DOCX
BSL 4060, Team Building and Leadership 1 Course Learn.docx
DOCX
BHA 3002, Health Care Management 1 Course Learning Ou.docx
DOCX
BBA 3551, Information Systems Management Course Learn.docx
DOCX
Afro-Asian Inquiry and the Problematics of Comparative Cr.docx
DOCX
BBA 2201, Principles of Accounting I 1 Course Learnin.docx
DOCX
ARH2000 Art & Culture USF College of the Arts 1 .docx
CASE GS-65 DATE 021309 (REVISED 010311) .docx
BBA 3551, Information Systems Management 1 Course Lea.docx
BUS 599 – Assignments and Rubrics © 2019 Strayer Unive.docx
BEAUTY AND UGLINESS IN OLMEC MONUMENTAL SCULPTUREAuthor.docx
August 4, 2011 TAX FLIGHT IS A MYTH Higher State .docx
BHA 3202, Standards for Health Care Staff 1 Course Le.docx
Assignment – 8600-341 (Leading and motivating a team effectiv.docx
BIOEN 4250 BIOMECHANICS I Laboratory 4 – Principle Stres.docx
BHR 4680, Training and Development 1 Course Learning .docx
Business Plan 2016 Owners Mick & Sheryl Dun.docx
Assignment Guidelines NR224 Fundamentals - Skills NR224 .docx
Brand Extension Marketing Plan 8GB530 Brand Extension Marketi.docx
Building a Dynamic Organization The Stanley Lynch Investme.docx
BBA 4351, International Economics 1 Course Learning O.docx
BSL 4060, Team Building and Leadership 1 Course Learn.docx
BHA 3002, Health Care Management 1 Course Learning Ou.docx
BBA 3551, Information Systems Management Course Learn.docx
Afro-Asian Inquiry and the Problematics of Comparative Cr.docx
BBA 2201, Principles of Accounting I 1 Course Learnin.docx
ARH2000 Art & Culture USF College of the Arts 1 .docx
Ad

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Computing-Curriculum for Schools in Ghana
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Pharma ospi slides which help in ospi learning
PDF
RMMM.pdf make it easy to upload and study
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Cell Types and Its function , kingdom of life
PDF
Classroom Observation Tools for Teachers
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Microbial disease of the cardiovascular and lymphatic systems
FourierSeries-QuestionsWithAnswers(Part-A).pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Computing-Curriculum for Schools in Ghana
O5-L3 Freight Transport Ops (International) V1.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
TR - Agricultural Crops Production NC III.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Complications of Minimal Access Surgery at WLH
Pharma ospi slides which help in ospi learning
RMMM.pdf make it easy to upload and study
O7-L3 Supply Chain Operations - ICLT Program
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Renaissance Architecture: A Journey from Faith to Humanism
Cell Types and Its function , kingdom of life
Classroom Observation Tools for Teachers
STATICS OF THE RIGID BODIES Hibbelers.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra

30REGRESSION Regression is a statistical tool that a.docx

  • 1. 30 REGRESSION Regression is a statistical tool that allows you to predict the value of one continuous variable from one or more other variables. When you perform a regression analysis, you create a regression equation that predicts the values of your DV using the values of your IVs. Each IV is associated with specific coefficients in the equation that summarizes the relationship between that IV and the DV. Once we estimate a set of coefficients in a regression equation, we can use hypothesis tests and confidence intervals to make inferences about the corresponding parameters in the population. You can also use the regression equation to predict the value of the DV given a specified set of values for your IVs. Simple Linear Regression Simple linear regression is used to predict the value of a single continuous DV (which we will call Y) from a single continuous IV (which we will call X). Regression assumes that the relationship between IV and the DV can be represented by the equation Yi = β0 + β 1Xi + εi, where Yi is the value of the DV for case i, Xi is the value of the IV for case i, β0 and β1 are
  • 2. constants, and εi is the error in prediction for case i. When you perform a regression, what you are basically doing is determining estimates of β0 and β1 that let you best predict values of Y from values of X. You may remember from geometry that the above equation is equivalent to a straight line. This is no accident, since the purpose of simple linear regression is to define the line that represents the relationship between our two variables. β0 is the intercept of the line, indicating the expected value of Y when X = 0. β1 is the slope of the line, indicating how much we expect Y will change when we increase X by a single unit. The regression equation above is written in terms of population parameters. That indicates that our goal is to determine the relationship between the two variables in the population as a whole. We typically do this by taking a sample and then performing calculations to obtain the estimated regression equation Yi = b0 + b1Xi . Once you estimate the values of b0 and b1, you can substitute in those values and use the regression equation to predict the expected values of the DV for specific values of the IV. Predicting the values of Y from the values of X is referred to as regressing Y on X. When analyzing data from a study you will typically want to regress the values of the DV on the values of the IV. This makes sense since you want to use the IV to explain variability in the DV. We typically calculate b0 and b1 using least squares estimation. This chooses estimates that minimize
  • 3. the sum of squared errors between the values of the estimated regression line and the actual observed values. In addition to using the estimated regression equation for prediction, you can also perform hypothesis tests regarding the individual regression parameters. The slope of the regression equation (β1) represents the change in Y with a one-unit change in X. If X predicts Y, then as X 31 increases, Y should change in some systematic way. You can therefore test for a linear relationship between X and Y by determining whether the slope parameter is significantly different from zero. When using performing linear regression, we typically make the following assumptions about the error terms εi. 1. The errors have a normal distribution. 2. The same amount of error in the model is found at each level of X. 3. The errors in the model are all independent. To perform a simple linear regression in SPSS • Choose Analyze !!!! Regression!!!! Linear. • Move the DV to the Dependent box. • Move the IV to the Independent(s) box.
  • 4. • Click the Continue button. • Click the OK button. The output from this analysis will contain the following sections. • Variables Entered/Removed. This section is only used in model building and contains no useful information in simple linear regression. • Model Summary. The value listed below R is the correlation between your variables. The value listed below R Square is the proportion of variance in your DV that can be accounted for by your IV. The value in the Adjusted R Square column is a measure of model fit, adjusting for the number of IVs in the model. The value listed below Std. Error of the Estimate is the standard deviation of the residuals. • ANOVA. Here you will see an ANOVA table, which provides an F test of the relationship between your IV and your DV. If the F test is significant, it indicates that there is a relationship. • Coefficients. This section contains a table where each row corresponds to a single coefficient in your model. The row labeled Constant refers to the intercept, while the row containing the name of your IV refers to the slope. Inside the table, the column labeled B contains the estimates of the parameters and the column labeled Std. Error contains the standard error of those parameters. The column
  • 5. labeled Beta contains the standardized regression coefficient, which is the parameter estimate that you would get if you standardized both the IV and the DV by subtracting off their mean and dividing by their standard deviations. Standardized regression coefficients are sometimes used in multiple regression (discussed below) to compare the relative importance of different IVs when predicting the DV. In simple linear regression, the standardized regression coefficient will always be equal to the correlation between the IV and the DV. The column labeled t contains the value of the t-statistic testing whether the value of each parameter is equal to zero. The p-value of this test is found in the column labeled Sig. If the value for the IV is significant, then there is a relationship between the IV and the DV. Note that the square of the t statistic is equal to the F statistic in the ANOVA table and that the p-values of the two tests are equal. This is because both of these are testing whether there is a significant linear relationship between your variables. 32 Multiple Regression Sometimes you may want to explain variability in a continuous DV using several different continuous IVs. Multiple regression allows us to build an
  • 6. equation predicting the value of the DV from the values of two or more IVs. The parameters of this equation can be used to relate the variability in our DV to the variability in specific IVs. Sometimes people use the term multivariate regression to refer to multiple regression, but most statisticians do not use ìmultiple" and ìmultivariate" as synonyms. Instead, they use the term ìmultiple" to describe analyses that examine the effect of two or more IVs on a single DV, while they reserve the term ìmultivariate" to describe analyses that examine the effect of any number of IVs on two or more DVs. The general form of the multiple regression model is Yi = β0 + β 1Xi1 + β 2Xi2 + Ö + βkXik + εi,. The elements in this equation are the same as those found in simple linear regression, except that we now have k different parameters which are multipled by the values of the k IVs to get our predicted value. We can again use least squares estimation to determine the estimates of these parameters that best our observed data. Once we obtain these estimates we can either use our equation for prediction, or we can test whether our parameters are significantly different from zero to determine whether each of our IVs makes a significant contribution to our model. Care must be taken when making inferences based on the coefficients obtained in multiple regression. The way that you interpret a multiple regression coefficient is somewhat different
  • 7. from the way that you interpret coefficients obtained using simple linear regression. Specifically, the value of a multiple regression coefficient represents the ability of part of the corresponding IV that is unrelated to the other IVs to predict the part of the DV that is unrelated to the other IVs. It therefore represents the unique ability of the IV to account for variability in the DV. One implication of the way coefficients are determined is that your parameter estimates become very difficult to interpret if there are large correlations among your IVs. The effect of these relationships on multiple regression coefficients is called multicollinearity. This changes the values of your coefficients and greatly increases their variance. It can cause you to find that none of your coefficients are significantly different from zero, even when the overall model does a good job predicting the value of the DV. One implication of the way coefficients are determined is that your parameter estimates become very difficult to interpret if there are large correlations among your IVs. The typical effect of multicollinearity is to reduce the size of your parameter estimates. Since the value of the coefficient is based on the unique ability for an IV to account for variability in a DV, if there is a portion of variability that is accounted for by multiple IVs, all of their coefficients will be reduced. Under certain circumstances multicollinearity can also create a suppression effect. If you have one IV that has a high correlation with another IV but a low correlation with the DV, you can find that the multiple regression coefficient for the second IV from a model including
  • 8. both variables can be larger (or even opposite in direction!) compared to the coefficient from a model that doesn't include the first IV. This happens when the part of the second IV that is independent of the first IV has a different relationship with the DV than does the part that is 33 related to the first IV. It is called a suppression effect because the relationship that appears in multiple regression is suppressed when you just look at the variable by itself. To perform a multiple regression in SPSS • Choose Analyze !!!! Regression !!!! Linear. • Move the DV to the Dependent box. • Move all of the IVs to the Independent(s) box. • Click the Continue button. • Click the OK button. The SPSS output from a multiple regression analysis contains the following sections. • Variables Entered/Removed. This section is only used in model building and contains no useful information in standard multiple regression. • Model Summary. The value listed below R is the multiple correlation between your IVs and your DV. The value listed below R square is the proportion of variance in your DV
  • 9. that can be accounted for by your IV. The value in the Adjusted R Square column is a measure of model fit, adjusting for the number of IVs in the model. The value listed below Std. Error of the Estimate is the standard deviation of the residuals. • ANOVA. This section provides an F test for your statistical model. If this F is significant, it indicates that the model as a whole (that is, all IVs combined) predicts significantly more variability in the DV compared to a null model that only has an intercept parameter. Notice that this test is affected by the number of IVs in the model being tested. • Coefficients. This section contains a table where each row corresponds to a single coefficient in your model. The row labeled Constant refers to the intercept, while the coefficients for each of your IVs appear in the row beginning with the name of the IV. Inside the table, the column labeled B contains the estimates of the parameters and the column labeled Std. Error contains the standard error of those estimates. The column labeled Beta contains the standardized regression coefficient. The column labeled t contains the value of the t-statistic testing whether the value of each parameter is equal to zero. The p-value of this test is found in the column labeled Sig. A significant t-test indicates that the IV is able to account for a significant amount of variability in the DV, independent of the other IVs in your regression model.
  • 10. Multiple regression with interactions In addition to determining the independent effect of each IV on the DV, multiple regression can also be used to detect interactions between your IVs. An interaction measures the extent to which the relationship between an IV and a DV depends on the level of other IVs in the model. For example, if you have an interaction between two IVs (called a two-way interaction) then you expect that the relationship between the first IV and the DV will be different across different levels of the second IV. Interactions are symmetric, so if you have an interaction such that the effect of IV1 on the DV depends on the level of IV2, then it is also true that the effect of IV2 on the DV depends on the level of IV1. It therefore does not matter whether you say that you have an interaction between IV1 and IV2 or an interaction between IV2 and IV1. You can also have interactions between more than two IVs. For example, you can have a three-way interaction between IV1, IV2, and IV3. This would mean that the two-way interaction between IV1 and IV2 depends on the level of IV3. Just like two-way interactions, three-way interactions are also 34 independent of the order of the variables. So the above three- way interaction would also mean that the two-way interaction between IV1 and IV3 is dependent on the level of IV2, and that the two-way interaction between IV2 and IV3 depends on the level
  • 11. of IV1. It is possible to have both main effects and interactions at the same time. For example, you can have a general trend that the value of the DV increases when the value of a particular IV increases along with an interaction such that the relationship is stronger when the value of a second IV is high than when the value of that second IV is low. You can also have lower order interactions in the presence of a higher order interaction. Again, the lower-order interaction would represent a general trend that is modified by the higher- order interaction. You can use linear regression to determine if there is an interaction between a pair of IVs by adding an interaction term to your statistical model. To detect the interaction effect of two IVs (X1 and X2) on a DV (Y) you would use linear regression to estimate the equation Yi = b0 + b 1Xi1 + b 2Xi2 + b 3Xi1Xi2. You construct the variable for the interaction term Xi1Xi2 by literally multiplying the value of X1 by the value of X2 for each case in your data set. If the test of b3 is significant, then the two predictors have an interactive effect on the outcome variable. In addition to the interaction term itself, your model must contain all of the main effects of the variables involved in the interaction as well as all of the lower- order interaction terms that can be created using those main effects. For example, if you want to test for a three-way interaction you
  • 12. must include the three main effects as well as all of the possible two-way interactions that can be made from those three variables. If you do not include the lower-order terms then the test on the highest order interaction will produce incorrect results. It is important to center the variables that are involved in an interaction before including them in your model. That is, for each independent variable, the analyst should subtract the mean of the independent variable from each participantís score on that variable. The interaction term should then be constructed from the centered variables by multiplying them together. The model itself should then be tested using the centered main effects and the constructed interaction term. Centering your independent variables will not change their relationship to the dependent variable, but it will reduce the collinearity between the main effects and the interaction term. If the variables are not centered then none of the coefficients on terms involving IVs involved in the interaction will be interpretable except for the highest-order interaction. When the variables are centered, however, then the coefficients on the IVs can be interpreted as representing the main effect of the IV on the DV, averaging over the other variables in the interaction. The coefficients on lower-order interaction terms can similarly be interpreted as the testing the average strength of that lower-order interaction, averaging over the variables that are excluded from the lower-order interaction but included in the highest- order interaction term. Centering has the added benefit of reducing the collinearity between the main effect and interaction terms.
  • 13. You can perform a multiple regression including interaction terms in SPSS just like you would a standard multiple regression if you create your interaction terms ahead of time. However, 35 creating these variables can be tedious when analyzing models that contain a large number of interaction terms. Luckily, if you choose to analyze your data using the General Linear Model procedure, SPSS will create these interaction terms for you (although you still need to center all of your original IVs beforehand). To analyze a regression model this way in SPSS • Center the IVs involved in the interaction. • Choose Analyze !!!! General Linear Model !!!! Univariate. • Move your DV to the box labeled Dependent Variable. • Move all of the main effect terms for your IVs to the box labeled Covariate(s). • Click the Options button. • Check the box next to Parameter estimates. By default this procedure will only provide you with tests of your IVs and not the actual parameter estimates. • Click the Continue button. • By default SPSS will not include interactions between continuous variables in its statistical models. However, if you build a custom model you
  • 14. can include whatever terms you like. You should therefore next build a model that includes all of the main effects of your IVs as well as any desired interactions. To do this o Click the Model button. o Click the radio button next to Custom. o Select all of your IVs, set the drop-down menu to Main effects, and click the arrow button. o For each interaction term, select the variables involved in the interaction, set the drop-down menu to Interaction, and click the arrow button. o If you want all of the possible two-way interactions between a collection of IVs you can just select the IVs, set the drop-down menu to All 2- way, and click the arrow button. This procedure can also be used to get all possible three-way, four- way, or five-way interactions between a collection of IVs by setting the drop- down menu to the appropriate interaction type. • Click the Continue button. • Click the OK button. The output from this analysis will contain the same sections found in standard multiple regression. When referring to an interaction, SPSS will display the names of the variables involved in the interaction separated by asterisks (*). So the
  • 15. interaction between the variables RACE and GENDER would be displayed as RACE * GENDER. So what does it mean if you obtain a significant interaction in regression? Remember that in simple linear regression, the slope coefficient (b1) indicates the expected change in Y with a one- unit change in X. In multiple regression, the slope coefficient for X1 indicates the expected change in Y with a one-unit change in X1, holding all other X values constant. Importantly, this change in Y with a one-unit change in X1 is the same no matter what value the other X variables in the model take on. However, if there is a significant interaction, the interpretation of coefficients is slightly different. In this case, the slope coefficient for X1 depends on the level of the other predictor variables in the model. 36 Polynomial regression Polynomial regression models are used when the true relationship between a continuous predictor variable and a continuous dependent variable is a polynomial function, or when the curvilinear relationship is complex or unknown but can be approximated by a polynomial function. A polynomial regression model with one predictor variable is expressed in the following way:
  • 16. Yi = β0 + β1Xi + β11X2i + εi The predictor variable (X) should be centered (discussed in the section Multiple regression with interactions), or else the X and X2 terms will be highly correlated and lead to severe multicollinearity. Additionally, you lose the ability to interpret the lower-order coefficients in a straightforward manner. In the above model, the coefficient β1 is typically called the ìlinear effectî coefficient and β11 is called the ìquadratic effectî coefficient. If the estimate of the coefficient β11 is significantly different from zero then you have a significant quadratic effect in your data. If the highest-order term in a polynomial model is not significant, conventionally statisticians will remove that term from the model and rerun the regression. The best way to choose the highest order polynomial is through a historical or theoretical analysis. There are certain types of relationships that are well known to be fitted by quadratic or cubic models. You might also determine that a specific type of relationship should exist because of the mechanisms responsible for the relationship between the IV and the DV. If you are building your model in an exploratory fashion, however, you can estimate how high of an order function you should use by the shape of the relationship between the DV and that IV. If your data appears to reverse p times (has p curves in the graph), you should use a function whose highest order parameter is raised to the power of p +1. In multiple regression you can see
  • 17. whether you should add an additional term for an IV by examining a graph of the residuals against the IV. Again, if the relationship between the residuals and the IV appears to reverse p times, you should add terms whose highest order parameter is raised to the power of p + 1. It is quite permissible to have more than one predictor variable represented in quadratic form in the same model. For instance: Yi = β0 + β 1Xi1 + β 2Xi2 + β 11X2i1 + β 22X2i2 + εi is a model with two predictor variables, both with quadratic terms. To perform a polynomial regression in SPSS • Determine the highest order term that you will use for each IV. • Center any IVs for which you will examine higher-order terms. • For each IV, create new variables that are equal to your IV raised to the powers of 2 through the power of your highest order term. Be sure to use the centered version of your IV. 37 • Conduct a standard multiple regression including all of the terms for each IV.
  • 18. Simultaneously testing categorical and continuous IVs Both ANOVA and regression are actually based on the same set of statistical ideas, the general linear model. SPSS implements these functions in different menu selections, but the basic way that the independent variables are tested is fundamentally the same. It is therefore perfectly reasonable to combine both continuous and categorical predictor variables in the same model, even though people are usually taught to think of ANOVA and regression as separate types of analyses. To perform an analysis in SPSS using the General Linear Model • Choose Analyze !!!! General Linear Model !!!! Univariate. • Move your DV to the box labeled Dependent Variable. • Move any categorical IVs to the box labeled Fixed Factor(s). • Move any continuous IVs to the box labeled Covariate(s). • By default SPSS will include all possible interactions between your categorical IVs, but will only include the main effects of your continuous IVs. If this is not the model you want then you will need to define it by hand by taking the following steps. o Click the Model button. o Click the radio button next to Custom. o Add all of your main effects to the model by clicking all of the IVs in the box labeled Factors and covariates, setting the pull-down menu to Main effects, and clicking the arrow button.
  • 19. o Add each of the interaction terms to your model. You can do this one at a time by selecting the variables included in the interaction in the box labeled Factors and covariates, setting the pull-down menu to Interaction, and clicking the arrow button for each of your interactions. o You can also use the setting on the pull-down menu to tell SPSS to add all possible 2-way, 3-way, 4-way, or 5-way interactions that can be made between the selected variables to your model. o Click the Continue button. • Click the OK button. The SPSS output from running an analysis using the General Linear Model contains the following sections. • Between-Subjects Factors. This table just lists out the different levels of any categorical variables included in your model. • Tests of Between-Subjects Effects. This table provides an F test of each main effect or interaction that you included in your model. It indicates whether or not the effect can independently account for a significant amount of variability in your DV. This provides the same results as testing the change in model R2 that you get from the test of the set of terms representing the effect.
  • 20. Post-hoc comparisons in mixed models. You can ask SPSS to provide post-hoc contrasts comparing the different levels within any of your categorical predictor variables by clicking the Contrasts button in the variable selection window. If you want to compare the means of cells 38 resulting from combinations of your categorical predictors, you will need to recode them all into a single variable as described in the section Post-hoc comparisons for when you have two or more factors. The easiest way to examine the main effect of a continuous independent variable is to graph its relationship to the dependent variable using simple linear regression.. You can obtain this using the following procedure: • Choose Analyze !!!! Regression !!!! Curve Estimation. • Move your dependent variable into the Dependent(s) box • Move your independent variable into the Independent box • Make sure that Plot Models is checked • Under the heading Models, make sure that only Linear is checked This will produce a graph of your data along with the least- squares regression line. If you want to look at the interaction between a categorical and a continuous independent variable, you can
  • 21. use the Select Cases function (described above) to limit this graph to cases that have a particular value on the categorical variable. Using this method several times, you can obtain graphs of the relationship between the continuous variable and the dependent variable separately for each level of the categorical independent variable. Another option you might consider would be to recode the continuous variables as categorical, separating them into groups based on their value on the continuous variables. You can then run a standard ANOVA and compare the means of the dependent variable for those high or low on the continuous variable. Even if you decide to do this, you should still base all of your conclusions on the analysis that actually treated the variable as continuous. Numerous simulations have shown that there is greater power and less error in analysis that treat truly continuous variables as continuous compared to those that analyze them in a categorical fashion. 39 MEDIATION When researchers find a relationship between an independent variable (A) and a dependent variable (C), they may seek to uncover variables that mediate this relationship. That is, they may believe that the effect of variable A on variable C exits because variable A leads to a change in a
  • 22. mediating variable (M), which in turn effects the dependent variable (C). When a variable fully mediates a relationship, the effect of variable A on variable C disappears when controlling for the mediating variable. A variable partially mediates a relationship when the effect of variable A on variable C is significantly reduced when controlling for the mediator. A common way of expressing these patterns is the following: Mediating Variable (M) Independent Variable (A) Dependent Variable (C) You need to conduct three different regression analyses to determine if you have a mediated relationship using the traditional method Regression 1. Predict the dependent variable (C) from the independent variable (A). The effect of the independent variable in this model must be significant. If there is no direct effect of A on C, then there is no relationship to mediate. Regression 2. Predict the mediating variable (M) from the independent variable (A). The effect of the independent variable in this model must be significant. If the independent variable does not reliably affect the mediator, the mediator cannot be responsible for the relationship observed between A and C. Regression 3. Simultaneously predict the value of the dependent variable (C) from both the independent variable (A) and the mediating variable (M) using
  • 23. multiple regression. The effect of the independent variable should be non significant (or at least significantly reduced, compared to Regression 1), whereas the effect of the mediating variable must be significant. The reduction in the relationship between A and C indicates that the mediator is accounting for a significant portion of this relationship. However, if the relationship between M and C is not significant, then you cannot clearly determine whether M mediates the relationship between A and C, or if A mediates the relationship between M and C. One can directly test for a reduction in the effect of A ! C when controlling for the mediator by performing a Sobel Test. This involves testing the significance of the path between A and C through M in Regression 3. While you cannot do a Sobel Test in SPSS, the website http://www.unc.edu/~preacher/sobel/sobel.htm will perform this for you online. If you wish to show mediation in a journal article, you will almost always be required to show the results of the Sobel Test. 40 CHI-SQUARE TEST OF INDEPENDENCE A chi-square is a nonparametric test used to determine if there is a relationship between two categorical variables. Letís take a simple example. Suppose a researcher brought male and
  • 24. female participants into the lab and asked them which color they preferóblue or green. The researcher believes that color preference may be related to gender. Notice that both gender (male, female) and color preference (blue, green) are categorical variables. If there is a relationship between gender and color preference, we would expect that the proportion of men who prefer blue would be different than the proportion of women who prefer blue. In general, you have a relationship between two categorical variables when the distribution of people across the categories of the first variable changes across the different categories of the second variable. To determine if a relationship exists between gender and color preference, the chi-square test computes the distributions across the combination of your two factors that you would expect if there were no relationship between them. In then compares this to the actual distribution found in your data. In the example above, we have a 2 (gender: male, female) X 2 (color preference: green, blue) design. For each cell in the combination of the two factors, we would compute "observed" and "expected" counts. The observed counts are simply the actual number of observations found in each of the cells. The expected proportion in each cell can be determined by multiplying the marginal proportions found in a table. For example, let us say that 52% of all the participants preferred blue and 48% preferred green, whereas 40% of the all of the participants were men and 60% were women. The expected proportions are presented in the table below.
  • 25. Expected proportion table Males Females Marginal proportion Blue 20.8% 31.2% 52% Green 19.2% 28.8% 48% Marginal proportion 40% 60% As you can see, you get the expected proportion for a particular cell by multiplying the two marginal proportions together. You would then determine the expected count for each cell by multiplying the expected proportion by the total number of participants in your study. The chi- square statistic is a function of the difference between the expected and observed counts across all your cells. Luckily you do not actually need to calculate any of this by hand, since SPSS will compute the expected counts for each cell and perform the chi- square test. To perform a chi-square test of independence in SPSS • Choose Analyze !!!! Descriptive Statistics !!!! Crosstabs. • Put one of the variables in the Row(s) box • Put the other variable in the Column(s) box • Click the Statistics button. • Check the box next to Chi-square. • Click the Continue button. 41 • Click the OK button.
  • 26. The output of this analysis will contain the following sections. • Case Processing Summary. Provides information about missing values in your two variables. • Crosstabulation. Provides you with the observed counts within each combination of your two variables. • Chi-Square Tests. The first row of this table will give you the chi-square value, its degrees of freedom and the p-value associated with the test. Note that the p-values produced by a chi-square test are inappropriate if the expected count is less than 5 in 20% of the cells or more. If you are in this situation, you should either redefine your coding scheme (combining the categories with low cell counts with other categories) or exclude categories with low cell counts from your analysis. 42 LOGISTIC REGRESSION The chi-square test allows us to determine if a pair of categorical variables are related. But what if you want to test a model using two or more independent variables? Most of the inferential procedures we have discussed so far require that the dependent variable be a continuous variable.
  • 27. The most common inferential statistics such as t-tests, regression, and ANOVA, require that the residuals have a normal distribution, and that the variance is equal across conditions. Both of these assumptions are likely to be seriously violated if the dependent variable is categorical. The answer is to use logistic regression, which does not make these assumptions and so can be used to determine the ability of a set of continuous or categorical independent variables to predict the value of a categorical dependent variable. However, standard logistic regression assumes that all of your observations are independent, so it cannot be directly used to test within-subject factors. Logistic regression generates equations that tell you exactly how changes in your independent variables affect the probability that the observation is in a level of your dependent variable. These equations are based on predicting the odds that a particular observation is in one of two groups. Let us say that you have two groups: a reference group and a comparison group. The odds that an observation is in the reference group is equal to the probability that the observation is in the reference group divided by the probability that it is in the comparison group. So, if there is a 75% chance that the observation is in the reference group, the odds of it being in the reference group would be .75/.25 = 3. We therefore talk about odds in the same way that people do when betting at a racetrack. In logistic regression, we build an equation that predicts the logarithm of the odds from the values of the independent variables (which is why itís called
  • 28. log-istic regression). For each independent variable in our model, we want to calculate a coefficient B that tells us what the change in the log odds would be if we would increase the value of the variable by 1. These coefficients therefore parallel those found in a standard regression model. However, they are somewhat difficult to interpret because they relate the independent variables to the log odds. To make interpretation easier, people often transform the coefficients into odds ratios by raising the mathematical constant e to the power of the coefficient (eB). The odds ratio directly tells you how the odds increase when you change the value of the independent variable. Specifically, the odds of being in the reference group are multiplied by the odds ratio when the independent variable increases by 1. One obvious limitation of this procedure is that we can only compare two groups at a time. If we want to examine a dependent variable with three or more levels, we must actually create several different logistic regression equations. If your dependent variable has k levels, you will need a total of k-1 logistic regression equations. What people typically do is designate a specific level of your dependent variable as the reference group, and then generate a set of equations that each compares one other level of the dependent variable to that group. You must then examine the behavior of your independent variables in each of your equations to determine what their influence is on your dependent variable. To test the overall success of your model, you can determine the
  • 29. probability that you can predict the category of the dependent variable from the values of your independent variables. The 43 higher this probability is, the stronger the relationship is between the independent variables and your dependent variable. You can determine this probability iteratively using maximum likelihood estimation. If you multiply the logarithm of this probability by ñ2, you will obtain a statistic that has an approximate chi-square distribution, with degrees of freedom equal to the number of parameters in your model. This is referred to as ñ2LL (minus 2 log likelihood) and is commonly used to assess the fit of the model. Large values of ñ2LL indicate that the observed model has poor fit. This statistic can also be used to provide a statistical test of the relationship between each independent variable and your dependent variable. The importance of each term in the model can be assessed by examining the increase in ñ2LL when the term is dropped. This difference also has a chi-square distribution, and can be used as a statistical test of whether there is an independent relationship between each term and the dependent variable. To performing a logistic regression in SPSS • Choose Analyze !!!! Regression !!!! Multinomial Logistic. • Move the categorical DV to the Dependent box. • Move your categorical IVs to the Factor(s) box.
  • 30. • Move your continuous independent variables to the Covariate(s) box. • By default, SPSS does not include any interaction terms in your model. You will need to click the Model button and manually build your model if you want to include any interactions. • When you are finished, you click the Ok button to tell SPSS to perform the analysis. If your dependent variable only has two groups, you have the option of selecting Analyze !!!! Regression !!!! Binary Logistic. Though this performs the same basic analysis, this procedure is primarily designed to perform model building. It organizes the output in a less straightforward way and does not provide you with the likelihood ratio test for each of your predictors. You are therefore better off if you only use this selection if you are specifically interested in using the model-building procedures that it offers. NOTE: The results from a binary logistic analysis in SPSS will actually produce coefficients that are opposite in sign when compared to the results of a multinomial logistic regression performed on exactly the same data. This is because the binary procedure chooses to predict the probability of choosing the category with the largest indicator variable, while the multinomial procedure chooses to predict the probability of choosing the category with the smallest indicator variable.
  • 31. The Multinomial Logistic procedure will produce output with the following sections. • Case Processing Summary. Describes the levels of the dependent variable and any categorical independent variables. • Model Fitting Information. Tells you the ñ2LL of both a null model containing only the intercept and the full model being tested. Recall that this statistic follows a chi-square distribution and that significant values indicate that there is a significant amount of variability in your DV that is not accounted for by your model. • Pseudo R-Square. Provides a number of statistics that researchers have developed to represent the ability of a logistic regression model to account for variability in the dependent variable. Logistic regression does not have a true R- square statistic because 44 the amount of variance is partly determined by the distribution of the dependent variable. The more even the observations are distributed among the levels of the dependent variable, the greater the variance in the observations. This means that the R-square values for models that have different distributions are not directly comparable. However, these statistics can be useful for comparing the fit of different models predicting the same
  • 32. response variable. The most commonly reported pseudo R- square estimate is Nagelkerkeís R-square, which is provided by SPSS in this section. • Likelihood Ratio Tests. Provides the likelihood ratio tests for the IVs. The first column of the table contains the ñ2LL (a measurement of model error having a chi-square distribution) of a model that does not include the factor listed in the row. The value in the first row (labeled Intercept) is actually the ñ2LL for the full model. The second column is the difference between the ñ2LL for the full model and the ñ2LL for the model that excludes the factor listed in the row. This is a measure of the amount of variability that is accounted for by the factor. This difference parallels the Type III SS in a regression model, and follows a chi-square distribution with degrees of freedom equal to the number of parameters it takes to code the factor. The final column provides the p- value for the test of the null hypothesis that the amount of error in the model that excludes the factor is the same as the amount of error in the full model. A significant statistic indicates that the factor does account for a significant amount of the variability in the dependent variable that is not captured by other variables in the model. • Parameter Estimates. Provides the specific coefficients of the logistic regression equations. You will have a number of equations equal to the number of levels in your
  • 33. dependent variable ñ 1. Each equation predicts the log odds of your observations being in the highest numbered level of your dependent variable compared to another level (which is listed in the leftmost column of the chart). Within each equation, you will see estimates of the standardized logistic regression coefficient for each variable in the model. These coefficients tell you the increase in the log odds when the variable increases by 1 (assuming everything else is held constant). The next column contains the standard errors of those coefficients. The Wald Statistic provides another statistic testing the significance of the individual coefficients, and is based on the relationship between the coefficient and its standard error. However, there is a flaw in this statistic such that large coefficients may have inappropriately large standard errors, so researchers typically prefer to use the likelihood ratio test to determine the importance of individual factors in the model. SPSS provides the odds ratio for the parameter under the column Exp(B). The last two columns in the table provide the upper and lower bounds for a 95% confidence interval around the odds ratio. 45 RELIABILITY
  • 34. Ideally, the measurements that we take with a scale would always replicate perfectly. However, in the real world there are a number of external random factors that can affect the way that respondents provide answers to a scale. A particular measurement taken with the scale is therefore composed of two factors: the theoretical "true score" of the scale and the variation caused by random factors. Reliability is a measure of how much of the variability in the observed scores actually represents variability in the underlying true score. Reliability ranges from 0 to 1. In psychology it is preferred to have scales with reliability greater than .7. The reliability of a scale is heavily dependent on the number of items composing the scale. Even using items with poor internal consistency, you can get a reliable scale if your scale is long enough. For example, 10 items that have an average inter-item correlation of only .2 will produce a scale with a reliability of .714. However, the benefit of adding additional items decreases as the scale grows larger, and mostly disappears after 20 items. One consequence of this is that adding extra items to a scale will generally increase the scale's reliability, even if the new items are not particularly good. An item will have to significantly lower the average inter-item correlation for it to have a negative impact on reliability. Reliability has specific implications for the utility of your scale. The most that responses to your scale can correlate with any other variable is equal to the square root of the scaleís reliability. The variability in your measure will prevent anything higher.
  • 35. Therefore, the higher the reliability of your scale, the easier it is to obtain significant findings. This is probably what you should think about when you want to determine if your scale has a high enough reliability. It should also be noted that low reliability does not call into question results obtained using a scale. Low reliability only hurts your chances of finding significant results. It cannot cause you to obtain false significance. If anything, finding significant results with an unreliable scale indicates that you have discovered a particularly strong effect, since it was able to overcome the hindrances of your unreliable scale. In this way, using a scale with low reliability is analogous to conducting an experiment with a small number of participants. Calculating reliability from parallel measurements One way to calculate reliability is to correlate the scores on parallel measurements of the scale. Two measurements are defined as parallel if they are distinct (are based on different data) but equivalent (such that you expect responses to the two measurements to have the same true score). The two measurements must be performed on the same (or matched) respondents so that the correlation can be performed. There are a number of different ways to measure reliability using parallel measurements. Below are several examples. Test-Retest method. In this method, you have respondents complete the scale at two different points in time. The reliability of the scale can then be estimated by the correlation between the two scores. The accuracy of this method rests on the assumption
  • 36. that the participants are fundamentally the same (i.e., possess the same true score on your scale) during your two test periods. One common problem is that completing the scale the first time can change the way that respondents complete the scale the second time. If they remember any of their specific responses 46 from the first period, for example, it could artificially inflate the reliability estimate. When using this method, you should present evidence that this is not an issue. Alternate Forms method. This method, also referred to as parallel forms, is basically the same as the Test-Retest method, but with the use of different versions of the scale during each session. The use of different versions reduces the likelihood that the first administration of the scale influences responses to the second. The reliability of the scale can then be estimated by the correlation between the two scores. When using alternate forms, you should show that the administration of the first scale did not affect responses to the second and that the two versions of your scale are essentially the same. The use of this method is generally preferred to the Test- Retest method. Split-Halves method. One difficulty with both the Test-Retest and the Alternate Forms methods is that the scale responses must be collected at two different
  • 37. points in time. This requires more work and introduces the possibility that some natural event might change the actual true score between the two administrations of the scale. In the Split- Halves method you only have respondents fill out your scale one time. You then divide your scale items into two sections (such as the even-numbered items and the odd-numbered items) and calculate a score for each half. You then determine the correlation between these two scores. Unlike the other methods, this correlation does not estimate your scaleís reliability. Instead, you get your estimate using the formula: r r + = 1 2àρ where ρà is the reliability estimate and r is the correlation that you obtain. Note that if you split your scale in different ways, you will obtain different reliability estimates. Assuming that there are no confounding variables, all split- halves should be centered on the true reliability. In general it is best not to use a first half/second half split of the questionnaire since respondents may become tired as they work through the scale.
  • 38. This would mean that you would expect greater variability in the score from the second half than in the score from the first half. In this case, your two measurements are not actually parallel, making your reliability estimate invalid. A more acceptable method would be to divide your scale into sections of odd-numbered and even-numbered items. Calculating reliability from internal consistency The other way to calculate reliability is to use a measure of internal consistency. The most popular of these reliability estimates is Cronbach's alpha. Cronbach's alpha can be obtained using the equation: )1(1 −+ = Nr rN α , 47 where α is Cronbach's alpha, N is the number of items in the scale, and r is the mean inter-item correlation. From the equation we can see that α increases both with increasing r as well as with increasing N. Calculating Cronbach's alpha is the most
  • 39. commonly used procedure to estimate reliability. It is highly accurate and has the advantage of only requiring a single administration of the scale. The only real disadvantage is that it is difficult to calculate by hand, as it requires you to calculate the correlation between every single pair of items in your scale. This is rarely an issue, however, since SPSS will calculate it for you automatically. To obtain the α of a set of items in SPSS: • Choose Analyze !!!! Scale !!!! Reliability analysis. • Move all of the items in the scale to the Items box. • Click the Statistics button. • Check the box next to Scale if item deleted. • Click the Continue button. • Click the OK button. Note: Before performing this analysis, make sure all items are coded in the same direction. That is, for every item, larger values should consistently indicate either more of the construct or less of the construct. The output from this analysis will include a single section titled Reliability. The reliability of your scale will actually appear at the bottom of the output next to the word Alpha. The top of this section contains information about the consistency of each item with the scale as a whole. You use this to determine whether there are any ìbad itemsî in your scale (i.e., ones that are not representing the construct you are trying to measure). The column labeled Corrected Item-
  • 40. Total Correlation tells you the correlation between each item and the average of the other items in your scale. The column labeled Alpha if Item Deleted tells you what the reliability of your scale would be if you would delete the given item. You will generally want to remove any items where the reliability of the scale would increase if it were deleted, and you want to keep any items where the reliability of the scale would drop if it were deleted. If any of your items have a negative item-total score correlation it may mean that you forgot to reverse code the item. Inter-rater reliability A final type of reliability that is commonly assessed in psychological research is called ìinter- rater reliability.î Inter-rater reliability is used when judges are asked to code some stimuli, and the analyst wants to know how much those judges agree. If the judges are making continuous ratings, the analyst can simply calculate a correlation between the judgesí responses. More commonly, judges are asked to make categorical decisions about stimuli. In this case, reliability is assessed via Cohenís kappa. To obtain Cohen's kappa in SPSS, you first must set up your data file in the appropriate manner. The codes from each judge should be represented as separate variables in the data set. For example, suppose a researcher asked participants to list their thoughts about a persuasive message. Each judge was given a spreadsheet with one thought per row. The two judges were then asked to code each thought as: 1 = neutral response to the message, 2 = positive response to
  • 41. the message, 3 = negative response to the message, or 4 = irrelevant thought. Once both judges 48 have rendered their codes, the analyst should create an SPSS data file with two columns, one for each judgeís codes. To obtain Cohen's kappa in SPSS • Choose Analyze !!!! Descriptives !!!! Crosstabs. • Place Judge Aís responses in the Row(s) box. • Place Judge Bís responses in the Column(s) box. • Click the Statistics button. • Check the box next to Kappa. • Click the Continue button. • Click the OK button. The output from this analysis will contain the following sections. • Case Processing Summary. Reports the number observations on which you have ratings from both of your judges. • Crosstabulation. This table lists all the reported values from each judge and the number of times each combination of codes was rendered. For example, assuming that each judge used all the codes in the thought-listing example (e.g., code values 1 ñ 4), the output would contain a cross-tabulation table like this:
  • 42. Judge A * Judge B Crosstabulation Count Judge B Total 1.00 2.00 3.00 4.00 1.00 5 1 6 2.00 5 1 6 3.00 1 7 8 Judge A 4.00 7 7 Total 5 7 8 7 27 The counts on the diagonal represent agreements. That is, these counts represent the number of times both Judges A and B coded a thought with a 1, 2, 3, or 4. The more agreements, the better the inter-rater reliability. Values not on the diagonal represent disagreements. In this example, we can see that there was one occasion when Judge A coded a thought in category 1 but Judge B coded that same thought in category 2. • Symmetric Measures. The value of kappa can be found in this section at the intersection of the Kappa row and the Value column. This section also reports a p-value for the Kappa, but this is not typically used in reliability analysis.
  • 43. Note that a kappa cannot be computed on a non-symmetric table. For instance, if Judge A had used codes 1 ñ 4, but Judge B never used code 1 at all, the table would not be symmetric. This is because there would be 4 rows for Judge A but only 3 columns for Judge B. Should you have this situation, you should first determine which values are not used by both judges. You then change each instance of these codes to some other value that is not the value chosen by the opposite judge. Since the original code was a mismatch, you can preserve the original amount of agreement by simply changing the value to a different mismatch. This way you can remove the 49 unbalanced code from your scheme while retaining the information from every observation. You can then use the kappa obtained from this revised data set as an accurate measure of the reliability of the original codes. 50 FACTOR ANALYSIS Factor analysis is a collection of methods used to examine how underlying constructs influence the responses on a number of measured variables. There are
  • 44. basically two types of factor analysis: exploratory and confirmatory. Exploratory factor analysis (EFA) attempts to discover the nature of the constructs influencing a set of responses. Confirmatory factor analysis (CFA) tests whether a specified set of constructs is influencing responses in a predicted way. SPSS only has the capability to perform EFA. CFAs require a program with the ability to perform structural equation modeling, such as LISREL or AMOS. The primary objectives of an EFA are to determine the number of factors influencing a set of measures and the strength of the relationship between each factor and each observed measure. To perform an EFA, you first identify a set of variables that you want to analyze. SPSS will then examine the correlation matrix between those variables to identify those that tend to vary together. Each of these groups will be associated with a factor (although it is possible that a single variable could be part of several groups and several factors). You will also receive a set of factor loadings, which tells you how strongly each variable is related to each factor. They also allow you to calculate factor scores for each participant by multiplying the response on each variable by the corresponding factor loading. Once you identify the construct underlying a factor, you can use the factor scores to tell you how much of that construct is possessed by each participant. Some common uses of EFA are to: • Identify the nature of the constructs underlying responses in a
  • 45. specific content area. • Determine what sets of items ``hang together'' in a questionnaire. • Demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic. • Determine what features are most important when classifying a group of items. • Generate ``factor scores'' representing values of the underlying constructs for use in other analyses. • Create a set of uncorrelated factor scores from a set of highly collinear predictor variables. • Use a small set of factor scores to represent the variable contained in a larger set of variables. This is often referred to as data reduction. It is important to note that EFA does not produce any statistical tests. It therefore cannot ever provide concrete evidence that a particular structure exists in your data ñ it can only direct you to what patterns there may be. If you want to actually test whether a particular structure exists in your data you should use CFA, which does allow you to test whether your proposed structure is able to account for a significant amount of variability in your items. EFA is strongly related to another procedure called principle components analysis (PCA). The two have basically the same purpose: to identify a set of
  • 46. underlying constructs that can account for the variability in a set of variables. However, PCA is based on a different statistical model, and produces slightly different results when compared to EFA. EFA tends to produce better results when you want to identify a set of latent factors that underlie the responses on a set of 51 measures, whereas PCA works better when you want to perform data reduction. Although SPSS says that it performs ìfactor analysis,î statistically it actually performs PCA. The differences are slight enough that you will generally not need to be concerned about them ñ you can use the results from a PCA for all of the same things that you would the results of an EFA. However, if you want to identify latent constructs, you should be aware that you might be able to get slightly better results if you used a statistical package that can actually perform EFA, such as SAS, AMOS, or LISREL. Factor analyses require a substantial number of subjects to generate reliable results. As a general rule, the minimum sample size should be the larger of 100 or 5 times the number of items in your factor analysis. Though you can still conduct a factor analysis with fewer subjects, the results will not be very stable. To perform an EFA in SPSS
  • 47. • Choose Analyze !!!! Data Reduction !!!! Factor. • Move the variables you want to include in your factor analysis to the Variables box. • If you want to restrict the factor analysis to those cases that have a particular value on a variable, you can put that variable in the Selection Variable box and then click Value to tell SPSS which value you want the included cases to have. • Click the Extraction button to indicate how many factors you want to extract from your items. The maximum number of factors you can extract is equal to the number of items in your analysis, although you will typically want to examine a much smaller number. There are several different ways to choose how many factors to examine. First, you may want to look for a specific number of factors for theoretical reasons. Second, you can choose to keep factors that have eigenvalues over 1. A factor with an eigenvalue of 1 is able to account for the amount of variability present in a single item, so factors that account for less variability than this will likely not be very meaningful. A final method is to create a Scree Plot, where you graph the amount of variability that each of the factors is able to account for in descending order. You then use all the factors that occur prior to the last major drop in the amount of variance accounted for. If you wish to use this method, you should run the factor analysis twice - once to generate the Scree plot, and a second time where you specify exactly how many factors you want to examine.
  • 48. • Click the Rotation button to select a rotation method. Though you do not need to rotate your solution, using a rotation typically provides you with more interpretable factors by locating solutions with more extreme factor loadings. There are two broad classes of rotations: orthogonal and oblique. If you choose an orthogonal rotation, then your resulting factors will all be uncorrelated with each other. If you choose an oblique rotation, you allow your factors to be correlated. Which you should choose depends on your purpose for performing the factor analysis, as well as your beliefs about the constructs that underlie responses to your items. If you think that the underlying constructs are independent, or if you are specifically trying to get a set of uncorrelated factor scores, then you should clearly choose an orthogonal rotation. If you think that the underlying constructs may be correlated, then you should choose an oblique rotation. Varimax is the most popular orthogonal rotation, whereas Direct Oblimin is the most popular oblique rotation. If you decide to perform a rotation on your solution, you usually ignore the parts of the output that deal with the initial (unrotated) solution since 52 the rotated solution will generally provide more interpretable results. If you want to use
  • 49. direct oblimin rotation, you will also need to specify the parameter delta. This parameter influences the extent that your final factors will be correlated. Negative values lead to lower correlations whereas positive values lead to higher correlations. You should not choose a value over .8 or else the high correlations will make it very difficult to differentiate the factors. • If you want SPSS to save the factor scores as variables in your data set, then you can click the Scores button and check the box next to Save as variables. • Click the Ok button when you are ready for SPSS to perform the analysis. The output from a factor analysis will vary depending on the type of rotation you chose. Both orthogonal and oblique rotations will contain the following sections. • Communalities. The communality of a given item is the proportion of its variance that can be accounted for by your factors. In the first column youíll see that the communality for the initial extraction is always 1. This is because the full set of factors is specifically designed to account for the variability in the full set of items. The second column provides the communalities of the final set of factors that you decided to extract. • Total Variance Explained. Provides you with the eigenvalues and the amount of
  • 50. variance explained by each factor in both the initial and the rotated solutions. If you requested a Scree plot, this information will be presented in a graph following the table. • Component Matrix. Presents the factor loadings for the initial solution. Factor loadings can be interpreted as standardized regression coefficients, regressing the factor on the measures. Factor loadings less than .3 are considered weak, loadings between .3 and .6 are considered moderate, and loadings greater than .6 are considered to be large. Factor analyses using an orthogonal rotation will include the following section. • Rotated Component Matrix. Provides the factor loadings for the orthogonal rotation. The rotated factor loadings can be interpreted in the same way as the unrotated factor loadings. • Component Transformation Matrix. Provides the correlations between the factors in the original and in the rotated solutions. Factor analyses using an oblique rotation will include the following sections. • Pattern Matrix. Provides the factor loadings for the oblique rotation. The rotated factor loadings can be interpreted in the same way as the unrotated factor loadings.
  • 51. • Structure Matrix. Holds the correlations between the factions and each of the items. This is not going to look the same as the pattern matrix because the factors themselves can be correlated. This means that an item can have a factor loading of zero for one factor but still be correlated with the factor, simply because it loads on other factors that are correlated with the first factor. • Component Correlation Matrix. Provides you with the correlations among your rotated factors. After you obtain the factor loadings, you will want to come up with a theoretical interpretation of each of your factors. You define a factor by considering the possible constructs that could be responsible for the observed pattern of positive and negative loadings. You should examine the 53 items that have the largest loadings and consider what they have in common. To ease interpretation, you have the option of multiplying all of the loadings for a given factor by -1. This essentially reverses the scale of the factor, allowing you, for example, to turn an ``unfriendliness'' factor into a ``friendliness'' factor.
  • 52. 54 VECTORS AND LOOPS Vectors and loops are two tools drawn from computer programming that can be very useful when manipulating data. Their primary use is to perform a large number of similar computations using a relatively small program. Some of the more complicated types of data manipulation can only reasonably be done using vectors and loops. A vector is a set of variables that are linked together because they represent similar things. The purpose of the vector is to provide a single name that can be used to access any of the entire set of variables. A loop is used to tell the computer to perform a set of procedures a specified number of times. Often times we need to perform the same transformation on a large number of variables. By using a loop, we only need to define the transformation once, and can then tell the computer to do the same thing to all the variables using a loop. If you have computer-programming experience then you have likely come across these ideas before. However, what SPSS calls a ìvectorî is typically referred to as an ìarrayî in most programming languages. If you are familiar with arrays and loops from a computer- programming course, you are a step ahead. Vectors and loops are used in data manipulation in
  • 53. more or less the same way that arrays and loops are used in standard computer programming. Vectors Vectors can only be defined and used in syntax. Before you can use a vector you first need to define it. You must specify the name of the vector and list what variables are associated with it. Variables referenced by a vector are called ìelementsî of that vector. You declare a vector using the following syntax. vector Vname = varX1 to varX2. If the variables in the vector have not already been declared, you can do so as part of the vector statement. For more information on this, see page 904 of the SPSS Base Syntax Reference Guide. The following are all acceptable vector declarations. vector V = v1 to v8. vector Myvector = entry01 to entry64. vector Grade = grade1 to grade12. vector Income = in1992 to in2000. The vector is given the name Vname and is used to reference a set of variables defined by the variable list. The elements in the vector must be declared using the syntax first variable to last variable. You cannot list them out individually. This means that the variables to be included in a vector must all be grouped together in your data set. Vectors can be used in transformation statements just like variables. However, the vector itself
  • 54. isn't able to hold values. Instead, the vector acts as a mediator between your statement and the variables it references. The variables included in a vector are placed in a specific order, determined by the declaration statement. So if you give SPSS a vector and an order number (referred to as the index), it knows what specific element you want to access. You do not need to 55 know what the exact name of the variable is - you just need to know its location in the vector. References to items within a vector are typically made using the format vname (index) where vname is the name of the vector, and index is the numerical position of the desired element. Using this format, you can use a vector to reference a variable in any place that you would normally insert a variable name. For example, all of the following would be valid SPSS statements, assuming that we had defined the four variables above. compute V(4) = 6. if (Myvector(30)='house') correct = correct + 1. compute sum1 = Grade(1) + Grade(2) + Grade(3). compute change = Income(9) - Income(1). Note that the index used by a vector only takes into account the position of elements in the vector
  • 55. - not the names of the variables. To reference the variable in1993 from in the Income vector above, you would use the phrase income(2), not income(1993). Using vectors this way doesn't provide us with much of an advantage - we are not really saving ourselves any effort by referring to a particular variable as Myvector(1) instead of entry01. The advantage comes in with the fact that the index of the vector itself can be a variable. In this case, the element that the vector will reference will depend on the value of the index variable. So the exact variable that is changed by the statement compute Grade(t) = Grade(t) + 1. depends on the value of t when this statement is executed. If t has the value of 1, then the variable grade1 will be incremented by 1. If t has a value of 8, then the variable grade8 will be incremented by 1. This means that the same statement can be used to perform many different things, simply depending what value you assign to t. This allows you to use vectors to write ìgenericî sections of code, where you control exactly what the code does by assigning different values to the index variables. Loops Vectors are most useful when they are combined with loops. A loop is a statement that lets you tell the computer to perform a set of commands a specified number of times. In SPSS you can tell the computer to perform a loop by using the following code: loop loop_variable = lower_limit to upper_limit.
  • 56. --commands to be repeated appear here-- end loop. When SPSS encounters a loop statement, what it does first is set the value of the loop variable to be equal to the lower limit. It then performs all of the commands inside the loop until it reaches the end loop statement. At that point the computer adds 1 to the loop variable, and then compares it to the upper limit. If the new value of the loop variable is less than or equal to the upper limit, it goes back to the beginning of the loop and goes through all of the commands 56 again. If the new value is greater than the upper limit, the computer then moves to the statement after the end loop statement. Basically, this means that the computer performs the statements inside the loop a total number of times equal to (upper limit ñ lower limit + 1). The following is an example of an SPSS program that uses a loop to calculate a sum: compute x = 0. loop #t = 4 to 8. + compute x = x + #t. end loop. The first line simply initializes the variable count to the value of zero. The second line defines the conditions of the loop. The loop variable is named t, and starts with a value of 4. The loop
  • 57. cycles until the value of t is greater than 8. This causes the program to perform a total of 5 cycles. During each cycle the current value of t is added to x. At the end of this set of statements, the variable x would have the value of 4 + 5 + 6 + 7 + 8 = 30. In this example, the loop variable is denoted as a ìscratch variableî because its first letter is a number sign (#). When something is denoted as a scratch variable in SPSS it is not saved in the final data set. Typically we are not interested in storing the values of our loop variables, so it is common practice to denote them as scratch variables. For more information on scratch variables see page 32 of the SPSS Base Syntax Reference Guide. You will also notice the plus sign (+) placed before the compute statement in line 3. SPSS needs you to start all new commands in the first column of each line. Here we wish to indent the command to indicate that it is part of the loop. We therefore put the plus symbol in the first column which tells SPSS that the actual command starts later on the line. Just in case you were wondering, the first statement setting x = 0 is actually necessary for the sum to be calculated. Most programming languages, including SPSS syntax, start variables with missing values. Adding anything to a missing value produces a missing value, so we must explicitly start the variable count at zero to be able to obtain the sum. The Power of Combining Vectors and Loops Though you can work with vectors and loops alone, they were
  • 58. truly designed to be used together. A combination of vectors and loops can save you incredible amounts of time when performing certain types of repetitive transformations. Consider the characteristics of vectors and loops. A vector lets you reference a set of related variables using a single name and an index. The index can be a variable or a mathematical expression involving one or more variables. A loop repeatedly performs a set of commands, incrementing a loop variable after each cycle. What would happen if a statement inside of a loop referenced a vector using the loop variable as the index? During each cycle, the loop variable increases by 1. So during each cycle, the vector would refer to a different variable. If you correctly design the upper and lower limits of your loop, you could use a loop to perform a transformation on every element of a vector. For an example, let's say that you conducted a reaction-time study where research participants observed strings of letters on the screen and judged whether they composed a real word or not. In your study, you had a total of 200 trials in several experimental conditions. You want to analyze 57 your data with an ANOVA to see if the reaction time varies by condition, but you find that the data has a right skew (which is common). To use ANOVA, you will need to transform the data so that it has a normal distribution, which involves taking the
  • 59. logarithm of the response time on each trial. In terms of your data set, what you need is a set of 200 new variables whose values are equal to the logarithms of the 200 response time variables. Without using vectors or loops, you would need to write 200 individual transformation statements to create each log variable from the corresponding response time variable. Using vectors and loops, however, we can do the same work with the following simple program. The program assumes that the original response time variables are rt001 to rt200, and the desired log variables will be lrt001 to lrt200. vector Rtvector = rt001 to rt200. vector Lvector = lrt001 to lrt200. loop #item = 1 to 200. + compute Lvector(#item) = log(Rtvector(#item)). end loop. The first two statements set up a pair of vectors, one to represent the original response time variables and one to represent the transformed variables. The third statement creates a loop with 200 cycles. Each cycle of the loop corresponds to a trial in the experiment. The fourth line actually performs the desired transformation. During each cycle it takes one variable from Lvector and sets it equal to the log of the corresponding variable in Rtvector. The fifth line simply ends the loop. By the time this program completes, it will have created 200 new variables holding the log values that you desire. In addition to greatly reducing the number of programming lines, there are other advantages to
  • 60. performing transformations using vectors and loops. If you need to make a change to the transformation you only need to change a single statement. If you write separate transformations for each variable, you must change every single statement anytime you want to change the specifics of the transformation. It is also much easier to read programs that use loops than programs with large numbers of transformation statements. The loops naturally group together transformations that are all of the same type, whereas with a list you must examine each individual transformation to find out what it does. 11 T TESTS Many analyses in psychological research involve testing hypotheses about means or mean differences. Below we describe the SPSS procedures that allow you to determine if a given mean is equal to either a fixed value or some other mean. One-sample t-test You perform a one-sample t-test when you want to determine if the mean value of a target variable is different from a hypothesized value. To perform a one-sample t-test in SPSS • Choose Analyze!!!! Compare Means !!!! One-sample t-test. • Move the variable of interest to the Test variable(s) box.
  • 61. • Change the test value to the hypothesized value. • Click the OK button. The output from this analysis will contain the following sections. • One-Sample Statistics. Provides the sample size, mean, standard deviation, and standard error of the mean for the target variable. • One-Sample Test. Provides the results of a t-test comparing the mean of the target variable to the hypothesized value. A significant test statistic indicates that the sample mean differs from the hypothesized value. This section also contains the upper and lower bounds for a 95% confidence interval around the sample mean. Independent-samples t-test You perform an independent-samples t-test (also called a between-subjects t-test) when you want to determine if the mean value on a given target variable for one group differs from the mean value on the target variable for a different group. This test is only valid if the two groups have entirely different members. To perform this test in SPSS you must have a variable representing group membership, such that different values on the group variable correspond to different groups. To perform an independent-samples t-test in SPSS • Choose Analyze!!!! Compare Means !!!! Independent-sample
  • 62. t-test. • Move the target variable to the Test variable(s) box. • Move the group variable to the Grouping variable box. • Click the Define groups button. • Enter the values corresponding to your two groups you want to compare in the boxes labeled group 1 and group 2. • Click the Continue button. • Click the OK button. The output from this analysis will contain the following sections. • Group Statistics. Provides descriptive information about your two groups, including the sample size, mean, standard deviation, and the standard error of the mean. 12 • Independent Samples Test. Provides the results of two t-tests comparing the means of your two groups. The first row reports the results of a test assuming that the two variances are equal, while the second row reports the results of a test that does not assume the two variances are equal. The columns labeled Levene’s Test for Equality of Variances report an F test comparing the variances of your two groups. If the F test is significant then you should use the test in the second row. If it is not significant then you
  • 63. should use the test in the first row. A significant t-test indicates that the two groups have different means. The last two columns provide the upper and lower bounds for a 95% confidence interval around the difference between your two groups. Paired-samples t-test You perform a paired samples t-test (also called a within- subjects t-test) when you want to determine whether a single group of participants differs on two measured variables. Probably the most common use of this test would be to compare participantsí response on a measure before a manipulation to their response after a manipulation. This test works by first computing a difference score for each participant between the within- subject conditions (e.g. post-test ñ pre- test). The mean of these difference scores is then compared to zero. This is the same thing as determining whether there is a significant difference between the means of the two variables. To perform a paired-samples t-test in SPSS • Choose Analyze!!!! Compare Means !!!! Paired-samples t-test. • Click the two variables you want to compare in the box on the left-hand side. • Click the arrow button. • Click the OK button. The output from this analysis will contain the following sections.
  • 64. • Paired Samples Statistics. Provides descriptive information about the two variables, including the sample size, mean, standard deviation, and the standard error of the mean. • Paired Samples Correlations. Provides the correlation between the two variables. • Paired Samples Test. Provides the results of a t-test comparing the means of the two variables. A significant t-test indicates that there is a difference between the two variables. It also contains the upper and lower bounds of a 95% confidence interval around the difference between the two means. 13 ANALYSIS OF VARIANCE (ANOVA) One-way between-subjects ANOVA A one-way between-subjects ANOVA allows you to determine if there is a relationship between a categorical independent variable (IV) and a continuous dependent variable (DV), where each subject is only in one level of the IV. To determine whether there is a relationship between the IV and the DV, a one-way between-subjects ANOVA tests whether the means of all of the groups are the same. If there are any differences among the means, we know that the value of the DV depends on the value of the IV. The IV in an ANOVA is referred to as a factor, and the
  • 65. different groups composing the IV are referred to as the levels of the factor. A one-way ANOVA is also sometimes called a single factor ANOVA. A one-way ANOVA with two groups is analogous to an independent-samples t-test. The p- values of the two tests will be the same, and the F statistic from the ANOVA will be equal to the square of the t statistic from the t-test. To perform a one-way between-subjects ANOVA in SPSS • Choose Analyze !!!! General Linear Model !!!! Univariate. • Move the DV to the Dependent Variable box. • Move the IV to the Fixed Factor(s) box. • Click the OK button. The output from this analysis will contain the following sections. • Between-Subjects Factors. Lists how many subjects are in each level of your factor. • Tests of Between-Subjects Effects. The row next to the name of your factor reports a test of whether there is a significant relationship between your IV and the DV. A significant F statistic means that at least two group means are different from each other, indicating the presence of a relationship. You can ask SPSS to provide you with the means within each level of your between-subjects factor by clicking the Options button in the variable selection
  • 66. window and moving your within- subjects variable to the Display Means For box. This will add a section to your output titled Estimated Marginal Means containing a table with a row for each level of your factor. The values within each row provide the mean, standard error of the mean, and the boundaries for a 95% confidence interval around the mean for observations within that cell. Post-hoc analyses for one-way between-subjects ANOVA. A significant F statistic tells you that at least two of your means are different from each other, but does not tell you where the differences may lie. Researchers commonly perform post-hoc analyses following a significant ANOVA to help them understand the nature of the relationship between the IV and the DV. The most commonly reported post-hoc tests are (in order from most to least liberal): LSD (Least Significant Difference test), SNK (Student-Newman-Keuls), Tukey, and Bonferroni. The more liberal a test is, the more likely it will find a significant difference between your means, but the more likely it is that this difference is actually just due to chance. 14 Although it is the most liberal, simulations have demonstrated that using LSD post-hoc analyses will not substantially increase your experimentwide error rate as long as you only perform the
  • 67. post-hoc analyses after you have already obtained a significant F statistic from an ANOVA. We therefore recommend this method since it is most likely to detect any differences among your groups. To perform post-hoc analyses in SPSS • Repeat the steps necessary for a one-way ANOVA, but do not press the OK button at the end. • Click the Post-Hoc button. • Move the IV to the Post-Hoc Tests for box. • Check the boxes next to the post-hoc tests you want to perform. • Click the Continue button. • Click the OK button. Requesting a post-hoc test will add one or both of the following sections to your ANOVA output. • Multiple Comparisons. This section is produced by LSD, Tukey, and Bonferroni tests. It reports the difference between every possible pair of factor levels and tests whether each is significant. It also includes the boundaries for a 95% confidence interval around the size of each difference. • Homogenous Subsets. This section is produced by SNK and Tukey tests. It reports a number of different subsets of your different factor levels. The mean values for the factor
  • 68. levels within each subset are not significantly different from each other. This means that there is a significant difference between the mean of two factor levels only if they do not appear in any of the same subsets. Multifactor between-subjects ANOVA Sometimes you want to examine more than one factor in the same experiment. Although you could analyze the effect of each factor separately, testing them together in the same analysis allows you to look at two additional things. First, it lets you determine the independent influence of each of the factors on the DV, controlling for the other IVs in the model. The test of each IV in a multifactor ANOVA is based solely on the part of the DV that it can predict that is not predicted by any of the other IVs. Second, including multiple IVs in the same model allows you to test for interactions among your factors. The presence of an interaction between two variables means that the effect of the first IV on the DV depends on the level of the second IV. An interaction between three variables means that the nature of the two-way interaction between the first two variables depends on the level of a third variable. It is possible to have an interaction between any number of variables. However, researchers rarely examine interactions containing more than three variables because they are difficult to interpret and require large sample sizes to detect. Note that to obtain a valid test of a given interaction effect your
  • 69. model must also include all lower-order main effects and interactions. This means that the model has to include terms representing all of the main effects of the IVs involved in the interaction, as well as all the 15 possible interactions between those IVs. So, if you want to test a 3-way interaction between variables A, B, and C, the model must include the main effects for those variables, as well as the AxB, AxC, and the BxC interactions. To perform a multifactor ANOVA in SPSS • Choose Analyze !!!! General Linear Model !!!! Univariate. • Move the DV to the Dependent Variable box. • Move all of your IVs to the Fixed Factor(s) box. • By default SPSS will include all possible interactions between your categorical IVs. If this is not the model you want then you will need to define it by hand by taking the following steps. o Click the Model button. o Click the radio button next to Custom. o Add all of your main effects to the model by clicking all of the IVs in the box labeled Factors and covariates, setting the pull-down menu to Main effects, and clicking the arrow button.
  • 70. o Add each of the interaction terms to your model. You can do this one at a time by selecting the variables included in the interaction in the box labeled Factors and covariates, setting the pull-down menu to Interaction, and clicking the arrow button for each of your interactions. You can also use the setting on the pull- down menu to tell SPSS to add all possible 2-way, 3-way, 4- way, or 5-way interactions that can be made between the selected variables to your model. o Click the Continue button. • Click the Options button and move each independent variable and all interaction terms to the Display means for box. • Click the Continue button. • Click the OK button. The output of this analysis will contain the following sections. • Between-Subjects Factors. Lists how many subjects are in each level of each of your factors. • Tests of Between-Subjects Effects. The row next to the name of each factor or interaction reports a test of whether there is a significant relationship between that effect and the DV, independent of the other effects in the model.
  • 71. You can ask SPSS to provide you with the means within the levels of your main effects or your interactions by clicking the Options button in the variable selection window and moving the appropriate term to the Display Means For box. This will add a section to your output titled Estimated Marginal Means containing a table for each main effect or interaction in your model. The table will contain a row for each cell within the effect. The values within each row provide the mean, standard error of the mean, and the boundaries for a 95% confidence interval around the mean for observations within that cell. Graphing Interactions in an ANOVA. It is often useful to examine a plot of the means by condition when trying to interpret a significant interaction. 16 To get plot of means by condition from SPSS • Perform a multifactor ANOVA as described above, but do not click the OK button to perform the analysis. • Click the Plots button. • Define all the plots you want to see. o To plot a main effect, move the factor to the Horizontal Axis box and click the Add button. o To plot a two-way interaction, move the first factor to the
  • 72. Horizontal Axis box, move the second factor to the Separate Lines box, and click the Add button. o To plot a three-way interaction, move the first factor to the Horizontal Axis box, move the second factor to the Separate Lines box, move the third factor to the Separate Plots box, and click the Add button. • Click the Continue button. • Click the OK button. In addition to the standard ANOVA output, the plots you requested will appear in a section titled Profile Plots. Post-hoc comparisons for when you have two or more factors. Graphing the means from a two-way or three-way between-subject ANOVA shows you the basic form of the significant interaction. However, the analyst may also wish to perform post-hoc analyses to determine which means differ from one another. If you want to compare the levels of a single factor to one another, you can follow the post-hoc procedures described in the section on one-way ANOVA. Comparing the individual cells formed by the combination of two or more factors, however, is slightly more complicated. SPSS does provide options to directly make such comparisons. Fortunately, there is a very easy method that allows one to perform post-hocs comparing all cell means to one another within a between-subjects interaction.
  • 73. We will work with a specific example to illustrate how to perform this analysis in SPSS. Suppose that you wanted to compare all of the means within a 2x2x3 between-subjects factorial design. The basic idea is to create a new variable that has a different value for each cell in the above design, and then use the post-hoc procedures available in one-way ANOVA to perform your comparisons. The total number of cells in an interaction can be determined by multiplying together the number of levels in each factor composing the interaction. In our example, this would mean that our new variable would need to have 2*2*3=12 different levels, each corresponding to a unique combination of our three IVs. One way to create this variable would be to use the Recode function described above. However, there is an easier way to do this if your IVs all use numbers to code the different levels. In our example we will assume that the first factor (A) has two levels coded by the values 1 and 2, the second factor (B) has two levels again coded by the values 1 and 2, and that the third factor (C) has three levels coded by the values 1, 2, and 3. In this case, you can use the Compute function to calculate your new variable using the formula: newcode = (A*100) + (B*10) + C 17 In this example, newcode would always be a three-digit number.
  • 74. The first digit would be equal to the level on variable A, the second digit would be equal to the level on variable B, while the third digit would be equal to the level on variable C. There are two benefits to using this transformation. First, it can be completed in a single step, whereas assigning the groups manually would take several separate steps. Second, you can directly see the correspondence between the levels of the original factors and the level of the composite variable by looking at the digits of the composite variable. If you actually used the values of 1 through 12 to represent the different cells in your new variable, you would likely need to reference a table to know the relationships between the values of the composite and the values of the original variables. If you ever want to create a composite of a different number of factors (besides 3 factors, like in this example), you follow the same general principle, basically multiplying each factor by decreasing powers of 10, such as the following examples. newcode = (A*10) + B (for a two-way interaction) newcode = (A*1000) + (B*100) + (C*10) + D (for a four-way interaction) Regardless of which procedure you use to create the composite variable, you would perform the post-hoc in SPSS by taking the following steps. • Choose Analyze !!!! General Linear Model !!!! Univariate. • Move the DV to the Dependent Variable box. • Move the composite variable to the Fixed Factor(s) box. • Click the Post-Hoc button. • Move the composite variable to the Post-Hoc Tests for box.
  • 75. • Check the boxes next to the post-hoc tests you want to perform. • Click the Continue button. • Click the OK button. The post-hoc analyses will be reported in the Multiple Comparisons and Homogenous Subsets sections, as described above under one-way between-subjects ANOVA. One-way within-subjects ANOVA A one-way within-subjects ANOVA allows you to determine if there is a relationship between a categorical IV and a continuous DV, where each subject is measured at every level of the IV. Within-subject ANOVA should be used whenever want to compare 3 or more groups where the same subjects are in all of the groups. To perform a within- subject ANOVA in SPSS you must have your data set organized so that the subject is the unit of analysis and you have different variables containing the value of the DV at each level of your within-subjects factor. To perform a within-subject ANOVA in SPSS: • Choose Analyze !!!! General linear model !!!! Repeated measures. • Type the name of the factor in the Within-Subjects Factor Name box. • Type the number of groups the factor represents in the Number of Levels box. • Click the Add button. • Click the Define button. • Move the variables representing the different levels of the
  • 76. within-subjects factor to the Within-Subjects Variables box. 32 Multiple Regression Sometimes you may want to explain variability in a continuous DV using several different continuous IVs. Multiple regression allows us to build an equation predicting the value of the DV from the values of two or more IVs. The parameters of this equation can be used to relate the variability in our DV to the variability in specific IVs. Sometimes people use the term multivariate regression to refer to multiple regression, but most statisticians do not use ìmultiple" and ìmultivariate" as synonyms. Instead, they use the term ìmultiple" to describe analyses that examine the effect of two or more IVs on a single DV, while they reserve the term ìmultivariate" to describe analyses that examine the effect of any number of IVs on two or more DVs. The general form of the multiple regression model is Yi = β0 + β 1Xi1 + β 2Xi2 + Ö + βkXik + εi,. The elements in this equation are the same as those found in simple linear regression, except that we now have k different parameters which are multipled by the values of the k IVs to get our
  • 77. predicted value. We can again use least squares estimation to determine the estimates of these parameters that best our observed data. Once we obtain these estimates we can either use our equation for prediction, or we can test whether our parameters are significantly different from zero to determine whether each of our IVs makes a significant contribution to our model. Care must be taken when making inferences based on the coefficients obtained in multiple regression. The way that you interpret a multiple regression coefficient is somewhat different from the way that you interpret coefficients obtained using simple linear regression. Specifically, the value of a multiple regression coefficient represents the ability of part of the corresponding IV that is unrelated to the other IVs to predict the part of the DV that is unrelated to the other IVs. It therefore represents the unique ability of the IV to account for variability in the DV. One implication of the way coefficients are determined is that your parameter estimates become very difficult to interpret if there are large correlations among your IVs. The effect of these relationships on multiple regression coefficients is called multicollinearity. This changes the values of your coefficients and greatly increases their variance. It can cause you to find that none of your coefficients are significantly different from zero, even when the overall model does a good job predicting the value of the DV. One implication of the way coefficients are determined is that your parameter estimates become very difficult to interpret if there are large correlations among
  • 78. your IVs. The typical effect of multicollinearity is to reduce the size of your parameter estimates. Since the value of the coefficient is based on the unique ability for an IV to account for variability in a DV, if there is a portion of variability that is accounted for by multiple IVs, all of their coefficients will be reduced. Under certain circumstances multicollinearity can also create a suppression effect. If you have one IV that has a high correlation with another IV but a low correlation with the DV, you can find that the multiple regression coefficient for the second IV from a model including both variables can be larger (or even opposite in direction!) compared to the coefficient from a model that doesn't include the first IV. This happens when the part of the second IV that is independent of the first IV has a different relationship with the DV than does the part that is 33 related to the first IV. It is called a suppression effect because the relationship that appears in multiple regression is suppressed when you just look at the variable by itself. To perform a multiple regression in SPSS • Choose Analyze !!!! Regression !!!! Linear. • Move the DV to the Dependent box. • Move all of the IVs to the Independent(s) box. • Click the Continue button. • Click the OK button.
  • 79. The SPSS output from a multiple regression analysis contains the following sections. • Variables Entered/Removed. This section is only used in model building and contains no useful information in standard multiple regression. • Model Summary. The value listed below R is the multiple correlation between your IVs and your DV. The value listed below R square is the proportion of variance in your DV that can be accounted for by your IV. The value in the Adjusted R Square column is a measure of model fit, adjusting for the number of IVs in the model. The value listed below Std. Error of the Estimate is the standard deviation of the residuals. • ANOVA. This section provides an F test for your statistical model. If this F is significant, it indicates that the model as a whole (that is, all IVs combined) predicts significantly more variability in the DV compared to a null model that only has an intercept parameter. Notice that this test is affected by the number of IVs in the model being tested. • Coefficients. This section contains a table where each row corresponds to a single coefficient in your model. The row labeled Constant refers to the intercept, while the coefficients for each of your IVs appear in the row beginning with the name of the IV. Inside the table, the column labeled B contains the estimates of
  • 80. the parameters and the column labeled Std. Error contains the standard error of those estimates. The column labeled Beta contains the standardized regression coefficient. The column labeled t contains the value of the t-statistic testing whether the value of each parameter is equal to zero. The p-value of this test is found in the column labeled Sig. A significant t-test indicates that the IV is able to account for a significant amount of variability in the DV, independent of the other IVs in your regression model. Multiple regression with interactions In addition to determining the independent effect of each IV on the DV, multiple regression can also be used to detect interactions between your IVs. An interaction measures the extent to which the relationship between an IV and a DV depends on the level of other IVs in the model. For example, if you have an interaction between two IVs (called a two-way interaction) then you expect that the relationship between the first IV and the DV will be different across different levels of the second IV. Interactions are symmetric, so if you have an interaction such that the effect of IV1 on the DV depends on the level of IV2, then it is also true that the effect of IV2 on the DV depends on the level of IV1. It therefore does not matter whether you say that you have an interaction between IV1 and IV2 or an interaction between IV2 and IV1. You can also have interactions between more than two IVs. For example, you can have a three-way interaction between IV1, IV2, and IV3. This would mean that the two-way
  • 81. interaction between IV1 and IV2 depends on the level of IV3. Just like two-way interactions, three-way interactions are also 28 CORRELATION Pearson correlation A Pearson correlation measures the strength of the linear relationship between two continuous variables. A linear relationship is one that can be captured by drawing a straight line on a scatterplot between the two variables of interest. The value of the correlation provides information both about the nature and the strength of the relationship. • Correlations range between -1.0 and 1.0. • The sign of the correlation describes the direction of the relationship. A positive sign indicates that as one variable gets larger the other also tends to get larger, while a negative sign indicates that as one variable gets larger the other tends to get smaller. • The magnitude of the correlation describes the strength of the relationship. The further that a correlation is from zero, the stronger the relationship is between the two variables. A zero correlation would indicate that the two variables aren't related to each other at all.
  • 82. Correlations only measure the strength of the linear relationship between the two variables. Sometimes you have a relationship that would be better measured by a curve of some sort rather than a straight line. In this case the correlation coefficient would not provide a very accurate measure of the strength of the relationship. If a line accurately describes the relationship between your two variables, your ability to predict the value of one variable from the value of the other is directly related to the correlation between them. When the points in your scatterplot are all clustered closely about a line your correlation will be large and the accuracy of the predictions will be high. If the points tend to be widely spread your correlation will be small and the accuracy of your predictions will be low. The Pearson correlation assumes that both of your variables have normal distributions. If this is not the case then you might consider performing a Spearman rank-order correlation instead (described below). To perform a Pearson correlation in SPSS • Choose Analyze!!!! Correlate!!!! Bivariate. • Move the variables you want to correlate to the Variables box. • Click the OK button. The output of this analysis will contain the following section. • Correlations. This section contains the correlation matrix of
  • 83. the variables you selected. A variable always has a perfect correlation with itself, so the diagonals of this matrix will always have values of 1. The other cells in the table provide you with the correlation between the variable listed at the top of the column and the variable listed to the left of the row. Below this is a p-value testing whether the correlation differs significantly from zero. Finally, the bottom value in each box is the sample size used to compute the correlation. Point-biserial correlation The point-biserial correlation captures the relationship between a dichotomous (two-value) variable and a continuous variable. If the analyst codes the dichotomous variable with values of 29 0 and 1, and then computes a standard Pearson correlation using this variable, it is mathematically equivalent to the point-biserial correlation. The interpretation of this variable is similar to the interpretation of the Pearson correlation. A positive correlation indicates that group associated with the value of 1 has larger values than the group associated with the value of 0. A negative correlation indicates that group associated with the value of 1 has smaller values than the group associated with the value of 0. A value near zero indicates no relationship
  • 84. between the two variables. To perform a point-biserial correlation in SPSS • Make sure your categories are indicated by values of 0 and 1. • Obtain the Pearson correlation between the categorical variable and the continuous variable, as discussed above. The result of this analysis will include the same sections as discussed in the Pearson correlation section. Spearman rank correlation The Spearman rank correlation is a nonparametric equivalent to the Pearson correlation. The Pearson correlation assumes that both of your variables have normal distributions. If this assumption is violated for either of your variables then you may choose to perform a Spearman rank correlation instead. However, the Spearman rank correlation is a less powerful measure of association, so people will commonly choose to use the standard Pearson correlation even when the variables you want to consider are moderately nonnormal. The Spearman Rank correlation is typically preferred over Kendaís tau, another nonparametric correlation measure, because its scaling is more consistent with the standard Pearson correlation. To perform a Spearman rank correlation in SPSS • Choose Analyze!!!! Correlate!!!! Bivariate. • Move the variables you want to correlate to the Variables box. • Check the box next to Spearman.
  • 85. • Click the OK button. The output of this analysis will contain the following section. • Correlations. This section contains the correlation matrix of the variables you selected. The Spearman rank correlations can be interpreted in exactly the same way as you interpret a standard Pearson correlation. Below each correlation SPSS provides a p-value testing whether the correlation is significantly different from zero, and the sample size used to compute the correlation. 4 DESCRIPTIVE STATISTICS Analyses often begin by examining basic descriptive-level information about data. The most common and useful descriptive statistics are • Mean • Median • Mode • Frequency • Quartiles • Sum • Variance • Standard deviation
  • 86. • Minimum/Maximum • Range Note: All of these are appropriate for continuous variables, and frequency and mode are also appropriate for categorical variables. If you just want to obtain the mean and standard deviation for a set of variables • Choose Analyze !!!! Descriptive Statistics !!!! Descriptives. • Move the variables of interest to the Variable(s) box. • Click the OK button. If you want to obtain any other statistics • Choose Analyze!!!! Descriptive Statistics !!!! Frequencies. • Move the variables of interest to the Variable(s) box • Click the Statistics button. • Check the boxes next to the statistics you want. • Click the Continue button. • Click the OK button. 40 CHI-SQUARE TEST OF INDEPENDENCE A chi-square is a nonparametric test used to determine if there is a relationship between two
  • 87. categorical variables. Letís take a simple example. Suppose a researcher brought male and female participants into the lab and asked them which color they preferóblue or green. The researcher believes that color preference may be related to gender. Notice that both gender (male, female) and color preference (blue, green) are categorical variables. If there is a relationship between gender and color preference, we would expect that the proportion of men who prefer blue would be different than the proportion of women who prefer blue. In general, you have a relationship between two categorical variables when the distribution of people across the categories of the first variable changes across the different categories of the second variable. To determine if a relationship exists between gender and color preference, the chi-square test computes the distributions across the combination of your two factors that you would expect if there were no relationship between them. In then compares this to the actual distribution found in your data. In the example above, we have a 2 (gender: male, female) X 2 (color preference: green, blue) design. For each cell in the combination of the two factors, we would compute "observed" and "expected" counts. The observed counts are simply the actual number of observations found in each of the cells. The expected proportion in each cell can be determined by multiplying the marginal proportions found in a table. For example, let us say that 52% of all the participants preferred blue and 48% preferred green, whereas 40% of the all of the participants were men and 60% were women. The expected
  • 88. proportions are presented in the table below. Expected proportion table Males Females Marginal proportion Blue 20.8% 31.2% 52% Green 19.2% 28.8% 48% Marginal proportion 40% 60% As you can see, you get the expected proportion for a particular cell by multiplying the two marginal proportions together. You would then determine the expected count for each cell by multiplying the expected proportion by the total number of participants in your study. The chi- square statistic is a function of the difference between the expected and observed counts across all your cells. Luckily you do not actually need to calculate any of this by hand, since SPSS will compute the expected counts for each cell and perform the chi- square test. To perform a chi-square test of independence in SPSS • Choose Analyze !!!! Descriptive Statistics !!!! Crosstabs. • Put one of the variables in the Row(s) box • Put the other variable in the Column(s) box • Click the Statistics button. • Check the box next to Chi-square. • Click the Continue button. 41
  • 89. • Click the OK button. The output of this analysis will contain the following sections. • Case Processing Summary. Provides information about missing values in your two variables. • Crosstabulation. Provides you with the observed counts within each combination of your two variables. • Chi-Square Tests. The first row of this table will give you the chi-square value, its degrees of freedom and the p-value associated with the test. Note that the p-values produced by a chi-square test are inappropriate if the expected count is less than 5 in 20% of the cells or more. If you are in this situation, you should either redefine your coding scheme (combining the categories with low cell counts with other categories) or exclude categories with low cell counts from your analysis.