SlideShare a Scribd company logo
Pg. 05
Question Five
Assignment #
Deadline: Day 22/10/2017 @ 23:59
[Total Mark for this Assignment is 25]
System Analysis and Design
IT 243
College of Computing and Informatics
Question One
5 Marks
Learning Outcome(s):
Understand the need of Feasibility analysis in project approval
and its types
What is feasibility analysis? List and briefly discuss three kinds
of feasibility analysis.
Question Two
5 Marks
Learning Outcome(s):
Understand the various cost incurred in project development
How can you classify costs? Describe each cost classification
and provide a typical example of each category.Question Three
5 Marks
Learning Outcome(s):
System Development Life Cycle methodologies (Waterfall &
Prototyping)
There a several development methodologies for the System
Development Life Cycle (SDLC). Among these are the
Waterfall and System Prototyping models. Compare the two
methodologies in details in terms of the following criteria.
Criteria
Waterfall
System Prototyping
Description
Requirements Clarity
System complexity
Project Time schedule
Question Four
5 Marks
Learning Outcome(s):
Understand JAD Session and its procedure
What is JAD session? Describe the five major steps in
conducting JAD sessions.
Question Five
5 Marks
Learning Outcome(s):
Ability to distinguish between functional and non functional
requirements
State what is meant by the functional and non-functional
requirements. What are the primary types of nonfunctional
requirements? Give two examples of each. What role do
nonfunctional requirements play in the project overall?
# Marks
4 - PRELIMINARY DATA SCREENING
4.1 Introduction: Problems in Real Data
Real datasets often contain errors, inconsistencies in responses
or measurements, outliers, and missing values. Researchers
should conduct thorough preliminary data screening to identify
and remedy potential problems with their data prior to running
the data analyses that are of primary interest. Analyses based on
a dataset that contains errors, or data that seriously violate
assumptions that are required for the analysis, can yield
misleading results.
Some of the potential problems with data are as follows: errors
in data coding and data entry, inconsistent responses, missing
values, extreme outliers, nonnormal distribution shapes, within-
group sample sizes that are too small for the intended analysis,
and nonlinear relations between quantitative variables.
Problems with data should be identified and remedied (as
adequately as possible) prior to analysis. A research report
should include a summary of problems detected in the data and
any remedies that were employed (such as deletion of outliers
or data transformations) to address these problems.
4.2 Quality Control During Data Collection
There are many different possible methods of data collection. A
psychologist may collect data on personality or attitudes by
asking participants to answer questions on a questionnaire. A
medical researcher may use a computer-controlled blood
pressure monitor to assess systolic blood pressure (SBP) or
other physiological responses. A researcher may record
observations of animal behavior. Physical measurements (such
as height or weight) may be taken. Most methods of data
collection are susceptible to recording errors or artifacts, and
researchers need to know what kinds of errors are likely to
occur.
For example, researchers who use self-report data to do research
on personality or attitudes need to be aware of common
problems with this type of data. Participants may distort their
answers because of social desirability bias; they may
misunderstand questions; they may not remember the events that
they are asked to report about; they may deliberately try to
“fake good” or “fake bad”; they may even make random
responses without reading the questions. A participant may
accidentally skip a question on a survey and, subsequently, use
the wrong lines on the answer sheet to enter each response; for
example, the response to Question 4 may be filled in as Item 3
on the answer sheet, the response to Question 5 may be filled in
as Item 4, and so forth. In addition, research assistants have
been known to fill in answer sheets themselves instead of
having the participants complete them. Good quality control in
the collection of self-report data requires careful consideration
of question wording and response format and close supervision
of the administration of surveys. Converse and Presser (1999);
Robinson, Shaver, and Wrightsman (1991); and Stone, Turkkan,
Kurtzman, Bachrach, and Jobe (1999) provide more detailed
discussion of methodological issues in the collection of self-
report data.
For observer ratings, it is important to consider issues of
reactivity (i.e., the presence of an observer may actually change
the behavior that is being observed). It is important to establish
good interobserver reliability through training of observers and
empirical assessment of interobserver agreement. See Chapter
21 in this book for discussion of reliability, as well as Aspland
and Gardner (2003), Bakeman (2000), Gottman and Notarius
(2002), and Reis and Gable (2000) for further discussion of
methodological issues in the collection of observational data.
For physiological measures, it is necessary to screen for
artifacts (e.g., when electroencephalogram electrodes are
attached near the forehead, they may detect eye blinks as well
as brain activity; these eye blink artifacts must be removed from
the electroencephalogram signal prior to other processing). See
Cacioppo, Tassinary, and Berntson (2000) for methodological
issues in the collection of physiological data.
This discussion does not cover all possible types of
measurement problems, of course; it only mentions a few of the
many possible problems that may arise in data collection.
Researchers need to be aware of potential problems or sources
of artifact associated with any data collection method that they
use, whether they use data from archival sources, experiments
with animals, mass media, social statistics, or other methods not
mentioned here.
4.3 Example of an SPSS Data Worksheet
The dataset used to illustrate data-screening procedures in this
chapter is named bpstudy .sav. The scores appear in Table 4.1,
and an image of the corresponding SPSS worksheet appears in
Figure 4.1.
This file contains selected data from a dissertation that assessed
the effects of social stress on blood pressure (Mooney, 1990).
The most important features in Figure 4.1 are as follows. Each
row in the Data View worksheet corresponds to the data for 1
case or 1 participant. In this example, there are a total of N = 65
participants; therefore, the dataset has 65 rows. Each column in
the Data View worksheet corresponds to a variable; the SPSS
variable names appear along the top of the data worksheet. In
Figure 4.1, scores are given for the following SPSS variables:
idnum, GENDER, SMOKE (smoking status), AGE, SYS1
(systolic blood pressure or SBP at Time 1), DIA1 (diastolic
blood pressure, DBP, at Time 1), HR1 (heart rate at Time 1),
and WEIGHT. The numerical values contained in this data file
were typed into the SPSS Data View worksheet by hand.
Table 4.1 Data for the Blood Pressure/Social Stress Study
SOURCE: Mooney (1990).
NOTES: 1. idnum = arbitrary, unique identification number for
each participant. 2. GENDER was coded 1 = male, 2 = female.
3. SMOKE was coded 1 = nonsmoker, 2 = light smoker, 3 =
moderate smoker, 4 = heavy or regular smoker. 4. AGE = age in
years. 5. SYS1 = systolic blood pressure at Time 1/baseline. 6.
DIA1 = diastolic blood pressure at Time 1/baseline. 7. HR1 =
heart rate at Time 1/baseline. 8. WEIGHT = body weight in
pounds.
The menu bar across the top of the SPSS Data View worksheet
in Figure 4.1 can be used to select menus for different types of
procedures. The pull-down menu for <File> includes options
such as opening and saving data files. The pull-down menus for
<Analyze> and <Graphs> provide access to SPSS procedures for
data analysis and graphics, respectively.
The two tabs near the lower left-hand corner of the Data View
of the SPSS worksheet can be used to toggle back and forth
between the Data View (shown in Figure 4.1) and the Variable
View (shown in Figure 4.2) versions of the SPSS data file.
The Variable View of an SPSS worksheet, shown in Figure 4.2,
provides a place to document and describe the characteristics of
each variable, to supply labels for variables and score values,
and to identify missing values.
For example, examine the row of the Variable View worksheet
that corresponds to the variable named GENDER. The scores on
this variable were numerical; that is, the scores are in the form
of numbers (rather than alphabetic characters). Other possible
variable types include dates or string variables that consist of
alphabetic characters instead of numbers. If the researcher
needs to identify a variable as string or date type, he or she
clicks on the cell for Variable Type and selects the appropriate
variable type from the pull-down menu list. In the datasets used
as examples in this textbook, almost all the variables are
numerical.
Figure 4.1 SPSS Worksheet for the Blood Pressure/Social
Stress Study (Data View) in bpstudy.sav
SOURCE: Mooney (1990).
The Width column indicates how many significant digits the
scores on each variable can have. For this example, the
variables GENDER and SMOKE were each allowed a one-digit
code, the variable AGE was allowed a two-digit code, and the
remaining variables (heart rate, blood pressure, and body
weight) were each allowed three digits. The Decimals column
indicates how many digits are displayed after the decimal point.
All the variables in this dataset (such as age in years and body
weight in pounds) are given to the nearest integer value, and so
all these variables are displayed with 0 digits to the right of the
decimal place. If a researcher has a variable, such as grade point
average (GPA), that is usually reported to two decimal places
(as in GPA = 2.67), then he or she would select 2 as the number
of digits to display after the decimal point.
Figure 4.2 SPSS Worksheet for the Blood Pressure/Social
Stress Study (Variable View)
The next column, Label, provides a place where each variable
name can be associated with a longer descriptive label. This is
particularly helpful when brief SPSS variable names are not
completely self-explanatory. For example, “body weight in
pounds” appears as a label for the variable WEIGHT. The
Values column provides a place where labels can be associated
with the individual score values of each variable; this is
primarily used with nominal or categorical variables. Figure 4.3
shows the dialog window that opens up when the user clicks on
the cell for Values for the variable GENDER. To associate each
score with a verbal label, the user types in the score (such as 1)
and the corresponding verbal label (such as male) and then
clicks the Add button to add this label to the list of value labels.
When all the labels have been specified, clicking on OK returns
to the main Variable View worksheet. In this example, a score
of 1 on GENDER corresponds to male and a score of 2 on
GENDER corresponds to female.
The column headed Missing provides a place to identify scores
as codes for missing values. Consider the following example to
illustrate the problem that arises in data analysis when there are
missing values. Suppose that a participant did not answer the
question about body weight. If the data analyst enters a value of
0 for the body weight of this person who did not provide
information about body weight and does not identify 0 as a code
for a missing value, this value of 0 would be included when
SPSS sums the scores on body weight to compute a mean for
weight. The sample mean is not robust to outliers; that is, a
sample mean for body weight will be substantially lower when a
value of 0 is included for a participant than it would be if that
value of 0 was excluded from the computation of the sample
mean. What should the researcher do to make sure that missing
values are not included in the computation of sample statistics?
SPSS provides two different ways to handle missing score
values. The first option is to leave the cell in the SPSS Data
View worksheet that corresponds to the missing score blank. In
Figure 4.1, participant number 12 did not answer the question
about smoking status; therefore, the cell that corresponds to the
response to the variable SMOKE for Participant 12 was left
blank. By default, SPSS treats empty cells as “system missing”
values. If a table of frequencies is set up for scores on SMOKE,
the response for Participant 12 is labeled as a missing value. If
a mean is calculated for scores on smoking, the score for
Participant 12 is not included in the computation as a value 0;
instead, it is omitted from the computation of the sample mean.
Figure 4.3 Value Labels for Gender
A second method is available to handle missing values; it is
possible to use different code numbers to represent different
types of missing data. For example, a survey question that is a
follow-up about the amount and frequency of smoking might be
coded 9 if it was not applicable to an individual (because that
individual never smoked), 99 if the question was not asked
because the interviewer ran out of time, and 88 if the
respondent refused to answer the question. For the variable
WEIGHT, body weight in pounds, a score of 999 was identified
as a missing value by clicking on the cell for Missing and then
typing a score of 999 into one of the windows for missing
values; see the Missing Values dialog window in Figure 4.4. A
score of 999 is defined as a missing value code, and therefore,
these scores are not included when statistics are calculated. It is
important to avoid using codes for missing values that
correspond to possible valid responses. Consider the question,
How many children are there in a household? It would not make
sense to use a score of 0 or a score of 9 as a code for missing
values, because either of these could correspond to the number
of children in some households. It would be acceptable to use a
code of 99 to represent a missing value for this variable because
no single-family household could have such a large number of
children.
The next few columns in the SPSS Variable View worksheet
provide control over the way the values are displayed in the
Data View worksheet. The column headed Columns indicates
the display width of each column in the SPSS Data View
worksheet, in number of characters. This was set at eight
characters wide for most variables. The column headed Align
indicates whether scores will be shown left justified, centered,
or (as in this example) right justified in each column of the
worksheet.
Finally, the column in the Variable View worksheet that is
headed Measure indicates the level of measurement for each
variable. SPSS designates each numerical variable as nominal,
ordinal, or scale (scale is equivalent to interval/ratio level of
measurement, as described in Chapter 1 of this textbook). In
this sample dataset, idnum (an arbitrary and unique
identification number for each participant) and GENDER were
identified as categorical or nominal variables. Smoking status
(SPSS variable name SMOKE) was coded on an ordinal scale
from 1 to 4, with 1 = nonsmoker, 2 = light smoker, 3 = moderate
smoker, and 4 = heavy smoker. The other variables (heart rate,
blood pressure, body weight, and age) are quantitative and
interval/ratio, so they were designated as “scale” in level of
measurement.
Figure 4.4 Missing Values for Weight
4.4 Identification of Errors and Inconsistencies
The SPSS data file should be proofread and compared with
original data sources (if these are accessible) to correct errors in
data coding or data entry. For example, if self-report data are
obtained using computer-scorable answer sheets, the
correspondence between scores on these answer sheets and
scores in the SPSS data file should be verified. This may
require proofreading data line by line and comparing the scores
with the data on the original answer sheets. It is helpful to have
a unique code number associated with each case so that each
line in the data file can be matched with the corresponding
original data sheet.
Even when line-by-line proofreading has been done, it is useful
to run simple exploratory analyses as an additional form of data
screening. Rosenthal (cited in D. B. Wright, 2003) called this
process of exploration “making friends with your data.”
Subsequent sections of this chapter show how examination of
the frequency distribution tables and graphs provides an
overview of the characteristics of people in this sample—for
example, how many males and females were included in the
study, how many nonsmokers versus heavy smokers were
included, and the range of scores on physiological responses
such as heart rate.
Examining response consistency across questions or
measurements is also useful. If a person chooses the response “I
have never smoked” to one question and then reports smoking
10 cigarettes on an average day in another question, these
responses are inconsistent. If a participant’s responses include
numerous inconsistencies, the researcher may want to consider
removing that participant’s data from the data file.
On the basis of knowledge of the variables and the range of
possible response alternatives, a researcher can identify some
responses as “impossible” or “unlikely.” For example, if
participants are provided a choice of the following responses to
a question about smoking status: 1 = nonsmoker, 2 = light
smoker, 3 = moderate smoker, and 4 = heavy smoker, and a
participant marks response number “6” for this question, the
value of 6 does not correspond to any of the response
alternatives provided for the question. When impossible,
unlikely, or inconsistent responses are detected, there are
several possible remedies. First, it may be possible to go back
to original data sheets or experiment logbooks to locate the
correct information and use it to replace an incorrect score
value. If that is not possible, the invalid score value can be
deleted and replaced with a blank cell entry or a numerical code
that represents a missing value. It is also possible to select out
(i.e., temporarily or permanently remove) cases that have
impossible or unlikely scores.
4.5 Missing Values
Journal editors and funding agencies now expect more
systematic evaluation of missing values than was customary in
the past. SPSS has a Missing Values add-on procedure to assess
the amount and pattern of missing data and replace missing
scores with imputed values. Research proposals should include
a plan for identification and handling of missing data; research
reports should document the amount and pattern of missing data
and imputation procedures for replacement. Within the SPSS
program, an empty or blank cell in the data worksheet is
interpreted as a System Missing value. Alternatively, as
described earlier, the Missing Value column in the Variable
View worksheet in SPSS can be used to identify some specific
numerical codes as missing values and to use different
numerical codes to correspond to different types of missing
data. For example, for a variable such as verbal Scholastic
Aptitude Test (SAT) score, codes such as 888 = student did not
take the SAT and 999 = participant refused to answer could be
used to indicate different reasons for the absence of a valid
score.
Ideally, a dataset should have few missing values. A systematic
pattern of missing observations suggests possible bias in
nonresponse. For example, males might be less willing than
females to answer questions about negative emotions such as
depression; students with very low SAT scores may refuse to
provide information about SAT performance more often than
students with high SAT scores. To assess whether missing
responses on depression are more common among some groups
of respondents or are associated with scores on some other
variable, the researcher can set up a variable that is coded 1
(respondent answered a question about depression) versus 0
(respondent did not answer the question about depression).
Analyses can then be performed to see whether this variable,
which represents missing versus nonmissing data on one
variable, is associated with scores on any other variable. If the
researcher finds, for example, that a higher proportion of men
than women refused to answer a question about depression, it
signals possible problems with generalizability of results; for
example, conclusions about depression in men can be
generalized only to the kinds of men who are willing to answer
such questions.
It is useful to assess whether specific individual participants
have large numbers of missing scores; if so, data for these
participants could simply be deleted. Similarly, it may be useful
to see whether certain variables have very high nonresponse
rates; it may be necessary to drop these variables from further
analysis.
When analyses involving several variables (such as
computations of all possible correlations among a set of
variables) are performed in SPSS, it is possible to request either
listwise or pairwise deletion. For example, suppose that the
researcher wants to use the bivariate correlation procedure in
SPSS to run all possible correlations among variables named
V1, V2, V3, and V4. If listwise deletion is chosen, the data for
a participant are completely ignored when all these correlations
are calculated if the participant has a missing score on any one
of the variables included in the list. In pairwise deletion, each
correlation is computed using data from all the participants who
had nonmissing values on that particular pair of variables. For
example, suppose that there is one missing score on the variable
V1. If listwise deletion is chosen, then the data for the
participant who had a missing score on V1 are not used to
compute any of the correlations (between V1 and V2, V2 and
V3, V2 and V4, etc.). On the other hand, if pairwise deletion is
chosen, the data for the participant who is missing a score on
V1 cannot be used to calculate any of the correlations that
involve V1 (e.g., V1 with V2, V1 with V3, V1 with V4), but the
data from this participant will be used when correlations that
don’t require information about V1 are calculated (correlations
between V2 and V3, V2 and V4, and V3 and V4). When using
listwise deletion, the same number of cases and subset of
participants are used to calculate all the correlations for all
pairs of variables. When using pairwise deletion, depending on
the pattern of missing values, each correlation may be based on
a different N and a different subset of participants than those
used for other correlations.
The default for handling missing data in most SPSS procedures
is listwise deletion. The disadvantage of listwise deletion is that
it can result in a rather small N of participants, and the
advantage is that all correlations are calculated using the same
set of participants. Pairwise deletion can be selected by the
user, and it preserves the maximum possible N for the
computation of each correlation; however, both the number of
participants and the composition of the sample may vary across
correlations, and this can introduce inconsistencies in the values
of the correlations (as described in more detail by Tabachnick &
Fidell, 2007).
When a research report includes a series of analyses and each
analysis includes a different set of variables, the N of scores
that are included may vary across analyses (because different
people have missing scores on each variable). This can raise a
question in readers’ minds: Why do the Ns change across pages
of the research report? When there are large numbers of missing
scores, quite different subsets of data may be used in each
analysis, and this may make the results not comparable across
analyses. To avoid these potential problems, it may be
preferable to select out all the cases that have missing values on
all the variables that will be used ahead of time, so that the
same subset of participants (and the same N of scores) are used
in all the analyses in a paper.
The default in SPSS is that cases with system missing values or
scores that are specifically identified as missing values are
excluded from computations, but this can result in a substantial
reduction in the sample size for some analyses. Another way to
deal with missing data is by substitution of a reasonable
estimated score value to replace each missing response. Missing
value replacement can be done in many different ways; for
example, the mean score on a variable can be substituted for all
missing values on that variable, or estimated values can be
calculated separately for each individual participant using
regression methods to predict that person’s missing score from
her or his scores on other, related variables. This is often called
imputation of missing data. Procedures for missing value
replacement can be rather complex (Schafer, 1997, 1999;
Schafer & Olsen, 1998).
Tabachnick and Fidell (2007) summarized their discussion of
missing value replacement by saying that the seriousness of the
problem of missing values depends on “the pattern of missing
data, how much is missing, and why it is missing” (p. 62). They
also noted that the decision about how to handle missing data
(e.g., deletion of cases or variables, or estimation of scores to
replace missing values) is “a choice among several bad
alternatives” (p. 63). If some method of imputation or
estimation is employed to replace missing values, it is desirable
to repeat the analysis with the missing values omitted. Results
are more believable, of course, if they are essentially the same
with and without the replacement scores.
4.6 Empirical Example of Data Screening for Individual
Variables
In this textbook, variables are treated as either categorical or
quantitative (see Chapter 1 for a review of this distinction).
Different types of graphs and descriptive statistics are
appropriate for use with categorical versus quantitative
variables, and for that reason, data screening is discussed
separately for categorical and quantitative variables.
4.6.1 Frequency Distribution Tables
For both categorical (nominal) and quantitative (scale)
variables, a table of frequencies can be obtained to assess the
number of persons or cases who had each different score value.
These frequencies can be converted to proportions or
percentages. Examination of a frequency table quickly provides
answers to the following questions about each categorical
variable: How many groups does this variable represent? What
is the number of persons in each group? Are there any groups
with ns that are too small for the group to be used in analyses
that compare groups (e.g., analysis of variance, ANOVA)?
If a group with a very small number (e.g., 10 or fewer) cases is
detected, the researcher needs to decide what to do with the
cases in that group. The group could be dropped from all
analyses, or if it makes sense to do so, the small n group could
be combined with one or more of the other groups (by recoding
the scores that represent group membership on the categorical
variable).
For both categorical and quantitative variables, a frequency
distribution also makes it possible to see if there are any
“impossible” score values. For instance, if the categorical
variable GENDER on a survey has just two response options, 1
= male and 2 = female, then scores of “3” and higher are not
valid or interpretable responses. Impossible score values should
be detected during proofreading, but examination of frequency
tables provides another opportunity to see if there are any
impossible score values on categorical variables.
Figure 4.5 SPSS Menu Selections: <Analyze> → <Descriptive
Statistics> → <Frequencies>
Figure 4.6 SPSS Dialog Window for the Frequencies
Procedure
SPSS was used to obtain a frequency distribution table for the
variables GENDER and AGE. Starting from the data worksheet
view (as shown in Figure 4.1), the following menu selections
(as shown in Figure 4.5) were made: <Analyze> → <Descriptive
Statistics> → <Frequencies>.
The SPSS dialog window for the Frequencies procedure appears
in Figure 4.6. To specify which variables are included in the
request for frequency tables, the user points to the names of the
two variables (GENDER and AGE) and clicks the right-pointing
arrow to move these variable names into the right-hand window.
Output from this procedure appears in Figure 4.7.
In the first frequency table in Figure 4.7, there is one
“impossible” response for GENDER. The response alternatives
provided for the question about gender were 1 = male and 2 =
female, but a response of 3 appears in the summary table; this
does not correspond to a valid response option. In the second
frequency table in Figure 4.7, there is an extreme score (88
years) for AGE. This is a possible, but unusual, age for a
college student. As will be discussed later in this chapter,
scores that are extreme or unusual are often identified as
outliers and sometimes removed from the data prior to doing
other analyses.
4.6.2 Removal of Impossible or Extreme Scores
In SPSS, the Select Cases command can be used to remove
cases from a data file prior to other analyses. To select out the
participant with a score of “3” for GENDER and also the
participant with an age of 88, the following SPSS menu
selections (see Figure 4.8) would be used: <Data> → <Select
Cases>.
The initial SPSS dialog window for Select Cases appears in
Figure 4.9.
A logical “If” conditional statement can be used to exclude
specific cases. For example, to exclude the data for the person
who reported a value of “3” for GENDER, click the radio button
for “If condition is satisfied” in the first Select Cases dialog
window. Then, in the “Select Cases: If” window, type in the
logical condition “GENDER ~= 3.” The symbol “~=” represents
the logical comparison “not equal to”; thus, this logical “If”
statement tells SPSS to include the data for all participants
whose scores for GENDER are not equal to “3.” The entire line
of data for the person who reported “3” as a response to
GENDER is (temporarily) filtered out or set aside as a result of
this logical condition.
It is possible to specify more than one logical condition. For
example, to select cases that have valid scores on GENDER and
that do not have extremely high scores on AGE, we could set up
the logical condition “GENDER ~= 3 and AGE < 70,” as shown
in Figures 4.10 and 4.11. SPSS evaluates this logical statement
for each participant. Any participant with a score of “3” on
GENDER and any participant with a score greater than or equal
to 70 on AGE is excluded or selected out by this Select If
statement.
When a case has been selected out using the Select Cases
command, a crosshatch mark appears over the case number for
that case (on the far left-hand side of the SPSS data worksheet).
Cases that are selected out can be temporarily filtered or
permanently deleted. In Figure 4.12, the SPSS data worksheet is
shown as it appears after the execution of the Data Select If
commands just described. Case number 11 (a person who had a
score of 3 on GENDER) and case number 15 (a person who had
a score of 88 on AGE) are now shown with a crosshatch mark
through the case number in the left-hand column of the data
worksheet. This crosshatch indicates that unless the Select If
condition is explicitly removed, the data for these 2 participants
will be excluded from all future analyses. Note that the original
N of 65 cases has been reduced to an N of 63 by this Data
Select If statement. If the researcher wants to restore
temporarily filtered cases to the sample, it can be done by
selecting the radio button for All Cases in the Select Cases
dialog window.
Figure 4.7 Output From the SPSS Frequencies Procedure
(Prior to Removal of “Impossible” Score Values)
4.6.3 Bar Chart for a Categorical Variable
For categorical or nominal variables, a bar chart can be used to
represent the distribution of scores graphically. A bar chart for
GENDER was created by making the following SPSS menu
selections (see Figure 4.13): <Graphs> → <Legacy Dialogs> →
<Bar [Chart]>.
Figure 4.8 SPSS Menu Selections for <Data> → <Select
Cases> Procedure
The first dialog window for the bar chart procedure appears in
Figure 4.14. In this example, the upper left box was clicked in
the Figure 4.14 dialog window to select the “Simple” type of
bar chart; the radio button was selected for “Summaries for
groups of cases”; then, the Define button was clicked. This
opened the second SPSS dialog window, which appears in
Figure 4.15.
Figure 4.9 SPSS Dialog Windows for the Select Cases
Command
Figure 4.10 Logical Criteria for Select Cases
NOTE: Include only persons who have a score for GENDER
that is not equal to 3 and who have a score for AGE that is less
than 70.
Figure 4.11 Appearance of the Select Cases Dialog Window
After Specification of the Logical “If” Selection Rule
To specify the form of the bar chart, use the cursor to highlight
the name of the variable that you want to graph, and click on
the arrow that points to the right to move this variable name
into the window under Category Axis. Leave the radio button
selection as the default choice, “Bars represent N of cases.”
This set of menu selections will yield a bar graph with one bar
for each group; for GENDER, this is a bar graph with one bar
for males and one for females. The height of each bar represents
the number of cases in each group. The output from this
procedure appears in Figure 4.16. Note that because the invalid
score of 3 has been selected out by the prior Select If statement,
this score value of 3 is not included in the bar graph in Figure
4.16. A visual examination of a set of bar graphs, one for each
categorical variable, is a useful way to detect impossible values.
The frequency table and bar graphs also provide a quick
indication of group size; in this dataset, there are N = 28 males
(score of 1 on GENDER) and N = 36 females (score of 2 on
GENDER). The bar chart in Figure 4.16, like the frequency
table in Figure 4.7, indicates that the male group had fewer
participants than the female group.
Figure 4.12 Appearance of SPSS Data Worksheet After the
Select Cases Procedure in Figure 4.11
4.6.4 Histogram for a Quantitative Variable
For a quantitative variable, a histogram is a useful way to assess
the shape of the distribution of scores. As described in Chapter
3, many analyses assume that scores on quantitative variables
are at least approximately normally distributed. Visual
examination of the histogram is a way to evaluate whether the
distribution shape is reasonably close to normal or to identify
the shape of a distribution if it is quite different from normal.
In addition, summary statistics can be obtained to provide
information about central tendency and dispersion of scores.
The mean (M), median, or mode can be used to describe central
tendency; the range, standard deviation (s or SD), or variance
(s2) can be used to describe variability or dispersion of scores.
A comparison of means, variances, and other descriptive
statistics provides the information that a researcher needs to
characterize his or her sample and to judge whether the sample
is similar enough to some broader population of interest so that
results might possibly be generalizable to that broader
population (through the principle of “proximal similarity,”
discussed in Chapter 1). If a researcher conducts a political poll
and finds that the range of ages of persons in the sample is from
age 18 to 22, for instance, it would not be reasonable to
generalize any findings from that sample to populations of
persons older than age 22.
Figure 4.13 SPSS Menu Selections for the <Graphs> →
<Legacy Dialogs> → <Bar [Chart]> Procedure
Figure 4.14 SPSS Bar Charts Dialog Window
Figure 4.15 SPSS Define Simple Bar Chart Dialog Window
Figure 4.16 Bar Chart: Frequencies for Each Gender Category
When the distribution shape of a quantitative variable is
nonnormal, it is preferable to assess central tendency and
dispersion of scores using graphic methods that are based on
percentiles (such as a boxplot, also called a box and whiskers
plot). Issues that can be assessed by looking at frequency tables,
histograms, or box and whiskers plots for quantitative scores
include the following:
1. Are there impossible or extreme scores?
2. Is the distribution shape normal or nonnormal?
3. Are there ceiling or floor effects? Consider a set of test
scores. If a test is too easy and most students obtain scores of
90% and higher, the distribution of scores shows a “ceiling
effect”; if the test is much too difficult, most students will
obtain scores of 10% and below, and this would be called a
“floor effect.” Either of these would indicate a problem with the
measurement, in particular, a lack of sensitivity to individual
differences at the upper end of the distribution (when there is a
ceiling effect) or the lower end of the distribution (when there
is a floor effect).
4. Is there a restricted range of scores? For many measures,
researchers know a priori what the minimum and maximum
possible scores are, or they have a rough idea of the range of
scores. For example, suppose that Verbal SAT scores can range
from 250 to 800. If the sample includes scores that range from
550 to 580, the range of Verbal SAT scores in the sample is
extremely restricted compared with the range of possible scores.
Generally, researchers want a fairly wide range of scores on
variables that they want to correlate with other variables. If a
researcher wants to “hold a variable constant”—for example, to
limit the impact of age on the results of a study by including
only persons between 18 and 21 years of age—then a restricted
range would actually be preferred.
The procedures for obtaining a frequency table for a
quantitative variable are the same as those discussed in the
previous section on data screening for categorical variables.
Distribution shape for a quantitative variable can be assessed by
examining a histogram obtained by making these SPSS menu
selections (see Figure 4.17): <Graphs> → <Legacy Dialogs> →
<Histogram>.
These menu selections open the Histogram dialog window
displayed in Figure 4.18. In this example, the variable selected
for the histogram was HR1 (baseline or Time 1 heart rate).
Placing a checkmark in the box next to Display Normal Curve
requests a superimposed smooth normal distribution function on
the histogram plot. To obtain the histogram, after making these
selections, click the OK button. The histogram output appears in
Figure 4.19. The mean, standard deviation, and N for HR1
appear in the legend below the graph.
Figure 4.17 SPSS Menu Selections: <Graphs> → <Legacy
Dialogs> → <Histogram>
Figure 4.18 SPSS Dialog Window: Histogram Procedure
Figure 4.19 Histogram of Heart Rates With Superimposed
Normal Curve
An assumption common to all the parametric analyses covered
in this book is that scores on quantitative variables should be
(at least approximately) normally distributed. In practice, the
normality of distribution shape is usually assessed visually; a
histogram of scores is examined to see whether it is
approximately “bell shaped” and symmetric. Visual examination
of the histogram in Figure 4.19 suggests that the distribution
shape is not exactly normal; it is slightly asymmetrical.
However, this distribution of sample scores is similar enough to
a normal distribution shape to allow the use of parametric
statistics such as means and correlations. This distribution
shows a reasonably wide range of heart rates, no evidence of
ceiling or floor effects, and no extreme outliers.
There are many ways in which the shape of a distribution can
differ from an ideal normal distribution shape. For example, a
distribution is described as skewed if it is asymmetric, with a
longer tail on one side (see Figure 4.20 for an example of a
distribution with a longer tail on the right). Positively skewed
distributions similar to the one that appears in Figure 4.20 are
quite common; many variables, such as reaction time, have a
minimum possible value of 0 (which means that the lower tail of
the distribution ends at 0) but do not have a fixed limit at the
upper end of the distribution (and therefore the upper tail can be
quite long). (Distributions with many zeros pose special
problems; refer back to comments on Figure 1.4. Also see
discussions by Atkins & Gallop [2007] and G. King & Zeng
[2001]; options include Poisson regression [Cohen, Cohen,
West, & Aiken, 2003, chap. 13], and negative binomial
regression [Hilbe, 2011].)
Figure 4.20 Histogram of Positively Skewed Distribution
NOTE: Skewness index for this variable is +2.00.
A numerical index of skewness for a sample set of X scores
denoted by (X1, X2, …, XN) can be calculated using the
following formula:
where Mx is the sample mean of the X scores, s is the sample
standard deviation of the X scores, and N is the number of
scores in the sample.
For a perfectly normal and symmetrical distribution, skewness
has a value of 0. If the skewness statistic is positive, it indicates
that there is a longer tail on the right-hand/upper end of the
distribution (as in Figure 4.20); if the skewness statistic is
negative, it indicates that there is a longer tail on the lower end
of the distribution (as in Figure 4.21).
Figure 4.21 Histogram of a Negatively Skewed Distribution
NOTE: Skewness index for this variable is −2.00.
A distribution is described as platykurtic if it is flatter than an
ideal normal distribution and leptokurtic if it has a
sharper/steeper peak in the center than an ideal normal
distribution (see Figure 4.22). A numerical index of kurtosis can
be calculated using the following formula:
where Mx is the sample mean of the X scores, s is the sample
standard deviation of the X scores, and N is the number of
scores in the sample.
Using Equation 4.2, the kurtosis for a normal distribution
corresponds to a value of 3; most computer programs actually
report “excess kurtosis”—that is, the degree to which the
kurtosis of the scores in a sample differs from the kurtosis
expected in a normal distribution. This excess kurtosis is given
by the following formula:
Figure 4.22 Leptokurtic and Platykurtic Distributions
SOURCE: Adapted from
http://www.murraystate.edu/polcrjlst/p660kurtosis.htm
A positive score for excess kurtosis indicates that the
distribution of scores in the sample is more sharply peaked than
in a normal distribution (this is shown as leptokurtic in Figure
4.22). A negative score for kurtosis indicates that the
distribution of scores in a sample is flatter than in a normal
distribution (this corresponds to a platykurtic distribution shape
in Figure 4.22). The value that SPSS reports as kurtosis
corresponds to excess kurtosis (as in Equation 4.3).
A normal distribution is defined as having skewness and
(excess) kurtosis of 0. A numerical index of skewness and
kurtosis can be obtained for a sample of data to assess the
degree of departure from a normal distribution shape.
Additional summary statistics for a quantitative variable such as
HR1 can be obtained from the SPSS Descriptives procedure by
making the following menu selections: <Analyze> →
<Descriptive Statistics> → <Descriptives>.
The menu selections shown in Figure 4.23 open the Descriptive
Statistics dialog box shown in Figure 4.24. The Options button
opens up a dialog box that has a menu with check boxes that
offer a selection of descriptive statistics, as shown in Figure
4.25. In addition to the default selections, the boxes for
skewness and kurtosis were also checked. The output from this
procedure appears in Figure 4.26. The upper panel in Figure
4.26 shows the descriptive statistics for scores on HR1 that
appeared in Figure 4.19; skewness and kurtosis for the sample
of scores on the variable HR1 were both fairly close to 0. The
lower panel shows the descriptive statistics for the artificially
generated data that appeared in Figures 4.20 (a set of positively
skewed scores) and 4.21 (a set of negatively skewed scores).
Figure 4.23 SPSS Menu Selections: <Analyze> →
<Descriptive Statistics> → <Descriptives>
Figure 4.24 Dialog Window for SPSS Descriptive Statistics
Procedure
Figure 4.25 Options for the Descriptive Statistics Procedure
Figure 4.26 Output From the SPSS Descriptive Statistics
Procedure for Three Types of Distribution Shape
NOTE: Scores for HR (from Figure 4.19) are not skewed; scores
for posskew (from Figure 4.20) are positively skewed; and
scores for negskew (from Figure 4.21) are negatively skewed.
It is possible to set up a statistical significance test (in the form
of a z ratio) for skewness because SPSS also reports the
standard error (SE) for this statistic:
When the N of cases is reasonably large, the resulting z ratio
can be evaluated using the standard normal distribution; that is,
skewness is statistically significant at the α = .05 level (two-
tailed) if the z ratio given in Equation 4.4 is greater than 1.96 in
absolute value.
A z test can also be set up to test the significance of (excess)
kurtosis:
The tests in Equations 4.4 and 4.5 provide a way to evaluate
whether an empirical frequency distribution differs significantly
from a normal distribution in skewness or kurtosis. There are
formal mathematical tests to evaluate the degree to which an
empirical distribution differs from some ideal or theoretical
distribution shape (such as the normal curve). If a researcher
needs to test whether the overall shape of an empirical
frequency distribution differs significantly from normal, it can
be done by using the Kolmogorov-Smirnov or Shapiro-Wilk test
(both are available in SPSS). In most situations, visual
examination of distribution shape is deemed sufficient.
In general, empirical distribution shapes are considered
problematic only when they differ dramatically from normal.
Some earlier examples of drastically nonnormal distribution
shapes appeared in Figures 1.2 (a roughly uniform distribution)
and 1.3 (an approximately exponential or J-shaped distribution).
Multimodal distributions or very seriously skewed distributions
(as in Figure 4.20) may also be judged problematic. A
distribution that resembles the one in Figure 4.19 is often
judged close enough to normal shape.
4.7 Identification and Handling of Outliers
An outlier is an extreme score on either the low or the high end
of a frequency distribution of a quantitative variable. Many
different decision rules can be used to decide whether a
particular score is extreme enough to be considered an outlier.
When scores are approximately normally distributed, about 99%
of the scores should fall within +3 and −3 standard deviations
of the sample mean. Thus, for normally distributed scores, z
scores can be used to decide which scores to treat as outliers.
For example, a researcher might decide to treat scores that
correspond to values of z that are less than −3.30 or greater than
+3.30 as outliers.
Another method for the detection of outliers uses a graph called
a boxplot (or a box and whiskers plot). This is a nonparametric
exploratory procedure that uses medians and quartiles as
information about central tendency and dispersion of scores.
The following example uses a boxplot of scores on WEIGHT
separately for each gender group, as a means of identifying
potential outliers on WEIGHT. To set up this boxplot for the
distribution of weight within each gender group, the following
SPSS menu selections were made: <Graphs> → <Legacy
Dialogs> → <Box[plot]>.
This opens up the first SPSS boxplot dialog box, shown in
Figure 4.27. For this example, the box marked Simple was
clicked, and the radio button for “Summaries for groups of
cases” was selected to obtain a boxplot for just one variable
(WEIGHT) separately for each of two groups (male and female).
Clicking on the Define button opened up the second boxplot
dialog window, as shown in Figure 4.28. The name of the
quantitative dependent variable, WEIGHT, was placed in the top
window (as the name of the variable); the categorical or
“grouping” variable (GENDER) was placed in the window for
the Category Axis. Clicking the OK button generated the
boxplot shown in Figure 4.29, with values of WEIGHT shown
on the Y axis and the categories male and female shown on the
X axis.
Rosenthal and Rosnow (1991) noted that there are numerous
variations of the boxplot; the description here is specific to the
boxplots generated by SPSS and may not correspond exactly to
descriptions of boxplots given elsewhere. For each group, a
shaded box corresponds to the middle 50% of the distribution of
scores in that group. The line that bisects this box horizontally
(not necessarily exactly in the middle) represents the 50th
percentile (the median). The lower and upper edges of this
shaded box correspond to the 25th and 75th percentiles of the
weight distribution for the corresponding group (labeled on the
X axis). The 25th and 75th percentiles of each distribution of
scores, which correspond to the bottom and top edges of the
shaded box, respectively, are called the hinges. The distance
between the hinges (i.e., the difference between scores at the
75th and 25th percentiles) is called the H-spread. The vertical
lines that extend above and below the 75th and 25th percentiles
are called “whiskers,” and the horizontal lines at the ends of the
whiskers mark the “adjacent values.” The adjacent values are
the most extreme scores in the sample that lie between the hinge
and the inner fence (not shown on the graph; the inner fence is
usually a distance from the median that is 1.5 times the H-
spread). Generally, any data points that lie beyond these
adjacent values are considered outliers. In the boxplot, outliers
that lie outside the adjacent values are graphed using small
circles. Observations that are extreme outliers are shown as
asterisks (*).
Figure 4.27 SPSS Dialog Window for Boxplot Procedure
Figure 4.28 Define Simple Boxplot: Distribution of Weight
Separately by Gender
Figure 4.29 Boxplot of WEIGHT for Each Gender Group
Figure 4.29 indicates that the middle 50% of the distribution of
body weights for males was between about 160 and 180 lb, and
there was one outlier on WEIGHT (Participant 31 with a weight
of 230 lb) in the male group. For females, the middle 50% of
the distribution of weights was between about 115 and 135 lb,
and there were two outliers on WEIGHT in the female group;
Participant 50 was an outlier (with weight = 170), and
Participant 49 was an extreme outlier (with weight = 190). The
data record numbers that label the outliers in Figure 4.29 can be
used to look up the exact score values for the outliers in the
entire listing of data in the SPSS data worksheet or in Table 4.1.
In this dataset, the value of idnum (a variable that provides a
unique case number for each participant) was the same as the
SPSS line number or record number for all 65 cases. If the
researcher wants to exclude the 3 participants who were
identified as outliers in the boxplot of weight scores for the two
gender groups, it could be done by using the following Select If
statement: idnum ~= 31 and idnum ~= 49 and idnum ~=50.
Parametric statistics (such as the mean, variance, and Pearson
correlation) are not particularly robust to outliers; that is, the
value of M for a batch of sample data can be quite different
when it is calculated with an outlier included than when an
outlier is excluded. This raises a problem: Is it preferable to
include outliers (recognizing that a single extreme score may
have a disproportionate impact on the outcome of the analysis)
or to omit outliers (understanding that the removal of scores
may change the outcome of the analysis)? It is not possible to
state a simple rule that can be uniformly applied to all research
situations. Researchers have to make reasonable judgment calls
about how to handle extreme scores or outliers. Researchers
need to rely on both common sense and honesty in making these
judgments.
When the total N of participants in the dataset is relatively
small, and when there are one or more extreme outliers, the
outcomes for statistical analyses that examine the relation
between a pair of variables can be quite different when outliers
are included versus excluded from an analysis. The best way to
find out whether the inclusion of an outlier would make a
difference in the outcome of a statistical analysis is to run the
analysis both including and excluding the outlier score(s).
However, making decisions about how to handle outliers post
hoc (after running the analyses of interest) gives rise to a
temptation: Researchers may wish to make decisions about
outliers based on the way the outliers influence the outcome of
statistical analyses. For example, a researcher might find a
significant positive correlation between variables X and Y when
outliers are included, but the correlation may become
nonsignificant when outliers are removed from the dataset. It
would be dishonest to report a significant correlation without
also explaining that the correlation becomes nonsignificant
when outliers are removed from the data. Conversely, a
researcher might also encounter a situation where there is no
significant correlation between scores on the X and Y variables
when outliers are included, but the correlation between X and Y
becomes significant when the data are reanalyzed with outliers
removed. An honest report of the analysis should explain that
outlier scores were detected and removed as part of the data
analysis process, and there should be a good rationale for
removal of these outliers. The fact that dropping outliers yields
the kind of correlation results that the researcher hopes for is
not, by itself, a satisfactory justification for dropping outliers.
It should be apparent that if researchers arbitrarily drop enough
cases from their samples, they can prune their data to fit just
about any desired outcome. (Recall the myth of King
Procrustes, who cut off the limbs of his guests so that they
would fit his bed; we must beware of doing the same thing to
our data.)
A less problematic way to handle outliers is to state a priori that
the study will be limited to a specific population—that is, to
specific ranges of scores on some of the variables. If the
population of interest in the blood pressure study is healthy
young adults whose blood pressure is within the normal range,
this a priori specification of the population of interest would
provide a justification for the decision to exclude data for
participants with age older than 30 years and SBP above 140.
Another reasonable approach is to use a standard rule for
exclusion of extreme scores (e.g., a researcher might decide at
an early stage in data screening to drop all values that
correspond to z scores in excess of 3.3 in absolute value; this
value of 3.3 is an arbitrary standard).
Another method of handling extreme scores (trimming) involves
dropping the top and bottom scores (or some percentage of
scores, such as the top and bottom 1% of scores) from each
group. Winsorizing is yet another method of reducing the
impact of outliers: The most extreme score at each end of a
distribution is recoded to have the same value as the next
highest score.
Another way to reduce the impact of outliers is to apply a
nonlinear transformation (such as taking the base 10 logarithm
[log] of the original X scores). This type of data transformation
can bring outlier values at the high end of a distribution closer
to the mean.
Whatever the researcher decides to do with extreme scores
(throw them out, Winsorize them, or modify the entire
distribution by taking the log of scores), it is a good idea to
conduct analyses with the outlier included and with the outlier
excluded to see what effect (if any) the decision about outliers
has on the outcome of the analysis. If the results are essentially
identical no matter what is done to outliers, then either
approach could be reported. If the results are substantially
different when different things are done with outliers, the
researcher needs to make a thoughtful decision about which
version of the analysis provides a more accurate and honest
description of the situation. In some situations, it may make
sense to report both versions of the analysis (with outliers
included and excluded) so that it is clear to the reader how the
extreme individual score values influenced the results. None of
these choices are ideal solutions; any of these procedures may
be questioned by reviewers or editors.
It is preferable to decide on simple exclusion rules for outliers
before data are collected and to remove outliers during the
preliminary screening stages rather than at later stages in the
analysis. It may be preferable to have a consistent rule for
exclusion (e.g., excluding all scores that show up as extreme
outliers in boxplots) rather than to tell a different story to
explain why each individual outlier received the specific
treatment that it did. The final research report should explain
what methods were used to detect outliers, identify the scores
that were identified as outliers, and make it clear how the
outliers were handled (whether extreme scores were removed or
modified).
4.8 Screening Data for Bivariate Analyses
There are three possible combinations of types of variables in
bivariate analysis. Both variables may be categorical, both may
be quantitative, or one may be categorical and the other
quantitative. Separate bivariate data-screening methods are
outlined for each of these situations.
4.8.1 Bivariate Data Screening for Two Categorical Variables
When both variables are categorical, it does not make sense to
compute means (the numbers serve only as labels for group
memberships); instead, it makes sense to look at the numbers of
cases within each group. When two categorical variables are
considered jointly, a cross-tabulation or contingency table
summarizes the number of participants in the groups for all
possible combinations of scores. For example, consider
GENDER (coded 1 = male and 2 = female) and smoking status
(coded 1 = nonsmoker, 2 = occasional smoker, 3 = frequent
smoker, and 4 = heavy smoker). A table of cell frequencies for
these two categorical variables can be obtained using the SPSS
Crosstabs procedure by making the following menu selections:
<Analyze> → <Descriptives> → <Crosstabs>.
These menu selections open up the Crosstabs dialog window,
which appears in Figure 4.30. The names of the row variable (in
this example, GENDER) and the column variable (in this
example, SMOKE) are entered into the appropriate boxes.
Clicking on the button labeled Cells opens up an additional
dialog window, shown in Figure 4.31, where the user specifies
the information to be presented in each cell of the contingency
table. In this example, both observed (O) and expected (E)
frequency counts are shown in each cell (see Chapter 8 in this
textbook to see how expected cell frequencies are computed
from the total number in each row and column of a contingency
table). Row percentages were also requested.
The observed cell frequencies in Figure 4.32 show that most of
the males and the females were nonsmokers (SMOKE = 1). In
fact, there were very few light smokers and heavy smokers (and
no moderate smokers). As a data-screening result, this has two
implications: If we wanted to do an analysis (such as a chi-
square test of association, as described in Chapter 8) to assess
how gender is related to smoking status, the data do not satisfy
an assumption about the minimum expected cell frequencies
required for the chi-square test of association. (For a 2 × 2
table, none of the expected cell frequencies should be less than
5; for larger tables, various sources recommend different
standards for minimum expected cell frequencies, but a
minimum expected frequency of 5 is recommended here.) In
addition, if we wanted to see how gender and smoking status
together predict some third variable, such as heart rate, the
numbers of participants in most of the groups (such as heavy
smoker/females with only N = 2 cases) are simply too small.
What would we hope to see in preliminary screening for
categorical variables? The marginal frequencies (e.g., number
of males, number of females; number of nonsmokers, light
smokers, and heavy smokers) should all be reasonably large.
That is clearly not the case in this example: There were so few
heavy smokers that we cannot judge whether heavy smoking is
associated with gender.
Figure 4.30 SPSS Crosstabs Dialog Window
Figure 4.31 SPSS Crosstabs: Information to Display in Cells
Figure 4.32 Cross-Tabulation of Gender by Smoking Status
NOTE: Expected cell frequencies less than 10 in three cells.
The 2 × 3 contingency table in Figure 4.32 has four cells with
expected cell frequencies less than 5. There are two ways to
remedy this problem. One possible solution is to remove groups
that have small marginal total Ns. For example, only 3 people
reported that they were “heavy smokers.” If this group of 3
people were excluded from the analysis, the two cells with the
lowest expected cell frequencies would be eliminated from the
table. Another possible remedy is to combine groups (but only
if this makes sense). In this example, the SPSS recode command
can be used to recode scores on the variable SMOKE so that
there are just two values: 1 = nonsmokers and 2 = light or heavy
smokers. The SPSS menu selections <Compute> → <Recode>
→ <Into Different Variable> appear in Figure 4.33; these menu
selections open up the Recode into Different Variables dialog
box, as shown in Figure 4.34.
In the Recode into Different Variables dialog window, the
existing variable SMOKE is identified as the numeric variable
by moving its name into the window headed Numeric Variable
→ Output Variable. The name for the new variable (in this
example, SMOKE2) is typed into the right-hand window under
the heading Output Variable, and if the button marked Change
is clicked, SPSS identifies SMOKE2 as the (new) variable that
will contain the recoded values that are based on scores for the
existing variable SMOKE. Clicking on the button marked Old
and New Values opens up the next SPSS dialog window, which
appears in Figure 4.35.
The Old and New Values dialog window that appears in Figure
4.35 can be used to enter a series of pairs of scores that show
how old scores (on the existing variable SMOKE) are used to
create new recoded scores (on the output variable SMOKE2).
For example, under Old Value, the value 1 is entered; under
New Value, the value 1 is entered; then, we click the Add
button to add this to the list of recode commands. People who
have a score of 1 on SMOKE (i.e., they reported themselves as
nonsmokers) will also have a score of 1 on SMOKE2 (this will
also be interpreted as “nonsmokers”). For the old values 2, 3,
and 4 on the existing variable SMOKE, each of these variables
is associated with a score of 2 on the new variable SMOKE2. In
other words, people who chose responses 2, 3, or 4 on the
variable SMOKE (light, moderate, or heavy smokers) will be
coded 2 (smokers) on the new variable SMOKE2. Click
Continue and then OK to make the recode commands take
effect. After the recode command has been executed, a new
variable called SMOKE2 will appear in the far right-hand
column of the SPSS Data View worksheet; this variable will
have scores of 1 (nonsmoker) and 2 (smoker).
Figure 4.33 SPSS Menu Selection for the Recode Command
Figure 4.34 SPSS Recode Into Different Variables Dialog
Window
Figure 4.35 Old and New Values for the Recode Command
Figure 4.36 Crosstabs Using the Recoded Smoking Variable
(SMOKE2)
While it is possible to replace the scores on the existing
variable SMOKE with recoded values, it is often preferable to
put recoded scores into a new output variable. It is easy to lose
track of recodes as you continue to work with a data file. It is
helpful to retain the variable in its original form so that
information remains available.
After the recode command has been used to create a new
variable (SMOKE2), with codes for light, moderate, and heavy
smoking combined into a single code for smoking, the Crosstabs
procedure can be run using this new version of the smoking
variable. The contingency table for GENDER by SMOKE2
appears in Figure 4.36. Note that this new table has no cells
with minimum expected cell frequencies less than 5. Sometimes
this type of recoding results in reasonably large marginal
frequencies for all groups. In this example, however, the total
number of smokers in this sample is still small.
4.8.2 Bivariate Data Screening for One Categorical and One
Quantitative Variable
Data analysis methods that compare means of quantitative
variables across groups (such as ANOVA) have all the
assumptions that are required for univariate parametric
statistics:
1. Scores on quantitative variables should be normally
distributed.
2. Observations should be independent.
When means on quantitative variables are compared across
groups, there is one additional assumption: The variances of the
populations (from which the samples are drawn) should be
equal. This can be stated as a formal null hypothesis:
Assessment of possible violations of Assumptions 1 and 2 were
described in earlier sections of this chapter. Graphic methods,
such as boxplots (as described in an earlier section of this
chapter), provide a way to see whether groups have similar
ranges or variances of scores.
The SPSS t test and ANOVA procedures provide a significance
test for the null assumption that the population variances are
equal (the Levene test). Usually, researchers hope that this
assumption is not violated, and thus, they usually hope that the
F ratio for the Levene test will be nonsignificant. However,
when the Ns in the groups are equal and reasonably large
(approximately N > 30 per group), ANOVA is fairly robust to
violations of the equal variance assumption (Myers & Well,
1991, 1995).
Small sample sizes create a paradox with respect to the
assessment of violations of many assumptions. When N is small,
significance tests for possible violations of assumptions have
low statistical power, and violations of assumptions are more
problematic for the analysis. For example, consider a one-way
ANOVA with only 5 participants per group. With such a small
N, the test for heterogeneity of variance may be significant only
when the differences among sample variances are extremely
large; however, with such a small N, small differences among
sample variances might be enough to create problems in the
analysis. Conversely, in a one-way ANOVA with 50 participants
per group, quite small differences in variance across groups
could be judged statistically significant, but with such a large
N, only fairly large differences in group variances would be a
problem. Doing the preliminary test for heterogeneity of
variance when Ns are very large is something like sending out a
rowboat to see if the water is safe for the Queen Mary.
Therefore, it may be reasonable to use very small α levels, such
as α = .001, for significance tests of violations of assumptions
in studies with large sample sizes. On the other hand,
researchers may want to set α values that are large (e.g., α of
.20 or larger) for preliminary tests of assumptions when Ns are
small.
Tabachnick and Fidell (2007) provide extensive examples of
preliminary data screening for comparison of groups. These
generally involve repeating the univariate data-screening
procedures described earlier (to assess normality of distribution
shape and identify outliers) separately for each group and, in
addition, assessing whether the homogeneity of variance
assumption is violated.
It is useful to assess the distribution of quantitative scores
within each group and to look for extreme outliers within each
group. Refer back to Figure 4.29 to see an example of a boxplot
that identified outliers on WEIGHT within the gender groups. It
might be desirable to remove these outliers or, at least, to
consider how strongly they influence the outcome of a t test to
compare male and female mean weights. The presence of these
outlier scores on WEIGHT raises the mean weight for each
group; the presence of these outliers also increases the within-
group variance for WEIGHT in both groups.
4.8.3 Bivariate Data Screening for Two Quantitative Variables
Statistics that are part of the general linear model (GLM), such
as the Pearson correlation, require several assumptions. Suppose
we want to use Pearson’s r to assess the strength of the
relationship between two quantitative variables, X (diastolic
blood pressure [DBP]) and Y (systolic blood pressure [SBP]).
For this analysis, the data should satisfy the following
assumptions:
1. Scores on X and Y should each have a univariate normal
distribution shape.
2. The joint distribution of scores on X and Y should have a
bivariate normal shape (and there should not be any extreme
bivariate outliers).
3. X and Y should be linearly related.
4. The variance of Y scores should be the same at each level of
X (the homogeneity or homoscedasticity of variance
assumption).
The first assumption (univariate normality of X and Y) can be
evaluated by setting up a histogram for scores on X and Y and
by looking at values of skewness as described in Section 4.6.4.
The other two assumptions (a bivariate normal distribution
shape and a linear relation) can be assessed by examining an X,
Yscatter plot.
To obtain an X, Y scatter plot, the following menu selections
are used: <Graph> → <Scatter>.
From the initial Scatter/Dot dialog box (see Figure 4.37), the
Simple Scatter type of scatter plot was selected by clicking on
the icon in the upper left part of the Scatter/Dot dialog window.
The Define button was used to move on to the next dialog
window. In the next dialog window (shown in Figure 4.38), the
name of the predictor variable (DBP at Time 1) was placed in
the window marked X Axis, and the name of the outcome
variable (SBP at Time 1) was placed in the window marked Y
Axis. (Generally, if there is a reason to distinguish between the
two variables, the predictor or “causal” variable is placed on the
X axis in the scatter plot. In this example, either variable could
have been designated as the predictor.) The output for the
scatter plot showing the relation between scores on DBP and
SBP in Figure 4.39 shows a strong positive association between
DBP and SBP. The relation appears to be fairly linear, and there
are no bivariate outliers.
The assumption of bivariate normal distribution is more
difficult to evaluate than the assumption of univariate
normality, particularly in relatively small samples. Figure 4.40
represents an ideal theoretical bivariate normal distribution.
Figure 4.41 is a bar chart that shows the frequencies of scores
with specific pairs of X, Y values; it corresponds approximately
to an empirical bivariate normal distribution (note that these
figures were not generated using SPSS). X and Y have a
bivariate normal distribution if Y scores are normally
distributed for each value of X (and vice versa). In either graph,
if you take any specific value of X and look at that cross section
of the distribution, the univariate distribution of Y should be
normal. In practice, even relatively large datasets (N > 200)
often do not have enough data points to evaluate whether the
scores for each pair of variables have a bivariate normal
distribution.
Several problems may be detectable in a bivariate scatter plot.
A bivariate outlier (see Figure 4.42) is a score that falls outside
the region in the X, Y scatter plot where most X, Y values are
located. In Figure 4.42, one individual has a body weight of
about 230 lb and SBP of about 110; this combination of score
values is “unusual” (in general, persons with higher body
weight tended to have higher blood pressure). To be judged a
bivariate outlier, a score does not have to be a univariate outlier
on either X or Y (although it may be). A bivariate outlier can
have a disproportionate impact on the value of Pearson’s r
compared with other scores, depending on its location in the
scatter plot. Like univariate outliers, bivariate outliers should
be identified and examined carefully. It may make sense in
some cases to remove bivariate outliers, but it is preferable to
do this early in the data analysis process, with a well-thought-
out justification, rather than late in the data analysis process,
because the data point does not conform to the preferred linear
model.
Figure 4.37 SPSS Dialog Window for the Scatter Plot
Procedure
Figure 4.38 Scatter Plot: Identification of Variables on X and
Y Axes
Heteroscedasticity or heterogeneity of variance refers to a
situation where the variance in Y scores is greater for some
values of X than for others. In Figure 4.43, the variance of Y
scores is much higher for X scores near 50 than for X values
less than 30.
This unequal variance in Y across levels of X violates the
assumption of homoscedasticity of variance; it also indicates
that prediction errors for high values of X will be systematically
larger than prediction errors for low values of X. Sometimes a
log transformation on a Y variable that shows heteroscedasticity
across levels of X can reduce the problem of unequal variance
to some degree. However, if this problem cannot be corrected,
then the graph that shows the unequal variances should be part
of the story that is reported, so that readers understand: It is not
just that Y tends to increase as X increases, as in Figure 4.43;
the variance of Y also tends to increase as X increases. Ideally,
researchers hope to see reasonably uniform variance in Y scores
across levels of X. In practice, the number of scores at each
level of X is often too small to evaluate the shape and variance
of Y values separately for each level of X.
Figure 4.39 Bivariate Scatter Plot for Diastolic Blood
Pressure (DIA1) and Systolic Blood Pressure (SYS1)
(Moderately Strong, Positive, Linear Relationship)
Figure 4.40 Three-Dimensional Representation of an Ideal
Bivariate Normal Distribution
SOURCE: Reprinted with permission from Hartlaub, B., Jones,
B. D., & Karian, Z. A., downloaded from
www2.kenyon.edu/People/hartlaub/MellonProject/images/bivari
ate17.gif, supported by the Andrew W. Mellon Foundation.
Figure 4.41 Three-Dimensional Histogram of an Empirical
Bivariate Distribution (Approximately Bivariate Normal)
SOURCE: Reprinted with permission from Dr. P. D. M.
MacDonald.
NOTE: Z1 and Z2 represent scores on the two variables, while
the vertical heights of the bars along the “frequency” axis
represent the number of cases that have each combination of
scores on Z1 and Z2. A clear bivariate normal distribution is
likely to appear only for datasets with large numbers of
observations; this example only approximates bivariate normal.
4.9 Nonlinear Relations
Students should be careful to distinguish between these two
situations: no relationship between X and Y versus a nonlinear
relationship between X and Y (i.e., a relationship between X
and Y that is not linear). An example of a scatter plot that
shows no relationship of any kind (either linear or curvilinear)
between X and Y appears in Figure 4.44. Note that as the value
of X increases, the value of Y does not either increase or
decrease.
In contrast, an example of a curvilinear relationship between X
and Y is shown in Figure 4.45. This shows a strong relationship
between X and Y, but it is not linear; as scores on X increase
from 0 to 30, scores on Y tend to increase, but as scores on X
increase between 30 and 50, scores on Y tend to decrease. An
example of a real-world research situation that yields results
similar to those shown in Figure 4.45 is a study that examines
arousal or level of stimulation (on the X axis) as a predictor of
task performance (on the Y axis). For example, suppose that the
score on the X axis is a measure of anxiety and the score on the
Y axis is a score on an examination. At low levels of anxiety,
exam performance is not very good: Students may be sleepy or
not motivated enough to study. At moderate levels of anxiety,
exam performance is very good: Students are alert and
motivated. At the highest levels of anxiety, exam performance
is not good: Students may be distracted, upset, and unable to
focus on the task. Thus, there is an optimum (moderate) level of
anxiety; students perform best at moderate levels of anxiety.
Figure 4.42 Bivariate Scatter Plot for Weight and Systolic
Blood Pressure (SYS1)
NOTE: Bivariate outlier can be seen in the lower right corner of
the graph.
Figure 4.43 Illustration of Heteroscedasticity of Variance
NOTE: Variance in Y is larger for values of X near 50 than for
values of X near 0.
Figure 4.44 No Relationship Between X and Y
Figure 4.45 Bivariate Scatter Plot: Inverse U-Shaped
Curvilinear Relation Between X and Y
If a scatter plot reveals this kind of curvilinear relation between
X and Y, Pearson’s r (or other analyses that assume a linear
relationship) will not do a good job of describing the strength of
the relationship and will not reveal the true nature of the
relationship. Other analyses may do a better job in this
situation. For example, Y can be predicted from both X and X2
(a function that includes an X2 term as a curve rather than a
straight line). (For details, see Aiken & West, 1991, chap. 5.)
Alternatively, students can be separated into high-, medium-,
and low-anxiety groups based on their scores on X, and a one-
way ANOVA can be performed to assess how mean Y test
scores differ across these three groups. However, recoding
scores on a quantitative variable into categories can result in
substantial loss of information, as pointed out by Fitzsimons
(2008).
Another possible type of curvilinear function appears in Figure
4.46. This describes a situation where responses on Y reach an
asymptote as X increases. After a certain point, further
increases in X scores begin to result in diminishing returns on
Y. For example, some studies of social support suggest that
most of the improvements in physical health outcomes occur
between no social support and low social support and that there
is little additional improvement in physical health outcomes
between low social support and higher levels of social support.
Here also, Pearson’s r or other statistic that assumes a linear
relation between X and Y may understate the strength and fail
to reveal the true nature of the association between X and Y.
Figure 4.46 Bivariate Scatter Plot: Curvilinear Relation
Between X1 and Y
If a bivariate scatter plot of scores on two quantitative variables
reveals a nonlinear or curvilinear relationship, this nonlinearity
must be taken into account in the data analysis. Some nonlinear
relations can be turned into linear relations by applying
appropriate data transformations; for example, in
psychophysical studies, the log of the physical intensity of a
stimulus may be linearly related to the log of the perceived
magnitude of the stimulus.
4.10 Data Transformations
A linear transformation is one that changes the original X score
by applying only simple arithmetic operations (addition,
subtraction, multiplication, or division) using constants. If we
let b and c represent any two values that are constants within a
study, then the arithmetic function (X – b)/c is an example of a
linear transformation. The linear transformation that is most
often used in statistics is the one that involves the use of M as
the constant b and the sample standard deviation s as the
constant c: z = (X – M)/s. This transformation changes the mean
of the scores to 0 and the standard deviation of the scores to 1,
but it leaves the shape of the distribution of X scores
unchanged.
Sometimes, we want a data transformation that will change the
shape of a distribution of scores (or alter the nature of the
relationship between a pair of quantitative variables in a scatter
plot). Some data transformations for a set of raw X scores (such
as the log of X and the log of Y) tend to reduce positive
skewness and also to bring extreme outliers at the high end of
the distribution closer to the body of the distribution (see
Tabachnick & Fidell, 2007, chap. 4, for further discussion).
Thus, if a distribution is skewed, taking the log or square root
of scores sometimes makes the shape of the distribution more
nearly normal. For some variables (such as reaction time), it is
conventional to do this; log of reaction time is very commonly
reported (because reaction times tend to be positively skewed).
However, note that changing the scale of a variable (from heart
rate to log of heart rate) changes the meaning of the variable
and can make interpretation and presentation of results
somewhat difficult.
Figure 4.47 Illustration of the Effect of the Base 10 Log
Transformation
NOTE: In Figure 4.47, raw scores for body weight are plotted
on the X axis; raw scores for metabolic rate are plotted on the Y
axis. In Figure 4.48, both variables have been transformed using
base 10 log. Note that the log plot has more equal spaces among
cases (there is more information about the differences among
low-body-weight animals, and the outliers have been moved
closer to the rest of the scores). Also, when logs are taken for
both variables, the relation between them becomes linear. Log
transformations do not always create linear relations, of course,
but there are some situations where they do.
Sometimes a nonlinear transformation of scores on X and Y can
change a nonlinear relation between X and Y to a linear
relation. This is extremely useful, because the analyses included
in the family of methods called general linear models usually
require linear relations between variables.
A common and useful nonlinear transformation of X is the base
10 log of X, denoted by log10(X). When we find the base 10 log
of X, we find a number p such that 10p = X. For example, the
base 10 log of 1,000 is 3, because 103 = 1,000. The p exponent
indicates order of magnitude.
Consider the graph shown in Figure 4.47. This is a graph of
body weight (in kilograms) on the X axis with mean metabolic
rate on the Y axis; each data point represents a mean body
weight and a mean metabolic rate for one species. There are
some ways in which this graph is difficult to read; for example,
all the data points for physically smaller animals are crowded
together in the lower left-hand corner of the scatter plot. In
addition, if you wanted to fit a function to these points, you
would need to fit a curve (rather than a straight line).
Figure 4.48 Graph Illustrating That the Relation Between
Base 10 Log of Body Weight and Base 10 Log of Metabolic
Rate Across Species Is Almost Perfectly Linear
SOURCE: Reprinted with permission from Dr. Tatsuo
Motokawa.
Figure 4.48 shows the base 10 log of body weight and the base
10 log of metabolic rate for the same set of species as in Figure
4.47. Note that now, it is easy to see the differences among
species at the lower end of the body-size scale, and the relation
between the logs of these two variables is almost perfectly
linear.
In Figure 4.47, the tick marks on the X axis represented equal
differences in terms of kilograms. In Figure 4.48, the equally
spaced points on the X axis now correspond to equal spacing
between orders of magnitude (e.g., 101, 102, 103, …); a one-
tick-mark change on the X axis in Figure 4.48 represents a
change from 10 to 100 kg or 100 to 1,000 kg or 1,000 to 10,000
kg. A cat weighs something like 10 times as much as a dove, a
human being weighs something like 10 times as much as a cat, a
horse about 10 times as much as a human, and an elephant about
10 times as much as a horse. If we take the log of body weight,
these log values (p = 1, 2, 3, etc.) represent these orders of
magnitude, 10p (101 for a dove, 102 for a cat, 103 for a human,
and so on). If we graphed weights in kilograms using raw
scores, we would find a much larger difference between
elephants and humans than between humans and cats. The
log10(X) transformation yields a new way of scaling weight in
terms of p, the relative orders of magnitude.
When the raw X scores have a range that spans several orders of
magnitude (as in the sizes of animals, which vary from < 1 g up
to 10,000 kg), applying a log transformation reduces the
distance between scores on the high end of the distribution
much more than it reduces distances between scores on the low
end of the distribution. Depending on the original distribution
of X, outliers at the high end of the distribution of X are
brought “closer” by the log(X) transformation. Sometimes when
raw X scores have a distribution that is skewed to the right,
log(X) is nearly normal.
Some relations between variables (such as the physical
magnitude of a stimulus, e.g., a weight or a light source) and
subjective judgments (of heaviness or brightness) become linear
when log or power transformations are applied to the scores on
both variables.
Note that when a log transformation is applied to a set of scores
with a limited range of possible values (e.g., Likert ratings of 1,
2, 3, 4, 5), this transformation has little effect on the shape of
the distribution. However, when a log transformation is applied
to scores that vary across orders of magnitude (e.g., the highest
score is 10,000 times as large as the lowest score), the log
transformation may change the distribution shape substantially.
Log transformations tend to be much more useful for variables
where the highest score is orders of magnitude larger than the
smallest score; for example, maximum X is 100 or 1,000 or
10,000 times minimum X.
Other transformations that are commonly used involve power
functions—that is, replacing X with X2, Xc (where c is some
power of X, not necessarily an integer value), or . For specific
types of data (such as scores that represent proportions,
percentages, or correlations), other types of nonlinear
transformations are needed.
Usually, the goals that a researcher hopes to achieve through
data transformations include one or more of the following: to
make a nonnormal distribution shape more nearly normal, to
minimize the impact of outliers by bringing those values closer
to other values in the distribution, or to make a nonlinear
relationship between variables linear.
One argument against the use of nonlinear transformations has
to do with interpretability of the transformed scores. If we take
the square root of “number of times a person cries per week,”
how do we talk about the transformed variable? For some
variables, certain transformations are so common that they are
expected (e.g., psychophysical data are usually modeled using
power functions; measurements of reaction time usually have a
log transformation applied to them).
4.11 Verifying That Remedies Had the Desired Effects
Researchers should not assume that the remedies they use to try
to correct problems with their data (such as removal of outliers,
or log transformations) are successful in achieving the desired
results. For example, after one really extreme outlier is
removed, when the frequency distribution is graphed again,
other scores may still appear to be relatively extreme outliers.
After the scores on an X variable are transformed by taking the
natural log of X, the distribution of the natural log of X may
still be nonnormal. It is important to repeat data screening using
the transformed scores to make certain that the data
transformation had the desired effect. Ideally, the transformed
scores will have a nearly normal distribution without extreme
outliers, and relations between pairs of transformed variables
will be approximately linear.
4.12 Multivariate Data Screening
Data screening for multivariate analyses (such as multiple
regression and multivariate analysis of variance) begins with
screening for each individual variable and bivariate data
screening for all possible pairs of variables as described in
earlier sections of this chapter. When multiple predictor or
multiple outcome variables are included in an analysis,
correlations among these variables are reported as part of
preliminary screening. More complex assumptions about data
structure will be reviewed as they arise in later chapters.
Complete data screening in multivariate studies requires careful
examination not just of the distributions of scores for each
individual variable but also of the relationships between pairs
of variables and among subsets of variables. It is possible to
obtain numerical indexes (such as Mahalanobis d) that provide
information about the degree to which individual scores are
multivariate outliers. Excellent examples of multivariate data
screening are presented in Tabachnick and Fidell (2007).
4.13 Reporting Preliminary Data Screening
Many journals in psychology and related fields use the style
guidelines published by the American Psychological
Association (APA, 2009). This section covers some of the basic
guidelines. All APA-style research reports should be double-
spaced and single-sided with at least 1-in. margins on each
page.
A Results section should report data screening and the data
analyses that were performed (including results that run counter
to predictions). Interpretations and discussion of implications of
the results are generally placed in the Discussion section of the
paper (except in very brief papers with combined
Results/Discussion sections).
Although null hypothesis significance tests are generally
reported, the updated fifth and sixth editions of the APA (2001,
2009) Publication Manual also call for the inclusion of effect-
size information and confidence intervals (CIs), wherever
possible, for all major outcomes. Include the basic descriptive
statistics that are needed to understand the nature of the results;
for example, a report of a one-way ANOVA should include
group means and standard deviations as well as F values,
degrees of freedom, effect-size information, and CIs.
Standard abbreviations are used for most statistics—for
example, M for mean and SD for standard deviation (APA,
2001, pp. 140–144). These should be in italic font (APA, 2001,
p. 101). Parentheses are often used when these are reported in
the context of a sentence, as in, “The average verbal SAT for
the sample was 551 (SD = 135).”
The sample size (N) or the degrees of freedom (df) should
always be included when reporting statistics. Often the df
values appear in parentheses immediately following the
statistic, as in this example: “There was a significant gender
difference in mean score on the Anger In scale, t(61) = 2.438, p
= .018, two-tailed, with women scoring higher on average than
men.” Generally, results are rounded to two decimal places,
except that p values are sometimes given to three decimal
places. It is more informative to report exact p values than to
make directional statements such as p < .05. If the printout
shows a p of .000, it is preferable to report p < .001 (the risk of
Type I error indicated by p is not really zero). When it is
possible for p values to be either one-tailed or two-tailed (for
the independent samples t test, for example), this should be
stated explicitly.
Tables and figures are often useful ways of summarizing a large
amount of information—for example, a list of t tests with
several dependent variables, a table of correlations among
several variables, or the results from multivariate analyses such
as multiple regression. See APA (2001, pp. 147–201) for
detailed instructions about the preparation of tables and figures.
(Tufte, 1983, presents wonderful examples of excellence and
awfulness in graphic representations of data.) All tables and
figures should be discussed in the text; however, the text should
not repeat all the information in a table; it should point out only
the highlights. Table and figure headings should be informative
enough to be understood on their own. It is common to denote
statistical significance using asterisks (e.g., * for p < .05, ** for
p < .01, and *** for p < .001), but these should be described by
footnotes to the table. Each column and row of the table should
have a clear heading. When there is not sufficient space to type
out the entire names for variables within the table, numbers or
abbreviations may be used in place of variable names, and this
should also be explained fully in footnotes to the table.
Horizontal rules or lines should be used sparingly within tables
(i.e., not between each row but only in the headings and at the
bottom). Vertical lines are not used in tables. Spacing should be
sufficient so that the table is readable.
In general, Results sections should include the following
information. For specific analyses, additional information may
be useful or necessary.
1. The opening sentence of each Results section should state
what analysis was done, with what variables, and to answer
what question. This sounds obvious, but sometimes this
information is difficult to find in published articles. An example
of this type of opening sentence is, “In order to assess whether
there was a significant difference between the mean Anger In
scores of men and women, an independent samples t test was
performed using the Anger In score as the dependent variable.”
2. Next, describe the data screening that was done to decide
whether assumptions were violated, and report any steps that
were taken to correct the problems that were detected. For
example, this would include examination of distribution shapes
using graphs such as histograms, detection of outliers using
boxplots, and tests for violations of homogeneity of variance.
Remedies might include deletion of outliers, data
transformations such as the log, or choice of a statistical test
that is more robust to violations of the assumption.
3. The next sentence should report the test statistic and the
associated exact p value; also, a statement whether or not it
achieved statistical significance, according to the predetermined
alpha level, should be included: “There was a significant gender
difference in mean score on the Anger In scale, t(61) = 2.438, p
= .018, two-tailed, with women scoring higher on average than
men.” The significance level can be given as a range (p < .05)
or as a specific obtained value (p = .018). For nonsignificant
results, any of the following methods of reporting may be used:
p > .05 (i.e., a statement that the p value on the printout was
larger than a preselected α level of .05), p = .38 (i.e., an exact
obtained p value), or just ns (an abbreviation for
nonsignificant). Recall that the p value is an estimate of the risk
of Type I error; in theory, this risk is never zero, although it
may be very small. Therefore, when the printout reports a
significance or p value of .000, it is more accurate to report it
as “p < .001” than as “p = .000.”
4. Information about the strength of the relationship should be
reported. Most statistics have an accompanying effect-size
measure. For example, for the independent samples t test,
Cohen’s d and η2 are common effect-size indexes. For this
example, η2 = t2/(t2 + df) = (2.438)2/((2.438)2 + 61) = .09.
Verbal labels may be used to characterize an effect-size
estimate as small, medium, or large. Reference books such as
Cohen’s (1988) Statistical Power Analysis for the Behavioral
Sciences suggest guidelines for the description of effect size.
5. Where possible, CIs should be reported for estimates. In this
example, the 95% CI for the difference between the sample
means was from .048 to .484.
6. It is important to make a clear statement about the nature of
relationships (e.g., the direction of the difference between group
means or the sign of a correlation). In this example, the mean
Anger In score for females (M = 2.36, SD = .484) was higher
than the mean Anger In score for males (M = 2.10, SD = .353).
Descriptive statistics should be included to provide the reader
with the most important information. Also, note whether the
outcome was consistent with or contrary to predictions; detailed
interpretation/discussion should be provided in the Discussion
section.
Many published studies report multiple analyses. In these
situations, it is important to think about the sequence.
Sometimes, basic demographic information is reported in the
section about participants in the Methods/Participants section of
the paper. However, it is also common for the first table in the
Results section to provide means and standard deviations for all
the quantitative variables and group sizes for all the categorical
variables. Preliminary analyses that examine the reliabilities of
variables are reported prior to analyses that use those variables.
It is helpful to organize the results so that analyses that examine
closely related questions are grouped together. It is also helpful
to maintain a parallel structure throughout the research paper.
That is, questions are outlined in the Introduction, the Methods
section describes the variables that are manipulated and/or
measured to answer those questions, the Results section reports
the statistical analyses that were employed to try to answer each
question, and the Discussion section interprets and evaluates the
findings relevant to each question. It is helpful to keep the
questions in the same order in each section of the paper.
Sometimes, a study has both confirmatory and exploratory
components (as discussed in Chapter 1). For example, a study
might include an experiment that tests the hypotheses derived
from earlier research (confirmatory), but it might also examine
the relations among variables to look for patterns that were not
predicted (exploratory). It is helpful to make a clear distinction
between these two types of results. The confirmatory Results
section usually includes a limited number of analyses that
directly address questions that were stated in the Introduction;
when a limited number of significance tests are presented, there
should not be a problem with inflated risk of Type I error. On
the other hand, it may also be useful to present the results of
other exploratory analyses; however, when many significance
tests are performed and no a priori predictions were made, the
results should be labeled as exploratory, and the author should
state clearly that any p values that are reported in this context
are likely to underestimate the true risk of Type I error.
Usually, data screening that leads to a reduced sample size and
assessment of measurement reliability are reported in the
Methods section prior to the Results section. In some cases, it
may be possible to make a general statement such as, “All
variables were normally distributed, with no extreme outliers”
or “Group variances were not significantly heterogeneous,” as a
way of indicating that assumptions for the analysis are
reasonably well satisfied.
An example of a Results section that illustrates some of these
points follows. The SPSS printout that yielded these numerical
results is not included here; it is provided in the Instructor
Supplement materials for this textbook.
Results
An independent samples t test was performed to assess whether
there was a gender difference in mean Anger In scores.
Histograms and boxplots indicated that scores on the dependent
variable were approximately normally distributed within each
group with only one outlier in each group. Because these
outliers were not extreme, these scores were retained in the
analysis. The Levene test showed a nonsignificant difference
between the variances; because the homogeneity of variance
assumption did not appear to be violated, the pooled variances t
test was used. The male and female groups had 28 and 37
participants, respectively. The difference in mean Anger In
scores was found to be statistically significant, t(63) = 2.50, p =
.015, two-tailed. The mean Anger In score for females (M =
2.37, SD = .482) was higher than the mean Anger In score for
males (M = 2.10, SD = .353). The effect size, indexed by η2,
was .09. The 95% CI around the difference between these
sample means ranged from .05 to .49.
4.14 Summary and Checklist for Data Screening
The goals of data screening include the following: identification
and correction of data errors, detection and decisions about
outliers, and evaluation of patterns of missing data and
decisions regarding how to deal with missing data. For
categorical variables, the researcher needs to verify that all
groups that will be examined in analyses (such as Crosstabs or
ANOVA) have a reasonable number of cases. For quantitative
variables, it is important to assess the shape of the distribution
of scores and to see what information the distribution provides
about outliers, ceiling or floor effects, and restricted range.
Assumptions specific to the analyses that will be performed
(e.g., the assumption of homogeneous population variances for
the independent samples t test, the assumption of linear
relations between variables for Pearson’s r) should be
evaluated. Possible remedies for problems with general linear
model assumptions that are identified include dropping scores,
modifying scores through data transformations, or choosing a
different analysis that is more appropriate to the data. After
deleting outliers or transforming scores, it is important to check
(by rerunning frequency distributions and replotting graphs)
that the data modifications actually had the desired effects. A
checklist of data-screening procedures is given in Table 4.2.
Preliminary screening also yields information that may be
needed to characterize the sample. The Methods section
typically reports the numbers of male and female participants,
mean and range of age, and other demographic information.
Table 4.2 Checklist for Data Screening
1.
Proofread scores in the SPSS data worksheet against original
data sources, if possible.
2.
Identify response inconsistencies across variables.
3.
During univariate screening of scores on categorical variables,
a.
check for values that do not correspond to valid response
alternatives, and
b.
note groups that have Ns too small to be examined separately in
later analyses (decide what to do with small-N groups—e.g.,
combine them with other groups, drop them from the dataset).
4.
During univariate screening of scores on quantitative variables,
look for
a.
normality of distribution shape (e.g., skewness, kurtosis, other
departures from normal shape),
b.
outliers,
c.
scores that do not correspond to valid response alternatives or
possible values, and
d.
ceiling or floor effects, restricted range.
5.
Consider dropping individual participants or variables that show
high levels of incorrect responses or responses that are
inconsistent.
6.
Note the pattern of “missing” data. If not random, describe how
missing data are patterned. Imputation may be used to replace
missing scores.
7.
For bivariate analyses involving two categorical variables (e.g.,
chi-squared),
a.
examine the marginal distributions to see whether the Ns in
each row and column are sufficiently large (if not, consider
dropping some categories or combining them with other
categories), and
b.
check whether expected values in all cells are greater than 5 (if
this is not the case, consider alternatives to χ2 such as the
Fisher exact test).
8.
For bivariate analyses of two continuous variables (e.g.,
Pearson’s r), examine the scatter plot:
a.
Assess possible violations of bivariate normality.
b.
Look for bivariate outliers or disproportionately influential
scores.
c.
Assess whether the relation between X and Y is linear. If it is
not linear, consider whether to use a different approach to
analysis (e.g., divide scores into low, medium, and high groups
based on X scores and do an ANOVA) or use nonlinear
transformations such as log to make the relation more nearly
linear.
d.
Assess whether variance in Y scores is uniform across levels of
X (i.e., the assumption of homoscedasticity of variance).
9.
For bivariate analyses with one categorical and one continuous
variable,
a.
assess the distribution shapes for scores within each group (Are
the scores normally distributed?),
b.
look for outliers within each group,
c.
test for possible violations of homogeneity of variance, and
d.
make sure that group sizes are adequate.
10.
Verify that any remedies that have been attempted were
successful—for example, after removal of outliers, does a
distribution of scores on a quantitative variable now appear
approximately normal in shape? After taking a log of X, is the
distribution of X more nearly normal, and is the relation of X
with Y more nearly linear?
11.
Based on data screening and the success or failure of remedies
that were attempted,
a.
Are assumptions for the intended parametric analysis (such as t
test, ANOVA, or Pearson’s r) sufficiently well met to go ahead
and use parametric methods?
b.
If there are problems with these assumptions, should a
nonparametric method of data analysis be used?
12.
In the report of results, include a description of data-screening
procedures and any remedies (such as dropping outliers,
imputing values for missing data, or data transformations) that
were applied to the data prior to other analyses.
4.15 Final Notes
Removal of scores, cases, groups, or variables from an analysis
based on data screening and on whether the results of analysis
are statistically significant can lead to a problem, discussed in
more detail in a recent paper by Simmons, Nelson, and
Simonsohn (2011). They provide empirical demonstrations that
many common research practices, such as dropping outliers,
dropping groups, adding or omitting variables in the final
reported analysis, and continuing to collect data until the effect
of interest is found to be statistically significant, raise the risk
of Type I error. They acknowledge that researchers cannot
always make all decisions about the analysis (e.g., which cases
to include) in advance. However, they noted correctly that when
researchers go through a process in which they try out many
variations of the analysis, searching for a version of the
analysis that yields a statistically significant outcome, there is
an inflated risk of Type I error. They call this “excess
flexibility” in analysis. They recommend a list of research
design and reporting requirements that would make it possible
for readers to evaluate whether the authors have tried out a
large number of alternative analyses before settling on one
version to report. First, authors should decide on a rule for
terminating data collection before data collection is started (as
opposed to continuing to collect data, analyzing the data, and
terminating data collection only when the result of interest is
statistically significant). Second, for cells or groups, authors
should collect at least 20 cases per group or provide a
compelling reason why this would be too costly. Third, authors
should list all variables included in a study. (I would add to this
the suggestion that it should be made clear which variables were
and were not included in exploratory analyses.) Fourth, authors
should report all experimental conditions, including any groups
for which manipulations failed to work as predicted. Fifth, if
authors remove observations such as outliers, they should report
what the statistical results would be if those observations were
included. Sixth, if covariates are used (see Chapter 17 in this
book), authors should report differences among group means
when the covariates are excluded, as well as when covariates
are included. Simmons et al. further recommended that journal
reviewers should ask authors to make it clear that the results
reported do not depend on arbitrary decisions (such as omission
of outliers or inclusion of covariates for which there is no
theoretical justification).
Another source of false-positive results (Type I errors) arises in
research labs or programs where many studies are conducted,
but only those that yield p < .05 are published.
There is a basic contradiction between exploratory and
confirmatory/hypothesis testing approaches to data analysis.
Both approaches can be valuable. However, it is fairly common
practice for researchers to engage in a very “exploratory” type
of analysis; they try out a variety of analyses searching for an
analysis for which the p values are less than .05, and then
sometimes after the fact formulate a theoretical explanation
consistent with this result. If this is reported in a paper that
states hypotheses at the beginning, this form of reporting makes
it appear that the data analysis approach was confirmatory,
hides the fact that the reported analysis was selected from a
large number of other analyses that were conducted that did not
support the author’s conclusions, and leads the reader to believe
that the p value should be an accurate indication of the risk of
Type I error. As clearly demonstrated by Simmons et al. (2011),
doing many variations of the analysis inflates the risk of Type I
error.
What should the data analyst do about this problem? First,
before data are collected, researchers can establish (and should
adhere to) some simple rules about handing of outliers. In
experiments, researchers should avoid collecting data, running
analyses, and then continuing to collect data until a point is
reached where group differences are statistically significant;
instead, sample size should be decided before data collection
begins.
When a researcher wants to do exploratory work to see what
patterns may emerge from data, the best approach is to collect
enough data to do a cross-validation. For example, a researcher
might obtain 600 cases and randomly divide the data into two
datasets of 300 cases each. Exploratory analyses using the first
batch of data should be clearly described in the research report
as exploratory, with cautions about the inflated risk of Type I
error that accompany this approach. The second batch of data
can then be used to test whether results of a limited number of
these exploratory analyses produce the same results on a new
batch of data.
Comprehension Questions
1. What are the goals of data screening?
2. What SPSS procedures can be used for data screening of
categorical variables?
3. What SPSS procedures can be used for data screening of
quantitative variables?
4. What do you need to look for in bivariate screening (for
each combination of categorical and quantitative variables)?
5. What potential problems should you look for in the
univariate distributions of categorical and quantitative scores?
6. How can a box and whiskers plot (or boxplot) be used to
look for potential outliers?
7. How can you identify and remedy the following: errors in
data entry, outliers, and missing data?
8. Why is it important to assess whether missing values are
randomly distributed throughout the participants and measures?
Or in other words, why is it important to understand what
processes lead to missing values?
9. Why are log transformations sometimes applied to scores?
10. Outline the information that should be included in an APA-
style Results section.
Data Analysis Project for Univariate and Bivariate Data
Screening
Data for this assignment may be provided by your instructor, or
use one of the data-sets found on the website for this textbook.
Note that in addition to the variables given in the SPSS file, you
can also use variables that are created by compute statements,
such as scale scores formed by summing items (e.g., Hostility =
H1 + H2 + H3 + H4).
1. Select three variables from the dataset. Choose two of the
variables such that they are good candidates for
correlation/regression and one other variable as a bad candidate.
Good candidates are variables that meet the assumptions (e.g.,
normally distributed, reliably measured, interval/ratio level of
measurement, etc.). Bad candidates are variables that do not
meet assumptions or that have clear problems (restricted range,
extreme outliers, gross nonnormality of distribution shape,
etc.).
2. For each of the three variables, use the Frequencies
procedure to obtain a histogram and all univariate descriptive
statistics.
3. For the two “good candidate” variables, obtain a scatter plot.
Also, obtain a scatter plot for the “bad candidate” variable with
one of the two good variables.
Hand in your printouts for these analyses along with your
answers to the following questions (there will be no Results
section in this assignment).
1. Explain which variables are good and bad candidates for a
correlation analysis, and give your rationale. Comment on the
empirical results from your data screening—both the histograms
and the scatter plots—as evidence that these variables meet or
do not meet the basic assumptions necessary for correlation to
be meaningful and “honest.” Also, can you think of other
information you would want to have about the variables to make
better informed judgments?
2. Is there anything that could be done (in terms of data
transformations, eliminating outliers, etc.) to make your “bad
candidate” variable better? If so, what would you recommend?
(Warner)
Warner, Rebecca (Becky) (Margaret). Applied Statistics: From
Bivariate Through Multivariate Techniques, 2nd Edition. SAGE
Publications, Inc, 04/2012. VitalBook file.
The citation provided is a guideline. Please check each citation
for accuracy before use.
Chapter 2 - BASIC STATISTICS, SAMPLING ERROR, AND
CONFIDENCE INTERVALS
2.1 Introduction
The first few chapters of a typical introductory statistics book
present simple methods for summarizing information about the
distribution of scores on a single variable. It is assumed that
readers understand that information about the distribution of
scores for a quantitative variable, such as heart rate, can be
summarized in the form of a frequency distribution table or a
histogram and that readers are familiar with concepts such as
central tendency and dispersion of scores. This chapter reviews
the formulas for summary statistics that are most often used to
describe central tendency and dispersion of scores in batches of
data (including the mean, M, and standard deviation, s). These
formulas provide instructions that can be used for by-hand
computation of statistics such as the sample mean, M. A few
numerical examples are provided to remind readers how these
computations are done. The goal of this chapter is to lead
students to think about the formula for each statistic (such as
the sample mean, M). A thoughtful evaluation of each equation
makes it clear what information each statistic is based on, the
range of possible values for the statistic, and the patterns in the
data that lead to large versus small values of the statistic.
Each statistic provides an answer to some question about the
data. The sample mean, M, is one way to answer the question,
What is a typical score value? It is instructive to try to imagine
these questions from the point of view of the people who
originally developed the statistical formulas and to recognize
why they used the arithmetic operations that they did. For
example, summing scores for all participants in a sample is a
way of summarizing or combining information from all
participants. Dividing a sum of scores by N corrects for the
impact of sample size on the magnitude of this sum.
The notation used in this book is summarized in Table 2.1. For
example, the mean of scores in a sample batch of data is
denoted by M. The (usually unknown) mean of the population
that the researcher wants to estimate or make inferences about,
using the sample value of M, is denoted by μ (Greek letter mu).
One of the greatest conceptual challenges for students who are
taking a first course in statistics arises when the discussion
moves beyond the behavior of single X scores and begins to
consider how sample statistics (such as M) vary across different
batches of data that are randomly sampled from the same
population. On first passing through the material, students are
often so preoccupied with the mechanics of computation that
they lose sight of the questions about the data that the statistics
are used to answer. This chapter discusses each formula as
something more than just a recipe for computation; each
formula can be understood as a meaningful sentence. The
formula for a sample statistic (such as the sample mean, M)
tells us what information in the data is taken into account when
the sample statistic is calculated. Thinking about the formula
and asking what will happen if the values of X increase in size
or in number make it possible for students to answer questions
such as the following: Under what circumstances (i.e., for what
patterns in the data) will the value of this statistic be a large or
a small number? What does it mean when the value of the
statistic is large or when its value is small?
The basic research questions in this chapter will be illustrated
by using a set of scores on heart rate (HR); these are contained
in the file hr130.sav. For a variable such as HR, how can we
describe a typical HR? We can answer this question by looking
at measures of central tendency such as mean or median HR.
How much does HR vary across persons? We can assess this by
computing a variance and standard deviation for the HR scores
in this small sample. How can we evaluate whether an
individual person has an HR that is relatively high or low
compared with other people’s HRs? When scores are normally
distributed, we can answer questions about the location of an
individual score relative to a distribution of scores by
calculating a z score to provide a unit-free measure of distance
of the individual HR score from the mean HR and using a table
of the standard normal distribution to find areas under the
normal distribution that correspond to distances from the mean.
These areas can be interpreted as proportions and used to
answer questions such as, Approximately what proportion of
people in the sample had HR scores higher than a specific value
such as 84?
Table 2.1 Notation for Sample Statistics and Population
Parameters
a. The first notation listed for each sample statistic is the
notation most commonly used in this book.
We will consider the issues that must be taken into account
when we use the sample mean, M, for a small random sample to
estimate the population mean, μ, for a larger population. In
introductory statistics courses, students are introduced to the
concept of sampling error, that is, variation in values of the
sample mean, M, across different batches of data that are
randomly sampled from the same population. Because of
sampling error, the sample mean, M, for a single sample is not
likely to be exactly correct as an estimate of μ, the unknown
population mean. When researchers report a sample mean, M, it
is important to include information about the magnitude of
sampling error; this can be done by setting up a confidence
interval (CI). This chapter reviews the concepts that are
involved in setting up and interpreting CIs.
2.2 Research Example: Description of a Sample of HR Scores
In the following discussion, the population of interest consists
of 130 persons; each person has a score on HR, reported in
beats per minute (bpm). Scores for this hypothetical population
are contained in the data file hr130.sav. Shoemaker (1996)
generated these hypothetical data so that sample statistics such
as the sample mean, M, would correspond to the outcomes from
an empirical study reported by Mackowiak, Wasserman, and
Levine (1992). For the moment, it is useful to treat this set of
130 scores as the population of interest and to draw one small
random sample (consisting of N = 9 cases) from this population.
This will provide us with a way to evaluate how accurately a
mean based on a random sample of N = 9 cases estimates the
mean of the population from which the sample was selected. (In
this case, we can easily find the actual population mean, μ,
because we have HR data for the entire population of 130
persons.) IBM SPSS® Version 19 is used for examples in this
book. SPSS has a procedure that allows the data analyst to
select a random sample of cases from a data file; the data
analyst can specify either the percentage of cases to be included
in the sample (e.g., 10% of the cases in the file) or the number
of cases (N) for the sample. In the following exercise, a random
sample of N = 9 HR scores was selected from the population of
130 cases in the SPSS file hr130.sav.
Figure 2.1 shows the Data View for the SPSS worksheet for the
hr130.sav file. Each row in this worksheet corresponds to scores
for one participant. Each column in the SPSS worksheet
corresponds to one variable. The first column gives each
person’s HR in beats per minute (bpm).
Clicking on the tab near the bottom left corner of the worksheet
shown in Figure 2.1 changes to the Variable View of the SPSS
dataset, displayed in Figure 2.2. In this view, the names of
variables are listed in the first column. Other cells provide
information about the nature of each variable—for example,
variable type. In this dataset, HR is a numerical variable, and
the variable type is “scale” (i.e., quantitative or approximately
interval/ratio) level of measurement. HR is conventionally
reported in whole numbers; the choice of “0” in the decimal
points column for this variable instructs SPSS to include no
digits after the decimal point when displaying scores for this
variable.
Readers who have never used SPSS will find a brief
introduction to SPSS in the appendix to this chapter; they may
also want to consult an introductory user’s guide for SPSS, such
as George and Mallery (2010).
Figure 2.1 The SPSS Data View for the First 23 Lines From the
SPSS Data File hr130.sav
Figure 2.2 The Variable View for the SPSS Worksheet for
hr130.sav
Prior to selection of a random sample, let’s look at the
distribution of this population of 130 scores. A histogram can
be generated for this set of scores by starting in the Data View
worksheet, selecting the <Graphs> menu from the menu bar
along the top of the SPSS Data View worksheet, and then
selecting <Legacy Dialogs> and <Histogram> from the pull-
down menus, as shown in Figure 2.3.
Figure 2.4 shows the SPSS dialog window for the Histogram
procedure. Initially, the names of all the variables in the file (in
this example, there is only one variable, HR) appear in the left-
hand panel, which shows the available variables. To designate
HR as the variable for the histogram, highlight it with the
cursor and click on the right-pointing arrow to move the
variable name HR into the small window on the right-hand side
under the heading Variable. (Notice that the variable named HR
has a “ruler” icon associated with it. This ruler icon indicates
that scores on this variable are scale [i.e., quantitative or
interval/ratio] level of measurement.) To request a
superimposed normal curve, click the check box for Display
normal curve. Finally, to run the procedure, click the OK button
in the upper right-hand corner of the Histogram dialog window.
The output from this procedure appears in Figure 2.5, along
with the values for the population mean μ = 73.76 and
population standard deviation σ = 7.06 for the entire population
of 130 scores.
Figure 2.3 SPSS Menu Selections <Graphs> → <Legacy
Dialogs> → <Histogram> to Open the Histogram Dialog
Window
NOTE: IBM SPSS Version 19 was used for all examples in this
book.
To select a random sample of size N = 9 from the entire
population of 130 scores in the SPSS dataset hr130.sav, make
the following menu selections, starting from the SPSS Data
View worksheet, as shown in Figure 2.6: <Data> → <Select
Cases>. This opens the SPSS dialog window for Select Cases,
which appears in Figure 2.7. In the Select Cases dialog window,
click the radio button for Random sample of cases. Then, click
the Sample button; this opens the Select Cases: Random Sample
dialog window in Figure 2.8. Within this box under the heading
Sample Size, click the radio button that corresponds to the word
“Exactly” and enter in the desired sample size (9) and the
number of cases in the entire population (130). The resulting
SPSS command is, “Randomly select exactly 9 cases from the
first 130 cases.” Click the Continue button to return to the main
Select Cases dialog window. To save this random sample of N =
9 HR scores into a separate, smaller file, click on the radio
button for “Copy selected cases to a new dataset” and provide a
name for the dataset that will contain the new sample of nine
cases—in this instance, hr9.sav. Then, click the OK button.
Figure 2.4 SPSS Histogram Dialog Window
Figure 2.5 Output: Histogram for the Entire Population of Heart
Rate (HR) Scores in hr130.sav
Figure 2.6 SPSS Menu Selection for <Data> → <Select Cases>
Figure 2.7 SPSS Dialog Window for Select Cases
Figure 2.8 SPSS Dialog Window for Select Cases: Random
Sample
When this was done, a random sample of nine cases was
obtained; these nine HR scores appear in the first column of
Table 2.2. (The computation of the values in the second and
third columns in Table 2.2 will be explained in later sections of
this chapter.) Of course, if you give the same series of
commands, you will obtain a different subset of nine scores as
the random sample.
The next few sections show how to compute descriptive
statistics for this sample of nine scores: the sample mean, M;
the sample variance, s2; and the sample standard deviation, s.
The last part of the chapter shows how this descriptive
information about the sample can be used to help evaluate
whether an individual HR score is relatively high or low,
relative to other scores in the sample, and how to set up a CI
estimate for μ using the information from the sample.
Table 2.2 Summary Statistics for Random Sample of N = 9
Heart Rate (HR) Scores
NOTES: Sample mean for HR: M = ∑ X/N = 658/9 = 73.11.
Sample variance for HR: s2 = SS/(N − 1) = 244.89/8 = 30.61.
Sample standard deviation for HR:
2.3 Sample Mean (M)
A sample mean provides information about the size of a
“typical” score in a sample. The interpretation of a sample
mean, M, can be worded in several different ways. A sample
mean, M, corresponds to the center of a distribution of scores in
a sample. It provides us with one kind of information about the
size of a typical X score. Scores in a sample can be represented
as X1, X2, …, Xn, where N is the number of observations or
participants and Xi is the score for participant number i. For
example, the HR score for a person with the SPSS case record
number 2 in Figure 2.1 could be given as X2 = 69. Some
textbooks, particularly those that offer more mathematical or
advanced treatments of statistics, include subscripts on X
scores; in this book, the i subscript is used only when omitting
subscripts would create ambiguity about which scores are
included in a computation. The sample mean, M, is obtained by
summing all the X scores in a sample of N scores and dividing
by N, the number of scores:
Adding the scores is a way of summarizing information across
all participants. The size of ∑X depends on two things: the
magnitudes of the individual X scores and N, the number of
scores. If N is held constant and all X scores are positive, ∑X
increases if the values of individual X scores are increased.
Assuming all X scores are positive, ∑X also increases as N gets
larger. To obtain a sample mean that represents the size of a
typical score and that is independent of N, we have to correct
for sample size by dividing ∑X by N, to yield M, our sample
mean. Equation 2.1 is more than just instructions for
computation. It is also a statement or “sentence” that tells us the
following:
1. What information is the sample statistic M based on? It is
based on the sum of the Xs and the N of cases in the sample.
2. Under what circumstances will the statistic (M) turn out to
have a large or small value? M is large when the individual X
scores are large and positive. Because we divide by N when
computing M to correct for sample size, the magnitude of M is
independent of N.
In this chapter, we explore what happens when we use a sample
mean, M, based on a random sample of N = 9 cases to estimate
the population mean μ (in this case, the entire set of 130 HR
scores in the file hr130.sav is the population of interest). The
sample of N = 9 randomly selected HR scores appears in the
first column of Table 2.2. For the set of the N = 9 HR scores
shown in Table 2.2, we can calculate the mean by hand:
(Note that the values of sample statistics are usually reported up
to two decimal places unless the original X scores provide
information that is accurate up to more than two decimal
places.)
The SPSS Descriptive Statistics: Frequencies procedure was
used to obtain the sample mean and other simple descriptive
statistics for the set of scores in the file hr9.sav. On the Data
View worksheet, find the Analyze option in the menu bar at the
top of the worksheet and click on it. Select Descriptive
Statistics from the pull-down menu that appears (as shown in
Figure 2.9); this leads to another drop-down menu. Because we
want to see a distribution of frequencies and also obtain simple
descriptive statistics such as the sample mean, M, click on the
Frequencies procedure from this second pull-down menu.
This series of menu selections displayed in Figure 2.9 opens the
SPSS dialog window for the Descriptive Statistics: Frequencies
procedure shown in Figure 2.10. Move the variable name HR
from the left-hand panel into the right-hand panel under the
heading Variables to indicate that the Frequencies procedure
will be performed on scores for the variable HR. Clicking the
Statistics button at the bottom of the SPSS Frequencies dialog
window opens up the Frequencies: Statistics dialog window;
this contains a menu of basic descriptive statistics for
quantitative variables (see Figure 2.11). Check box selections
can be used to include or omit any of the statistics on this menu.
In this example, the following sample statistics were selected:
Under the heading Central Tendency, Mean and Sum were
selected, and under the heading Dispersion, Standard deviation
and Variance were selected. Click Continue to return to the
main Frequencies dialog window. When all the desired menu
selections have been made, click the OK button to run the
analysis for the selected variable, HR. The results from this
analysis appear in Figure 2.12. The top panel of Figure 2.12
reports the requested summary statistics, and the bottom panel
reports the table of frequencies for each score value included in
the sample. The value for the sample mean that appears in the
SPSS output in Figure 2.12, M = 73.11, agrees with the
numerical value obtained by the earlier calculation.
Figure 2.9 SPSS Menu Selections for the Descriptive Statistics
and Frequencies Procedures Applied to the Random Sample of
N = 9 Heart Rate Scores in the Dataset Named hrsample9.sav
How can this value of M = 73.11 be used? If we wanted to
estimate or guess any one individual’s HR, in the absence of
any other information, the best guess for any randomly selected
individual member of this sample of N = 9 persons would be M
= 73.11 bpm. Why do we say that the mean M is the “best”
prediction for any randomly selected individual score in this
sample? It is best because it is the estimate that makes the sum
of the prediction errors (i.e., the X – M differences) zero and
minimizes the overall sum of squared prediction errors across
all participants.
To see this, reexamine Table 2.2. The second column of Table
2.2 shows the deviation of each score from the sample mean (X
− M), for each of the nine scores in the sample. This deviation
from the mean is the prediction error that arises if M is used to
estimate that person’s score; the magnitude of error is given by
the difference X − M, the person’s actual HR score minus the
sample mean HR, M. For instance, if we use M to estimate
Participant 1’s score, the prediction error for Case 1 is (70 −
73.11) = −3.11; that is, Participant 1’s actual HR score is 3.11
points below the estimated value of M = 73.11.
Figure 2.10 The SPSS Dialog Window for the Frequencies
Procedure
Figure 2.11 The Frequencies: Statistics Window With Check
Box Menu for Requested Descriptive Statistics
Figure 2.12 SPSS Output From Frequencies Procedure for the
Sample of N = 9 Heart Rate Scores in the File hrsample9.sav
Randomly Selected From the File hr130.sav
How can we summarize information about the magnitude of
prediction error across persons in the sample? One approach
that might initially seem reasonable is summing the X − M
deviations across all the persons in the sample. The sum of
these deviations appears at the bottom of the second column of
Table 2.2. By definition, the sample mean, M, is the value for
which the sum of the deviations across all the scores in a
sample equals 0. In that sense, using M to estimate X for each
person in the sample results in the smallest possible sum of
prediction errors. It can be demonstrated that taking deviations
of these X scores from any constant other than the sample mean,
M, yields a sum of deviations that is not equal to 0. However,
the fact that ∑(X − M) always equals 0 for a sample of data
makes this sum uninformative as summary information about
dispersion of scores.
We can avoid the problem that the sum of the deviations always
equals 0 in a simple manner: If we first square the prediction
errors or deviations (i.e., if we square the X − M value for each
person, as shown in the third column of Table 2.2) and then sum
these squared deviations, the resulting term ∑(X − M)2 is a
number that gets larger as the magnitudes of the deviations of
individual X values from M increase.
There is a second sense in which M is the best predictor of HR
for any randomly selected member of the sample. M is the value
for which the sum of squared deviations (SS), ∑(X − M)2, is
minimized. The sample mean is the best predictor of any
randomly selected person’s score because it is the estimate for
which prediction errors sum to 0, and it is also the estimate that
has the smallest sum of squared prediction errors. The term
ordinary least squares (OLS) refers to this criterion; a statistic
meets the criterion for best OLS estimator when it minimizes
the sum of squared prediction errors.
This empirical demonstration1 only shows that ∑(X − M) = 0
for this particular batch of data. An empirical demonstration is
not equivalent to a formal proof. Formal proofs for the claim
that ∑(X − M) = 0 and the claim that M is the value for which
the SS, ∑(X − M)2, is minimized are provided in mathematical
statistics textbooks such as deGroot and Schervish (2001). The
present textbook provides demonstrations rather than formal
proofs.
Based on the preceding demonstration (and the proofs provided
in mathematical statistics books), the mean is the best estimate
for any individual score when we do not have any other
information about the participant. Of course, if a researcher can
obtain information about the participant’s drug use, smoking,
age, gender, anxiety level, aerobic fitness, and other variables
that may be predictive of HR (or that may influence HR), better
estimates of an individual’s HR may be obtainable by using
statistical analyses that take one or more of these predictor
variables into account. Two other statistics are commonly used
to describe the average or typical score in a sample: the mode
and the median. The mode is simply the score value that occurs
most often. This is not a very useful statistic for this small
batch of sample data because each score value occurs only once;
no single score value has a larger number of occurrences than
other scores. The median is obtained by rank ordering the scores
in the sample from lowest to highest and then counting the
scores. Here is the set of nine scores from Figure 2.1 and Table
2.2 arranged in rank order:
[64, 69, 70, 71, 73, 74, 75, 80, 82]
The score that has half the scores above it and half the scores
below it is the median; in this example, the median is 73.
Because M is computed using ∑X, the inclusion of one or two
extremely large individual X scores tends to increase the size of
M. For instance, suppose that the minimum score of “64” was
replaced by a much higher score of “190” in the set of nine
scores above. The mean for this new set of nine scores would be
given by
However, the median for this new set of nine scores with an
added outlier of X = 190,
[69, 70, 71, 73, 74, 75, 80, 82, 190],
would change to 74, which is still quite close to the original
median (without the outlier) of 73.
The preceding example demonstrates that the inclusion of one
extremely high score typically has little effect on the size of the
sample median. However, the presence of one extreme score can
make a substantial difference in the size of the sample mean, M.
In this sample of N = 9 scores, adding an extreme score of X =
190 raises the value of M from 73.11 to 87.11, but it changes
the median by only one point. Thus, the mean is less “robust” to
extreme scores or outliers than the median; that is, the value of
a sample mean can be changed substantially by one or two
extreme scores. It is not desirable for a sample statistic to
change drastically because of the presence of one extreme
score, of course. When researchers use statistics (such as the
mean) that are not very robust to outliers, they need to pay
attention to extreme scores when screening the data. Sometimes
extreme scores are removed or recoded to avoid situations in
which the data for one individual participant have a
disproportionately large impact on the value of the mean (see
Chapter 4 for a more detailed discussion of identification and
treatment of outliers).
When scores are perfectly normally distributed, the mean,
median, and mode are equal. However, when scores have
nonnormal distributions (e.g., when the distribution of scores
has a longer tail on the high end), these three indexes of central
tendency are generally not equal. When the distribution of
scores in a sample is nonnormal (or skewed), the researcher
needs to consider which of these three indexes of central
tendency is the most appropriate description of the center of a
distribution of scores.
Despite the fact that the mean is not robust to the influence of
outliers, the mean is more widely reported than the mode or
median. The most extensively developed and widely used
statistical methods, such as analysis of variance (ANOVA), use
group means and deviations from group means as the basic
building blocks for computations. ANOVA assumes that the
scores on the quantitative outcome variable are normally
distributed. When this assumption is satisfied, the use of the
mean as a description of central tendency yields reasonable
results.
2.4 Sum of Squared Deviations (SS) and Sample Variance (s2)
The question we want to answer when we compute a sample
variance can be worded in several different ways. How much do
scores differ among the members of a sample? How widely
dispersed are the scores in a batch of data? How far do
individual X scores tend to be from the sample mean M? The
sample variance provides summary information about the
distance of individual X scores from the mean of the sample.
Let’s build the formula for the sample variance (denoted by s2)
step by step.
First, we need to know the distance of each individual X score
from the sample mean. To answer this question, a deviation
from the mean is calculated for each score as follows (the i
subscript indicates that this is done for each person in the
sample—that is, for scores that correspond to person number i
for i = 1, 2, 3, …, N). The deviation of person number i’s score
from the sample mean is given by Equation 2.2:
The value of this deviation for each person in the sample
appears in the second column of Table 2.2. The sign of this
deviation tells us whether an individual person’s score is above
M (if the deviation is positive) or below M (if the deviation is
negative). The magnitude of the deviation tells us whether a
score is relatively close to, or far from, the sample mean.
To obtain a numerical index of variance, we need to summarize
information about distance from the mean across subjects. The
most obvious approach to summarizing information across
subjects would be to sum the deviations from the mean for all
the scores in the sample:
As noted earlier, this sum turns out to be uninformative
because, by definition, deviations from a sample mean in a
batch of sample data sum to 0. We can avoid this problem by
squaring the deviation for each subject and then summing the
squared deviations. This SS is an important piece of information
that appears in the formulas for many of the more advanced
statistical analyses discussed later in this textbook:
What range of values can SS have? SS has a minimum possible
value of 0; this occurs in situations where all the X scores in a
sample are equal to each other and therefore also equal to M.
(Because squaring a deviation must yield a positive number, and
SS is a sum of squared deviations, SS cannot be a negative
number.) The value of SS has no upper limit. Other factors
being equal, SS tends to increase when
1. the number of squared deviations included in the sum
increases, or
2. the individual Xi − M deviations get larger in absolute value.
A different version of the formula for SS is often given in
introductory textbooks:
Equation 2.5 is a more convenient procedure for by-hand
computation of the SS than is Equation 2.4 because it involves
fewer arithmetic operations and results in less rounding error.
This version of the formula also makes it clear that SS depends
on both ∑X, the sum of the Xs, and ∑X2, the sum of the squared
Xs. Formulas for more complex statistics often include these
same terms: ∑X and ∑X2. When these terms (∑X and ∑X2) are
included in a formula, their presence implies that the
computation takes both the mean and the variance of X scores
into account. These chunks of information are the essential
building blocks for the computation of most of the statistics
covered later in this book.
From Table 2.2, the numerical result for SS = ∑(X – M)2 is
244.89.
How can the value of SS be used or interpreted? The minimum
possible value of SS occurs when all the X scores are equal to
each other and, therefore, equal to M. For example, in the set of
scores [73, 73, 73, 73, 73], the SS term would equal 0.
However, there is no upper limit, in practice, for the maximum
value of SS. SS values tend to be larger when they are based on
large numbers of deviations and when the individual X scores
have large deviations from the mean, M. To interpret SS as
information about variability, we need to correct for the fact
that SS tends to be larger when the number of squared
deviations included in the sum is large.
2.5 Degrees of Freedom (df) for a Sample Variance
It might seem logical to divide SS by N to correct for the fact
that the size of SS gets larger as N increases. However, the
computation (SS/N) produces a sample variance that is a biased
estimate of the population variance; that is, the sample statistic
SS/N tends to be smaller than σ2, the true population variance.
This can be empirically demonstrated by taking hundreds of
small samples from a population, computing a value of s2 for
each sample by using the formula s2 = SS/N, and tabulating the
obtained values of s2. When this experiment is performed, the
average of the sample s2 values turns out to be smaller than the
population variance, σ2.2 This is called bias in the size of s2; s2
calculated as SS/N is smaller on average than σ2, and thus, it
systematically underestimates σ2. SS/N is a biased estimate
because the SS term is actually based on fewer than N
independent pieces of information. How many independent
pieces of information is the SS term actually based on?
Let’s reconsider the batch of HR scores for N = 9 people and
the corresponding deviations from the mean; these deviations
appear in column 2 of Table 2.2. As mentioned earlier, for this
batch of data, the sum of deviations from the sample mean
equals 0; that is, ∑(Xi − M) = −3.11 − 2.11 + .89 + 6.89 − .11 +
1.89 + 8.89 − 9.11 − 4.11 = 0. In general, the sum of deviations
of sample scores from the sample mean, ∑(Xi – M), always
equals 0. Because of the constraint that ∑(X − M) = 0, only the
first N − 1 values (in this case, 8) of the X − M deviation terms
are “free to vary.” Once we know any eight deviations for this
batch of data, we can deduce what the remaining ninth deviation
must be; it has to be whatever value is needed to make ∑(X −
M) = 0. For example, once we know that the sum of the
deviations from the mean for Persons 1 through 8 in this sample
of nine HR scores is +4.11, we know that the deviation from the
mean for the last remaining case must be −4.11. Therefore, we
really have only N − 1 (in this case, 8) independent pieces of
information about variability in our sample of 9 subjects. The
last deviation does not provide new information. The number of
independent pieces of information that a statistic is based on is
called the degrees of freedom, or df. For a sample variance for a
set of N scores, df = N − 1. The SS term is based on only N − 1
independent deviations from the sample mean.
It can be demonstrated empirically and proved formally that
computing the sample variance by dividing the SS term by N
results in a sample variance that systematically underestimates
the true population variance. This underestimation or bias can
be corrected by using the degrees of freedom as the divisor. The
preferred (unbiased) formula for computation of a sample
variance for a set of X scores is thus
Whenever a sample statistic is calculated using sums of squared
deviations, it has an associated degrees of freedom that tells us
how many independent deviations the statistic is based on.
These df terms are used to compute statistics such as the sample
variance and, later, to decide which distribution (in the family
of t distributions, for example) should be used to look up
critical values for statistical significance tests.
For this hypothetical batch of nine HR scores, the deviations
from the mean appear in column 2 of Table 2.2; the squared
deviations appear in column 3 of Table 2.2; the SS is 244.89; df
= N − 1 = 8; and the sample variance, s2, is 244.89/8 = 30.61.
This agrees with the value of the sample variance in the SPSS
output from the Frequencies procedure in Figure 2.12.
It is useful to think about situations that would make the sample
variance s2 take on larger or smaller values. The smallest
possible value of s2 occurs when all the scores in the sample
have the same value; for example, the set of scores [73, 73, 73,
73, 73, 73, 73, 73, 73] would have a variance s2 = 0. The value
of s2 would be larger for a sample in which individual
deviations from the sample mean are relatively large, for
example, [44, 52, 66, 97, 101, 119, 120, 135, 151], than for the
set of scores [72, 73, 72, 71, 71, 74, 70, 73], where individual
deviations from the mean are relatively small.
The value of the sample variance, s2, has a minimum of 0.
There is, in practice, no fixed upper limit for values of s2; they
increase as the distances between individual scores and the
sample mean increase. The sample variance s2 = 30.61 is in
“squared HR in beats per minute.” We will want to have
information about dispersion that is in terms of HR (rather than
HR squared); this next step in the development of sample
statistics is discussed in Section 2.7. First, however, let’s
consider an important question: Why is there variance? Why do
researchers want to know about variance?
2.6 Why Is There Variance?
The best question ever asked by a student in my statistics class
was, “Why is there variance?” This seemingly naive question is
actually quite profound; it gets to the heart of research
questions in behavioral, educational, medical, and social
science research. The general question of why is there variance
can be asked specifically about HR: Why do some people have
higher and some people lower HR scores than average? Many
factors may influence HR—for example, family history of
cardiovascular disease, gender, smoking, anxiety, caffeine
intake, and aerobic fitness. The initial question that we consider
when we compute a variance for our sample scores is, How
much variability of HR is there across the people in our study?
In subsequent analyses, researchers try to account for at least
some of this variability by noting that factors such as gender,
smoking, anxiety, and caffeine use may be systematically
related to and therefore predictive of HR. In other words, the
question of why is there variance in HR can be partially
answered by noting that people have varying exposure to all
sorts of factors that may raise or lower HR, such as aerobic
fitness, smoking, anxiety, and caffeine consumption. Because
people experience different genetic and environmental
influences, they have different HRs. A major goal of research is
to try to identify the factors that predict (or possibly even
causally influence) each individual person’s score on the
variable of interest, such as HR.
Similar questions can be asked about all attributes that vary
across people or other subjects of study; for example, Why do
people have differing levels of anxiety, satisfaction with life,
body weight, or salary?
The implicit model that underlies many of the analyses
discussed later in this textbook is that an observed score can be
broken down into components and that each component of the
score is systematically associated with a different predictor
variable. Consider Participant 7 (let’s call him Joe), with an HR
of 82 bpm. If we have no information about Joe’s background, a
reasonable initial guess would be that Joe’s HR is equal to the
mean resting HR for the sample, M = 73.11. However, let’s
assume that we know that Joe smokes cigarettes and that we
know that cigarette smoking tends to increase HR by about 5
bpm. If Joe is a smoker, we might predict that his HR would be
5 points higher than the population mean of 73.11 (73.11, the
overall mean, plus 5 points, the effect of smoking on HR, would
yield a new estimate of 78.11 for Joe’s HR). Joe’s actual HR
(82) is a little higher than this predicted value (78.11), which
combines information about what is average for most people
with information about the effect of smoking on HR. An
estimate of HR that is based on information about only one
predictor variable (in this example, smoking) probably will not
be exactly correct because many other factors are likely to
influence Joe’s HR (e.g., body weight, family history of
cardiovascular disease, drug use). These other variables that are
not included in the analysis are collectively called sources of
“error.” The difference between Joe’s actual HR of 82 and his
predicted HR of 78.11 (82 − 78.11 = +3.89) is a prediction
error. Perhaps Joe’s HR is a little higher than we might predict
based on overall average HR and Joe’s smoking status because
Joe has poor aerobic fitness or was anxious when his HR was
measured. It might be possible to reduce this prediction error to
a smaller value if we had information about additional variables
(such as aerobic fitness and anxiety) that are predictive of HR.
Because we do not know all the factors that influence or predict
Joe’s HR, a predicted HR based on just a few variables is
generally not exactly equal to Joe’s actual HR, although it may
be a better estimate of his HR than we would have if we just
used the sample mean to estimate his score.
Statistical analyses covered in later chapters will provide us
with a way to “take scores apart” into components that represent
how much of the HR score is associated with each predictor
variable. In other words, we can “explain” why Joe’s HR of 82
is 8.89 points higher than the sample mean of 73.11 by
identifying parts of Joe’s HR score that are associated with, and
predictable from, specific variables such as smoking, aerobic
fitness, and anxiety. More generally, a goal of statistical
analysis is to show that we can predict whether individuals tend
to have high or low scores on an outcome variable of interest
(such as HR) from scores on a relatively small number of
predictor variables. We want to explain or account for the
variance in HR by showing that some components of each
person’s HR score can be predicted from his or her scores on
other variables.
2.7 Sample Standard Deviation (s)
An inconvenient property of the sample variance that was
calculated in Section 2.5 (s2 = 30.61) is that it is given in
squared HR rather than in the original units of measurement.
The original scores were measures of HR in beats per minute,
and it would be easier to talk about typical distances of
individual scores from the mean if we had a measure of
dispersion that was in the original units of measurement. To
describe how far a typical subject’s HR is from the sample
mean, it is helpful to convert the information about dispersion
contained in the sample variance, s2, back into the original
units of measurement (scores on HR rather than HR squared).
To obtain an estimate of the sample standard deviation (s), we
take the square root of the variance. The formula used to
compute the sample standard deviation (which provides an
unbiased estimate of the population standard deviation, s) is as
follows:
For the set of N = 9 HR scores given above, the variance was
30.61; the sample standard deviation s is the square root of this
value, 5.53. The sample standard deviation, s = 5.53, tells us
something about typical distances of individual X scores from
the mean, M. Note that the numerical estimate for the sample
standard deviation, s, obtained from this computation agrees
with the value of s reported in the SPSS output from the
Frequencies procedure that appears in Figure 2.12.
How can we use the information that we obtain from sample
values of M and s? If we know that scores are normally
distributed, and we have values for the sample mean and
standard deviation, we can work out an approximate range that
is likely to include most of the score values in the sample.
Recall from Chapter 1 that in a normal distribution, about 95%
of the scores lie within ±1.96 standard deviations from the
mean. For a sample with M = 73.11 and s = 5.53, if we assume
that HR scores are normally distributed, an estimated range that
should include most of the values in the sample is obtained by
finding M ± 1.96 × s. For this example, 73.11 ± (1.96 × 5.53) =
73.11 ± 10.84; this is a range from 62.27 to 83.95. These values
are fairly close to the actual minimum (64) and maximum (82)
for the sample. The approximation of range obtained by using M
and s tends to work much better when the sample has a larger N
of participants and when scores are normally distributed within
the sample. What we know at this point is that the average for
HR was about 73 bpm and that the range of HR in this sample
was from 64 to 82 bpm. Later in the chapter, we will ask, How
can we use this information from the sample (M and s) to
estimate μ, the mean HR for the entire population?
However, several additional issues need to be considered before
we take on the problem of making inferences about μ, the
unknown population mean. These are discussed in the next few
sections.
2.8 Assessment of Location of a Single X Score Relative to a
Distribution of Scores
We can use the mean and standard deviation of a population, if
these are known (μ and σ, respectively), or the mean and
standard deviation for a sample (M and s, respectively) to
evaluate the location of a single X score (relative to the other
scores in a population or a sample).
First, let’s consider evaluating a single X score relative to a
population for which the mean and standard deviation, μ and σ,
respectively, are known. In real-life research situations,
researchers rarely have this information. One clear example of a
real-life situation where the values of μ and σ are known to
researchers involves scores on standardized tests such as the
Wechsler Adult Intelligence Scale (WAIS).
Suppose you are told that an individual person has received a
score of 110 points on the WAIS. How can you interpret this
score? To answer this question, you need to know several
things. Does this score represent a high or a low score relative
to other people who have taken the test? Is it far from the mean
or close to the mean of the distribution of scores? Is it far
enough above the mean to be considered “exceptional” or
unusual? To evaluate the location of an individual score, you
need information about the distribution of the other scores. If
you have a detailed frequency table that shows exactly how
many people obtained each possible score, you can work out an
exact percentile rank (the percentage of test takers who got
scores lower than 110) using procedures that are presented in
detail in introductory statistics books. When the distribution of
scores has a normal shape, a standard score or z score provides
a good description of the location of that single score relative to
other people’s scores without the requirement for complete
information about the location of every other individual score.
In the general population, scores on the WAIS intelligence
quotient (IQ) test have been scaled so that they are normally
distributed with a mean μ = 100 and a standard deviation σ of
15. The first thing you might do to assess an individual score is
to calculate the distance from the mean—that is, X − μ (in this
example, 110 − 100 = +10 points). This result tells you that the
score is above average (because the deviation has a positive
sign). But it does not tell whether 10 points correspond to a
large or a small distance from the mean when you consider the
variability or dispersion of IQ scores in the population.
To obtain an index of distance from the mean that is “unit free”
or standardized, we compute a z score; we divide the deviation
from the mean (X − μ) by the standard deviation of population
scores (σ) to find out the distance of the X score from the mean
in number of standard deviations, as shown in Equation 2.8:
If the z transformation is applied to every X score in a normally
distributed population, the shape of the distribution of scores
does not change, but the mean of the distribution is changed to
0 (because we have subtracted μ from each score), and the
standard deviation is changed to 1 (because we have divided
deviations from the mean by σ). Each z score now represents
how far an X score is from the mean in “standard units”—that
is, in terms of the number of standard deviations. The mapping
of scores from a normally shaped distribution of raw scores,
with a mean of 100 and a standard deviation of 15, to a standard
normal distribution, with a mean of 0 and a standard deviation
of 1, is illustrated in Figure 2.13.
For a score of X = 110, z = (110 − 100)/15 = +.67. Thus, an X
score of 110 IQ points corresponds to a z score of +.67, which
corresponds to a distance of two thirds of a standard deviation
above the population mean.
Recall from the description of the normal distribution in
Chapter 1 that there is a fixed relationship between distance
from the mean (given as a z score, i.e., numbers of standard
deviations) and area under the normal distribution curve. We
can deduce approximately what proportion or percentage of
people in the population had IQ scores higher (or lower) than
110 points by (a) finding out how far a score of 110 is from the
mean in standard score or z score units and (b) looking up the
areas in the normal distribution that correspond to the z score
distance from the mean.
Figure 2.13 Mapping of Scores From a Normal Distribution of
Raw IQ Scores (With μ = 100 and σ = 15) to a Standard Normal
Distribution (With μ = 0 and σ = 1)
The proportion of the area of the normal distribution that
corresponds to outcomes greater than z = +.67 can be evaluated
by looking up the area that corresponds to the obtained z value
in the table of the standard normal distribution in Appendix A.
The obtained value of z (+.67) and the corresponding areas
appear in the three columns on the right-hand side of the first
page of the standard normal distribution table, about eight lines
from the top. Area C corresponds to the proportion of area
under a normal curve that lies to the right of z = +.67; from the
table, area C = .2514. Thus, about 25% of the area in the normal
distribution lies above z = +.67. The areas for sections of the
normal distribution are interpretable as proportions; if they are
multiplied by 100, they can be interpreted as percentages. In
this case, we can say that the proportion of the population that
had z scores equal to or above +.67 and/or IQ scores equal to or
above 110 points was .2514. Equivalently, we could say that
25.14% of the population had IQ scores equal to or above 110.
Note that the table in Appendix A can also be used to assess the
proportion of cases that lie below z = +.67. The proportion of
area in the lower half of the distribution (from z = –∞ to z =
.00) is .50. The proportion of area that lies between z = .00 and
z = +.67 is shown in column B (area = .2486) of the table. To
find the total area below z = +.67, these two areas are summed:
.5000 + .2486 = .7486. If this value is rounded to two decimal
places and multiplied by 100 to convert the information into a
percentage, it implies that about 75% of persons in the
population had IQ scores below 110. This tells us that a score of
110 is above average, although it is not an extremely high score.
Consider another possible IQ score. If a person has an IQ score
of 145, that person’s z score is (145 − 100)/15 = +3.00. This
person scored 3 standard deviations above the mean. The
proportion of the area of a normal distribution that lies above z
= +3.00 is .0013. That is, only about 1 in 1,000 people have z
scores greater than or equal to +3.00 (which would correspond
to IQs greater than or equal to 145).
By convention, scores that fall in the most extreme 5% of a
distribution are regarded as extreme, unusual, exceptional, or
unlikely. (While 5% is the most common criterion for
“extreme,” sometimes researchers choose to look at the most
extreme 1% or .1%.) Because the most extreme 5% (combining
the outcomes at both the upper and the lower extreme ends of
the distribution) is so often used as a criterion for an “unusual”
or “extreme” outcome, it is useful to remember that 2.5% of the
area in a normal distribution lies below z = −1.96, and 2.5% of
the area in a normal distribution lies above z = +1.96. When the
areas in the upper and lower tails are combined, the most
extreme 5% of the scores in a normal distribution correspond to
z values ≤ −1.96 and ≥ +1.96. Thus, anyone whose score on a
test yields a z score greater than 1.96 in absolute value might be
judged “extreme” or unusual. For example, a person whose test
score corresponds to a value of z that is greater than +1.96 is
among the top 2.5% of all test scorers in the population.
2.9 A Shift in Level of Analysis: The Distribution of Values of
M Across Many Samples From the Same Population
At this point in the discussion, we need to make a major shift in
thinking. Up to this point, the discussion has examined the
distributions of individual X scores in populations and in
samples. We can describe the central tendency or average score
by computing a mean; we describe the dispersion of individual
X scores around the mean by computing a standard deviation.
We now move to a different level of analysis: We will ask
analogous questions about the behavior of the sample mean, M;
that is, What is the average value of M across many samples,
and how much does the value of M vary across samples? It may
be helpful to imagine this as a sort of “thought experiment.”
In actual research situations, a researcher usually has only one
sample. The researcher computes a mean and a variance for the
data in that one sample, and often the researcher wants to use
the mean and variance from one sample to make inferences
about (or estimates of) the mean and variance of the population
from which the sample was drawn.
Note, however, that the single sample mean, M, reported for a
random sample of N = 9 cases from the hr130 file (M = 73.11)
was not exactly equal to the population mean μ of 73.76 (in
Figure 2.5). The difference M − μ (in this case, 73.11 − 73.76)
represents an estimation error; if we used the sample mean
value M = 73.11 to estimate the population mean of μ = 73.76,
in this instance, our estimate will be off by 73.11 − 73.76 =
−.65. It is instructive to stop and think, Why was the value of M
in this one sample different from the value of μ?
It may be useful for the reader to repeat this sampling exercise.
Using the <Data> → <Select Cases> → <Random> SPSS menu
selections, as shown in Figures 2.6 and 2.7 earlier, each member
of the class might draw a random sample of N = 9 cases from
the file hr130.sav and compute the sample mean, M. If students
report their values of M to the class, they will see that the value
of M differs across their random samples. If the class sets up a
histogram to summarize the values of M that are obtained by
class members, this is a “sampling distribution” for M—that is,
a set of different values for M that arise when many random
samples of size N = 9 are selected from the same population.
Why is it that no two students obtain the same answer for the
value of M?
2.10 An Index of Amount of Sampling Error: The Standard
Error of the Mean (σM)
Different samples drawn from the same population typically
yield different values of M because of sampling error. Just by
“luck of the draw,” some random samples contain one or more
individuals with unusually low or high scores on HR; for those
samples, the value of the sample mean, M, will be lower (or
higher) than the population mean, μ. The question we want to
answer is, How much do values of M, the sample mean, tend to
vary across different random samples drawn from the same
population, and how much do values of M tend to differ from
the value of μ, the population mean that the researcher wants to
estimate? It turns out that we can give a precise answer to this
question. That is, we can quantify the magnitude of sampling
error that arises when we take hundreds of different random
samples (of the same size, N) from the same population. It is
useful to have information about the magnitude of sampling
error; we will need this information later in this chapter to set
up CIs, and we will also use this information in later chapters to
set up statistical significance tests.
The outcome for this distribution of values of M—that is, the
sampling distribution of M—is predictable from the central
limit theorem. A reasonable statement of this theorem is
provided by Jaccard and Becker (2002):
Given a population [of individual X scores] with a mean of μ
and a standard deviation of σ, the sampling distribution of the
mean [M] has a mean of μ and a standard deviation [generally
called the “[population] standard error,” σM] of and approaches
a normal distribution as the sample size on which it is based, N,
approaches infinity. (p. 189)
For example, an instructor using the entire dataset hr130.sav
can compute the population mean μ = 73.76 and the population
standard deviation σ = 7.062 for this population of 130 scores.
If the instructor asks each student in the class to draw a random
sample of N = 9 cases, the instructor can use the central limit
theorem to predict the distribution of outcomes for M that will
be obtained by class members. (This prediction will work well
for large classes; e.g., in a class of 300 students, there are
enough different values of the sample mean to obtain a good
description of the sampling distribution; for classes smaller than
30 students, the outcomes may not match the predictions from
the central limit theorem very closely.)
When hundreds of class members bring in their individual
values of M, mean HR (each based on a different random
sample of N = 9 cases), the instructor can confidently predict
that when all these different values of M are evaluated as a set,
they will be approximately normally distributed with a mean
close to 73.76 bpm (the population mean) and with a standard
deviation or standard error, σM, of bpm The middle 95% of the
sampling distribution of M should lie within the range μ −
1.96σM and μ + 1.96σM; in this case, the instructor would
predict that about 95% of the values of M obtained by class
members should lie approximately within the range between
73.76 − 1.96 × 2.35 and 73.76 + 1.96 × 2.35, that is, mean HR
between 69.15 and 78.37 bpm. On the other hand, about 2.5% of
students are expected to obtain sample mean M values below
69.15, and about 2.5% of students are expected to obtain sample
mean M values above 78.37. In other words, before the students
go through all the work involved in actually drawing hundreds
of samples and computing a mean M for each sample and then
setting up a histogram and frequency table to summarize the
values of M across the hundreds of class members, the
instructor can anticipate the outcome; while the instructor
cannot predict which individual students will obtain unusually
high or low values of M, the instructor can make a fairly
accurate prediction about the range of values of M that most
students will obtain.
The fact that we can predict the outcome of this time-consuming
experiment on the behavior of the sample statistic M based on
the central limit theorem means that we do not, in practice, need
to actually obtain hundreds of samples from the same
population to estimate the magnitude of sampling error, σM. We
only need to know the values of σ and N and to apply the
central limit theorem to obtain fairly precise information about
the typical magnitude of sampling error.
The difference between each individual student’s value of M
and the population mean, μ, is attributable to sampling error.
When we speak of sampling error, we do not mean that the
individual student has necessarily done something wrong
(although students could make mistakes while computing M
from a set of scores). Rather, sampling error represents the
differences between the values of M and μ that arise just by
chance. When individual students carry out all the instructions
for the assignment correctly, most students obtain values of M
that differ from μ by relatively small amounts, and a few
students obtain values of M that are quite far from μ.
Prior to this section, the statistics that have been discussed
(such as the sample mean, M, and the sample standard
deviation, s) have described the distribution of individual X
scores. Beginning in this section, we use the population
standard error of the mean, σM, to describe the variability of a
sample statistic (M) across many samples. The standard error of
the mean describes the variability of the distribution of values
of M that would be obtained if a researcher took thousands of
samples from one population, computed M for each sample, and
then examined the distribution of values of M; this distribution
of many different values of M is called the sampling
distribution for M.
2.11 Effect of Sample Size (N) on the Magnitude of the
Standard Error (σM)
When the instructor sets up a histogram of the M values for
hundreds of students, the shape of this distribution is typically
close to normal; the mean of the M values is close to μ, and the
population mean, as well as the standard error (essentially, the
standard deviation) of this distribution of M values, is close to
the theoretical value given by
Refer back to Figure 2.5 to see the histogram for the entire
population of 130 HR scores. Because this population of 130
observations is small, we can calculate the population mean μ =
73.76 and the population standard deviation σ = 7.062 (these
statistics appeared along with the histogram in Figure 2.5).
Suppose that each student in an extremely large class (500 class
members) draws a sample of size N = 9 and computes a mean M
for this sample; the values of M obtained by 500 members of the
class would be normally distributed and centered at μ = 73.76,
with
as shown in Figure 2.15. When comparing the distribution of
individual X scores in Figure 2.5 with the distribution of values
of M based on 500 samples each with an N of 9 in Figure 2.15,
the key thing to note is that they are both centered at the same
value of μ (in this case, 73.76), but the variance or dispersion of
the distribution of M values is less than the variance of the
individual X scores. In general, as N (the size of each sample)
increases, the variance of the M values across samples
decreases.
Recall that σM is computed as σ/√N. It is useful to examine this
formula and to ask, Under what circumstances will σM be larger
or smaller? For any fixed value of N, this equation says that as
σ increases, σM also increases. In other words, when there is an
increase in the variance of the original individual X scores, it is
intuitively obvious that random samples are more likely to
include extreme scores, and these extreme scores in the samples
will produce sample values of M that are farther from μ.
For any fixed value of σ, as N increases, the value of σM will
decrease. That is, as the number of cases (N) in each sample
increases, the estimate of M for any individual sample tends to
be closer to μ. This should seem intuitively reasonable; larger
samples tend to yield sample means that are better estimates of
μ—that is, values of M that tend to be closer to μ. When N − 1,
σM = σ; that is, for samples of size 1, the standard error is the
same as the standard deviation of the individual X scores.
Figures 2.14 through 2.17 illustrate that as the N per sample is
increased, the dispersion of values of M in the sampling
distributions continues to decrease in a predictable way. The
numerical values of the standard errors for the histograms
shown in Figures 2.14 through 2.17 are approximately equal to
the theoretical values of σM computed from σ and N:
Figure 2.14 The Sampling Distribution of 500 Sample Means,
Each Based on an N of 4, Drawn From the Population of 130
Heart Rate Scores in the hr130.sav Dataset
Figure 2.15 The Sampling Distribution of 500 Sample Means,
Each Based on an N of 9, Drawn From the Population of 130
Heart Rate Scores in the hr130.sav Dataset
The standard error, σM, provides information about the
predicted dispersion of sample means (values of M) around μ
(just as σ provided information about the dispersion of
individual X scores around M).
We want to know the typical magnitude of differences between
M, an individual sample mean, and μ, the population mean, that
we want to estimate using the value of M from a single sample.
When we use M to estimate μ, the difference between these two
values (M − μ) is an estimation error. Recall that σ, the standard
deviation for a population of X scores, provides summary
information about the distances between individual X scores and
μ, the population mean. In a similar way, the standard error of
the mean, σM, provides summary information about the
distances between M and μ, and these distances correspond to
the estimation error that arises when we use individual sample
M values to try to estimate μ. We hope to make the magnitudes
of estimation errors, and therefore the magnitude of σM, small.
Information about the magnitudes of estimation errors helps us
to evaluate how accurate or inaccurate our sample statistics are
likely to be as estimates of population parameters. Information
about the magnitude of sampling errors is used to set up CIs and
to conduct statistical significance tests.
Because the sampling distribution of M has a normal shape (and
σM is the “standard deviation” of this distribution) and we
know from Chapter 1 (Figure 1.4) that 95% of the area under a
standard normal distribution lies between z = −1.96 and z =
+1.96, we can reason that approximately 95% of the means of
random samples of size N drawn from a normally distributed
population of X scores, with a mean of μ and standard deviation
of σ, should fall within a range given by μ = (1.96) × σM and μ
+ 1.96 × σM.
Figure 2.16 The Sampling Distribution of 500 Sample Means,
Each Based on an N of 25, Drawn From the Population of 130
Heart Rate Scores in the hr130.sav Dataset
2.12 Sample Estimate of the Standard Error of the Mean (SEM)
The preceding section described the sampling distribution of M
in situations where the value of the population standard
deviation, σ, is known. In most research situations, the
population mean and standard deviation are not known; instead,
they are estimated by using information from the sample. We
can estimate σ by using the sample value of the standard
deviation; in this textbook, as in most other statistics textbooks,
the sample standard deviation is denoted by s. Many journals,
including those published by the American Psychological
Association, use SD as the symbol for the sample standard
deviations reported in journal articles.
Figure 2.17 The Sampling Distribution of 500 Sample Means,
Each Based on an N of 64, Drawn From the Population of 130
Heart Rate Scores in the hr130.sav Dataset
Earlier in this chapter, we sidestepped the problem of working
with populations whose characteristics are unknown by
arbitrarily deciding that the set of 130 scores in the file named
hr130.sav was the “population of interest.” For this dataset, the
population mean, μ, and standard deviation, σ, can be obtained
by having SPSS calculate these values for the entire set of 130
scores that are defined as the population of interest. However,
in many real-life research problems, researchers do not have
information about all the scores in the population of interest,
and they do not know the population mean, μ, and standard
deviation, σ. We now turn to the problem of evaluating the
magnitude of prediction error in the more typical real-life
situation, where a researcher has one sample of data of size N
and can compute a sample mean, M, and a sample standard
deviation, s, but does not know the values of the population
parameters μ or σ. The researcher will want to estimate μ using
the sample M from just one sample. The researcher wants to
have a reasonably clear idea of the magnitude of estimation
error that can be expected when the mean from one sample of
size N is used to estimate μ, the mean of the corresponding
population.
When σ, the population standard deviation, is not known, we
cannot find the value of σM. Instead, we calculate an estimated
standard error (SEM), using the sample standard deviation s to
replace the unknown value of σ in the formula for the standard
error of the mean, as follows (when σ is known):
When σ is unknown, we use s to estimate σ and relabel the
resulting standard error to make it clear that it is now based on
information about sample variability rather than population
variability of scores:
The substitution of the sample statistic σ as an estimate of the
population σ introduces additional sampling error. Because of
this additional sampling error, we can no longer use the
standard normal distribution to evaluate areas that correspond to
distances from the mean. Instead, a family of distributions
(called t distributions) is used to find areas that correspond to
distances from the mean.
Thus, when σ is not known, we use the sample value of SEM to
estimate σM, and because this substitution introduces additional
sampling error, the shape of the sampling distribution changes
from a normal distribution to a t distribution. When the standard
deviation from a sample (s) is used to estimate σ, the sampling
distribution of M has the following characteristics:
1. It is distributed as a t distribution with df = N − 1.
2. It is centered at μ.
3. The estimated standard error is
2.13 The Family of t Distributions
The family of “t” distributions is essentially a set of “modified”
normal distributions, with a different t distribution for each
value of df (or N). Like the standard normal distribution, a t
distribution is scaled so that t values are unit free. As N and df
decrease, assuming that other factors remain constant, the
magnitude of sampling error increases, and the required amount
of adjustment in distribution shape also increases. A t
distribution (like a normal distribution) is bell shaped and
symmetrical; however, as the N and df decrease, t distributions
become flatter in the middle compared with a normal
distribution, with thicker tails (they become platykurtic). Thus,
when we have a small df value, such as df = 3, the distance
from the mean that corresponds to the middle 95% of the t
distribution is larger than the corresponding distance in a
normal distribution.
As the value of df increases, the shape of the t distribution
becomes closer to that of a normal distribution; for df > 100, a t
distribution is essentially identical to a normal distribution.
Figure 2.18 shows t distributions for df values of 3, 6, and ∞.
As df increases, the shape of the t distribution converges toward
the normal distribution; a t distribution with df > 100 is
essentially indistinguishable from a normal distribution.
Figure 2.18 Graph of the t Distribution for Three Different df
Values (df = 3, 6, and Infinity, or ∞)
SOURCE:
www.psychstat.missouristate.edu/introbook/sbk24.htm
For a research situation where the sample mean is based on N =
7 cases, df = N − 1 = 6. In this case, the sampling distribution
of the mean would have the shape described by a t distribution
with 6 df; a table for the distribution with df = 6 would be used
to look up the values of t that cut off the top and bottom 2.5%
of the area. The area that corresponds to the middle 95% of the t
distribution with 6 df can be obtained either from the table of
the t distribution in Appendix B or from the diagram in Figure
2.18. When df = 6, 2.5% of the area in the t distribution lies
below t = −2.45, 95% of the area lies between t = −2.45 and t =
+2.45, and 2.5% of the area lies above t = +2.45.
2.14 Confidence Intervals
2.14.1 The General Form of a CI
When a single value of M in a sample is reported as an estimate
of μ, it is called a point estimate. An interval estimate (CI)
makes use of information about sampling error. A CI is reported
by giving a lower limit and an upper limit for likely values of μ
that correspond to some probability or level of confidence that,
across many samples, the CI will include the actual population
mean μ. The level of “confidence” is an arbitrarily selected
probability, usually 90%, 95%, or 99%.
The computations for a CI make use of the reasoning, discussed
in earlier sections, about the sampling error associated with
values of M. On the basis of our knowledge about the sampling
distribution of M, we can figure out a range of values around μ
that will probably contain most of the sample means that would
be obtained if we drew hundreds or thousands of samples from
the population. SEM provides information about the typical
magnitude of estimation error—that is, the typical distance
between values of M and μ. Statistical theory tells us that (for
values of df larger than 100) approximately 95% of obtained
sample means will likely be within a range of about 1.96 SEM
units on either side of μ.
μ.
When we set up a CI around an individual sample mean, M, we
are essentially using some logical sleight of hand and saying
that if values of M tend to be close to μ, then the unknown
value of μ should be reasonably close to (most) sample values
of M. However, the language used to interpret a CI is tricky. It
is incorrect to say that a CI computed using data from a single
sample has a 95% chance of including μ. (It either does or
doesn’t.) We can say, however, that in the long run,
approximately 95% of the CIs that are set up by applying these
procedures to hundreds of samples from a normally distributed
population with mean = μ will include the true population mean,
μ, between the lower and the upper limits. (The other 5% of CIs
will not contain μ.)
2.14.2 Setting Up a CI for M When σ Is Known
To set up a 95% CI to estimate the mean when σ, the population
standard deviation, is known, the researcher needs to do the
following:
1. Select a “level of confidence.” In the empirical example that
follows, the level of confidence is set at 95%. In applications of
CIs, 95% is the most commonly used level of confidence.
2. For a sample of N observations, calculate the sample statistic
(such as M) that will be used to estimate the corresponding
population parameter (μ).
3. Use the value of σ (the population standard deviation) and
the sample size N to calculate σM.
4. When σ is known, use the standard normal distribution to
look up the “critical values” of z that correspond to the middle
95% of the area in the standard normal distribution. These
values can be obtained by looking at the table of the standard
normal distribution in Appendix A. For a 95% level of
confidence, from Appendix B, we find that the critical values of
z that correspond to the middle 95% of the area are z = −1.96
and z = +1.96.
This provides the information necessary to calculate the lower
and upper limits for a CI. In the equations below, LL stands for
the lower limit (or boundary) of the CI, and UL stands for the
upper limit (or boundary) of the CI. Because the level of
confidence was set at 95%, the critical values of z, zcritical,
were obtained by looking up the distance from the mean that
corresponds to the middle 95% of the normal distribution. (If a
90% level of confidence is chosen, the z values that correspond
to the middle 90% of the area under the normal distribution
would be used.)
The lower and upper limits of a CI for a sample mean M
correspond to the following:
As an example, suppose that a student researcher collects a
sample of N = 25 scores on IQ for a random sample of people
drawn from the population of students at Corinth College. The
WAIS IQ test is known to have σ equal to 15. Suppose the
student decides to set up a 95% CI. The student obtains a
sample mean IQ, M, equal to 128.
The student needs to do the following:
1. Find the value of
2. Look up the critical values of z that correspond to the middle
95% of a standard normal distribution. From the table of the
normal distribution in Appendix A, these critical values are z =
–1.96 and z = +1.96.
3. Substitute the values for σM and zcritical into Equations 2.11
and 2.12 to obtain the following results:
What conclusions can the student draw about the mean IQ of the
population (all students at Corinth College) from which the
random sample was drawn? It would not be correct to say that
“there is a 95% chance that the true population mean IQ, μ, for
all Corinth College students lies between 122.12 and 133.88.” It
would be correct to say that “the 95% CI around the sample
mean lies between 122.12 and 133.88.” (Note that the value of
100, which corresponds to the mean, μ, for the general adult
population, is not included in this 95% CI for a sample of
students drawn from the population of all Corinth College
students. It appears, therefore, that the population mean WAIS
score for Corinth College students may be higher than the
population mean IQ for the general adult population.)
To summarize, the 95% confidence level is not the probability
that the true population mean, μ, lies within the CI that is based
on data from one sample (μ either does lie in this interval or
does not). The confidence level is better understood as a long-
range prediction about the performance of CIs when these
procedures for setting up CIs are followed. We expect that
approximately 95% of the CIs that researchers obtain in the
long run will include the true value of the population mean, μ.
The other 5% of the CIs that researchers obtain using these
procedures will not include μ.
2.14.3 Setting Up a CI for M When the Value of σ Is Not
Known
In a typical research situation, the researcher does not know the
values of μ and σ; instead, the researcher has values of M and s
from just one sample of size N and wants to use this sample
mean, M, to estimate μ. In Section 2.12, I explained that when σ
is not known, we can use s to calculate an estimate of SEM.
However, when we use SEM (rather than σM) to set up CIs, the
use of SEM to estimate σM results in additional sampling error.
To adjust for this additional sampling error, we use the t
distribution with N − 1 degrees of freedom (rather than the
normal distribution) to look up distances from the mean that
correspond to the middle 95% of the area in the sampling
distribution. When N is large (>100), the t distribution
converges to the standard normal distribution; therefore, when
samples are large (N > 100), the standard normal distribution
can be used to obtain the critical values for a CI.
The formulas for the upper and lower limits of the CI when σ is
not known, therefore, differ in two ways from the formulas for
the CI when σ is known. First, when σ is unknown, we replace
σM with SEM. Second, when σ is unknown and N < 100, we
replace zcritical with tcritical, using a t distribution with N − 1
df to look up the critical values (for N ≥ 100, zcritical may be
used).
For example, suppose that the researcher wants to set up a 95%
CI using the sample mean data reported in an earlier section of
this chapter with N = 9, M = 73.11, and s = 5.533 (sample
statistics are from Figure 2.12). The procedure is as follows:
1. Find the value of
2. Find the tcritical values that correspond to the middle 95% of
the area for a t distribution with df = N − 1 − 9 = 1 = 8. From
the table of the distribution of t, using 8 df, in Appendix B,
these are tcritical = −2.31 and tcritical = +2.31.
3. Substitute the values of M, tcritical, and SEM into the
following equations:
Lower limit = M − [tcritical × SEM] = 73.11 − [2.31 × 1.844] =
73.11 − 4.26 = 68.85;
Upper limit = M + [tcritical × SEM] = 73.11 + [2.31 × 1.844] =
73.11 + 4.26 = 77.37.
What conclusions can the student draw about the mean HR of
the population (all 130 cases in the file named hr130.sav) from
which the random sample of N = 9 cases was drawn? The
student can report that “the 95% CI for mean HR ranges from
68.85 to 77.37.” In this particular situation, we know what μ
really is; the population mean HR for all 130 scores was 73.76
(from Figure 2.5). In this example, we know that the CI that was
set up using information from the sample actually did include μ.
(However, about 5% of the time, when a 95% level of
confidence is used, the CI that is set up using sample data will
not include μ.)
The sample mean, M, is not the only statistic that has a
sampling distribution and a known standard error. The sampling
distributions for many other statistics are known; thus, it is
possible to identify an appropriate sampling distribution and to
estimate the standard error and set up CIs for many other
sample statistics, such as Pearson’s r.
2.14.4 Reporting CIs
On the basis of recommendations made by Wilkinson and Task
Force on Statistical Inference (1999), the Publication Manual of
the American Psychological Association (American
Psychological Association [APA], 2009) states that CI
information should be provided for major outcomes wherever
possible. SPSS provides CI information for many, but not all,
outcome statistics of interest. For some sample statistics and for
effect sizes, researchers may need to calculate CIs by hand
(Kline, 2004).
When we report CIs, such as a CI for a sample mean, we remind
ourselves (and our readers) that the actual value of the
population parameter that we are trying to estimate is generally
unknown and that the values of sample statistics are influenced
by sampling error. Note that it may be inappropriate to use CIs
to make inferences about the means for any specific real-world
population if the CIs are based on samples that are not
representative of a specific, well-defined population of interest.
As pointed out in Chapter 1, the widespread use of convenience
samples (rather than random samples from clearly defined
populations) may lead to situations where the sample is not
representative of any real-world population. It would be
misleading to use sample statistics (such as the sample mean,
M) to make inferences about the population mean, μ, for real-
world populations if the members of the sample are not similar
to, or representative of, that real-world population. At best,
when researchers work with convenience samples, they can
make inferences about hypothetical populations that have
characteristics similar to those of the sample.
The results obtained from the analysis of a random sample of
nine HR scores could be reported as follows:
Results
Using the SPSS random sampling procedure, a random sample
of N = 9 cases was selected from the population of 130 scores in
the hr130.sav data file. The scores in this sample appear in
Table 2.2. For this sample of nine cases, mean HR M = 73.11
beats per minute (bpm), with SD = 5.53 bpm. The 95% CI for
the mean based on this sample had a lower limit of 68.85 and an
upper limit of 77.37.
2.15 Summary
Many statistical analyses include relatively simple terms that
summarize information across X scores, such as ∑X and ∑X2. It
is helpful to recognize that whenever a formula includes ∑X,
information about the mean of X is being taken into account;
when terms involving ∑X2 are included, information about
variance is included in the computations.
This chapter reviewed several basic concepts from introductory
statistics:
1. The computation and interpretation of sample statistics,
including the mean, variance, and standard deviation, were
discussed.
2. A z score is used as a unit-free index of the distance of a
single X score from the mean of a normal distribution of
individual X scores. Because values of z have a fixed
relationship to areas under the normal distribution curve, a z
score can be used to answer questions such as, What proportion
or percentage of cases have scores higher than X?
3. Sampling error arises because the value of a sample statistic
such as M varies across samples when many random samples are
drawn from the same population.
4. Given some assumptions (e.g., that the distribution of scores
in the population of interest is normal in shape), it is possible to
predict the shape, mean, and variance of the sampling
distribution of M. When σ is known, the sampling distribution
of M has the following known characteristics: It is normal in
shape; the mean of the distribution of values of M corresponds
to μ, the population mean; and the standard deviation or
standard error that describes typical distances of sample mean
values of M from μ is given by σ/N. When σ is not known and
the researcher uses a sample standard deviation s to estimate σ,
a second source of sampling error arises; we now have potential
errors in estimation of σ using s as well as errors of estimation
of μ using M. The magnitude of this additional sampling error
depends on N, the size of the samples that are used to calculate
M and s.
5. Additional sampling error arises when s is used to estimate σ.
This additional sampling error requires us to refer to a different
type of sampling distribution when we evaluate distances of
individual M values from the center of the sampling
distribution—that is, the family of t distributions (instead of the
standard normal distribution).
6. The family of t distributions has a different distribution
shape for each degree of freedom. As the df for the t
distribution increases, the shape of the t distribution becomes
closer to that of a standard normal distribution. When N (and
therefore df) becomes greater than 100, the difference between
the shape of the t and normal distributions becomes so small
that distances from the mean can be evaluated using the normal
distribution curve.
7. All these pieces of information come together in the formula
for the CI. We can set up an “interval estimate” for μ based on
the sample value of M and the amount of sampling error that is
theoretically expected to occur.
8. Recent reporting guidelines for statistics (e.g., Wilkinson and
the Task Force on Statistical Inference, 1999) recommend that
CIs should be included for all important statistical outcomes in
research reports wherever possible.
Appendix on SPSS
The examples in this textbook use IBM SPSS Version 19.0.
Students who have never used SPSS (or programs that have
similar capabilities) may need an introduction to SPSS, such as
George and Mallery (2010). As with other statistical packages,
students may either purchase a personal copy of the SPSS
software and install it on a PC or use a version installed on their
college or university computer network. When SPSS access has
been established (either by installing a personal copy of SPSS
on a PC or by doing whatever is necessary to access the college
or university network version of SPSS), an SPSS® icon appears
on the Windows desktop, or an SPSS for Windows folder can be
opened by clicking on Start in the lower left corner of the
computer screen and then on All Programs. When SPSS is
started in this manner, the initial screen asks the user whether
he or she wants to open an existing data file or type in new
data.
When students want to work with existing SPSS data files, such
as the SPSS data files on the website for this textbook, they can
generally open these data files just by clicking on the SPSS data
file; as long as the student has access to the SPSS program,
SPSS data files will automatically be opened using this
program. SPSS can save and read several different file formats.
On the website that accompanies this textbook, each data file is
available in two formats: as an SPSS system file (with a full file
name of the form dataset.sav) and as an Excel file (with a file
name of the form dataset.xls). Readers who use programs other
than SPSS will need to use the drop-down menu that lists
various “file types” to tell their program (such as SAS) to look
for and open a file that is in Excel XLS format (rather than the
default SAS format).
SPSS examples are presented in sufficient detail in this
textbook so that students should be able to reproduce any of the
analyses that are discussed. Some useful data-handling features
of SPSS (such as procedures for handling missing data) are
discussed in the context of statistical analyses, but this textbook
does not provide a comprehensive treatment of the features in
SPSS. Students who want a more comprehensive treatment of
SPSS may consult books by Norusis and SPSS (2010a, 2010b).
Note that the titles of recent books sometimes refer to SPSS as
PASW, a name that applied only to Version 18 of SPSS.
Notes
1. Demonstrations do not constitute proofs; however, they
require less lengthy explanations and less mathematical
sophistication from the reader than proofs or formal
mathematical derivations. Throughout this book, demonstrations
are offered instead of proofs, but readers should be aware that a
demonstration only shows that a result works using the specific
numbers involved in the demonstration; it does not constitute a
proof.
2. The population variance, σ2, is defined as σ2 = ∑(X − μ)2/N.
I have already commented that when we calculate a sample
variance, s2, using the formula s2 = ∑(X – M)2/N − 1, we need
to use N − 1 as the divisor to take into account the fact that we
only have N − 1 independent deviations from the sample mean.
However, a second problem arises when we calculate s2; that is,
we calculate s2 using M, an estimate of μ that is also subject to
sampling error.
Comprehension Questions
1.
Consider the following small set of scores. Each number
represents the number of siblings reported by each of the N = 6
persons in the sample: X scores are [0, 1, 1, 1, 2, 7].
a.
Compute the mean (M) for this set of six scores.
b.
Compute the six deviations from the mean (X − M), and list
these six deviations.
c.
What is the sum of the six deviations from the mean you
reported in (b)? Is this outcome a surprise?
d.
Now calculate the sum of squared deviations (SS) for this set of
six scores.
e.
Compute the sample variance, s2, for this set of six scores.
f.
When you compute s2, why should you divide SS by (N − 1)
rather than by N?
g.
Finally, compute the sample standard deviation (denoted by
either s or SD).
2.
In your own words, what does an SS tell us about a set of data?
Under what circumstances will the value of SS equal 0? Can SS
ever be negative?
3.
For each of the following lists of scores, indicate whether the
value of SS will be negative, 0, between 0 and +15, or greater
than +15. (You do not need to actually calculate SS.)
Sample A: X = [103, 156, 200, 300, 98]
Sample B: X = [103, 103, 103, 103, 103, 103]
Sample C: X = [101, 102, 103, 102, 101]
4.
For a variable that interests you, discuss why there is variance
in scores on that variable. (In Chapter 2, e.g., there is a
discussion of factors that might create variance in heart rate,
HR.)
5.
Assume that a population of thousands of people whose
responses were used to develop the anxiety test had scores that
were normally distributed with μ = 30 and σ = 10. What
proportion of people in this population would have anxiety
scores within each of the following ranges of scores?
a.
Below 20
b.
Above 30
c.
Between 10 and 50
d.
Below 10
e.
Below 50
f.
Above 50
g.
Either below 10 or above 50
Assuming that a score in the top 5% of the distribution
would be considered extremely anxious, would a person whose
anxiety score was 50 be considered extremely anxious?
6.
What is a confidence interval (CI), and what information is
required to set up a CI?
7.
What is a sampling distribution? What do we know about the
shape and characteristics of the sampling distribution for M, the
sample mean?
8.
What is SEM? What does the value of SEM tell you about the
typical magnitude of sampling error?
a.
As s increases, how does the size of SEM change (assuming that
N stays the same)?
b.
As N increases, how does the size of SEM change (assuming
that s stays the same)?
9.
How is a t distribution similar to a standard normal distribution
score? How is it different?
10.
Under what circumstances should a t distribution be used rather
than the standard normal distribution to look up areas or
probabilities associated with distances from the mean?
11.
Consider the following questions about CIs.
A researcher tests emotional intelligence (EI) for a random
sample of children selected from a population of all students
who are enrolled in a school for gifted children. The researcher
wants to estimate the mean EI for the entire school. The
population standard deviation, σ, for EI is not known.
Let’s suppose that a researcher wants to set up a 95% CI for
IQ scores using the following information:
The sample mean M = 130.
The sample standard deviation s = 15.
The sample size N = 120.
The df = N − 1 = 119.
For the values given above, the limits of the 95% CI are as
follows:
Lower limit = 130 − 1.96 × 1.37 = 127.31;
Upper limit = 130 + 1.96 × 1.37 = 132.69.
The following exercises ask you to experiment to see how
changing some of the values involved in computing the CI
influences the width of the CI.
Recalculate the CI above to see how the lower and upper
limits (and the width of the CI) change as you vary the N in the
sample (and leave all the other values the same).
a.
What are the upper and lower limits of the CI and the width of
the 95% CI if all the other values remain the same (M = 130, s =
15) but you change the value of N to 16?
For N = 16, lower limit = _________ and upper limit =
____________.
Width (upper limit − lower limit) = ______________________.
Note that when you change N, you need to change two things:
the computed value of SEM and the degrees of freedom used to
look up the critical values for t.
b.
What are the upper and lower limits of the CI and the width of
the 95% CI if all the other values remain the same but you
change the value of N to 25?
For N = 25, lower limit = __________ and upper limit =
___________.
Width (upper limit – lower limit) = _______________________.
c.
What are the upper and lower limits of the CI and the width of
the 95% CI if all the other values remain the same (M = 130, s =
15) but you change the value of N to 49?
For N = 49, lower limit = __________ and upper limit =
___________. Width (upper limit – lower limit) =
______________________.
d.
Based on the numbers you reported for sample size N of 16, 25,
and 49, how does the width of the CI change as N (the number
of cases in the sample) increases?
e.
What are the upper and lower limits and the width of this CI if
you change the confidence level to 80% (and continue to use M
= 130, s = 15, and N = 49)?
For an 80% CI, lower limit = ________ and upper limit =
__________.
Width (upper limit – lower limit) = ______________________.
f.
What are the upper and lower limits and the width of the CI if
you change the confidence level to 99% (continue to use M =
130, s = 15, and N = 49)?
For a 99% CI, lower limit = ________ and upper limit =
___________.
Width (upper limit – lower limit) = ______________________.
g.
How does increasing the level of confidence from 80% to 99%
affect the width of the CI?
12.
Data Analysis Project:
The N = 130 scores in the temphr.sav file are hypothetical
data created by Shoemaker (1996) so that they yield results
similar to those obtained in an actual study of temperature and
HR (Mackowiak et al., 1992).
Use the Temperature data in the temphr.sav file to do the
following:
Note that temperature in degrees Fahrenheit (tempf) can be
converted into temperature in degrees centigrade (tempc) by the
following: tempc = (tempf − 32)/1.8.
The following analyses can be done on tempf, tempc, or both
tempf and tempc.
a.
Find the sample mean, M; standard deviation, s; and standard
error of the mean, SEM, for scores on temperature.
b.
Examine a histogram of scores on temperature. Is the shape of
the distribution reasonably close to normal?
c.
Set up a 95% CI for the sample mean, using your values of M, s,
and N (N = 130 in this dataset).
d.
The temperature that is popularly believed to be “average” or
“healthy” is 98.6°F (or 37°C). Does the 95% CI based on this
sample include the value 98.6, which is widely believed to
represent an “average/healthy” temperature? What conclusion
might you draw from this result?
(Warner 71-80)
Warner, Rebecca (Becky) (Margaret). Applied Statistics: From
Bivariate Through Multivariate Techniques, 2nd Edition. SAGE
Publications, Inc, 04/2012. VitalBook file.
The citation provided is a guideline. Please check each citation
for accuracy before use.
In Unit 1, you read about the difference between descriptive
statistics and inferential statistics in Chapter 1 of
your Warner text. For the next two units, we will focus on the
theory, logic, and application of descriptive
statistics. This introduction focuses on scales of measurement,
measures of central tendency and dispersion, the
visual inspection of histograms, and the detection and
processing of outliers.
An important concept in understanding descriptive statistics is
the scales of measurement. The Warner (2013)
text defines four scales of measurement—nominal, ordinal,
interval, and ratio:
• Nominal data refer to numbers arbitrarily assigned to
represent group membership, such as gender
(male = 1; female = 2). Nominal data are useful in comparing
groups, but they are meaningless in terms
of measures of central tendency and dispersion.
• Ordinal data represent ranked data, such as coming in first,
second, or third in a marathon. However,
ordinal data do not tell us how much of a difference there is
between measurements. The first-place and
second-place finishers could finish 1 second apart, whereas the
third-place finisher arrives 2 minutes later.
Ordinal data lack equal intervals.
• Interval data refer to equal intervals between data points. An
example is degrees measured in
Fahrenheit. Interval data lack a "true zero" value (freezing at 32
degrees Fahrenheit).
• Ratio data do have a true zero, such as heart rate, where "0"
represents a heart that is not beating. This
is often seen as "count" data in social research. For example,
how many days did an employee miss from
work? Zero is a meaningful unit in this example.
These four scales of measurement are routinely reviewed in
introductory statistics textbooks as the classic way
of differentiating measurements. However, the boundaries
between the measurement scales are fuzzy. For
example, is intelligence quotient (IQ) measured on the ordinal
or interval scale? Recently, researchers have
argued for a simpler dichotomy in terms of selecting an
appropriate statistic: categorical versus continuous
measures.
• A categorical variable is a nominal variable. It simply
categorizes things according to group membership
(for example, apple = 1, banana = 2, grape = 3).
• A continuous measure represents a difference in magnitude of
something, such as a continuum of "low
to high" statistics anxiety. In contrast to categorical variables
designated by arbitrary values, a
quantitative measure allows for a variety of arithmetic
operations, including equal (=), less than (<),
greater than (>), addition (+), subtraction (−), multiplication (*
or ×), and division (/ or ÷). Arithmetic
operations generate a variety of descriptive statistics discussed
next.
Measures of Central Tendency and Dispersion
Chapter 2 of Warner (2013) reviews descriptive statistics that
measure central tendency (mean, median, mode)
and dispersion (range, sum of squares, variance, standard
deviation). To visualize central tendency and
dispersion, refer to Figure 2.5 on page 46 of the Warner text for
an illustration of how heart rate data are
represented in a histogram. The horizontal axis represents heart
rate ("hr"). The vertical axis represents the total
number of people who were recorded at a particular heart rate
("Frequency"). Measures of centrality summarize
where data clump together at the center of a distribution of
scores. (For example, in Figure 2.5 this occurs
around hr = 74.)
Unit 2 - Descriptive Statistics: Theory and Logic
INTRODUCTION
To simplify, consider the following measured heart rates: 65,
70, 75, 75, 130.
The simplest measure of central tendency is the mode. It is the
most frequent score within a distribution of
scores (for example, two scores of hr = 75). Technically, in a
distribution of scores, you can have two or more
modes. An advantage of the mode is that it can be applied to
categorical data. It is also not sensitive to
extreme scores.
The median is the geometric center of a distribution because of
how it is calculated. All scores are arranged in
ascending order. The score in the middle is the median. In the
five heart rates above, the middle score is a 75. If
you have an even number of scores, the average of the two
middle scores is used. The median also has the
advantage of not being sensitive to extreme scores.
The mean is probably what most people consider to be an
average score. In the example above, the mean
heart rate is (65 + 70 + 75 + 75 + 130) ÷ 5 = 83. Although the
mean is more sensitive to extreme scores (such as
130) relative to the mode and median, it can be more stable
across samples, and it is the best estimate of the
population mean. It is also used in many of the inferential
statistics studied in this course, such as t tests and
analysis of variance (ANOVA).
In contrast to measures of central tendency, measures of
dispersion summarize how far apart data are spread on
a distribution of scores. The range is a basic measure of
dispersion quantifying the distance between the lowest
score and the highest score in a distribution (for example, 130 −
65 = 65). A deviance represents the difference
between an individual score and the mean. For example, the
deviance for the first heart rate score (65) is 65 −
83, which is −18. By calculating the deviance for each score
above from a mean of 83, we arrive at −18, −13,
−8, −8, and +47. Summing all of the deviances equals 0, which
is not a very informative measure of dispersion.
A somewhat more informative measure of dispersion is sum of
squares ( SS), which you will see again in Units 9
and 10 in the study of analysis of variance (ANOVA). To get
around the problem of summing to zero, the sum of
squares involves calculating the square of each deviation and
then summing those squares. In the example
above, SS = [(−18)2 + (−13)2 + (−8)2 + (−8)2 + (+47)2] =
[(324) + (169) + (64) + (64) + (2209)] = 2830. The
problem with SS is that it increases as data points increase
(Field, 2013), and it still is not a very informative
measure of dispersion.
This problem is solved by next calculating the sample variance (
s2), which is the average distance between the
mean and a particular score (squared). Instead of dividing SS by
5 for the example above, we divide by N − 1, or
4; see pages 56–57 of your Warner text for an explanation. The
variance is therefore SS ÷ ( N − 1), or 2830 ÷ 4
= 707.5. The problem with interpreting variance is that it is the
average distance of "squared units" from the
mean. What is, for example, a "squared" heart rate score?
The final step is calculating the sample standard deviation ( s),
which is simply calculated as the square root of
the sample variance, or in our example, √707.5 = 26.60. The
sample standard deviation represents the average
deviation of scores from the mean. In other words, the average
distance of heart rate scores to the mean is 26.6
beats per minute. If the extreme score of 130 is replaced with a
score closer to the mean, such as 90, then s =
9.35. Thus, small standard deviations (relative to the mean)
represent a small amount of dispersion; large
standard deviations (relative to the mean) represent a large
amount of dispersion (Field, 2013). The standard
deviation is an important component of the normal distribution.
Visual Inspection of a Distribution of Scores
An assumption of the statistical tests that you will study in this
course is that the scores for a dependent variable
are normal (or approximately normal) in shape. This assumption
is first checked by examining a histogram of the
distribution. Figure 4.19 in the Warner text (p. 147) represents a
distribution of heart rate scores that are
approximately normal in shape and visualized in terms of a bell-
shaped curve. Notice that the tails of the
distribution are approximately symmetrical, meaning that they
are near mirror images to the left and right of the
mean. This distribution technically has two modes at hr = 70
and hr = 76, but the close proximity of these
modes suggests a unimodal distribution.
Departures from normality and symmetry are assessed in terms
of skew and kurtosis. Skewness is the tilt or
extent a distribution deviates from symmetry around the mean.
A distribution that is positively skewed has a
longer tail extending to the right (the "positive" side of the
distribution) as shown in Figure 4.20 of the Warner
text (p. 148). A distribution that is negatively skewed has a
longer tail extending to the left (the "negative" side
of the distribution) as shown in Figure 4.21 of the Warner text
(p. 149). In contrast to skewness, kurtosis is
defined as the peakedness of a distribution of scores. Figure
4.22 of the Warner text (p. 150) illustrates a
distribution with normal kurtosis, negative kurtosis (a "flat"
distribution; platykurtic), and positive kurtosis (a
"sharp" peak; leptokurtic).
The use of these terms is not limited to your description of a
distribution following a visual inspection. They are
included in your list of descriptive statistics and should be
included when analyzing your distribution of scores.
Skew and kurtosis scores of near zero indicate a shape that is
symmetric or close to normal respectively. Values
of −1 to +1 are considered ideal, whereas values ranging from
−2 to +2 are considered acceptable for
psychometric purposes.
Outliers
Outliers are defined as extreme scores on either the left of right
tail of a distribution, and they can influence the
overall shape of that distribution. There are a variety of
methods for identifying and adjusting for outliers.
Outliers can be detected by calculating z scores (reviewed in
Unit 4) or by inspection of a box plot. Once an
outlier is detected, the researcher must determine how to handle
it. The outlier may represent a data entry
error that should be corrected, or the outlier may be a valid
extreme score. The outlier can be left alone,
deleted, or transformed. Whatever decision is made regarding an
outlier, the researcher must be transparent
and justify his or her decision.
References
Field, A. (2013). Discovering statistics using IBM SPSS (4th
ed.). Thousand Oaks, CA: Sage.
Warner, R. M. (2013). Applied statistics: From bivariate
through multivariate techniques (2nd ed.). Thousand
Oaks, CA: Sage.
OBJECTIVES
To successfully complete this learning unit, you will be
expected to:
1. Analyze the strengths and limitations of descriptive statistics.
2. Identify previous experience with and future applications of
descriptive statistics.
3. Analyze the purpose and reporting of confidence intervals.
4. Discuss standard error and confidence intervals.
Unit 2 Study 1- Readings
Use your Warner text, Applied Statistics: From Bivariate
Through Multivariate Techniques, to complete
the following:
• Read Chapter 2, "Basic Statistics, Sampling Error, and
Confidence Intervals," pages 41–80. This
reading addresses the following topics:
◦ Sample mean ( M).
◦ Sum of squared deviations ( SS).
◦ Sample variance ( s2).
◦ Sample standard deviation ( s).
◦ Sample standard error ( SE).
◦ Confidence intervals (CIs).
• Read Chapter 4, "Preliminary Data Screening" pages 125–184.
This reading addresses the following
topics:
◦ Problems in real data.
◦ Identification of errors and inconsistencies.
◦ Missing values.
◦ Data screening for individual variables.
◦ Data screening for bivariate analysis.
◦ Data transformations.
◦ Reporting preliminary data screening.
SOE Learners – Suggested Readings
Young, J. R., Young, J. L., & Hamilton, C. (2014). The use of
confidence intervals as a meta-analytic lens to
summarize the effects of teacher education technology courses
on preservice teacher TPACK. Journal
of Research on Technology in Education, 46(2), 149–172.
For this discussion:
• Discuss your previous experience with descriptive statistics.
For example, you have probably
encountered descriptive statistics in an undergraduate course
and in journal articles.
• Analyze the strengths and limitations of descriptive statistics.
• Finally, discuss how you might use descriptive statistics in
your professional or academic future.
• Remember to cite your supporting references.

More Related Content

DOCX
n 2 3 n99 2.58 95 1.96 90 1.645.docx
DOC
Psych 625 Enhance teaching / snaptutorial.com
DOC
PSYCH 625 Inspiring Innovation / tutorialrank.com
PDF
Psych 625 Effective Communication - tutorialrank.com
DOCX
PSYCH 625 MENTOR Become Exceptional--psych625mentor.com
DOCX
Psych 625Education Specialist / snaptutorial.com
DOCX
PSYCH 625 MENTOR Education Counseling -- psych625mentor.com
DOCX
PSYCH 625 MENTOR Redefined Education--psych625mentor.com
n 2 3 n99 2.58 95 1.96 90 1.645.docx
Psych 625 Enhance teaching / snaptutorial.com
PSYCH 625 Inspiring Innovation / tutorialrank.com
Psych 625 Effective Communication - tutorialrank.com
PSYCH 625 MENTOR Become Exceptional--psych625mentor.com
Psych 625Education Specialist / snaptutorial.com
PSYCH 625 MENTOR Education Counseling -- psych625mentor.com
PSYCH 625 MENTOR Redefined Education--psych625mentor.com

Similar to Pg. 05Question FiveAssignment #Deadline Day 22.docx (20)

DOCX
PSYCH 625 MENTOR Education for Service-- psych625mentor.com
DOCX
PSYCH 625 MENTOR Knowledge is divine--psych625mentor.com
PDF
PSYCH 625 MENTOR Education Planning--psych625mentor.com
DOCX
PSYCH 625 MENTOR Inspiring Innovation--psych625mentor.com
PDF
PSYCH 625 MENTOR Achievement Education / psych625mentor.com
DOC
Psych 625 Education Organization-snaptutorial.com
PDF
Refresher in statistics and analysis skill
DOCX
PSYCH 625 Effective Communication - snaptutorial.com
DOCX
PSYCH 625 Extraordinary Life/newtonhelp.com 
DOCX
PSYCH 625  Focus Dreams/newtonhelp.com
DOC
PSYCH 625 Exceptional Education - snaptutorial.com
PPTX
The Process of Analyzing and Interpreting Data
PDF
SPSS FINAL.pdf
PDF
Metopen 6
DOCX
PSYCH 625 MENTOR Education Your Life / psych625mentor.com
PPT
Data collection & management
PDF
Data_Collection_Tools174311789uuuuu7.pdf
PPT
Mba ii rm unit-4.1 data analysis & presentation a
PDF
Review of Basic Statistics and Terminology
PPTX
RSS 2012 Data Entry SPSS
PSYCH 625 MENTOR Education for Service-- psych625mentor.com
PSYCH 625 MENTOR Knowledge is divine--psych625mentor.com
PSYCH 625 MENTOR Education Planning--psych625mentor.com
PSYCH 625 MENTOR Inspiring Innovation--psych625mentor.com
PSYCH 625 MENTOR Achievement Education / psych625mentor.com
Psych 625 Education Organization-snaptutorial.com
Refresher in statistics and analysis skill
PSYCH 625 Effective Communication - snaptutorial.com
PSYCH 625 Extraordinary Life/newtonhelp.com 
PSYCH 625  Focus Dreams/newtonhelp.com
PSYCH 625 Exceptional Education - snaptutorial.com
The Process of Analyzing and Interpreting Data
SPSS FINAL.pdf
Metopen 6
PSYCH 625 MENTOR Education Your Life / psych625mentor.com
Data collection & management
Data_Collection_Tools174311789uuuuu7.pdf
Mba ii rm unit-4.1 data analysis & presentation a
Review of Basic Statistics and Terminology
RSS 2012 Data Entry SPSS
Ad

More from mattjtoni51554 (20)

DOCX
you will evaluate the history of cryptography from its origins.  Ana.docx
DOCX
You will do this project in a group of 5 or less. Each group or in.docx
DOCX
you will discuss the use of a tool for manual examination of a .docx
DOCX
you will discuss sexuality, popular culture and the media.  What is .docx
DOCX
You will discuss assigned questions for the ModuleWeek. · Answe.docx
DOCX
You will develop a proposed public health nursing intervention to me.docx
DOCX
You will develop a comprehensive literature search strategy. After r.docx
DOCX
You will develop a formal information paper that addresses the l.docx
DOCX
You will design a patient education tool that can be used by nurses .docx
DOCX
You will design a patient education tool that can be used by nur.docx
DOCX
You will create an entire Transformational Change Management Plan fo.docx
DOCX
You will create an Access School Management System Database that can.docx
DOCX
You will create a 13 slide powerpoint presentation (including your r.docx
DOCX
You will create a 10 minute virtual tour of a cultural museum” that.docx
DOCX
You will continue the previous discussion by considering the sacred.docx
DOCX
You will craft individual essays in response to the provided prompts.docx
DOCX
You will complete the Aquifer case,Internal Medicine 14 18-year.docx
DOCX
You will complete the Aquifer case,Internal Medicine 14 18-.docx
DOCX
You will complete several steps for this assignment.Step 1 Yo.docx
DOCX
You will compile a series of critical analyses of how does divorce .docx
you will evaluate the history of cryptography from its origins.  Ana.docx
You will do this project in a group of 5 or less. Each group or in.docx
you will discuss the use of a tool for manual examination of a .docx
you will discuss sexuality, popular culture and the media.  What is .docx
You will discuss assigned questions for the ModuleWeek. · Answe.docx
You will develop a proposed public health nursing intervention to me.docx
You will develop a comprehensive literature search strategy. After r.docx
You will develop a formal information paper that addresses the l.docx
You will design a patient education tool that can be used by nurses .docx
You will design a patient education tool that can be used by nur.docx
You will create an entire Transformational Change Management Plan fo.docx
You will create an Access School Management System Database that can.docx
You will create a 13 slide powerpoint presentation (including your r.docx
You will create a 10 minute virtual tour of a cultural museum” that.docx
You will continue the previous discussion by considering the sacred.docx
You will craft individual essays in response to the provided prompts.docx
You will complete the Aquifer case,Internal Medicine 14 18-year.docx
You will complete the Aquifer case,Internal Medicine 14 18-.docx
You will complete several steps for this assignment.Step 1 Yo.docx
You will compile a series of critical analyses of how does divorce .docx
Ad

Recently uploaded (20)

PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
IGGE1 Understanding the Self1234567891011
PDF
Trump Administration's workforce development strategy
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
My India Quiz Book_20210205121199924.pdf
PDF
Empowerment Technology for Senior High School Guide
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
Practical Manual AGRO-233 Principles and Practices of Natural Farming
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
IGGE1 Understanding the Self1234567891011
Trump Administration's workforce development strategy
What if we spent less time fighting change, and more time building what’s rig...
AI-driven educational solutions for real-life interventions in the Philippine...
Hazard Identification & Risk Assessment .pdf
B.Sc. DS Unit 2 Software Engineering.pptx
Paper A Mock Exam 9_ Attempt review.pdf.
202450812 BayCHI UCSC-SV 20250812 v17.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
My India Quiz Book_20210205121199924.pdf
Empowerment Technology for Senior High School Guide
Unit 4 Computer Architecture Multicore Processor.pptx
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
FORM 1 BIOLOGY MIND MAPS and their schemes

Pg. 05Question FiveAssignment #Deadline Day 22.docx

  • 1. Pg. 05 Question Five Assignment # Deadline: Day 22/10/2017 @ 23:59 [Total Mark for this Assignment is 25] System Analysis and Design IT 243 College of Computing and Informatics Question One 5 Marks Learning Outcome(s): Understand the need of Feasibility analysis in project approval and its types
  • 2. What is feasibility analysis? List and briefly discuss three kinds of feasibility analysis. Question Two 5 Marks Learning Outcome(s): Understand the various cost incurred in project development How can you classify costs? Describe each cost classification and provide a typical example of each category.Question Three 5 Marks Learning Outcome(s): System Development Life Cycle methodologies (Waterfall & Prototyping) There a several development methodologies for the System Development Life Cycle (SDLC). Among these are the Waterfall and System Prototyping models. Compare the two methodologies in details in terms of the following criteria. Criteria Waterfall System Prototyping Description
  • 3. Requirements Clarity System complexity Project Time schedule Question Four 5 Marks Learning Outcome(s): Understand JAD Session and its procedure What is JAD session? Describe the five major steps in conducting JAD sessions. Question Five 5 Marks Learning Outcome(s): Ability to distinguish between functional and non functional requirements
  • 4. State what is meant by the functional and non-functional requirements. What are the primary types of nonfunctional requirements? Give two examples of each. What role do nonfunctional requirements play in the project overall? # Marks 4 - PRELIMINARY DATA SCREENING 4.1 Introduction: Problems in Real Data Real datasets often contain errors, inconsistencies in responses or measurements, outliers, and missing values. Researchers should conduct thorough preliminary data screening to identify and remedy potential problems with their data prior to running the data analyses that are of primary interest. Analyses based on a dataset that contains errors, or data that seriously violate assumptions that are required for the analysis, can yield misleading results. Some of the potential problems with data are as follows: errors in data coding and data entry, inconsistent responses, missing values, extreme outliers, nonnormal distribution shapes, within- group sample sizes that are too small for the intended analysis, and nonlinear relations between quantitative variables. Problems with data should be identified and remedied (as adequately as possible) prior to analysis. A research report should include a summary of problems detected in the data and any remedies that were employed (such as deletion of outliers or data transformations) to address these problems.
  • 5. 4.2 Quality Control During Data Collection There are many different possible methods of data collection. A psychologist may collect data on personality or attitudes by asking participants to answer questions on a questionnaire. A medical researcher may use a computer-controlled blood pressure monitor to assess systolic blood pressure (SBP) or other physiological responses. A researcher may record observations of animal behavior. Physical measurements (such as height or weight) may be taken. Most methods of data collection are susceptible to recording errors or artifacts, and researchers need to know what kinds of errors are likely to occur. For example, researchers who use self-report data to do research on personality or attitudes need to be aware of common problems with this type of data. Participants may distort their answers because of social desirability bias; they may misunderstand questions; they may not remember the events that they are asked to report about; they may deliberately try to “fake good” or “fake bad”; they may even make random responses without reading the questions. A participant may accidentally skip a question on a survey and, subsequently, use the wrong lines on the answer sheet to enter each response; for example, the response to Question 4 may be filled in as Item 3 on the answer sheet, the response to Question 5 may be filled in as Item 4, and so forth. In addition, research assistants have been known to fill in answer sheets themselves instead of having the participants complete them. Good quality control in the collection of self-report data requires careful consideration of question wording and response format and close supervision of the administration of surveys. Converse and Presser (1999); Robinson, Shaver, and Wrightsman (1991); and Stone, Turkkan, Kurtzman, Bachrach, and Jobe (1999) provide more detailed discussion of methodological issues in the collection of self- report data. For observer ratings, it is important to consider issues of reactivity (i.e., the presence of an observer may actually change
  • 6. the behavior that is being observed). It is important to establish good interobserver reliability through training of observers and empirical assessment of interobserver agreement. See Chapter 21 in this book for discussion of reliability, as well as Aspland and Gardner (2003), Bakeman (2000), Gottman and Notarius (2002), and Reis and Gable (2000) for further discussion of methodological issues in the collection of observational data. For physiological measures, it is necessary to screen for artifacts (e.g., when electroencephalogram electrodes are attached near the forehead, they may detect eye blinks as well as brain activity; these eye blink artifacts must be removed from the electroencephalogram signal prior to other processing). See Cacioppo, Tassinary, and Berntson (2000) for methodological issues in the collection of physiological data. This discussion does not cover all possible types of measurement problems, of course; it only mentions a few of the many possible problems that may arise in data collection. Researchers need to be aware of potential problems or sources of artifact associated with any data collection method that they use, whether they use data from archival sources, experiments with animals, mass media, social statistics, or other methods not mentioned here. 4.3 Example of an SPSS Data Worksheet The dataset used to illustrate data-screening procedures in this chapter is named bpstudy .sav. The scores appear in Table 4.1, and an image of the corresponding SPSS worksheet appears in Figure 4.1. This file contains selected data from a dissertation that assessed the effects of social stress on blood pressure (Mooney, 1990). The most important features in Figure 4.1 are as follows. Each row in the Data View worksheet corresponds to the data for 1 case or 1 participant. In this example, there are a total of N = 65 participants; therefore, the dataset has 65 rows. Each column in the Data View worksheet corresponds to a variable; the SPSS variable names appear along the top of the data worksheet. In Figure 4.1, scores are given for the following SPSS variables:
  • 7. idnum, GENDER, SMOKE (smoking status), AGE, SYS1 (systolic blood pressure or SBP at Time 1), DIA1 (diastolic blood pressure, DBP, at Time 1), HR1 (heart rate at Time 1), and WEIGHT. The numerical values contained in this data file were typed into the SPSS Data View worksheet by hand. Table 4.1 Data for the Blood Pressure/Social Stress Study SOURCE: Mooney (1990). NOTES: 1. idnum = arbitrary, unique identification number for each participant. 2. GENDER was coded 1 = male, 2 = female. 3. SMOKE was coded 1 = nonsmoker, 2 = light smoker, 3 = moderate smoker, 4 = heavy or regular smoker. 4. AGE = age in years. 5. SYS1 = systolic blood pressure at Time 1/baseline. 6. DIA1 = diastolic blood pressure at Time 1/baseline. 7. HR1 = heart rate at Time 1/baseline. 8. WEIGHT = body weight in pounds. The menu bar across the top of the SPSS Data View worksheet in Figure 4.1 can be used to select menus for different types of procedures. The pull-down menu for <File> includes options such as opening and saving data files. The pull-down menus for <Analyze> and <Graphs> provide access to SPSS procedures for data analysis and graphics, respectively. The two tabs near the lower left-hand corner of the Data View of the SPSS worksheet can be used to toggle back and forth between the Data View (shown in Figure 4.1) and the Variable View (shown in Figure 4.2) versions of the SPSS data file. The Variable View of an SPSS worksheet, shown in Figure 4.2, provides a place to document and describe the characteristics of each variable, to supply labels for variables and score values, and to identify missing values. For example, examine the row of the Variable View worksheet that corresponds to the variable named GENDER. The scores on this variable were numerical; that is, the scores are in the form of numbers (rather than alphabetic characters). Other possible variable types include dates or string variables that consist of
  • 8. alphabetic characters instead of numbers. If the researcher needs to identify a variable as string or date type, he or she clicks on the cell for Variable Type and selects the appropriate variable type from the pull-down menu list. In the datasets used as examples in this textbook, almost all the variables are numerical. Figure 4.1 SPSS Worksheet for the Blood Pressure/Social Stress Study (Data View) in bpstudy.sav SOURCE: Mooney (1990). The Width column indicates how many significant digits the scores on each variable can have. For this example, the variables GENDER and SMOKE were each allowed a one-digit code, the variable AGE was allowed a two-digit code, and the remaining variables (heart rate, blood pressure, and body weight) were each allowed three digits. The Decimals column indicates how many digits are displayed after the decimal point. All the variables in this dataset (such as age in years and body weight in pounds) are given to the nearest integer value, and so all these variables are displayed with 0 digits to the right of the decimal place. If a researcher has a variable, such as grade point average (GPA), that is usually reported to two decimal places (as in GPA = 2.67), then he or she would select 2 as the number of digits to display after the decimal point. Figure 4.2 SPSS Worksheet for the Blood Pressure/Social Stress Study (Variable View) The next column, Label, provides a place where each variable name can be associated with a longer descriptive label. This is particularly helpful when brief SPSS variable names are not completely self-explanatory. For example, “body weight in pounds” appears as a label for the variable WEIGHT. The Values column provides a place where labels can be associated with the individual score values of each variable; this is primarily used with nominal or categorical variables. Figure 4.3
  • 9. shows the dialog window that opens up when the user clicks on the cell for Values for the variable GENDER. To associate each score with a verbal label, the user types in the score (such as 1) and the corresponding verbal label (such as male) and then clicks the Add button to add this label to the list of value labels. When all the labels have been specified, clicking on OK returns to the main Variable View worksheet. In this example, a score of 1 on GENDER corresponds to male and a score of 2 on GENDER corresponds to female. The column headed Missing provides a place to identify scores as codes for missing values. Consider the following example to illustrate the problem that arises in data analysis when there are missing values. Suppose that a participant did not answer the question about body weight. If the data analyst enters a value of 0 for the body weight of this person who did not provide information about body weight and does not identify 0 as a code for a missing value, this value of 0 would be included when SPSS sums the scores on body weight to compute a mean for weight. The sample mean is not robust to outliers; that is, a sample mean for body weight will be substantially lower when a value of 0 is included for a participant than it would be if that value of 0 was excluded from the computation of the sample mean. What should the researcher do to make sure that missing values are not included in the computation of sample statistics? SPSS provides two different ways to handle missing score values. The first option is to leave the cell in the SPSS Data View worksheet that corresponds to the missing score blank. In Figure 4.1, participant number 12 did not answer the question about smoking status; therefore, the cell that corresponds to the response to the variable SMOKE for Participant 12 was left blank. By default, SPSS treats empty cells as “system missing” values. If a table of frequencies is set up for scores on SMOKE, the response for Participant 12 is labeled as a missing value. If a mean is calculated for scores on smoking, the score for Participant 12 is not included in the computation as a value 0; instead, it is omitted from the computation of the sample mean.
  • 10. Figure 4.3 Value Labels for Gender A second method is available to handle missing values; it is possible to use different code numbers to represent different types of missing data. For example, a survey question that is a follow-up about the amount and frequency of smoking might be coded 9 if it was not applicable to an individual (because that individual never smoked), 99 if the question was not asked because the interviewer ran out of time, and 88 if the respondent refused to answer the question. For the variable WEIGHT, body weight in pounds, a score of 999 was identified as a missing value by clicking on the cell for Missing and then typing a score of 999 into one of the windows for missing values; see the Missing Values dialog window in Figure 4.4. A score of 999 is defined as a missing value code, and therefore, these scores are not included when statistics are calculated. It is important to avoid using codes for missing values that correspond to possible valid responses. Consider the question, How many children are there in a household? It would not make sense to use a score of 0 or a score of 9 as a code for missing values, because either of these could correspond to the number of children in some households. It would be acceptable to use a code of 99 to represent a missing value for this variable because no single-family household could have such a large number of children. The next few columns in the SPSS Variable View worksheet provide control over the way the values are displayed in the Data View worksheet. The column headed Columns indicates the display width of each column in the SPSS Data View worksheet, in number of characters. This was set at eight characters wide for most variables. The column headed Align indicates whether scores will be shown left justified, centered, or (as in this example) right justified in each column of the worksheet. Finally, the column in the Variable View worksheet that is headed Measure indicates the level of measurement for each
  • 11. variable. SPSS designates each numerical variable as nominal, ordinal, or scale (scale is equivalent to interval/ratio level of measurement, as described in Chapter 1 of this textbook). In this sample dataset, idnum (an arbitrary and unique identification number for each participant) and GENDER were identified as categorical or nominal variables. Smoking status (SPSS variable name SMOKE) was coded on an ordinal scale from 1 to 4, with 1 = nonsmoker, 2 = light smoker, 3 = moderate smoker, and 4 = heavy smoker. The other variables (heart rate, blood pressure, body weight, and age) are quantitative and interval/ratio, so they were designated as “scale” in level of measurement. Figure 4.4 Missing Values for Weight 4.4 Identification of Errors and Inconsistencies The SPSS data file should be proofread and compared with original data sources (if these are accessible) to correct errors in data coding or data entry. For example, if self-report data are obtained using computer-scorable answer sheets, the correspondence between scores on these answer sheets and scores in the SPSS data file should be verified. This may require proofreading data line by line and comparing the scores with the data on the original answer sheets. It is helpful to have a unique code number associated with each case so that each line in the data file can be matched with the corresponding original data sheet. Even when line-by-line proofreading has been done, it is useful to run simple exploratory analyses as an additional form of data screening. Rosenthal (cited in D. B. Wright, 2003) called this process of exploration “making friends with your data.” Subsequent sections of this chapter show how examination of the frequency distribution tables and graphs provides an overview of the characteristics of people in this sample—for example, how many males and females were included in the study, how many nonsmokers versus heavy smokers were included, and the range of scores on physiological responses
  • 12. such as heart rate. Examining response consistency across questions or measurements is also useful. If a person chooses the response “I have never smoked” to one question and then reports smoking 10 cigarettes on an average day in another question, these responses are inconsistent. If a participant’s responses include numerous inconsistencies, the researcher may want to consider removing that participant’s data from the data file. On the basis of knowledge of the variables and the range of possible response alternatives, a researcher can identify some responses as “impossible” or “unlikely.” For example, if participants are provided a choice of the following responses to a question about smoking status: 1 = nonsmoker, 2 = light smoker, 3 = moderate smoker, and 4 = heavy smoker, and a participant marks response number “6” for this question, the value of 6 does not correspond to any of the response alternatives provided for the question. When impossible, unlikely, or inconsistent responses are detected, there are several possible remedies. First, it may be possible to go back to original data sheets or experiment logbooks to locate the correct information and use it to replace an incorrect score value. If that is not possible, the invalid score value can be deleted and replaced with a blank cell entry or a numerical code that represents a missing value. It is also possible to select out (i.e., temporarily or permanently remove) cases that have impossible or unlikely scores. 4.5 Missing Values Journal editors and funding agencies now expect more systematic evaluation of missing values than was customary in the past. SPSS has a Missing Values add-on procedure to assess the amount and pattern of missing data and replace missing scores with imputed values. Research proposals should include a plan for identification and handling of missing data; research reports should document the amount and pattern of missing data and imputation procedures for replacement. Within the SPSS program, an empty or blank cell in the data worksheet is
  • 13. interpreted as a System Missing value. Alternatively, as described earlier, the Missing Value column in the Variable View worksheet in SPSS can be used to identify some specific numerical codes as missing values and to use different numerical codes to correspond to different types of missing data. For example, for a variable such as verbal Scholastic Aptitude Test (SAT) score, codes such as 888 = student did not take the SAT and 999 = participant refused to answer could be used to indicate different reasons for the absence of a valid score. Ideally, a dataset should have few missing values. A systematic pattern of missing observations suggests possible bias in nonresponse. For example, males might be less willing than females to answer questions about negative emotions such as depression; students with very low SAT scores may refuse to provide information about SAT performance more often than students with high SAT scores. To assess whether missing responses on depression are more common among some groups of respondents or are associated with scores on some other variable, the researcher can set up a variable that is coded 1 (respondent answered a question about depression) versus 0 (respondent did not answer the question about depression). Analyses can then be performed to see whether this variable, which represents missing versus nonmissing data on one variable, is associated with scores on any other variable. If the researcher finds, for example, that a higher proportion of men than women refused to answer a question about depression, it signals possible problems with generalizability of results; for example, conclusions about depression in men can be generalized only to the kinds of men who are willing to answer such questions. It is useful to assess whether specific individual participants have large numbers of missing scores; if so, data for these participants could simply be deleted. Similarly, it may be useful to see whether certain variables have very high nonresponse rates; it may be necessary to drop these variables from further
  • 14. analysis. When analyses involving several variables (such as computations of all possible correlations among a set of variables) are performed in SPSS, it is possible to request either listwise or pairwise deletion. For example, suppose that the researcher wants to use the bivariate correlation procedure in SPSS to run all possible correlations among variables named V1, V2, V3, and V4. If listwise deletion is chosen, the data for a participant are completely ignored when all these correlations are calculated if the participant has a missing score on any one of the variables included in the list. In pairwise deletion, each correlation is computed using data from all the participants who had nonmissing values on that particular pair of variables. For example, suppose that there is one missing score on the variable V1. If listwise deletion is chosen, then the data for the participant who had a missing score on V1 are not used to compute any of the correlations (between V1 and V2, V2 and V3, V2 and V4, etc.). On the other hand, if pairwise deletion is chosen, the data for the participant who is missing a score on V1 cannot be used to calculate any of the correlations that involve V1 (e.g., V1 with V2, V1 with V3, V1 with V4), but the data from this participant will be used when correlations that don’t require information about V1 are calculated (correlations between V2 and V3, V2 and V4, and V3 and V4). When using listwise deletion, the same number of cases and subset of participants are used to calculate all the correlations for all pairs of variables. When using pairwise deletion, depending on the pattern of missing values, each correlation may be based on a different N and a different subset of participants than those used for other correlations. The default for handling missing data in most SPSS procedures is listwise deletion. The disadvantage of listwise deletion is that it can result in a rather small N of participants, and the advantage is that all correlations are calculated using the same set of participants. Pairwise deletion can be selected by the user, and it preserves the maximum possible N for the
  • 15. computation of each correlation; however, both the number of participants and the composition of the sample may vary across correlations, and this can introduce inconsistencies in the values of the correlations (as described in more detail by Tabachnick & Fidell, 2007). When a research report includes a series of analyses and each analysis includes a different set of variables, the N of scores that are included may vary across analyses (because different people have missing scores on each variable). This can raise a question in readers’ minds: Why do the Ns change across pages of the research report? When there are large numbers of missing scores, quite different subsets of data may be used in each analysis, and this may make the results not comparable across analyses. To avoid these potential problems, it may be preferable to select out all the cases that have missing values on all the variables that will be used ahead of time, so that the same subset of participants (and the same N of scores) are used in all the analyses in a paper. The default in SPSS is that cases with system missing values or scores that are specifically identified as missing values are excluded from computations, but this can result in a substantial reduction in the sample size for some analyses. Another way to deal with missing data is by substitution of a reasonable estimated score value to replace each missing response. Missing value replacement can be done in many different ways; for example, the mean score on a variable can be substituted for all missing values on that variable, or estimated values can be calculated separately for each individual participant using regression methods to predict that person’s missing score from her or his scores on other, related variables. This is often called imputation of missing data. Procedures for missing value replacement can be rather complex (Schafer, 1997, 1999; Schafer & Olsen, 1998). Tabachnick and Fidell (2007) summarized their discussion of missing value replacement by saying that the seriousness of the problem of missing values depends on “the pattern of missing
  • 16. data, how much is missing, and why it is missing” (p. 62). They also noted that the decision about how to handle missing data (e.g., deletion of cases or variables, or estimation of scores to replace missing values) is “a choice among several bad alternatives” (p. 63). If some method of imputation or estimation is employed to replace missing values, it is desirable to repeat the analysis with the missing values omitted. Results are more believable, of course, if they are essentially the same with and without the replacement scores. 4.6 Empirical Example of Data Screening for Individual Variables In this textbook, variables are treated as either categorical or quantitative (see Chapter 1 for a review of this distinction). Different types of graphs and descriptive statistics are appropriate for use with categorical versus quantitative variables, and for that reason, data screening is discussed separately for categorical and quantitative variables. 4.6.1 Frequency Distribution Tables For both categorical (nominal) and quantitative (scale) variables, a table of frequencies can be obtained to assess the number of persons or cases who had each different score value. These frequencies can be converted to proportions or percentages. Examination of a frequency table quickly provides answers to the following questions about each categorical variable: How many groups does this variable represent? What is the number of persons in each group? Are there any groups with ns that are too small for the group to be used in analyses that compare groups (e.g., analysis of variance, ANOVA)? If a group with a very small number (e.g., 10 or fewer) cases is detected, the researcher needs to decide what to do with the cases in that group. The group could be dropped from all analyses, or if it makes sense to do so, the small n group could be combined with one or more of the other groups (by recoding the scores that represent group membership on the categorical variable). For both categorical and quantitative variables, a frequency
  • 17. distribution also makes it possible to see if there are any “impossible” score values. For instance, if the categorical variable GENDER on a survey has just two response options, 1 = male and 2 = female, then scores of “3” and higher are not valid or interpretable responses. Impossible score values should be detected during proofreading, but examination of frequency tables provides another opportunity to see if there are any impossible score values on categorical variables. Figure 4.5 SPSS Menu Selections: <Analyze> → <Descriptive Statistics> → <Frequencies> Figure 4.6 SPSS Dialog Window for the Frequencies Procedure SPSS was used to obtain a frequency distribution table for the variables GENDER and AGE. Starting from the data worksheet view (as shown in Figure 4.1), the following menu selections (as shown in Figure 4.5) were made: <Analyze> → <Descriptive Statistics> → <Frequencies>. The SPSS dialog window for the Frequencies procedure appears in Figure 4.6. To specify which variables are included in the request for frequency tables, the user points to the names of the two variables (GENDER and AGE) and clicks the right-pointing arrow to move these variable names into the right-hand window. Output from this procedure appears in Figure 4.7. In the first frequency table in Figure 4.7, there is one “impossible” response for GENDER. The response alternatives provided for the question about gender were 1 = male and 2 = female, but a response of 3 appears in the summary table; this does not correspond to a valid response option. In the second frequency table in Figure 4.7, there is an extreme score (88 years) for AGE. This is a possible, but unusual, age for a college student. As will be discussed later in this chapter, scores that are extreme or unusual are often identified as outliers and sometimes removed from the data prior to doing other analyses.
  • 18. 4.6.2 Removal of Impossible or Extreme Scores In SPSS, the Select Cases command can be used to remove cases from a data file prior to other analyses. To select out the participant with a score of “3” for GENDER and also the participant with an age of 88, the following SPSS menu selections (see Figure 4.8) would be used: <Data> → <Select Cases>. The initial SPSS dialog window for Select Cases appears in Figure 4.9. A logical “If” conditional statement can be used to exclude specific cases. For example, to exclude the data for the person who reported a value of “3” for GENDER, click the radio button for “If condition is satisfied” in the first Select Cases dialog window. Then, in the “Select Cases: If” window, type in the logical condition “GENDER ~= 3.” The symbol “~=” represents the logical comparison “not equal to”; thus, this logical “If” statement tells SPSS to include the data for all participants whose scores for GENDER are not equal to “3.” The entire line of data for the person who reported “3” as a response to GENDER is (temporarily) filtered out or set aside as a result of this logical condition. It is possible to specify more than one logical condition. For example, to select cases that have valid scores on GENDER and that do not have extremely high scores on AGE, we could set up the logical condition “GENDER ~= 3 and AGE < 70,” as shown in Figures 4.10 and 4.11. SPSS evaluates this logical statement for each participant. Any participant with a score of “3” on GENDER and any participant with a score greater than or equal to 70 on AGE is excluded or selected out by this Select If statement. When a case has been selected out using the Select Cases command, a crosshatch mark appears over the case number for that case (on the far left-hand side of the SPSS data worksheet). Cases that are selected out can be temporarily filtered or permanently deleted. In Figure 4.12, the SPSS data worksheet is shown as it appears after the execution of the Data Select If
  • 19. commands just described. Case number 11 (a person who had a score of 3 on GENDER) and case number 15 (a person who had a score of 88 on AGE) are now shown with a crosshatch mark through the case number in the left-hand column of the data worksheet. This crosshatch indicates that unless the Select If condition is explicitly removed, the data for these 2 participants will be excluded from all future analyses. Note that the original N of 65 cases has been reduced to an N of 63 by this Data Select If statement. If the researcher wants to restore temporarily filtered cases to the sample, it can be done by selecting the radio button for All Cases in the Select Cases dialog window. Figure 4.7 Output From the SPSS Frequencies Procedure (Prior to Removal of “Impossible” Score Values) 4.6.3 Bar Chart for a Categorical Variable For categorical or nominal variables, a bar chart can be used to represent the distribution of scores graphically. A bar chart for GENDER was created by making the following SPSS menu selections (see Figure 4.13): <Graphs> → <Legacy Dialogs> → <Bar [Chart]>. Figure 4.8 SPSS Menu Selections for <Data> → <Select Cases> Procedure The first dialog window for the bar chart procedure appears in Figure 4.14. In this example, the upper left box was clicked in the Figure 4.14 dialog window to select the “Simple” type of bar chart; the radio button was selected for “Summaries for groups of cases”; then, the Define button was clicked. This opened the second SPSS dialog window, which appears in Figure 4.15. Figure 4.9 SPSS Dialog Windows for the Select Cases Command Figure 4.10 Logical Criteria for Select Cases
  • 20. NOTE: Include only persons who have a score for GENDER that is not equal to 3 and who have a score for AGE that is less than 70. Figure 4.11 Appearance of the Select Cases Dialog Window After Specification of the Logical “If” Selection Rule To specify the form of the bar chart, use the cursor to highlight the name of the variable that you want to graph, and click on the arrow that points to the right to move this variable name into the window under Category Axis. Leave the radio button selection as the default choice, “Bars represent N of cases.” This set of menu selections will yield a bar graph with one bar for each group; for GENDER, this is a bar graph with one bar for males and one for females. The height of each bar represents the number of cases in each group. The output from this procedure appears in Figure 4.16. Note that because the invalid score of 3 has been selected out by the prior Select If statement, this score value of 3 is not included in the bar graph in Figure 4.16. A visual examination of a set of bar graphs, one for each categorical variable, is a useful way to detect impossible values. The frequency table and bar graphs also provide a quick indication of group size; in this dataset, there are N = 28 males (score of 1 on GENDER) and N = 36 females (score of 2 on GENDER). The bar chart in Figure 4.16, like the frequency table in Figure 4.7, indicates that the male group had fewer participants than the female group. Figure 4.12 Appearance of SPSS Data Worksheet After the Select Cases Procedure in Figure 4.11 4.6.4 Histogram for a Quantitative Variable For a quantitative variable, a histogram is a useful way to assess the shape of the distribution of scores. As described in Chapter 3, many analyses assume that scores on quantitative variables are at least approximately normally distributed. Visual examination of the histogram is a way to evaluate whether the
  • 21. distribution shape is reasonably close to normal or to identify the shape of a distribution if it is quite different from normal. In addition, summary statistics can be obtained to provide information about central tendency and dispersion of scores. The mean (M), median, or mode can be used to describe central tendency; the range, standard deviation (s or SD), or variance (s2) can be used to describe variability or dispersion of scores. A comparison of means, variances, and other descriptive statistics provides the information that a researcher needs to characterize his or her sample and to judge whether the sample is similar enough to some broader population of interest so that results might possibly be generalizable to that broader population (through the principle of “proximal similarity,” discussed in Chapter 1). If a researcher conducts a political poll and finds that the range of ages of persons in the sample is from age 18 to 22, for instance, it would not be reasonable to generalize any findings from that sample to populations of persons older than age 22. Figure 4.13 SPSS Menu Selections for the <Graphs> → <Legacy Dialogs> → <Bar [Chart]> Procedure Figure 4.14 SPSS Bar Charts Dialog Window Figure 4.15 SPSS Define Simple Bar Chart Dialog Window Figure 4.16 Bar Chart: Frequencies for Each Gender Category When the distribution shape of a quantitative variable is nonnormal, it is preferable to assess central tendency and dispersion of scores using graphic methods that are based on percentiles (such as a boxplot, also called a box and whiskers plot). Issues that can be assessed by looking at frequency tables, histograms, or box and whiskers plots for quantitative scores include the following: 1. Are there impossible or extreme scores?
  • 22. 2. Is the distribution shape normal or nonnormal? 3. Are there ceiling or floor effects? Consider a set of test scores. If a test is too easy and most students obtain scores of 90% and higher, the distribution of scores shows a “ceiling effect”; if the test is much too difficult, most students will obtain scores of 10% and below, and this would be called a “floor effect.” Either of these would indicate a problem with the measurement, in particular, a lack of sensitivity to individual differences at the upper end of the distribution (when there is a ceiling effect) or the lower end of the distribution (when there is a floor effect). 4. Is there a restricted range of scores? For many measures, researchers know a priori what the minimum and maximum possible scores are, or they have a rough idea of the range of scores. For example, suppose that Verbal SAT scores can range from 250 to 800. If the sample includes scores that range from 550 to 580, the range of Verbal SAT scores in the sample is extremely restricted compared with the range of possible scores. Generally, researchers want a fairly wide range of scores on variables that they want to correlate with other variables. If a researcher wants to “hold a variable constant”—for example, to limit the impact of age on the results of a study by including only persons between 18 and 21 years of age—then a restricted range would actually be preferred. The procedures for obtaining a frequency table for a quantitative variable are the same as those discussed in the previous section on data screening for categorical variables. Distribution shape for a quantitative variable can be assessed by examining a histogram obtained by making these SPSS menu selections (see Figure 4.17): <Graphs> → <Legacy Dialogs> → <Histogram>. These menu selections open the Histogram dialog window displayed in Figure 4.18. In this example, the variable selected for the histogram was HR1 (baseline or Time 1 heart rate). Placing a checkmark in the box next to Display Normal Curve requests a superimposed smooth normal distribution function on
  • 23. the histogram plot. To obtain the histogram, after making these selections, click the OK button. The histogram output appears in Figure 4.19. The mean, standard deviation, and N for HR1 appear in the legend below the graph. Figure 4.17 SPSS Menu Selections: <Graphs> → <Legacy Dialogs> → <Histogram> Figure 4.18 SPSS Dialog Window: Histogram Procedure Figure 4.19 Histogram of Heart Rates With Superimposed Normal Curve An assumption common to all the parametric analyses covered in this book is that scores on quantitative variables should be (at least approximately) normally distributed. In practice, the normality of distribution shape is usually assessed visually; a histogram of scores is examined to see whether it is approximately “bell shaped” and symmetric. Visual examination of the histogram in Figure 4.19 suggests that the distribution shape is not exactly normal; it is slightly asymmetrical. However, this distribution of sample scores is similar enough to a normal distribution shape to allow the use of parametric statistics such as means and correlations. This distribution shows a reasonably wide range of heart rates, no evidence of ceiling or floor effects, and no extreme outliers. There are many ways in which the shape of a distribution can differ from an ideal normal distribution shape. For example, a distribution is described as skewed if it is asymmetric, with a longer tail on one side (see Figure 4.20 for an example of a distribution with a longer tail on the right). Positively skewed distributions similar to the one that appears in Figure 4.20 are quite common; many variables, such as reaction time, have a minimum possible value of 0 (which means that the lower tail of the distribution ends at 0) but do not have a fixed limit at the upper end of the distribution (and therefore the upper tail can be quite long). (Distributions with many zeros pose special
  • 24. problems; refer back to comments on Figure 1.4. Also see discussions by Atkins & Gallop [2007] and G. King & Zeng [2001]; options include Poisson regression [Cohen, Cohen, West, & Aiken, 2003, chap. 13], and negative binomial regression [Hilbe, 2011].) Figure 4.20 Histogram of Positively Skewed Distribution NOTE: Skewness index for this variable is +2.00. A numerical index of skewness for a sample set of X scores denoted by (X1, X2, …, XN) can be calculated using the following formula: where Mx is the sample mean of the X scores, s is the sample standard deviation of the X scores, and N is the number of scores in the sample. For a perfectly normal and symmetrical distribution, skewness has a value of 0. If the skewness statistic is positive, it indicates that there is a longer tail on the right-hand/upper end of the distribution (as in Figure 4.20); if the skewness statistic is negative, it indicates that there is a longer tail on the lower end of the distribution (as in Figure 4.21). Figure 4.21 Histogram of a Negatively Skewed Distribution NOTE: Skewness index for this variable is −2.00. A distribution is described as platykurtic if it is flatter than an ideal normal distribution and leptokurtic if it has a sharper/steeper peak in the center than an ideal normal distribution (see Figure 4.22). A numerical index of kurtosis can be calculated using the following formula: where Mx is the sample mean of the X scores, s is the sample standard deviation of the X scores, and N is the number of scores in the sample. Using Equation 4.2, the kurtosis for a normal distribution
  • 25. corresponds to a value of 3; most computer programs actually report “excess kurtosis”—that is, the degree to which the kurtosis of the scores in a sample differs from the kurtosis expected in a normal distribution. This excess kurtosis is given by the following formula: Figure 4.22 Leptokurtic and Platykurtic Distributions SOURCE: Adapted from http://www.murraystate.edu/polcrjlst/p660kurtosis.htm A positive score for excess kurtosis indicates that the distribution of scores in the sample is more sharply peaked than in a normal distribution (this is shown as leptokurtic in Figure 4.22). A negative score for kurtosis indicates that the distribution of scores in a sample is flatter than in a normal distribution (this corresponds to a platykurtic distribution shape in Figure 4.22). The value that SPSS reports as kurtosis corresponds to excess kurtosis (as in Equation 4.3). A normal distribution is defined as having skewness and (excess) kurtosis of 0. A numerical index of skewness and kurtosis can be obtained for a sample of data to assess the degree of departure from a normal distribution shape. Additional summary statistics for a quantitative variable such as HR1 can be obtained from the SPSS Descriptives procedure by making the following menu selections: <Analyze> → <Descriptive Statistics> → <Descriptives>. The menu selections shown in Figure 4.23 open the Descriptive Statistics dialog box shown in Figure 4.24. The Options button opens up a dialog box that has a menu with check boxes that offer a selection of descriptive statistics, as shown in Figure 4.25. In addition to the default selections, the boxes for skewness and kurtosis were also checked. The output from this procedure appears in Figure 4.26. The upper panel in Figure 4.26 shows the descriptive statistics for scores on HR1 that appeared in Figure 4.19; skewness and kurtosis for the sample
  • 26. of scores on the variable HR1 were both fairly close to 0. The lower panel shows the descriptive statistics for the artificially generated data that appeared in Figures 4.20 (a set of positively skewed scores) and 4.21 (a set of negatively skewed scores). Figure 4.23 SPSS Menu Selections: <Analyze> → <Descriptive Statistics> → <Descriptives> Figure 4.24 Dialog Window for SPSS Descriptive Statistics Procedure Figure 4.25 Options for the Descriptive Statistics Procedure Figure 4.26 Output From the SPSS Descriptive Statistics Procedure for Three Types of Distribution Shape NOTE: Scores for HR (from Figure 4.19) are not skewed; scores for posskew (from Figure 4.20) are positively skewed; and scores for negskew (from Figure 4.21) are negatively skewed. It is possible to set up a statistical significance test (in the form of a z ratio) for skewness because SPSS also reports the standard error (SE) for this statistic: When the N of cases is reasonably large, the resulting z ratio can be evaluated using the standard normal distribution; that is, skewness is statistically significant at the α = .05 level (two- tailed) if the z ratio given in Equation 4.4 is greater than 1.96 in absolute value. A z test can also be set up to test the significance of (excess) kurtosis: The tests in Equations 4.4 and 4.5 provide a way to evaluate whether an empirical frequency distribution differs significantly from a normal distribution in skewness or kurtosis. There are formal mathematical tests to evaluate the degree to which an empirical distribution differs from some ideal or theoretical
  • 27. distribution shape (such as the normal curve). If a researcher needs to test whether the overall shape of an empirical frequency distribution differs significantly from normal, it can be done by using the Kolmogorov-Smirnov or Shapiro-Wilk test (both are available in SPSS). In most situations, visual examination of distribution shape is deemed sufficient. In general, empirical distribution shapes are considered problematic only when they differ dramatically from normal. Some earlier examples of drastically nonnormal distribution shapes appeared in Figures 1.2 (a roughly uniform distribution) and 1.3 (an approximately exponential or J-shaped distribution). Multimodal distributions or very seriously skewed distributions (as in Figure 4.20) may also be judged problematic. A distribution that resembles the one in Figure 4.19 is often judged close enough to normal shape. 4.7 Identification and Handling of Outliers An outlier is an extreme score on either the low or the high end of a frequency distribution of a quantitative variable. Many different decision rules can be used to decide whether a particular score is extreme enough to be considered an outlier. When scores are approximately normally distributed, about 99% of the scores should fall within +3 and −3 standard deviations of the sample mean. Thus, for normally distributed scores, z scores can be used to decide which scores to treat as outliers. For example, a researcher might decide to treat scores that correspond to values of z that are less than −3.30 or greater than +3.30 as outliers. Another method for the detection of outliers uses a graph called a boxplot (or a box and whiskers plot). This is a nonparametric exploratory procedure that uses medians and quartiles as information about central tendency and dispersion of scores. The following example uses a boxplot of scores on WEIGHT separately for each gender group, as a means of identifying potential outliers on WEIGHT. To set up this boxplot for the distribution of weight within each gender group, the following SPSS menu selections were made: <Graphs> → <Legacy
  • 28. Dialogs> → <Box[plot]>. This opens up the first SPSS boxplot dialog box, shown in Figure 4.27. For this example, the box marked Simple was clicked, and the radio button for “Summaries for groups of cases” was selected to obtain a boxplot for just one variable (WEIGHT) separately for each of two groups (male and female). Clicking on the Define button opened up the second boxplot dialog window, as shown in Figure 4.28. The name of the quantitative dependent variable, WEIGHT, was placed in the top window (as the name of the variable); the categorical or “grouping” variable (GENDER) was placed in the window for the Category Axis. Clicking the OK button generated the boxplot shown in Figure 4.29, with values of WEIGHT shown on the Y axis and the categories male and female shown on the X axis. Rosenthal and Rosnow (1991) noted that there are numerous variations of the boxplot; the description here is specific to the boxplots generated by SPSS and may not correspond exactly to descriptions of boxplots given elsewhere. For each group, a shaded box corresponds to the middle 50% of the distribution of scores in that group. The line that bisects this box horizontally (not necessarily exactly in the middle) represents the 50th percentile (the median). The lower and upper edges of this shaded box correspond to the 25th and 75th percentiles of the weight distribution for the corresponding group (labeled on the X axis). The 25th and 75th percentiles of each distribution of scores, which correspond to the bottom and top edges of the shaded box, respectively, are called the hinges. The distance between the hinges (i.e., the difference between scores at the 75th and 25th percentiles) is called the H-spread. The vertical lines that extend above and below the 75th and 25th percentiles are called “whiskers,” and the horizontal lines at the ends of the whiskers mark the “adjacent values.” The adjacent values are the most extreme scores in the sample that lie between the hinge and the inner fence (not shown on the graph; the inner fence is usually a distance from the median that is 1.5 times the H-
  • 29. spread). Generally, any data points that lie beyond these adjacent values are considered outliers. In the boxplot, outliers that lie outside the adjacent values are graphed using small circles. Observations that are extreme outliers are shown as asterisks (*). Figure 4.27 SPSS Dialog Window for Boxplot Procedure Figure 4.28 Define Simple Boxplot: Distribution of Weight Separately by Gender Figure 4.29 Boxplot of WEIGHT for Each Gender Group Figure 4.29 indicates that the middle 50% of the distribution of body weights for males was between about 160 and 180 lb, and there was one outlier on WEIGHT (Participant 31 with a weight of 230 lb) in the male group. For females, the middle 50% of the distribution of weights was between about 115 and 135 lb, and there were two outliers on WEIGHT in the female group; Participant 50 was an outlier (with weight = 170), and Participant 49 was an extreme outlier (with weight = 190). The data record numbers that label the outliers in Figure 4.29 can be used to look up the exact score values for the outliers in the entire listing of data in the SPSS data worksheet or in Table 4.1. In this dataset, the value of idnum (a variable that provides a unique case number for each participant) was the same as the SPSS line number or record number for all 65 cases. If the researcher wants to exclude the 3 participants who were identified as outliers in the boxplot of weight scores for the two gender groups, it could be done by using the following Select If statement: idnum ~= 31 and idnum ~= 49 and idnum ~=50. Parametric statistics (such as the mean, variance, and Pearson correlation) are not particularly robust to outliers; that is, the value of M for a batch of sample data can be quite different when it is calculated with an outlier included than when an outlier is excluded. This raises a problem: Is it preferable to include outliers (recognizing that a single extreme score may
  • 30. have a disproportionate impact on the outcome of the analysis) or to omit outliers (understanding that the removal of scores may change the outcome of the analysis)? It is not possible to state a simple rule that can be uniformly applied to all research situations. Researchers have to make reasonable judgment calls about how to handle extreme scores or outliers. Researchers need to rely on both common sense and honesty in making these judgments. When the total N of participants in the dataset is relatively small, and when there are one or more extreme outliers, the outcomes for statistical analyses that examine the relation between a pair of variables can be quite different when outliers are included versus excluded from an analysis. The best way to find out whether the inclusion of an outlier would make a difference in the outcome of a statistical analysis is to run the analysis both including and excluding the outlier score(s). However, making decisions about how to handle outliers post hoc (after running the analyses of interest) gives rise to a temptation: Researchers may wish to make decisions about outliers based on the way the outliers influence the outcome of statistical analyses. For example, a researcher might find a significant positive correlation between variables X and Y when outliers are included, but the correlation may become nonsignificant when outliers are removed from the dataset. It would be dishonest to report a significant correlation without also explaining that the correlation becomes nonsignificant when outliers are removed from the data. Conversely, a researcher might also encounter a situation where there is no significant correlation between scores on the X and Y variables when outliers are included, but the correlation between X and Y becomes significant when the data are reanalyzed with outliers removed. An honest report of the analysis should explain that outlier scores were detected and removed as part of the data analysis process, and there should be a good rationale for removal of these outliers. The fact that dropping outliers yields the kind of correlation results that the researcher hopes for is
  • 31. not, by itself, a satisfactory justification for dropping outliers. It should be apparent that if researchers arbitrarily drop enough cases from their samples, they can prune their data to fit just about any desired outcome. (Recall the myth of King Procrustes, who cut off the limbs of his guests so that they would fit his bed; we must beware of doing the same thing to our data.) A less problematic way to handle outliers is to state a priori that the study will be limited to a specific population—that is, to specific ranges of scores on some of the variables. If the population of interest in the blood pressure study is healthy young adults whose blood pressure is within the normal range, this a priori specification of the population of interest would provide a justification for the decision to exclude data for participants with age older than 30 years and SBP above 140. Another reasonable approach is to use a standard rule for exclusion of extreme scores (e.g., a researcher might decide at an early stage in data screening to drop all values that correspond to z scores in excess of 3.3 in absolute value; this value of 3.3 is an arbitrary standard). Another method of handling extreme scores (trimming) involves dropping the top and bottom scores (or some percentage of scores, such as the top and bottom 1% of scores) from each group. Winsorizing is yet another method of reducing the impact of outliers: The most extreme score at each end of a distribution is recoded to have the same value as the next highest score. Another way to reduce the impact of outliers is to apply a nonlinear transformation (such as taking the base 10 logarithm [log] of the original X scores). This type of data transformation can bring outlier values at the high end of a distribution closer to the mean. Whatever the researcher decides to do with extreme scores (throw them out, Winsorize them, or modify the entire distribution by taking the log of scores), it is a good idea to conduct analyses with the outlier included and with the outlier
  • 32. excluded to see what effect (if any) the decision about outliers has on the outcome of the analysis. If the results are essentially identical no matter what is done to outliers, then either approach could be reported. If the results are substantially different when different things are done with outliers, the researcher needs to make a thoughtful decision about which version of the analysis provides a more accurate and honest description of the situation. In some situations, it may make sense to report both versions of the analysis (with outliers included and excluded) so that it is clear to the reader how the extreme individual score values influenced the results. None of these choices are ideal solutions; any of these procedures may be questioned by reviewers or editors. It is preferable to decide on simple exclusion rules for outliers before data are collected and to remove outliers during the preliminary screening stages rather than at later stages in the analysis. It may be preferable to have a consistent rule for exclusion (e.g., excluding all scores that show up as extreme outliers in boxplots) rather than to tell a different story to explain why each individual outlier received the specific treatment that it did. The final research report should explain what methods were used to detect outliers, identify the scores that were identified as outliers, and make it clear how the outliers were handled (whether extreme scores were removed or modified). 4.8 Screening Data for Bivariate Analyses There are three possible combinations of types of variables in bivariate analysis. Both variables may be categorical, both may be quantitative, or one may be categorical and the other quantitative. Separate bivariate data-screening methods are outlined for each of these situations. 4.8.1 Bivariate Data Screening for Two Categorical Variables When both variables are categorical, it does not make sense to compute means (the numbers serve only as labels for group memberships); instead, it makes sense to look at the numbers of cases within each group. When two categorical variables are
  • 33. considered jointly, a cross-tabulation or contingency table summarizes the number of participants in the groups for all possible combinations of scores. For example, consider GENDER (coded 1 = male and 2 = female) and smoking status (coded 1 = nonsmoker, 2 = occasional smoker, 3 = frequent smoker, and 4 = heavy smoker). A table of cell frequencies for these two categorical variables can be obtained using the SPSS Crosstabs procedure by making the following menu selections: <Analyze> → <Descriptives> → <Crosstabs>. These menu selections open up the Crosstabs dialog window, which appears in Figure 4.30. The names of the row variable (in this example, GENDER) and the column variable (in this example, SMOKE) are entered into the appropriate boxes. Clicking on the button labeled Cells opens up an additional dialog window, shown in Figure 4.31, where the user specifies the information to be presented in each cell of the contingency table. In this example, both observed (O) and expected (E) frequency counts are shown in each cell (see Chapter 8 in this textbook to see how expected cell frequencies are computed from the total number in each row and column of a contingency table). Row percentages were also requested. The observed cell frequencies in Figure 4.32 show that most of the males and the females were nonsmokers (SMOKE = 1). In fact, there were very few light smokers and heavy smokers (and no moderate smokers). As a data-screening result, this has two implications: If we wanted to do an analysis (such as a chi- square test of association, as described in Chapter 8) to assess how gender is related to smoking status, the data do not satisfy an assumption about the minimum expected cell frequencies required for the chi-square test of association. (For a 2 × 2 table, none of the expected cell frequencies should be less than 5; for larger tables, various sources recommend different standards for minimum expected cell frequencies, but a minimum expected frequency of 5 is recommended here.) In addition, if we wanted to see how gender and smoking status together predict some third variable, such as heart rate, the
  • 34. numbers of participants in most of the groups (such as heavy smoker/females with only N = 2 cases) are simply too small. What would we hope to see in preliminary screening for categorical variables? The marginal frequencies (e.g., number of males, number of females; number of nonsmokers, light smokers, and heavy smokers) should all be reasonably large. That is clearly not the case in this example: There were so few heavy smokers that we cannot judge whether heavy smoking is associated with gender. Figure 4.30 SPSS Crosstabs Dialog Window Figure 4.31 SPSS Crosstabs: Information to Display in Cells Figure 4.32 Cross-Tabulation of Gender by Smoking Status NOTE: Expected cell frequencies less than 10 in three cells. The 2 × 3 contingency table in Figure 4.32 has four cells with expected cell frequencies less than 5. There are two ways to remedy this problem. One possible solution is to remove groups that have small marginal total Ns. For example, only 3 people reported that they were “heavy smokers.” If this group of 3 people were excluded from the analysis, the two cells with the lowest expected cell frequencies would be eliminated from the table. Another possible remedy is to combine groups (but only if this makes sense). In this example, the SPSS recode command can be used to recode scores on the variable SMOKE so that there are just two values: 1 = nonsmokers and 2 = light or heavy smokers. The SPSS menu selections <Compute> → <Recode> → <Into Different Variable> appear in Figure 4.33; these menu selections open up the Recode into Different Variables dialog box, as shown in Figure 4.34. In the Recode into Different Variables dialog window, the existing variable SMOKE is identified as the numeric variable by moving its name into the window headed Numeric Variable → Output Variable. The name for the new variable (in this
  • 35. example, SMOKE2) is typed into the right-hand window under the heading Output Variable, and if the button marked Change is clicked, SPSS identifies SMOKE2 as the (new) variable that will contain the recoded values that are based on scores for the existing variable SMOKE. Clicking on the button marked Old and New Values opens up the next SPSS dialog window, which appears in Figure 4.35. The Old and New Values dialog window that appears in Figure 4.35 can be used to enter a series of pairs of scores that show how old scores (on the existing variable SMOKE) are used to create new recoded scores (on the output variable SMOKE2). For example, under Old Value, the value 1 is entered; under New Value, the value 1 is entered; then, we click the Add button to add this to the list of recode commands. People who have a score of 1 on SMOKE (i.e., they reported themselves as nonsmokers) will also have a score of 1 on SMOKE2 (this will also be interpreted as “nonsmokers”). For the old values 2, 3, and 4 on the existing variable SMOKE, each of these variables is associated with a score of 2 on the new variable SMOKE2. In other words, people who chose responses 2, 3, or 4 on the variable SMOKE (light, moderate, or heavy smokers) will be coded 2 (smokers) on the new variable SMOKE2. Click Continue and then OK to make the recode commands take effect. After the recode command has been executed, a new variable called SMOKE2 will appear in the far right-hand column of the SPSS Data View worksheet; this variable will have scores of 1 (nonsmoker) and 2 (smoker). Figure 4.33 SPSS Menu Selection for the Recode Command Figure 4.34 SPSS Recode Into Different Variables Dialog Window Figure 4.35 Old and New Values for the Recode Command Figure 4.36 Crosstabs Using the Recoded Smoking Variable (SMOKE2)
  • 36. While it is possible to replace the scores on the existing variable SMOKE with recoded values, it is often preferable to put recoded scores into a new output variable. It is easy to lose track of recodes as you continue to work with a data file. It is helpful to retain the variable in its original form so that information remains available. After the recode command has been used to create a new variable (SMOKE2), with codes for light, moderate, and heavy smoking combined into a single code for smoking, the Crosstabs procedure can be run using this new version of the smoking variable. The contingency table for GENDER by SMOKE2 appears in Figure 4.36. Note that this new table has no cells with minimum expected cell frequencies less than 5. Sometimes this type of recoding results in reasonably large marginal frequencies for all groups. In this example, however, the total number of smokers in this sample is still small. 4.8.2 Bivariate Data Screening for One Categorical and One Quantitative Variable Data analysis methods that compare means of quantitative variables across groups (such as ANOVA) have all the assumptions that are required for univariate parametric statistics: 1. Scores on quantitative variables should be normally distributed. 2. Observations should be independent. When means on quantitative variables are compared across groups, there is one additional assumption: The variances of the populations (from which the samples are drawn) should be equal. This can be stated as a formal null hypothesis: Assessment of possible violations of Assumptions 1 and 2 were described in earlier sections of this chapter. Graphic methods, such as boxplots (as described in an earlier section of this
  • 37. chapter), provide a way to see whether groups have similar ranges or variances of scores. The SPSS t test and ANOVA procedures provide a significance test for the null assumption that the population variances are equal (the Levene test). Usually, researchers hope that this assumption is not violated, and thus, they usually hope that the F ratio for the Levene test will be nonsignificant. However, when the Ns in the groups are equal and reasonably large (approximately N > 30 per group), ANOVA is fairly robust to violations of the equal variance assumption (Myers & Well, 1991, 1995). Small sample sizes create a paradox with respect to the assessment of violations of many assumptions. When N is small, significance tests for possible violations of assumptions have low statistical power, and violations of assumptions are more problematic for the analysis. For example, consider a one-way ANOVA with only 5 participants per group. With such a small N, the test for heterogeneity of variance may be significant only when the differences among sample variances are extremely large; however, with such a small N, small differences among sample variances might be enough to create problems in the analysis. Conversely, in a one-way ANOVA with 50 participants per group, quite small differences in variance across groups could be judged statistically significant, but with such a large N, only fairly large differences in group variances would be a problem. Doing the preliminary test for heterogeneity of variance when Ns are very large is something like sending out a rowboat to see if the water is safe for the Queen Mary. Therefore, it may be reasonable to use very small α levels, such as α = .001, for significance tests of violations of assumptions in studies with large sample sizes. On the other hand, researchers may want to set α values that are large (e.g., α of .20 or larger) for preliminary tests of assumptions when Ns are small. Tabachnick and Fidell (2007) provide extensive examples of preliminary data screening for comparison of groups. These
  • 38. generally involve repeating the univariate data-screening procedures described earlier (to assess normality of distribution shape and identify outliers) separately for each group and, in addition, assessing whether the homogeneity of variance assumption is violated. It is useful to assess the distribution of quantitative scores within each group and to look for extreme outliers within each group. Refer back to Figure 4.29 to see an example of a boxplot that identified outliers on WEIGHT within the gender groups. It might be desirable to remove these outliers or, at least, to consider how strongly they influence the outcome of a t test to compare male and female mean weights. The presence of these outlier scores on WEIGHT raises the mean weight for each group; the presence of these outliers also increases the within- group variance for WEIGHT in both groups. 4.8.3 Bivariate Data Screening for Two Quantitative Variables Statistics that are part of the general linear model (GLM), such as the Pearson correlation, require several assumptions. Suppose we want to use Pearson’s r to assess the strength of the relationship between two quantitative variables, X (diastolic blood pressure [DBP]) and Y (systolic blood pressure [SBP]). For this analysis, the data should satisfy the following assumptions: 1. Scores on X and Y should each have a univariate normal distribution shape. 2. The joint distribution of scores on X and Y should have a bivariate normal shape (and there should not be any extreme bivariate outliers). 3. X and Y should be linearly related. 4. The variance of Y scores should be the same at each level of X (the homogeneity or homoscedasticity of variance assumption). The first assumption (univariate normality of X and Y) can be evaluated by setting up a histogram for scores on X and Y and
  • 39. by looking at values of skewness as described in Section 4.6.4. The other two assumptions (a bivariate normal distribution shape and a linear relation) can be assessed by examining an X, Yscatter plot. To obtain an X, Y scatter plot, the following menu selections are used: <Graph> → <Scatter>. From the initial Scatter/Dot dialog box (see Figure 4.37), the Simple Scatter type of scatter plot was selected by clicking on the icon in the upper left part of the Scatter/Dot dialog window. The Define button was used to move on to the next dialog window. In the next dialog window (shown in Figure 4.38), the name of the predictor variable (DBP at Time 1) was placed in the window marked X Axis, and the name of the outcome variable (SBP at Time 1) was placed in the window marked Y Axis. (Generally, if there is a reason to distinguish between the two variables, the predictor or “causal” variable is placed on the X axis in the scatter plot. In this example, either variable could have been designated as the predictor.) The output for the scatter plot showing the relation between scores on DBP and SBP in Figure 4.39 shows a strong positive association between DBP and SBP. The relation appears to be fairly linear, and there are no bivariate outliers. The assumption of bivariate normal distribution is more difficult to evaluate than the assumption of univariate normality, particularly in relatively small samples. Figure 4.40 represents an ideal theoretical bivariate normal distribution. Figure 4.41 is a bar chart that shows the frequencies of scores with specific pairs of X, Y values; it corresponds approximately to an empirical bivariate normal distribution (note that these figures were not generated using SPSS). X and Y have a bivariate normal distribution if Y scores are normally distributed for each value of X (and vice versa). In either graph, if you take any specific value of X and look at that cross section of the distribution, the univariate distribution of Y should be normal. In practice, even relatively large datasets (N > 200) often do not have enough data points to evaluate whether the
  • 40. scores for each pair of variables have a bivariate normal distribution. Several problems may be detectable in a bivariate scatter plot. A bivariate outlier (see Figure 4.42) is a score that falls outside the region in the X, Y scatter plot where most X, Y values are located. In Figure 4.42, one individual has a body weight of about 230 lb and SBP of about 110; this combination of score values is “unusual” (in general, persons with higher body weight tended to have higher blood pressure). To be judged a bivariate outlier, a score does not have to be a univariate outlier on either X or Y (although it may be). A bivariate outlier can have a disproportionate impact on the value of Pearson’s r compared with other scores, depending on its location in the scatter plot. Like univariate outliers, bivariate outliers should be identified and examined carefully. It may make sense in some cases to remove bivariate outliers, but it is preferable to do this early in the data analysis process, with a well-thought- out justification, rather than late in the data analysis process, because the data point does not conform to the preferred linear model. Figure 4.37 SPSS Dialog Window for the Scatter Plot Procedure Figure 4.38 Scatter Plot: Identification of Variables on X and Y Axes Heteroscedasticity or heterogeneity of variance refers to a situation where the variance in Y scores is greater for some values of X than for others. In Figure 4.43, the variance of Y scores is much higher for X scores near 50 than for X values less than 30. This unequal variance in Y across levels of X violates the assumption of homoscedasticity of variance; it also indicates that prediction errors for high values of X will be systematically larger than prediction errors for low values of X. Sometimes a log transformation on a Y variable that shows heteroscedasticity
  • 41. across levels of X can reduce the problem of unequal variance to some degree. However, if this problem cannot be corrected, then the graph that shows the unequal variances should be part of the story that is reported, so that readers understand: It is not just that Y tends to increase as X increases, as in Figure 4.43; the variance of Y also tends to increase as X increases. Ideally, researchers hope to see reasonably uniform variance in Y scores across levels of X. In practice, the number of scores at each level of X is often too small to evaluate the shape and variance of Y values separately for each level of X. Figure 4.39 Bivariate Scatter Plot for Diastolic Blood Pressure (DIA1) and Systolic Blood Pressure (SYS1) (Moderately Strong, Positive, Linear Relationship) Figure 4.40 Three-Dimensional Representation of an Ideal Bivariate Normal Distribution SOURCE: Reprinted with permission from Hartlaub, B., Jones, B. D., & Karian, Z. A., downloaded from www2.kenyon.edu/People/hartlaub/MellonProject/images/bivari ate17.gif, supported by the Andrew W. Mellon Foundation. Figure 4.41 Three-Dimensional Histogram of an Empirical Bivariate Distribution (Approximately Bivariate Normal) SOURCE: Reprinted with permission from Dr. P. D. M. MacDonald. NOTE: Z1 and Z2 represent scores on the two variables, while the vertical heights of the bars along the “frequency” axis represent the number of cases that have each combination of scores on Z1 and Z2. A clear bivariate normal distribution is likely to appear only for datasets with large numbers of observations; this example only approximates bivariate normal. 4.9 Nonlinear Relations Students should be careful to distinguish between these two
  • 42. situations: no relationship between X and Y versus a nonlinear relationship between X and Y (i.e., a relationship between X and Y that is not linear). An example of a scatter plot that shows no relationship of any kind (either linear or curvilinear) between X and Y appears in Figure 4.44. Note that as the value of X increases, the value of Y does not either increase or decrease. In contrast, an example of a curvilinear relationship between X and Y is shown in Figure 4.45. This shows a strong relationship between X and Y, but it is not linear; as scores on X increase from 0 to 30, scores on Y tend to increase, but as scores on X increase between 30 and 50, scores on Y tend to decrease. An example of a real-world research situation that yields results similar to those shown in Figure 4.45 is a study that examines arousal or level of stimulation (on the X axis) as a predictor of task performance (on the Y axis). For example, suppose that the score on the X axis is a measure of anxiety and the score on the Y axis is a score on an examination. At low levels of anxiety, exam performance is not very good: Students may be sleepy or not motivated enough to study. At moderate levels of anxiety, exam performance is very good: Students are alert and motivated. At the highest levels of anxiety, exam performance is not good: Students may be distracted, upset, and unable to focus on the task. Thus, there is an optimum (moderate) level of anxiety; students perform best at moderate levels of anxiety. Figure 4.42 Bivariate Scatter Plot for Weight and Systolic Blood Pressure (SYS1) NOTE: Bivariate outlier can be seen in the lower right corner of the graph. Figure 4.43 Illustration of Heteroscedasticity of Variance NOTE: Variance in Y is larger for values of X near 50 than for values of X near 0.
  • 43. Figure 4.44 No Relationship Between X and Y Figure 4.45 Bivariate Scatter Plot: Inverse U-Shaped Curvilinear Relation Between X and Y If a scatter plot reveals this kind of curvilinear relation between X and Y, Pearson’s r (or other analyses that assume a linear relationship) will not do a good job of describing the strength of the relationship and will not reveal the true nature of the relationship. Other analyses may do a better job in this situation. For example, Y can be predicted from both X and X2 (a function that includes an X2 term as a curve rather than a straight line). (For details, see Aiken & West, 1991, chap. 5.) Alternatively, students can be separated into high-, medium-, and low-anxiety groups based on their scores on X, and a one- way ANOVA can be performed to assess how mean Y test scores differ across these three groups. However, recoding scores on a quantitative variable into categories can result in substantial loss of information, as pointed out by Fitzsimons (2008). Another possible type of curvilinear function appears in Figure 4.46. This describes a situation where responses on Y reach an asymptote as X increases. After a certain point, further increases in X scores begin to result in diminishing returns on Y. For example, some studies of social support suggest that most of the improvements in physical health outcomes occur between no social support and low social support and that there is little additional improvement in physical health outcomes between low social support and higher levels of social support. Here also, Pearson’s r or other statistic that assumes a linear relation between X and Y may understate the strength and fail to reveal the true nature of the association between X and Y. Figure 4.46 Bivariate Scatter Plot: Curvilinear Relation Between X1 and Y If a bivariate scatter plot of scores on two quantitative variables
  • 44. reveals a nonlinear or curvilinear relationship, this nonlinearity must be taken into account in the data analysis. Some nonlinear relations can be turned into linear relations by applying appropriate data transformations; for example, in psychophysical studies, the log of the physical intensity of a stimulus may be linearly related to the log of the perceived magnitude of the stimulus. 4.10 Data Transformations A linear transformation is one that changes the original X score by applying only simple arithmetic operations (addition, subtraction, multiplication, or division) using constants. If we let b and c represent any two values that are constants within a study, then the arithmetic function (X – b)/c is an example of a linear transformation. The linear transformation that is most often used in statistics is the one that involves the use of M as the constant b and the sample standard deviation s as the constant c: z = (X – M)/s. This transformation changes the mean of the scores to 0 and the standard deviation of the scores to 1, but it leaves the shape of the distribution of X scores unchanged. Sometimes, we want a data transformation that will change the shape of a distribution of scores (or alter the nature of the relationship between a pair of quantitative variables in a scatter plot). Some data transformations for a set of raw X scores (such as the log of X and the log of Y) tend to reduce positive skewness and also to bring extreme outliers at the high end of the distribution closer to the body of the distribution (see Tabachnick & Fidell, 2007, chap. 4, for further discussion). Thus, if a distribution is skewed, taking the log or square root of scores sometimes makes the shape of the distribution more nearly normal. For some variables (such as reaction time), it is conventional to do this; log of reaction time is very commonly reported (because reaction times tend to be positively skewed). However, note that changing the scale of a variable (from heart rate to log of heart rate) changes the meaning of the variable and can make interpretation and presentation of results
  • 45. somewhat difficult. Figure 4.47 Illustration of the Effect of the Base 10 Log Transformation NOTE: In Figure 4.47, raw scores for body weight are plotted on the X axis; raw scores for metabolic rate are plotted on the Y axis. In Figure 4.48, both variables have been transformed using base 10 log. Note that the log plot has more equal spaces among cases (there is more information about the differences among low-body-weight animals, and the outliers have been moved closer to the rest of the scores). Also, when logs are taken for both variables, the relation between them becomes linear. Log transformations do not always create linear relations, of course, but there are some situations where they do. Sometimes a nonlinear transformation of scores on X and Y can change a nonlinear relation between X and Y to a linear relation. This is extremely useful, because the analyses included in the family of methods called general linear models usually require linear relations between variables. A common and useful nonlinear transformation of X is the base 10 log of X, denoted by log10(X). When we find the base 10 log of X, we find a number p such that 10p = X. For example, the base 10 log of 1,000 is 3, because 103 = 1,000. The p exponent indicates order of magnitude. Consider the graph shown in Figure 4.47. This is a graph of body weight (in kilograms) on the X axis with mean metabolic rate on the Y axis; each data point represents a mean body weight and a mean metabolic rate for one species. There are some ways in which this graph is difficult to read; for example, all the data points for physically smaller animals are crowded together in the lower left-hand corner of the scatter plot. In addition, if you wanted to fit a function to these points, you would need to fit a curve (rather than a straight line). Figure 4.48 Graph Illustrating That the Relation Between Base 10 Log of Body Weight and Base 10 Log of Metabolic
  • 46. Rate Across Species Is Almost Perfectly Linear SOURCE: Reprinted with permission from Dr. Tatsuo Motokawa. Figure 4.48 shows the base 10 log of body weight and the base 10 log of metabolic rate for the same set of species as in Figure 4.47. Note that now, it is easy to see the differences among species at the lower end of the body-size scale, and the relation between the logs of these two variables is almost perfectly linear. In Figure 4.47, the tick marks on the X axis represented equal differences in terms of kilograms. In Figure 4.48, the equally spaced points on the X axis now correspond to equal spacing between orders of magnitude (e.g., 101, 102, 103, …); a one- tick-mark change on the X axis in Figure 4.48 represents a change from 10 to 100 kg or 100 to 1,000 kg or 1,000 to 10,000 kg. A cat weighs something like 10 times as much as a dove, a human being weighs something like 10 times as much as a cat, a horse about 10 times as much as a human, and an elephant about 10 times as much as a horse. If we take the log of body weight, these log values (p = 1, 2, 3, etc.) represent these orders of magnitude, 10p (101 for a dove, 102 for a cat, 103 for a human, and so on). If we graphed weights in kilograms using raw scores, we would find a much larger difference between elephants and humans than between humans and cats. The log10(X) transformation yields a new way of scaling weight in terms of p, the relative orders of magnitude. When the raw X scores have a range that spans several orders of magnitude (as in the sizes of animals, which vary from < 1 g up to 10,000 kg), applying a log transformation reduces the distance between scores on the high end of the distribution much more than it reduces distances between scores on the low end of the distribution. Depending on the original distribution of X, outliers at the high end of the distribution of X are brought “closer” by the log(X) transformation. Sometimes when
  • 47. raw X scores have a distribution that is skewed to the right, log(X) is nearly normal. Some relations between variables (such as the physical magnitude of a stimulus, e.g., a weight or a light source) and subjective judgments (of heaviness or brightness) become linear when log or power transformations are applied to the scores on both variables. Note that when a log transformation is applied to a set of scores with a limited range of possible values (e.g., Likert ratings of 1, 2, 3, 4, 5), this transformation has little effect on the shape of the distribution. However, when a log transformation is applied to scores that vary across orders of magnitude (e.g., the highest score is 10,000 times as large as the lowest score), the log transformation may change the distribution shape substantially. Log transformations tend to be much more useful for variables where the highest score is orders of magnitude larger than the smallest score; for example, maximum X is 100 or 1,000 or 10,000 times minimum X. Other transformations that are commonly used involve power functions—that is, replacing X with X2, Xc (where c is some power of X, not necessarily an integer value), or . For specific types of data (such as scores that represent proportions, percentages, or correlations), other types of nonlinear transformations are needed. Usually, the goals that a researcher hopes to achieve through data transformations include one or more of the following: to make a nonnormal distribution shape more nearly normal, to minimize the impact of outliers by bringing those values closer to other values in the distribution, or to make a nonlinear relationship between variables linear. One argument against the use of nonlinear transformations has to do with interpretability of the transformed scores. If we take the square root of “number of times a person cries per week,” how do we talk about the transformed variable? For some variables, certain transformations are so common that they are expected (e.g., psychophysical data are usually modeled using
  • 48. power functions; measurements of reaction time usually have a log transformation applied to them). 4.11 Verifying That Remedies Had the Desired Effects Researchers should not assume that the remedies they use to try to correct problems with their data (such as removal of outliers, or log transformations) are successful in achieving the desired results. For example, after one really extreme outlier is removed, when the frequency distribution is graphed again, other scores may still appear to be relatively extreme outliers. After the scores on an X variable are transformed by taking the natural log of X, the distribution of the natural log of X may still be nonnormal. It is important to repeat data screening using the transformed scores to make certain that the data transformation had the desired effect. Ideally, the transformed scores will have a nearly normal distribution without extreme outliers, and relations between pairs of transformed variables will be approximately linear. 4.12 Multivariate Data Screening Data screening for multivariate analyses (such as multiple regression and multivariate analysis of variance) begins with screening for each individual variable and bivariate data screening for all possible pairs of variables as described in earlier sections of this chapter. When multiple predictor or multiple outcome variables are included in an analysis, correlations among these variables are reported as part of preliminary screening. More complex assumptions about data structure will be reviewed as they arise in later chapters. Complete data screening in multivariate studies requires careful examination not just of the distributions of scores for each individual variable but also of the relationships between pairs of variables and among subsets of variables. It is possible to obtain numerical indexes (such as Mahalanobis d) that provide information about the degree to which individual scores are multivariate outliers. Excellent examples of multivariate data screening are presented in Tabachnick and Fidell (2007). 4.13 Reporting Preliminary Data Screening
  • 49. Many journals in psychology and related fields use the style guidelines published by the American Psychological Association (APA, 2009). This section covers some of the basic guidelines. All APA-style research reports should be double- spaced and single-sided with at least 1-in. margins on each page. A Results section should report data screening and the data analyses that were performed (including results that run counter to predictions). Interpretations and discussion of implications of the results are generally placed in the Discussion section of the paper (except in very brief papers with combined Results/Discussion sections). Although null hypothesis significance tests are generally reported, the updated fifth and sixth editions of the APA (2001, 2009) Publication Manual also call for the inclusion of effect- size information and confidence intervals (CIs), wherever possible, for all major outcomes. Include the basic descriptive statistics that are needed to understand the nature of the results; for example, a report of a one-way ANOVA should include group means and standard deviations as well as F values, degrees of freedom, effect-size information, and CIs. Standard abbreviations are used for most statistics—for example, M for mean and SD for standard deviation (APA, 2001, pp. 140–144). These should be in italic font (APA, 2001, p. 101). Parentheses are often used when these are reported in the context of a sentence, as in, “The average verbal SAT for the sample was 551 (SD = 135).” The sample size (N) or the degrees of freedom (df) should always be included when reporting statistics. Often the df values appear in parentheses immediately following the statistic, as in this example: “There was a significant gender difference in mean score on the Anger In scale, t(61) = 2.438, p = .018, two-tailed, with women scoring higher on average than men.” Generally, results are rounded to two decimal places, except that p values are sometimes given to three decimal places. It is more informative to report exact p values than to
  • 50. make directional statements such as p < .05. If the printout shows a p of .000, it is preferable to report p < .001 (the risk of Type I error indicated by p is not really zero). When it is possible for p values to be either one-tailed or two-tailed (for the independent samples t test, for example), this should be stated explicitly. Tables and figures are often useful ways of summarizing a large amount of information—for example, a list of t tests with several dependent variables, a table of correlations among several variables, or the results from multivariate analyses such as multiple regression. See APA (2001, pp. 147–201) for detailed instructions about the preparation of tables and figures. (Tufte, 1983, presents wonderful examples of excellence and awfulness in graphic representations of data.) All tables and figures should be discussed in the text; however, the text should not repeat all the information in a table; it should point out only the highlights. Table and figure headings should be informative enough to be understood on their own. It is common to denote statistical significance using asterisks (e.g., * for p < .05, ** for p < .01, and *** for p < .001), but these should be described by footnotes to the table. Each column and row of the table should have a clear heading. When there is not sufficient space to type out the entire names for variables within the table, numbers or abbreviations may be used in place of variable names, and this should also be explained fully in footnotes to the table. Horizontal rules or lines should be used sparingly within tables (i.e., not between each row but only in the headings and at the bottom). Vertical lines are not used in tables. Spacing should be sufficient so that the table is readable. In general, Results sections should include the following information. For specific analyses, additional information may be useful or necessary. 1. The opening sentence of each Results section should state what analysis was done, with what variables, and to answer
  • 51. what question. This sounds obvious, but sometimes this information is difficult to find in published articles. An example of this type of opening sentence is, “In order to assess whether there was a significant difference between the mean Anger In scores of men and women, an independent samples t test was performed using the Anger In score as the dependent variable.” 2. Next, describe the data screening that was done to decide whether assumptions were violated, and report any steps that were taken to correct the problems that were detected. For example, this would include examination of distribution shapes using graphs such as histograms, detection of outliers using boxplots, and tests for violations of homogeneity of variance. Remedies might include deletion of outliers, data transformations such as the log, or choice of a statistical test that is more robust to violations of the assumption. 3. The next sentence should report the test statistic and the associated exact p value; also, a statement whether or not it achieved statistical significance, according to the predetermined alpha level, should be included: “There was a significant gender difference in mean score on the Anger In scale, t(61) = 2.438, p = .018, two-tailed, with women scoring higher on average than men.” The significance level can be given as a range (p < .05) or as a specific obtained value (p = .018). For nonsignificant results, any of the following methods of reporting may be used: p > .05 (i.e., a statement that the p value on the printout was larger than a preselected α level of .05), p = .38 (i.e., an exact obtained p value), or just ns (an abbreviation for nonsignificant). Recall that the p value is an estimate of the risk of Type I error; in theory, this risk is never zero, although it may be very small. Therefore, when the printout reports a significance or p value of .000, it is more accurate to report it as “p < .001” than as “p = .000.” 4. Information about the strength of the relationship should be reported. Most statistics have an accompanying effect-size measure. For example, for the independent samples t test, Cohen’s d and η2 are common effect-size indexes. For this
  • 52. example, η2 = t2/(t2 + df) = (2.438)2/((2.438)2 + 61) = .09. Verbal labels may be used to characterize an effect-size estimate as small, medium, or large. Reference books such as Cohen’s (1988) Statistical Power Analysis for the Behavioral Sciences suggest guidelines for the description of effect size. 5. Where possible, CIs should be reported for estimates. In this example, the 95% CI for the difference between the sample means was from .048 to .484. 6. It is important to make a clear statement about the nature of relationships (e.g., the direction of the difference between group means or the sign of a correlation). In this example, the mean Anger In score for females (M = 2.36, SD = .484) was higher than the mean Anger In score for males (M = 2.10, SD = .353). Descriptive statistics should be included to provide the reader with the most important information. Also, note whether the outcome was consistent with or contrary to predictions; detailed interpretation/discussion should be provided in the Discussion section. Many published studies report multiple analyses. In these situations, it is important to think about the sequence. Sometimes, basic demographic information is reported in the section about participants in the Methods/Participants section of the paper. However, it is also common for the first table in the Results section to provide means and standard deviations for all the quantitative variables and group sizes for all the categorical variables. Preliminary analyses that examine the reliabilities of variables are reported prior to analyses that use those variables. It is helpful to organize the results so that analyses that examine closely related questions are grouped together. It is also helpful to maintain a parallel structure throughout the research paper. That is, questions are outlined in the Introduction, the Methods section describes the variables that are manipulated and/or measured to answer those questions, the Results section reports the statistical analyses that were employed to try to answer each question, and the Discussion section interprets and evaluates the findings relevant to each question. It is helpful to keep the
  • 53. questions in the same order in each section of the paper. Sometimes, a study has both confirmatory and exploratory components (as discussed in Chapter 1). For example, a study might include an experiment that tests the hypotheses derived from earlier research (confirmatory), but it might also examine the relations among variables to look for patterns that were not predicted (exploratory). It is helpful to make a clear distinction between these two types of results. The confirmatory Results section usually includes a limited number of analyses that directly address questions that were stated in the Introduction; when a limited number of significance tests are presented, there should not be a problem with inflated risk of Type I error. On the other hand, it may also be useful to present the results of other exploratory analyses; however, when many significance tests are performed and no a priori predictions were made, the results should be labeled as exploratory, and the author should state clearly that any p values that are reported in this context are likely to underestimate the true risk of Type I error. Usually, data screening that leads to a reduced sample size and assessment of measurement reliability are reported in the Methods section prior to the Results section. In some cases, it may be possible to make a general statement such as, “All variables were normally distributed, with no extreme outliers” or “Group variances were not significantly heterogeneous,” as a way of indicating that assumptions for the analysis are reasonably well satisfied. An example of a Results section that illustrates some of these points follows. The SPSS printout that yielded these numerical results is not included here; it is provided in the Instructor Supplement materials for this textbook. Results An independent samples t test was performed to assess whether there was a gender difference in mean Anger In scores. Histograms and boxplots indicated that scores on the dependent
  • 54. variable were approximately normally distributed within each group with only one outlier in each group. Because these outliers were not extreme, these scores were retained in the analysis. The Levene test showed a nonsignificant difference between the variances; because the homogeneity of variance assumption did not appear to be violated, the pooled variances t test was used. The male and female groups had 28 and 37 participants, respectively. The difference in mean Anger In scores was found to be statistically significant, t(63) = 2.50, p = .015, two-tailed. The mean Anger In score for females (M = 2.37, SD = .482) was higher than the mean Anger In score for males (M = 2.10, SD = .353). The effect size, indexed by η2, was .09. The 95% CI around the difference between these sample means ranged from .05 to .49. 4.14 Summary and Checklist for Data Screening The goals of data screening include the following: identification and correction of data errors, detection and decisions about outliers, and evaluation of patterns of missing data and decisions regarding how to deal with missing data. For categorical variables, the researcher needs to verify that all groups that will be examined in analyses (such as Crosstabs or ANOVA) have a reasonable number of cases. For quantitative variables, it is important to assess the shape of the distribution of scores and to see what information the distribution provides about outliers, ceiling or floor effects, and restricted range. Assumptions specific to the analyses that will be performed (e.g., the assumption of homogeneous population variances for the independent samples t test, the assumption of linear relations between variables for Pearson’s r) should be evaluated. Possible remedies for problems with general linear model assumptions that are identified include dropping scores, modifying scores through data transformations, or choosing a different analysis that is more appropriate to the data. After deleting outliers or transforming scores, it is important to check (by rerunning frequency distributions and replotting graphs)
  • 55. that the data modifications actually had the desired effects. A checklist of data-screening procedures is given in Table 4.2. Preliminary screening also yields information that may be needed to characterize the sample. The Methods section typically reports the numbers of male and female participants, mean and range of age, and other demographic information. Table 4.2 Checklist for Data Screening 1. Proofread scores in the SPSS data worksheet against original data sources, if possible. 2. Identify response inconsistencies across variables. 3. During univariate screening of scores on categorical variables, a. check for values that do not correspond to valid response alternatives, and b. note groups that have Ns too small to be examined separately in later analyses (decide what to do with small-N groups—e.g., combine them with other groups, drop them from the dataset). 4. During univariate screening of scores on quantitative variables, look for a. normality of distribution shape (e.g., skewness, kurtosis, other departures from normal shape), b. outliers, c.
  • 56. scores that do not correspond to valid response alternatives or possible values, and d. ceiling or floor effects, restricted range. 5. Consider dropping individual participants or variables that show high levels of incorrect responses or responses that are inconsistent. 6. Note the pattern of “missing” data. If not random, describe how missing data are patterned. Imputation may be used to replace missing scores. 7. For bivariate analyses involving two categorical variables (e.g., chi-squared), a. examine the marginal distributions to see whether the Ns in each row and column are sufficiently large (if not, consider dropping some categories or combining them with other categories), and b. check whether expected values in all cells are greater than 5 (if this is not the case, consider alternatives to χ2 such as the Fisher exact test). 8. For bivariate analyses of two continuous variables (e.g., Pearson’s r), examine the scatter plot: a. Assess possible violations of bivariate normality. b. Look for bivariate outliers or disproportionately influential
  • 57. scores. c. Assess whether the relation between X and Y is linear. If it is not linear, consider whether to use a different approach to analysis (e.g., divide scores into low, medium, and high groups based on X scores and do an ANOVA) or use nonlinear transformations such as log to make the relation more nearly linear. d. Assess whether variance in Y scores is uniform across levels of X (i.e., the assumption of homoscedasticity of variance). 9. For bivariate analyses with one categorical and one continuous variable, a. assess the distribution shapes for scores within each group (Are the scores normally distributed?), b. look for outliers within each group, c. test for possible violations of homogeneity of variance, and d. make sure that group sizes are adequate. 10. Verify that any remedies that have been attempted were successful—for example, after removal of outliers, does a distribution of scores on a quantitative variable now appear approximately normal in shape? After taking a log of X, is the distribution of X more nearly normal, and is the relation of X with Y more nearly linear?
  • 58. 11. Based on data screening and the success or failure of remedies that were attempted, a. Are assumptions for the intended parametric analysis (such as t test, ANOVA, or Pearson’s r) sufficiently well met to go ahead and use parametric methods? b. If there are problems with these assumptions, should a nonparametric method of data analysis be used? 12. In the report of results, include a description of data-screening procedures and any remedies (such as dropping outliers, imputing values for missing data, or data transformations) that were applied to the data prior to other analyses. 4.15 Final Notes Removal of scores, cases, groups, or variables from an analysis based on data screening and on whether the results of analysis are statistically significant can lead to a problem, discussed in more detail in a recent paper by Simmons, Nelson, and Simonsohn (2011). They provide empirical demonstrations that many common research practices, such as dropping outliers, dropping groups, adding or omitting variables in the final reported analysis, and continuing to collect data until the effect of interest is found to be statistically significant, raise the risk of Type I error. They acknowledge that researchers cannot always make all decisions about the analysis (e.g., which cases to include) in advance. However, they noted correctly that when researchers go through a process in which they try out many variations of the analysis, searching for a version of the analysis that yields a statistically significant outcome, there is an inflated risk of Type I error. They call this “excess flexibility” in analysis. They recommend a list of research
  • 59. design and reporting requirements that would make it possible for readers to evaluate whether the authors have tried out a large number of alternative analyses before settling on one version to report. First, authors should decide on a rule for terminating data collection before data collection is started (as opposed to continuing to collect data, analyzing the data, and terminating data collection only when the result of interest is statistically significant). Second, for cells or groups, authors should collect at least 20 cases per group or provide a compelling reason why this would be too costly. Third, authors should list all variables included in a study. (I would add to this the suggestion that it should be made clear which variables were and were not included in exploratory analyses.) Fourth, authors should report all experimental conditions, including any groups for which manipulations failed to work as predicted. Fifth, if authors remove observations such as outliers, they should report what the statistical results would be if those observations were included. Sixth, if covariates are used (see Chapter 17 in this book), authors should report differences among group means when the covariates are excluded, as well as when covariates are included. Simmons et al. further recommended that journal reviewers should ask authors to make it clear that the results reported do not depend on arbitrary decisions (such as omission of outliers or inclusion of covariates for which there is no theoretical justification). Another source of false-positive results (Type I errors) arises in research labs or programs where many studies are conducted, but only those that yield p < .05 are published. There is a basic contradiction between exploratory and confirmatory/hypothesis testing approaches to data analysis. Both approaches can be valuable. However, it is fairly common practice for researchers to engage in a very “exploratory” type of analysis; they try out a variety of analyses searching for an analysis for which the p values are less than .05, and then sometimes after the fact formulate a theoretical explanation consistent with this result. If this is reported in a paper that
  • 60. states hypotheses at the beginning, this form of reporting makes it appear that the data analysis approach was confirmatory, hides the fact that the reported analysis was selected from a large number of other analyses that were conducted that did not support the author’s conclusions, and leads the reader to believe that the p value should be an accurate indication of the risk of Type I error. As clearly demonstrated by Simmons et al. (2011), doing many variations of the analysis inflates the risk of Type I error. What should the data analyst do about this problem? First, before data are collected, researchers can establish (and should adhere to) some simple rules about handing of outliers. In experiments, researchers should avoid collecting data, running analyses, and then continuing to collect data until a point is reached where group differences are statistically significant; instead, sample size should be decided before data collection begins. When a researcher wants to do exploratory work to see what patterns may emerge from data, the best approach is to collect enough data to do a cross-validation. For example, a researcher might obtain 600 cases and randomly divide the data into two datasets of 300 cases each. Exploratory analyses using the first batch of data should be clearly described in the research report as exploratory, with cautions about the inflated risk of Type I error that accompany this approach. The second batch of data can then be used to test whether results of a limited number of these exploratory analyses produce the same results on a new batch of data. Comprehension Questions 1. What are the goals of data screening? 2. What SPSS procedures can be used for data screening of categorical variables? 3. What SPSS procedures can be used for data screening of quantitative variables? 4. What do you need to look for in bivariate screening (for
  • 61. each combination of categorical and quantitative variables)? 5. What potential problems should you look for in the univariate distributions of categorical and quantitative scores? 6. How can a box and whiskers plot (or boxplot) be used to look for potential outliers? 7. How can you identify and remedy the following: errors in data entry, outliers, and missing data? 8. Why is it important to assess whether missing values are randomly distributed throughout the participants and measures? Or in other words, why is it important to understand what processes lead to missing values? 9. Why are log transformations sometimes applied to scores? 10. Outline the information that should be included in an APA- style Results section. Data Analysis Project for Univariate and Bivariate Data Screening Data for this assignment may be provided by your instructor, or use one of the data-sets found on the website for this textbook. Note that in addition to the variables given in the SPSS file, you can also use variables that are created by compute statements, such as scale scores formed by summing items (e.g., Hostility = H1 + H2 + H3 + H4). 1. Select three variables from the dataset. Choose two of the variables such that they are good candidates for correlation/regression and one other variable as a bad candidate. Good candidates are variables that meet the assumptions (e.g., normally distributed, reliably measured, interval/ratio level of measurement, etc.). Bad candidates are variables that do not meet assumptions or that have clear problems (restricted range, extreme outliers, gross nonnormality of distribution shape, etc.). 2. For each of the three variables, use the Frequencies procedure to obtain a histogram and all univariate descriptive statistics.
  • 62. 3. For the two “good candidate” variables, obtain a scatter plot. Also, obtain a scatter plot for the “bad candidate” variable with one of the two good variables. Hand in your printouts for these analyses along with your answers to the following questions (there will be no Results section in this assignment). 1. Explain which variables are good and bad candidates for a correlation analysis, and give your rationale. Comment on the empirical results from your data screening—both the histograms and the scatter plots—as evidence that these variables meet or do not meet the basic assumptions necessary for correlation to be meaningful and “honest.” Also, can you think of other information you would want to have about the variables to make better informed judgments? 2. Is there anything that could be done (in terms of data transformations, eliminating outliers, etc.) to make your “bad candidate” variable better? If so, what would you recommend? (Warner) Warner, Rebecca (Becky) (Margaret). Applied Statistics: From Bivariate Through Multivariate Techniques, 2nd Edition. SAGE Publications, Inc, 04/2012. VitalBook file. The citation provided is a guideline. Please check each citation for accuracy before use. Chapter 2 - BASIC STATISTICS, SAMPLING ERROR, AND CONFIDENCE INTERVALS 2.1 Introduction The first few chapters of a typical introductory statistics book present simple methods for summarizing information about the distribution of scores on a single variable. It is assumed that readers understand that information about the distribution of scores for a quantitative variable, such as heart rate, can be summarized in the form of a frequency distribution table or a
  • 63. histogram and that readers are familiar with concepts such as central tendency and dispersion of scores. This chapter reviews the formulas for summary statistics that are most often used to describe central tendency and dispersion of scores in batches of data (including the mean, M, and standard deviation, s). These formulas provide instructions that can be used for by-hand computation of statistics such as the sample mean, M. A few numerical examples are provided to remind readers how these computations are done. The goal of this chapter is to lead students to think about the formula for each statistic (such as the sample mean, M). A thoughtful evaluation of each equation makes it clear what information each statistic is based on, the range of possible values for the statistic, and the patterns in the data that lead to large versus small values of the statistic. Each statistic provides an answer to some question about the data. The sample mean, M, is one way to answer the question, What is a typical score value? It is instructive to try to imagine these questions from the point of view of the people who originally developed the statistical formulas and to recognize why they used the arithmetic operations that they did. For example, summing scores for all participants in a sample is a way of summarizing or combining information from all participants. Dividing a sum of scores by N corrects for the impact of sample size on the magnitude of this sum. The notation used in this book is summarized in Table 2.1. For example, the mean of scores in a sample batch of data is denoted by M. The (usually unknown) mean of the population that the researcher wants to estimate or make inferences about, using the sample value of M, is denoted by μ (Greek letter mu). One of the greatest conceptual challenges for students who are taking a first course in statistics arises when the discussion moves beyond the behavior of single X scores and begins to consider how sample statistics (such as M) vary across different batches of data that are randomly sampled from the same population. On first passing through the material, students are often so preoccupied with the mechanics of computation that
  • 64. they lose sight of the questions about the data that the statistics are used to answer. This chapter discusses each formula as something more than just a recipe for computation; each formula can be understood as a meaningful sentence. The formula for a sample statistic (such as the sample mean, M) tells us what information in the data is taken into account when the sample statistic is calculated. Thinking about the formula and asking what will happen if the values of X increase in size or in number make it possible for students to answer questions such as the following: Under what circumstances (i.e., for what patterns in the data) will the value of this statistic be a large or a small number? What does it mean when the value of the statistic is large or when its value is small? The basic research questions in this chapter will be illustrated by using a set of scores on heart rate (HR); these are contained in the file hr130.sav. For a variable such as HR, how can we describe a typical HR? We can answer this question by looking at measures of central tendency such as mean or median HR. How much does HR vary across persons? We can assess this by computing a variance and standard deviation for the HR scores in this small sample. How can we evaluate whether an individual person has an HR that is relatively high or low compared with other people’s HRs? When scores are normally distributed, we can answer questions about the location of an individual score relative to a distribution of scores by calculating a z score to provide a unit-free measure of distance of the individual HR score from the mean HR and using a table of the standard normal distribution to find areas under the normal distribution that correspond to distances from the mean. These areas can be interpreted as proportions and used to answer questions such as, Approximately what proportion of people in the sample had HR scores higher than a specific value such as 84? Table 2.1 Notation for Sample Statistics and Population Parameters
  • 65. a. The first notation listed for each sample statistic is the notation most commonly used in this book. We will consider the issues that must be taken into account when we use the sample mean, M, for a small random sample to estimate the population mean, μ, for a larger population. In introductory statistics courses, students are introduced to the concept of sampling error, that is, variation in values of the sample mean, M, across different batches of data that are randomly sampled from the same population. Because of sampling error, the sample mean, M, for a single sample is not likely to be exactly correct as an estimate of μ, the unknown population mean. When researchers report a sample mean, M, it is important to include information about the magnitude of sampling error; this can be done by setting up a confidence interval (CI). This chapter reviews the concepts that are involved in setting up and interpreting CIs. 2.2 Research Example: Description of a Sample of HR Scores In the following discussion, the population of interest consists of 130 persons; each person has a score on HR, reported in beats per minute (bpm). Scores for this hypothetical population are contained in the data file hr130.sav. Shoemaker (1996) generated these hypothetical data so that sample statistics such as the sample mean, M, would correspond to the outcomes from an empirical study reported by Mackowiak, Wasserman, and Levine (1992). For the moment, it is useful to treat this set of 130 scores as the population of interest and to draw one small random sample (consisting of N = 9 cases) from this population. This will provide us with a way to evaluate how accurately a mean based on a random sample of N = 9 cases estimates the mean of the population from which the sample was selected. (In this case, we can easily find the actual population mean, μ, because we have HR data for the entire population of 130 persons.) IBM SPSS® Version 19 is used for examples in this book. SPSS has a procedure that allows the data analyst to select a random sample of cases from a data file; the data analyst can specify either the percentage of cases to be included
  • 66. in the sample (e.g., 10% of the cases in the file) or the number of cases (N) for the sample. In the following exercise, a random sample of N = 9 HR scores was selected from the population of 130 cases in the SPSS file hr130.sav. Figure 2.1 shows the Data View for the SPSS worksheet for the hr130.sav file. Each row in this worksheet corresponds to scores for one participant. Each column in the SPSS worksheet corresponds to one variable. The first column gives each person’s HR in beats per minute (bpm). Clicking on the tab near the bottom left corner of the worksheet shown in Figure 2.1 changes to the Variable View of the SPSS dataset, displayed in Figure 2.2. In this view, the names of variables are listed in the first column. Other cells provide information about the nature of each variable—for example, variable type. In this dataset, HR is a numerical variable, and the variable type is “scale” (i.e., quantitative or approximately interval/ratio) level of measurement. HR is conventionally reported in whole numbers; the choice of “0” in the decimal points column for this variable instructs SPSS to include no digits after the decimal point when displaying scores for this variable. Readers who have never used SPSS will find a brief introduction to SPSS in the appendix to this chapter; they may also want to consult an introductory user’s guide for SPSS, such as George and Mallery (2010). Figure 2.1 The SPSS Data View for the First 23 Lines From the SPSS Data File hr130.sav Figure 2.2 The Variable View for the SPSS Worksheet for hr130.sav Prior to selection of a random sample, let’s look at the distribution of this population of 130 scores. A histogram can be generated for this set of scores by starting in the Data View worksheet, selecting the <Graphs> menu from the menu bar along the top of the SPSS Data View worksheet, and then
  • 67. selecting <Legacy Dialogs> and <Histogram> from the pull- down menus, as shown in Figure 2.3. Figure 2.4 shows the SPSS dialog window for the Histogram procedure. Initially, the names of all the variables in the file (in this example, there is only one variable, HR) appear in the left- hand panel, which shows the available variables. To designate HR as the variable for the histogram, highlight it with the cursor and click on the right-pointing arrow to move the variable name HR into the small window on the right-hand side under the heading Variable. (Notice that the variable named HR has a “ruler” icon associated with it. This ruler icon indicates that scores on this variable are scale [i.e., quantitative or interval/ratio] level of measurement.) To request a superimposed normal curve, click the check box for Display normal curve. Finally, to run the procedure, click the OK button in the upper right-hand corner of the Histogram dialog window. The output from this procedure appears in Figure 2.5, along with the values for the population mean μ = 73.76 and population standard deviation σ = 7.06 for the entire population of 130 scores. Figure 2.3 SPSS Menu Selections <Graphs> → <Legacy Dialogs> → <Histogram> to Open the Histogram Dialog Window NOTE: IBM SPSS Version 19 was used for all examples in this book. To select a random sample of size N = 9 from the entire population of 130 scores in the SPSS dataset hr130.sav, make the following menu selections, starting from the SPSS Data View worksheet, as shown in Figure 2.6: <Data> → <Select Cases>. This opens the SPSS dialog window for Select Cases, which appears in Figure 2.7. In the Select Cases dialog window, click the radio button for Random sample of cases. Then, click the Sample button; this opens the Select Cases: Random Sample dialog window in Figure 2.8. Within this box under the heading
  • 68. Sample Size, click the radio button that corresponds to the word “Exactly” and enter in the desired sample size (9) and the number of cases in the entire population (130). The resulting SPSS command is, “Randomly select exactly 9 cases from the first 130 cases.” Click the Continue button to return to the main Select Cases dialog window. To save this random sample of N = 9 HR scores into a separate, smaller file, click on the radio button for “Copy selected cases to a new dataset” and provide a name for the dataset that will contain the new sample of nine cases—in this instance, hr9.sav. Then, click the OK button. Figure 2.4 SPSS Histogram Dialog Window Figure 2.5 Output: Histogram for the Entire Population of Heart Rate (HR) Scores in hr130.sav Figure 2.6 SPSS Menu Selection for <Data> → <Select Cases> Figure 2.7 SPSS Dialog Window for Select Cases Figure 2.8 SPSS Dialog Window for Select Cases: Random Sample When this was done, a random sample of nine cases was obtained; these nine HR scores appear in the first column of Table 2.2. (The computation of the values in the second and third columns in Table 2.2 will be explained in later sections of this chapter.) Of course, if you give the same series of commands, you will obtain a different subset of nine scores as the random sample. The next few sections show how to compute descriptive statistics for this sample of nine scores: the sample mean, M; the sample variance, s2; and the sample standard deviation, s. The last part of the chapter shows how this descriptive information about the sample can be used to help evaluate whether an individual HR score is relatively high or low, relative to other scores in the sample, and how to set up a CI
  • 69. estimate for μ using the information from the sample. Table 2.2 Summary Statistics for Random Sample of N = 9 Heart Rate (HR) Scores NOTES: Sample mean for HR: M = ∑ X/N = 658/9 = 73.11. Sample variance for HR: s2 = SS/(N − 1) = 244.89/8 = 30.61. Sample standard deviation for HR: 2.3 Sample Mean (M) A sample mean provides information about the size of a “typical” score in a sample. The interpretation of a sample mean, M, can be worded in several different ways. A sample mean, M, corresponds to the center of a distribution of scores in a sample. It provides us with one kind of information about the size of a typical X score. Scores in a sample can be represented as X1, X2, …, Xn, where N is the number of observations or participants and Xi is the score for participant number i. For example, the HR score for a person with the SPSS case record number 2 in Figure 2.1 could be given as X2 = 69. Some textbooks, particularly those that offer more mathematical or advanced treatments of statistics, include subscripts on X scores; in this book, the i subscript is used only when omitting subscripts would create ambiguity about which scores are included in a computation. The sample mean, M, is obtained by summing all the X scores in a sample of N scores and dividing by N, the number of scores: Adding the scores is a way of summarizing information across all participants. The size of ∑X depends on two things: the magnitudes of the individual X scores and N, the number of scores. If N is held constant and all X scores are positive, ∑X increases if the values of individual X scores are increased. Assuming all X scores are positive, ∑X also increases as N gets larger. To obtain a sample mean that represents the size of a typical score and that is independent of N, we have to correct for sample size by dividing ∑X by N, to yield M, our sample mean. Equation 2.1 is more than just instructions for
  • 70. computation. It is also a statement or “sentence” that tells us the following: 1. What information is the sample statistic M based on? It is based on the sum of the Xs and the N of cases in the sample. 2. Under what circumstances will the statistic (M) turn out to have a large or small value? M is large when the individual X scores are large and positive. Because we divide by N when computing M to correct for sample size, the magnitude of M is independent of N. In this chapter, we explore what happens when we use a sample mean, M, based on a random sample of N = 9 cases to estimate the population mean μ (in this case, the entire set of 130 HR scores in the file hr130.sav is the population of interest). The sample of N = 9 randomly selected HR scores appears in the first column of Table 2.2. For the set of the N = 9 HR scores shown in Table 2.2, we can calculate the mean by hand: (Note that the values of sample statistics are usually reported up to two decimal places unless the original X scores provide information that is accurate up to more than two decimal places.) The SPSS Descriptive Statistics: Frequencies procedure was used to obtain the sample mean and other simple descriptive statistics for the set of scores in the file hr9.sav. On the Data View worksheet, find the Analyze option in the menu bar at the top of the worksheet and click on it. Select Descriptive Statistics from the pull-down menu that appears (as shown in Figure 2.9); this leads to another drop-down menu. Because we want to see a distribution of frequencies and also obtain simple descriptive statistics such as the sample mean, M, click on the Frequencies procedure from this second pull-down menu. This series of menu selections displayed in Figure 2.9 opens the SPSS dialog window for the Descriptive Statistics: Frequencies procedure shown in Figure 2.10. Move the variable name HR from the left-hand panel into the right-hand panel under the
  • 71. heading Variables to indicate that the Frequencies procedure will be performed on scores for the variable HR. Clicking the Statistics button at the bottom of the SPSS Frequencies dialog window opens up the Frequencies: Statistics dialog window; this contains a menu of basic descriptive statistics for quantitative variables (see Figure 2.11). Check box selections can be used to include or omit any of the statistics on this menu. In this example, the following sample statistics were selected: Under the heading Central Tendency, Mean and Sum were selected, and under the heading Dispersion, Standard deviation and Variance were selected. Click Continue to return to the main Frequencies dialog window. When all the desired menu selections have been made, click the OK button to run the analysis for the selected variable, HR. The results from this analysis appear in Figure 2.12. The top panel of Figure 2.12 reports the requested summary statistics, and the bottom panel reports the table of frequencies for each score value included in the sample. The value for the sample mean that appears in the SPSS output in Figure 2.12, M = 73.11, agrees with the numerical value obtained by the earlier calculation. Figure 2.9 SPSS Menu Selections for the Descriptive Statistics and Frequencies Procedures Applied to the Random Sample of N = 9 Heart Rate Scores in the Dataset Named hrsample9.sav How can this value of M = 73.11 be used? If we wanted to estimate or guess any one individual’s HR, in the absence of any other information, the best guess for any randomly selected individual member of this sample of N = 9 persons would be M = 73.11 bpm. Why do we say that the mean M is the “best” prediction for any randomly selected individual score in this sample? It is best because it is the estimate that makes the sum of the prediction errors (i.e., the X – M differences) zero and minimizes the overall sum of squared prediction errors across all participants. To see this, reexamine Table 2.2. The second column of Table 2.2 shows the deviation of each score from the sample mean (X
  • 72. − M), for each of the nine scores in the sample. This deviation from the mean is the prediction error that arises if M is used to estimate that person’s score; the magnitude of error is given by the difference X − M, the person’s actual HR score minus the sample mean HR, M. For instance, if we use M to estimate Participant 1’s score, the prediction error for Case 1 is (70 − 73.11) = −3.11; that is, Participant 1’s actual HR score is 3.11 points below the estimated value of M = 73.11. Figure 2.10 The SPSS Dialog Window for the Frequencies Procedure Figure 2.11 The Frequencies: Statistics Window With Check Box Menu for Requested Descriptive Statistics Figure 2.12 SPSS Output From Frequencies Procedure for the Sample of N = 9 Heart Rate Scores in the File hrsample9.sav Randomly Selected From the File hr130.sav How can we summarize information about the magnitude of prediction error across persons in the sample? One approach that might initially seem reasonable is summing the X − M deviations across all the persons in the sample. The sum of these deviations appears at the bottom of the second column of Table 2.2. By definition, the sample mean, M, is the value for which the sum of the deviations across all the scores in a sample equals 0. In that sense, using M to estimate X for each person in the sample results in the smallest possible sum of prediction errors. It can be demonstrated that taking deviations of these X scores from any constant other than the sample mean, M, yields a sum of deviations that is not equal to 0. However, the fact that ∑(X − M) always equals 0 for a sample of data makes this sum uninformative as summary information about dispersion of scores. We can avoid the problem that the sum of the deviations always equals 0 in a simple manner: If we first square the prediction errors or deviations (i.e., if we square the X − M value for each
  • 73. person, as shown in the third column of Table 2.2) and then sum these squared deviations, the resulting term ∑(X − M)2 is a number that gets larger as the magnitudes of the deviations of individual X values from M increase. There is a second sense in which M is the best predictor of HR for any randomly selected member of the sample. M is the value for which the sum of squared deviations (SS), ∑(X − M)2, is minimized. The sample mean is the best predictor of any randomly selected person’s score because it is the estimate for which prediction errors sum to 0, and it is also the estimate that has the smallest sum of squared prediction errors. The term ordinary least squares (OLS) refers to this criterion; a statistic meets the criterion for best OLS estimator when it minimizes the sum of squared prediction errors. This empirical demonstration1 only shows that ∑(X − M) = 0 for this particular batch of data. An empirical demonstration is not equivalent to a formal proof. Formal proofs for the claim that ∑(X − M) = 0 and the claim that M is the value for which the SS, ∑(X − M)2, is minimized are provided in mathematical statistics textbooks such as deGroot and Schervish (2001). The present textbook provides demonstrations rather than formal proofs. Based on the preceding demonstration (and the proofs provided in mathematical statistics books), the mean is the best estimate for any individual score when we do not have any other information about the participant. Of course, if a researcher can obtain information about the participant’s drug use, smoking, age, gender, anxiety level, aerobic fitness, and other variables that may be predictive of HR (or that may influence HR), better estimates of an individual’s HR may be obtainable by using statistical analyses that take one or more of these predictor variables into account. Two other statistics are commonly used to describe the average or typical score in a sample: the mode and the median. The mode is simply the score value that occurs most often. This is not a very useful statistic for this small batch of sample data because each score value occurs only once;
  • 74. no single score value has a larger number of occurrences than other scores. The median is obtained by rank ordering the scores in the sample from lowest to highest and then counting the scores. Here is the set of nine scores from Figure 2.1 and Table 2.2 arranged in rank order: [64, 69, 70, 71, 73, 74, 75, 80, 82] The score that has half the scores above it and half the scores below it is the median; in this example, the median is 73. Because M is computed using ∑X, the inclusion of one or two extremely large individual X scores tends to increase the size of M. For instance, suppose that the minimum score of “64” was replaced by a much higher score of “190” in the set of nine scores above. The mean for this new set of nine scores would be given by However, the median for this new set of nine scores with an added outlier of X = 190, [69, 70, 71, 73, 74, 75, 80, 82, 190], would change to 74, which is still quite close to the original median (without the outlier) of 73. The preceding example demonstrates that the inclusion of one extremely high score typically has little effect on the size of the sample median. However, the presence of one extreme score can make a substantial difference in the size of the sample mean, M. In this sample of N = 9 scores, adding an extreme score of X = 190 raises the value of M from 73.11 to 87.11, but it changes the median by only one point. Thus, the mean is less “robust” to extreme scores or outliers than the median; that is, the value of a sample mean can be changed substantially by one or two extreme scores. It is not desirable for a sample statistic to change drastically because of the presence of one extreme score, of course. When researchers use statistics (such as the mean) that are not very robust to outliers, they need to pay attention to extreme scores when screening the data. Sometimes extreme scores are removed or recoded to avoid situations in which the data for one individual participant have a
  • 75. disproportionately large impact on the value of the mean (see Chapter 4 for a more detailed discussion of identification and treatment of outliers). When scores are perfectly normally distributed, the mean, median, and mode are equal. However, when scores have nonnormal distributions (e.g., when the distribution of scores has a longer tail on the high end), these three indexes of central tendency are generally not equal. When the distribution of scores in a sample is nonnormal (or skewed), the researcher needs to consider which of these three indexes of central tendency is the most appropriate description of the center of a distribution of scores. Despite the fact that the mean is not robust to the influence of outliers, the mean is more widely reported than the mode or median. The most extensively developed and widely used statistical methods, such as analysis of variance (ANOVA), use group means and deviations from group means as the basic building blocks for computations. ANOVA assumes that the scores on the quantitative outcome variable are normally distributed. When this assumption is satisfied, the use of the mean as a description of central tendency yields reasonable results. 2.4 Sum of Squared Deviations (SS) and Sample Variance (s2) The question we want to answer when we compute a sample variance can be worded in several different ways. How much do scores differ among the members of a sample? How widely dispersed are the scores in a batch of data? How far do individual X scores tend to be from the sample mean M? The sample variance provides summary information about the distance of individual X scores from the mean of the sample. Let’s build the formula for the sample variance (denoted by s2) step by step. First, we need to know the distance of each individual X score from the sample mean. To answer this question, a deviation from the mean is calculated for each score as follows (the i subscript indicates that this is done for each person in the
  • 76. sample—that is, for scores that correspond to person number i for i = 1, 2, 3, …, N). The deviation of person number i’s score from the sample mean is given by Equation 2.2: The value of this deviation for each person in the sample appears in the second column of Table 2.2. The sign of this deviation tells us whether an individual person’s score is above M (if the deviation is positive) or below M (if the deviation is negative). The magnitude of the deviation tells us whether a score is relatively close to, or far from, the sample mean. To obtain a numerical index of variance, we need to summarize information about distance from the mean across subjects. The most obvious approach to summarizing information across subjects would be to sum the deviations from the mean for all the scores in the sample: As noted earlier, this sum turns out to be uninformative because, by definition, deviations from a sample mean in a batch of sample data sum to 0. We can avoid this problem by squaring the deviation for each subject and then summing the squared deviations. This SS is an important piece of information that appears in the formulas for many of the more advanced statistical analyses discussed later in this textbook: What range of values can SS have? SS has a minimum possible value of 0; this occurs in situations where all the X scores in a sample are equal to each other and therefore also equal to M. (Because squaring a deviation must yield a positive number, and SS is a sum of squared deviations, SS cannot be a negative number.) The value of SS has no upper limit. Other factors being equal, SS tends to increase when 1. the number of squared deviations included in the sum increases, or 2. the individual Xi − M deviations get larger in absolute value.
  • 77. A different version of the formula for SS is often given in introductory textbooks: Equation 2.5 is a more convenient procedure for by-hand computation of the SS than is Equation 2.4 because it involves fewer arithmetic operations and results in less rounding error. This version of the formula also makes it clear that SS depends on both ∑X, the sum of the Xs, and ∑X2, the sum of the squared Xs. Formulas for more complex statistics often include these same terms: ∑X and ∑X2. When these terms (∑X and ∑X2) are included in a formula, their presence implies that the computation takes both the mean and the variance of X scores into account. These chunks of information are the essential building blocks for the computation of most of the statistics covered later in this book. From Table 2.2, the numerical result for SS = ∑(X – M)2 is 244.89. How can the value of SS be used or interpreted? The minimum possible value of SS occurs when all the X scores are equal to each other and, therefore, equal to M. For example, in the set of scores [73, 73, 73, 73, 73], the SS term would equal 0. However, there is no upper limit, in practice, for the maximum value of SS. SS values tend to be larger when they are based on large numbers of deviations and when the individual X scores have large deviations from the mean, M. To interpret SS as information about variability, we need to correct for the fact that SS tends to be larger when the number of squared deviations included in the sum is large. 2.5 Degrees of Freedom (df) for a Sample Variance It might seem logical to divide SS by N to correct for the fact that the size of SS gets larger as N increases. However, the computation (SS/N) produces a sample variance that is a biased estimate of the population variance; that is, the sample statistic SS/N tends to be smaller than σ2, the true population variance. This can be empirically demonstrated by taking hundreds of small samples from a population, computing a value of s2 for
  • 78. each sample by using the formula s2 = SS/N, and tabulating the obtained values of s2. When this experiment is performed, the average of the sample s2 values turns out to be smaller than the population variance, σ2.2 This is called bias in the size of s2; s2 calculated as SS/N is smaller on average than σ2, and thus, it systematically underestimates σ2. SS/N is a biased estimate because the SS term is actually based on fewer than N independent pieces of information. How many independent pieces of information is the SS term actually based on? Let’s reconsider the batch of HR scores for N = 9 people and the corresponding deviations from the mean; these deviations appear in column 2 of Table 2.2. As mentioned earlier, for this batch of data, the sum of deviations from the sample mean equals 0; that is, ∑(Xi − M) = −3.11 − 2.11 + .89 + 6.89 − .11 + 1.89 + 8.89 − 9.11 − 4.11 = 0. In general, the sum of deviations of sample scores from the sample mean, ∑(Xi – M), always equals 0. Because of the constraint that ∑(X − M) = 0, only the first N − 1 values (in this case, 8) of the X − M deviation terms are “free to vary.” Once we know any eight deviations for this batch of data, we can deduce what the remaining ninth deviation must be; it has to be whatever value is needed to make ∑(X − M) = 0. For example, once we know that the sum of the deviations from the mean for Persons 1 through 8 in this sample of nine HR scores is +4.11, we know that the deviation from the mean for the last remaining case must be −4.11. Therefore, we really have only N − 1 (in this case, 8) independent pieces of information about variability in our sample of 9 subjects. The last deviation does not provide new information. The number of independent pieces of information that a statistic is based on is called the degrees of freedom, or df. For a sample variance for a set of N scores, df = N − 1. The SS term is based on only N − 1 independent deviations from the sample mean. It can be demonstrated empirically and proved formally that computing the sample variance by dividing the SS term by N results in a sample variance that systematically underestimates the true population variance. This underestimation or bias can
  • 79. be corrected by using the degrees of freedom as the divisor. The preferred (unbiased) formula for computation of a sample variance for a set of X scores is thus Whenever a sample statistic is calculated using sums of squared deviations, it has an associated degrees of freedom that tells us how many independent deviations the statistic is based on. These df terms are used to compute statistics such as the sample variance and, later, to decide which distribution (in the family of t distributions, for example) should be used to look up critical values for statistical significance tests. For this hypothetical batch of nine HR scores, the deviations from the mean appear in column 2 of Table 2.2; the squared deviations appear in column 3 of Table 2.2; the SS is 244.89; df = N − 1 = 8; and the sample variance, s2, is 244.89/8 = 30.61. This agrees with the value of the sample variance in the SPSS output from the Frequencies procedure in Figure 2.12. It is useful to think about situations that would make the sample variance s2 take on larger or smaller values. The smallest possible value of s2 occurs when all the scores in the sample have the same value; for example, the set of scores [73, 73, 73, 73, 73, 73, 73, 73, 73] would have a variance s2 = 0. The value of s2 would be larger for a sample in which individual deviations from the sample mean are relatively large, for example, [44, 52, 66, 97, 101, 119, 120, 135, 151], than for the set of scores [72, 73, 72, 71, 71, 74, 70, 73], where individual deviations from the mean are relatively small. The value of the sample variance, s2, has a minimum of 0. There is, in practice, no fixed upper limit for values of s2; they increase as the distances between individual scores and the sample mean increase. The sample variance s2 = 30.61 is in “squared HR in beats per minute.” We will want to have information about dispersion that is in terms of HR (rather than HR squared); this next step in the development of sample statistics is discussed in Section 2.7. First, however, let’s consider an important question: Why is there variance? Why do
  • 80. researchers want to know about variance? 2.6 Why Is There Variance? The best question ever asked by a student in my statistics class was, “Why is there variance?” This seemingly naive question is actually quite profound; it gets to the heart of research questions in behavioral, educational, medical, and social science research. The general question of why is there variance can be asked specifically about HR: Why do some people have higher and some people lower HR scores than average? Many factors may influence HR—for example, family history of cardiovascular disease, gender, smoking, anxiety, caffeine intake, and aerobic fitness. The initial question that we consider when we compute a variance for our sample scores is, How much variability of HR is there across the people in our study? In subsequent analyses, researchers try to account for at least some of this variability by noting that factors such as gender, smoking, anxiety, and caffeine use may be systematically related to and therefore predictive of HR. In other words, the question of why is there variance in HR can be partially answered by noting that people have varying exposure to all sorts of factors that may raise or lower HR, such as aerobic fitness, smoking, anxiety, and caffeine consumption. Because people experience different genetic and environmental influences, they have different HRs. A major goal of research is to try to identify the factors that predict (or possibly even causally influence) each individual person’s score on the variable of interest, such as HR. Similar questions can be asked about all attributes that vary across people or other subjects of study; for example, Why do people have differing levels of anxiety, satisfaction with life, body weight, or salary? The implicit model that underlies many of the analyses discussed later in this textbook is that an observed score can be broken down into components and that each component of the score is systematically associated with a different predictor variable. Consider Participant 7 (let’s call him Joe), with an HR
  • 81. of 82 bpm. If we have no information about Joe’s background, a reasonable initial guess would be that Joe’s HR is equal to the mean resting HR for the sample, M = 73.11. However, let’s assume that we know that Joe smokes cigarettes and that we know that cigarette smoking tends to increase HR by about 5 bpm. If Joe is a smoker, we might predict that his HR would be 5 points higher than the population mean of 73.11 (73.11, the overall mean, plus 5 points, the effect of smoking on HR, would yield a new estimate of 78.11 for Joe’s HR). Joe’s actual HR (82) is a little higher than this predicted value (78.11), which combines information about what is average for most people with information about the effect of smoking on HR. An estimate of HR that is based on information about only one predictor variable (in this example, smoking) probably will not be exactly correct because many other factors are likely to influence Joe’s HR (e.g., body weight, family history of cardiovascular disease, drug use). These other variables that are not included in the analysis are collectively called sources of “error.” The difference between Joe’s actual HR of 82 and his predicted HR of 78.11 (82 − 78.11 = +3.89) is a prediction error. Perhaps Joe’s HR is a little higher than we might predict based on overall average HR and Joe’s smoking status because Joe has poor aerobic fitness or was anxious when his HR was measured. It might be possible to reduce this prediction error to a smaller value if we had information about additional variables (such as aerobic fitness and anxiety) that are predictive of HR. Because we do not know all the factors that influence or predict Joe’s HR, a predicted HR based on just a few variables is generally not exactly equal to Joe’s actual HR, although it may be a better estimate of his HR than we would have if we just used the sample mean to estimate his score. Statistical analyses covered in later chapters will provide us with a way to “take scores apart” into components that represent how much of the HR score is associated with each predictor variable. In other words, we can “explain” why Joe’s HR of 82 is 8.89 points higher than the sample mean of 73.11 by
  • 82. identifying parts of Joe’s HR score that are associated with, and predictable from, specific variables such as smoking, aerobic fitness, and anxiety. More generally, a goal of statistical analysis is to show that we can predict whether individuals tend to have high or low scores on an outcome variable of interest (such as HR) from scores on a relatively small number of predictor variables. We want to explain or account for the variance in HR by showing that some components of each person’s HR score can be predicted from his or her scores on other variables. 2.7 Sample Standard Deviation (s) An inconvenient property of the sample variance that was calculated in Section 2.5 (s2 = 30.61) is that it is given in squared HR rather than in the original units of measurement. The original scores were measures of HR in beats per minute, and it would be easier to talk about typical distances of individual scores from the mean if we had a measure of dispersion that was in the original units of measurement. To describe how far a typical subject’s HR is from the sample mean, it is helpful to convert the information about dispersion contained in the sample variance, s2, back into the original units of measurement (scores on HR rather than HR squared). To obtain an estimate of the sample standard deviation (s), we take the square root of the variance. The formula used to compute the sample standard deviation (which provides an unbiased estimate of the population standard deviation, s) is as follows: For the set of N = 9 HR scores given above, the variance was 30.61; the sample standard deviation s is the square root of this value, 5.53. The sample standard deviation, s = 5.53, tells us something about typical distances of individual X scores from the mean, M. Note that the numerical estimate for the sample standard deviation, s, obtained from this computation agrees with the value of s reported in the SPSS output from the Frequencies procedure that appears in Figure 2.12.
  • 83. How can we use the information that we obtain from sample values of M and s? If we know that scores are normally distributed, and we have values for the sample mean and standard deviation, we can work out an approximate range that is likely to include most of the score values in the sample. Recall from Chapter 1 that in a normal distribution, about 95% of the scores lie within ±1.96 standard deviations from the mean. For a sample with M = 73.11 and s = 5.53, if we assume that HR scores are normally distributed, an estimated range that should include most of the values in the sample is obtained by finding M ± 1.96 × s. For this example, 73.11 ± (1.96 × 5.53) = 73.11 ± 10.84; this is a range from 62.27 to 83.95. These values are fairly close to the actual minimum (64) and maximum (82) for the sample. The approximation of range obtained by using M and s tends to work much better when the sample has a larger N of participants and when scores are normally distributed within the sample. What we know at this point is that the average for HR was about 73 bpm and that the range of HR in this sample was from 64 to 82 bpm. Later in the chapter, we will ask, How can we use this information from the sample (M and s) to estimate μ, the mean HR for the entire population? However, several additional issues need to be considered before we take on the problem of making inferences about μ, the unknown population mean. These are discussed in the next few sections. 2.8 Assessment of Location of a Single X Score Relative to a Distribution of Scores We can use the mean and standard deviation of a population, if these are known (μ and σ, respectively), or the mean and standard deviation for a sample (M and s, respectively) to evaluate the location of a single X score (relative to the other scores in a population or a sample). First, let’s consider evaluating a single X score relative to a population for which the mean and standard deviation, μ and σ, respectively, are known. In real-life research situations, researchers rarely have this information. One clear example of a
  • 84. real-life situation where the values of μ and σ are known to researchers involves scores on standardized tests such as the Wechsler Adult Intelligence Scale (WAIS). Suppose you are told that an individual person has received a score of 110 points on the WAIS. How can you interpret this score? To answer this question, you need to know several things. Does this score represent a high or a low score relative to other people who have taken the test? Is it far from the mean or close to the mean of the distribution of scores? Is it far enough above the mean to be considered “exceptional” or unusual? To evaluate the location of an individual score, you need information about the distribution of the other scores. If you have a detailed frequency table that shows exactly how many people obtained each possible score, you can work out an exact percentile rank (the percentage of test takers who got scores lower than 110) using procedures that are presented in detail in introductory statistics books. When the distribution of scores has a normal shape, a standard score or z score provides a good description of the location of that single score relative to other people’s scores without the requirement for complete information about the location of every other individual score. In the general population, scores on the WAIS intelligence quotient (IQ) test have been scaled so that they are normally distributed with a mean μ = 100 and a standard deviation σ of 15. The first thing you might do to assess an individual score is to calculate the distance from the mean—that is, X − μ (in this example, 110 − 100 = +10 points). This result tells you that the score is above average (because the deviation has a positive sign). But it does not tell whether 10 points correspond to a large or a small distance from the mean when you consider the variability or dispersion of IQ scores in the population. To obtain an index of distance from the mean that is “unit free” or standardized, we compute a z score; we divide the deviation from the mean (X − μ) by the standard deviation of population scores (σ) to find out the distance of the X score from the mean in number of standard deviations, as shown in Equation 2.8:
  • 85. If the z transformation is applied to every X score in a normally distributed population, the shape of the distribution of scores does not change, but the mean of the distribution is changed to 0 (because we have subtracted μ from each score), and the standard deviation is changed to 1 (because we have divided deviations from the mean by σ). Each z score now represents how far an X score is from the mean in “standard units”—that is, in terms of the number of standard deviations. The mapping of scores from a normally shaped distribution of raw scores, with a mean of 100 and a standard deviation of 15, to a standard normal distribution, with a mean of 0 and a standard deviation of 1, is illustrated in Figure 2.13. For a score of X = 110, z = (110 − 100)/15 = +.67. Thus, an X score of 110 IQ points corresponds to a z score of +.67, which corresponds to a distance of two thirds of a standard deviation above the population mean. Recall from the description of the normal distribution in Chapter 1 that there is a fixed relationship between distance from the mean (given as a z score, i.e., numbers of standard deviations) and area under the normal distribution curve. We can deduce approximately what proportion or percentage of people in the population had IQ scores higher (or lower) than 110 points by (a) finding out how far a score of 110 is from the mean in standard score or z score units and (b) looking up the areas in the normal distribution that correspond to the z score distance from the mean. Figure 2.13 Mapping of Scores From a Normal Distribution of Raw IQ Scores (With μ = 100 and σ = 15) to a Standard Normal Distribution (With μ = 0 and σ = 1) The proportion of the area of the normal distribution that corresponds to outcomes greater than z = +.67 can be evaluated by looking up the area that corresponds to the obtained z value in the table of the standard normal distribution in Appendix A. The obtained value of z (+.67) and the corresponding areas
  • 86. appear in the three columns on the right-hand side of the first page of the standard normal distribution table, about eight lines from the top. Area C corresponds to the proportion of area under a normal curve that lies to the right of z = +.67; from the table, area C = .2514. Thus, about 25% of the area in the normal distribution lies above z = +.67. The areas for sections of the normal distribution are interpretable as proportions; if they are multiplied by 100, they can be interpreted as percentages. In this case, we can say that the proportion of the population that had z scores equal to or above +.67 and/or IQ scores equal to or above 110 points was .2514. Equivalently, we could say that 25.14% of the population had IQ scores equal to or above 110. Note that the table in Appendix A can also be used to assess the proportion of cases that lie below z = +.67. The proportion of area in the lower half of the distribution (from z = –∞ to z = .00) is .50. The proportion of area that lies between z = .00 and z = +.67 is shown in column B (area = .2486) of the table. To find the total area below z = +.67, these two areas are summed: .5000 + .2486 = .7486. If this value is rounded to two decimal places and multiplied by 100 to convert the information into a percentage, it implies that about 75% of persons in the population had IQ scores below 110. This tells us that a score of 110 is above average, although it is not an extremely high score. Consider another possible IQ score. If a person has an IQ score of 145, that person’s z score is (145 − 100)/15 = +3.00. This person scored 3 standard deviations above the mean. The proportion of the area of a normal distribution that lies above z = +3.00 is .0013. That is, only about 1 in 1,000 people have z scores greater than or equal to +3.00 (which would correspond to IQs greater than or equal to 145). By convention, scores that fall in the most extreme 5% of a distribution are regarded as extreme, unusual, exceptional, or unlikely. (While 5% is the most common criterion for “extreme,” sometimes researchers choose to look at the most extreme 1% or .1%.) Because the most extreme 5% (combining the outcomes at both the upper and the lower extreme ends of
  • 87. the distribution) is so often used as a criterion for an “unusual” or “extreme” outcome, it is useful to remember that 2.5% of the area in a normal distribution lies below z = −1.96, and 2.5% of the area in a normal distribution lies above z = +1.96. When the areas in the upper and lower tails are combined, the most extreme 5% of the scores in a normal distribution correspond to z values ≤ −1.96 and ≥ +1.96. Thus, anyone whose score on a test yields a z score greater than 1.96 in absolute value might be judged “extreme” or unusual. For example, a person whose test score corresponds to a value of z that is greater than +1.96 is among the top 2.5% of all test scorers in the population. 2.9 A Shift in Level of Analysis: The Distribution of Values of M Across Many Samples From the Same Population At this point in the discussion, we need to make a major shift in thinking. Up to this point, the discussion has examined the distributions of individual X scores in populations and in samples. We can describe the central tendency or average score by computing a mean; we describe the dispersion of individual X scores around the mean by computing a standard deviation. We now move to a different level of analysis: We will ask analogous questions about the behavior of the sample mean, M; that is, What is the average value of M across many samples, and how much does the value of M vary across samples? It may be helpful to imagine this as a sort of “thought experiment.” In actual research situations, a researcher usually has only one sample. The researcher computes a mean and a variance for the data in that one sample, and often the researcher wants to use the mean and variance from one sample to make inferences about (or estimates of) the mean and variance of the population from which the sample was drawn. Note, however, that the single sample mean, M, reported for a random sample of N = 9 cases from the hr130 file (M = 73.11) was not exactly equal to the population mean μ of 73.76 (in Figure 2.5). The difference M − μ (in this case, 73.11 − 73.76) represents an estimation error; if we used the sample mean value M = 73.11 to estimate the population mean of μ = 73.76,
  • 88. in this instance, our estimate will be off by 73.11 − 73.76 = −.65. It is instructive to stop and think, Why was the value of M in this one sample different from the value of μ? It may be useful for the reader to repeat this sampling exercise. Using the <Data> → <Select Cases> → <Random> SPSS menu selections, as shown in Figures 2.6 and 2.7 earlier, each member of the class might draw a random sample of N = 9 cases from the file hr130.sav and compute the sample mean, M. If students report their values of M to the class, they will see that the value of M differs across their random samples. If the class sets up a histogram to summarize the values of M that are obtained by class members, this is a “sampling distribution” for M—that is, a set of different values for M that arise when many random samples of size N = 9 are selected from the same population. Why is it that no two students obtain the same answer for the value of M? 2.10 An Index of Amount of Sampling Error: The Standard Error of the Mean (σM) Different samples drawn from the same population typically yield different values of M because of sampling error. Just by “luck of the draw,” some random samples contain one or more individuals with unusually low or high scores on HR; for those samples, the value of the sample mean, M, will be lower (or higher) than the population mean, μ. The question we want to answer is, How much do values of M, the sample mean, tend to vary across different random samples drawn from the same population, and how much do values of M tend to differ from the value of μ, the population mean that the researcher wants to estimate? It turns out that we can give a precise answer to this question. That is, we can quantify the magnitude of sampling error that arises when we take hundreds of different random samples (of the same size, N) from the same population. It is useful to have information about the magnitude of sampling error; we will need this information later in this chapter to set up CIs, and we will also use this information in later chapters to set up statistical significance tests.
  • 89. The outcome for this distribution of values of M—that is, the sampling distribution of M—is predictable from the central limit theorem. A reasonable statement of this theorem is provided by Jaccard and Becker (2002): Given a population [of individual X scores] with a mean of μ and a standard deviation of σ, the sampling distribution of the mean [M] has a mean of μ and a standard deviation [generally called the “[population] standard error,” σM] of and approaches a normal distribution as the sample size on which it is based, N, approaches infinity. (p. 189) For example, an instructor using the entire dataset hr130.sav can compute the population mean μ = 73.76 and the population standard deviation σ = 7.062 for this population of 130 scores. If the instructor asks each student in the class to draw a random sample of N = 9 cases, the instructor can use the central limit theorem to predict the distribution of outcomes for M that will be obtained by class members. (This prediction will work well for large classes; e.g., in a class of 300 students, there are enough different values of the sample mean to obtain a good description of the sampling distribution; for classes smaller than 30 students, the outcomes may not match the predictions from the central limit theorem very closely.) When hundreds of class members bring in their individual values of M, mean HR (each based on a different random sample of N = 9 cases), the instructor can confidently predict that when all these different values of M are evaluated as a set, they will be approximately normally distributed with a mean close to 73.76 bpm (the population mean) and with a standard deviation or standard error, σM, of bpm The middle 95% of the sampling distribution of M should lie within the range μ − 1.96σM and μ + 1.96σM; in this case, the instructor would predict that about 95% of the values of M obtained by class members should lie approximately within the range between 73.76 − 1.96 × 2.35 and 73.76 + 1.96 × 2.35, that is, mean HR between 69.15 and 78.37 bpm. On the other hand, about 2.5% of
  • 90. students are expected to obtain sample mean M values below 69.15, and about 2.5% of students are expected to obtain sample mean M values above 78.37. In other words, before the students go through all the work involved in actually drawing hundreds of samples and computing a mean M for each sample and then setting up a histogram and frequency table to summarize the values of M across the hundreds of class members, the instructor can anticipate the outcome; while the instructor cannot predict which individual students will obtain unusually high or low values of M, the instructor can make a fairly accurate prediction about the range of values of M that most students will obtain. The fact that we can predict the outcome of this time-consuming experiment on the behavior of the sample statistic M based on the central limit theorem means that we do not, in practice, need to actually obtain hundreds of samples from the same population to estimate the magnitude of sampling error, σM. We only need to know the values of σ and N and to apply the central limit theorem to obtain fairly precise information about the typical magnitude of sampling error. The difference between each individual student’s value of M and the population mean, μ, is attributable to sampling error. When we speak of sampling error, we do not mean that the individual student has necessarily done something wrong (although students could make mistakes while computing M from a set of scores). Rather, sampling error represents the differences between the values of M and μ that arise just by chance. When individual students carry out all the instructions for the assignment correctly, most students obtain values of M that differ from μ by relatively small amounts, and a few students obtain values of M that are quite far from μ. Prior to this section, the statistics that have been discussed (such as the sample mean, M, and the sample standard deviation, s) have described the distribution of individual X scores. Beginning in this section, we use the population standard error of the mean, σM, to describe the variability of a
  • 91. sample statistic (M) across many samples. The standard error of the mean describes the variability of the distribution of values of M that would be obtained if a researcher took thousands of samples from one population, computed M for each sample, and then examined the distribution of values of M; this distribution of many different values of M is called the sampling distribution for M. 2.11 Effect of Sample Size (N) on the Magnitude of the Standard Error (σM) When the instructor sets up a histogram of the M values for hundreds of students, the shape of this distribution is typically close to normal; the mean of the M values is close to μ, and the population mean, as well as the standard error (essentially, the standard deviation) of this distribution of M values, is close to the theoretical value given by Refer back to Figure 2.5 to see the histogram for the entire population of 130 HR scores. Because this population of 130 observations is small, we can calculate the population mean μ = 73.76 and the population standard deviation σ = 7.062 (these statistics appeared along with the histogram in Figure 2.5). Suppose that each student in an extremely large class (500 class members) draws a sample of size N = 9 and computes a mean M for this sample; the values of M obtained by 500 members of the class would be normally distributed and centered at μ = 73.76, with as shown in Figure 2.15. When comparing the distribution of individual X scores in Figure 2.5 with the distribution of values of M based on 500 samples each with an N of 9 in Figure 2.15, the key thing to note is that they are both centered at the same value of μ (in this case, 73.76), but the variance or dispersion of the distribution of M values is less than the variance of the individual X scores. In general, as N (the size of each sample) increases, the variance of the M values across samples decreases.
  • 92. Recall that σM is computed as σ/√N. It is useful to examine this formula and to ask, Under what circumstances will σM be larger or smaller? For any fixed value of N, this equation says that as σ increases, σM also increases. In other words, when there is an increase in the variance of the original individual X scores, it is intuitively obvious that random samples are more likely to include extreme scores, and these extreme scores in the samples will produce sample values of M that are farther from μ. For any fixed value of σ, as N increases, the value of σM will decrease. That is, as the number of cases (N) in each sample increases, the estimate of M for any individual sample tends to be closer to μ. This should seem intuitively reasonable; larger samples tend to yield sample means that are better estimates of μ—that is, values of M that tend to be closer to μ. When N − 1, σM = σ; that is, for samples of size 1, the standard error is the same as the standard deviation of the individual X scores. Figures 2.14 through 2.17 illustrate that as the N per sample is increased, the dispersion of values of M in the sampling distributions continues to decrease in a predictable way. The numerical values of the standard errors for the histograms shown in Figures 2.14 through 2.17 are approximately equal to the theoretical values of σM computed from σ and N: Figure 2.14 The Sampling Distribution of 500 Sample Means, Each Based on an N of 4, Drawn From the Population of 130 Heart Rate Scores in the hr130.sav Dataset Figure 2.15 The Sampling Distribution of 500 Sample Means, Each Based on an N of 9, Drawn From the Population of 130 Heart Rate Scores in the hr130.sav Dataset The standard error, σM, provides information about the predicted dispersion of sample means (values of M) around μ (just as σ provided information about the dispersion of individual X scores around M). We want to know the typical magnitude of differences between
  • 93. M, an individual sample mean, and μ, the population mean, that we want to estimate using the value of M from a single sample. When we use M to estimate μ, the difference between these two values (M − μ) is an estimation error. Recall that σ, the standard deviation for a population of X scores, provides summary information about the distances between individual X scores and μ, the population mean. In a similar way, the standard error of the mean, σM, provides summary information about the distances between M and μ, and these distances correspond to the estimation error that arises when we use individual sample M values to try to estimate μ. We hope to make the magnitudes of estimation errors, and therefore the magnitude of σM, small. Information about the magnitudes of estimation errors helps us to evaluate how accurate or inaccurate our sample statistics are likely to be as estimates of population parameters. Information about the magnitude of sampling errors is used to set up CIs and to conduct statistical significance tests. Because the sampling distribution of M has a normal shape (and σM is the “standard deviation” of this distribution) and we know from Chapter 1 (Figure 1.4) that 95% of the area under a standard normal distribution lies between z = −1.96 and z = +1.96, we can reason that approximately 95% of the means of random samples of size N drawn from a normally distributed population of X scores, with a mean of μ and standard deviation of σ, should fall within a range given by μ = (1.96) × σM and μ + 1.96 × σM. Figure 2.16 The Sampling Distribution of 500 Sample Means, Each Based on an N of 25, Drawn From the Population of 130 Heart Rate Scores in the hr130.sav Dataset 2.12 Sample Estimate of the Standard Error of the Mean (SEM) The preceding section described the sampling distribution of M in situations where the value of the population standard deviation, σ, is known. In most research situations, the population mean and standard deviation are not known; instead, they are estimated by using information from the sample. We
  • 94. can estimate σ by using the sample value of the standard deviation; in this textbook, as in most other statistics textbooks, the sample standard deviation is denoted by s. Many journals, including those published by the American Psychological Association, use SD as the symbol for the sample standard deviations reported in journal articles. Figure 2.17 The Sampling Distribution of 500 Sample Means, Each Based on an N of 64, Drawn From the Population of 130 Heart Rate Scores in the hr130.sav Dataset Earlier in this chapter, we sidestepped the problem of working with populations whose characteristics are unknown by arbitrarily deciding that the set of 130 scores in the file named hr130.sav was the “population of interest.” For this dataset, the population mean, μ, and standard deviation, σ, can be obtained by having SPSS calculate these values for the entire set of 130 scores that are defined as the population of interest. However, in many real-life research problems, researchers do not have information about all the scores in the population of interest, and they do not know the population mean, μ, and standard deviation, σ. We now turn to the problem of evaluating the magnitude of prediction error in the more typical real-life situation, where a researcher has one sample of data of size N and can compute a sample mean, M, and a sample standard deviation, s, but does not know the values of the population parameters μ or σ. The researcher will want to estimate μ using the sample M from just one sample. The researcher wants to have a reasonably clear idea of the magnitude of estimation error that can be expected when the mean from one sample of size N is used to estimate μ, the mean of the corresponding population. When σ, the population standard deviation, is not known, we cannot find the value of σM. Instead, we calculate an estimated standard error (SEM), using the sample standard deviation s to replace the unknown value of σ in the formula for the standard error of the mean, as follows (when σ is known):
  • 95. When σ is unknown, we use s to estimate σ and relabel the resulting standard error to make it clear that it is now based on information about sample variability rather than population variability of scores: The substitution of the sample statistic σ as an estimate of the population σ introduces additional sampling error. Because of this additional sampling error, we can no longer use the standard normal distribution to evaluate areas that correspond to distances from the mean. Instead, a family of distributions (called t distributions) is used to find areas that correspond to distances from the mean. Thus, when σ is not known, we use the sample value of SEM to estimate σM, and because this substitution introduces additional sampling error, the shape of the sampling distribution changes from a normal distribution to a t distribution. When the standard deviation from a sample (s) is used to estimate σ, the sampling distribution of M has the following characteristics: 1. It is distributed as a t distribution with df = N − 1. 2. It is centered at μ. 3. The estimated standard error is 2.13 The Family of t Distributions The family of “t” distributions is essentially a set of “modified” normal distributions, with a different t distribution for each value of df (or N). Like the standard normal distribution, a t distribution is scaled so that t values are unit free. As N and df decrease, assuming that other factors remain constant, the magnitude of sampling error increases, and the required amount of adjustment in distribution shape also increases. A t distribution (like a normal distribution) is bell shaped and symmetrical; however, as the N and df decrease, t distributions become flatter in the middle compared with a normal distribution, with thicker tails (they become platykurtic). Thus,
  • 96. when we have a small df value, such as df = 3, the distance from the mean that corresponds to the middle 95% of the t distribution is larger than the corresponding distance in a normal distribution. As the value of df increases, the shape of the t distribution becomes closer to that of a normal distribution; for df > 100, a t distribution is essentially identical to a normal distribution. Figure 2.18 shows t distributions for df values of 3, 6, and ∞. As df increases, the shape of the t distribution converges toward the normal distribution; a t distribution with df > 100 is essentially indistinguishable from a normal distribution. Figure 2.18 Graph of the t Distribution for Three Different df Values (df = 3, 6, and Infinity, or ∞) SOURCE: www.psychstat.missouristate.edu/introbook/sbk24.htm For a research situation where the sample mean is based on N = 7 cases, df = N − 1 = 6. In this case, the sampling distribution of the mean would have the shape described by a t distribution with 6 df; a table for the distribution with df = 6 would be used to look up the values of t that cut off the top and bottom 2.5% of the area. The area that corresponds to the middle 95% of the t distribution with 6 df can be obtained either from the table of the t distribution in Appendix B or from the diagram in Figure 2.18. When df = 6, 2.5% of the area in the t distribution lies below t = −2.45, 95% of the area lies between t = −2.45 and t = +2.45, and 2.5% of the area lies above t = +2.45. 2.14 Confidence Intervals 2.14.1 The General Form of a CI When a single value of M in a sample is reported as an estimate of μ, it is called a point estimate. An interval estimate (CI) makes use of information about sampling error. A CI is reported by giving a lower limit and an upper limit for likely values of μ that correspond to some probability or level of confidence that, across many samples, the CI will include the actual population
  • 97. mean μ. The level of “confidence” is an arbitrarily selected probability, usually 90%, 95%, or 99%. The computations for a CI make use of the reasoning, discussed in earlier sections, about the sampling error associated with values of M. On the basis of our knowledge about the sampling distribution of M, we can figure out a range of values around μ that will probably contain most of the sample means that would be obtained if we drew hundreds or thousands of samples from the population. SEM provides information about the typical magnitude of estimation error—that is, the typical distance between values of M and μ. Statistical theory tells us that (for values of df larger than 100) approximately 95% of obtained sample means will likely be within a range of about 1.96 SEM units on either side of μ. μ. When we set up a CI around an individual sample mean, M, we are essentially using some logical sleight of hand and saying that if values of M tend to be close to μ, then the unknown value of μ should be reasonably close to (most) sample values of M. However, the language used to interpret a CI is tricky. It is incorrect to say that a CI computed using data from a single sample has a 95% chance of including μ. (It either does or doesn’t.) We can say, however, that in the long run, approximately 95% of the CIs that are set up by applying these procedures to hundreds of samples from a normally distributed population with mean = μ will include the true population mean, μ, between the lower and the upper limits. (The other 5% of CIs will not contain μ.) 2.14.2 Setting Up a CI for M When σ Is Known To set up a 95% CI to estimate the mean when σ, the population standard deviation, is known, the researcher needs to do the following: 1. Select a “level of confidence.” In the empirical example that follows, the level of confidence is set at 95%. In applications of
  • 98. CIs, 95% is the most commonly used level of confidence. 2. For a sample of N observations, calculate the sample statistic (such as M) that will be used to estimate the corresponding population parameter (μ). 3. Use the value of σ (the population standard deviation) and the sample size N to calculate σM. 4. When σ is known, use the standard normal distribution to look up the “critical values” of z that correspond to the middle 95% of the area in the standard normal distribution. These values can be obtained by looking at the table of the standard normal distribution in Appendix A. For a 95% level of confidence, from Appendix B, we find that the critical values of z that correspond to the middle 95% of the area are z = −1.96 and z = +1.96. This provides the information necessary to calculate the lower and upper limits for a CI. In the equations below, LL stands for the lower limit (or boundary) of the CI, and UL stands for the upper limit (or boundary) of the CI. Because the level of confidence was set at 95%, the critical values of z, zcritical, were obtained by looking up the distance from the mean that corresponds to the middle 95% of the normal distribution. (If a 90% level of confidence is chosen, the z values that correspond to the middle 90% of the area under the normal distribution would be used.) The lower and upper limits of a CI for a sample mean M correspond to the following: As an example, suppose that a student researcher collects a sample of N = 25 scores on IQ for a random sample of people drawn from the population of students at Corinth College. The WAIS IQ test is known to have σ equal to 15. Suppose the student decides to set up a 95% CI. The student obtains a sample mean IQ, M, equal to 128. The student needs to do the following:
  • 99. 1. Find the value of 2. Look up the critical values of z that correspond to the middle 95% of a standard normal distribution. From the table of the normal distribution in Appendix A, these critical values are z = –1.96 and z = +1.96. 3. Substitute the values for σM and zcritical into Equations 2.11 and 2.12 to obtain the following results: What conclusions can the student draw about the mean IQ of the population (all students at Corinth College) from which the random sample was drawn? It would not be correct to say that “there is a 95% chance that the true population mean IQ, μ, for all Corinth College students lies between 122.12 and 133.88.” It would be correct to say that “the 95% CI around the sample mean lies between 122.12 and 133.88.” (Note that the value of 100, which corresponds to the mean, μ, for the general adult population, is not included in this 95% CI for a sample of students drawn from the population of all Corinth College students. It appears, therefore, that the population mean WAIS score for Corinth College students may be higher than the population mean IQ for the general adult population.) To summarize, the 95% confidence level is not the probability that the true population mean, μ, lies within the CI that is based on data from one sample (μ either does lie in this interval or does not). The confidence level is better understood as a long- range prediction about the performance of CIs when these procedures for setting up CIs are followed. We expect that approximately 95% of the CIs that researchers obtain in the long run will include the true value of the population mean, μ. The other 5% of the CIs that researchers obtain using these procedures will not include μ. 2.14.3 Setting Up a CI for M When the Value of σ Is Not Known In a typical research situation, the researcher does not know the values of μ and σ; instead, the researcher has values of M and s from just one sample of size N and wants to use this sample
  • 100. mean, M, to estimate μ. In Section 2.12, I explained that when σ is not known, we can use s to calculate an estimate of SEM. However, when we use SEM (rather than σM) to set up CIs, the use of SEM to estimate σM results in additional sampling error. To adjust for this additional sampling error, we use the t distribution with N − 1 degrees of freedom (rather than the normal distribution) to look up distances from the mean that correspond to the middle 95% of the area in the sampling distribution. When N is large (>100), the t distribution converges to the standard normal distribution; therefore, when samples are large (N > 100), the standard normal distribution can be used to obtain the critical values for a CI. The formulas for the upper and lower limits of the CI when σ is not known, therefore, differ in two ways from the formulas for the CI when σ is known. First, when σ is unknown, we replace σM with SEM. Second, when σ is unknown and N < 100, we replace zcritical with tcritical, using a t distribution with N − 1 df to look up the critical values (for N ≥ 100, zcritical may be used). For example, suppose that the researcher wants to set up a 95% CI using the sample mean data reported in an earlier section of this chapter with N = 9, M = 73.11, and s = 5.533 (sample statistics are from Figure 2.12). The procedure is as follows: 1. Find the value of 2. Find the tcritical values that correspond to the middle 95% of the area for a t distribution with df = N − 1 − 9 = 1 = 8. From the table of the distribution of t, using 8 df, in Appendix B, these are tcritical = −2.31 and tcritical = +2.31. 3. Substitute the values of M, tcritical, and SEM into the following equations: Lower limit = M − [tcritical × SEM] = 73.11 − [2.31 × 1.844] = 73.11 − 4.26 = 68.85; Upper limit = M + [tcritical × SEM] = 73.11 + [2.31 × 1.844] = 73.11 + 4.26 = 77.37.
  • 101. What conclusions can the student draw about the mean HR of the population (all 130 cases in the file named hr130.sav) from which the random sample of N = 9 cases was drawn? The student can report that “the 95% CI for mean HR ranges from 68.85 to 77.37.” In this particular situation, we know what μ really is; the population mean HR for all 130 scores was 73.76 (from Figure 2.5). In this example, we know that the CI that was set up using information from the sample actually did include μ. (However, about 5% of the time, when a 95% level of confidence is used, the CI that is set up using sample data will not include μ.) The sample mean, M, is not the only statistic that has a sampling distribution and a known standard error. The sampling distributions for many other statistics are known; thus, it is possible to identify an appropriate sampling distribution and to estimate the standard error and set up CIs for many other sample statistics, such as Pearson’s r. 2.14.4 Reporting CIs On the basis of recommendations made by Wilkinson and Task Force on Statistical Inference (1999), the Publication Manual of the American Psychological Association (American Psychological Association [APA], 2009) states that CI information should be provided for major outcomes wherever possible. SPSS provides CI information for many, but not all, outcome statistics of interest. For some sample statistics and for effect sizes, researchers may need to calculate CIs by hand (Kline, 2004). When we report CIs, such as a CI for a sample mean, we remind ourselves (and our readers) that the actual value of the population parameter that we are trying to estimate is generally unknown and that the values of sample statistics are influenced by sampling error. Note that it may be inappropriate to use CIs to make inferences about the means for any specific real-world population if the CIs are based on samples that are not representative of a specific, well-defined population of interest. As pointed out in Chapter 1, the widespread use of convenience
  • 102. samples (rather than random samples from clearly defined populations) may lead to situations where the sample is not representative of any real-world population. It would be misleading to use sample statistics (such as the sample mean, M) to make inferences about the population mean, μ, for real- world populations if the members of the sample are not similar to, or representative of, that real-world population. At best, when researchers work with convenience samples, they can make inferences about hypothetical populations that have characteristics similar to those of the sample. The results obtained from the analysis of a random sample of nine HR scores could be reported as follows: Results Using the SPSS random sampling procedure, a random sample of N = 9 cases was selected from the population of 130 scores in the hr130.sav data file. The scores in this sample appear in Table 2.2. For this sample of nine cases, mean HR M = 73.11 beats per minute (bpm), with SD = 5.53 bpm. The 95% CI for the mean based on this sample had a lower limit of 68.85 and an upper limit of 77.37. 2.15 Summary Many statistical analyses include relatively simple terms that summarize information across X scores, such as ∑X and ∑X2. It is helpful to recognize that whenever a formula includes ∑X, information about the mean of X is being taken into account; when terms involving ∑X2 are included, information about variance is included in the computations. This chapter reviewed several basic concepts from introductory statistics: 1. The computation and interpretation of sample statistics, including the mean, variance, and standard deviation, were
  • 103. discussed. 2. A z score is used as a unit-free index of the distance of a single X score from the mean of a normal distribution of individual X scores. Because values of z have a fixed relationship to areas under the normal distribution curve, a z score can be used to answer questions such as, What proportion or percentage of cases have scores higher than X? 3. Sampling error arises because the value of a sample statistic such as M varies across samples when many random samples are drawn from the same population. 4. Given some assumptions (e.g., that the distribution of scores in the population of interest is normal in shape), it is possible to predict the shape, mean, and variance of the sampling distribution of M. When σ is known, the sampling distribution of M has the following known characteristics: It is normal in shape; the mean of the distribution of values of M corresponds to μ, the population mean; and the standard deviation or standard error that describes typical distances of sample mean values of M from μ is given by σ/N. When σ is not known and the researcher uses a sample standard deviation s to estimate σ, a second source of sampling error arises; we now have potential errors in estimation of σ using s as well as errors of estimation of μ using M. The magnitude of this additional sampling error depends on N, the size of the samples that are used to calculate M and s. 5. Additional sampling error arises when s is used to estimate σ. This additional sampling error requires us to refer to a different type of sampling distribution when we evaluate distances of individual M values from the center of the sampling distribution—that is, the family of t distributions (instead of the standard normal distribution). 6. The family of t distributions has a different distribution shape for each degree of freedom. As the df for the t distribution increases, the shape of the t distribution becomes closer to that of a standard normal distribution. When N (and therefore df) becomes greater than 100, the difference between
  • 104. the shape of the t and normal distributions becomes so small that distances from the mean can be evaluated using the normal distribution curve. 7. All these pieces of information come together in the formula for the CI. We can set up an “interval estimate” for μ based on the sample value of M and the amount of sampling error that is theoretically expected to occur. 8. Recent reporting guidelines for statistics (e.g., Wilkinson and the Task Force on Statistical Inference, 1999) recommend that CIs should be included for all important statistical outcomes in research reports wherever possible. Appendix on SPSS The examples in this textbook use IBM SPSS Version 19.0. Students who have never used SPSS (or programs that have similar capabilities) may need an introduction to SPSS, such as George and Mallery (2010). As with other statistical packages, students may either purchase a personal copy of the SPSS software and install it on a PC or use a version installed on their college or university computer network. When SPSS access has been established (either by installing a personal copy of SPSS on a PC or by doing whatever is necessary to access the college or university network version of SPSS), an SPSS® icon appears on the Windows desktop, or an SPSS for Windows folder can be opened by clicking on Start in the lower left corner of the computer screen and then on All Programs. When SPSS is started in this manner, the initial screen asks the user whether he or she wants to open an existing data file or type in new data. When students want to work with existing SPSS data files, such as the SPSS data files on the website for this textbook, they can generally open these data files just by clicking on the SPSS data file; as long as the student has access to the SPSS program, SPSS data files will automatically be opened using this program. SPSS can save and read several different file formats. On the website that accompanies this textbook, each data file is available in two formats: as an SPSS system file (with a full file
  • 105. name of the form dataset.sav) and as an Excel file (with a file name of the form dataset.xls). Readers who use programs other than SPSS will need to use the drop-down menu that lists various “file types” to tell their program (such as SAS) to look for and open a file that is in Excel XLS format (rather than the default SAS format). SPSS examples are presented in sufficient detail in this textbook so that students should be able to reproduce any of the analyses that are discussed. Some useful data-handling features of SPSS (such as procedures for handling missing data) are discussed in the context of statistical analyses, but this textbook does not provide a comprehensive treatment of the features in SPSS. Students who want a more comprehensive treatment of SPSS may consult books by Norusis and SPSS (2010a, 2010b). Note that the titles of recent books sometimes refer to SPSS as PASW, a name that applied only to Version 18 of SPSS. Notes 1. Demonstrations do not constitute proofs; however, they require less lengthy explanations and less mathematical sophistication from the reader than proofs or formal mathematical derivations. Throughout this book, demonstrations are offered instead of proofs, but readers should be aware that a demonstration only shows that a result works using the specific numbers involved in the demonstration; it does not constitute a proof. 2. The population variance, σ2, is defined as σ2 = ∑(X − μ)2/N. I have already commented that when we calculate a sample variance, s2, using the formula s2 = ∑(X – M)2/N − 1, we need to use N − 1 as the divisor to take into account the fact that we only have N − 1 independent deviations from the sample mean. However, a second problem arises when we calculate s2; that is, we calculate s2 using M, an estimate of μ that is also subject to sampling error. Comprehension Questions 1. Consider the following small set of scores. Each number
  • 106. represents the number of siblings reported by each of the N = 6 persons in the sample: X scores are [0, 1, 1, 1, 2, 7]. a. Compute the mean (M) for this set of six scores. b. Compute the six deviations from the mean (X − M), and list these six deviations. c. What is the sum of the six deviations from the mean you reported in (b)? Is this outcome a surprise? d. Now calculate the sum of squared deviations (SS) for this set of six scores. e. Compute the sample variance, s2, for this set of six scores. f. When you compute s2, why should you divide SS by (N − 1) rather than by N? g. Finally, compute the sample standard deviation (denoted by either s or SD). 2.
  • 107. In your own words, what does an SS tell us about a set of data? Under what circumstances will the value of SS equal 0? Can SS ever be negative? 3. For each of the following lists of scores, indicate whether the value of SS will be negative, 0, between 0 and +15, or greater than +15. (You do not need to actually calculate SS.) Sample A: X = [103, 156, 200, 300, 98] Sample B: X = [103, 103, 103, 103, 103, 103] Sample C: X = [101, 102, 103, 102, 101] 4. For a variable that interests you, discuss why there is variance in scores on that variable. (In Chapter 2, e.g., there is a discussion of factors that might create variance in heart rate, HR.) 5. Assume that a population of thousands of people whose responses were used to develop the anxiety test had scores that were normally distributed with μ = 30 and σ = 10. What proportion of people in this population would have anxiety scores within each of the following ranges of scores? a. Below 20 b. Above 30 c. Between 10 and 50 d. Below 10 e. Below 50
  • 108. f. Above 50 g. Either below 10 or above 50 Assuming that a score in the top 5% of the distribution would be considered extremely anxious, would a person whose anxiety score was 50 be considered extremely anxious? 6. What is a confidence interval (CI), and what information is required to set up a CI? 7. What is a sampling distribution? What do we know about the shape and characteristics of the sampling distribution for M, the sample mean? 8. What is SEM? What does the value of SEM tell you about the typical magnitude of sampling error? a. As s increases, how does the size of SEM change (assuming that N stays the same)? b. As N increases, how does the size of SEM change (assuming that s stays the same)? 9. How is a t distribution similar to a standard normal distribution score? How is it different? 10. Under what circumstances should a t distribution be used rather than the standard normal distribution to look up areas or probabilities associated with distances from the mean? 11.
  • 109. Consider the following questions about CIs. A researcher tests emotional intelligence (EI) for a random sample of children selected from a population of all students who are enrolled in a school for gifted children. The researcher wants to estimate the mean EI for the entire school. The population standard deviation, σ, for EI is not known. Let’s suppose that a researcher wants to set up a 95% CI for IQ scores using the following information: The sample mean M = 130. The sample standard deviation s = 15. The sample size N = 120. The df = N − 1 = 119. For the values given above, the limits of the 95% CI are as follows: Lower limit = 130 − 1.96 × 1.37 = 127.31; Upper limit = 130 + 1.96 × 1.37 = 132.69. The following exercises ask you to experiment to see how changing some of the values involved in computing the CI influences the width of the CI. Recalculate the CI above to see how the lower and upper
  • 110. limits (and the width of the CI) change as you vary the N in the sample (and leave all the other values the same). a. What are the upper and lower limits of the CI and the width of the 95% CI if all the other values remain the same (M = 130, s = 15) but you change the value of N to 16? For N = 16, lower limit = _________ and upper limit = ____________. Width (upper limit − lower limit) = ______________________. Note that when you change N, you need to change two things: the computed value of SEM and the degrees of freedom used to look up the critical values for t. b. What are the upper and lower limits of the CI and the width of the 95% CI if all the other values remain the same but you change the value of N to 25? For N = 25, lower limit = __________ and upper limit = ___________. Width (upper limit – lower limit) = _______________________. c. What are the upper and lower limits of the CI and the width of the 95% CI if all the other values remain the same (M = 130, s = 15) but you change the value of N to 49? For N = 49, lower limit = __________ and upper limit = ___________. Width (upper limit – lower limit) = ______________________. d. Based on the numbers you reported for sample size N of 16, 25, and 49, how does the width of the CI change as N (the number of cases in the sample) increases? e.
  • 111. What are the upper and lower limits and the width of this CI if you change the confidence level to 80% (and continue to use M = 130, s = 15, and N = 49)? For an 80% CI, lower limit = ________ and upper limit = __________. Width (upper limit – lower limit) = ______________________. f. What are the upper and lower limits and the width of the CI if you change the confidence level to 99% (continue to use M = 130, s = 15, and N = 49)? For a 99% CI, lower limit = ________ and upper limit = ___________. Width (upper limit – lower limit) = ______________________. g. How does increasing the level of confidence from 80% to 99% affect the width of the CI? 12. Data Analysis Project: The N = 130 scores in the temphr.sav file are hypothetical data created by Shoemaker (1996) so that they yield results similar to those obtained in an actual study of temperature and HR (Mackowiak et al., 1992). Use the Temperature data in the temphr.sav file to do the following: Note that temperature in degrees Fahrenheit (tempf) can be converted into temperature in degrees centigrade (tempc) by the following: tempc = (tempf − 32)/1.8. The following analyses can be done on tempf, tempc, or both tempf and tempc.
  • 112. a. Find the sample mean, M; standard deviation, s; and standard error of the mean, SEM, for scores on temperature. b. Examine a histogram of scores on temperature. Is the shape of the distribution reasonably close to normal? c. Set up a 95% CI for the sample mean, using your values of M, s, and N (N = 130 in this dataset). d. The temperature that is popularly believed to be “average” or “healthy” is 98.6°F (or 37°C). Does the 95% CI based on this sample include the value 98.6, which is widely believed to represent an “average/healthy” temperature? What conclusion might you draw from this result? (Warner 71-80) Warner, Rebecca (Becky) (Margaret). Applied Statistics: From Bivariate Through Multivariate Techniques, 2nd Edition. SAGE Publications, Inc, 04/2012. VitalBook file. The citation provided is a guideline. Please check each citation for accuracy before use. In Unit 1, you read about the difference between descriptive statistics and inferential statistics in Chapter 1 of your Warner text. For the next two units, we will focus on the theory, logic, and application of descriptive statistics. This introduction focuses on scales of measurement, measures of central tendency and dispersion, the visual inspection of histograms, and the detection and processing of outliers.
  • 113. An important concept in understanding descriptive statistics is the scales of measurement. The Warner (2013) text defines four scales of measurement—nominal, ordinal, interval, and ratio: • Nominal data refer to numbers arbitrarily assigned to represent group membership, such as gender (male = 1; female = 2). Nominal data are useful in comparing groups, but they are meaningless in terms of measures of central tendency and dispersion. • Ordinal data represent ranked data, such as coming in first, second, or third in a marathon. However, ordinal data do not tell us how much of a difference there is between measurements. The first-place and second-place finishers could finish 1 second apart, whereas the third-place finisher arrives 2 minutes later. Ordinal data lack equal intervals. • Interval data refer to equal intervals between data points. An example is degrees measured in Fahrenheit. Interval data lack a "true zero" value (freezing at 32 degrees Fahrenheit). • Ratio data do have a true zero, such as heart rate, where "0" represents a heart that is not beating. This is often seen as "count" data in social research. For example, how many days did an employee miss from work? Zero is a meaningful unit in this example. These four scales of measurement are routinely reviewed in introductory statistics textbooks as the classic way of differentiating measurements. However, the boundaries between the measurement scales are fuzzy. For example, is intelligence quotient (IQ) measured on the ordinal
  • 114. or interval scale? Recently, researchers have argued for a simpler dichotomy in terms of selecting an appropriate statistic: categorical versus continuous measures. • A categorical variable is a nominal variable. It simply categorizes things according to group membership (for example, apple = 1, banana = 2, grape = 3). • A continuous measure represents a difference in magnitude of something, such as a continuum of "low to high" statistics anxiety. In contrast to categorical variables designated by arbitrary values, a quantitative measure allows for a variety of arithmetic operations, including equal (=), less than (<), greater than (>), addition (+), subtraction (−), multiplication (* or ×), and division (/ or ÷). Arithmetic operations generate a variety of descriptive statistics discussed next. Measures of Central Tendency and Dispersion Chapter 2 of Warner (2013) reviews descriptive statistics that measure central tendency (mean, median, mode) and dispersion (range, sum of squares, variance, standard deviation). To visualize central tendency and dispersion, refer to Figure 2.5 on page 46 of the Warner text for an illustration of how heart rate data are represented in a histogram. The horizontal axis represents heart rate ("hr"). The vertical axis represents the total number of people who were recorded at a particular heart rate ("Frequency"). Measures of centrality summarize where data clump together at the center of a distribution of scores. (For example, in Figure 2.5 this occurs around hr = 74.)
  • 115. Unit 2 - Descriptive Statistics: Theory and Logic INTRODUCTION To simplify, consider the following measured heart rates: 65, 70, 75, 75, 130. The simplest measure of central tendency is the mode. It is the most frequent score within a distribution of scores (for example, two scores of hr = 75). Technically, in a distribution of scores, you can have two or more modes. An advantage of the mode is that it can be applied to categorical data. It is also not sensitive to extreme scores. The median is the geometric center of a distribution because of how it is calculated. All scores are arranged in ascending order. The score in the middle is the median. In the five heart rates above, the middle score is a 75. If you have an even number of scores, the average of the two middle scores is used. The median also has the advantage of not being sensitive to extreme scores. The mean is probably what most people consider to be an average score. In the example above, the mean heart rate is (65 + 70 + 75 + 75 + 130) ÷ 5 = 83. Although the mean is more sensitive to extreme scores (such as 130) relative to the mode and median, it can be more stable across samples, and it is the best estimate of the population mean. It is also used in many of the inferential statistics studied in this course, such as t tests and analysis of variance (ANOVA). In contrast to measures of central tendency, measures of dispersion summarize how far apart data are spread on
  • 116. a distribution of scores. The range is a basic measure of dispersion quantifying the distance between the lowest score and the highest score in a distribution (for example, 130 − 65 = 65). A deviance represents the difference between an individual score and the mean. For example, the deviance for the first heart rate score (65) is 65 − 83, which is −18. By calculating the deviance for each score above from a mean of 83, we arrive at −18, −13, −8, −8, and +47. Summing all of the deviances equals 0, which is not a very informative measure of dispersion. A somewhat more informative measure of dispersion is sum of squares ( SS), which you will see again in Units 9 and 10 in the study of analysis of variance (ANOVA). To get around the problem of summing to zero, the sum of squares involves calculating the square of each deviation and then summing those squares. In the example above, SS = [(−18)2 + (−13)2 + (−8)2 + (−8)2 + (+47)2] = [(324) + (169) + (64) + (64) + (2209)] = 2830. The problem with SS is that it increases as data points increase (Field, 2013), and it still is not a very informative measure of dispersion. This problem is solved by next calculating the sample variance ( s2), which is the average distance between the mean and a particular score (squared). Instead of dividing SS by 5 for the example above, we divide by N − 1, or 4; see pages 56–57 of your Warner text for an explanation. The variance is therefore SS ÷ ( N − 1), or 2830 ÷ 4 = 707.5. The problem with interpreting variance is that it is the average distance of "squared units" from the mean. What is, for example, a "squared" heart rate score? The final step is calculating the sample standard deviation ( s), which is simply calculated as the square root of
  • 117. the sample variance, or in our example, √707.5 = 26.60. The sample standard deviation represents the average deviation of scores from the mean. In other words, the average distance of heart rate scores to the mean is 26.6 beats per minute. If the extreme score of 130 is replaced with a score closer to the mean, such as 90, then s = 9.35. Thus, small standard deviations (relative to the mean) represent a small amount of dispersion; large standard deviations (relative to the mean) represent a large amount of dispersion (Field, 2013). The standard deviation is an important component of the normal distribution. Visual Inspection of a Distribution of Scores An assumption of the statistical tests that you will study in this course is that the scores for a dependent variable are normal (or approximately normal) in shape. This assumption is first checked by examining a histogram of the distribution. Figure 4.19 in the Warner text (p. 147) represents a distribution of heart rate scores that are approximately normal in shape and visualized in terms of a bell- shaped curve. Notice that the tails of the distribution are approximately symmetrical, meaning that they are near mirror images to the left and right of the mean. This distribution technically has two modes at hr = 70 and hr = 76, but the close proximity of these modes suggests a unimodal distribution. Departures from normality and symmetry are assessed in terms of skew and kurtosis. Skewness is the tilt or extent a distribution deviates from symmetry around the mean. A distribution that is positively skewed has a longer tail extending to the right (the "positive" side of the
  • 118. distribution) as shown in Figure 4.20 of the Warner text (p. 148). A distribution that is negatively skewed has a longer tail extending to the left (the "negative" side of the distribution) as shown in Figure 4.21 of the Warner text (p. 149). In contrast to skewness, kurtosis is defined as the peakedness of a distribution of scores. Figure 4.22 of the Warner text (p. 150) illustrates a distribution with normal kurtosis, negative kurtosis (a "flat" distribution; platykurtic), and positive kurtosis (a "sharp" peak; leptokurtic). The use of these terms is not limited to your description of a distribution following a visual inspection. They are included in your list of descriptive statistics and should be included when analyzing your distribution of scores. Skew and kurtosis scores of near zero indicate a shape that is symmetric or close to normal respectively. Values of −1 to +1 are considered ideal, whereas values ranging from −2 to +2 are considered acceptable for psychometric purposes. Outliers Outliers are defined as extreme scores on either the left of right tail of a distribution, and they can influence the overall shape of that distribution. There are a variety of methods for identifying and adjusting for outliers. Outliers can be detected by calculating z scores (reviewed in Unit 4) or by inspection of a box plot. Once an outlier is detected, the researcher must determine how to handle it. The outlier may represent a data entry error that should be corrected, or the outlier may be a valid extreme score. The outlier can be left alone, deleted, or transformed. Whatever decision is made regarding an outlier, the researcher must be transparent and justify his or her decision.
  • 119. References Field, A. (2013). Discovering statistics using IBM SPSS (4th ed.). Thousand Oaks, CA: Sage. Warner, R. M. (2013). Applied statistics: From bivariate through multivariate techniques (2nd ed.). Thousand Oaks, CA: Sage. OBJECTIVES To successfully complete this learning unit, you will be expected to: 1. Analyze the strengths and limitations of descriptive statistics. 2. Identify previous experience with and future applications of descriptive statistics. 3. Analyze the purpose and reporting of confidence intervals. 4. Discuss standard error and confidence intervals. Unit 2 Study 1- Readings Use your Warner text, Applied Statistics: From Bivariate Through Multivariate Techniques, to complete the following: • Read Chapter 2, "Basic Statistics, Sampling Error, and Confidence Intervals," pages 41–80. This reading addresses the following topics: ◦ Sample mean ( M).
  • 120. ◦ Sum of squared deviations ( SS). ◦ Sample variance ( s2). ◦ Sample standard deviation ( s). ◦ Sample standard error ( SE). ◦ Confidence intervals (CIs). • Read Chapter 4, "Preliminary Data Screening" pages 125–184. This reading addresses the following topics: ◦ Problems in real data. ◦ Identification of errors and inconsistencies. ◦ Missing values. ◦ Data screening for individual variables. ◦ Data screening for bivariate analysis. ◦ Data transformations. ◦ Reporting preliminary data screening. SOE Learners – Suggested Readings Young, J. R., Young, J. L., & Hamilton, C. (2014). The use of confidence intervals as a meta-analytic lens to summarize the effects of teacher education technology courses on preservice teacher TPACK. Journal of Research on Technology in Education, 46(2), 149–172. For this discussion: • Discuss your previous experience with descriptive statistics. For example, you have probably encountered descriptive statistics in an undergraduate course and in journal articles.
  • 121. • Analyze the strengths and limitations of descriptive statistics. • Finally, discuss how you might use descriptive statistics in your professional or academic future. • Remember to cite your supporting references.