Week 9 validity and reliability

Validity and reliability
In Research

•Agenda
AT the end of this lesson, you should be able to:
Discuss validity
Discuss reliability
Discuss validity in qualitative research
Discuss validity in experimental design
1
2
5
3
4
Discuss how to achieve validity and reliability

 The consistency of scores or answers from one administration of
an instrument to another, or from one set of items to another.
 A reliable instrument yields similar results if given to a similar
population at different times.
Reliability

 Appropriateness, meaningfulness, correctness, and
usefulness of inferences a researcher makes.
 Validity of ??
 Instrument?
 Data?
Validity

Validity
• Internal validity is the extent to which research findings are free
from bias and effects
• External validity is the extent to which the findings can be
generalised

• Content-related evidence of validity focuses on the content and
format of an instrument.
• Is it appropriate?
• Comprehensive?
• Is it logical?
• How do the items or questions represent the content? Is the
format appropriate?
Validity - Content-related evidence

This refers to the relationship between the scores obtained using
the instrument and the scores obtained using one or more
other instruments or measures. For example, are students’
scores on teacher made tests consistent with their scores on
standardized tests in the same subject areas?
Validity - Criterion-related evidence

Construct validity is defined as “establishing correct operational
measures for the concepts being studied” (Yin, 1984).
For example, if one is looking at problem solving in leaders, how
well does a particular instrument explain the relationship
between being able to problem solve and effectiveness as a
leader.
Validity - Construct-related evidence

ATTAINING VALIDITY AND
RELIABILITY

 Adequacy : the size and scope of the questions must be large
enough to cover the topic.
 Format of the instrument: Clarity of printing, type size,
adequacy of work area, appropriateness of language, clarity of
directions, etc.
Elements of content-related evidence

 Consult other experts who rate the items.
 Rate items, eliminating or changing those that do not meet the
specified content.
 Repeat until all raters agree on the questions and answers.
How to achieve content validity

To obtain criterion-related validity, researchers identify a
characteristic, assess it using one instrument (e.g., IQ test) and
compare the score with performance on an external measure,
such as GPA or an achievement test.
 A validity coefficient is obtained by correlating a set of scores on
one test (a predictor) with a set of scores on another (the
criterion).
 The degree to which the predictor and the criterion relate is the
validity coefficient. A predictor that has a strong relationship to
a criterion test would have a high coefficient.
Criterion-related validity

 This type of validity is more typically associated with research
studies than testing.
 It relates to psychological traits, so multiple sources are used to
collect evidence. Often times a combination of observation,
surveys, focus groups, and other measures are used to identify
how much of the trait being measured is possessed by the
observee.
Construct-related validity
Proactive
Coping Skills

The consistency of scores obtained from one
instrument to another, or from the same
instrument over different groups.
Reliability

 Every test or instrument has associated with its errors of
measurement.
 These can be due to a number of things: testing conditions,
student health or motivation, test anxiety, etc.
 Instrument/test developers work hard to try to ensure that their
errors are not grounded in flaws with the instrument/test itself.
Errors of measurement

 Test-retest: Same test to same group
 Equivalent-forms: A different form of the same instrument is
given to the same group of individuals
 Internal consistency: Split-half procedure
 Kuder-Richardson: Mathematically computes reliability from
the # of items, the mean, and the standard deviation of the test.
Reliability Methods

• Reliability coefficient - a number that tells us how likely one
instrument is to be consistent over repeated administrations
• Alpha or Cronbach’s alpha
• used on instruments where answers aren’t scored “right” and “wrong”.
It is often used to test the reliability of survey instruments.
Reliability coefficient

•Validity
• Validity can be used in three ways.
• instrument or measurement validity
• external or generalization validity
• internal validity, which means that what a
researcher observes between two variables should
be clear in its meaning rather than due to
something that is unclear (“something else”)

• Any one (or more) of these conditions:
• Age or ability of subjects
• Conditions under which the study was conducted
• Type of materials used in the study
• Technically, the “something else” is called a threat to internal validity.
What is “something else”?

• Subject characteristics
• Loss of subjects
• Location
• Instrumentation
• Testing
• History
• Maturation
• Attitude of subjects
• Implementation
Threats to internal validity

•Subject characteristics
• Subject characteristics can pose a threat if
there is selection bias, or if there are
unintended factors present within or among
groups selected for a study. For example, in
group studies, members may differ on the basis
of age, gender, ability, socioeconomic
background, etc. They must be controlled for
in order to ensure that the key variables in the
study, not these, explain differences.

•Subject characteristics
• Age Intelligence
• Strength Vocabulary
• Maturity Reading ability
• Gender Fluency
• Ethnicity Manual dexterity
• Coordination Socioeconomic status
• Speed Religious/political belief

•Loss of subjects (mortality)
• Loss of subjects limits generalizability, but it can also
affect internal validity if the subjects who don’t
respond or participate are over represented in a
group.

•Location
• The place where data collection occurs, aka
“location” might pose a threat. For example,
hot, noisy, unpleasant conditions might affect
scores; situations where privacy is important
for the results, but where people are streaming
in and out of the room, might pose a threat.

• Decay: If the nature of the instrument or the scoring procedure
is changed in some way, instrument decay occurs.
• Data Collector Characteristics: The person collecting data can
affect the outcome.
• Data Collector Bias: The data collector might hold an opinion
that is at odds with respondents and it affects the
administration.
Instrumentation

• In longitudinal studies, data are often collected through more
than one administration of a test.
• If the previous test influences subsequent ones by getting the
subject to engage in learning or some other behavior that he or
she might not otherwise have done, there is a testing threat.
Testing

• If an unanticipated or unplanned event occurs prior to a study
or intervention, there might be a history threat.
History

• Sometimes the very fact of being studied influences subjects.
The best known example of this is the Hawthorne Effect.
Attitude of subjects

• This threat can be caused by various things; different data
collectors, teachers, conditions in treatment, method bias, etc.
Implementation

• Standardize conditions of study
• Obtain more information on subjects
• Obtain as much information on details of the study: location,
history, instrumentation, subject attitude, implementation
• Choose an appropriate design
• Train data collectors
Minimizing Threats

Qualitative Research
Validity and reliability??

• Many qualitative researchers
contend that validity and
reliability are irrelevant to their
work because they study one
phenomenon and don’t seek to
generalize
• Fraenkel and Wallen - any
instrument or design used to
collect data should be credible
and backed by evidence
consistent with quantitative
studies.
• Trustworthiness
•Qualitative research
.

•Quantitative vs. Qualitative
Traditional Criteria for
Judging Quantitative
Research
Alternative Criteria for
Judging Qualitative
Research
Internal validity Credibility
External validity Transferability
Reliability Dependability
Objectivity Confirmability

In qualitative research
• Reliability pertained to the extent to which the study is
replicable and how accurate the research methods and the
techniques used to produce data
• Objectivity of the researcher - researcher must look at her bias
and preconceived notions of what she will find before she
begins her research.
• Objectivity of the interviewee

• Triangulation
• Member check
• Audit trail
In qualitative research

Let’s look at one particular design
Validity in experimental research

Experimental
Designs Should
be Developed to
Ensure Internal
and External
Validity of the
Study

Internal Validity:
• Are the results of the study
(DV) caused by the factors
included in the study (IV) or
are they caused by other
factors (EV) which were not
part of the study?

(Selection Bias/Differential Selection) -- The groups may have been
different from the start. If you were testing instructional strategies to
improve reading and one group enjoyed reading more than the
other group, they may improve more in their reading because they
enjoy it, rather than the instructional strategy you used.
Subject
Characteristics
Threats to Internal Validity

(Mortality) -- All of the high or low scoring subject may
have dropped out or were missing from one of the
groups. If we collected posttest data on a day when the
debate society was on field trip , the mean for the
treatment group would probably be much lower than it
really should have been.
Loss of Subjects

Perhaps one group was at a
disadvantage because of their
location. The city may have been
demolishing a building next to one of
the schools in our study and there are
constant distractions which interfere
with our treatment.
Location

The testing instruments may not be scores similarly.
Perhaps the person grading the posttest is fatigued
and pays less attention to the last set of papers
reviewed. It may be that those papers are from one
of our groups and will received different scores than
the earlier group's papers
Instrumentation
Instrument Decay

The subjects of one group may react differently to the data collector
than the other group. A male interviewing males and females about
their attitudes toward a type of math instruction may not receive the
same responses from females as a female interviewing females would.
Data Collector
Characteristics

The person collecting data my favors one group, or some
characteristic some subject possess, over another. A principal
who favors strict classroom management may rate students'
attention under different teaching conditions with a bias toward
one of the teaching conditions.
Data Collector Bias

The act of taking a pretest or posttest may influence the results of the
experiment. Suppose we were conducting a unit to increase student
sensitivity to racial prejudice. As a pretest we have the control and
treatment groups watch a movie on racism and write a reaction essay.
The pretest may have actually increased both groups' sensitivity and we
find that our treatment groups didn't score any higher on a posttest given
later than the control group did. If we hadn't given the pretest, we might
have seen differences in the groups at the end of the study.
Testing

Something may happen at one site during our study that influences the results.
Perhaps a classmate was injured in a car accident at the control site for a study
teaching children bike safety. The control group may actually demonstrate more
concern about bike safety than the treatment group.
History

There may be natural changes in
the subjects that can account for
the changes found in a study. A
critical thinking unit may appear
more effective if it taught during a
time when children are developing
abstract reasoning.
Maturation

The subjects may respond differently just because they are being studied. The
name comes from a classic study in which researchers were studying the effect
of lighting on worker productivity. As the intensity of the factory lights increased,
so did the worker productivity. One researcher suggested that they reverse the
treatment and lower the lights. The productivity of the workers continued to
increase. It appears that being observed by the researchers was increasing
productivity, not the intensity of the lights.
Hawthorne Effect

One group may view that it is in competition with the other group and may work
harder than they would under normal circumstances. This generally is applied to
the control group "taking on" the treatment group.
John
Henry
Effect

The control group may become discouraged because it is not
receiving the special attention that is given to the treatment
group. They may perform lower than usual because of this.
Resentful
Demoralization of
the Control Group

(Statistical Regression) -- A class that scores particularly low can
be expected to score slightly higher just by chance. Likewise, a
class that scores particularly high, will have a tendency to score
slightly lower by chance. The change in these scores may have
nothing to do with the treatment.
Regression

The treatment may not be implemented as intended. A
study where teachers are asked to use student modeling
techniques may not show positive results, not because
modeling techniques don't work, but because the teacher
didn't implement them or didn't implement them as they
were designed.
Implementation

Compensatory
Equalization of
Treatment
Someone may feel sorry for the control group because they
are not receiving much attention and give them special
treatment. For example, a researcher could be studying the
effect of laptop computers on students' attitudes toward
math. The teacher feels sorry for the class that doesn't have
computers and sponsors a popcorn party during math
class. The control group begins to develop a more positive
attitude about mathematics.

Experimental Treatment
Diffusion
Sometimes the control group actually
implements the treatment. If two different
techniques are being tested in two
different third grades in the same
building, the teachers may share what
they are doing. Unconsciously, the control
may use of the techniques she or he
learned from the treatment teacher.

Once the researchers are confident that
the outcome (dependent variable) of the
experiment they are designing is the
result of their treatment
(independent variable)
[internal validity],
they determine for which
people or situations
the results of
their study apply
[external validity].

External Validity:
• Are the results of the study generalizable to other
populations and settings?
• Population
• Ecological

...the extent to which one can generalize from the study sample to a defined
population--
If the sample is drawn from an accessible population, rather than the target
population, generalizing the research results from the accessible population to the
target population is risky.
Threats to External Validity (Population)
Population Validity is the extent to which the results of a study can be generalized
from the specific sample that was studied to a larger group of subjects. It involves...

Ecological Validity is the extent
to which the results of an experiment can be generalized from the set
of environmental conditions created by the researcher to other
environmental conditions (settings and conditions).
Threats to External Validity (Ecological)
There are 10 common
threats to external
validity.

(not sufficiently described for others to replicate) If the
researcher fails to adequately describe how he or
she conducted a study, it is difficult to determine
whether the results are applicable to other
settings.
Explicit description of
the experimental
treatment

(catalyst effect)
If a researcher were to apply several treatments,
it is difficult to determine how well each of the
treatments would work individually. It might be
that only the combination of the treatments is
effective.
Multiple-treatment
interference

(attention causes differences)
Subjects perform differently because they know they
are being studied. "...External validity of the experiment
is jeopardized because the findings might not
generalize to a situation in which researchers or others
who were involved in the research are not present"
(Gall, Borg, & Gall, 1996, p. 475)
Hawthorne effect

(anything different makes a difference)
A treatment may work because it is novel and the subjects respond to the
uniqueness, rather than the actual treatment. The opposite may also occur,
the treatment may not work because it is unique, but given time for the
subjects to adjust to it, it might have worked.
Novelty and
disruption effect

(it only works with this experimenter)
The treatment might have worked because of the
person implementing it. Given a different person, the
treatment might not work at all.
Experimenter effect

(pretest sets the stage)
A treatment might only work if a pretest is
given. Because they have taken a pretest, the
subjects may be more sensitive to the
treatment. Had they not taken a pretest, the
treatment would not have worked.
Pretest sensitization

(posttest helps treatment "fall into place")
The posttest can become a learning experience. "For
example, the posttest might cause certain ideas presented
during the treatment to 'fall into place' “ . If the subjects had
not taken a posttest, the treatment would not have worked.
Posttest sensitization

Interaction of
history and
treatment effect
(...to everything there is a time...)
Not only should researchers be cautious about generalizing to other
population, caution should be taken to generalize to a different time
period. As time passes, the conditions under which treatments work
change.

(maybe only works with M/C tests)
A treatment may only be evident with certain types of
measurements. A teaching method may produce
superior results when its effectiveness is tested with an
essay test, but show no differences when the
effectiveness is measured with a multiple choice test.
Measurement of
the dependent
variable

Interaction of time
of measurement
and treatment
effect
(it takes a while for the treatment to kick in)
It may be that the treatment effect does not occur until several weeks after the end
of the treatment. In this situation, a posttest at the end of the treatment would
show no impact, but a posttest a month later might show an impact.

Week 9 validity and reliability

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Week 9 validity and reliability (20)

More from wawaaa789 (20)

Recently uploaded (20)

Week 9 validity and reliability