Reliability and validity issues in language

1
Reliability and Validity
What is reliability?
A quality of test scores which refers to the
consistency of measures across different times,
test forms, raters, and other characteristics of
the measurement context. Synonyms for
reliability are: dependability, stability,
consistency, predictability and accuracy.

2
For instance a reliable man is a man whose behaviour
is consistent, dependable and predictable, i.e. What
he will do tomorrow and next week will be
consistent with what he does today and what he did
last week.
-In language testing, for example, if a student receives
a low score on a test one day and high score on the
same test two days later (the test does not yield
consistent results), the scores cannot be considered
reliable indicators of individual’s ability.
-If two raters give widely different ratings to the same
sample, we say that the ratings are not reliable.

3
-The notion of reliability has to do with accuracy
of measurement. This kind of accuracy is
reflected in the obtaining of similar results when
measurement is repeated on different occasions
or with different instruments or by different
persons.
-Based on Henning [1987], reliability is a measure of
accuracy, consistency, dependability, or fairness of
scores resulting from the administration of a
particular examination e.g. 75% on a test today,
83% tomorrow – problem with reliability.
-

4
Sources of Variance
Potential sources of score variance:
1. Those creating variance related to the purpose of
the test-meaningful variance
Meaningful variance on a test is defined as the
variance that is directly attributable to the testing
purpose.
2. Those generating variance due to other extraneous
sources-measurement error
Measurement error is a term that describes the variance
in scores on a test that is not directly related to the
purpose of the test.

5
Potential sources of measurement
error
1. Environment( location, space, noise, etc.)
2. Administration procedures( directions,
timing, equipment, etc.)
3. Scoring procedures ( error in scoring,
subjectivity, etc.)
4. Test and test items ( item type, item
quality, item security, etc.)
5. Examinees ( health, motivation, memory,
forgiveness, guessing, etc.)

Dr. R. Green, Aug 2006 6
 quality of items
 number of items
 difficulty level of items
 level of item discrimination
 type of test methods
 number of test methods

 time allowed
 clarity of instructions
 use of the test
 selection of content
 sampling of content
 invalid constructs

Test taker
 familiarity with test method
 attitude towards the test i.e. interest,
motivation, emotional/mental state
 degree of guessing employed
 level of ability

Test administration
 consistency of administration procedure
 degree of interaction between invigilators and
test takers
 time of day the test is administered
 clarity of instructions
 test environment – light / heat / noise /
space / layout of room
 quality of equipment used e.g. for listening
tests
9

Scoring
 accuracy of the key e.g. does it include all
possible alternatives?
 inter-rater reliability e.g. in writing,
speaking
 intra-rater reliability e.g. in writing,
speaking
 machine vs. human

In order to minimize reliability we should try to minimize
measurement error. For example we can think of factors
such as health, lack of interest or motivation and test-
wiseness that can affect individuals’ test performance.,
but which are not generally associated with language
ability and thus not characteristics we want to measure
with language test. Test method facets are another source
of error. When we minimize the effect of these various
factors, we minimize measurement error and maximize
reliability.

12
The investigation of reliability is concerned with
the question:
“How much of an individual’s test performance
is due to measurement error or factors other
than the language ability we want to
measure?” and with the minimizing these
factors on test scores.

13
Unreliability
Inconsistencies in the results of measurement
stemming from factors other than the abilities
we want to measure. These are called
random factors. Such factors would result in
measurement errors which in turn can
introduce fluctuations in the observed scores
and thus reduce reliability.
Systematic factors: factors such as test
method facets which are uniform from one
test administration to the next. Attributes of

14
Individuals that are not related to language ability
Including individual characteristics such as cognitive
style, and knowledge of particular content areas, and
group characteristics such as sex, ethics background
are categorized as systematic factors. Two different
effects are associated with systematic error:
1. General effect: the error of systematic error which is
consistent for all observations, it affects the scores of
all individuals who take the test.
2. Specific effect: the effect which varies across
individuals; it affects different individuals differently.

Test wiseness
A test taker’s capacity to utilize the characteristics
and formats of the test and/or the test taking
situation to guess the correct answer and hence
receive a high score. It includes a variety of
general strategies such as conscious pacing of
one’s time, reading questions before the
passages upon which they are based, and ruling
out alternatives as possible in multiple-choice
items and then guessing among the ones
remaining.
15

Test wiseness can be divided into two Basic parts:
1 Independent elements: these elements are
independent of the test constructor. These
elements are the kinds that many times are
included in books that help students prepare for
standardized tests. These relate to reading
directions, use of time, guessing, etc.
2. Dependent elements: these elements refer to
cues in the stem or in the distracters that need
to be reasoned by the test taker.
16

Test method facet
The specific characteristics of test methods which
constitute the “how” of language testing. Thus
test performance is affected by these “facets” or
characteristics of test method. For example,
some do better on oral interview test; while one
person might perform well on a multiple-choice
test of reading, another may find such tasks very
difficult. Test method facets are systematic to
the extent that they are uniform from one test
administration to the next one.
17

Test method facet include:
1. Testing environment
2. Test rubric(characteristics that specify how
test takers are expected to proceed in taking
the test.
3. Testing organization( the collection of parts in
a test which may be individual items or subtests)
4. Time allocation
5. Instructions

Factors which may contribute to unreliability
1. Fluctuations in the learner
a. Changes in the learner
b. Temporary psychological or physical changes
2. Fluctuations in scoring
a. intra-rater variance
b. inter-rater variance
3. Fluctuations in test administration
a. Regulatory fluctuations
b. Fluctuations in the administrative environment
19

4. The behavior of test characteristics
a.Test length
b. Difficulty level and boundary effect
c. Discriminability
d. Speededness
e. Homogeneity
5. Error associated with response characteristics
a. Response arbitrariness
b. Wiseness and familiarity responses
20

21
How can we measure reliability?
Test-retest
 same test administered to the same test
takers following an interval of no more than 2
weeks
Inter-rater reliability
 two or more independent estimates on a test
e.g. written scripts marked by two raters
independently and results compared

Measuring reliability [2]
Internal consistency reliability estimates
e.g.
 Split half reliability
 Cronbach’s alpha / Kuder Richardson 20
[KR20]

23
Split half reliability
 test to be administered to a group of test takers is
divided into halves, scores on each half correlated
with the other half
 the resulting coefficient is then adjusted by
Spearman-Brown Prophecy Formula to allow for the
fact that the total score is based on an instrument
that is twice as long as its halves

24
Cronbach's Alpha [KR 20]
 this approach looks at how test takers
perform on each individual item and then
compares that performance against their
performance on the test as a whole
 measured on a -1 to +1 scale like
discrimination

25
Reliability is influenced by …..
 the longer the test, the more reliable it is likely to be
[though there is a point of no extra return]
 items which discriminate will add to reliability,
therefore, if the items are too easy / too difficult,
reliability is likely to be lower
 if there is a wide range of abilities amongst the test
takers, test is likely to have higher reliability
 the more homogeneous the items are, the higher
the reliability is likely to be

Validity
The extent to which the inferences or decisions we
make on the basis of test scores are meaningful,
appropriate, and useful. In other words, a test is
said to be valid to the extent that it measures
what it is supposed to measures or can be used
for the purposes for which it is intended.
The matter of concern in testing is to ensure that
any test employed is valid for the purpose for
which it is administered. Validity tells us what can
be inferred from test scores. Validity is a quality of
test interpretation and use. If test scores are
affected by abilities other than the one we want to26

measure, they will not be meaningful indicators
of that particular ability.
An example of validity:
If we ask students to listen to a lecture and to write a short
essay based on that lecture, the essays they write will be
affected by both their writing ability and their ability to
comprehend the lecture. Ratings of their essays,
therefore, might not be valid measures of their writing
ability.
It is important for test developers and test users to realize
that test validation is an ongoing process and that the
interpretations we make of test scores can never be
considered absolutely valid.
27

 For most kinds of validity, reliability is a
necessary but not sufficient condition.
Some types of validity:
1.Content validity
2.Construct validity
3. Consequential validity
4. criterion-related validity
28

Content validity?
A form of validity which is based on the degree
to which a test adequately and sufficiently
measures the particular skills or behavior it
sets out to measure . For example, a test of
pronunciation skills in a language would have
low content validity if it tests only some of the
skills which are required for accurate
pronunciation, such a test which tested the
ability to pronounce isolated sounds, but not
stress, intonation , or the pronunciation of
sound within words. 29

The test would have content validity only if it
included a proper sample of the relevant
structures. Just what the relevant structures
will depend, of course, upon the purpose of
the test. In order to judge whether or not a
test has content validity, we need a
specification of skills or structures etc, that it
is meant to cover.
30

Construct validity?
A form of validity which is based on the degree to which
the items in a test reflect the essential aspects of the
theory on which the test is based.
A test is said to have construct validity if it can be
demonstrated that it measures just the ability it is
supposed to measure. For example, if the assumption
is held that systematic language habits are best
acquired at the elementary level by means of the
structural approach, then a test which emphasizes the
communication aspects of the language will have low
construct validity.
31

Reliability and validity issues in language

More Related Content

Similar to Reliability and validity issues in language (20)

Recently uploaded (20)

Reliability and validity issues in language