SlideShare a Scribd company logo
1
Reliability and Validity
What is reliability?
A quality of test scores which refers to the
consistency of measures across different times,
test forms, raters, and other characteristics of
the measurement context. Synonyms for
reliability are: dependability, stability,
consistency, predictability and accuracy.
2
For instance a reliable man is a man whose behaviour
is consistent, dependable and predictable, i.e. What
he will do tomorrow and next week will be
consistent with what he does today and what he did
last week.
-In language testing, for example, if a student receives
a low score on a test one day and high score on the
same test two days later (the test does not yield
consistent results), the scores cannot be considered
reliable indicators of individual’s ability.
-If two raters give widely different ratings to the same
sample, we say that the ratings are not reliable.
3
-The notion of reliability has to do with accuracy
of measurement. This kind of accuracy is
reflected in the obtaining of similar results when
measurement is repeated on different occasions
or with different instruments or by different
persons.
-Based on Henning [1987], reliability is a measure of
accuracy, consistency, dependability, or fairness of
scores resulting from the administration of a
particular examination e.g. 75% on a test today,
83% tomorrow – problem with reliability.
-
4
Sources of Variance
Potential sources of score variance:
1. Those creating variance related to the purpose of
the test-meaningful variance
Meaningful variance on a test is defined as the
variance that is directly attributable to the testing
purpose.
2. Those generating variance due to other extraneous
sources-measurement error
Measurement error is a term that describes the variance
in scores on a test that is not directly related to the
purpose of the test.
5
Potential sources of measurement
error
1. Environment( location, space, noise, etc.)
2. Administration procedures( directions,
timing, equipment, etc.)
3. Scoring procedures ( error in scoring,
subjectivity, etc.)
4. Test and test items ( item type, item
quality, item security, etc.)
5. Examinees ( health, motivation, memory,
forgiveness, guessing, etc.)
Dr. R. Green, Aug 2006 6
 quality of items
 number of items
 difficulty level of items
 level of item discrimination
 type of test methods
 number of test methods
 time allowed
 clarity of instructions
 use of the test
 selection of content
 sampling of content
 invalid constructs
Dr. R. Green, Aug 2006 7
Test taker
 familiarity with test method
 attitude towards the test i.e. interest,
motivation, emotional/mental state
 degree of guessing employed
 level of ability
Dr. R. Green, Aug 2006 8
Test administration
 consistency of administration procedure
 degree of interaction between invigilators and
test takers
 time of day the test is administered
 clarity of instructions
 test environment – light / heat / noise /
space / layout of room
 quality of equipment used e.g. for listening
tests
9
Scoring
 accuracy of the key e.g. does it include all
possible alternatives?
 inter-rater reliability e.g. in writing,
speaking
 intra-rater reliability e.g. in writing,
speaking
 machine vs. human
Dr. R. Green, Aug 2006 10
In order to minimize reliability we should try to minimize
measurement error. For example we can think of factors
such as health, lack of interest or motivation and test-
wiseness that can affect individuals’ test performance.,
but which are not generally associated with language
ability and thus not characteristics we want to measure
with language test. Test method facets are another source
of error. When we minimize the effect of these various
factors, we minimize measurement error and maximize
reliability.
12
The investigation of reliability is concerned with
the question:
“How much of an individual’s test performance
is due to measurement error or factors other
than the language ability we want to
measure?” and with the minimizing these
factors on test scores.
13
Unreliability
Inconsistencies in the results of measurement
stemming from factors other than the abilities
we want to measure. These are called
random factors. Such factors would result in
measurement errors which in turn can
introduce fluctuations in the observed scores
and thus reduce reliability.
Systematic factors: factors such as test
method facets which are uniform from one
test administration to the next. Attributes of
14
Individuals that are not related to language ability
Including individual characteristics such as cognitive
style, and knowledge of particular content areas, and
group characteristics such as sex, ethics background
are categorized as systematic factors. Two different
effects are associated with systematic error:
1. General effect: the error of systematic error which is
consistent for all observations, it affects the scores of
all individuals who take the test.
2. Specific effect: the effect which varies across
individuals; it affects different individuals differently.
Test wiseness
A test taker’s capacity to utilize the characteristics
and formats of the test and/or the test taking
situation to guess the correct answer and hence
receive a high score. It includes a variety of
general strategies such as conscious pacing of
one’s time, reading questions before the
passages upon which they are based, and ruling
out alternatives as possible in multiple-choice
items and then guessing among the ones
remaining.
15
Test wiseness can be divided into two Basic parts:
1 Independent elements: these elements are
independent of the test constructor. These
elements are the kinds that many times are
included in books that help students prepare for
standardized tests. These relate to reading
directions, use of time, guessing, etc.
2. Dependent elements: these elements refer to
cues in the stem or in the distracters that need
to be reasoned by the test taker.
16
Test method facet
The specific characteristics of test methods which
constitute the “how” of language testing. Thus
test performance is affected by these “facets” or
characteristics of test method. For example,
some do better on oral interview test; while one
person might perform well on a multiple-choice
test of reading, another may find such tasks very
difficult. Test method facets are systematic to
the extent that they are uniform from one test
administration to the next one.
17
Test method facet include:
1. Testing environment
2. Test rubric(characteristics that specify how
test takers are expected to proceed in taking
the test.
3. Testing organization( the collection of parts in
a test which may be individual items or subtests)
4. Time allocation
5. Instructions
Dr. R. Green, Aug 2006 18
Factors which may contribute to unreliability
1. Fluctuations in the learner
a. Changes in the learner
b. Temporary psychological or physical changes
2. Fluctuations in scoring
a. intra-rater variance
b. inter-rater variance
3. Fluctuations in test administration
a. Regulatory fluctuations
b. Fluctuations in the administrative environment
19
4. The behavior of test characteristics
a.Test length
b. Difficulty level and boundary effect
c. Discriminability
d. Speededness
e. Homogeneity
5. Error associated with response characteristics
a. Response arbitrariness
b. Wiseness and familiarity responses
20
21
How can we measure reliability?
Test-retest
 same test administered to the same test
takers following an interval of no more than 2
weeks
Inter-rater reliability
 two or more independent estimates on a test
e.g. written scripts marked by two raters
independently and results compared
Measuring reliability [2]
Internal consistency reliability estimates
e.g.
 Split half reliability
 Cronbach’s alpha / Kuder Richardson 20
[KR20]
23
Split half reliability
 test to be administered to a group of test takers is
divided into halves, scores on each half correlated
with the other half
 the resulting coefficient is then adjusted by
Spearman-Brown Prophecy Formula to allow for the
fact that the total score is based on an instrument
that is twice as long as its halves
24
Cronbach's Alpha [KR 20]
 this approach looks at how test takers
perform on each individual item and then
compares that performance against their
performance on the test as a whole
 measured on a -1 to +1 scale like
discrimination
25
Reliability is influenced by …..
 the longer the test, the more reliable it is likely to be
[though there is a point of no extra return]
 items which discriminate will add to reliability,
therefore, if the items are too easy / too difficult,
reliability is likely to be lower
 if there is a wide range of abilities amongst the test
takers, test is likely to have higher reliability
 the more homogeneous the items are, the higher
the reliability is likely to be
Validity
The extent to which the inferences or decisions we
make on the basis of test scores are meaningful,
appropriate, and useful. In other words, a test is
said to be valid to the extent that it measures
what it is supposed to measures or can be used
for the purposes for which it is intended.
The matter of concern in testing is to ensure that
any test employed is valid for the purpose for
which it is administered. Validity tells us what can
be inferred from test scores. Validity is a quality of
test interpretation and use. If test scores are
affected by abilities other than the one we want to26
measure, they will not be meaningful indicators
of that particular ability.
An example of validity:
If we ask students to listen to a lecture and to write a short
essay based on that lecture, the essays they write will be
affected by both their writing ability and their ability to
comprehend the lecture. Ratings of their essays,
therefore, might not be valid measures of their writing
ability.
It is important for test developers and test users to realize
that test validation is an ongoing process and that the
interpretations we make of test scores can never be
considered absolutely valid.
27
 For most kinds of validity, reliability is a
necessary but not sufficient condition.
Some types of validity:
1.Content validity
2.Construct validity
3. Consequential validity
4. criterion-related validity
28
Content validity?
A form of validity which is based on the degree
to which a test adequately and sufficiently
measures the particular skills or behavior it
sets out to measure . For example, a test of
pronunciation skills in a language would have
low content validity if it tests only some of the
skills which are required for accurate
pronunciation, such a test which tested the
ability to pronounce isolated sounds, but not
stress, intonation , or the pronunciation of
sound within words. 29
The test would have content validity only if it
included a proper sample of the relevant
structures. Just what the relevant structures
will depend, of course, upon the purpose of
the test. In order to judge whether or not a
test has content validity, we need a
specification of skills or structures etc, that it
is meant to cover.
30
Construct validity?
A form of validity which is based on the degree to which
the items in a test reflect the essential aspects of the
theory on which the test is based.
A test is said to have construct validity if it can be
demonstrated that it measures just the ability it is
supposed to measure. For example, if the assumption
is held that systematic language habits are best
acquired at the elementary level by means of the
structural approach, then a test which emphasizes the
communication aspects of the language will have low
construct validity.
31

More Related Content

PPTX
Standardized and non standardized tests
PPTX
constructionoftests-211015110341 (1).pptx
PPTX
Construction of Tests
PPTX
LESSON 6 JBF 361.pptx
PPTX
Language Assessments - Key Features and Concepts
PPTX
Standardized and non standardized tests
PPTX
Unit 2.pptx
PPT
Unit 2- Reliability & validity.ppt Bachelor of education
Standardized and non standardized tests
constructionoftests-211015110341 (1).pptx
Construction of Tests
LESSON 6 JBF 361.pptx
Language Assessments - Key Features and Concepts
Standardized and non standardized tests
Unit 2.pptx
Unit 2- Reliability & validity.ppt Bachelor of education

Similar to Reliability and validity issues in language (20)

PPTX
Principles of assessment
PPTX
Characteristics of a good test
PDF
Assessment in Learning ( Characteristics)
PPTX
Standardized and non standardized tests (1)
DOC
Testing
PPTX
Qualities of a Good Test
PPTX
PRINCIPLES OF ASSESSMENT 2.pptx
PPT
Presentation Validity & Reliability
PDF
Principles of language assessment
PPTX
Module-4-Priciples-of-High-Quality-Assessment-and-Methods-of-Estimating-Relia...
DOCX
CLASSROOM ACTIVITIES
PPT
Reliability in Language Testing
PDF
Nature-and-use of psychological in testing
DOCX
LESSON-1-high-quality-assessment.docx123
PPTX
Reliability
PDF
TEST DEVELOPMENT AND EVALUATION (6462)
PPT
LING139 (MARTINEZ MAEd-Eng).ppt
PPTX
Evaluation and measurement nursing education
PPTX
validity and reliability
PPTX
4. qualities of good measuring instrument
Principles of assessment
Characteristics of a good test
Assessment in Learning ( Characteristics)
Standardized and non standardized tests (1)
Testing
Qualities of a Good Test
PRINCIPLES OF ASSESSMENT 2.pptx
Presentation Validity & Reliability
Principles of language assessment
Module-4-Priciples-of-High-Quality-Assessment-and-Methods-of-Estimating-Relia...
CLASSROOM ACTIVITIES
Reliability in Language Testing
Nature-and-use of psychological in testing
LESSON-1-high-quality-assessment.docx123
Reliability
TEST DEVELOPMENT AND EVALUATION (6462)
LING139 (MARTINEZ MAEd-Eng).ppt
Evaluation and measurement nursing education
validity and reliability
4. qualities of good measuring instrument
Ad

Recently uploaded (20)

PPTX
Lesson notes of climatology university.
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Trump Administration's workforce development strategy
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
Cell Types and Its function , kingdom of life
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
Classroom Observation Tools for Teachers
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
advance database management system book.pdf
Lesson notes of climatology university.
Supply Chain Operations Speaking Notes -ICLT Program
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Trump Administration's workforce development strategy
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Cell Types and Its function , kingdom of life
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Indian roads congress 037 - 2012 Flexible pavement
Classroom Observation Tools for Teachers
Practical Manual AGRO-233 Principles and Practices of Natural Farming
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Final Presentation General Medicine 03-08-2024.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Paper A Mock Exam 9_ Attempt review.pdf.
A systematic review of self-coping strategies used by university students to ...
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
advance database management system book.pdf
Ad

Reliability and validity issues in language

  • 1. 1 Reliability and Validity What is reliability? A quality of test scores which refers to the consistency of measures across different times, test forms, raters, and other characteristics of the measurement context. Synonyms for reliability are: dependability, stability, consistency, predictability and accuracy.
  • 2. 2 For instance a reliable man is a man whose behaviour is consistent, dependable and predictable, i.e. What he will do tomorrow and next week will be consistent with what he does today and what he did last week. -In language testing, for example, if a student receives a low score on a test one day and high score on the same test two days later (the test does not yield consistent results), the scores cannot be considered reliable indicators of individual’s ability. -If two raters give widely different ratings to the same sample, we say that the ratings are not reliable.
  • 3. 3 -The notion of reliability has to do with accuracy of measurement. This kind of accuracy is reflected in the obtaining of similar results when measurement is repeated on different occasions or with different instruments or by different persons. -Based on Henning [1987], reliability is a measure of accuracy, consistency, dependability, or fairness of scores resulting from the administration of a particular examination e.g. 75% on a test today, 83% tomorrow – problem with reliability. -
  • 4. 4 Sources of Variance Potential sources of score variance: 1. Those creating variance related to the purpose of the test-meaningful variance Meaningful variance on a test is defined as the variance that is directly attributable to the testing purpose. 2. Those generating variance due to other extraneous sources-measurement error Measurement error is a term that describes the variance in scores on a test that is not directly related to the purpose of the test.
  • 5. 5 Potential sources of measurement error 1. Environment( location, space, noise, etc.) 2. Administration procedures( directions, timing, equipment, etc.) 3. Scoring procedures ( error in scoring, subjectivity, etc.) 4. Test and test items ( item type, item quality, item security, etc.) 5. Examinees ( health, motivation, memory, forgiveness, guessing, etc.)
  • 6. Dr. R. Green, Aug 2006 6  quality of items  number of items  difficulty level of items  level of item discrimination  type of test methods  number of test methods
  • 7.  time allowed  clarity of instructions  use of the test  selection of content  sampling of content  invalid constructs Dr. R. Green, Aug 2006 7
  • 8. Test taker  familiarity with test method  attitude towards the test i.e. interest, motivation, emotional/mental state  degree of guessing employed  level of ability Dr. R. Green, Aug 2006 8
  • 9. Test administration  consistency of administration procedure  degree of interaction between invigilators and test takers  time of day the test is administered  clarity of instructions  test environment – light / heat / noise / space / layout of room  quality of equipment used e.g. for listening tests 9
  • 10. Scoring  accuracy of the key e.g. does it include all possible alternatives?  inter-rater reliability e.g. in writing, speaking  intra-rater reliability e.g. in writing, speaking  machine vs. human Dr. R. Green, Aug 2006 10
  • 11. In order to minimize reliability we should try to minimize measurement error. For example we can think of factors such as health, lack of interest or motivation and test- wiseness that can affect individuals’ test performance., but which are not generally associated with language ability and thus not characteristics we want to measure with language test. Test method facets are another source of error. When we minimize the effect of these various factors, we minimize measurement error and maximize reliability.
  • 12. 12 The investigation of reliability is concerned with the question: “How much of an individual’s test performance is due to measurement error or factors other than the language ability we want to measure?” and with the minimizing these factors on test scores.
  • 13. 13 Unreliability Inconsistencies in the results of measurement stemming from factors other than the abilities we want to measure. These are called random factors. Such factors would result in measurement errors which in turn can introduce fluctuations in the observed scores and thus reduce reliability. Systematic factors: factors such as test method facets which are uniform from one test administration to the next. Attributes of
  • 14. 14 Individuals that are not related to language ability Including individual characteristics such as cognitive style, and knowledge of particular content areas, and group characteristics such as sex, ethics background are categorized as systematic factors. Two different effects are associated with systematic error: 1. General effect: the error of systematic error which is consistent for all observations, it affects the scores of all individuals who take the test. 2. Specific effect: the effect which varies across individuals; it affects different individuals differently.
  • 15. Test wiseness A test taker’s capacity to utilize the characteristics and formats of the test and/or the test taking situation to guess the correct answer and hence receive a high score. It includes a variety of general strategies such as conscious pacing of one’s time, reading questions before the passages upon which they are based, and ruling out alternatives as possible in multiple-choice items and then guessing among the ones remaining. 15
  • 16. Test wiseness can be divided into two Basic parts: 1 Independent elements: these elements are independent of the test constructor. These elements are the kinds that many times are included in books that help students prepare for standardized tests. These relate to reading directions, use of time, guessing, etc. 2. Dependent elements: these elements refer to cues in the stem or in the distracters that need to be reasoned by the test taker. 16
  • 17. Test method facet The specific characteristics of test methods which constitute the “how” of language testing. Thus test performance is affected by these “facets” or characteristics of test method. For example, some do better on oral interview test; while one person might perform well on a multiple-choice test of reading, another may find such tasks very difficult. Test method facets are systematic to the extent that they are uniform from one test administration to the next one. 17
  • 18. Test method facet include: 1. Testing environment 2. Test rubric(characteristics that specify how test takers are expected to proceed in taking the test. 3. Testing organization( the collection of parts in a test which may be individual items or subtests) 4. Time allocation 5. Instructions Dr. R. Green, Aug 2006 18
  • 19. Factors which may contribute to unreliability 1. Fluctuations in the learner a. Changes in the learner b. Temporary psychological or physical changes 2. Fluctuations in scoring a. intra-rater variance b. inter-rater variance 3. Fluctuations in test administration a. Regulatory fluctuations b. Fluctuations in the administrative environment 19
  • 20. 4. The behavior of test characteristics a.Test length b. Difficulty level and boundary effect c. Discriminability d. Speededness e. Homogeneity 5. Error associated with response characteristics a. Response arbitrariness b. Wiseness and familiarity responses 20
  • 21. 21 How can we measure reliability? Test-retest  same test administered to the same test takers following an interval of no more than 2 weeks Inter-rater reliability  two or more independent estimates on a test e.g. written scripts marked by two raters independently and results compared
  • 22. Measuring reliability [2] Internal consistency reliability estimates e.g.  Split half reliability  Cronbach’s alpha / Kuder Richardson 20 [KR20]
  • 23. 23 Split half reliability  test to be administered to a group of test takers is divided into halves, scores on each half correlated with the other half  the resulting coefficient is then adjusted by Spearman-Brown Prophecy Formula to allow for the fact that the total score is based on an instrument that is twice as long as its halves
  • 24. 24 Cronbach's Alpha [KR 20]  this approach looks at how test takers perform on each individual item and then compares that performance against their performance on the test as a whole  measured on a -1 to +1 scale like discrimination
  • 25. 25 Reliability is influenced by …..  the longer the test, the more reliable it is likely to be [though there is a point of no extra return]  items which discriminate will add to reliability, therefore, if the items are too easy / too difficult, reliability is likely to be lower  if there is a wide range of abilities amongst the test takers, test is likely to have higher reliability  the more homogeneous the items are, the higher the reliability is likely to be
  • 26. Validity The extent to which the inferences or decisions we make on the basis of test scores are meaningful, appropriate, and useful. In other words, a test is said to be valid to the extent that it measures what it is supposed to measures or can be used for the purposes for which it is intended. The matter of concern in testing is to ensure that any test employed is valid for the purpose for which it is administered. Validity tells us what can be inferred from test scores. Validity is a quality of test interpretation and use. If test scores are affected by abilities other than the one we want to26
  • 27. measure, they will not be meaningful indicators of that particular ability. An example of validity: If we ask students to listen to a lecture and to write a short essay based on that lecture, the essays they write will be affected by both their writing ability and their ability to comprehend the lecture. Ratings of their essays, therefore, might not be valid measures of their writing ability. It is important for test developers and test users to realize that test validation is an ongoing process and that the interpretations we make of test scores can never be considered absolutely valid. 27
  • 28.  For most kinds of validity, reliability is a necessary but not sufficient condition. Some types of validity: 1.Content validity 2.Construct validity 3. Consequential validity 4. criterion-related validity 28
  • 29. Content validity? A form of validity which is based on the degree to which a test adequately and sufficiently measures the particular skills or behavior it sets out to measure . For example, a test of pronunciation skills in a language would have low content validity if it tests only some of the skills which are required for accurate pronunciation, such a test which tested the ability to pronounce isolated sounds, but not stress, intonation , or the pronunciation of sound within words. 29
  • 30. The test would have content validity only if it included a proper sample of the relevant structures. Just what the relevant structures will depend, of course, upon the purpose of the test. In order to judge whether or not a test has content validity, we need a specification of skills or structures etc, that it is meant to cover. 30
  • 31. Construct validity? A form of validity which is based on the degree to which the items in a test reflect the essential aspects of the theory on which the test is based. A test is said to have construct validity if it can be demonstrated that it measures just the ability it is supposed to measure. For example, if the assumption is held that systematic language habits are best acquired at the elementary level by means of the structural approach, then a test which emphasizes the communication aspects of the language will have low construct validity. 31