SlideShare a Scribd company logo
Unit A4: Test
Specifications
and Designs
Presented by:
Amir Hamid Forough Ameri
ahfameri@gmail.com
October 2015
Introduction
Test specifications – usually called ‘specs’ – are generative explanatory
documents for the creation of test tasks.
Specs tell us the nuts and bolts of
•how to phrase the test items,
•how to structure the test layout,
•how to locate the passages,
•and how to make a host of difficult choices as we prepare test
materials. More importantly,
•They tell us the rationale behind the various choices that we make.
• The idea of specs is rather old
• Ruch (1924) may be the earliest proponent of
something like test specs.
• Fulcher and Davidson believe that specs are
actually a common-sense notion in test
development.
Planning seems wise in test authorship.
 Specs are often called blueprints. Blueprints are used to build structures,
and from blueprints many equivalent structures can be erected.
 The classic utility of specs lies in test equivalence. Suppose we had a
particular test task and we wanted an equivalent task with
• same difficulty level,
• same testing objective, but
• different content.
We want this to vary our test without varying its results.
• We simply want a new version of the same test with the same assurance of
reliability and validity. This was the original purpose of specs.
• Equivalence, reliability, validity are
all assured because, in addition to
the mechanical blueprint-like
function of specs, they also serve
as a formal record of critical
dialogue.
PLANNING IN TEST AUTHORING
 How much can we actually plan in any test?
According to (Ruch, 1924), detailed rules of procedure in the
construction of an objective examination can hardly be
formulated.
The type of questions must be decided on the basis of such
facts as:
 the school subject concerned,
 the purposes of the examination,
 the length and reliability of the proposed examination,
 preferences of teachers and pupils, …
• Kehoe (1995) presents a series of guidelines for creating
multiple-choice test items.
• These guidelines are spec-like in their advice:
1. Before writing the stem, identify the one point to be tested by
that item. In general, the stem should not pose more than one
problem
2. Construct the stem to be either an incomplete statement or a
direct question, avoiding stereotyped phraseology.
Example:
In a recent survey, eighty per cent of drivers were found to
wear seat belts. Which of the following could be true?
(a) Eighty of one hundred drivers surveyed were wearing seat
belts.
(b) Twenty of one hundred drivers surveyed were wearing seat
belts.
(c) One hundred drivers were surveyed, and all were wearing seat
belts.
(d) One hundred drivers were surveyed, and none were wearing
seat belts.
GUIDING LANGUAGE VERSUS SAMPLES
All test specifications have two components:
Note: Guiding language comprises all parts of the test spec
•other than the sample itself.
For the above seat-belt sample, guiding language might include:
[1] This is a four-option multiple-choice test question.
[2] The stem shall be a statement followed by a question about
the statement.
[3] Each choice shall be plausible against real-world knowledge,
and each choice shall be internally grammatical.
[4] The key shall be the only inference that is feasible from the
statement in the stem.
CONGRUENCE (OR FIT-TO-SPEC)
• Congruence (also called fit-to-spec) is the degree to which a new item
fits an existing spec.
• Our next step above was to induce some guiding language such that items
equivalent to the seat-belt question could be written. For example:
• The vast majority of parents (in a recent survey) favour stricter attendance
regulations at their children’s schools. Which of the following could be true?
• (a) Most parents want stricter attendance rules.
• (b) Many parents want stricter attendance rules.
• (c) Only a few parents think current attendance rules are acceptable.
• (d) Some parents think current attendance rules are acceptable.
HOW DO TEST QUESTIONS
ORIGINATE?
REVERSE
ENGINEERING
ARCHETYPES
• Spec writing is an organic process.
• These can cause the spec to grow and to
evolve and to better represent what its
development team wishes to accomplish.
• We can follow this organic process by use of
an audit trail (Li, 2006), which is a narrative of
how we feel the test has improved
REVERSE ENGINEERING
 most tests are created from tests we have already seen or
have already experienced (perhaps as test takers).
 Often, test creation starts by looking at existing tests – a
process known as reverse engineering.
 Reverse engineering (RE) is an idea of ancient origin; the
name was coined by Davidson and Lynch (2002):
 RE is an analytical process of test creation that begins with an
actual test question and infers the guiding language that drives
it, such that equivalent items can be generated.
There are five types of RE:
•Straight RE: this is when you infer guiding
language about existing items without changing
the existing items at all.
•The purpose is solely to produce equivalent test
questions.
• Historical RE: this is straight RE across several
existing versions of a test.
• If the archives at your teaching institution
contain tests that have changed and evolved,
you can do RE on each version to try to
understand how and why the tests changed.
• Critical RE: perhaps the most common form of
RE, as we analyse an item, we think critically:
• Are we testing what we want?
• Do we wish to make changes in our test
design?
• Test deconstruction RE:
• The term ‘test deconstruction’ was coined by Elatia (2003)
• whether critical or straight, whether historical or not, it provides
insight beyond the test setting. We may discover larger
realities.
• why, for instance, would our particular test setting so value
close reading for students in the seat-belt and attendance
items?
• What role does close inferential reading have to the school
setting?
• Parallel RE: In some cases, teachers are asked to produce
tests according to external influences, what Davidson and
Lynch (2002) call the ‘mandate’.
• Teachers may feel compelled to design tests that adhere to
these external standards.
• If we obtain sample test questions from several teachers which
(the teachers tell us) measure the same thing, and then
perform straight RE on the samples, and then compare the
resulting specs, we are using RE as a tool to determine
parallelism.
• 1. Charles sent the package __________ Chicago.
*a. to b. on c.at d. with
• 2. Mary put the paper __________ the folder.
a. at b. for *c. in d. to
• 3. Steven turned __________ the radio.
*a. on b. at c. with d. to
• At first glance, the items all seem to test the English grammar of
prepositions. Item 3, however, seems somewhat different. It is
probably a test of two-word or “phrasal” verbs. If we reverse-
engineered a spec from these items, we might have to decide
whether item 3 is actually in a different skill domain and should be
reverse-engineered into a different spec. (Davidson & Lynch, 2002)
ARCHETYPES
• An ‘archetype’ is a canonical item or task; it is the typical
way to measure a particular target skill.
• If we then step back and look at that item, we will often see
echoes of an item or task that we have used in the past, that
we have done (as test takers) or that we have studied in a
testing textbook.
• Perhaps all supposed spec-driven testing is actually a form of
reverse engineering: we try to write a blueprint to fit item or
task types – archetypes – that are already in our experience.
TOWARDS SPEC-DRIVEN THEORY
• Specs exist. Somehow, somewhere, the system is able to
articulate the generative guiding language behind any sample
test items or tasks.
• Written specs have two great advantages:
 firstly, they spark creative critical dialogue about the testing
 secondly, they are wonderful training tools when newcomers
join the organization.
• Specs evolve. They grow and change, in creative engineering-
like response to changes in theory, in curriculum, in funding,
and so on.
• The specs (and the test they generate) are not launched until
ready. The organization works hard to discuss and to try out
test specs and their items and tasks, withholding a rush to
production, examining all evidence available;
• Discussion happens and that leads to transparency. Spec-
driven testing can lead to ‘shared authority, collaboration, and
involvement of different stake-holders
• And this is discussion to which all are welcome.
Unit A5: Writing
Items and Tasks
INTRODUCTION
• One of the most common mistakes made in language testing is for the test
writer to begin the process by writing a set of items or tasks that are
intuitively felt to be relevant to the test takers.
• An iterative process which is one in which all tasks are undertaken in a
cyclical fashion should be used.
• So test specifications are not set until the final test is ready to be rolled out in
actual use.
• a well-built test can be adapted and changed even after it is operational and
as things change, provided that we have a clear record (specifications) of
how it was built in the first place and evidence is collected to support the link
between the task and the construct.
• A test task is a device that allows the tester to collect evidence. This
evidence is a response from the test taker. The ‘response as evidence’
indicates that we are using the responses to make inferences about the
ability of the test taker to use language in the domains and situations defined
in the test specifications.
EVIDENCE-CENTRED DESIGN (ECD)
• It is important that we see the tasks or items that we design for tests as part
of a larger picture, and one approach to doing this in a systematic way is
ECD: ECD is a methodology for designing assessments that underscores
the central role of evidentiary reasoning in assessment design. ECD is
based on three premises:
(1) An assessment must build around the important knowledge in the domain of
interest;
(2) The chain of reasoning from what participants say and do in assessments
to inferences about what they know, must be based on the principles of
evidentiary reasoning;
(3) Purpose must be the driving force behind design decisions, which reflect
constraints, resources and conditions of use. (Mislevy et al., 2003: 20)
• Evidentiary reasoning or validity argument. → The argument
shows the reasoning that supports the inferences we make
from test scores to what we claim those scores mean.
• Evidentiary reasoning is the reasoning that leads from evidence
to the evaluation of the strength of the validity claim.
• All tests are indirect. From test performance we obtain a score,
and from the score we draw inferences about the constructs the
test is designed to measure. It is therefore a ‘construct-centered
approach’, as described by Messick (1994).
• The way in which we do things reflects how we think we know things.
Peirce (1878) argued that there were four ways of knowing:
■ Tenacity: we believe what we do because we have always believed this and not
questioned it. → What we do is therefore based on what we have always done.
■ Authority: we believe what we do because these beliefs come from an authoritative
source. → What we do therefore follows the practice dictated by this source.
■ A-priori: what we believe appears reasonable to us when we think about it, because
it feels intuitively right. → What we do is therefore what we think is reasonable and
right.
■ Scientific: what we believe is established by investigating what happens in the world.
→ What we do is therefore based on methods that lead to an increase in
knowledge.
→ECD treats knowledge as scientific because it is primarily a method that leads
to the test designer understanding more about relations between variables for
a particular assessment context.
The structure of ECD
ECD claims to provide a very systematic way of thinking about the process of
assessment and the place of the task within that process.
ECD is considered to be a ‘framework’ in the sense that it is structured and formal
and thus enables ‘the actual work of designing and implementing assessments’
(Mislevy et al., 2003: 4) in a way that makes a validity argument more explicit.
It is sometimes referred to as a conceptual assessment framework (CAF).
Within this framework are a number of models, and these are defined as design
objects. These design objects help us to think about how to go about the practical work
of designing a test.
Within ECD-style test specification there are six models or design objects:
1. Student model. This comprises a statement of the particular mix of knowledge,
skills or abilities about which we wish to make claims as a result of the test. In other
words, it is the list of constructs that are relevant to a particular testing situation. This is
the highest-level model, and needs to be designed before any other models can be
addressed, because it defines what we wish to claim about an individual test taker: what
are we testing?
2. Evidence models. Once we have selected constructs for the student model, we
need to ask what evidence we need to collect in order to make inferences from
performance to underlying knowledge or ability: what evidence do we need to test the
construct(s)? In ECD the evidence is frequently referred to as a work product, which
means whatever comes from what the test takers do. It includes two components:
a) evaluation component: The focus here is the evidentiary interrelationships being
drawn among characteristics of students, of what they say and do, and of task and real-
world situations.
b) measurement component that links the observable variables to the student
model by specifying how we score the evidence. This turns what we observe into the
score from which we make inferences.
3. Task models: How do we collect the evidence? Task models therefore describe the
situations in which test takers respond to items or tasks that generate the evidence we
need. Task models minimally comprise three elements: the presentation material,
or input; the work products, or what the test takers actually do; and finally, the task
model variables that describe task features. Task features are those elements that tell
us what the task looks like, and which parts of the task are likely to make it more or
less difficult.
4. Presentation model. Items and tasks can be presented in many different formats.
A text and set of reading items may be presented in paper and pencil format, or on
a computer. The presentation model describes how these will be laid out and
presented to the test takers.
5. Assembly model. An assembly model accounts for how the student model,
evidence models and task models work together. It does this by specifying two
elements: targets and constraints.
 A target is the reliability with which each construct in a student model should be
measured.
 A constraint relates to the mix of items or tasks on the test that must be included in
order to represent the domain adequately: how much do we need to test?
6. Delivery model. This final model is not independent of the others, but explains
how they will work together to deliver the actual test – for example, how the modules
will operate if they are delivered in computer-adaptive mode, or as set paper and
pencil forms. Of course, changes at this level will also impact on other models and
how they are designed. This model would also deal with issues that are relevant at
the level of the entire test, such as test security and the timing of sections of the test
(Mislevy et al., 2003: 13).However, it also contains four processes, referred to as the
delivery architecture (see further discussion in Unit A8). These are the presentation
process, response processing, summary scoring and activity selection.
Models in the conceptual assessment framework of ECD (adapted from Mislevy et al., 2003)
Student
Models
Evidence
Models
Task
Models
Assembly
Model
Presentation
Model
Delivery
Model
Thank You

More Related Content

PPT
How to make tests more reliable
PPT
Testing Pronunciation
PPT
Test specifications and designs
PPTX
Achieving beneficial blackwash
PPTX
The universal grammar approach
PPTX
Contrastive analysis
PDF
Fulcher and Davidson Unit a5
PPT
Notional functional syllabus design
How to make tests more reliable
Testing Pronunciation
Test specifications and designs
Achieving beneficial blackwash
The universal grammar approach
Contrastive analysis
Fulcher and Davidson Unit a5
Notional functional syllabus design

What's hot (20)

PPTX
Content based syllabus
PPTX
discrete-point and integrative testing
PPTX
Testing listening slide
PPTX
Testing grammar and vocabulary
PPTX
Testing Grammar and Vocabulary Skill
PPTX
MELT 104 Functional Grammar
PPT
Discourse Analysis and Pragmatics
PPTX
Assessing grammar
PPTX
Practical Language Testing Glenn Fulcher
PPTX
Binding theory
PPTX
Testing Grammar
PPTX
Communicative Testing
PPTX
Reading test specifications assignment-01-ppt
PPTX
What is co text
PPTX
Language curriculum design
PPT
Reliability in Language Testing
PPT
Social Interaction Approach
PPTX
Contrastive analysis
PPTX
Interactionist approach (mobin bozorgi)
PPTX
Corpus linguistics the basics
Content based syllabus
discrete-point and integrative testing
Testing listening slide
Testing grammar and vocabulary
Testing Grammar and Vocabulary Skill
MELT 104 Functional Grammar
Discourse Analysis and Pragmatics
Assessing grammar
Practical Language Testing Glenn Fulcher
Binding theory
Testing Grammar
Communicative Testing
Reading test specifications assignment-01-ppt
What is co text
Language curriculum design
Reliability in Language Testing
Social Interaction Approach
Contrastive analysis
Interactionist approach (mobin bozorgi)
Corpus linguistics the basics
Ad

Viewers also liked (17)

PDF
2016 resume v3
ODT
Mandi's Resume
PDF
Domaine le jardin presentation (1)
PDF
Diario Resumen 20170105
DOCX
Los precursores del marginalismo
PPT
El ser humano.ppt
PPT
Ppt on cognitive items, items types and computer based test items
PDF
1998 2006 kpds_kelimeleri_listesi
PDF
Alexander - Education in the Internet of Everything
PPTX
Toxic Workplace: Bullying
PPTX
Testing oral ability
PPTX
Change agent
PDF
PPT
change agent
PPTX
FinTech Belgium - Compliance Regulation & Innovation - Monizze
PDF
The Ultimate Guide To Converting JPG to DWG
2016 resume v3
Mandi's Resume
Domaine le jardin presentation (1)
Diario Resumen 20170105
Los precursores del marginalismo
El ser humano.ppt
Ppt on cognitive items, items types and computer based test items
1998 2006 kpds_kelimeleri_listesi
Alexander - Education in the Internet of Everything
Toxic Workplace: Bullying
Testing oral ability
Change agent
change agent
FinTech Belgium - Compliance Regulation & Innovation - Monizze
The Ultimate Guide To Converting JPG to DWG
Ad

Similar to Test specifications and designs session 4 (20)

PPTX
test specification for MA Linguisticsppt.pptx
PPTX
Bioscience Laboratory Workforce Skills - part II
PPTX
Readingtestspecifications assignment-01-ppt-141130013903-conversion-gate01 - ...
PPTX
La notes (5 10)
PPTX
English Proficiency Test
PDF
Clean tests
PPTX
HDP PPT.pptx
PPTX
Lt j-test construction procedure
PPTX
Business Research Method - Unit II, AKTU, Lucknow Syllabus
PPTX
notebook-lesson_science_subject_in_educa
PPTX
Test Construction in language skills.pptx
PDF
Item development.pdf for national examination development
PPTX
Assessment 2 Briefing and Guidance (Full).pptx
PPT
Ssr test construction admin and scoring
PPTX
Construction of a proper test
DOCX
Designing Calssroom Language Test and Test Methods
PPTX
Resaerch-design-Presentation-MTTE-1st-sem-2023.pptx
PDF
HND_MSCP_W5_Reliability_and_Validity_of_Research.pdf
PPTX
RM & IPR Module1 PPT by Prof. Manjula K, Assistant Professor, Dept. of ECE, S...
test specification for MA Linguisticsppt.pptx
Bioscience Laboratory Workforce Skills - part II
Readingtestspecifications assignment-01-ppt-141130013903-conversion-gate01 - ...
La notes (5 10)
English Proficiency Test
Clean tests
HDP PPT.pptx
Lt j-test construction procedure
Business Research Method - Unit II, AKTU, Lucknow Syllabus
notebook-lesson_science_subject_in_educa
Test Construction in language skills.pptx
Item development.pdf for national examination development
Assessment 2 Briefing and Guidance (Full).pptx
Ssr test construction admin and scoring
Construction of a proper test
Designing Calssroom Language Test and Test Methods
Resaerch-design-Presentation-MTTE-1st-sem-2023.pptx
HND_MSCP_W5_Reliability_and_Validity_of_Research.pdf
RM & IPR Module1 PPT by Prof. Manjula K, Assistant Professor, Dept. of ECE, S...

More from Amir Hamid Forough Ameri (20)

PPTX
The task based approach some questions and suggestions littlewood
PPTX
Task based research and language pedagogy ellis
PPTX
PPTX
Notional functional syllabus
PPTX
Integrated syllabus
PPTX
Exploring culture by ah forough ameri
PPTX
Critical pedagogy in l2 learning and teaching suresh canagarajah
PPTX
Critical literacy and second language learning luke and dooley
PPTX
Context culture .... m. wendt
PPTX
Thesis summary by amir hamid forough ameri
PPTX
The role of corrective feedback in second language learning
PPTX
Standards based classroom assessments of english proficiency
PPTX
Reliability bachman 1990 chapter 6
PPTX
Reliability and dependability by neil jones
PPTX
Language testing the social dimension
PPTX
Extroversion introversion
PPTX
Developing a comprehensive, empirically based research framework for classroo...
PPTX
Classroom assessment, glenn fulcher
PPT
Behavioral view of motivation
The task based approach some questions and suggestions littlewood
Task based research and language pedagogy ellis
Notional functional syllabus
Integrated syllabus
Exploring culture by ah forough ameri
Critical pedagogy in l2 learning and teaching suresh canagarajah
Critical literacy and second language learning luke and dooley
Context culture .... m. wendt
Thesis summary by amir hamid forough ameri
The role of corrective feedback in second language learning
Standards based classroom assessments of english proficiency
Reliability bachman 1990 chapter 6
Reliability and dependability by neil jones
Language testing the social dimension
Extroversion introversion
Developing a comprehensive, empirically based research framework for classroo...
Classroom assessment, glenn fulcher
Behavioral view of motivation

Recently uploaded (20)

PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
Cell Types and Its function , kingdom of life
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Trump Administration's workforce development strategy
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PPTX
Cell Structure & Organelles in detailed.
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Classroom Observation Tools for Teachers
Cell Types and Its function , kingdom of life
Anesthesia in Laparoscopic Surgery in India
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Supply Chain Operations Speaking Notes -ICLT Program
202450812 BayCHI UCSC-SV 20250812 v17.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Trump Administration's workforce development strategy
2.FourierTransform-ShortQuestionswithAnswers.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
What if we spent less time fighting change, and more time building what’s rig...
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
Cell Structure & Organelles in detailed.
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Microbial disease of the cardiovascular and lymphatic systems
Paper A Mock Exam 9_ Attempt review.pdf.

Test specifications and designs session 4

  • 1. Unit A4: Test Specifications and Designs Presented by: Amir Hamid Forough Ameri ahfameri@gmail.com October 2015
  • 2. Introduction Test specifications – usually called ‘specs’ – are generative explanatory documents for the creation of test tasks. Specs tell us the nuts and bolts of •how to phrase the test items, •how to structure the test layout, •how to locate the passages, •and how to make a host of difficult choices as we prepare test materials. More importantly, •They tell us the rationale behind the various choices that we make.
  • 3. • The idea of specs is rather old • Ruch (1924) may be the earliest proponent of something like test specs. • Fulcher and Davidson believe that specs are actually a common-sense notion in test development. Planning seems wise in test authorship.
  • 4.  Specs are often called blueprints. Blueprints are used to build structures, and from blueprints many equivalent structures can be erected.  The classic utility of specs lies in test equivalence. Suppose we had a particular test task and we wanted an equivalent task with • same difficulty level, • same testing objective, but • different content. We want this to vary our test without varying its results. • We simply want a new version of the same test with the same assurance of reliability and validity. This was the original purpose of specs.
  • 5. • Equivalence, reliability, validity are all assured because, in addition to the mechanical blueprint-like function of specs, they also serve as a formal record of critical dialogue.
  • 6. PLANNING IN TEST AUTHORING  How much can we actually plan in any test? According to (Ruch, 1924), detailed rules of procedure in the construction of an objective examination can hardly be formulated. The type of questions must be decided on the basis of such facts as:  the school subject concerned,  the purposes of the examination,  the length and reliability of the proposed examination,  preferences of teachers and pupils, …
  • 7. • Kehoe (1995) presents a series of guidelines for creating multiple-choice test items. • These guidelines are spec-like in their advice: 1. Before writing the stem, identify the one point to be tested by that item. In general, the stem should not pose more than one problem 2. Construct the stem to be either an incomplete statement or a direct question, avoiding stereotyped phraseology.
  • 8. Example: In a recent survey, eighty per cent of drivers were found to wear seat belts. Which of the following could be true? (a) Eighty of one hundred drivers surveyed were wearing seat belts. (b) Twenty of one hundred drivers surveyed were wearing seat belts. (c) One hundred drivers were surveyed, and all were wearing seat belts. (d) One hundred drivers were surveyed, and none were wearing seat belts.
  • 9. GUIDING LANGUAGE VERSUS SAMPLES All test specifications have two components:
  • 10. Note: Guiding language comprises all parts of the test spec •other than the sample itself. For the above seat-belt sample, guiding language might include: [1] This is a four-option multiple-choice test question. [2] The stem shall be a statement followed by a question about the statement. [3] Each choice shall be plausible against real-world knowledge, and each choice shall be internally grammatical. [4] The key shall be the only inference that is feasible from the statement in the stem.
  • 11. CONGRUENCE (OR FIT-TO-SPEC) • Congruence (also called fit-to-spec) is the degree to which a new item fits an existing spec. • Our next step above was to induce some guiding language such that items equivalent to the seat-belt question could be written. For example: • The vast majority of parents (in a recent survey) favour stricter attendance regulations at their children’s schools. Which of the following could be true? • (a) Most parents want stricter attendance rules. • (b) Many parents want stricter attendance rules. • (c) Only a few parents think current attendance rules are acceptable. • (d) Some parents think current attendance rules are acceptable.
  • 12. HOW DO TEST QUESTIONS ORIGINATE? REVERSE ENGINEERING ARCHETYPES
  • 13. • Spec writing is an organic process.
  • 14. • These can cause the spec to grow and to evolve and to better represent what its development team wishes to accomplish. • We can follow this organic process by use of an audit trail (Li, 2006), which is a narrative of how we feel the test has improved
  • 15. REVERSE ENGINEERING  most tests are created from tests we have already seen or have already experienced (perhaps as test takers).  Often, test creation starts by looking at existing tests – a process known as reverse engineering.  Reverse engineering (RE) is an idea of ancient origin; the name was coined by Davidson and Lynch (2002):  RE is an analytical process of test creation that begins with an actual test question and infers the guiding language that drives it, such that equivalent items can be generated.
  • 16. There are five types of RE: •Straight RE: this is when you infer guiding language about existing items without changing the existing items at all. •The purpose is solely to produce equivalent test questions.
  • 17. • Historical RE: this is straight RE across several existing versions of a test. • If the archives at your teaching institution contain tests that have changed and evolved, you can do RE on each version to try to understand how and why the tests changed.
  • 18. • Critical RE: perhaps the most common form of RE, as we analyse an item, we think critically: • Are we testing what we want? • Do we wish to make changes in our test design?
  • 19. • Test deconstruction RE: • The term ‘test deconstruction’ was coined by Elatia (2003) • whether critical or straight, whether historical or not, it provides insight beyond the test setting. We may discover larger realities. • why, for instance, would our particular test setting so value close reading for students in the seat-belt and attendance items? • What role does close inferential reading have to the school setting?
  • 20. • Parallel RE: In some cases, teachers are asked to produce tests according to external influences, what Davidson and Lynch (2002) call the ‘mandate’. • Teachers may feel compelled to design tests that adhere to these external standards. • If we obtain sample test questions from several teachers which (the teachers tell us) measure the same thing, and then perform straight RE on the samples, and then compare the resulting specs, we are using RE as a tool to determine parallelism.
  • 21. • 1. Charles sent the package __________ Chicago. *a. to b. on c.at d. with • 2. Mary put the paper __________ the folder. a. at b. for *c. in d. to • 3. Steven turned __________ the radio. *a. on b. at c. with d. to • At first glance, the items all seem to test the English grammar of prepositions. Item 3, however, seems somewhat different. It is probably a test of two-word or “phrasal” verbs. If we reverse- engineered a spec from these items, we might have to decide whether item 3 is actually in a different skill domain and should be reverse-engineered into a different spec. (Davidson & Lynch, 2002)
  • 22. ARCHETYPES • An ‘archetype’ is a canonical item or task; it is the typical way to measure a particular target skill. • If we then step back and look at that item, we will often see echoes of an item or task that we have used in the past, that we have done (as test takers) or that we have studied in a testing textbook. • Perhaps all supposed spec-driven testing is actually a form of reverse engineering: we try to write a blueprint to fit item or task types – archetypes – that are already in our experience.
  • 23. TOWARDS SPEC-DRIVEN THEORY • Specs exist. Somehow, somewhere, the system is able to articulate the generative guiding language behind any sample test items or tasks. • Written specs have two great advantages:  firstly, they spark creative critical dialogue about the testing  secondly, they are wonderful training tools when newcomers join the organization. • Specs evolve. They grow and change, in creative engineering- like response to changes in theory, in curriculum, in funding, and so on.
  • 24. • The specs (and the test they generate) are not launched until ready. The organization works hard to discuss and to try out test specs and their items and tasks, withholding a rush to production, examining all evidence available; • Discussion happens and that leads to transparency. Spec- driven testing can lead to ‘shared authority, collaboration, and involvement of different stake-holders • And this is discussion to which all are welcome.
  • 26. INTRODUCTION • One of the most common mistakes made in language testing is for the test writer to begin the process by writing a set of items or tasks that are intuitively felt to be relevant to the test takers. • An iterative process which is one in which all tasks are undertaken in a cyclical fashion should be used. • So test specifications are not set until the final test is ready to be rolled out in actual use. • a well-built test can be adapted and changed even after it is operational and as things change, provided that we have a clear record (specifications) of how it was built in the first place and evidence is collected to support the link between the task and the construct. • A test task is a device that allows the tester to collect evidence. This evidence is a response from the test taker. The ‘response as evidence’ indicates that we are using the responses to make inferences about the ability of the test taker to use language in the domains and situations defined in the test specifications.
  • 27. EVIDENCE-CENTRED DESIGN (ECD) • It is important that we see the tasks or items that we design for tests as part of a larger picture, and one approach to doing this in a systematic way is ECD: ECD is a methodology for designing assessments that underscores the central role of evidentiary reasoning in assessment design. ECD is based on three premises: (1) An assessment must build around the important knowledge in the domain of interest; (2) The chain of reasoning from what participants say and do in assessments to inferences about what they know, must be based on the principles of evidentiary reasoning; (3) Purpose must be the driving force behind design decisions, which reflect constraints, resources and conditions of use. (Mislevy et al., 2003: 20)
  • 28. • Evidentiary reasoning or validity argument. → The argument shows the reasoning that supports the inferences we make from test scores to what we claim those scores mean. • Evidentiary reasoning is the reasoning that leads from evidence to the evaluation of the strength of the validity claim. • All tests are indirect. From test performance we obtain a score, and from the score we draw inferences about the constructs the test is designed to measure. It is therefore a ‘construct-centered approach’, as described by Messick (1994).
  • 29. • The way in which we do things reflects how we think we know things. Peirce (1878) argued that there were four ways of knowing: ■ Tenacity: we believe what we do because we have always believed this and not questioned it. → What we do is therefore based on what we have always done. ■ Authority: we believe what we do because these beliefs come from an authoritative source. → What we do therefore follows the practice dictated by this source. ■ A-priori: what we believe appears reasonable to us when we think about it, because it feels intuitively right. → What we do is therefore what we think is reasonable and right. ■ Scientific: what we believe is established by investigating what happens in the world. → What we do is therefore based on methods that lead to an increase in knowledge. →ECD treats knowledge as scientific because it is primarily a method that leads to the test designer understanding more about relations between variables for a particular assessment context.
  • 30. The structure of ECD ECD claims to provide a very systematic way of thinking about the process of assessment and the place of the task within that process. ECD is considered to be a ‘framework’ in the sense that it is structured and formal and thus enables ‘the actual work of designing and implementing assessments’ (Mislevy et al., 2003: 4) in a way that makes a validity argument more explicit. It is sometimes referred to as a conceptual assessment framework (CAF). Within this framework are a number of models, and these are defined as design objects. These design objects help us to think about how to go about the practical work of designing a test. Within ECD-style test specification there are six models or design objects:
  • 31. 1. Student model. This comprises a statement of the particular mix of knowledge, skills or abilities about which we wish to make claims as a result of the test. In other words, it is the list of constructs that are relevant to a particular testing situation. This is the highest-level model, and needs to be designed before any other models can be addressed, because it defines what we wish to claim about an individual test taker: what are we testing? 2. Evidence models. Once we have selected constructs for the student model, we need to ask what evidence we need to collect in order to make inferences from performance to underlying knowledge or ability: what evidence do we need to test the construct(s)? In ECD the evidence is frequently referred to as a work product, which means whatever comes from what the test takers do. It includes two components: a) evaluation component: The focus here is the evidentiary interrelationships being drawn among characteristics of students, of what they say and do, and of task and real- world situations. b) measurement component that links the observable variables to the student model by specifying how we score the evidence. This turns what we observe into the score from which we make inferences.
  • 32. 3. Task models: How do we collect the evidence? Task models therefore describe the situations in which test takers respond to items or tasks that generate the evidence we need. Task models minimally comprise three elements: the presentation material, or input; the work products, or what the test takers actually do; and finally, the task model variables that describe task features. Task features are those elements that tell us what the task looks like, and which parts of the task are likely to make it more or less difficult. 4. Presentation model. Items and tasks can be presented in many different formats. A text and set of reading items may be presented in paper and pencil format, or on a computer. The presentation model describes how these will be laid out and presented to the test takers. 5. Assembly model. An assembly model accounts for how the student model, evidence models and task models work together. It does this by specifying two elements: targets and constraints.
  • 33.  A target is the reliability with which each construct in a student model should be measured.  A constraint relates to the mix of items or tasks on the test that must be included in order to represent the domain adequately: how much do we need to test? 6. Delivery model. This final model is not independent of the others, but explains how they will work together to deliver the actual test – for example, how the modules will operate if they are delivered in computer-adaptive mode, or as set paper and pencil forms. Of course, changes at this level will also impact on other models and how they are designed. This model would also deal with issues that are relevant at the level of the entire test, such as test security and the timing of sections of the test (Mislevy et al., 2003: 13).However, it also contains four processes, referred to as the delivery architecture (see further discussion in Unit A8). These are the presentation process, response processing, summary scoring and activity selection.
  • 34. Models in the conceptual assessment framework of ECD (adapted from Mislevy et al., 2003) Student Models Evidence Models Task Models Assembly Model Presentation Model Delivery Model