Chapter-3 Dsigning classroom language Tests 1.pdf

Designing Classroom
Language Tests
Chapter Three

Introduction
• In this chapter, you will begin the process of designing tests or revising existing tests. To start that
process, you need to ask some critical questions:
• What is the purpose of the test? Why am I creating this test or why was it created by someone else?
For an evaluation of overall-proficiency? To place students into a course? To measure
achievement within a course? Once you have established the major purpose of a test, you can
determine its objectives.
• What are the objectives of the test? What specifically am I trying to find out? Establishing
appropriate objectives involves a number of issues, ranging from relatively simple ones about
forms and functions covered in a course unit to much more complex ones about constructs to be
operationalized in the test.
• How will the test specifications reflect both the purpose and the objectives? To evaluate or design a
test, you must make sure that the objectives are incorporated into a structure that appropriately
weights the various competencies being assessed.

Introduction
• How will the test specifications reflect both the purpose and the objectives? To evaluate or design a
test, you must make sure that the objectives are incorporated into a structure that appropriately
weights the various competencies being assessed.
• How will the test tasks be selected and the separate items arranged? The tasks that the test-takers
must perform need to be practical in the ways defined in the previous chapter. They should also
achieve content validity by presenting tasks that mirror those of the course (or segment thereof)
being assessed. Further, they should be able to be evaluated reliably by the teacher or scorer. The
tasks themselves should strive for authenticity, and the progression of tasks ought to be biased for
best performance.
• 5. What kind of scoring, grading, and/or feedback is expected? Tests vary in the form and function
of feedback, depending on their purpose. For every test, the way results are reported is an
important consideration. Under some circumstances a letter grade or a holistic score may be
appropriate; other circumstances may require that a teacher offer substantive washback to the
learner.

TEST TYPES
• The first task you will face in designing a test for your students is to determine the purpose for the test.
Defining your purpose will help you choose the right kind of test, and it will also help you to focus on the
specific objectives of the test.
• Language Aptitude Tests
• One type of test-although admittedly not a very common one-predicts a person's success prior 'to exposure to
the second language. A language aptitude test is designed to measure capacity or general ability to learn a
foreign language and ultimate success in that undertaking. Language aptitude tests are designed to apply to
the classroom learning of any language.
• Tasks in the Modern Language Aptitude Test:
• Number Learning
• Phonetic script
• Spelling clues
• Words in sentences

Language
Aptitude
Tests
• There is no research to show that those kinds of tasks
predict communicative success in a language, especially
untutored acquisition of the language. Because of this
limitation, standardized aptitude tests are seldom used
today. Instead, attempts to measure language aptitude
more often provide learners with information about
their preferred styles and their potential strengths and
weaknesses, with follow-up strategies for capitalizing on
the strengths and overcoming the weaknesses. Any test
that claims to predict success in learning a language is
undoubtedly flawed because we now know that with
appropriate self-knowledge, active strategic involvement
in learning, and/or strategies-based instruction, virtually
everyone can succeed eventually.

Proficiency Tests
• If your aim is to test global competence in a language, then you are, in conventional terminology, testing
proficiency. A proficiency test is not limited to anyone course, curriculum , or single skill in the language;
rather, it tests overall ability. Proficiency tests have traditionally consisted of standardized multiple-choice
items on grammar, vocabulary, reading comprehension, and aural comprehension. Sometimes a sample of
writing is added, and more recent tests also include oral production performance.
• Proficiency tests are almost always summative and norm-referenced. They provide results in the form of a
single score which is a sufficient result for the gate-keeping role they play of accepting or denying someone
passage into the next stage of a journey. And because they measure performance against a norm, with equated
scores and percentile ranks taking on paramount importance, they are usually not equipped to provide
diagnostic feedback.
• A typical example of a standardized proficiency test is the Test of English as a Foreign Language
(TOEFL produced by the Educational Testing Service. The TOEFL” is used by more than a thousand
institutions of higher education in the United States as an indicator of a prospective student's ability to
undertake academic work in an English-speaking milieu. The TOEFL consists of sections on listening
comprehension, structure (or grammatical accuracy), reading comprehension, and written expression.
•

Placement Tests
• Certain proficiency tests can act in the role of placement tests, the purpose of which is to place a student into a
particular level or section of a language curriculum or school.
• The English as a Second Language Placement Test (ESLP1) at San Francisco State University has three parts.
In Part I, students read a short article and then write a summary essay. In Part II, students write a composition
in response to an article. Part III is multiple-choice: students read an essay and identify grammar errors in it.
• Teachers and administrators in the ESL program at SFSU are satisfied with this test's capacity to discriminate
appropriately, and they feel that it is a more authentic test than its multiple-choice, discrete-point, grammar-
vocabulary predecessor. The practicality of the ESLPT is relatively low: human evaluators are required for the
first two parts, a process more costly in both time and money than running the multiple choice
• Placement tests come in many varieties: assessing comprehension and production, responding through
written and oral performance, open-ended and limited responses, selection (e.g., multiple-choice) and gap-
filling formats, depending on the nature of a program and its needs. Some programs simply use existing
standardized proficiency tests because of their obvious advantage in practicality-cost, speed in scoring, and
efficient reporting of results. Others prefer the performance data available in more open-ended written and/or
oral production. The ultimate objective of a placement test is, of course, to correctly place a student into a.
course or level.
• Secondary benefits to consider include face Validity, diagnostic information on students' performance, and
authenticity.

Diagnostic Tests
• A diagnostic test is designed to diagnose specified aspects of a language. A test in pronunciation,
for example, might diagnose the phonological features of English that are difficult for learners and
should therefore become part of a curriculum. Usually, such tests offer a checklist of features for
the administrator (often the teacher) to use in pinpointing difficulties. A writing diagnostic would
elicit a writing sample from students that would allow the teacher to identify those rhetorical and
linguistic features on which the course needed to focus special attention.
• There is also a fine line of difference between a diagnostic test and a general achievement test.
Achievement tests analyse the extent to which students have acquired language features that have
already been taught; diagnostic tests should elicit information on what students need to work on in
the future. Therefore, a diagnostic test will typically offer more detailed subcategorized
information on the learner. In a curriculum that has a form-focused phase, for example, a
diagnostic test might offer information about a learner's acquisition of verb tenses, modal
auxiliaries, definite articles, relative clauses, and the like.

Diagnostic Tests
• A typical diagnostic test of oral production was created by Clifford Prator (1972) to accompany a manual
of English pronunciation. Test-takers are directed to read a 150-word passage while they are tape-recorded.
The test administrator then refers to an. inventory of phonological items for analysing a learner's
production. After multiple listenings, the administrator produces a checklist of errors in five separate
categories, each of which has several subcategories. The main' categories include
• stress and rhythm,
• intonation,
• vowels,
• consonants, and
• other factors.
• An example of subcategories is shown in this list for the first category (stress and rhythm):
• a. "stress on the wrong syllable (in multi-syllabic words)
• b. incorrect sentence stress
• c. incorrect division of sentences into thought groups
• d. failure to make smooth transitions between words or syllables (prator, 1972)

Achievement Tests
• An achievement test is related directly to classroom lessons, units, or even a total curriculum. Achievement
tests are (or should be) limited to particular material addressed in a curriculum within a particular time frame
and are offered after a course has focused on the objectives in question.
• Achievement tests can also serve the diagnostic role of indicating what a student needs to continue to work on
in the future, but the primary role of an achievement test is to determine whether course objectives have been
met-and appropriate knowledge and skills acquired-by the end of a period of instruction.
• Achievement tests are often summative because they are administered at the end of a unit or term of study.
They also play an important formative role. An effective achievement test will offer washback about the
quality of a learner's performance in subsets of the unit or course. This washback contributes to the formative
nature of such tests.
• The specifications for an achievement test should be determined by
• the objectives of the lesson, unit, or course being assessed,
• the relative importance (or 'weight) assigned to each objective,
• the tasks employed in classroom lessons during the unit of time,
• practicality issues, such as the time frame for the test and turnaround time, and
• the extent to which the test structure lends itself to formative washback.
• Achievement tests range from five· or ten-minute quizzes to three-hour final examinatiOns, with an almost
infirtite variety of item types and formats.

SOME PRACIICAL STEPS TO
TEST CONSTRUCTION:
• Assessing Clear, Unambiguous Objectives* :
• Remember that every curriculum should have appropriately framed assessable objectives, that
is, objective that are state in terms of overt performance by students. Thus, an objective that
states "Students will learn tag questions" or simply names the grammatical focus "Tag questions"
is not testable. You don't know whether students should be able to understand them in spoken or
written language, or whether they should be able to produce them orally or in writing. Nor do
you know in what context,{a conversation? an essay? an academic lecture?) those linguistic forms
should be used. Your first task in designing a test, then, is to determine appropriate objectives.**

Assessing Clear, Unambiguous Objectives

Drawing Up Test
Specifications
• In the unit discussed before, your
specifications will simply comprise (a) a
broad outline of the test, (b) what skills you
will test, and (c) what the items will look like.
• (a) Outline of the test and (b) skills to be
included. Because of the constraints of your
curriculum, your unit test must take no more
than 30 minutes. This is an integrated
curriculum, so you need to test all four skills.
• (c) Item types and tasks. The next and
potentially more complex choices involve the
item types and tasks to use in this test.
• Consider the options: the test prompt can be
oral (student listens) or written (student
reads),

Drawing Up Test Specifications
These informal, classroom-oriented specifications give you an indication of
the topics (objectives) you will cover,
the implied elicitation and response formats for items,
the number of items in each section, and
the time to be allocated for each.

Devising Test Tasks
• Your oral interview comes first, and so you draft questions to conform to the accepted pattern of oral
interviews .You begin and end with non-scored items (warm-up and winddown) designed to set students at
ease, and then sandwich between them items intended to test the objective (level check) and a little beyond
(Probe).

Devising Test Tasks
• You are now ready to draft other test items. To provide a sense of authenticity and interest, you have decided
to ·conform your items to the context of a recent TV sitcom that you used in class to . illustrate certain
discourse and. form-focused factors. The sitcom depicted a loud, noisy party with lots of small talk .As you
devise your test items, consider such factors as how students will perceive them (face validity), the extent to
which authentic language and contexts are present, potential difficulty caused by cultural schemata, the length
of the listening stimuli, how well a story line comes across, how things like the cloze testing format will work,
and other practicalities.
• Let's say your first draft of items produces the following possibilities within each section:

Devising Test Tasks
In revising your draft, you will want to ask yourself some important questions:
Are the directions to each section absolutely clear?
Is there an example item for each section?
Does each item measure a specified objective?
Is each item stated in clear, simple language?
Does each multiple-choice item have appropriate distractors; that is, are the wrong items clearly
wrong and yet sufficiently "alluring" that they are not ridiculously easy? (See below for a primer
on creating effective distractors.)
Is the difficulty of each item appropriate for your students?
Is the language of each item sufficiently authentic?
Do the sum of the items and the test as a whole adequately reflect the learning objectives?

Designing Multiple-Choice Test Items
• Hughes (2003, pp. 76-78) cautions against a number of weaknesses of multiple-choice items:
• The technique tests only recognition knowledge.
• Guessing may have a considerable effect on test scores.
• The technique severely restricts what can be tested.
• It is very difficult to write successful items.
• Cheating may be facilitated.
• Since there will be occasions when multiple-choice items are appropriate, consider the following four
guidelines for designing multiple-choice items for classroom-based and large-scale situations (adapted from
Gronlund, 1998, pp.60-75, and]. D. Brown, 1996, pp. 54-57).
• 1. Design each item to measure a specific objective. Consider this item introduced, and then revised, in the
sample test above:
• Multiple-choice item, revised

2. State both stem and options as simply and directly as possible.
• We are sometimes tempted to make multiple-choice items too wordy. A good rule of thumb is to get directly
to the point. Here's an example.
You might argue that the first two sentences of this item give it some authenticity and accomplish a bit of
schema setting. But if you simply want a student to identify the type of medical professional who deals with
eyesight issues1 those sentences are superfluous. Moreover, by lengthening the stem, you have introduced a
potentially confounding lexical item, deteriorate, that could distract the student unnecessarily.

Another rule of succinctness is to remove needless redundancy from your options. In the following item, which
were is repeated in all three options. It should be placed in the stem to keep the item as succinct as possible.
3. Make certain that the intended answer is clearly the only correct one.
In the proposed unit test described earlIer, the following item appeared in the original draft:

4. Use item indices to accept, discard, or revise items.
• The appropriate selection and arrangement of suitable multiple-choice items on a test can best be accomplished by
measuring items against three indices: item facility (or item difficulty), item discrimination (sometimes called item
differentiation), and distractor analysis. Although measuring these factors on classroom tests would be useful, you
probably will have neither the time nor the expertise to do this for every classroom test you create, especially one-time
tests. But they are a must for standardized norm-referenced tests that are designed to be administered a number of
times and/or administered in multiple forms.
• 1. Item facility (or IF) is the extent to which an item is easy or difficult for the proposed group of test-takers. You may
wonder why that is important if in your estimation the item achieves validity. The answer is that an item that is too easy
(say 99 percent of respondents get it right) or too difficult (99 percent get it wrong) really does nothing to separate
high-ability and low-ability test-takers. It is not really performing much "work" for you on a test.
• IF simply reflects the percentage of students answering the item correctly. The formula looks like this-:
For example, if you have an item on which 13 out of 20 students respond correctly, your IF index is 13 divided by 20 or
.65 (65 percent). There is no absolute IF value that must be met to determine if an item should be included in the test as
is, modified, or thrown out, but appropriate test items will generally have IFs that range between .15 and .85. Two good
reasons for occasionally including a very easy item (.85 or higher) are to build in some affective feelings of "success"
among lower ability students and to serve as warm-up items. And very difficult items can provide a challenge to the
highest-ability students.

2. Item discrimination (10) is the extent to which an item differentiates between high-and low-ability test-takers. An
item on which high-ability students (who did well in the test) and low-ability students (who didn't) score equally well
would have poor ID because it did not discriminate between the two groups. Conversely, an item that garners correct
responses from most of the high-ability group and incorrect responses from most ofthe low-abilitygroup has good
discrimination power.
Suppose your class of 30 students has taken a test. Once you have calculated final scores for all 30 students, divide
them roughly into thirds-that is, create three rank-ordered ability groups including the top 10 scores, the middle 10,
and the lowest 10. To find out which of your 50 or so test items were most "powerful" in discriminating between high
and low ability, eliminate the middle group, leaving two groups with results that might look something like this on a
particular item:
Using the ID formula (7-2 = 5 -:-10 = .50), you would find that-this item has an ID of .50, or a moderate level. The
formula for calculating ID is

The result of this example item tells you that the item has a moderate level of lD. High discriminating power would
approach a perfect 1.0, and no discriminating power at all would be zero. In most cases, you would want to discard
an item that scored near zero. As with IF, no absolute rule governs the establishment of acceptable and unacceptable
ID indices.
One clear, practical use for ID indices is to select items from a test bank that includes more items than you need. You
might decide to discard or improve some items with lower ID because you know they won't be as powerful an
indicator of success on your test.
3. Distractor efficiency is one more important measure of a multiple-choice item's value in a test, and one that is
related to item discrimination. The efficiency of distractors is the extent to which (a) the distractors "lure" a sufficient
number of test-takers, especially lower-ability ones, and (b) those responses are somewhat evenly distributed across
all distractors. Those of you who have a fear of mathematical formulas will be happy to read that there is no formula
for calculating distractor efficiency and that an inspection of a distribution of responses will usually yield the
information you need.
Consider the following. The same item (#23) used, above is a multiple-choice item with five choices, and responses
across upper-and lower-ability students are distributed as follows:

No mathematical formula is needed to tell you that this item successfully attracts seven of the ten high-ability students
toward the correct response, while only two of the low-ability students get this one right. As shown above, its ID is .50,
which 'is acceptable, but the item might be improved in two ways: (a) Distractor D doesn’t fool anyone. No one picked
it, and therefore it probably has no utility· A revision 'might provide a distractor that actually attracts a response or two.
(b) Distractor E attracts more responses (2) from the high-ability group than the low-ability group (O).Why are good
students choosing this one? Perhaps it includes a subtle reference that entices the high group but is "over the head" of the
low group, and therefore the latter students don't even consider it.
The other two distractors (A and B) seem to be fulfilling their function of attracting some attention from lower-ability
students.
SCORING, GRADING, AND GIVING FEEDBACK:
As you design a classroom test, you must consider how the test will be scored and graded. Your scoring plan reflects the
relative weight that you place on each section and items in each section. The integrated-skills class that we have been
using as an example focuses on listening and speaking skills with some attention to reading and writing. Three of your
nine objectives target reading and writing skills. How do you assign scoring to the various components of this test?

Chapter-3 Dsigning classroom language Tests 1.pdf

Chapter-3 Dsigning classroom language Tests 1.pdf

More Related Content

Similar to Chapter-3 Dsigning classroom language Tests 1.pdf (20)

Recently uploaded (20)

Chapter-3 Dsigning classroom language Tests 1.pdf