Language testing the social dimension

Language Testing: The Social Dimension
Tim McNamara and
Carsten Roever
(Pages 1-64)
Presented by: Amir Hamid Forough Ameri
ahfameri@gmail.com
December 2015

Chapter 2: Validity and the Social Dimension
of Language Testing
In what ways is the social dimension of language assessment reflected
in current theories of the validation of language tests?
Mislevy
Lynch
Cronbach
Kunnan
BachmanShohamy
Messick
Chapelle
Kane

Cronbach
Contemporary discussions of validity in educational assessment are
heavily influenced by the thinking of the American Lee Cronbach;
The “father” of construct validity.
Meehl and Challman coined the term “construct validity.”

Cronbach
Their new concept of construct validity was an alternative to criterion-
related validity:
Construct validity is ordinarily studied when the tester has no definite
criterion measure of the quality with which he [sic] is concerned and
must use indirect measures. Here the trait or quality underlying the test
is of central importance, rather than either the test behavior or the cores
on the criteria. (Cronbach & Meehl, 1955, p. 283)

Cronbach
Since Cronbach and Meehl’s article:
A: There has been the increasingly central role taken by construct
validity,
B: There is also clear recognition that validity is not a mathematical
property like discrimination or reliability, but a matter of judgment.
Cronbach (1989) emphasized the need for a validity argument, which
focuses on collecting evidence for or against a certain interpretation of
test scores: In other words, it is the validity of inferences that construct
validation work is concerned with, rather than the validity of
instruments.

Cronbach
In fact, Cronbach argued that there is no such thing as a “valid test,”
“One does not validate a test, but only a principle for making
inferences” (Cronbach & Meehl, 1955, p. 297);
“One validates not a test, but an interpretation of data arising from a
specified procedure” (Cronbach, 1971, p. 447).

Cronbach
Cronbach and Meehl (1955) distinguished between a weak and a
strong program for construct validation:
 The weak one is a fairly haphazard collection of any sort of evidence
(mostly correlational) that supports the particular interpretation to be
validated.
In contrast, the strong program is based on the falsification idea
advanced by Popperian philosophy (Popper, 1962): Rival hypotheses
for interpretations are proposed and logically or empirically examined.

Cronbach
Cronbach (1988) admitted that the actual approach taken in most
validation research is “confirmationist” rather than “falsificationist”
and aimed at refuting rival hypotheses.
Through his experiences in program evaluation, Cronbach highlighted
the role of beliefs and values in validity arguments, which “must link
concepts, evidence, social and personal consequences, and values”
(Cronbach, 1988, p. 4).
What we have here then is a concern for social consequences as a kind
of corrective to an earlier entirely cognitive and individualistic way of
thinking about tests.

Messick
 The most influential current theory of validity is developed by Samuel
Messick (1989).
 Messick incorporated a social dimension of assessment quite explicitly
within his model.
 Messick, like Cronbach, saw assessment as a process of reasoning and
evidence gathering carried out in order for inferences to be made about
individuals and saw the task of establishing the meaningfulness of those
inferences as being the primary task of assessment development and
research.
 This reflects an individualist, psychological tradition of measurement
concerned with fairness.

Messick on Construct Validity
Setting out the nature of the claims that we wish to make about test
takers and providing arguments and evidence in support of them, as in
Messick’s cell 1, represents the process of construct definition and
validation.
Those claims then provide the rationale for making decisions about
individuals (cells 2 and 4) on the basis of test scores.
Consider tests such as IELTS, TOEFL iBT, Occupational English Test
(McNamara, 1996):

The path from the observed test performance to the predicted real-
world performance (in performance assessments) or estimate of the
actual knowledge of the target domain (in knowledge-based tests)
involves a chain of inferences.
In principle, how the person will fare in the target setting cannot be
known directly, but must be predicted.
 Deciding whether the person should be admitted then depends on two prior steps:
1. Modeling what you believe the demands of the target setting are and
2. predicting what the standing of the individual is in relation to this
construct.

 The test is a procedure for gathering evidence in support of decisions that
need to be made.
 The relationships among test, construct and target are set out in Figure 2.3.
(Next Slide)
 Validity therefore implies considerations of social responsibility, both to
the candidate (protecting him or her against unfair exclusion) and to the
receiving institution.
 Fairness in this sense can only be achieved through carefully planning the
design of the observations of candidate performance and carefully
articulating the relationship between the evidence we gain and the inferences
about candidate standing.

Test validation steers between what Messick called
1. construct underrepresentation, and
2. construct-irrelevant variance,
The former: the assessment requires less of the test taker than is
required in reality.
The latter: differences in scores might not be due only to differences
in the ability being measured but that other factors are illegitimately
affecting scores.

Mislevy
Central to assessment is the chain of reasoning from the observations
to the claims we make about test takers, on which the decisions about
them will be based. Mislevy calls this the “assessment argument”.
According to Mislevy: An assessment is a machine for reasoning about
what students know, can do, or have accomplished, based on a handful
of things they say, do, or make in particular settings.
Mislevy has developed an approach called Evidence Centered Design
(Figure 2.5), which focuses on the chain of reasoning in designing
tests.

Construct Definition and Validation:
Mislevy

Mislevy
A preliminary first stage, Domain Analysis, involves what in
performance assessment is called job analysis (the testing equivalent of
needs analysis).
Here the test developer needs to develop insight into the conceptual
and organizational structure of the target domain.
What follows is the most crucial stage of the process, Domain
Modeling.
It involves modeling three things: claims, evidence, and tasks.
Together, the modeling of claims and evidence is equivalent to
articulating the test construct.

Mislevy
Step 1 involves the test designer in articulating the claims the test will
make about candidates on the basis of test performance.
This involves conceptualizing the aspects of knowledge or
performance ability to which the evidence of the test will be directed
and on which decisions about candidates will be based.
Claims might be stated in broader or narrower terms; the latter
approach brings us closer to the specification of criterion behaviors in
criterion-referenced assessment (Brown & Hudson, 2004).

Mislevy
Step 2 involves determining the kind of evidence that would be
necessary to support the claims.
This stage depends on a theory of the characteristics of a successful
performance, e.g., the categories of rating scales used to judge the
adequacy of the performance.
Step 3 involves defining in general terms the kinds of task in which the
candidate will be required to engage.

Mislevy
All three steps precede the actual writing of specifications for test
tasks; they constitute the “thinking stage” of test design.
Only when this chain of reasoning is completed can the specifications
for test tasks be written.
In further stages of ECD, Mislevy deals with turning the conceptual
framework developed in the domain modeling stage into an actual
assessment.
The final outcome of this is an operational assessment.

Mislevy
The consideration of the social dimension of assessment in Mislevy’s
conceptual analysis remains implicit and limited to issues of fairness
(McNamara, 2003):
1. Mislevy does not consider the context in which tests are
commissioned.
2. Nor does Mislevy deal directly with the uses of test scores.

Kane
Kane points out that we interpret scores as having meaning. The same
score might have different interpretations.
Whatever the interpretation we choose, Kane argues, we need an
argument to defend the relationship of the score to that interpretation.
He calls this an interpretative argument, defined as a:
“chain of inferences from the observed
performances to conclusions and decisions included
in the interpretation”

Kane
Kane proposes four types of inference in the chain of inferences. He
uses the metaphor of bridges for each of these inferences; all bridges
need to be crossed safely for the final interpretations to be reached.
The ultimate decisions are vulnerable to weaknesses in any of the
preceding steps; in this way, Kane is clear about the dependency of
valid interpretations on the reliability of scores.

Kane
The first inference is from observation to observed score. In order
for assessment to be possible, an instance of learner behavior needs to
be observable.
This behavior is then scored.
The first type of inference is that the observed score is a reflection of
the observed behavior (i.e., that there is a clear scoring procedure.)

Kane
The second inference is from the observed score to what Kane called
the universe score. This inference is that the observed score is
consistent across tasks, judges, and occasions.
This involves reliability and can be studied effectively using
generalizability theory, item response modeling, etc.
A number variables can threaten the validity of this inference,
including raters, task, rating scale, candidate characteristics, and
interactions among these.

Kane
The third inference, from the universe score to the target score:
construct validity which is closest to the first cell of Messick’s validity
matrix.
This inference involves extrapolation to nontest behavior—in some
cases, via explanation in terms of a model.
The fourth inference, from the target score to the decision based on
the score, moves the test into the world of test use and test context;
 It encompasses the material in the second, third, and fourth cells of
Messick’s matrix involving questions of relevance (cell 2), values (cell
3), and consequences (cell 4)

Kane
Kane thus distinguishes two types of inference:
1. semantic inferences
2. policy inferences
Kane also distinguishes two related types of interpretation:
1. Interpretations that only involve semantic inferences are called
descriptive interpretations;
2. Interpretations involving policy inferences are called decision-based
interpretations.

The Social Dimension of Validity in Language Testing
Some of the implications for assessment of the more socially oriented
views of communication were represented in the work of Hymes (1967,
1972)
However, the most influential discussion of the validity of
communicative language tests, Bachman’s landmark Fundamental
Considerations in Language Testing (1990), builds its discussion
around Messick’s approach to validity.
Lado (1961) and Davies (1977), had reflected the existing modeling of
validity in terms of different types: criterion validity (concurrent/
predictive), content validity, and construct validity.

Following Messick, Bachman presented validity as a unitary concept,
requiring evidence to support the inferences that we make on the basis
of test scores.
Bachman (1990) introduced the Bachman model of communicative
language ability, clarifying the earlier interpretations by Canale and
Swain (1980) and Canale (1983).
Conceptualizing the demands of the target domain or criterion is a
critical stage in the development of a test validation framework.

Bachman does this in two stages:
First, he assumes that all contexts make specific demands on aspects
of test-taker competence;
The famous model of communicative language ability
Second, Bachman handled characterization of specific target language
use situations and of test content in terms of what he called
Test method facets.

Problems of Bachman’s Model:
 This a priori approach to characterizing the social context of use in
terms of a model of individual ability obviously severely constrains
the conceptualization of the social dimension of the assessment
context.
 The model is clearly primarily cognitive and psychological.
 Those who have used the test method facets approach in actual
research and development projects have found it difficult to use.

One influential attempt to make language test development and
validation more manageable Bachman and Palmer’s (1996) test
usefulness : reliability, construct validity, authenticity, interactiveness,
impact, and practicality.
Note: Authenticity, interactiveness and impact are three qualities
that many measurement specialists consider to be part of validity.

Bachman (2004): the process of validation includes two interrelated
activities:
1. Articulating an interpretive argument (also referred to as a validation
argument), which provides the logical framework linking test
performance to an intended interpretation and use.
Following Kane
2. Collecting relevant evidence in support of the intended
interpretations and uses. Following Mislevy

 Bachman describes procedures for validation of test use decisions,
following Mislevy et al. (2003) in suggesting that they follow the structure
of a Toulmin argument (i.e., a procedure for practical reasoning involving
articulating claims and providing arguments and evidence both in their
favor [warrants or backing] and against [rebuttals]).
 An assessment use argument (AUA) is an overall logical framework for
linking assessment performance to use (decisions). This assessment use
argument includes two parts: an assessment utilization argument, linking an
interpretation to a decision, and an assessment validity argument, which
links assessment performance to an interpretation.

Following Bachman, Chapelle, Enright, and Jamieson (2004) used
Kane’s framework as the basis for a validity argument for the new
TOEFL iBT, arguably the largest language test development and
validation effort yet undertaken.
In both the work of Kane and in work in language testing, with one or
two notable exceptions, the wider social context in which language
tests are commissioned and have their place is still not adequately
theorized.

A kind of optimism about the social role of tests is reflected by
Kunnan and Bachman.
In sharp contrast to this position is Shohamy’s analysis of the political
function of tests: critical language testing (1998, 2001).
Tests have become symbols of power for both individuals and society
 Lynch (2001), like Shohamy: tests have the potential to be sites of
unfairness or injustice at the societal level.

Chapter 3: The Social Dimension of
Proficiency: How Testable Is It?
 The changes in assessment are associated with the advent of the communicative
movement in language teaching in the mid-1970s.
 The most significant contribution of that movement was a renewed focus on
performance tests.
 There was a shift from seeing language proficiency in terms of knowledge of structure
which could be tested using discrete-point, multiple-choice items, to an emphasis on the
integration of many aspects of language knowledge and skill in performance.
 Two areas will be considered in which the assessment construct involves a social
dimension of communication:
• the assessment of face-to-face interaction
• the assessment of pragmatics.

Assessing Face-to-face Interaction
The advent of communicative language testing saw a growing
preference for face-to-face interaction as the context in which the
assessment of spoken language skills would occur.
In the past:
U.S. Government Foreign Service Institute (FSI) test and
the British tradition of examinations, but
these practices were seriously undertheorized.

Language assessment in face-to-face interaction takes place within
what Goffman (1983) called the interaction order:
Social interaction . . . [is] that which uniquely transpires in social
situations, that is, environments in which two or more individuals are
physically in one another’s response presence. . . . which might be
titled the interaction order . . . [and] whose preferred method of study is
microanalysis.
Face to face interaction has its own regulations; it has its own
processes and its own structure.

The nature of behavior in face-to-face domain, and the rules that
constrain it was systematically explored in the work of students of
Goffman, Sacks and Schegloff, who developed Conversation Analysis.
The realization was that for speaking tests, the presence of an
interlocutor in the test setting introduces an immediate and overt
social context, which presented fundamental challenges for the existing
individualistic theories of language proficiency.

A social view of performance is incompatible with taking the
traditional view of performance as a simple projection or display of
individual competence.
Psychology, linguistics, and psychometrics assume that it is possible to
read off underlying individual cognitive abilities from the data of
performance.
 However, the position of Conversation Analysis is that the
interlocutor is implicated in each move by the candidate.
 How are we to speak of communicative competence as residing in the
individual if we are to “include the hearer in the speaker’s processes”?

Research on this issue has taken a number of directions. Discourse
analysts have focused on defining and clarifying the interactional
nature of the oral proficiency interview, e.g., Young and He (1998);
Johnson (2001). Results:
1. To emphasize the contribution of both parties to the performance, thus
making it difficult to isolate the contribution of the candidate, which
is the target of measurement and reporting.
2. To disclose the peculiar character of the oral proficiency interview as
a communicative event.

I do not agree that the interview is more natural than the
other forms of tests, because if I’m being interviewed and
I know that my salary and my promotion depend on it, no
matter how charming the interviewer and his assistants
are, this couldn’t be any more unnatural.
(Jones & Spolsky, 1975, p. 7)

The response from within the psychometric tradition is to maintain the
goal of disentangling the role of the interlocutor by treating it as a
variable like any other, to be controlled for, thus allowing us to focus
again exclusively on the candidate.
The most comprehensive study of the role of the interlocutor in
performances in oral proficiency interviews is that of Brown (2003,
2005).

Findings:
 It was possible to identify score patterns (higher or lower scores) for
candidates paired with particular interlocutors.
What was the source of this effect? Two possible sources:
 The behavior of interlocutors as revealed through discourse analysis,
 The comments of raters as they listened to the performances, using
think-aloud techniques.

The social character of the interaction has also been conceptualized
from more macrosociological points of view, as potentially being
influenced by the identities of the candidate and interlocutor/rater.
 What is at issue here is the extent to which such features as
the gender of participants,
the professional identity and experience of the interviewer/rater,
the native-speaker status (or otherwise) of the interviewer/rater,
the language background of the candidate,
and so on influence the interaction and its outcome.

This is the subject of intensive ongoing debate in discourse studies of
macrosocial phenomena e.g.:
o feminist discourse studies,
o studies of racism,
o the movement known as Critical Discourse Analysis in general, and
o discursive psychology.

Assessing Second Language Pragmatics
Assessment of L2 pragmatics tests language use in social settings, but
unlike oral proficiency tests, it does not necessarily focus on
conversation or extracting speech samples.
Because of its highly contextualized nature, assessment of pragmatics
leads to significant tension between
the construction of authentic assessment tasks and
practicality
only few tests are available in this area

Second Language Pragmatics
 Pragmatics is the study of language use in a social context, and language
users’ pragmatic competence is their “ability to act and interact by means of
language” (Kasper & Roever, 2005, p. 317).
 Pragmatics covers
 implicature,
 deixis,
 speech acts,
 conversational management,
 situational routines, and others (Leech, 1983; Levinson, 1983; Mey, 2001).

Components of pragmatic competence (both are equally necessary):
A: “sociopragmatic” knowledge
knowledge of the target language community’s
social rules, appropriateness norms, discourse practices,
and accepted behaviors, whereas
B: “pragmalinguistic” knowledge
the linguistic tools necessary to
“do things with words”
(Austin, 1962)

 Because of the close connection between pragmalinguistics and sociopragmatics, it is
difficult to design a test that tests pragmalinguistics to the exclusion of sociopragmatics or
vice versa.
Pragmatic Tests
Hudson, Detmer, & Brown’s (1995) test of English as a second language
(ESL) sociopragmatics
Bouton’s test of ESL implicature (Bouton, 1988, 1994, 1999)
Roever’s Web-based test of ESL pragmalinguistics
(Roever, 2005, 2006b) Liu’s (2006) test of EFL sociopragmatics.
Liu’s (2006) test of EFL sociopragmatics.

Testing Sociopragmatics
 Hudson et al. (1995) designed a test battery for assessing Japanese ESL
learners’ ability to produce and recognize appropriate realizations of the
speech acts of request, apology, and refusal.
 Their battery consisted of these sections,
1. a written DCT,
2. an oral (language lab) DCT,
3. a multiple-choice DCT,
4. a role-play, as well as
5. self-assessment measures for
• the DCTs and the role play.

Testing Sociopragmatics
Result:
A central problem of sociopragmatically oriented tests that focus
on appropriateness: Judgments of what is and what is not appropriate
differ widely among NSs and are probably more a function of
personality and social background variables than of language
knowledge.

Testing Implicature
 Bouton (1988, 1994, 1999) designed a 33-item test of implicature, incorporating two
major types of implicature, which he termed “idiosyncratic implicature” and
“formulaic implicature.”
 Idiosyncratic implicature is conversational implicature in Grice’s terms (1975); that is, it
violates a Gricean maxim and forces the hearer to infer meaning beyond the literal
meaning of the utterance by using background knowledge.
 Bouton viewed formulaic implicature as a specific kind of implicature, which follows a
routinized schema.
 He placed the Pope Q (“Is the Pope Catholic?”), indirect criticism (“How did you like the
food?”—“Let’s just say it was colorful.”), and sequences of events in this category.

Testing Pragmalinguistics
 Roever (2005, 2006b) developed a test battery that focused squarely on the
pragmalinguistic side of pragmatic knowledge.
 Unlike Hudson et al., who limited themselves to speech acts, and Bouton,
who assessed only implicature, Roever tested three aspects of ESL
pragmalinguistic competence:
1. recognition of situational routine formulas,
2. comprehension of implicature, and
3. knowledge of speech act strategies.
 Roever tried to strike a balance between practicality and broad content
coverage to avoid construct underrepresentation.

Apologies and Requests for Chinese EFL
Learners
 Liu (2006) developed a test of requests and apologies for Chinese EFL learners, which
consisted of
 a multiple-choice DCT,
 A written DCT, and
 a self-assessment instrument.
 All three test papers contained 24 situations, evenly split between apologies and requests.
 As Liu (1996) stated quite clearly, his test exclusively targets Chinese EFL learners so its
usability for other first language (L1) groups is unclear.
 It also only considers one aspect of L2 pragmatic competence—speech acts—so that
conclusions drawn from it and decisions based on scores would have to be fairly limited
and restricted.

The Tension in Testing Pragmatics: Keeping It
Social but Practical
If pragmatics is understood as language use in social settings, tests
would necessarily have to construct such social settings.
The usual method of DCTs of cramming all possible speech act
strategies into one gap is highly inauthentic in terms of actual
conversation.
The obvious alternative would be to test pragmatics through face-to-
face interaction, most likely role-plays. In this way, speech acts could
unfold naturally as they would in real-world interaction.

 Problems with Role Plays:
 Practicality.
 Being time consuming to conduct
 Requiring multiple ratings.
 A standardization issue because every role-play would be somewhat unique
if the conversation is truly co-constructed.
 How can this dilemma be solved?
 One way is to acknowledge that pragmatics does not exclusively equal
speech acts.

 It is possible to test other aspects of pragmatic competence without simulating
conversations. Certain aspects of pragmatics can be tested more easily in isolation than
others. For example,
 Routine formulas
 Implicature
 However, such a limitation begs the question of construct underrepresentation.
 Given that pragmatics is a fairly broad area, it is difficult to design a single test that
assesses the entirety of a learner’s pragmatic competence.
 Depending on the purpose of the test, different aspects of pragmatic competence can be
tested, and for some purposes, it might be unavoidable to use role-plays or other
simulations of social situations

Language testing the social dimension

Language testing the social dimension

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Language testing the social dimension (20)

More from Amir Hamid Forough Ameri (9)

Recently uploaded (20)

Language testing the social dimension