SlideShare a Scribd company logo
© Fariba Chamani, 2015
GLEN FULCHER (2010). PRACTICAL
LANGUAGE TESTING
CHAPTER 5: ALIGNING TESTS TO
STANDARDS
Content of this chapter
It’s as old as the hills
The definition of ‘standards’
The uses of standards
Unintended consequences revisited
Using standards for harmonization and identity
How many standards can we afford?
Performance level descriptors (PLDs) and test scores
Some initial decisions
Standard-setting methodologies
Evaluating standard setting
Training
The special case of the CEFR
You can always count on uncertainty
It’s as old as the hills
Standard setting = The process of establishing
one or more cut scores on examinations
Standards-based assessment = Using tests to
assess learner performance and achievement in
relation to an absolute standard
A development of criterion-referenced testing
Using large-scale standardized tests
Pre-dating the criterion-referenced testing move
Definition of ‘standard’
Standard = a level of performance
required or experienced (Davies et al.,
1999).
Example: The standard required for entry
to the university is an A in English.
The uses of standards
Educational purposes: (achievement tests)
Professional purposes: (certification of aircraft
engineers)
Political purposes: (NCLB & AYP)
Immigration Policy purposes
Unintended consequences
In case of NCLB: ELL group is always lower
than the standard & resources are not channeled to
where they are most needed.
Mandatory use of English in tests of content
subjects puts pressure on the indigenous people to
abandon education in their own language
The use of language tests for immigration leads to
fraudulent practices & short-term paper marriages
Using standards for harmonization & identity
To enforce conformity to a single model that helps
to create and maintain political unity and identity.
Examples:
Carolingian empire of Charlemagne (CE 800–
814)
CEFR (Now)
Carolingian empire of Charlemagne
Within the empire of Charlemagne, in Central and Western
Europe, various groups followed different calendars, and the
main Christian festivals fell on different dates.
In order to bring uniformity, Charlemagne set a new
standard for ‘computists’ who worked out the time of
festivals. They required to pass a test in order to get their
certificate.
There are no ‘correct answers’ for the questions in the
Carolingian test, they are scored as ‘correct’ because they
are defined as such by the standard, and the standard is
arbitrarily chosen with the intention of harmonizing practice.
CEFR (Common European Framework of Reference
)
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization.
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded.
Problems with CEFR
It drains creativity among teachers.
The same set of standards are used for all people
across different contexts, with different purposes.
Validation is based on linking the test to the CEFR.
This is against validity theories.
The use of standards and tests for harmonization
ultimately leads to a desire for more control.
How many standards can we afford?
The number of performance levels depends on the goals and
the use of the test.
Choosing the fewest performance levels (pass or fail) is
ideal because the more numerous the classes, the greater will
be the danger of a small difference in marks.
Index of Separation estimates the number of performance
levels into which a test can reliably place test takers.
Sometimes we have to use numerous categories to motivate
young learners.
Performance level descriptors (PLDs) & Test
scores
PLDs are often developed based on intuitive and experiential
method & the labels and descriptors are simple reflections of the
values of policy makers.
There are around four levels: ‘advanced – proficient – basic –
below basic’.
The PLDs provide a conceptual hierarchy of performance that is
an indication of the ability or knowledge of the test taker.
Standard-setting is the process of deciding on a cut score for a
test to mark the boundary between two PLDs. If we have two
performance levels (pass and fail), we’ll need a single cut score.
Standard based tests, CRT & scoring rubrics
It is said that tests used in standards-based testing are
criterion- referenced, yet for Glaser the criterion was
the domain, and it does not have anything to do with
standard setting and classification.
The standards-based testing movement has interpreted
‘criterion’ to mean ‘standard’.
The focus within PLDs is on the general levels of
competence, proficiency, or performance while scoring
rubrics address only single items.
Some initial decisions
All standard setting methods involve expert judgemental
decision making at some level (Jaegar, 1979).
Decision 1: Compensatory or non-compensatory
marking? The strength in other areas ‘compensates’ for
the weakness in one area.
 Decision 2: What classification errors can you tolerate?
Decision 3: Are you going to allow test takers who ‘fail’
a test to retake it? If so, what time lapse is required to
retake the test?
Second Page :
"Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est
laborum."
Cycle Diagram
Test-centered
Criterion-referenced
Norm-
referenced
Examinee-
centered
Standard-Setting
Methods
Classification of
Standard-setting methodologies
Test-centered
• Angoff
• Ebel
• Nedelsky
• Bookmark
Examinee-centered
• Method of Contrasting
Groups
• Method of Borderline
group
Common process of standard setting
Select an appropriate standard setting method depending upon the
purpose of the standard setting, available data, and personnel.
Select a panel of judges based upon explicit criteria.
Prepare the PLDs and other materials as appropriate.
Train the judges to use the method select.
Rate items or persons, collect and store data.
Provide feedback on rating and initiate discussion for judges to
explain their ratings, listen to others, and revise their views or
decisions, before another round of judging.
Collect final ratings and establish cut scores.
Ask the judges to evaluate the process.
Document the process in order to justify the conclusions reached.
Test-centered methods
The judges are presented with individual
items or tasks and required to make a
decision about the expected performance on
them by a test taker who is just below the
border between two standards.
Angoff method
 Experts are given a set of items and they need to
rate the probability that a hypothetical learner
(who is on the borderline) would answer each test
item correctly.
 The average of these probabilities across judges
or raters is the cut score.
 If the test contains polytomous items or tasks, the
proportion of the maximum score is used instead
of the probability (modified Angoff).
Advantages & disadvantages
 Clarity
 Simplicity
Cognitive difficulty in
conceptualizing the
borderline learner by all
judges in precisely the
same way.
+ -
Ebel method
 2 Rounds
 Experts classify independently test items by:
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionable
Ebel method
The judges estimate the percentage of items a borderline
test taker would get correct for each cell.
Then the percentage for each cell is multiplied by the
number of items, so if the ‘easy/essential’ cell has 20
items, 20 *􏰀 85 = 1700.
These numbers for each of the 12 cells are added up and
then divided by the total number of items to give the cut
score for a single judge.
Finally, these are averaged across judges to give a final
cut score.
All items could be classified: 12 cells in a 3*4 grid defined by the three
difficulty and four relevance category. As in the example:
categories Expert №3 Expert №4 Expert №5
Number
of items
in a
category
(А)
% correctly
performed
items
(В)
А*В
Number
of items
in a
category
(А)
%
correctly
performed
items
(В)
А*В
Number
of items
in a
category
(А)
%
correctly
performed
items
(В)
А*В
Essential
Easy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0
Hard 0 10 0 1 0 0 0 0 0
Questionabl
e
Easy 0 0 0 0 0 0 0 0 0
Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0
Mean 25.1 26.7 35
Mean for all
experts
28
Cut-score 12
…
Problems with EBEL
The complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges.
As it is assumed that some items may have questionable
relevance to the construct of interest, it implicitly throws into
doubt the rigor of the test development process and validity
arguments.
Nedelsky method (Multiple-choice)
The experts estimate the multiple-choice items a
borderline test taker would be able to eliminate.
In a four-option item with three distractors, if a candidate
can eliminate 3 of the distractors the chances of getting the
item right are 1 (100 %), but if he can only rule out 1 of
the items, the chance of answering the item correctly is 1
in 3 (33 %).
These probabilities are averaged across all items for each
judge, and then across all judges to arrive at a cut score.
Problems with Nedelsky method
 It assumes that test takers answer multiple choice items by
eliminating the options that they think are distractors and then
guessing randomly between the remaining options. However, it
is highly unlikely that test takers answer items in this way.
 Nedelsky method tends to produce lower cut scores than other
methods and is therefore likely to increase the number of false
positives.
Bookmark method
Directions to Bookmark participants
Ordered item booklet
Booklet guideline
Student exemplar papers
Scoring guide
Essential materials
Standard Setting
Presentation of the percentage of
students falling into each performance level
and each median cut-score from Round 2.
After discussion individual judgments
Overview of established cut-scores by every
expert, repeating of the same procedure as
in the first step.
Experts are informed about the essential number
of cut-scores to establish. Experts work in
small groups, all the essential material is
introduced to them.
Basic steps of the
procedure
Round III
Round II
Round I
Procedures in Bookmark method
 Judges are presented with the necessary materials
 Then they are asked to keep in mind a borderline student,
and place a ‘bookmark’ in the book between two items,
such that the candidate is more likely to be able to answer
the items below correctly, and the items above incorrectly.

 The bookmarks are discussed in group and finally the
median of the bookmarks for each cut point is taken as that
group’s recommendation for that cut-point.
Examinee-centered methods
The judges make decisions about whether
individual test takers are likely to be just
below a particular standard; the test is
then administered to the test takers to
discover where the cut score should lie.
Borderline group method
The judges define what borderline candidates are
like, and then identify borderline candidates who fit
the definition.
 Once the students have been placed into groups the
test can be administered. The median score for a
group defined as borderline is used as the cut score.
The main problem: the cut score is dependent upon
the group being used in the study.
Method of contrasting groups
Procedure includes testing of two groups of examinees:
•The classification must be done using independent criteria,
such as teacher judgments.
•The test is then given, and the score distributions are
calculated. There are likely to be overlaps in the
distributions.
• The cut score will be where overlap is observed in the
distributions.
Competent Non-competent
Aligning tests to standards
Which method is the ‘best’?
It depends on what kind of judgments you can get for your
standard-setting study, and the quality of the judges that you
have available.
However, using the contrasting group approach is
recommended if it’s possible because it is the only method
that allows the calculation of likely decision errors (false
positives and false negatives) for cut scores.
The problem is getting the judgments of a number of people
on a large group of individuals.
Evaluating standard setting (Kane, 1994)
Procedural
evidence
• What procedures were used for the standard-setting to ensure that
the process is systematic?
• Were the judges properly trained in the methodology and allowed
to express their views freely?
Internal
evidence
• Deals with the consistency of results arising from the procedure
• It also estimates the extent of agreement between judges (Cohen’s
kappa )
External
evidence
• Correlation of scores of learners in a borderline group study with
some other test of the same construct.
• High correlation = the established cut scores are defensible.
Training: a critical part of standard setting
Training activities include familiarization with the PLDs and
the test, looking at the scoring keys, making practice
judgments, and getting feedback.
Different views may lead to disagreements among the judges.
Training should not be designed to eliminate these variations
but to allow free discussion among judges. If the judges do not
converge, the outcome should be accepted by the researchers.
The training process should not force agreement (cloning)
because removing their individuality and inducing agreement is
a threat to validity.
The special case of the CEFR
• The CEFR Manual contains performance level descriptors for
standard setting in order to introduce a common language and a single
reporting system into Europe.
• It recommends five processes to ‘relate’ Language Examinations to
the Common European Framework of Reference (CEFR)for
Languages Learning, Teaching, Assessment. These processes are:
Familiarization, specification, standardization training/benchmarking,
standard-setting, and validation.
• Familiarization, standard-setting, and validation are uncontentious
because they reflect common international assessment practice that is
not unique to Europe however the other two sections are problematic.
PLDs in CEFR & in other standard-based systems
The use of PLDs in the CEFR is
institutionalized & their meaning is
generalized across nations.
 Standardization facilitates ‘the
implementation of a common
understanding of CEFR and training
is cloning rather than familiarization
Benchmarking = the process of
rating individual performance
samples using the CEFR PLDs
Standard-setting = ‘mapping’ the
existing cut scores from tests onto
CEFR levels.
PLDs are evaluated in terms of their
usefulness and meaningfulness; they
can be discarded or changed.
Standardization & training ensure
that everyone understands the
standard-setting method yet
judgments are freely made
Benchmarking = the typical
performances that are identified
after standard-setting.
Standard-setting = establishing cut
scores on tests.
CEFR Other standard-based systems
You can always count on uncertainty
Standards-based testing can be positive if people can
reach a consensus, rather than being forced to see the
world through a single lens. Used in this way,
standards are never fixed, monolithic edifices. They
are open to change, and even rejection, in the service
of language education.
Standards-based testing fails if it is used as a policy
tool to achieve control of educational systems with the
intention of imposing a single acceptable teaching and
assessment discourse upon professionals.
Thank You
For Your Attention

More Related Content

PPTX
Introduction to standard setting (cutscores)
PPTX
Steps to design a test
PPT
Test specifications and designs session 4
PPT
Rubrics
PPTX
Principles of language assessment
PPT
Test Construction
PPTX
Assessments, concepts and issues
PPTX
Reading test specifications assignment-01-ppt
Introduction to standard setting (cutscores)
Steps to design a test
Test specifications and designs session 4
Rubrics
Principles of language assessment
Test Construction
Assessments, concepts and issues
Reading test specifications assignment-01-ppt

What's hot (20)

PPTX
New grading and student evaluation.ppt
PPTX
Testing oral ability ppt
PPT
Chapter 5( standards based assessment)
PPTX
Goals and objectives
PPT
Test specifications and designs
PPTX
Rubrics
PPTX
Principles of language assessment
PDF
Principles of language assessment
PPTX
Standardized testing.pptx 2
PPTX
056#Types of Assessment.pptx
PPTX
Language Assessment_Formal and Informal
PPTX
Reliability for testing and assessment
PPT
Chapter 2(principles of language assessment)
PPTX
Assessing Writing
PPT
Standardized tests
PPTX
Testing writing (for Language Teachers)
PPTX
Introduction to e-Assessment
PPTX
Standards based assessment
PPTX
Assessment of english language learners 1
New grading and student evaluation.ppt
Testing oral ability ppt
Chapter 5( standards based assessment)
Goals and objectives
Test specifications and designs
Rubrics
Principles of language assessment
Principles of language assessment
Standardized testing.pptx 2
056#Types of Assessment.pptx
Language Assessment_Formal and Informal
Reliability for testing and assessment
Chapter 2(principles of language assessment)
Assessing Writing
Standardized tests
Testing writing (for Language Teachers)
Introduction to e-Assessment
Standards based assessment
Assessment of english language learners 1
Ad

Similar to Aligning tests to standards (20)

PPTX
Designing classroom language tests
PPTX
7 assessment and the cefr
PPTX
educatiinar.pptx
PPTX
Standardized and non standardized tests
PPT
Testing for language teachers 101 (1)
PPTX
7.1 assessment and the cefr (1)
PPT
Standard Setting In Medical Exams
PPTX
Unit 2.pptx
PPTX
Langguage assessment( final version)
PPTX
NED 203 Criterion Referenced Test & Rubrics
PPTX
Types of Tests,
PPT
Two phases of establishing cutoff score
PDF
Learning Activity 1_ Viteri Flores_Arlyn Johanna
PPT
Assess score evaluate (1)
PPT
Testing and Test Construction
PPTX
Standardized and non standardized tests (1)
PPTX
identify the test specification
PPTX
Assessment in Learning
PPTX
Practical Language Testing by Fulcher (2010)
PPT
Testing and Test construction (Evaluation in EFL)
Designing classroom language tests
7 assessment and the cefr
educatiinar.pptx
Standardized and non standardized tests
Testing for language teachers 101 (1)
7.1 assessment and the cefr (1)
Standard Setting In Medical Exams
Unit 2.pptx
Langguage assessment( final version)
NED 203 Criterion Referenced Test & Rubrics
Types of Tests,
Two phases of establishing cutoff score
Learning Activity 1_ Viteri Flores_Arlyn Johanna
Assess score evaluate (1)
Testing and Test Construction
Standardized and non standardized tests (1)
identify the test specification
Assessment in Learning
Practical Language Testing by Fulcher (2010)
Testing and Test construction (Evaluation in EFL)
Ad

More from Fariba Chamani (16)

PPT
Carrying a baby in the back
PPTX
The distinctive characteristics of foreign language teachers
PPTX
Teacher education
PPTX
Taking the critics to task
PPTX
Legislation by hypothesis: task-based instruction
PPTX
Critical pedagogy
PPTX
Task based syllabus
PPTX
Language and Culture
PPTX
The History of Teaching English as a Foreign Language
PPTX
A critical view of ELT history
PPTX
Using conference submission data to uncover broad trends in language teaching
PPTX
Sociocultural perspectives on SLA
PPT
Erik Erikson
PPTX
Age and language acquisition
PPTX
Fairness in language testing
PPTX
Norm-referenced & Criterion-referenced Tests
Carrying a baby in the back
The distinctive characteristics of foreign language teachers
Teacher education
Taking the critics to task
Legislation by hypothesis: task-based instruction
Critical pedagogy
Task based syllabus
Language and Culture
The History of Teaching English as a Foreign Language
A critical view of ELT history
Using conference submission data to uncover broad trends in language teaching
Sociocultural perspectives on SLA
Erik Erikson
Age and language acquisition
Fairness in language testing
Norm-referenced & Criterion-referenced Tests

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Institutional Correction lecture only . . .
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Presentation on HIE in infants and its manifestations
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Computing-Curriculum for Schools in Ghana
Final Presentation General Medicine 03-08-2024.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Abdominal Access Techniques with Prof. Dr. R K Mishra
Institutional Correction lecture only . . .
A systematic review of self-coping strategies used by university students to ...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Microbial diseases, their pathogenesis and prophylaxis
Microbial disease of the cardiovascular and lymphatic systems
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Presentation on HIE in infants and its manifestations
GDM (1) (1).pptx small presentation for students
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O5-L3 Freight Transport Ops (International) V1.pdf

Aligning tests to standards

  • 1. © Fariba Chamani, 2015 GLEN FULCHER (2010). PRACTICAL LANGUAGE TESTING CHAPTER 5: ALIGNING TESTS TO STANDARDS
  • 2. Content of this chapter It’s as old as the hills The definition of ‘standards’ The uses of standards Unintended consequences revisited Using standards for harmonization and identity How many standards can we afford? Performance level descriptors (PLDs) and test scores Some initial decisions Standard-setting methodologies Evaluating standard setting Training The special case of the CEFR You can always count on uncertainty
  • 3. It’s as old as the hills Standard setting = The process of establishing one or more cut scores on examinations Standards-based assessment = Using tests to assess learner performance and achievement in relation to an absolute standard A development of criterion-referenced testing Using large-scale standardized tests Pre-dating the criterion-referenced testing move
  • 4. Definition of ‘standard’ Standard = a level of performance required or experienced (Davies et al., 1999). Example: The standard required for entry to the university is an A in English.
  • 5. The uses of standards Educational purposes: (achievement tests) Professional purposes: (certification of aircraft engineers) Political purposes: (NCLB & AYP) Immigration Policy purposes
  • 6. Unintended consequences In case of NCLB: ELL group is always lower than the standard & resources are not channeled to where they are most needed. Mandatory use of English in tests of content subjects puts pressure on the indigenous people to abandon education in their own language The use of language tests for immigration leads to fraudulent practices & short-term paper marriages
  • 7. Using standards for harmonization & identity To enforce conformity to a single model that helps to create and maintain political unity and identity. Examples: Carolingian empire of Charlemagne (CE 800– 814) CEFR (Now)
  • 8. Carolingian empire of Charlemagne Within the empire of Charlemagne, in Central and Western Europe, various groups followed different calendars, and the main Christian festivals fell on different dates. In order to bring uniformity, Charlemagne set a new standard for ‘computists’ who worked out the time of festivals. They required to pass a test in order to get their certificate. There are no ‘correct answers’ for the questions in the Carolingian test, they are scored as ‘correct’ because they are defined as such by the standard, and the standard is arbitrarily chosen with the intention of harmonizing practice.
  • 9. CEFR (Common European Framework of Reference ) CEFR = A set of standards (six-level scales and their descriptors ) that provides a European model for language testing and learning to enhance European identity and harmonization. Teachers should align their curriculum and tests to CEFR standards (Linking) otherwise many European institutions will not recognize the certificate they awarded.
  • 10. Problems with CEFR It drains creativity among teachers. The same set of standards are used for all people across different contexts, with different purposes. Validation is based on linking the test to the CEFR. This is against validity theories. The use of standards and tests for harmonization ultimately leads to a desire for more control.
  • 11. How many standards can we afford? The number of performance levels depends on the goals and the use of the test. Choosing the fewest performance levels (pass or fail) is ideal because the more numerous the classes, the greater will be the danger of a small difference in marks. Index of Separation estimates the number of performance levels into which a test can reliably place test takers. Sometimes we have to use numerous categories to motivate young learners.
  • 12. Performance level descriptors (PLDs) & Test scores PLDs are often developed based on intuitive and experiential method & the labels and descriptors are simple reflections of the values of policy makers. There are around four levels: ‘advanced – proficient – basic – below basic’. The PLDs provide a conceptual hierarchy of performance that is an indication of the ability or knowledge of the test taker. Standard-setting is the process of deciding on a cut score for a test to mark the boundary between two PLDs. If we have two performance levels (pass and fail), we’ll need a single cut score.
  • 13. Standard based tests, CRT & scoring rubrics It is said that tests used in standards-based testing are criterion- referenced, yet for Glaser the criterion was the domain, and it does not have anything to do with standard setting and classification. The standards-based testing movement has interpreted ‘criterion’ to mean ‘standard’. The focus within PLDs is on the general levels of competence, proficiency, or performance while scoring rubrics address only single items.
  • 14. Some initial decisions All standard setting methods involve expert judgemental decision making at some level (Jaegar, 1979). Decision 1: Compensatory or non-compensatory marking? The strength in other areas ‘compensates’ for the weakness in one area.  Decision 2: What classification errors can you tolerate? Decision 3: Are you going to allow test takers who ‘fail’ a test to retake it? If so, what time lapse is required to retake the test?
  • 15. Second Page : "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." Cycle Diagram Test-centered Criterion-referenced Norm- referenced Examinee- centered Standard-Setting Methods Classification of
  • 16. Standard-setting methodologies Test-centered • Angoff • Ebel • Nedelsky • Bookmark Examinee-centered • Method of Contrasting Groups • Method of Borderline group
  • 17. Common process of standard setting Select an appropriate standard setting method depending upon the purpose of the standard setting, available data, and personnel. Select a panel of judges based upon explicit criteria. Prepare the PLDs and other materials as appropriate. Train the judges to use the method select. Rate items or persons, collect and store data. Provide feedback on rating and initiate discussion for judges to explain their ratings, listen to others, and revise their views or decisions, before another round of judging. Collect final ratings and establish cut scores. Ask the judges to evaluate the process. Document the process in order to justify the conclusions reached.
  • 18. Test-centered methods The judges are presented with individual items or tasks and required to make a decision about the expected performance on them by a test taker who is just below the border between two standards.
  • 19. Angoff method  Experts are given a set of items and they need to rate the probability that a hypothetical learner (who is on the borderline) would answer each test item correctly.  The average of these probabilities across judges or raters is the cut score.  If the test contains polytomous items or tasks, the proportion of the maximum score is used instead of the probability (modified Angoff).
  • 20. Advantages & disadvantages  Clarity  Simplicity Cognitive difficulty in conceptualizing the borderline learner by all judges in precisely the same way. + -
  • 21. Ebel method  2 Rounds  Experts classify independently test items by: I level of difficulty II level of relevance easy medium hard essential important acceptable questionable
  • 22. Ebel method The judges estimate the percentage of items a borderline test taker would get correct for each cell. Then the percentage for each cell is multiplied by the number of items, so if the ‘easy/essential’ cell has 20 items, 20 *􏰀 85 = 1700. These numbers for each of the 12 cells are added up and then divided by the total number of items to give the cut score for a single judge. Finally, these are averaged across judges to give a final cut score.
  • 23. All items could be classified: 12 cells in a 3*4 grid defined by the three difficulty and four relevance category. As in the example: categories Expert №3 Expert №4 Expert №5 Number of items in a category (А) % correctly performed items (В) А*В Number of items in a category (А) % correctly performed items (В) А*В Number of items in a category (А) % correctly performed items (В) А*В Essential Easy 11 60 660 10 70 700 13 75 975 Medium 1 25 25 3 25 75 1 0 0 Hard 0 10 0 1 0 0 0 0 0 Questionabl e Easy 0 0 0 0 0 0 0 0 0 Medium 0 0 0 0 0 0 0 0 0 Hard 0 0 0 0 0 0 0 0 0 Mean 25.1 26.7 35 Mean for all experts 28 Cut-score 12 …
  • 24. Problems with EBEL The complex cognitive requirements of classifying items according to two criteria in relation to an imagined borderline student may be challenging for the judges. As it is assumed that some items may have questionable relevance to the construct of interest, it implicitly throws into doubt the rigor of the test development process and validity arguments.
  • 25. Nedelsky method (Multiple-choice) The experts estimate the multiple-choice items a borderline test taker would be able to eliminate. In a four-option item with three distractors, if a candidate can eliminate 3 of the distractors the chances of getting the item right are 1 (100 %), but if he can only rule out 1 of the items, the chance of answering the item correctly is 1 in 3 (33 %). These probabilities are averaged across all items for each judge, and then across all judges to arrive at a cut score.
  • 26. Problems with Nedelsky method  It assumes that test takers answer multiple choice items by eliminating the options that they think are distractors and then guessing randomly between the remaining options. However, it is highly unlikely that test takers answer items in this way.  Nedelsky method tends to produce lower cut scores than other methods and is therefore likely to increase the number of false positives.
  • 27. Bookmark method Directions to Bookmark participants Ordered item booklet Booklet guideline Student exemplar papers Scoring guide Essential materials
  • 28. Standard Setting Presentation of the percentage of students falling into each performance level and each median cut-score from Round 2. After discussion individual judgments Overview of established cut-scores by every expert, repeating of the same procedure as in the first step. Experts are informed about the essential number of cut-scores to establish. Experts work in small groups, all the essential material is introduced to them. Basic steps of the procedure Round III Round II Round I
  • 29. Procedures in Bookmark method  Judges are presented with the necessary materials  Then they are asked to keep in mind a borderline student, and place a ‘bookmark’ in the book between two items, such that the candidate is more likely to be able to answer the items below correctly, and the items above incorrectly.   The bookmarks are discussed in group and finally the median of the bookmarks for each cut point is taken as that group’s recommendation for that cut-point.
  • 30. Examinee-centered methods The judges make decisions about whether individual test takers are likely to be just below a particular standard; the test is then administered to the test takers to discover where the cut score should lie.
  • 31. Borderline group method The judges define what borderline candidates are like, and then identify borderline candidates who fit the definition.  Once the students have been placed into groups the test can be administered. The median score for a group defined as borderline is used as the cut score. The main problem: the cut score is dependent upon the group being used in the study.
  • 32. Method of contrasting groups Procedure includes testing of two groups of examinees: •The classification must be done using independent criteria, such as teacher judgments. •The test is then given, and the score distributions are calculated. There are likely to be overlaps in the distributions. • The cut score will be where overlap is observed in the distributions. Competent Non-competent
  • 34. Which method is the ‘best’? It depends on what kind of judgments you can get for your standard-setting study, and the quality of the judges that you have available. However, using the contrasting group approach is recommended if it’s possible because it is the only method that allows the calculation of likely decision errors (false positives and false negatives) for cut scores. The problem is getting the judgments of a number of people on a large group of individuals.
  • 35. Evaluating standard setting (Kane, 1994) Procedural evidence • What procedures were used for the standard-setting to ensure that the process is systematic? • Were the judges properly trained in the methodology and allowed to express their views freely? Internal evidence • Deals with the consistency of results arising from the procedure • It also estimates the extent of agreement between judges (Cohen’s kappa ) External evidence • Correlation of scores of learners in a borderline group study with some other test of the same construct. • High correlation = the established cut scores are defensible.
  • 36. Training: a critical part of standard setting Training activities include familiarization with the PLDs and the test, looking at the scoring keys, making practice judgments, and getting feedback. Different views may lead to disagreements among the judges. Training should not be designed to eliminate these variations but to allow free discussion among judges. If the judges do not converge, the outcome should be accepted by the researchers. The training process should not force agreement (cloning) because removing their individuality and inducing agreement is a threat to validity.
  • 37. The special case of the CEFR • The CEFR Manual contains performance level descriptors for standard setting in order to introduce a common language and a single reporting system into Europe. • It recommends five processes to ‘relate’ Language Examinations to the Common European Framework of Reference (CEFR)for Languages Learning, Teaching, Assessment. These processes are: Familiarization, specification, standardization training/benchmarking, standard-setting, and validation. • Familiarization, standard-setting, and validation are uncontentious because they reflect common international assessment practice that is not unique to Europe however the other two sections are problematic.
  • 38. PLDs in CEFR & in other standard-based systems The use of PLDs in the CEFR is institutionalized & their meaning is generalized across nations.  Standardization facilitates ‘the implementation of a common understanding of CEFR and training is cloning rather than familiarization Benchmarking = the process of rating individual performance samples using the CEFR PLDs Standard-setting = ‘mapping’ the existing cut scores from tests onto CEFR levels. PLDs are evaluated in terms of their usefulness and meaningfulness; they can be discarded or changed. Standardization & training ensure that everyone understands the standard-setting method yet judgments are freely made Benchmarking = the typical performances that are identified after standard-setting. Standard-setting = establishing cut scores on tests. CEFR Other standard-based systems
  • 39. You can always count on uncertainty Standards-based testing can be positive if people can reach a consensus, rather than being forced to see the world through a single lens. Used in this way, standards are never fixed, monolithic edifices. They are open to change, and even rejection, in the service of language education. Standards-based testing fails if it is used as a policy tool to achieve control of educational systems with the intention of imposing a single acceptable teaching and assessment discourse upon professionals.
  • 40. Thank You For Your Attention