Using Item Response Theory to Improve Assessment

Day 1 PM: Using IRT
Item and test information
Comparison of IRT to Classical Test Theory
How to do IRT analysis

Part 1
Item and test information

Information
 Information is the tool that IRT uses
to build tests
 It is a statistical term that quantifies
how much something “adds” to a
procedure
 Or, alternatively, how much
uncertainty (error) it decreases
 A good test has a lot of information!

Item information
 IRT calculates information for each
item and test at each level of q
 It is therefore not a single number –
it is a function across ability
 Each item has an item information
function
 Each test has an test information
function

Item information
 Some items provide information for
high students, some for low
 Same is true for tests: a test can be
more accurate for certain score
ranges – and IRT will tell you which

Information
 Item information is summative, that
is, it can be added up to obtain the
test information function (TIF)
 Then we know where to add/subtract
items
 Bonus: The TIF can also be inverted
to obtain a predicted SEM curve

Item information
 With CTT, “information” can be
conceptualized by jointly considering
the P and rpbis
◦ Obviously, a higher rpbis is better
 Definitely don’t want negative!
◦ P represents which examinees it is most
appropriate for
 P = 0.95 is easy, good for low examinees
 P = 0.50 is hard, good for high examinees

Item information
 But since items and examinees are
not on the same scale, there is no
direct connection
 With IRT, there is
 Item with b = 0.7 is good for person
with q = 0.7
◦ This is the basis of adaptive testing –
doing this continually

Item information
 Item information takes this idea and
quantifies it across the spectrum
 It is therefore a function of q as well as
the item parameters
 Where P(q) is the probability of a
correct answer for a given value and
Q(q) is 1-P
 
2
2 2 ( ) ( )
I
( ) 1
i i i
i
i i
Q P c
D a
P c
 q q 
q   
q  

Item information
 That is the computational equation
 Conceptual version that is seen in the
literature is
 Or the slope squared over the
conditional variance
   
2
I ( ) / ( ) (1 ( ) )i i iP P Pq q q q 

Graphing info
 So what does this mean?
 We calculate with that equation, and
it will be higher wherever the slope
of the IRF is higher (for a given value
of q)
 This is the item information function
(IIF)

Graphing info
 So the location of the item
determines the location of the IIF
 The discrimination of the item
determines the spread/peakedness of
the IIF
 Information decreases as the guessing
parameter increases

Some example items
Seq a b c
1 1.00 -2.00 0.26
2 0.70 -1.00 0.21
3 0.40 -0.50 0.30
4 0.50 1.00 0.00
5 0.80 0.00 0.22

Graphing info functions
 Note that a lower slope is not ALL
bad
 Even though Item 3’s peak is lower, it
provides some info at a much wider
range
 So items like that are quite useful
when info is needed across a wide
range

Using item info
 Item information is inversely related
to error in measurement
 If the item provides more info, it
reduces error
 The equation:
   2/1
1 qq ISEM 

Using item info
Key point: an item has less
error where it has more
information
--> where it has more slope
A test has less error where
it has more information
(items)

Using item info
 IIFs are another way to examine
items individually
 They are also what adaptive testing
utilizes for item selection
 But the best use of item info: test
information and test assembly…

Test information
 As a result of the assumption of local
independence, IIFs can be summed to
obtain a test information function
(TIF)
 Same is true for IRFs – they can be
summed into a TRF
◦ This converts thetas to estimated raw
score

Test information
 Test information, like item
information, shows how well a test
measures at each value of q
 Also inverts to CSEM
 This is extremely useful for test
assembly (aka construction, design,
or building)

Test information
 Consider the 5 IRFs…

Test information
 The TRF is…

Test information
 Consider the 5 IIFs…

Test information
 The TIF is…

Test information
 The CSEM curve is…

Test assembly
 Form building is more efficient and
better directed with IRT
 Reason: we can predict measurement
error (SEM) at each level of θ, not
just overall reliability

Test assembly
 This then allows you to build test
forms with specific TIFs or CSEMs in
mind
 Or multiple forms with the same TIF
 The following figures have the same
average a (0.9) but differ in where
they provide information

Test development
 You can build your test with specific
TRF/TIF/SEM graph in mind
 Peak at cutscore?
 This can be done inside item bankers
(FastTEST & FT Web) or in separate
spreadsheets (my Form Building Tool)

Bank development
 You can also build the bank for a
testing program with the desired TIF
in mind
 If you know you want it to be peaked,
write items at the desired level of
difficulty to build an adequate bank

Bank development
 Otherwise you risk overexposure
 Don’t use all your best items at once
to make a peaked TIF – or any TIF for
that matter
 In the theoretical IRT world, we don’t
have to worry about that, but
exposure is a real issue

Bank development
 That is the reason linear-on-the-fly
(LOFT )was developed – to massively
reduce exposure and increase
security
◦ Every person gets an very similar TIF, but
a completely different test
◦ These tests are parallel, from an IRT
point of view
◦ Tests are conventional fixed-form

Part 2
A brief comparison of CTT and IRT

CTT and IRT Assumptions
 IRT:
◦ Unidimensionality and local independence
◦ Responses modeled by IRF
◦ Parameters, not statistics (sample
independence)
 CTT:
◦ X = T + E
◦ (1) true scores and error scores are
uncorrelated; (2) the average error score in
the sample is zero
◦ Statistics (not parameters) are sample-based

Comparing CTT and IRT
 CTT is said to have weaker assumptions
◦ Does not explicitly assume
unidimensionality
 But if not there, statistics will be iffy, and rpbis
and reliability suffer
 Sum scoring implicitly assumes items are
equivalent, which means unidimensional (all
items count equally on one total score)

 CTT is said to have weaker assumptions
◦ Does not explicitly assume IRF
 But if the idea of an IRF is not working, then the
item isn’t either
 And if you use rpbis, you assume a linear IRF,
which is actually impossible!

 CTT item statistics are at odds with
each other
◦ P says that there is one common
probability of a correct response
(binomial)
◦ But rpbis says that P increases with total
score (~ability)

 Classical SEM: same for everyone
 IRT SEM: different for everyone –
depends on the items you see and
your ability
 Which is more realistic?

 Direct comparison of item statistics
◦ We still use “difficulty” and
“discrimination”
◦ How different are they from CTT?
◦ Difficulty correlates highly (>0.90)
◦ Discrimination does not – because Rpbis
is linear and IRT is not

 IRT and CTT scores also correlate
>0.95
 So why use IRT?
 There are distinct advantages…

Advantages of IRT
 IRT has parameters, not statistics
 Sample-independent… within a linear
transformation
 Huh? This means that if you have two
calibration groups of different levels,
we can convert parameters/scores
with a simple y = mx + b
 (Linking)

Advantages of IRT
 Items and people are on the same
scale
 Easier to interpret, and allows
adaptive testing

Advantages of IRT
 Information provides an important
tool for test building and bank
development
 Better match the purposes of a test
 IRT CSEM allows far better
description of precision

Advantages of IRT
 More precise scores
 CTT number correct scoring is limited
to k + 1 scores
 3PL has 2k scores
 Compare with 10 items:
◦ 11 vs 1024 possible scores

Advantages of IRT
 Scores take item difficulty into
account
 Allows direct comparison of
examinees that saw different sets of
items
 Scores also account for guessing

Advantages of IRT
 Nonlinear IRF – the linear IRF
assumed by CTT is impossible
 Allows for different SEM for every
examinee
 Not realistic to assume they are all
the same

Disadvantages of IRT
 Sample size
 CTT: 50 is OK, 100 is great
◦ It is much easier to fit a straight line
“model” than an IRF because it is an
oversimplification
 IRT: 100 is bare minimum for 1PL
◦ 3PL? ~500
◦ Puts it out of reach of small testing
programs

 No “native” distractor analysis unless
polytomous models
 Can adapt the CTT idea of
quantile/distractor plot with IRT
◦ IRT programs will also give you option P
and Rpbis

 Complexity
◦ Not only do you have to understand it
yourself, but…
◦ You also have to explain it to
stakeholders!

 However, note that these are not big
problems
◦ Many places have plenty of sample size
◦ You can still use CTT for distractor
analysis (always use both!!!!)
◦ The complexity is not too bad unless
using complex models
◦ Often, the biggest issue is the
stakeholders!

IRT Analysis
How do I go about doing this?

IRT Analysis
 Xcalibre 4 for IRT
 CTT analysis with Iteman 4 (not
necessary, but sometimes helps)
 Also:
◦ Scoring and graphing tool
◦ Form building tool
◦ Empirical IRFs in Excel
◦ Have we covered these sufficiently?

IRT Analysis
 I’m assuming here we are analyzing
just one sample of one test
 What would I look for? Basic…
◦ Items with good parameters (keep/clone)
◦ Items with bad parameters (retire)
 Evaluate their CTT option statistics
◦ TIF/CSEM – meet our needs? (not
good/bad in absolute sense)

IRT Analysis
 What would I look for? Advanced…
◦ Dimensionality assessment (reliability,
any items/sections “off on their own”)
◦ Item fit (also dimensionality, and possible
item issues)
◦ Test sections – any stand out for being
hard, easy, low discriminations, poor
precision, etc?
◦ CSEM/TIF for sections: anything under-
measured?

IRT Analysis
 What would I look for? Advanced…
◦ Finally: what do you want to see in the
data, and how will the test be used?
 Later, we’ll talk about more
advanced uses like:
◦ Linking and equating multiple forms
◦ Test assembly
◦ Adaptive testing
◦ Dimensionality evaluation

Iteman 4.1
 Performs comprehensive classical
analysis
 Quantile plots allow broad evaluation
of IRF shape
 Advantages:
◦ Easily understandable – can use with SMEs
◦ Includes distractors

Xcalibre 4.1
 Provides a comprehensive and user-
friendly IRT analysis
 Allows evaluation of individual items
and test as a whole
 All major graphs
 Many summary graphs (freqs etc.)
 Classical analysis too

Reasons for Xcalibre 4.1
 Current available software (Parscale,
Bilog, Multilog, ConQuest, WinSteps,
ICL) still require programming skills
 Some still run on DOS!
 If IRT is to be more widely used, it
needs a user-friendly system
◦ Input and output

 Better input
◦ Yes: Point and click buttons
◦ No: DOS programming quasi-language
 Better output
◦ Yes: Word docs (RTF), spreadsheets (CSV)
◦ No: DOS txt files with ugly tables

 Advanced users with programming
skills and need for customized analysis
can still utilize previous software
 Xcalibre 4.1 is designed for a wider
range of users
 The following description is of Xcalibre
4, but also applies to Iteman 4

Xcalibre 4.1 Interface
 Divided into tabs
 Move left to right…

Xcalibre 4.1 Interface
 All options are specified with buttons
or simple entry boxes
 No code based on keywords
◦ Best example: IRT models (you’ll see)
 Also: usable error messages

Specify files/input; choose options
 I’ll now show how to use X4, and do
some analysis of real data…

Using Item Response Theory to Improve Assessment

More Related Content

What's hot (9)

Viewers also liked (16)

Similar to Using Item Response Theory to Improve Assessment (20)

Recently uploaded (20)

Using Item Response Theory to Improve Assessment