More on the estimation of item parameters- 123docz.net

Special attention must be given to the presence of missing values when item parameters are estimated. There are two cases to consider: data can be missing by design or not. Data are missing by design when, for

L s s t t s s t

n N n

CML Prob Prob

= ( , , | , , )= ( , , , ,

1 1 1 1 , )

( , , )

(

t t

N N

i s i

t i

Prob 1

= ∏=1 ε

γ dd)

p N

∏=1

si N P xt i t N

t n

t i t

t t

= = =

−

∑ ( | ) = ( )( )

( )

1 1

ε γ 1

γ d

11 d

n−1

∑

example, the number of items is too large to present all items to an examinee. Then subtests can be administered to different examinee groups. The subtests must have common items in order to obtain item and person parameter estimates on a common scale (see Chapter 11).

All maximum likelihood methods can be generalized to incorporate values missing by design. For the MML approach, the consequence is that latent distributions for multiple groups must be deﬁned. Data also can be missing because items are skipped. In achievement testing, skipped items sometimes are treated as incorrect responses. Another, more adequate approach to deal with skipped items with the multiple- choice format is suggested by Lord (1980). When the presence of missing values correlates with latent ability, data are not missing at random (Little and Rubin, 1987). MML estimation of item parameters is affected. This is the case when the test is speeded—that is, when some examinees do not reach the items at the end of the test, or when the time limit on the test stimulates strategic answer behavior and examinees rapid-guess on (more difﬁcult) items (Wang and Zhang, 2006; Wise and DeMars, 2006). Then blind application of, for example, the 3PL model is inadequate.

For accurate estimation of a difficulty parameter, it is important that the group of persons that took the test has an average ability level comparable to the item difficulty. In the 2PL model and the 3PL model, discrimination parameters must be estimated. These parameters define the slopes of the ICCs. Information on the steepness of a slope is available only when the latent abilities are reasonably well spread. The Rasch model does not have a discrimination parameter.

In the Rasch model, item parameter estimation can be accurate even if all persons have the same ability. This advantage is of limited value, however. If all abilities are equal, there is no way of discriminating between alternative models. The estimation of c in the 3PL model is more accurate when we have more relatively low abilities. Inaccurate estimation of the pseudo-chance-level parameter has an impact on the estimation of the discrimination parameter and the difﬁculty parameter as well, for the estimates of the item parameters are correlated.

For known abilities the inverse of the matrix of error variances and covariances of the item parameter estimates, the information matrix of the item parameters, is given in Lord (1980).

The CML estimation procedure for the Rasch model has a clear statistical advantage above the other estimation procedures as was discussed in the previous section. For the Rasch model, software was developed for the estimation of item parameters with CML; CML estimation also can be done with a special kind of log-linear analysis

(Heinen, 1996; Kelderman, 1984). CML is, however, computationally demanding. This problem might be avoided by using MML in which the characteristics of the population distribution are estimated along with the item parameters (De Leeuw and Verhelst, 1999). There is another disadvantage of using software for the Rasch model. If the Rasch model does not ﬁt the data very well, one could consider ﬁtting other models and in that case other software is needed. The Rasch model can be viewed as a submodel of the 2PL model and the 3PL model. There is much to say for using the same software to compute item parameter estimates for alternative models and to compare the outcomes of these models.

Software for the analysis of item responses is commercially available, for well-known item response models like the IRT models for dichotomous data discussed here as well as for other models. Informa- tion on software can be found in books and articles that describe appli- cations or research with the software, from software houses, and from software review sections of journals like Applied Psychological Measure- ment. Embretson and Reise (2000), who introduce many of IRT models, discuss a selection of the commercially available computer programs:

TESTFACT (Wilson, Wood, and Gibbons, 1991; www.ssicentral.

com) for the full-information factor analysis of dichotomous data with the two- and three-parameter normal ogive models (with ﬁxed values c)

BILOG (Mislevy and Bock, 1990) for the estimation of the 1PL, 2PL, and 3PL models, and BILOG-MG (Zimowski, Muraki, Mislevy, and Bock, 1996; www.ssicentral.com) for the analysis of multiple groups (BILOG is no longer available; see BILOG- MG3 of Assessment Systems Corporation)

MULTILOG (Thissen, 1991; www.ssicentral.com) and PAR- SCALE (Muraki and Bock, 1997; www.ssicentral.com) for dichotomous as well as polytomous items

XCALIBRE (Assessment Systems Corporation, 1996;

www.assess.com) for the estimation of parameters in the 2PL and 3PL model

RUMM (Andrich, Sheridan, and Luo, 2000; www.rummlab.com) for the estimation of parameters of various Rasch models The authors notice the fact that no ﬁnal review of software is possible, because programs have been revised and will be revised con- tinually. They also notice that alternative programs may unexpectedly produce different results although model speciﬁcations are identical.

So, more comparative studies on IRT programs and possible ﬂaws of certain programs have to be done. It is to be hoped that this leads to improvements of the IRT software.

For WINSTEPS and information on other software packages for Rasch analyses, see www.winsteps.com.

More on the estimation of item parameters

True score and measurement error

Classical Test Theory and Reliability