IRT models for equating

Instead of the equipercentile method or the linear method of equating, a method that is based on IRT can be used. Equating with IRT has a large advantage over equating with the classical approach. With an IRT model that fits, the nonlinearities inherent in equating do not present a problem. IRT models can be used in horizontal equating as well as in vertical equating. In horizontal equating, different tests are meant for persons of similar abilities; equating as discussed so far is horizontal equating. In vertical equating, tests are constructed for target groups of different ability levels. The difference in test difficulty is planned, but for score interpretation, scores should be brought to the same scale. It is still necessary that all items be relevant for all examinees. Equating is not achieved if younger examinees have not been exposed to material tested in the unique items of the higher-level test tailored to the ability of a group of older examinees (Petersen et al., 1989). It should also be clear that in vertical equating, tests are not equated in the sense that they may be used interchangeably after equating.

In principle, three equating approaches for two test forms X and Y sharing a common set of items are possible within the IRT context (Petersen, Cook, and Stocking, 1983):

A. Simultaneous scaling: The item parameters of both tests are estimated jointly in one analysis. For this approach, we need software that allows for incomplete data—each person has answered only a subset of all items.

B. The responses to tests X and Y are analyzed separately. In the analysis of the second test, the item parameters of the common items are fixed to their values obtained in the analysis of the first test. The scales of X and Y can be related to each other by means of the scale values of the common items.

C. The responses to tests X and Y are analyzed separately. The difference with approach B is that the parameter values of the common items are not fixed to their values obtained in the analysis of the first test. Again, the scales of tests X and Y can be related to each other by means of the scale values of the common items.

When approach A is chosen and MML is the estimation method, char- acteristics of the latent ability distributions involved should be allowed to differ. Alternative C seems easiest to implement. Let us consider this approach in the context of the three most popular IRT models:

the Rasch model, 2PL model, and 3PL model.

11.7.1 The Rasch model

With the Rasch model, the third approach is very straightforward. We need the averages of the b parameters of the common items in test X and in test Y. Suppose that we have k common items with the following averages:

(11.10)

The b parameters of both tests would be on a common scale if the average parameter value for the common items would be equal for both tests. So, the b values and θ values of test Y can be brought on the same scale as those of test X with the following transformation:

(11.11)

b k b b

k b

X iX c

i k

Y iY c

i k

( )c = ( ), ( )c = ( )

= =

∑ ∑

1 1

bi*= −b bi Y( )c +bX( )c, θ*= −θ bY( )c +bX( )c

We do not have the parameter values of the difficulty parameters, but only estimated values. The estimated values are not equally accu- rate. So we might consider using weighted averages instead of the unweighted average in Equation 11.11. Such a method has been pro- posed by Linn, Levine, Hastings, and Wardrop (1981).

11.7.2 The 2PL model

In the 2PL model, equating is a bit more complicated because the parameters are defined on an interval scale. The common items can have different a values as well as different b values. Because of the interval character of the latent scale, b parameter values of the common items of test Y are linearly related to the values for test X:

(11.12) and the values of the a parameters are related through

(11.13) The coefficients d and e must be obtained in order to bring the parameters of the common items, and consequently the parameters of all items, to the same scale.

The simplest solution is to find the transformation by which the average b value of the common items and the standard deviation of the b values of the common items are equal in both tests. This is the mean and sigma method. With this method, the value of d is

(11.14)

that is, the ratio of the standard deviation of the common b values in test form X to the standard deviation of the common b values in test form Y, and the value of e is

(11.15) A robust alternative is the previously mentioned weighted method.

bi X( )=dbi Y( )+e

ai X( )=ai Y( )/d

d s s

b b X Y

= ( )

( ) c c

e b= X( )c −dbY( )c

We have two sets of parameter estimates for the common items.

One set is computed along with the other item parameters in test X.

The other set is computed along with the other item parameters in test Y. We also can compute two test characteristic curves—the sums of the ICCs of the items in the tests—for the subset of common items.

After test equating, these two test characteristic curves should be similar. In the characteristic curve methods (Haebara, 1980; Stocking and Lord, 1983), coefficients d and e are obtained for which these test characteristic curves are as similar as possible.

11.7.3 The 3PL model

The 3PL model is also defined on an interval scale, but the presence of a pseudo-chance-level parameter c complicates the equating of tests.

When we analyze two tests X and Y separately, the estimated c of a common item can have one value in test Y and another in test X. This difference is related to differences in the other parameter estimates of the particular item. The errors in the pseudo-chance-level parameters can have a disturbing effect on the relationship between the b parameters and the a parameters of the common items. In other words, we may expect a disturbing influence on the linear relationship between the item difficulties in test X and those in test Y. Equating tests X and Y is not simply achieved by using a linear transformation for the b values of the common items. With the 3PL model, more steps are needed. In a preliminary analysis, we obtain estimates of the parameters c. For a common item, one value for c is chosen on the basis of the two different values obtained in the analyses of tests X and Y. The chosen value can be the average of the two estimates. In the final analysis, the c parameter of a common item is fixed to this value for both tests.

After scaling the two test forms on a common latent scale, the relation between the true scores of both test forms can be computed.

For each value of θ, the corresponding true score of test form X and the corresponding true score of test form Y can be computed:

(11.16)

and

(11.17)

τX θ i θ

i X

G P

= =

∑∈

( ) ( )

τY θ i θ

i Y

H P

= =

∑∈

( ) ( )

The two true scores corresponding to the same θ are equated with the following formula:

(11.18) that is, we take the true score on test form Y, compute the corresponding value of θ, and, next, compute the true score on form X for this value of θ. True-score equating does not work in the 3PL model for equating observed scores below the chance level. One obvious procedure to obtain the relation between the tests below the level of the pseudo-chance level is to use (linear) interpolation. Lord (1980) suggested an alternative, a raw-score adaptation of the IRT-equating method. In this procedure, the distribution of θ is estimated for some group. Given this distribution, the marginal distributions of x and y can be estimated. Next, X and Y can be equated through equipercentile equating. The outcome depends to some extent on the group.

11.7.4 Other models

The equating methodology can be extended to the linking of tests with polytomous items. Cohen and Kim (1998) present an overview of linking methods under the graded response model. This model sometimes is used in connection with the judgment by raters of constructed responses. The fact that judges play a role complicates the linking process. For, it is by no means sure that judges have, for example, a stable year-to-year severity of judgment (Ercika et al., 1998; Tate, 1999).

True score and measurement error

Classical Test Theory and Reliability