Computerized adaptive testing (CAT)

Một phần của tài liệu Statistical test theory for the behavoial science (Trang 208 - 211)

Chapter 10 Applications of Item Response Theory

10.6 Computerized adaptive testing (CAT)

Tests can be computer administered. A wide variety of item formats is available in computer-based tests, both items with a forced item response as well as items with open-response formats. Computerized testing makes it possible to allow different examinees to take a test on different occasions. For each examinee a different test can be com- posed, in order to avoid the risk that items become known among other things. Tests frequently are composed using a stratified random selec- tion procedure. In that case, results can be analyzed with generaliz- ability theory, and, when items are scored dichotomously, with approaches discussed in Chapter 6.

With computerized testing more is possible. It is possible, for exam- ple, to use a sequential testing design. One example of such an approach is the closed sequential procedure mentioned in Chapter 6.

With item response theory, computerized testing can be made even more flexible. First, consider a traditional test. Such a test is meant for measurements in a population of persons—the target population.

No test can be equally accurate for all persons from the target popu- lation. However, with computerized adaptive testing, we have the possibility to administer each person a test in such a way that the test score is as accurate as possible.

If we knew the ability of a person, we could administer a test tailored to the ability level of this person. However, we do not know the ability level of a person; if we knew there would be no need for testing. Using item response theory, a testing strategy can be used such that step by step a person’s ability is estimated. At each consec- utive step, the estimate is more precise. The choice of the item or subset of items at each step is tailored to the estimated ability at the previous step. This calls for items for which item parameters have already been estimated. All the items are stored in an item bank, and for this large set of items IRT item parameter estimates are known.

More technically, for the administration of the first item, we can start with the not unreasonable assumption that a person to be tested is randomly chosen from the target population. The population distri- bution can be regarded as the prior distribution for the ability of this person. After each response, we can compute the posterior distribution of θ from the prior distribution g(θ) and the likelihood of all responses L(x|θ) (cf. Equation 9.29). This distribution can be used as the new prior distribution for the ability of the person. We choose a new item optimal with respect to this prior. We might, for example, after a response to an item, compute the posterior mean, the EAP estimate,

and select a new item that has the highest item information at the level of the EAP estimate. After a correct response, the estimated ability is higher than after an incorrect response. Therefore, a more difficult item is administered after a correct response than after an incorrect response. We might stop when the error variance is smaller than some criterion. When the EAP estimate is used, the relevant error variance is the posterior variance of θ (Bock and Mislevy, 1982).

This CAT procedure is illustrated in Figure 10.3. For practical reasons, another stopping rule is frequently used: the test length of the CAT procedure is fixed. Test length is variable, however, in applications where a decision about mastery must be made (i.e., when it must be decided whether an examinee has an ability level equal to or higher than a minimum proficiency level θc ) (Chang, 2005).

Sometimes it is profitable to redefine the unit of presentation in CAT and to group items into testlets. One argument for grouping could be that several items are based on the same subject or the same text, but there might be other reasons for grouping items as well (Wainer and Kiely, 1987). With a redefinition of the unit of presentation, a different choice of item response model might be in order (Li, Bolt, and Fu, 2006; Wainer and Wang, 2000).

Another approach to CAT is exemplified by the ALEKS software used for learning in highly structured knowledge domains like basic math (www.aleks.com). An early psychometric contribution to this approach is by Falmagne (1989). Another contribution to flexible test- ing has been made by Williamson, Almond, Mislevy, and Levy, (2006).

Computerized adaptive testing (CAT) can be very efficient in com- parison to traditional testing (Sands, Waters, and McBride, 1997; Van der Linden and Glas, 2000). With a relatively short test length, we already obtain a highly accurate ability estimate. This removes the objection to the use of a prior distribution. With an accurate test, the weight of the prior in the final ability estimate is very small. CAT can also be used in connection with multidimensional traits. Li and Schafer (2005) discuss multidimensional CAT in which the items have a simple structure (i.e., load on only one of the latent trait dimensions).

In practice, concessions have to be made in order to make CAT feasible. If we would use only items with maximum information given the estimated ability, then we would probably use a limited number of items from a large item pool frequently and other items would never be used. Several methods have been proposed to deal with this problem (Revuelta and Ponsoda, 1998). Van der Linden and Veldkamp (2004) used the concept of shadow tests for constraining item exposure.

When CAT is to be introduced, a few aspects of testing with CAT must be attended to. Answering the CAT test is different from answer- ing a traditional test. The items must be answered consecutively;

skipping items is not allowed. Therefore, it is sound practice to study the validity of the test procedure. We should be alert to the possibility that the validity of the test changes with a change in procedure.

The interest in CAT is growing tremendously, especially because of its prospects in educational assessment. The future of testing will be determined, among others, by CAT (see, e.g., Standards, APA et al., 1999).

Figure 10.3 Flowchart of computerized adaptive testing (CAT) with a stop- ping rule based on estimation accuracy.

Start

Stop Select Optimal Item

Present Item

Update Ability Estimate

Estimate Accurate Enough?

Starting Value for Ability

Estimate

No

Yes

In 1995 the American Council on Education published the Guidelines for Computer-Adaptive Test Development and Use in Education.

Một phần của tài liệu Statistical test theory for the behavoial science (Trang 208 - 211)

Tải bản đầy đủ (PDF)

(282 trang)