Classical Test Theory and Reliability

Một phần của tài liệu Statistical test theory for the behavoial science (Trang 32 - 42)

Classical Test Theory and Reliability

3.1 Introduction

Classical test theory gives the foundations of the basic true-score model, as discussed in Chapter 2. In this chapter, we will first go into some properties of the classical true-score model and define the basic concepts of reliability and standard error of measurement (Section 3.2). Then the concept of parallel tests will be discussed. Reliability estimation will be considered in the context of parallel tests (Section 3.3). Defining the reliability of measurement instruments is theoretically straightfor- ward; estimating reliability, on the other hand, requires taking into account explicitly the major sources of error variance. In Chapter 4, the most important reliability estimation procedures will be discussed more extensively.

The reliability of tests is, among others, influenced by test length (i.e., the number of parts or items in the test) and by the homogeneity of the group of subjects to whom the test is administered. This is the subject of Sections 3.4 and 3.5. Section 3.6 is concerned with the esti- mation of subject’s true scores. Finally, we could ask ourselves what the correlation between two variables X and Y would be “ideally” (i.e., when errors of measurement affect neither variable). In Section 3.7 the cor- rection for attenuation is presented.

3.2 The definition of reliability and the standard error of measurement

An important development in the context of the classical true-score model is that of the concept of reliability. Starting from the variances and covariances of the components of the classical model, the concept of reliability can directly be defined. First, consider the covariance between observed scores and true scores. The covariance between

16 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES

observed and true scores, using the basic assumptions of the classical model discussed in Chapter 2, is as follows:

Now the formula for the correlation between true scores and observed scores can be derived as

the quantity also known as the reliability index. The reliability of a test is defined as the squared correlation between true scores and observed scores, which is equal to the ratio of true-score variance to observed-score variance:

(3.1) The reliability indicates to which extent observed-score differences reflect true-score differences. In many test applications, it is important to be able to discriminate between persons, and a high test reliability is prerequisite. A measurement instrument that is reliable in a par- ticular population of persons is not necessarily reliable in another population. From Equation 3.1, it is clear that the size of the test reliability is population dependent. In a population with relatively small true-score differences, reliability is necessarily relatively low.

Estimation of test reliability has always been one of the important issues in test theory. We will discuss reliability estimation extensively in the next chapter. For the moment, we assume that reliability is known. Now we can define the concept of standard error of measure- ment. We derive the following from Equation 3.1:

(3.2) and

σXΤ=σ(Τ+E, )Τ =σΤ2+σ( , )ΤE =σΤ2

ρ σ

σ σ σ σ

X X

X X

Τ Τ

Τ

= = Τ

ρ σ

σ

σ

σ σ

X

X E

Τ Τ Τ

Τ

2 2

2

2

2 2

= =

+

σΤ2=ρ σ2XΤ 2X

σ2E=σ2X−ρ σ2XΤ X2

CLASSICAL TEST THEORY AND RELIABILITY 17

The standard error of measurement is defined as

(3.3) The reliability coefficient of a test and the standard error of mea- surement are essential characteristics (cf. Standards, APA, AERA, and NCME, 1999, Chapter 2). From the theoretical definition of reliability (Equation 3.1), and taking into account that variances cannot be neg- ative, the upper and lower limits of the reliability coefficient can easily be derived as

and = 0 if all observed-score variance equals error variance. If no errors of measurement occur, observed-score variance is equal to true- score variance and the measurement instrument is perfectly reliable (assuming that there is true-score variation).

The observed-score variance is population or sample dependent, as is the reliability coefficient. Reporting only the reliability coefficient of a test is insufficient—the standard error of measurement must also be reported.

3.3 The definition of parallel tests

Generally speaking, parallel tests are completely interchangeable.

They are perfectly equivalent. But how can equivalence be cast in statistical terms? Parallel tests are defined as tests that have identical true scores and identical person-specific error variances. Needless to say, parallel tests must measure the same construct or underlying trait.

For two parallel tests X and X′, we have, as defined,

τp= τ′p for all persons p from the population (3.4a) and

for all p (3.4b)

σEX 1−ρ2XΤ

0≤ρ2XΤ≤1 0. ρ2XΤ

σE σE

p p

2 = 2′

18 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES

Using the definition of parallel tests and the assumptions of the classical true-score model, we can now derive typical properties of two parallel tests X and X′:

μXX′ (3.5a)

(3.5b) (3.5c) (3.5d) and

ρXYXY for all tests Y different from tests X and X′ (3.5e) In other words, strictly parallel tests have equal means of observed scores; equal observed-score, true-score, and error-score variances; and equal correlations with any other test Y.

Now working out the correlation between two parallel tests X and X′, it follows that

(3.6) A second theoretical formulation of test reliability is that it is the correlation of a test with a parallel test. With this result, we obtained the first possibility to estimate test reliability: we can correlate the test with a parallel test. A critical note with this method, however, is how we should verify whether a second test is parallel. Also, parallelism is not a well-defined property: a test might have different sets of parallel tests (Guttman, 1953; see also Exhibit 3.1). Further, if we do not have a parallel test, we must find another way to estimate reliability.

Exhibit 3.1 On parallelism and other types of equivalence

To be sure, a certain test may have different sets of parallel tests (Guttman, 1953). Does it matter, for all practical purposes, if a test has different sets of parallel forms? An investigator will always look for meaningful- ness and interpretability of the measurement results. If certain parallel

σ2E=σ2E′ σΤ2 =σΤ2′ σ2X =σ2X

ρ σ

σ σ σ

σ ρ

XX

X X X

′ ′ X

= ΤΤ = 2Τ2 = 2Τ

CLASSICAL TEST THEORY AND RELIABILITY 19

forms do not suit the purpose of an investigator using a specific test, this investigator might well choose the most appropriate form of parallel test. Appropriateness may be checked against criteria relevant for the study at issue.

Parallel tests give rise to equal score means, equal observed-score and error means, and equal correlations with a third test. Gulliksen (1950) mentions the Votaw–Wilks’ tests for this strict parallelism. These tests, among others, are also embedded in some computer programs for what is known as confirmatory factor analysis. “Among others” implies that other types of equivalence can also be tested statistically by confirmatory factor analysis.

3.4 Reliability and test length

In general, to obtain more precise measurements, more observations of the same kind have to be collected. If we want a precise measure of body weight, we could increase the number of observations. Instead of one measurement, we could take ten measurements, and take the mean of these observations. This mean is a more precise estimate of body weight than the result of a single measurement. This is what elemen- tary statistics teaches us. If we have a measurement instrument for which two or more parallel tests are available, we might consider the possibility of combining them into one longer, more reliable test.

Assume that we have k parallel tests. The variance of the true scores on the test lengthened by a factor k is

Due to the fact that the errors are uncorrelated, the variance of the measurement errors of the lengthened test is

The variance of the measurement errors has a lower growth rate than the variance of true scores.

The reliability of the test lengthened by a factor k is var(kΤ)=k2 2σΤ

var(E1+E2+ +Ek)= σk E2

ρ σ

σ σ

σ

σ σ σ

X k X k

E X

k

k k

k

( ) ′( )= k

+ =

+ −

2 2

2 2 2

2

2 2 2

Τ Τ

Τ

Τ Τ

20 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES

After dividing numerator and denominator of the right-hand side by we obtain

(3.7)

This is known as the general Spearman–Brown formula for the reli- ability of a lengthened test.

It should be noted that, as mentioned earlier, the lengthened test must measure the same characteristic, attribute, or trait as the initial test. That is to say, some form of parallelism is required of the sup- plemented parts with the initial test. Adding a less-discriminating item might lower test reliability. For (partly) speeded tests, adding items to boost reliability has its specific problems. Lengthening a partly speeded multiple-choice test might also result in a lower reliability (Attali, 2005).

3.5 Reliability and group homogeneity

A reliability coefficient depends also on the variation of the true scores among subjects. So, the homogeneity of the group of subjects is an important characteristic to consider in the context of reliability. If a test has been developed to measure reading skill, then the true scores for a group of subjects consisting of children of a primary school will have a wider range, or a larger true-score variance, than the true scores of, for example, the fifth-grade children only. If we assume, as is fre- quently done, that the error-score variance is equal for all relevant groups of subjects, we can compute the reliability coefficient for a target group from the reliability in the original group:

(3.8)

where is the variance of the observed scores in the target group, its counterpart in the original group, and ρXX′ the reliability in the original group.

It is, however, advised to verify whether the size of the error variance varies systematically with the true-score level. One method for the computation of the conditional error variance, an important issue for

σ2X,

ρ ρ

ρ

X k X k

XX XX

k k

( ) ′( ) ( )

=1+ −1

ρ σ

σ

σ ρ

σ

UU

E U

X XX

U

′= − = − − ′

1 2 1 1

2

2 2

( )

σU2 σ2X

CLASSICAL TEST THEORY AND RELIABILITY 21

reporting errors of measurement of test scores (see Standards, APA et al., 1999, Chapter 2) has been suggested by Woodruff (1990). At several places in this book we will pay attention to the subject of conditional error variance.

3.6 Estimating the true score

The true score can be estimated by the observed score, and so it is done frequently. Assuming that the measurement errors are approximately normally distributed, we can construct a 95% confidence interval:

(3.9) Unfortunately, the point estimate and the confidence interval in Equation 3.9 are misleading for two reasons. The first reason is that we can safely assume that the variance of measurement errors varies from person to person. Persons with a high or low true score have a relatively low error variance due to a ceiling and a floor effect, respec- tively. So, we should estimate error variance as a function of true score.

We will discuss the second reason in more detail. We start with a simple demonstration. Suppose all true scores are equal. Then the true-score variance equals zero. So, the observed-score variance equals the variance of measurement errors. We know this because we have obtained a reliability equal to zero. Which estimate of a person’s true score seems most adequate? In this case, the best true-score estimate for all persons is the population mean μX.

More generally, we might estimate τ using an equation of the form axp + b, where a and b are chosen in such a way that the sum of the squared differences between true scores τ and their estimates are minimal. The resulting formula is the formula for the regression of true score on observed score:

This formula can be rewritten as follows:

(3.10) xp−1 96. σE ≤τˆpxp+1 96. σE

ˆ ( )

τ σ ρ

σ μ μ

= Τ XΤ − + Τ

X x X

ˆ ( )

τ ρ= XXx+ −1 ρXX′ μX

22 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES

with a standard error of estimation (for estimating true score from observed score) equal to

(3.11) Formula 3.10 is known as the Kelley regression formula (Kelley, 1947). From Equation 3.11, it is clear that the Kelley estimate is better than the observed score as an estimate of true score.

The use of the Kelley formula can also be criticized:

1. The standard error of estimation (Equation 3.11) also sup- poses a constant error variance.

2. The true regression might be nonlinear.

3. The Kelley estimate of the true score depends on the popula- tion. Persons with the same observed score coming from dif- ferent populations might have different true-score estimates and might consequently be treated differently.

4. The estimator is biased. The expected value of the Kelley formula equals τp only when the true score equals the popu- lation mean.

5. The regression formula is inaccurately estimated in small samples.

Under a few distributional assumptions, the Kelley formula can be derived from a Bayesian point of view. Assume that we have a prior distribution of true scores N(μT, )—that is, the distribution is normal with mean μT and variance . Empirical Bayesians take the estimated population distribution of Τ as the prior distribution of true scores.

Also assume that the distribution of observed score given true score τ equals N(τ, ). Under these assumptions, the mean of the posterior distribution of τ given observed score x equals Kelley’s estimate with μX replaced by μΤ. When a second measurement is taken, it is averaged with the first measurement in order to obtain a refined estimate of the true score. After a second measurement, the variance of measure- ment errors is not equal to but is equal to After k measure- ments, we have

(3.12) σε=σΤ 1−ρ2XΤ =σX ρXX′ 1−ρXX′ = ρ σXXE

σΤ2 σΤ2

σ2E

σE2 σ2E/ .2

ˆ / ( ) /

τ σ /

σ σ

σ

σ σ μ

= + +

+

Τ

Τ Τ

Τ 2

2 2

2

2 2

E

E

kx k kE

k

where x(k) is the average score after k measurements, as the estimate of true score, and as k becomes larger, the expected value of Equation 3.12 gets closer to the value τ. So, the bias of the estimator does not seem to be a real issue.

3.7 Correction for attenuation

The correlation between two variables X and Y, ρXY, is small if the two true-score variables are weakly related. The correlation can also be small if one or both variables have a large measurement error. With the correlation being weakened or attenuated due to measurement errors, one might ask how large the correlation would be without errors (i.e., the correlation between the true-score variables). This is an old problem in test theory, and the answer is simple. The correlation between the true-score variables is

(3.13)

Formula 3.13 is the correction for attenuation. In practice, the problem is to obtain a good estimate of reliability. Frequently, only an underestimate of reliability is available. Then the corrected coefficient (Equation 3.13) can have a value larger than one in case the correlation between the true-score variables is high.

When data are available for several variables X, Y, Z, and so forth, we can model the relationship between the latent variables underlying the observed variables. In structural equation modeling, the fit of the structure that has been proposed can be investigated. So, structural equation modeling produces information on the true relationship between two variables.

Exercises

3.1 The reliability of a test is 0.75. The standard deviation of observed scores is 10.0. Compute the standard error of mea- surement.

3.2 The reliability of a test is 0.5. Compute test reliability if the test is lengthened with a factor k = 2, 3, 4,…, 14 (k = 2(1)14, for short).

ρ σ

σ σ

σ

ρ σ ρ σ

ρ

ρ ρ

Τ Τ

Τ Τ

Τ Τ

X Y

X Y

X Y

XY

XX X YY Y

XY

XX Y

= = =

′ ′ ′ Y′′

3.3 Compute the ratio of the standard error of estimation and the standard error of measurement for ρxx′= 0.5 and ρxx′= 0.9.

Compute the Kelley estimate of true score for an observed score equal to 30, and μX= 40, ρxx′= 0.5, respectively, ρxx′= 0.9.

3.4 The reliability of test X equals 0.49. What is the maximum correlation that can be obtained between test X and a crite- rion? Explain your answer. Suggestion: Use the formula for the correction for attenuation.

3.5 Let ρXY be the validity of test X with respect to test Y. Write the validity of test X lengthened by a factor k, in terms of ρXY, σX, σY, and ρXX′. What happens when k becomes very large?

Một phần của tài liệu Statistical test theory for the behavoial science (Trang 32 - 42)

Tải bản đầy đủ (PDF)

(282 trang)