The standard error of measurement, both overall and conditional (if relevant), should be reported both in raw scores or original scale units and in units of each derived score recommended for use in test interpre- tations. (Standards, APA et al., 1999, p. 31)
The standard error of measurement can vary with true-score level.
Conditional standard errors of measurement are standard errors of measurement conditional on true-score level. Such standard errors of measurement can be used as an alternative approach to convey reli- ability information, by constructing a confidence interval for an exam- inee’s true score, universe score (to be discussed in Chapter 5), or percentile rank. Earlier, three types of standard errors were discussed:
the standard error of measurement (Equation 3.3), the standard error of estimate of true score (Equation 3.11), and the standard error of prediction (Equation 4.17).
Woodruff (1990) studied the conditional standard error of measure- ment for assessing the precision of a test on its score scale. He proposed to split a test into two parallel halves X and X'. ANOVA is used to estimate values σ2(E'|X) as substitutes of σ2(E'|T). Then the outcomes are corrected for the fact that the test was split into two halves (using the customary assumption that the error variance doubles for a test lengthened by a factor 2).
Feldt and Qualls (1996) proposed a method for the estimation of the conditional error variance based on a split of the test into a number
of essentially tau-equivalent subtests. It is possible to use a split of the test into two halves, but it proves to be better to split the test in many subtests as long as all subtests can be considered as essentially tau-equivalent measurement instruments. Let there be n subtests. For person p, the estimated error variance of the subtests is
(4.19) where the scores are corrected for the test effects x.i − x... In the terminology of ANOVA, two-way interactions are used in Equation 4.19. Suppose that the subtests have equal score ranges. Then the consequence of the assumption of essentially tau-equivalent subtests on which Equation 4.19 is based, is that the error variance associated with a perfect score is nonzero when the subtests differ in difficulty level. In a nonlinear true-score model, a model based on item response theory (IRT), such a strange effect does not occur.
Again, we must multiply the estimate with a constant in order to obtain the error variance on the total test. When the n subtests add up to the total test, total test length is n times the length of the subtests and the result in Equation 4.19 must be multiplied by n.
Next, the error variances for all persons with the same total score can be averaged. This produces the estimated relationship between the size of the conditional error variance and total score. Feldt and Qualls suggest to reduce sampling variation further by smoothing the empirical relationship between error variance and total score. This can be achieved by a polynomial regression, where the error variance is regressed on powers of X (X, X2, etc.).
It might be interesting to compare Equation 4.19 with a formula for the conditional error variance developed within the context of generalizability theory. For this purpose, Equation 4.19 is rewritten as (4.20) which is comparable to Equation 5.41.
More methods for estimating conditional standard errors of mea- surement are described by Lee, Brennan, and Kolen (2000). Methods for obtaining conditional error variances have been proposed specifi- cally within the context of generalizability theory (Chapter 5) and for
s
x x x x
n
E p
pi p i
i n
( )
. . ..
[( ) ( )]
2
2 1
= 1
− − −
= −
∑
sE p2( )=s x2( pi| )p +s x2( ).i −2cov(x x ppi, | )i
dichotomous items (Chapter 6). In IRT, the problem of the conditional standard error of measurement is approached in another way (see Section 6.4).
Exercises
4.1 A test X is given with three subtests, X1, X2, and X3. The variance–covariance matrix for the subtests is given in the table below. Estimate reliability with coefficient α.
4.2 Use the variance–covariance matrix from Exercise 4.1 for estimating test reliability according to the model of conge- neric tests. Use Equation 4.6 for the estimation of the ai. 4.3 Prove that for parallel test items coefficient alpha equals the
Spearman–Brown formula for the reliability of a lengthened test.
4.4 Two tests X1 and X2 are congeneric measurement instruments.
Their correlations with other variables Y1, Y2, and so on, differ.
Is there a pattern to be found in the correlations?
4.5 Given are two tests X and Y with = 16.0, = 16.0, ρXX′ = ρYY′ = 0.8, and ρXY = 0.7.
a. Compute the observed-score variance, the true-score vari- ance, and the reliability of the difference scores X – Y.
b. Compare the variance of the raw score differences with of Equation 4.18.
4.6 In a test, several items cover the same subject. Which assumption of classical test theory might be violated? What should we do when we want to estimate reliability with co- efficient α?
4.7 We have three tests X1, X2, and X3 measuring the same con- struct. Their correlations with test Y equal 0.80, 0.70, and 0.60. Their covariances with Y are equal to 0.20. The means of the tests are 16.0, 16.0, and 20.0, respectively. Are these
X1 X2 X3
X1 8.0 6.0 8.0
X2 6.0 12.0 12.0
X3 8.0 12.0 17.0
σX2 σY2
σ2E X Y( − )
tests parallel tests, tau-equivalent, essential tau-equivalent, or congeneric? Discuss your answer.
4.8 A test has a mean score equal to 40.0, a standard deviation equal to 10.0, and a reliability equal to 0.5. Which difference score do you expect after a retest when the first score of a person equals 30?
4.9 Two tests X and Y are available. The tests have equal observed- score variances: = = 25.0. The reliability of test X is 0.8, the reliability of test Y is 0.6. Their intercorrelation is zero.
Compute the reliability of the composite test X + Y. Also, com- pute the reliability of the composite after doubling the test length.
σ2X σY2