Suppose that we obtained a measurement xpi on person p with mea- surement instrument i. Let us assume, for example, that we read the weight of this person from a particular weighing machine and
10 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES
registered the outcome. Next, we take a new measurement and we notice a difference from the first. The obtained measurements can be thought of as arising from a probability distribution for measure- ments Xp with realizations xp.
With measurement in the behavioral sciences, we have a similar situation. We obtain a measurement and we expect to find another outcome from the measuring procedure if we would be able to repeat the procedure and replicate the measurement result. However, in the behavioral sciences we frequently are not able to obtain a series of comparable measurement results with the same measurement instru- ment because the measurements may have their impact on the person from whom measurements are taken. Memory effects prevent indepen- dent replications of the measurement procedure. We might, however, administer a second test constructed for measuring the same construct and notice that the person obtains a different score on this test than on the first test. So, here comes in the development of an appropriate theory of errors or error model. The simplest is the following. The underlying idea is that the observed test score is contaminated by a measurement error. The observed score is considered to be composed of a true score and a measurement error (see also Figure 2.1):
xp =τp+ep (2.1)
If the measurement could be repeated many times under the condition that the different measurements are experimentally independent, then the average of these measurements would give a reasonable approxi- mation to τp. In formal terms, true score is defined as the expected value of the variable Xp (xp from Equation 2.1 is a realization of the random variable Xp):
τp= EXp (2.2)
where E represents the expectation over independent replications.
Figure 2.1 The decomposition of observed scores in classical test theory.
True score
Observed score x1 Error e1
Error e2 Error e3 Observed score x2
Observed score x3
CLASSICAL TEST THEORY 11
The definition of true score as an expected value seems obvious if the measurements to be taken can be considered exchangeable. In other words, this definition seems obvious if we do not know anything about a particular measurement. But consider the situation in which different measurement instruments are available and we have infor- mation on these instruments. For example, assume we have some raters as measurement instruments. Assume also that the raters differ in leniency, a fact known to occur. Does the definition of true score as an expected value do justice to this situation? Should we not correct the scores given by a rater with a known constant bias? The answer is that we can correct the scores without rejecting the idea of a true score, for it is possible to use the score scale of a particular rater and define a true score for this rater. Scores obtained on this scale can be transformed to another scale, comparable to the transformation of degrees Fahrenheit into degrees Celsius. The transformation of scores to scales defined by other measurement instruments will be discussed in Chapter 11.
In other situations, the characteristics of a particular rater are unknown. It is not necessary to have information on this rater, because the next measurement is likely to be taken by another rater. Then the rater effect can be considered part of the measurement error. In Exhibit 2.1, more information on multiple sources of measurement error is given.
The foregoing means that the definition of measurement error and, consequently, the definition of true score depend on the situation in which measurements are taken and used. If a particular aspect of the measurement situation has an effect on the measurements and if this aspect can be considered as fixed, one can define true score so as to incorporate this effect. This is the case when one tries to minimize noise in the data to be obtained through the testing procedure by standardization. In other cases, one is not able or not prepared to fix an aspect, and the variation due to fluctuations in the measurement context is considered part of the measurement error.
Exhibit 2.1 Measurement error: Systematic and unsystematic
Classical test theory assumes unsystematic measurement errors. Sys- tematic measurement error may occur when a test consistently measures something other than the test purports to measure. A depression inventory, for example, may not merely tap depression as the intended trait to
12 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES
measure, but also anxiety. In this case, a reasonable decomposition of observed scores on the depression inventory would be
X=τ+ ED+EU
where X is the observed score, τ is the true score, ED is the systematic error due to the anxiety component, and EU is the combined effect of unsystematic error.
Clearly, the decomposition of observed score according to classical test theory is the most rudimentary form of linear model decomposition.
Generalizability theory (see Chapter 5) has to say more on the decom- position of observed scores. Structural equation modeling might be used to unravel the components of observed scores.
Classical test theory can deal with only one true score and one mea- surement error. Therefore, the test researcher or test user must formu- late precisely which aspects belong to the true score and which are due to measurement error. This choice also restricts the choice of methods to estimate reliability, which is the extent to which obtained score differ- ences reflect true differences. Suppose we want to measure a character- istic that fluctuates from day to day, but which also is relatively stable in the long term. We might be interested in the momentary state, or in the expectation on the long term. If we are interested in measuring the momentary state, the value of the test–retest correlation does not have much relevance. A systematical framework for the many aspects of mea- surement errors and true scores was developed in generalizability theory.
From the definition of true score, we can deduce that the measure- ment error has an expected value equal to zero:
EEp= 0 (2.3)
The variance of measurement errors equals
σ2(Ep) =σ2(Xp) (2.4) The square root from the variance in Equation 2.4 is the standard error of measurement for person p, the person-specific standard error of measurement.