Validity and its sources of evidence

Một phần của tài liệu Statistical test theory for the behavoial science (Trang 120 - 123)

Chapter 7 Validity and Validation of Tests

7.2 Validity and its sources of evidence

“Validity refers to the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests.”

This is the opening sentence of the chapter on validity in the latest edition of the Standards (APA et al., 1999, p. 9). It is no definition in the Aristotelian sense (i.e., per genum proximum et differentiam specifi- cam). Neither is it an operational definition: it does not explicitly refer to the relevant operations to ensure validity. The Standards therefore proceed by stating: ‘‘The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations’’ (l.c., p. 9). Fortunately, this is an explicit statement:

the unified view of validity entails that validity is evidence based, and the sources of evidence are:

• test content

• response processes

• internal structure

104 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES

• relations to other variables

• information on the consequences of testing;

• the latter evidence has to do also with social policy and deci- sion making.

The evidence based on test content can be obtained by analyzing the relationship between a test’s content and the construct it is intended to measure. Response processes refer to the detailed nature of performance.

It generally comes from analyses of individual responses (e.g., do test takers use performance or response strategies; are there deviant responses on certain items, etc.). The evidence based on internal struc- ture comes from the analysis of the internal structure of a test (e.g., can the relationships among test items be accounted for by a single dimen- sion of behavior?). In Chapter 3 we already met the analysis of the internal structure of test items in the context of internal consistency reliability. And the latter form of reliability is worked out (and liberal- ized, so to say, from the assumptions of classical test theory) in the broader framework of generalizability theory. G theory, therefore, bridges the gap between reliability and validity (cf. Cronbach et al., 1972). Per- formance assessment is generally thought to have the right content, but it needs further validation (Messick, 1994; Lane and Stone, 2006).

The largest category of evidence is evidence based on relations to other variables. This category of evidence analyzes the relationship of test scores to external variables (e.g., measures of the same or similar constructs, measures of related and different constructs, performance measures as criteria). Instead of using the old-fashioned label of con- current validity (e.g., the concept of validity in the unified view refers to the way evidence can be obtained for validity). The category based on relations to other variables includes the following:

• Convergent and discriminant evidence

• Test-criterion relationships

• Validity generalization

The first subcategory of convergent and discriminant evidence has its early beginnings with Cronbach and Meehl (1955) and, most impor- tantly, with Campbell and Fiske (1959). This subcategory of what was called construct-related validity is presented in Section 7.5. Test- criterion relationships studies what has been called criterion-related validity, and still earlier, predictive validity. Validity generalization is the evidence obtained by giving a summing-up of earlier findings with respect to similar research questions (e.g., of the findings of

VALIDITY AND VALIDATION OF TESTS 105

criterion-related correlation studies, with the same or comparable dependent and independent variables). Validity generalization is also known under the terms meta-analysis, research synthesis, or cumu- lation of studies. A new development that should be mentioned is the argument-based approach to validity. One could call this the herme- neutic or interpretative argument as Kane (2006, pp. 22–30) has it.

This development is too fresh to include it in the present chapter.

So far, it is all rather abstract. How can it be made more concrete?

How do we proceed in the validation of a test? Ironically, to make it clear how we study validity empirically, we do better to go back to the 1985 Standards trichotomy of test validity.

The following are the three validities in the 1985 Standards: 1. Content-related validity: In general, content-related evidence

demonstrates the degree to which the sample of items, tasks, or questions on a test is representative of some defined uni- verse or domain of content.

2. Criterion-related validity: Criterion-related evidence demon- strates that scores are systematically related to one or more outcome criteria. In this context, the criterion is the variable of primary interest, as is determined by a school system, the management of a firm, or clients, for example. The choice of the criterion and the measurement procedures used to obtain criterion scores are of central importance. Logically, the value of the criterion-related study depends on the relevance of the criterion measure that is used.

3. Construct-related validity: The evidence classed in the con- struct-related category focuses primarily on the test score as a measure of the characteristics of interest. Reasoning ability, spatial visualization, and reading comprehension are con- structs, as are personality characteristics such as sociability and introversion. Such characteristics are referred to as con- structs because they are theoretical constructions about the nature of human behavior (APA et al., 1985, pp. 9–11).

Each of these validities leads to methods for obtaining evidence for the specific type of validity. The methods for content-related validity, for example, often rely on expert judgments to assess the relationship between parts of the test and the defined universe. This line of thinking or approach is embedded in generalizability theory as discussed earlier.

In addition, certain logical and empirical procedures can be used (see, e.g., Cronbach, 1971).

106 STATISTICAL TEST THEORY FOR THE BEHAVIORAL SCIENCES

Methods for expressing the relationship between test scores and criterion measures vary. The general question is always: how accurate can criterion performance be predicted from test scores? Depending on the context, a given degree of accuracy is judged high or low, or useful or not useful. Two basic designs can be distinguished for obtain- ing information concerning the accuracy of test data. One is the pre- dictive study where test data are compared (i.e., its relationships are studied with criterion scores obtained in the future). The second type of study is the so-called concurrent study in which test data and criterion data are obtained simultaneously.

The value or utility of a predictor test can also be judged in a decision theory framework. This will be exemplified in a later section. There, errors of classification will be considered as evidence for criterion- related validity.

Empirical evidence for the construct interpretation of a test may be obtained from a variety of sources. The most straightforward procedure would be to use the intercorrelations among items to support the asser- tion that a test measures primarily or substantially a single construct.

Technically, quite a number of analytical procedures are available to do so (e.g., factor analysis, multidimensional scaling (MDS), IRT models).

Another procedure would be to study substantial relationships of a test with other measures that are purportedly of the same construct, and the weaknesses of the relationships to measures that are purportedly of different constructs. These relationships support both the identifica- tion of constructs and the distinctions among them. This quite abstract formulation is taken from the Standards (APA et al., 1985, p. 10). In a later section the so-called multitrait–multimethod approach to construct validation will be considered more concretely and in more detail.

Before going into certain aspects and procedures for validation stud- ies, it is important to consider the problem of selection and its effects on the correlation between, for example, test X and criterion Y—that is, the (predictive) validity of test X with respect to criterion Y. Essen- tially, this is applying statistics in the field of psychometrics: What is the influence of restriction of range on the value of the validity of a test?

Một phần của tài liệu Statistical test theory for the behavoial science (Trang 120 - 123)

Tải bản đầy đủ (PDF)

(282 trang)