4
Tests as Measurements

The high-stakes decisions for individual students on which this report focuses—tracking, promotion and retention, and graduation or denial of high school diplomas—have profound implications for the development and future life chances of young people. Tests used for such high-stakes purposes must therefore meet professional standards of reliability, validity, and fairness.

In this chapter, we examine these key concepts of testing to provide a basis for the discussion in the rest of the report of the psychometrics of particular high-stakes uses of tests. In addition, the principles of reliability, validity, and fairness in testing have been codified in various forms by professional organizations, and these codes are also addressed in this chapter. Although reliability and fairness are in fact aspects of the overarching concept of validity, the three concepts are addressed in turn to highlight their distinctive features.

In the simplest terms, reliability refers to the stability or reproducibility of a test's results. A test is highly reliable if a student taking it on two different occasions will get two very similar if not identical scores. The key issue of reliability, then, is to establish that something is being measured with a certain degree of consistency.

The key issue of validity is to determine the nature of that something—specifically, whether the test measures what it purports to measure



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 71
--> 4 Tests as Measurements The high-stakes decisions for individual students on which this report focuses—tracking, promotion and retention, and graduation or denial of high school diplomas—have profound implications for the development and future life chances of young people. Tests used for such high-stakes purposes must therefore meet professional standards of reliability, validity, and fairness. In this chapter, we examine these key concepts of testing to provide a basis for the discussion in the rest of the report of the psychometrics of particular high-stakes uses of tests. In addition, the principles of reliability, validity, and fairness in testing have been codified in various forms by professional organizations, and these codes are also addressed in this chapter. Although reliability and fairness are in fact aspects of the overarching concept of validity, the three concepts are addressed in turn to highlight their distinctive features. In the simplest terms, reliability refers to the stability or reproducibility of a test's results. A test is highly reliable if a student taking it on two different occasions will get two very similar if not identical scores. The key issue of reliability, then, is to establish that something is being measured with a certain degree of consistency. The key issue of validity is to determine the nature of that something—specifically, whether the test measures what it purports to measure

OCR for page 71
--> and what meaning can be drawn from the results—and whether the conclusions and inferences drawn from the test results are appropriate. Fairness incorporates not just technical issues of reliability and validity but also social values of equity and justice—for example, whether a test systematically underestimates the knowledge or skill of members of a particular group. Reliability of Measurement Reliability is typically estimated in one of three ways. One is to estimate the consistency of a test's results on different occasions, as explained above. A second way is to examine consistency across parallel forms of a test, which are developed to be equivalent in content and technical characteristics. That is, to what extent does performance on one form of the test correlate with performance on a parallel form? A third way is to determine how consistently examinees perform across similar items or subsets of items, intended to measure the same knowledge or skill, within a single test form. This concept of reliability is called internal consistency. For judgmentally scored tests, such as essays, another widely used index is the coefficient of scorer reliability, which addresses consistency across different observers, raters, or scorers. That is, do the scores assigned by one judge using a set of designated rating criteria agree with those given by another judge using the same criteria? How reliable must a test be? That depends on the nature of the construct—that is, the abstract skill, attribute, or domain of knowledge—being measured. For a very homogeneous, narrow construct, such as adding two-digit numbers, internal-consistency reliability should be extremely high. We would expect somewhat less high reliability for a more heterogeneous, broad construct, such as algebra, given the same length test. Measures of certain constructs, such as mood or anxiety (that is, states as opposed to traits), are generally less stable; thus high reliability would not be expected. For most purposes, a more useful index than reliability is the standard error of measurement, which is related to the unreliability of a test. This index defines a range of likely variation, or uncertainty, around the test score—similar to when public opinion polls report a margin of error of plus or minus x points. The standard error thus quantifies and makes explicit the uncertainty involved in interpreting a student's level of performance;

OCR for page 71
--> for example, "We can be 95 percent confident that this student's true score falls between x and y." This degree of uncertainty is particularly important to take into account when test scores are used to make high-stakes decisions about individual students. Validity of Test Interpretation and Use Validity asks what a test is measuring, and what meaning can be drawn from the results. Hence, what is to be validated is not the test per se but rather the inferences derived from the test scores and the actions that follow (Cronbach, 1971). On one hand, for example, the validity of a proficiency test can be subverted by inappropriate test preparation, such as having students practice on the actual test items or teaching students testwise strategies that might increase test scores without actually improving the skills the test is intended to measure. On the other hand, test preparation that familiarizes students with the test format and reduces anxiety may actually improve validity: scores that formerly were invalidly low because of anxiety might now become validly higher (Messick, 1982). In essence, then, test validation is an empirical evaluation of test meaning and use. It is both a scientific and a rhetorical process, requiring both evidence and argument. Because the meaning of a test score is a construction based on an understanding of the performance underlying the score, as well as the pattern of relationships with other variables, the literature of psychometrics views the fundamental issue as construct validity. The major threats to construct validity are construct underrepresentation (the test does not capture important aspects of the construct) and construct irrelevance (the test measures more than its intended construct). Test validation seeks evidence and arguments to discount these two threats and to evaluate the actions that are taken as a result of the scores. Six Aspects of Construct Validity Evaluating the validity of a test requires attention to a number of interrelated and persistent questions, such as: Are the right things being measured in the right balance?

OCR for page 71
--> Is the scoring system consistent with the structure of the domain about which inferences or predictions are being made? Are the scores reliable and consistent across the different contexts for which they are used, as well as across different population groups? Are the scores applied fairly for the proposed purposes—that is, consistently and equitably across individuals and groups? What are the short- and long-term consequences of score interpretation and use? Are the consequences supportive of the general purposes for giving the test in the first place? Validity is now widely viewed as an integral or unified concept (American Educational Research Association et al., 1985). Therefore, establishing validity requires the collection and integration of multiple complementary forms of evidence to answer an interdependent set of questions, such as those above. Nevertheless, differentiating validity into its several distinct aspects can clarify issues and nuances that might otherwise be downplayed or overlooked. One useful way of looking at validity is to distinguish aspects of construct validity: content, substantive, structural, generalizable, external, and consequential. In effect, these six aspects function as general validity criteria or standards for all educational and psychological measurement (Messick, 1989, 1995). Taken together, these aspects of construct validity incorporate the three standards for test use named in Chapter 1—that is, appropriate measurement, appropriate attribution of cause, and appropriate treatment. In subsequent chapters, examples of the types of evidence that might be collected to address each of the six are provided in the context of using test scores for tracking, promotion, and graduation decisions. The content aspect of construct validity (Lennon, 1956; Messick, 1989) refers to the extent to which test content represents an appropriate sample of the skills and knowledge that are the goals of instruction. Key issues here are specifying the boundaries of the content domain to be assessed and selecting tasks that are representative, so that all important parts of the domain are covered. Experts usually make these judgments. Also of concern here is the technical quality of the test items—for example, is the reading level appropriate and is the phrasing unambiguous? The substantive aspect refers to the cognitive processes that underlie student performance and correlations across items. This aspect of validity calls for models of the cognitive processes required by the tasks

OCR for page 71
--> (Embretson, 1983), as well as empirical evidence that test takers are in fact using those processes. Note these two important points: the need for tests to assess processes in addition to the traditional coverage of content and the need to move beyond traditional professional judgment of content to accrue empirical evidence that the assumed processes are actually at work (Embretson, 1983; Loevinger, 1957; Messick, 1989). For instance, it would be desirable to have evidence that a test item intended to measure problem solving does in fact tap those skills and not just elicit a memorized solution. One way to collect such evidence during test development might be to observe a sample of students and ask them to think aloud as they work the test items. The structural aspect (Loevinger, 1957; Messick, 1989) appraises the degree to which the score scales are consistent with the structure of the domain being measured. The theory of the construct domain should guide not only the creation of assessment tasks but also the development of scoring criteria. For example, on a test of American history and government, an item dealing with the functions of the judiciary might be weighted more heavily than an item asking the date of the Gadsden purchase. The generalizable aspect examines the extent to which scores and interpretations are consistent across assessment tasks, populations, and settings (Cook and Campbell, 1979; Feldt and Brennan, 1989; Shulman, 1970). An assessment should provide representative coverage of the content and processes of the domain being tested, so that the score is a valid measure of the student's knowledge of the broader construct, not just the particular sample of items on the test. For instance, a test might require students to write short essays on several topics, each with a particular purpose and audience in mind. The degree to which one can generalize about a student's writing skill from such a test depends on the strength of the correlations between the tasks focusing on different topics and genres. In one sense, this aspect of validity intersects with reliability: it refers to the consistency of performance across the tasks, occasions, and raters of a particular assessment (Feldt and Brennan, 1989). But in a second sense, generalizable validity refers to transfer, that is, the consistency of performance across tasks that are representative of the broader construct domain. Transfer refers to the range of tasks that performance on the tested tasks facilitates the learning of—or, more generally, is predictive of (Ferguson, 1956).

OCR for page 71
--> The issue of generalizable validity is particularly relevant to so-called performance assessments, which are designed to measure higher-order thinking skills in real-world contexts. Examples of performance assessments include writing an essay, conducting an experiment, and solving an open-ended mathematical problem and explaining one's reasoning. Performance assessments tend to involve a small number of tasks, each of which takes a lot of time. Thus, there is a conflict in performance assessment between time-intensive depth of examination on any one task and the number of tasks needed for broad domain coverage. This problem must be carefully negotiated in designing performance assessments (Wiggins, 1993). The external aspect of construct validity refers to the extent to which performance on a test is related to external variables. These correlations may be either high or low; they are predicted by the theory underlying the construct being assessed (Campbell and Fiske, 1959; Cronbach and Gleser, 1965). Convergent evidence shows that the test measure in question is in fact related to other variables that it should, theoretically, relate to. For example, a test of math computation would be expected to correlate highly with whether a person can make correct change in a cashier's job. Discriminant evidence shows that the measure is not unduly related to other measures. Other things being equal (e.g. testing conditions, reliability of the measures), a computation test should not correlate as highly with a reading test as with another computation test. It is especially important to examine the external relationships between test scores and criterion measures (that is, the desired behaviors that the test is intended to indicate or predict) when using test scores for selection, placement, certification of competence, program evaluation, and other kinds of accountability. So, for instance, before a college admissions officer uses a test to make decisions, she must have evidence that there is indeed a relationship between scores on that test and performance in college course work. The consequential aspect—which corresponds most directly to the third of the three standards for test use named in Chapter 1, appropriate treatment—includes evidence and rationales for evaluating the intended and unintended consequences of score interpretation and use in both the short and long terms. Ideally, there should be no adverse consequences associated with bias in scoring and interpretation, with unfairness in test use, or with negative effects on teaching and learning (Messick, 1980, 1989). Test makers view it as their responsibility to minimize negative

OCR for page 71
--> impact on individuals or groups due to any source of test invalidity, such as construct underrepresentation or construct-irrelevant variance (Messick, 1989). That is, validity is compromised when the assessment is missing something relevant to the focal construct that, if present, would have permitted the affected examinees to display their competence. Similarly, scores may be invalidly low because the measurement contains something irrelevant that interferes with the examinees' demonstration of competence. In contrast, adverse consequences associated with the accurate measure of an individual's knowledge or skills—such as low scores resulting from poor teaching or limited opportunity to learn—are not the test makers' responsibility but that of the test users. Adverse consequences that result from such test scores represent problems not of measurement but of something else, such as teaching or social policy. It is important that a strong set of validity evidence be collected when there are high individual stakes attached to test use. It should be clear that test validity cannot rely on any single one of these complementary forms of evidence. Neither does overall validity require a high level of every form, if there is good evidence supporting score meaning. What is required is a compelling argument that the available evidence justifies the test interpretation and use, even though some pertinent evidence may be lacking. Validity as Integrative Summary The six aspects of construct validity explained above apply to all educational and psychological measurement, including performance and other alternative assessments. Taken together, they provide a way of addressing the multiple and interrelated validity questions that need to be answered in justifying score interpretation and use. This is what is meant by validity as a unified concept. One can set priorities about the forms of evidence needed to justify the inferences drawn from test scores (Kane, 1992; Shepard, 1993). The key point is that the six aspects of construct validity provide a means of checking that the rationale or argument that supports a particular test use touches the important bases. If not, an argument should be provided that explains why such omissions are defensible. It should be clear that what needs to be validated is not the test in general or in the abstract, but rather each inference that is made from the

OCR for page 71
--> test scores and each specific use to which the test is put. Although there is a natural tendency to use existing tests for new and different purposes, each new purpose must be validated in its own right. Fairness in Testing There remains one overarching issue related to the validity of test use: fairness. Fairness, like validity, is not just a psychometric issue. It is also a social value, and there are alternative views about its essential features. In regard to test use, the core meaning of fairness we are concerned with here is comparable validity: a fair test is one that yields comparably valid scores from person to person, group to group, and setting to setting (Willingham, 1998). So, for example, if an assessment results in scores that substantially underestimate or overestimate the knowledge or skills of members of a particular group, then the test would be considered unfair. If an assessment claims to measure a single construct across groups, but in fact measures different constructs in different groups, it would also be unfair. Alternative Views of Fairness There are alternative views of fairness, but most relate to the central idea of comparable validity. For instance, the 1998 revision, Draft Standards for Educational and Psychological Testing (American Educational Research Association et al., 1998) cites four alternative views of fairness found in the technical and popular literature. Two of these views characterize fairness, respectively, as the absence of bias and as equitable treatment of all examinees in the testing process. Bias is said to arise when deficiencies in the test itself result in different meanings for scores earned by members of different identifiable sub-groups. For example, a test intended to measure verbal reasoning should include words in general use, not words and expressions associated with, for example, particular cultures or locations, as this might unfairly advantage test takers from these cultural or geographical groups. If these words or expressions are not removed from the test, then the unfair advantage could result in a lack of comparable score meaning across groups of test takers. Fairness as equitable treatment of all examinees in the testing process requires that examinees be given a comparable opportunity to demonstrate

OCR for page 71
--> their understanding of the construct(s) the assessment is intended to measure. Fair treatment includes such factors as appropriate testing conditions, opportunity to become familiar with the test format, and access to practice materials. There is broad consensus that tests should be free from bias and that all examinees should be treated fairly in the testing process itself. The third view found in the literature characterizes fairness in terms of opportunity to learn. Opportunity to learn is an important issue to consider when evaluating the comparability of score meaning across groups. For example, if two classes of students are given the same test, and students from class A have been previously taught the material whereas students from class B have not, then the resulting scores would have different meanings for the two groups. Opportunity to learn is especially relevant in the context of high-stakes assessments of what a test taker knows or can do as a result of formal instruction. If some test takers have not had the opportunity to learn the material covered by the test, they are more likely to get low scores. These scores may accurately reflect their knowledge, but only because they have not had the opportunity to learn the material tested. In this instance, using these test scores as a basis for a high-stakes decision, such as withholding a high school diploma, would be viewed as unfair. The issue of opportunity to learn is discussed further in Chapters 6 and 7. The fourth view of fairness involves equality of testing outcomes. But the idea that fairness requires overall passing rates to be equal across groups is not generally accepted in the professional literature. This is because unequal test outcomes among groups do not in themselves signify test unfairness: tests may validly document group differences that are real and may be reflective in part of unequal opportunity to learn (as discussed above). There is consensus, however, that a test should not systematically underpredict the performance of any group. One final point needs to be made here. In discussing fairness, it is important to distinguish between equality (the state of being the same) and equity (justness or fairness) and to recognize that not all inequalities are inequities. Indeed, in education as in medicine, the watchword should not be equal treatment, but rather treatment appropriate to the characteristic and sufficient to the need (Gordon, 1998). This brings us to the issue of allowing room for accommodations to the different needs of students.

OCR for page 71
--> Equity and Accommodations in the Testing Process The issue of equity and the need for testing accommodations goes directly to the heart of comparable construct validity and the fairness of the testing process. It is important to distinguish two kinds of comparability. One, called score comparability, means that the properties of scores, such as reliabilities, internal patterns of relationships between items, and external relationships with other variables, are comparable across groups and settings. Score comparability is important in justifying uniform score interpretation and use for different groups and in different circumstances. The other kind, called task comparability, means that the tested task elicits the same cognitive processes across different groups and different circumstances. Within task comparability, two types of processes may be distinguished: those that are relevant to the construct measured and those that are ancillary to the construct but nonetheless involved in task performance. Comparability of construct-relevant processes is necessary to sustain common score meaning across groups and contexts. Ancillary processes may be modified without jeopardizing score meaning. This provides a fair and legitimate basis for accommodating tests to the needs of students with disabilities and those who are English-language learners (Willingham et al., 1988). For example, a fair accommodation might be to read a mathematics test aloud to a student with certain disabilities, because reading is ancillary to the construct being measured (mathematics), whereas it would not be fair to read a reading test aloud. The availability of multimedia test presentation and response modes on computers promises an accommodation to serve the needs of certain students with disabilities, such as visually impaired and hearing-impaired students (Bennett, 1998). Thus, comparable validity—and test fairness—do not require identical task conditions, but rather common construct-relevant processes, with ignorable construct-irrelevant or ancillary processes that may be different across individuals and groups. Such accommodations, of course, have to be justified with evidence that score meaning and properties have not been unduly eroded in the process. Fairness Issues Throughout the Testing Process Fairness, like validity, cannot be properly addressed as an afterthought once the test has been developed, administered, and used. It must be confronted throughout the interconnected phases of the testing process,

OCR for page 71
--> from test design and development to administration, scoring, interpretation, and use. Indeed, one of the most critical fairness issues occurs at the design stage: the choice of constructs to measure. For example, consider the possible test requirements for awarding high school diplomas. If the test emphasizes reading and writing rather than science and mathematics, then graduation rates for males and females, as well as for language-minority students, will be quite different (Willingham and Cole, 1997). Any finite number of subjects covered by the test are likely to yield different graduation rates for different groups because they underrepresent the broad construct of school learning and because students have different opportunities to learn. Some alternatives are to assess school learning more comprehensively, to use more than one assessment mode (high school grades as well as test scores), and to justify any limited choice of subjects in terms of the social values of the school and the community. There are other fairness considerations in test design, such as the format of the items (short-answer versus multiple-choice) and the contexts in which items are cast, which may be more familiar to some examinees than to others. Bias in test questions is usually addressed empirically by examining whether individuals having the same knowledge level (as defined by their overall score on the test) but from different groups have different probabilities of getting a particular question correct. Fairness issues arise in the administration of tests because of nonstandard testing conditions, such as those related to the environment (lighting, space, temperature) and the directions given to students, that may disadvantage some examinees. Fairness is also an issue whenever scoring is not completely objective, as with the hand-scoring of constructed-response items, or when raters are influenced by group-related characteristics of an examinee that are irrelevant to the construct and purpose of the test. There is an inherent conflict of interest when teachers administer high-stakes tests to their own students or score their own students' exams. On one hand, teachers want valid information about how well their students are performing. On the other hand, there is often substantial external pressure on teachers (as well as principals and other school personnel) for their students to earn high scores. This external pressure may lead some teachers to provide inappropriate assistance to their students before and during the test administration or to mis-score exams. Fairness issues related to test use include relying unduly on a single score and basing decisions on an underrepresented view of the relevant

OCR for page 71
--> construct (Willingham, 1998). In contexts in which tests are used to make predictions of subsequent performance (e.g., grades), fairness also requires comparability of predictions for different groups. The latter concern is particularly important in the case of tests used for placement, such as tracking and some types of promotion decisions. For such uses, there should be evidence that the relationships between scores on the test and subsequent performance in certain tracks or at a certain grade level are comparable from group to group. 1 In conclusion, what needs to be comparable across groups and settings for fair test use is score meaning and the actions that follow. That is, test fairness derives from comparable construct validity (which may draw on all six aspects of validity discussed earlier). These issues of fairness surrounding test use are explored in greater detail in Chapters 5, 6, and 7. Codified Standards The issue of testing standards is not new, and there have been a number of useful documents over the years attempting to codify the principles of good practice. The most recent efforts bearing on the educational uses of tests include the Standards for Educational and Psychological Testing of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (1985), currently under revision; the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 1988); Responsibilities of Users of Standardized Tests (Association for Measurement 1   Considerable attention has been given to developing fair selection models in the context of college admissions and job entry. These models put a heavy emphasis on predictive validity (the extent to which test scores predict some desired future performance) but at the expense of other aspects of construct validity. In one way or another, all of the fair selection models address the possibility of differences in the predictor-criterion relationship for different groups (Cleary, 1968; Cole, 1973; Linn, 1973; Thorndike, 1971). With the recognition that fundamental value differences are at issue in fair selection, several utility models were developed that go beyond these selection models in that they require specific value positions to be articulated (Cronbach, 1976; Gross and Su, 1975; Petersen and Novick, 1976; Sawyer et al., 1976). In this way, social values are incorporated explicitly into the measurement technology involved in selection models. The need to make values explicit does not, however, determine or make easier the hard choices among them.

OCR for page 71
--> and Evaluation in Counseling and Development, 1992); Responsible Test Use: Case Studies for Assessing Human Behavior (Eyde et al., 1993); and the Code of Professional Responsibilities in Educational Measurement (National Council on Measurement in Education, 1995). These official statements of professional societies offer helpful guidelines; this report attempts both to build on and to go beyond them. The existing codes alert practitioners to important issues that deserve attention, but they do so in general terms. In this volume, we attempt to inform professional judgment specifically, with respect to the use of tests for student tracking, for grade promotion or retention, and for awarding or withholding diplomas. One of the limitations of existing testing guidelines is that compliance is essentially voluntary. There are no monitoring or enforcement mechanisms in place to ensure that producers and users of tests will understand and follow the guidelines. Chapter 11 considers some potential methods, practices, and safeguards that might be put in place in the future to better ensure proper test use. References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 1985 Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. 1998 Draft Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. Association for Measurement and Evaluation in Counseling and Development 1992 Responsibilities of Users of Standardized Tests. Alexandria, VA: American Association for Counseling and Development. Bennett, R.E. 1998 Computer-based testing for examinees with disabilities: On the road to generalized accommodation. In S. Messick, ed., Assessment in Higher Education. Mahwah, NJ: Erlbaum. Campbell, D.T., and D.W. Fiske 1959 Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin 56:81–105. Cleary, T.A. 1968 Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement 5(2):115–124. Cole, N.S. 1973 Bias in selection. Journal of Educational Measurement 10(4):237–255.

OCR for page 71
--> Cook, T.D., and D.T. Campbell 1979 Quasi-Experimentation: Design and Analysis Issues for Field Settings . Chicago, IL: Rand McNally. Cronbach, L.J. 1971 Test validation. Pp. 443–507 in Educational Measurement, 2nd Edition, R.L. Thorndike, ed. Washington, DC: American Council on Education. 1976 Equity in selection: Where psychometrics and political philosophy meet. Journal of Educational Measurement 13(1):31–41. Cronbach, L.J., and G.C. Gleser 1965 Psychological Tests and Personnel Decisions, 2nd Edition. Urbana: University of Illinois Press. Embretson (Whitely), S. 1983 Construct validity: Construct representation versus nomothetic span. Psychological Bulletin 93:179–197. Eyde, L.D., G.J. Robertson, S.E. Krug, K.L. Moreland, A.G. Robertson, C.M. Shewan, P.L. Harrison, B.E. Porch, A.L. Hammer, and E.S. Primoff 1993 Responsible Test Use: Case Studies for Assessing Human Behavior. Washington, DC: American Psychological Association. Feldt, L.S., and R.L. Brennan 1989 Reliability. Pp. 105–146 in Educational Measurement, 3rd Edition, R.L. Linn, ed. New York: American Council on Education and Macmillan Publishing Co. Ferguson, G.A. 1956. On transfer and the abilities of man. Canadian Journal of Psychology 10:121–131. Gordon, E. 1998 Human diversity and equitable assessment. In Assessment in Higher Education, S. Messick, ed. Mahwah, NJ: Erlbaum. Gross A.L., and W. Su 1975 Defining a "fair" and "unbiased" selection model: A question of utilities. Journal of Applied Psychology 60:345–351. Joint Committee on Testing Practices 1988 Code of Fair Testing Practices in Education. Washington, DC: National Council on Measurement in Education. Kane, M.T. 1992 An argument-based approach to validity. Psychological Bulletin 112(Nov):527–535. Lennon, R.T. 1956 Assumptions underlying the use of content validity. Educational and Psychological Measurement 16:294–304. Linn, R.L. 1973 Fair test use in selection. Review of Educational Research 43:139–161. Loevinger, J. 1957 Objective tests as instruments of psychological theory. Psychological Reports 3:635–694 (Monograph Supplement 9).

OCR for page 71
--> Messick, S. 1980 Test validity and the ethics of assessment. American Psychologist 35(11):1012–1027. 1982 Issues of effectiveness and equity in the coaching controversy: Implications for educational and testing policy. Educational Psychologist 17(2):67–91. 1989 Validity. Pp 13–103 in Educational Measurement, 3rd Edition., R.L. Linn, ed. New York: Macmillan. 1995 Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist 50(9):741–749. National Council on Measurement in Education, Ad Hoc Committee on the Development of a Code of Ethics 1995 Code of Professional Responsibilities in Educational Measurement. Washington, DC: National Council on Measurement in Education. Petersen, N.S., and M.R. Novick 1976 An evaluation of some models for culture-fair selection. Journal of Educational Measurement 13(1):3–29. Sawyer, R.L., N.S. Cole, and J.W.L. Cole 1976 Utilities and the issue of fairness in a decision theoretic model for selection. Journal of Educational Measurement 13(1):59–76. Shepard, L.A. 1993 Evaluating test validity. Review of Research in Education 19:405–450. Shulman, L.S. 1970 Reconstruction of educational research. Review of Educational Research 40:371–396. Thorndike, R.L. 1971 Concepts of culture fairness. Journal of Educational Measurement 8(2):63–70. Wiggins, G. 1993 Assessment: Authenticity, context, and validity. Phi Delta Kappan 75(3):200–214. Willingham, W.W. 1998 A systemic view of test validity. In Assessment in Higher Education , S. Messick, ed. Mahwah, NJ: Erlbaum. Willingham, W.W., and N.S. Cole 1997 Gender Bias and Fair Assessment. Hillsdale, NJ: Erlbaum. Willingham, W.W., M. Ragosta, R.E. Bennett, H. Braun, D.A. Rock, and D.E. Powers, eds. 1988 Testing Handicapped People. Boston: Allyn and Bacon.

OCR for page 71
This page in the original is blank.

OCR for page 71
--> PART II USES OF TESTS TO MAKE HIGH-STAKES DECISIONS ABOUT INDIVIDUALS

OCR for page 71
This page in the original is blank.