Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
7 Judging the Quality and Utility of Assessments I n this chapter we review important characteristics of assess- ment instruments that can be used to determine their Âquality and their utility for defined situations and purposes. We review significant psychometric concepts, including validity and reliability, and their relevance to selecting assessment instru- ments, and we discuss two major classes of instruments and the features that determine the uses to which they may appropriately be put. Next we review methods for evaluating the fairness of instruments, and finally we present three scenarios illustrating how the process of selecting assessment instruments can work in a variety of early childhood care and educational assessment circumstances. Many tests and other assessment tools are poorly designed. The failure of assessment instruments to meet the psychometric criteria of validity and reliability may be hard for the practitioner or policy maker to recognize, but these failings reduce the use- fulness of an instrument severely. Such characteristics as ease of administration and attractiveness are, understandably, likely to be influential in test selection, but they are of less significance than the validity and reliability considerations outlined here. Validity and reliability are technical concepts, and this chap- ter addresses some technical issues. Appendix A is a glossary of words and concepts to assist the reader. Especially for Chapter 7, 181
182 EARLY CHILDHOOD ASSESSMENT many readers may want to focus primarily on identifying the questions they need to ask about assessments under consideration and understanding the concepts well enough to appreciate the responses, rather than on a deep understanding of the statistical processes that determine how those questions can be answered. VALIDITY AND RELIABILITY OF ASSESSMENTS Before an assessment instrument or test is used for the pur- pose of making decisions about children, it is necessary to have evidence showing that the assessment does what it claims to do, namely, that it accurately measures a characteristic or construct (or âoutcomeâ as we are referring to it in this report). The evidence that is gathered to support the use of an assessment is referred to as validity evidence. Generally, when one asks the question âIs the assessment doing what it is supposed to do?â one is asking for validity evidence. A special kind of validity evidence relates to the consistency of the assessmentâthis may be consistency over repeated assessment or over different versions or forms of the assessment. This is termed reliability evidence. This chapter reviews the history and logic of validity and reliability evidence, especially as it pertains to infants and young children. It is important to note that, first, when judging valid- ity or reliability, one is judging a weight of evidence. Hence, one does not say that an assessment is âvalidâ or is âreliableâ; instead, one uses an accumulation of evidence of diverse kinds to judge whether the assessment is suitable for the task for which it is intended. Second, when mustering evidence for validity or reliability, the evidence will pertain to specific types of uses (i.e., types of decisions). Some forms of evidence inform a wider range of types of decisions than others. Nonetheless, one should always consider evidence as pertaining to a specific set of decisions. Brief Overview of the History of Validity Evidence The field of assessment of human behavior and development is an evolving one and has undergone many changes in the last half-century. Some changes are the result of developments in the field itself; others are responses to the social and political context
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 183 in which the field operates. Validity is an enduring criterion of the quality and utility of assessments, although conceptions of what constitutes validity of assessments have changed over time. Criterion Validity Originally, the conception of assessment validity was limited to predictionâspecifically, to the closeness of agreement between what the assessment actually assesses or measures and what it is intended to assess or measure (Cureton, 1951). Put differently, at the core of this definition of validity is the relationship between the actual scores obtained on a test or other assessment instrument and the score on another instrument considered to be a good assess- ment of the underlying âtrueâ variable or construct. Under this model of validityâthe criterion modelâif there already exists a criterion assessment that is considered to be a true measure of the construct, then a test or other measurement instrument is judged to be valid for that construct if the latter instrument provides accurate estimates of the criterion (Gulliksen, 1950). The accuracy of the estimates is usually estimated using a correlation coefficient. Among the advantages of the criterion model of validity are its relevance and potential objectivity. After a criterion has been specified, data can be collected and analyzed in a straightfor- ward manner to ascertain its correlation with the measure being validated. It is not always easy, however, to identify a suitable or adequate criterion. When one considers criterion-related validity evidence, for example, the size of the correlation between test scores and criterion can differ across settings, contexts, or popu- lations, suggesting that a measure be validated separately for every situation, context, or population for which it may be used. In many instances, criterion-related evidence is quite relevant to the interpretations or claims that can be made about the uses of assessments. In addition, questions about the validity of the cri- terion itself often remain unanswered or are difficult to answer without resorting to circular reasoningâfor example, when scores on a test of cognitive development are the validity criterion for scores on a different test of cognitive development. Moreover, decisions involving the choice of a criterion involve judgments about the value of the criterion.
184 EARLY CHILDHOOD ASSESSMENT The âThree Types of Validityâ Approach If agreement with a criterion were the only form of validity evidence, then one could never validate a measure in a new area, because there is no preexisting criterion in the new area. Thus, new and broader types of evidence were needed. The criterion model of validity was followed by a more nuanced and amplified view of validity, which identified three types: content, construct, and criterion validity. 1. Content validity. The content model of validation seeks to pro- vide a basis for validation without appealing to external criteria. The process of establishing content validity involves establish- ing a rational link between the procedures used to generate the test scores and the proposed interpretation or use of those scores (American Educational Research Association, American P Â sychological Association, and National Council on Measurement in Education, 1999; Cronbach, 1971; Kane, 2006). In developing an assessment procedure or system, a set of specifications of the content domain is usually set forth describing the content areas in detail and the item types. Content here refers to the themes, wording, and format of the assessment items (e.g., tasks, ques- tions) as well as the guidelines and procedures for administration and scoring. Defining the content domain becomes critical because validity inferences can be challenged by suggestions that the domain defi- nition is incomplete, irrelevant, or inappropriate. It is important to evaluate the appropriateness of an assessment toolâs content domain with respect to the proposed uses of that tool. For example, an off-the-shelf test that is used for the purposes of evaluating an educational program may cover content that is part of the programâs curriculum as well as material that was not part of that curriculum. It is then up to those who interpret the program evalu- ation results to evaluate the childrenâs achievement with respect to both the content-represented and content-Âunrepresented parts of the test. Studies of alignment between early learning standards (e.g., state early learning standards, the Head Start Child Outcomes Framework) and assessments are a new variant of content-related validity evidence. Such standards are descriptions of what children
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 185 should know and be able to do; benchmarks, a related concept, refer to descriptions of knowledge and skills that children should acquire by a particular age or grade level. It is generally agreed by measurement professionals that content-related validity evidence is necessary but insufficient for validation. Other forms of validity evidenceâsuch as empiri- cal evidence based on relationships between scores and other variablesâare also essential. The current shift in emphasis toward learning standards and aligned assessments does not alter this necessity for additional forms of validity evidence, and the growing consequences of assessments increase the importance of empirical evidence (Koretz and Hamilton, 2006). 2. Construct validity. Construct validity was originally introduced by Cronbach and Meehl (1955) as an alternative to content and cri- terion validity for assessments that sought to measure attributes or qualities that are theoretically defined but for which there is no adequate empirical criterion or definitive measure nor a domain of content to sample. They went on to emphasize, however, that âdetermining what psychological constructs account for test performance is desirable for almost any testâ (p. 282). In other words, even if an assessment is validated through content- and criterion-related evidence, a deeper understanding of the con- struct underlying the performance on the test requires construct- related evidence (Kane, 2006). Construct validity is also concerned with what research meth- odologists refer to as âconfoundingâ (Campbell and Stanley, 1966; Cook and Campbell, 1979). This refers to the possibility that an assessment procedure that is intended to produce a measure of a particular construct, such as a childâs level of quantitative knowl- edge, produces instead a measure that can be construed in terms of more than one construct. For example, a measure of a childâs quantitative knowledge might be confounded with the childâs willingness to cooperate with the stranger who is conducting the assessment. This reaction of the child to the assessor is thus a rival interpretation of that intended by the assessment procedure. To circumvent this rival interpretation, the assessment procedure might include more efforts to establish rapport between the child and the assessor, paying special attention to the fact that some
186 EARLY CHILDHOOD ASSESSMENT children are temperamentally shyer than others. If no correlation can be observed between a measure of shyness or willingness to cooperate and the measure of quantitative knowledge, then the rival interpretation can be ruled out. It is a mistake to think that construct validity applies only to measures of theory-based constructs. In this report we depart from some historical uses of the term âconstruct,â which limit the term to characteristics that are not directly observable but that are inferred from interrelated sets of observations. As noted in the Standards for Educational and Psychological Testing (1999), this limited use invites confusion because it causes some tests but not others to be viewed as measures of constructs. Following the Standards, we use the term âconstructâ more broadly as âthe con- cept or characteristic that a test is designed to measureâ (Ameri- can Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p. 5). 3. Integrated views of validity. Current conceptions of assessment validity replace the content/criterion/construct trinitarian model and its reference to types of validity by a discussion of sources, or strands, of validity evidence, often including evidence regard- ing the consequences of the use of assessments. Cronbach (1971) argued that in order to explain a test score, âone must bring to bear some sort of theory about the causes of the test performance and about its implicationsâ (p. 443). While recognizing the practi- cality of subdividing validity evidence into criterion, content, and construct, he called for âa comprehensive, integrated evaluation of a testâ (p. 445). He emphasized that âone validates not a test, but an interpretation of data arising from a specified procedureâ (p. 447). Messick (1989) echoed this emphasis. The aim of current conceptions of assessment validity is to seek information relevant to a specific interpretation and use of the assessments; many strands of evidence can contribute to an understanding of the meaning of assessments. Validity as Argument Kaneâs (2006) treatment of validity is consonant with ÂMessickâs approach, although Kane emphasizes a general methodology for
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 187 validation based on validity conceptualized as argument. In Kaneâs formulation, âto validate a proposed interpretation or use of test scores is to evaluate the rationale for its interpretation for useâ (2006, p. 23). In Kaneâs approach, validation involves two kinds of argument. An interpretive argument specifies the proposed interpretations and uses of test results. This argument consists of articulating the inferences and assumptions that link the observed behavior or test performance to the conclusions and decisions that are to be based on that behavior or performance. The validity argument is an evaluation of the interpretive argument. âTo claim that a proposed interpretation or use is valid is to claim that the interpretive argument is coherent, that its inferences are reason- able, and that its assumptions are plausibleâ (Kane, 2006, p. 23). In other words, the validity argument begins by reviewing the interpretive argument as a whole to ascertain whether it makes sense. If the interpretive argument is reasonable, then its infer- ences and assumptions are evaluated by means of appropriate evidence. Any interpretive argument potentially contains many assumptions. If there is any reason for not taking for granted a particular assumption, that assumption needs to be evaluated. The interpretive argument makes explicit the reasoning behind the proposed interpretations and uses, so that it can be clearly understood and evaluated. It also indicates which claims are to be evaluated through validation. For example, a child assessment procedure or instrument usually takes some performances by or observations of the child that are intended to be a sample of all possible performances or observations that constitute the instrumentâs target content domain. The procedure assumes that the childâs score on the instrument can be generalized to the entire domain, although the actual observed behaviors or performances may be only a small subset of the entire target domain. In addition, they may or may not be a representative sample of the domain. Standardization typically further restricts the sample of performances or observa- tions by specifying the conditions of observation or performance. Although standardization is necessary to reduce measurement error, it causes the range of possible observations or performances to be narrower than that of the target domain. In other words, it can be seen that the interpretation of the childâs observed behav- ior or performance as an indicator of his or her standing in the
188 EARLY CHILDHOOD ASSESSMENT target domain requires a complex chain of inferences and gen- eralizations that must be made clear as a part of the interpretive argument. An interpretive argument for a measure of childrenâs cognitive development in the area of quantitative reasoning, for example, may include inferences ranging from those involved in the scoring procedure (Is the scoring rule that is used to convert an observed behavior or performance by the child to an observed score appro- priate? Is it applied accurately and consistently? If any scaling model is used in scoring, does the model fit the data?); to those involved in the generalization from observed score to universe of scores (Are the observations made of the child in the testing or observation situation representative of the universe of observa- tions or performances defining the target cognitive domain? Is the sample of observations of the childâs behavior sufficiently large to control for sampling error?); to extrapolation from domain score to level of development (or level of proficiency) of the compe- tencies for that domain (Is the acquisition of lower level skills a prerequisite for attaining higher level skills? Are there systematic domain-irrelevant sources of variability that would bias the inter- pretation of scores as measures of the childâs level of development of the target domain attributes?); to the decisions that are made, or implications drawn, on the basis of conclusions about devel- opmental level on the target outcome domain (e.g., children with lower levels of the attribute are not likely to succeed in first grade; programs with strong effects on this measure are more desirable than those with weak effects). The decision inference usually involves assumptions that rest on value judgments. These values assumptions may represent widely held cultural values for which there is societal consensus, or they may represent values on which there is no consensus or even bitter divisions, in which case they are readily identifiable for the purposes of validation. When the underlying decision assumptions represent widely held values, they can be difficult to identify or articulate for validation through scientific analysis. The interpretive argument may also involve highly techni- cal inferences and assumptions (e.g., scaling, equating). The technical sophistication of measurement models has reached such a high degree of complexity that they have become a âblack
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 189 boxâ even for many measurement professionals (Brennan, 2006, p. 14). Moreover, as Brennan further points out, many mea- surement models are operationalized in proprietary computer programs that can sometimes make it difficult or impossible for users to know important details of the algorithms and assump- tions that underlie the manner in which measurement data are generated. Ideally, the interpretive argument should be made as a part of the development of the assessment procedure or system. From the outset, the goal should be to develop an assessment procedure or system congruent with the proposed interpreta- tion and use. In addition, efforts to identify and control sources of unwanted variance can help to rule out plausible alternative interpretations. Efforts to make the assessment system or pro- cedure congruent with the proposed interpretation and uses provide support for the plausibility of the interpretive argument. In practice, this developmental stage is likely to overlap consid- erably with the appraisal stage, but at some point in the process âa shift to a more armâs-length and critical stance is necessary in order to provide a convincing evaluation of the proposed inter- pretation and usesâ (Kane, 2006, p. 25). Kane views this shift as necessary because it is human nature (appropriate and probably inevitable) for the developers to have a confirmationist bias since they are trying to make the assessment system as good as it can be. The development stage thus has a legitimate confirmationist bias: its purpose is to develop an assessment procedure and a plausible interpretive argument that reflects the proposed inter- pretations and uses of test scores. After the assessment instrument or system is developed but still as a part of the development process, the inferences and assumptions in the interpretive argument should be evaluated to the extent possible. Any problems or weakness revealed by this process would indicate a need for alterations in either the interpretive argument or the assessment instrument. This itera- tive process would continue until the developers are satisfied with the congruence between the assessment instrument and the interpretive argument. This iterative process is similar to that of theory development and refinement in science; here the interpre- tive argument plays the role of the theory.
190 EARLY CHILDHOOD ASSESSMENT When the development process is considered complete, it is appropriate for the validation process to take a âmore neutral or even critical stanceâ (Kane, 2006, p. 26). Thus begins the appraisal stage. If the development stage has not delivered an explicit, coherent, detailed interpretive argument linking observed b Â ehavior or performance to the proposed interpretation and uses, then the development stage is considered incomplete, and thus a critical evaluation of the proposed interpretation is premature (Kane, 2006). The following events should occur during the appraisal stage: 1. Conduct studies of questionable inferences and assump- tions in the interpretive argument. To the extent that the proposed interpretive argument withstands these chal- lenges, confidence in the claims increase. âIf they do not withstand these challenges, then either the assessment procedure or the interpretive argument has to be revised or abandonedâ (Kane, 2006, p. 26). 2. Search for hidden assumptions, including value judg- ments, seeking to make such assumptions explicit and subject them to scrutiny (e.g., by individuals with different values). 3. Conduct investigations of alternative possible interpre- tations of the scores. An effective way to challenge an interpretive argument is to propose an alternative, more plausible argument. The evaluation of plausible competing interpretations is an important component in the appraisal of the proposed interpretive argument. Ruling Out Plausible Alternative Hypotheses It is important to recognize that one never establishes the validity of an assessment instrument or system; rather, one validates a score, and its typical uses, yielded by the instrument ( Â Messick, 1989). For example, depending on the circumstances surrounding an assessment (e.g., the manner of test administra- tion, the characteristics of the target population), the same instru- ment can produce valid or invalid scores.
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 191 The essence of validity, then, can be stated in the question, âTo what extent is an observed score a true or accurate mea- sure of the construct that the assessment instrument intends to measure?â Potential threats to validity are extraneous sources of varianceâor construct-irrelevant varianceâin the observed scores. These extraneous or irrelevant sources of variance are typi- cally called measurement error. As in the process of conducting scientific research, the validity question can be stated in the form of a hypothesis: âThe observed score is a true or accurate reflec- tion of the target construct.â The task of validating is to identify and rule out plausible alternate hypotheses regarding what the observed score measures. In a very fundamental sense, as is the case in science, one never âprovesâ or âconfirmsâ the assessment hypothesisârather, the successful assessment hypothesis is tested and escapes being disconfirmed. (The term assessment hypothesis is used here to refer to the hypothesis that specifies what the intended meaning of the observed score is, i.e., what the assess- ment instrument is intended to measure.) In this sense, the results of the validation process âprobeâ but do not prove the assessment hypothesis (Campbell and Stanley, 1966; Cook and Campbell, 1979). A valid set of scores is one that has survived such probing, but it may always be challenged and rejected by a new empirical probe. The task of validation, then, is to expose the assessment hypothesis to disconfirmation. In short, varying degrees of confirmation are conferred upon the assessment hypothesis through the number of plausible rival hypotheses (Campbell and Stanley, 1966) available to explain the meaning of the observed scores. That is, the smaller the number of such rival hypotheses remaining, the greater the degree of con- firmation of the assessment hypothesis. Thus, the list of potential sources of assessment invalidity is essentially a list of plausible hypotheses that are rival to the assessment hypothesis that speci- fies what the meaning of the observed score is intended to be. Studies need to be designed and conducted to test the tenability of each plausible rival hypothesis in order to determine whether each can be ruled out as a plausible explanation of the observed scores. Where the assessment procedure properly and convinc- ingly âcontrolsâ for a potential source of invalidity, the procedure renders the rival hypothesis implausible.
192 EARLY CHILDHOOD ASSESSMENT The Contemporary Synthesis of Views About Validity Evidence The current Standards for Educational and Psychological Testing (American Educational Research Association, American Psycho- logical Association, and National Council on Measurement in Education, 1999) lays out five sources of evidence for validity, which need to be combined to form the basis for a validity argu- ment. These are based on the discussions above and are only briefly described here. For an extended account of how to use these types of evidence in the validity argument for a particular assessment, see Wilson (2005). 1. vidence Based on Instrument Content. To compose the evi- E dence based on an assessmentâs content, the measurer must engage in âan analysis of the relationship between a testâs content and the construct it is intended to measureâ (American Educational Research Association, American Psychological Association, and National Council on Measurement in Educa- tion, 1999, p. 11) and interpret that analysis in an argument concerning the validity of using the instrument. This is gener- ally not an empirical argument in itself, although it may well be based on the results of earlier empirical studies. This is what has been described above in the section on content validity. 2. vidence Based on the Response Process. If one chooses to E assemble evidence based on response processes, one must engage in a detailed analysis of childrenâs responses to the assessment, either while they are taking the assessment or just after, in an exit interview. In the standard think-aloud investigation (also called âcog- nitive labsâ; American Institutes for Research, 2000), children are asked to talk aloud about what they are thinking while they are actually responding to the item. What the respondent says is recorded, transcribed, and analyzed. Asking a child to think aloud is of limited value with infants, but children in the preschool years can provide useful information. However, in a variant of this, observation rather than questioning may be the source of the evidence. Children may be videotaped and other characteristics may be recorded, such as having their eye movements tracked. Children must be familiarized with such
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 193 observational situations and allowed to explore the environ- ment so that they are comfortable. The results can provide insights ranging from the very plainââthe children were very distracted when respondingââto the very detailed, including evidence about particular behaviors and actions that were evi- dent when they were responding. The exit interview is similar in aim but is timed to occur after the child has made his or her responses. It may be conducted after each item or after the assessment as a whole, depending on whether the measurer judges that the delay will or will not interfere with the childâs memory. Again, limitations with infants and toddlers are obvious. The types of information gained will be similar to those from the think-aloud, although generally it will not be so detailed. It may be that a data collec- tion strategy that involves both think-alouds or observations and exit interviews will be best. 3. vidence Based on Internal Structure. To collect evidence E based on internal structure, the measurer must first ensure that there is an intention of internal structure. Although this idea of intended structure may not always be evident, it must always exist, even if it is treated as being so obvious that it need not be mentioned or only informally acknowledged in some cases. We refer to this internal structure as the construct. This is what has been described above in the section on construct validity. Note that the issue of differential item functioning (DIF), discussed later in this chapter, is one element of this type of evidence, specifically one related to fairness of the assessment. 4. vidence Based on Relations to Other Variables. If there are E other âexternalâ variables that the construct should (according to theory) be related to, and especially if another instrument is intended to measure the same or similar variable, a strong rela- tion (or lack of a strong relation) between the assessment under scrutiny and these external variables can be used as validity evi- dence. Typical examples of these external variables are (a) care- giver judgments and (b) scores on other assessments. Another source of external variables is treatment studies: if the measurer has good evidence that a treatment does indeed change the con- struct, then the contrast on the assessment between a treatment and a control group can be used as an external variable. (One
194 EARLY CHILDHOOD ASSESSMENT has to be careful about circularity of argument here; it should not be the case that the evidence for the treatmentâs efficacy is the same data as being used to investigate validity of the assess- ment.) Note that the relationship predicted from theory may be positive, negative, or nullâthat is, equally important that the instrument be supported by evidence that it is measuring what it should measure (convergent evidence, which may be positive or negative depending on the way the variables are scored), as it is that it is not measuring what it shouldnât (divergent evidence, which would be indicated by a null relationship). Evaluations of early childhood interventions have the poten- tial to provide important information regarding the validity of assessments for young children. Rather than using assessment instruments to evaluate the effectiveness of interventions, psychometricians use interventions as one means to evaluate the validity of assessments. For example, evidence of validity for a specific instrument of social skills is obtained when inter- vention effects on that instrument emerge from interventions designed to improve social skills. Typically one uses assessment instruments to evaluate the effectiveness of an intervention based on the assumption that those instruments have sufficient psychometric reliability and validity to be useful. In contrast, in the validity context, one is using successful interventions to evaluate the external validity of assessment instruments. The logic of using intervention data to establish validity involves several conditions. First, it assumes that the interven- tion is based on a theory of change in specific child characteris- tics or outcomes. These outcomes are the childâs abilities, skills, or beliefs targeted for change by the intervention. Second, the intervention successfully changes those outcomes. Third, the outcomes are measured with assessment instruments that are aligned (i.e., directly measure the designated outcomes). When these conditions are met, then the magnitude of the difference between treated and untreated children can be used as an index of external validity. Under this logic, more intensive interven- tions should yield larger treatment effects than less intensive interventions. 5. vidence Based on Consequences of Using an Assessment E Instrument. Under an integrated, comprehensive approach to
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 195 validity, information regarding the consequences of the assess- ment becomes part of the evidentiary basis for judging the validity of the assessment. An illustration can be drawn from high-stakes assessments in education, through which policy makers have sought to establish accountability. As with any form of assessment, these can have intended or unintended, desirable or undesirable consequences. An alleged potential consequence of high-stakes assessments is that they can drive instructional decisions in unintended and undesirable ways, usually by over-emphasizing the skills tested (âteaching to the testâ). They can also possibly have a corrupting influence, since the motivation to misuse or misrepresent test scores can be compelling. In addition, the psychometric characteristics of the test can vary depending on whether it is administered under low- or high-stakes conditions (e.g., level of motivation or anxiety as construct-irrelevant sources of variance in test performance). It is also possible that new and future technolo- gies used to administer, score, or report assessments will have unintended, unanticipated consequencesâas many new tech- nologies have had. Social Consequences of Assessment As in the field of medicine, in assessment there is an obliga- tion to do no harm to those assessed. As such, it is important to inquire into the intended as well as unintended consequences of assessment. Validity theoreticians differ from one another in the extent to which they incorporate the consequences of assessment under the purview of validity. Thus, although evidence about consequences can inform judgments about validity, it is important to distinguish between evidence that is directly relevant to valid- ity and consequences that may inform broader concerns, such as educational or social policy. For example, concerns have been raised about the impact of certain forms of assessment on narrowing the curriculum. (That is, it is often said that assessments should not have the effect of unduly narrowing the early childhood programâs focus to the detriment of the programâs wider or comprehensive goals.) For example, an educational assessment system should not lead
196 EARLY CHILDHOOD ASSESSMENT teachers to concentrate instruction on a few or narrowly defined learning objectives merely for the sake of the childrenâs passing a test, or to concentrate on a few discrete skills that can be achieved through routine drill, to the exclusion of coverage of other of the programâs goals for learning and development. Similarly, the assessment should not cause children to acquire habits of mind that emphasize shallow learning and quick forgetting, it should not take away the joy and excitement of engaging in intellectual inquiry, and it should not have the effect of discouraging them from taking responsibility for their own expanded learning. Such impact, if it occurs, may not in itself necessarily diminish the validity of an assessment score, although it raises issues surround- ing test use. If, however, a consequence of an assessment is the result of a threat to assessment validityâfor example, if there is construct- irrelevant variance, such as childrenâs language skills, affecting their performance on a test intended to measure only quantitative reasoning, a situation resulting in English language learners scor- ing as a group lower than other children on that testâthen the social consequence is clearly linked to validity. When claims are made about the benefits or harms of assess- ment that go beyond the uses of assessmentâfor example, claims that the use of assessments will encourage better classroom instruction by holding educators accountableâthen the valida- tion process should be informed by whether or not those claims hold true. The relevance of unintended consequences is not always e Â asily ascertained. For example, there can be confusion about whose intent is under consideration (e.g., the test developerâs intent or the userâs) and about whether a consequence is positive or negative. Moreover, the user is often an individual with little or no technical knowledge to determine the validity of a score interpretation that she or he might make (e.g., newspaper readersâ trying to make sense of newspaper reports based on test data). Validity of Assessments Used for Judging Program Quality Concerning assessment instruments that are to be used for the purposes of judging program quality, a fundamental question is,
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 197 âCan the instrument adequately gauge program quality?â This is really a threefold question: (1) Do the scores (or other data that are derived from the instrument) have the technical characteristics (e.g., reliability) to show measurable improvement in childrenâs developmental level on the programâs intended outcomes? (Popham, 2007). (2) Is there evidence available that the scores (or other data that are derived from the instrument) have appropriate validity characteristics (e.g., internal construct validity, external variable validity, etc.) for measuring the programâs intended outcomes? (Popham, 2007). (3) Is the evaluation design strong enough that improvement can be attributed to program effects? The program may or may not specify targets for attaining particular developmental levels on its intended outcomes. If the program has specific developmental outcome targets, then questions that should be asked in relation to the assessment instrument include (a) âWhat are those targets?â and (b) âCan the instrument accurately measure those targets?â It is important, for example, to ensure that the instrument does not have a ceiling short of those targets. One should also ask, âWhat is the yardstick used to measure a programâs success?â For example, is the outcome target the percentage of children who score at or above the chronological age norms for that outcome? If so, are those norms for the nation as a whole or are they subgroup normsâsuch as state norms, ethnic or language minority or socioeconomic group norms? If subgroup norms are used, it may be important to establish the metrics of correspondence between them (Popham, 2007). For example, a 1-decile improvement at the lower tail of the distribu- tion may or may not mean the same thing as a 1-decile improve- ment at the higher tail end. Thus, more program resources may be required to obtain improvements for one group of children than for another groupâor for one portion of the normative curve than for another. Moreover, in making judgments about program effectiveness on the basis of assessment data, one should also ask, âAre those program targets realistic?â Although this question does not bear on the quality of the assessment instrument per se, it nevertheless bears on the appropriateness of its use. What is a realistic level of expectation for childrenâs attaining a particular level of develop-
198 EARLY CHILDHOOD ASSESSMENT ment on a programâs intended outcomes? What is the timeline for attaining a programâs outcome targets? If assessment results are used for the purposes of accountabil- ity, it is important that the assessment should reflect the domains or areas of development or learning that the program or policy was intended to influence. For example, a pre-K program that was not designed to provide nutrition should not be held accountable for childrenâs nutritional status. This is discussed further in Chap- ter 10 on assessment systems. Reliability Evidence The traditional quality-control approach to score consistency has been to find ways to measure the consistency of the scoresâ this is the so-called reliability coefficient. There are several ways to do this, for example as (a) how much of the observed variance in scores is attributable to the underlying âtrueâ score (as a propor- tion), (b) the consistency over time, and (c) the consistency over different sets of items (i.e., different âformsâ). These constitute three different perspectives on measurement error and are termed internal consistency, test-retest, and alternate forms reliability, respectively. The internal consistency reliability coefficients are calculated using the information about variability that is contained in the data from a single administration of the instrumentâeffectively they are investigating the proportion of variance accounted for by the âtrueâ score. This âvariance explainedâ formulation is familiar to many through its use in analysis of variance and regression methods. Examples are the Kuder-Richardson 20 and 21 (Kuder and Richardson, 1937) for dichotomous responses and coefficient alpha (Cronbach, 1951) for polytomous responses. As described above, there are many sources of measurement error beyond a single administration of an instrument. Each such source could be the basis for calculating a different reliability coefficient. One type of coefficient that is commonly used is the Dichotomous means there are two possible responses, such as yes/no, true/ false. Polytomous means there are more than two possible responses, as in Âpartial- credit items.
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 199 test-retest reliability coefficient. In a test-retest reliability coeffi- cient, the respondents give responses to the questions twice, then the reliability coefficient is calculated simply as the correlation between the two sets of scores. On one hand, the test and the retest should be so far apart that it is reasonable to assume that the respondents are not answering the second time by remem- bering the first but are genuinely responding to each item anew. This may be difficult to achieve for some sorts of complex items, which may be quite memorable. On the other hand, as the aim is to investigate variation in the scores not due to real change in respondentâs true scores, the measurements should be close enough together for it to be reasonable to assume that there has been little real change. Obviously, this form of reliability index will work Âbetter when a stable construct is being measured with forgettable items, compared with a less stable construct being measured with memorable items. Another type of reliability coefficient is the alternate forms reliability coefficient. With this coefficient, two sets of items are developed for the instrument, each following the same construc- tion process. The two alternate copies of the instrument are administered, and the two sets of scores are then correlated to produce the alternate forms reliability coefficient. This coefficient is particularly useful as a means of evaluating the consistency with which the test has been developed. Other classical consistency indices that have also been devel- oped have their equivalents in the construct modeling approach. For example, in the so-called split-halves reliability coefficient, the instrument is split into two different (nonintersecting) but similar parts, and the correlation between them is used as a reliability coefficient after adjustment with a factor that attempts to predict what the reliability would be if there were twice as many items in each half. The adjustment is a special case of the Spearman-Brown formula: Lr râ² = , 1 + ( L â 1) r where L is the ratio of the number of items in the hypothetical test to the number of items in the real one (i.e., if the number of items were to be doubled, L = 2).
200 EARLY CHILDHOOD ASSESSMENT These reliability coefficients can be calculated separately, and the results will be quite useful for understanding the consis- tency of the instrumentâs measures across each of the different circumstances. In practice, such influences will occur simultane- ously, and it would be better to have ways of investigating the influences simultaneously also. Such methods have indeed been developedâfor example, generalizability theory (e.g., Shavelson and Webb, 1991) is an expansion of the analysis of variance approach mentioned above. One of the issues in interpreting reliability coefficients is the lack of any absolute standards for what is acceptable. It is cer- tainly true that a value of 0.90 is better than one of 0.84, but not so good as one of 0.95. At what point should one say that a test is âgood enoughâ? At what point is it not? One reason that it is diffi- cult to set a single uniform acceptable standard is that instruments are used for multiple purposes. A better approach is to consider each type of application individually and develop specific stan- dards based on the context. For example, when an instrument is to be used to make a single division into two groups (pass/fail, positive/negative, etc.), then a reliability coefficient may be quite misleading, using, as it does, data from the entire spectrum of the respondent locations. It may be better to investigate false positive and false negative rates in a region near the cut score. Measurement Choices: Direct Assessment and Observation-based Assessment Choosing what type of assessment to use is a critical decision for the design of an early childhood program evaluation or an accountability system. As others have noted, it is a decision for which there are no easy answers because there are serious short- comings in all currently available approaches (Meisels, 2007). Two sharply contrasting measurement approaches (which we have discussed in other chapters) can be used with children under age 5: direct assessments and observation-based (often called a Â uthentic) measures. A direct assessment involves an adult, possibly a familiar adult but sometimes a stranger, sitting with a child and asking
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 201 him or her to respond to a number of requests, such as pointing to picture, or counting objects. The conditions for administration, such as the directions and how the materials are presented, are standardized to ensure that each child is being presented with identical testing conditions. Observation-based measures, such as those involving obser- vation of childrenâs behaviors or a portfolio collecting records of observations together with products of childrenâs work, use regularly occurring classroom activities and products as the evi- dence for what children know and can do. Observation-based measures encompass a variety of tools, including checklists of a series of items that a teacher or parent completes based on general knowledge of the child, and classroom-based observation tools, with which the teacher is expected to make extensive annota- tions based on what the child is doing in the classroom and use that documentation to complete the observation items. Portfolio assessment involves collecting and analyzing records of such observations or samples of childrenâs work. Both direct assessment and observation-based assessment have strengths and weaknesses. Direct assessments, however, have been used far more frequently in large-scale research p Â rojects, such as the Early Childhood Longitudinal Study; pro- gram evaluations, such as the evaluation of Early Head Start; and accountability efforts, such as the Head Start National Report- ing System. Consequently, there is more known about both the strengths and weaknesses of this approach. Observation-based and performance methods are routinely recommended as tools for teachers to use to plan and guide instruction (National Association for the Education of Young Children and National Association of Early Childhood Specialists in State Departments of Education, 2003). Even the recommendation to regularly use such measures to assess childrenâs progress in early childhood classrooms is a relatively new development, so there is much yet to be learned about the large-scale use of authentic tools for any purpose and that certainly includes program evaluation and accountability. In an extensive review of assessment approaches, researchers at Mathematica Policy Research (2007) noted challenges associ- ated with using both direct assessment and observation-based measures for program evaluation and accountability purposes.
202 EARLY CHILDHOOD ASSESSMENT Direct assessments often have been found to be predictive of school achievement. However, they are strongly associated with socioeconomic status and may not show whether a program is supporting children across all developmental domains. The dilemma is that as a direct measure gets longer and more com- prehensive, it also taxes the energy and attention span of young children. The limitations of direct assessment derive from the nature of the young child; that nature is not well matched to the demands of a standardized testing situation. Potential problems include the following: â¢ The child may not be familiar with this type of task or be able to stay focused. â¢ Young children have a limited response repertoire, being more likely to show rather than tell what they know. â¢ Young children may have difficulty responding to situation cues and verbal directions. â¢ Young children may not understand how to weigh alterna- tive choices, for example, what it means for one answer to be the âbestâ answer. â¢ Young children may be confused by the language demands, such as negatives and subordinate clauses. â¢ Young children do not respond consistently when asked to do something for an adult. â¢ In some cultures, direct questioning is considered rude. â¢ The direct, decontextualized questioning about discon- nected events may be inconsistent with the types of ques- tions children encounter in the classroom. â¢ Measurement error may not be randomly distributed across programs if some classrooms typically use more direct questioning, like that found in a standardized testing situation. These problems may not be shown in traditional ways of assessing validity, which compare childrenâs performance on one type of direct assessment with their performance on a similarly structured testâso-called external validity evidence. Mathematica Policy Research reports on a study by La Paro and Pianta (2000) that found that about 25 percent of the variance in
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 203 academic achievement in primary grades was predicted by assess- ments administered in preschool or kindergarten. This provides a ceiling for possible external validity evidence. Observation-based measures present an entirely different set of issues. They do not present any of the problems associated with the young childâs ability to understand and comply with the demands of a struc- tured testing situation, since the childâs day-to-day behavior is the basis for the inference of knowledge and skills. Teachers and caregivers collect data over a variety of contexts and over time to gain a more valid and reliable picture of what children know and can do. Observation-based assessment approaches also are con- sistent with recommended practices for the assessment of young children. The challenges associated with observation-based mea- sures are centered around the caregiver or teacher as the source of the information. Mathematica Policy Research (2007) has sum- marized challenges related to observation-based assessments: â¢ There is a need to establish trust in teachersâ and caregiversâ judgments. Research has identified the conditions under which their ratings are reliable, but there is an ongoing need to monitor reliability. â¢ Teachers and caregivers must be well trained in the administration of the tool to achieve reliable results. More research is needed to specify the level of training needed to obtain reliable ratings from preschool teachers. (Assessors of direct assessments need to be trained as well, but the protocol may be more straightforward.) â¢ The assessment needs to contain well-defined rubrics and scoring guides. â¢ Teachers and caregivers may be inclined to inflate their r Â atings if they know the information is being used for pro- gram accountability. â¢ Not all teachers or caregivers will be good assessors. â¢ Measurement carried out by teachers and caregivers requires that additional steps be taken to ensure the valid- ity and reliability of the data, such as periodic monitoring. A strength of observation-based measures is that the informa- tion has utility for instructional as well as accountability purposes.
204 EARLY CHILDHOOD ASSESSMENT This means the time invested in training teachers to become good observers and the time teachers spend collecting the information are of direct benefit to classroom practice, which is not true for direct assessment. Mathematica Policy Research concludes that it is wiser âto invest in training teachers to be better observers and more reliable assessors than to spend those resources train- ing and paying for outside assessors to administer on-demand tasks to young children in unfamiliar contexts that will provide data with the added measurement error inherent in assessing young children from diverse backgroundsâ (Mathematica Policy Research, 2007). More research needs to be done on the use of observation- based assessment tools for program evaluation and Âaccountability. If teachers or caregivers are not well trained or do not complete the tool reliably because they want their programs to look good for accountability, then the information is useless for both accountability and instructional purposes. Several states have elected to use observation-based measurement in their preschool accountability systems, but it is so new that very limited data are available. One large program evaluation was able to document that early childhood teachers could be trained to use Âobservation- based measures reliably. Bagnato and colleagues (Bagnato, Smith- Jones, et al., 2002; Bagnato, Suen, et al., 2002) used an authentic assessment approach to document improved outcomes for 1,350 preschoolers participating in an innovative community-based urban preschool initiative. The highest level of education was a high school diploma for 42 percent of the teachers working with the children and thus providing the child outcomes data. To ensure the outcomes data were valid and reliable, the evaluation team provided initial, booster, and follow-up training until mas- tery was reached; supervised caregiver assessments during a set week each quarter; and once a year conducted random, authentic assessments on children as a concurrent validation of teacher and parent assessments. Although we have presented direct assessments and o Â bservation-based assessments as distinct choices in the para- graphs above, a more recent perspective sees them as constitut- ing different parts of an assessment system or net (Wilson, 2005; Wilson and Adams, 1996). In this perspective, no single type of
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 205 assessment is seen as being fully satisfactory, hence a multipart assessment system is developed, which uses a combination of specific assessment types to ensure that the measures are useable under a range of circumstances and the entire system can adapt to changing circumstances. The strengths of item response model- ing are used to establish both the validity and the usefulness of this approach. In a classic example drawn primarily from K-12 education, the two assessment types were multiple-choice items and open-response items (Wilson and Adams, 1996), but in the context of early childhood education, a more likely combination would be a mixture of direct assessment and observation-based assessments, such as teacher observations and portfolios. The judicious deployment of such a combination allows the different assessment types to "bootstrap" one another in terms of valid- ity, going a long way to helping establish (a) whether the direct assessments did indeed suffer from problems of unfamiliarity and (b) whether observation-based assessments suffered from such problems as teacher bias. Moreover, systematic use of a combina- tion of assessments enables the monitoring of assessments as an ongoing possibility, not just a special study carried out during initial implementation. Methods for Assessing Test and Item Bias Developing tests for educational and psychological purposes requires a thorough consideration of the populations for which the test is appropriate. Specifically, the test development process should include several phases designed to ensure that tests and items are free from bias across the populations for which the test is intended. These steps include the subjective review of items and test content by subject matter and bias review panels, as well as more objective or quantitative examination of item and test properties. In modern test development, the examination of test bias favors these more quantitative examinations of item and test bias for their ability to quantify the extent to which items and tests may function dif- ferently across populations of interest, and because of the strong psychometric theory that supports their development and use, but interpretation will still rely heavily on qualitative approaches. The following section is an overview of these quantitative
206 EARLY CHILDHOOD ASSESSMENT methods for examining (a) test bias and (b) DIF. These issues are most relevant for three populations of young children, which are the subject of the next chapter: minority children, English language learners, and children with disabilities. Differential Item Functioning Assessments are typically made of children from a variety of backgrounds. One standard requirement of fairness in assessment practice is that, for children who are at the same level of ability on the variable being measured, the items in the instrument behave in a reasonably similar way across different subgroups. That is, the items should show no evidence of bias due to DIF (Ameri- can Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p. 13). Typically these subgroups are gender, ethnic and racial, language, or socioeconomic groups, although other group- ings may be relevant in particular circumstances. First, it is necessary to make an important distinction. If the responses to an item have different frequencies for different sub- groups, then that is evidence of differential impact of the item on those subgroups. Although such results may well be of interest for other reasons, they are not generally the focus of DIF Âstudies. Instead, DIF studies focus on whether children at the same locations on the score distribution give similar responses across the different subgroups. DIF is not always indicated when different groups perform differently on an assessment or on particular items. For example, suppose that more English language learners got a particular item wrong from an assessment of âspeaking in Englishâ than children who are native speakers; that would constitute differential impact on the results of the assessment and could well be an interest- ing result in itself. But the issue of DIF would not necessarily be raised by such a resultâit is to be expected that someone learning a language will find it harder to speak that language than native speakers, and hence the result does not challenge the contention that the instrument was accurately measuring that difference in their speaking performance. However, if children from the two groups who scored at
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 207 around the same level on the whole assessment had response rates on that item that were very different, that would be evi- dence of DIF for that item. The item is sensitive to some special characteristic of the children that goes beyond what is being assessed generally across the range of the items in the assessment (e.g., interest in the topic or content of the item). In order to be more fair to children from different subgroups, one would wish to reduce the influence of items from the assessment that had notable amounts of DIF, or perhaps amend them to eliminate this characteristic. Second, one must be careful to distinguish between DIF and item bias. For one thing, it is possible that a test may include two items that exhibit DIF between two groups, but in opposite direc- tions, so that they tend to âcancel out.â Also, DIF may not always be a flaw, since it could be due to âa kind of multidimensionality that may be unexpected or may conform to the test frameworkâ (American Educational Research Association, American Psycho- logical Association, and National Council on Measurement in Education, 1999, p. 13). However, despite these considerations, most test developers seek to reduce or eliminate instances of DIF in their tests. The Educational Testing Service has developed criteria for judging DIF effects (Longford, Holland, and Thayer, 1993). Several techniques are available for investigating DIF, among them techniques based on linear and logistic regression and tech- niques based on log-linear models (see Holland and Wainer, 1993, for an overview). For example, consider the results of a (hypothetical) DIF analysis examining the differences between males and females on one item (item âZâ) of a certain test, shown in Figure 7-1. For each score on the test as a whole, the proportions of boys and girls who responded correctly to the item have been plotted separately. If there were no DIF, those proportions would be the same (except for sampling error) for all scores. Looking at the figure, we see that for most whole-test scores boys are more likely to respond correctly to this item than are girls. That is DIF, and it means that this item indicates a larger difference in proficiency between boys and girls on this item than on the test as a whole. Examination of item Z may well reveal that
208 EARLY CHILDHOOD ASSESSMENT Proportion Answering Item Z Correctly FIGURE 7-1â Examining differential item functioningâProportion answer- ing item Z correctly vs. score on entire test, for male and female subjects (hypothetical data). Figure 7-1, bitmapped there is something about it that unintentionally favors boys. There R01340 are many statistical procedures available to judge whether there is statistically sound evidence of DIF that are useful for different kinds of test items and sample sizes; see Wilson (2005), Dorans and Holland (1993), and Thissen, Steinberg, and Wainer (1993) for examples. Consider an example involving Chinese and U.S. children who were administered a test of cognitive development in their own languages (see Huang, 2007). Applying effect size criteria like those mentioned above (Longford et al., 1993) to the statis- tically significant difference found shows that indeed the DIF for several items is âlarge.â One such item concerns the use of comparativesâfor example, âmoreâ and âfewerâ in English and their equivalents in Chinese. It is easier for Chinese children than U.S. children (at the same overall cognitive development status) to get the comparative item correct. (Remember that this applies to just that item, but not the other items.) In fact, it turns out that this effect is common to five other items involving both compara-
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 209 tives and superlatives (Huang, 2007). In investigating this, we note that the Chinese language has some interesting differences in comparison to English. For example, the two languages Âdiffer greatly in the formation of comparatives and superlatives. In English, the words for comparatives and superlatives often used are âmore,â âmost,â âless/fewer,â âleast/fewest,â âas many,â and âequal.â Some of these words are used differentially depending on whether the nouns they are applied to are countable or not. For example, we say âless butterâ but âfewer sheep,â âthe least of the butterâ but âfewest of the sheep,â and âas many sheepâ but âas much butter.â But note that one can say âmore sheepâ as well as âmore butter,â so the rule is not a consistent one. In contrast, in Chinese, nouns are not differentiated to be countable or not. More- over, instead of using different words, the same two characters (duo and shao) and the same comparative (geng) and superlative (zui) are used. The function is morphologically easier in Chinese than in English. Zhou and Boehm (1999) found Chinese and U.S. elementary children developed differently on those concepts. So it is not surprising that the five DIF items testing childrenâs ability to compare quantities all favored Chinese children. To get an idea of the effect size of this difference, the relative odds of getting the item correct for children in the two groups can be calculated and they are 1:2.77 (U.S.:Chinese)âthat is, for respondents at the same level of cognitive development, approximately 1 U.S. child for every 3 Chinese children would be predicted to get the item cor- rect. This effect size needs to be embedded in a real-world context to decide whether it is important or not. However, it seems to be reasonable to say that the difference is quite noticeable and likely to be interpretable in many contexts. Once an item exhibiting DIF has been identified, one must decide what to do about it. Recall that not all instances of empiri- cal DIF threaten the itemâas mentioned earlier, the grouping characteristics may not be ones of concern for issues determined to be important, such as fairness. It is sobering to realize that, for each item, it is almost inevitable that there will be some grouping that could be constructed for which the item will exhibit DIF. It is first necessary to establish that the DIF is indeed not a result of random fluctuations, and then the same steps are needed: (a) repeated samplings and (b) development of a âtheory of DIFâ
210 EARLY CHILDHOOD ASSESSMENT for that particular item. If one is indeed confident that DIF is established, then the best strategy is to develop alternative items that do not exhibit DIF. However, it may not be possible to replace the DIF item in the instrumentâin the case above, the question would be whether comparatives and superlatives were indeed considered necessary to oneâs conception of cognitive develop- ment. Then the measurer must make a judgment about the best way to address the issue. Note that a technical solution is available hereâthe measurer can use the two different calibrations for the two groups, but this is seldom a chosen strategy, as it involves complex issues of fairness and interpretation. Validity Generalization As described earlier, the validity of inferences based on test data is critical to the selection and use of tests. Test users need to know the extent to which the inferences that they make on the basis of test performance would still apply if the child had been tested by a different examiner, on a different day of the week, in a different setting, using an alternate form of the same test, or even using a different assessment of the same skill or ability. Whether a particular test will function in the same way for differ- ent populations (e.g., for minority and nonminority children) in different settings (e.g., in a Head Start program and a private, for- profit, preschool program) are questions for research. However, because there are virtually infinite ways in which to characterize subpopulations of interest, and there are infinitely many settings across which one might wish to use assessments, it is impractical to consider that all tests might be studied on all populations in all settings of interest. Even if it were practical, doing so might not provide the best answer to questions about the validity of specific assessments, because individual studies can suffer from method- ological shortcomings that can affect the estimation of validity coefficients in ways that do not affect the validity of inferences based on the test. Put another way, the information one seeks con- To carry this out, one would use item estimates anchored (for the non-DIF items) on the whole sample to estimate different difficulties for the DIF items, then make sure that the two metrics are equated.
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 211 cerns population properties of the test but the individual research studies provide only imperfect estimates of these quantities. Even well-designed studies can provide only imperfect information about the test properties. A number of methodological factors can affect estimates of test validity. Several obvious candidates include sampling error, unreliability in the specific test being studied, unreliability in the specific criterion being used (e.g., another test measure, performance in a course, success at the next grade level), and restriction of range in the study sample. When assessing whether tests function similarly across different settings, such as in one preschool compared with another, or for different populations, such methodological factors that affect the size of the validity coefficients must be taken into consideration. The portability of test validity across different settings and populations has come to be known as validity generalization (Murphy, 2003). Studies of validity generalization rely on the methods of meta-analysis to examine the factors affecting variability in validity coefficients. The basic logic of the validity generalization argument rests on the ability of meta-analysis techniques to adjust validity coefficients for sampling error and other methodological artifacts that affect sample estimates of validity coefficients and then to estimate the magnitude of the remaining variance in validity coefficients. If the variability in the validity coefficients is statistically not different from zero once sampling error and other methodological study artifacts have been controlled, then one would conclude that validity will generalize to other settings and populations. Validity generalization has been widely used in the industrial and organizational psychology literature to examine the portabil- ity across employment settings of the validity of cognitive ability tests and other assessments used in employee selection. In the employment context, there are many studies providing data on the use of tests to measure specific ability domains. Interest often centers on the role of specific domains of assessment in predicting job performance more than on the validity evidence for specific tests. However, the techniques of validity generalization can also be used to study the validity evidence for specific tests and the use of specific tests in different populations. In studying test properties, validity generalization techniques are statistically
212 EARLY CHILDHOOD ASSESSMENT preferable to isolated comparisons across populations. Because such statistical artifacts as sampling error, unreliability in the test and criterion, restriction of range in study samples, and other study design features can be controlled through the techniques of meta-analysis, validity generalization studies can provide better inferences about the comparability or noncomparability of test properties across settings and populations than simple compari- sons of test correlations in individual studies or from narrative research reviews. Although the concept of validity generalization has been used most widely in employment research, related concepts have been discussed in other contexts. For example, the concept of popula- tion generalizability (Laosa, 1991) has been used to describe the extent to which inferences about tests or treatment effects in the normative population will also apply to other populations of interest. Although much of the literature on validity generaliza- tion is focused on the use of tests in employment settings, its rel- evance to educational and early childhood settings is clear. Limits of Validity Generalization There are significant limitations to the use of validity gen- eralization to infer the absence of test bias. In part, these limita- tions are inherent in the use of meta-analysis and the logic of statistical hypothesis testing. The inference that validity gener- alization holds is based on a test of the statistical hypothesis that v Â alidity coefficients (i.e., population correlations) do not vary across populations or contexts. Practically speaking, this is based on a test to determine that the variability in observed validity c Â oefficients is not different from zero once sampling error and other methodological artifacts have been controlled. Thus, the inference of validity generalization is tantamount to accepting a null hypothesis in statistical hypothesis testing. As in other hypothesis-testing contexts, one cannot prove that validity gen- eralization holds; one can only disprove it. Consequently, one can really infer only that the current evidence does not disprove the validity generalization. There are many reasons why the evidence might not sup- port rejecting the validity generalization hypothesis even though
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 213 validity coefficients vary across populations or contexts. Just as in the case of DIF and differential test functioning, the statistical power of the hypothesis test must be considered. In DIF, power is primarily a function of the sample size in each subgroup and the magnitude of the difference in item parameters across populations. In meta-analysis, the power of the variance test is principally affected by the number of studies, the sample sizes in those studies, and the magnitude of differences in the valid- ity coefficients across populations. If the number of studies in the meta-analysis is small, or the magnitude of the variability in validity coefficients is small, or the sample sizes in the included studies in the meta-analysis are small, power may be low for the test of variability in the validity coefficients. A complete discussion of the validity generalization literature or the use of meta-analysis to study validity generalization is beyond the scope of this volume. Interested readers are referred to Goldstein (1996), Hunter and Schmidt (1990), and Murphy (2003). For considerations about the use of validity generalization techniques in the study of test bias, see National Research Council (1989). Selecting Assessments and Developing Systems: Example Scenarios In the following section we describe three scenarios in which an individual or organization has decided to develop an assess- ment component for an early childhood program. These scenarios are intended to demonstrate the processes that the individual or organization might establish for achieving its objectives. They are illustrative and are not intended to be definitive or compre- hensive. They apply to assessments of children and of early care and education environments, though we have focused mostly on child assessments. When designing an assessment system to accomplish multiple purposes involving multiple domains (e.g., assessing childrenâs status; guiding intervention; or mea- suring program improvements in language, arithmetic, and socioÂemotional development), one must replicate many of the processes involved in selecting a test to measure performance in a single domain. Consequently, we begin with a simple scenario
214 EARLY CHILDHOOD ASSESSMENT in which a program director wishes to assess childrenâs language skills at entry into an early childhood educational program. We then consider a more complex scenario in which a consortium of early childhood programs seeks to establish an assessment system that can be used across all programs in the consortium to make instructional decisions for the children in the consortiumâs care. Finally, we consider the situation in which the local school board of a large urban school district has decided to incorporate child assessments into its evaluation of the districtâs new preschool initiative aimed at improving childrenâs school readiness, socio- emotional development, and physical health. All of the scenarios are fictitious and any resemblance to actual people or programs is entirely coincidental. We understand that assessment circumstances vary in the real world. A local program may have constraints on time, money, knowledge, and/or autonomy that limit its freedom in selecting assessment designs and instruments. A state-sponsored program may have state standards to meet, may need assessments that will provide information on how well those standards are being met, and may have to use assessments selected by the state. A federally sponsored program, similarly, operates in the context of standards imposed and assessment decisions made at the federal rather than the local level. We discuss these possibly conflicting requirements in Chapter 10. In the scenarios we mention some constraints on assessment design and implementation, (e.g., cost). The follow- ing scenarios therefore, represent cases in which people at the program levels specified have assessment needs that they wish to satisfy, within the constraints of their particular situations. Selecting One or More Tests to Meet a Local Need Jane Conway is the director of the Honeycomb Early Child- hood Center, serving a small rural community. The child popula- tion at Honeycomb has historically been largely Caucasian, but in more recent years the population has become increasingly diverse, with more African American and Latino families. Ms. Conway has decided that, in order to better serve the families and children at Honeycomb, she wishes to evaluate the language proficiency of children at the time of their enrollment. In order to achieve
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 215 her objective, she establishes a test selection committee that is comprised of herself, her best teacher, a parent, and Rebecca T Â hompson, a retired school psychologist. She asks Dr. Thompson to chair the committee, because of her experience working in school settings with diverse child populations, including children who are not native speakers of English. Dr. Thompson and Ms. Conway meet and agree to complete the committeeâs work in 45 days. To achieve this goal, they will need to rely on information about specific assessments from external sources, such as Tests in Print (Murphy, Spies, and Plake, 2006) and the Buros Mental Measurements Yearbook (Buros Institute of Mental Measurements, 2007); products of the Buros Institute of Mental Measurements at the University of Nebraska; publications focused on preschool assessment, such as the Child Trends (http:// www.childtrends.org) compendium, Early Childhood Measures Profiles (Child Trends, 2004), and the compendium developed by Mathematica Policy Research for Head Start (Mathematica Policy Research, 2003); and online databases, such as those provided by Buros, the Educational Testing Service, and others. The first committee meeting is focused on clarifying the pur- pose for using the test. Ms. Conway explains that her desire is to have information about the incoming language skills of all of the children and to be able to gauge how much language skill the children gain over the course of their time at Honeycomb. Thus, she would like a test that measures both receptive and expressive language, including vocabulary and the ability to follow direc- tions, and childrenâs knowledge and understanding of grammar (e.g., the ability to form the simple past tense of common verbs). She wants to know how the children at Honeycomb compare with other typically developing 3- and 4-year-old children. She is especially concerned to know the overall language skills, not just the English language skills, of the English language learners. This will help her teachers provide the necessary visual and linguistic supports to their children and opportunities to develop language skills through their interactions with the teacher, the environ- ment, and the other children, as well as to measure their progress See Appendix D for a list and descriptions of useful sources of information on instruments.
216 EARLY CHILDHOOD ASSESSMENT over the course of the year to ensure that their language skills are developing at an appropriate pace and that they will be ready for kindergarten when they finish at Honeycomb. The committee discusses these purposes and works to further clarify the assessment setting. They discuss who will administer and score the assessments, who will interpret the assessments, what specific decisions will be made on the basis of the assess- ment results, when these decisions will need to be made and how often they will be reviewed and possibly revised, which children will participate in the assessments, and what the characteristics of these children are: their ages, their race/ethnicity, their primary language, their socioeconomic status, and other aspects of their background and culture that might affect the assessment of their language skills. Dr. Thompson concludes, on the basis of the answers to these questions and refinement of their purposes in assessing childrenâs language, that either a direct assessment or a natural language assessment might be used. Ms. Conway likes the idea of using a natural language assessment but considers that such an assessment may be too costly. The committee decides not to preclude any particular form of assessment until they have more information on the available assessments; their reliability and validity for the purposes they have specified with children like those at Honeycomb; and the specific costs associated with using each of them, including the costs of training personnel to administer, score, and interpret the assessments and the costs associated with reporting and storing the assessment results so that they will be useful to teachers. The committee next considers how they will go about iden- tifying suitable tests. They consider what tests are being used in other programs like Honeycomb. In one nearby program, the director has adopted the use of a locally developed assessment. Ms. Conway considers that perhaps Honeycomb could also use this assessment, since the other program appears to be obtaining excellent results with it. However, Dr. Thompson points out that such a locally developed test, because it has not been normed with a nationally representative sample, will not meet at least one of the stated purposes for assessment, namely, to provide the teacher with information about how each assessed child is doing relative to other typically developing children. Knowledge about how
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 217 the children at Honeycomb compare with typically developing children is a sufficiently important purpose that the committee rejects the idea of using any locally developed assessments that do not support this kind of inference. Having clarified their purposes for collecting language assess- ments and given careful consideration to the requirements and limitations of their specific setting, the committee collects infor- mation on specific assessments. They search online Â publishers of major commercial tests for new and existing assessments and search and gather information from the print and online resources mentioned above, to gather general descriptive information about the skills measured by each assessment, its format (both stimuli and response formats), training requirements or skills of e Â xaminers, costs, and the kinds of scores and interpretive infor- mation that are provided. Because they anticipate finding a large number of assessments that meet their general needs, they decide not to examine specific review information until after they have narrowed the field to a manageable number (e.g., 10). They do agree, however, to consider tests that measure only some of the language skills of interest, although they believe that it would be preferable to have one assessment that measures all of the skills of interest. Dr. Thompson has developed an electronic form on which to record this information for each test that they identify as meeting their primary needs. Committee members arrange the informa- tion to be collected and the general characteristics to be rated in a hierarchy from most important to least important. Information on the name of the test and the publisher is to be obtained on all potentially suitable tests, including those that will ultimately be eliminated, in order that the committee has a record of each test examined at any level and the reason that it was rejected or not given further consideration. They arrange the criteria in the following order: (1) measures some or all of the language skills of interest, (2) has been normed on a nationally representative sample and provides normative information for each subgroup of interest to Honeycomb, (3) is suitable for use with children in the age range found at Honeycomb, and (4) is suitable for administration by preschool teachers. For each characteristic, the individual Âgathering the information is to mark âYes,â âNo,â or
218 EARLY CHILDHOOD ASSESSMENT â?â A test obtaining a âNoâ response to any characteristic will not be given further consideration, as it clearly fails to meet at least one important purpose. Tests with a âYesâ for all characteristics are highly valued, but it is expected that at least some information may not be available through online sources and will require fur- ther research. Because of the potential time required to complete this research, the committee can undertake this research only for tests that are otherwise highly promising. Thus, tests with â?â can remain in the pool for now, and, depending on what the character- istics of the set of tests that remain in the pool, they may be further researched or dropped. At the second committee meeting, the spreadsheets are assembled and the collection of tests is reviewed to see which tests show the most promise on the basis of the first-stage review. The ultimate objective of this meeting is to reduce to a manageable size the number of tests on which more detailed information will be sought. The committee reviews rejected tests to ensure that everyone agrees with the reason that the reviewer rejected those tests. Disagreements are settled at this point by keeping tests in the pool. The final disposition of these assessments may depend on the number of clear winners in the pool. If there are many outstanding options to choose from, then there is little or no need to give further consideration to tests that may be marginal, but if there is a limited number of tests that have been scored positively across all dimensions, then these âiffyâ tests might merit further examination and review. (Two committees confronted with the same information may make different decisions about the dis- position of such tests, and there is no single right answer to the number of tests to consider for more detailed review.) Thus, at this stage there are at least three groups of assessments: those for which additional review information will be sought, those that have been clearly rejected because of one or more âNoâ responses on the primary dimensions, and those that are seen as less desirable than tests in the top group but that are nevertheless not clearly rejected. It is helpful to rank-order the best of the tests in this last group. Occasionally, the more detailed review process may eliminate all of the top candidates, necessitating that one give further consideration to tests that were in this middle category.
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 219 Having rank-ordered these tests as alternates can save time in this situation. After the second meeting, the committee members collect, distribute, and read detailed review information on the top assessments prior to the next meeting. The committee chair assembles technical information, including any information on test Âreliability relevant to each test, adding it to the spreadsheet. The most relevant information is kept for each test. For example, if specific information is available on reliability for 3- and 4-year-old children, this information is preferred over reliability information that is not delineated by age group. Similarly, information that is provided for specific subgroups of children, such as Spanish- speaking children, African American children, and children with disabilities, is recorded separately. For some tests, this information must be found in technical manuals or in published research that uses the test. Thus, for tests that look promising, an effort will be made to seek out this information through a broader search of the literature and technical documents from the publisher. If this information cannot be secured in time for the next meeting, the committee will consider extended efforts to get it, if the test otherwise looks promising. Following the collection and distribution of the detailed review information, the committee meets for a third time to Ânarrow down the list of acceptable tests to a set of top contenders. Factors to consider at this point include the technical information from the reviews as well as cost information. For each of the tests that fare well in this stage of the review process every reasonable effort will be made to obtain a copy of the test, so that the full technical manual and administration procedures can be reviewed in-house. The review materials on each test will be examined to ensure that the test supports the kinds of inferences that Ms. Conway and her teachers wish to make about their childrenâs language skills and development. This judgment will be based on the information We know it may be difficult or expensive to obtain copies of tests and manuals, and it may not always be practical to do this. Workarounds may be possible, for example by tapping the expertise of committee members, bringing in a consultant familiar with the test and its manual, or relying on sample items or limited access arrangements on publisher websites. It is always preferable for decision makers to see the full instrument and its manual.
220 EARLY CHILDHOOD ASSESSMENT about reliability and validity that has been accumulated from all available sources. It is tempting to think that the best decision will be obvious and that everyone would make the same decision in the face of the same information, but each setting is somewhat dif- ferent, and choosing between tests is a matter of balancing com- peting objectives. For example, reviewers may differ in how much weight they put on the desire for short testing times compared with the desire for high reliability and validity for all subgroups of children, or the desire for a single assessment compared with the desire to measure all of the identified skills. Thus, decisions may vary from setting to setting, or even between members of the same committee in a given setting. These differences can be reduced by deciding on specific weights for each criterion that all reviewers will use, but in most situations these differences of opinion will need to be resolved by the committee. It is important to keep in mind that, at this point, the goal is simply to settle on a small slate of possible tests to review directly. The committee can always decide to keep an extra test in the review when opinions about it are divided. Some information will prevent a test from further consideration, such as a test that has been shown to function differently for language-minority chil- dren, children with disabilities, or other important subgroups (see the section on differential item and differential test functioning), or a test found to have poor reliability for one or more subgroups, or a test that is found to have special requirements for test admin- istrators that cannot be met in the current setting. Lack of information is not, in and of itself, a reason to reject a test. For example, a test that appears strong on all other criteria may have no information on its functioning for language-minority children. Specifically, the published information may not discuss the issue of test bias, and there may be no normative information or validity studies that focus on the use of the test with this popu- lation. The decision that one makes about this test may depend largely on: (1) the strength of other tests in the pool with respect to their use with language-minority children, (2) the Âability to locate information from other sources that can provide the missing infor- mation on the test in question, and (3) the Âcapacity of the center to generate its own information on how the test functions with this population of children through systematic use of the test and
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 221 collection of data that can address this question. In the absence of strong alternatives, a center that has the capacity to use the test in a research mode prior to using the test operationally to make decisions on individual children might choose to do so. There are two critical points to continue to keep in mind here. First, lack of information is not the same thing as negative information. Second, each suggests different courses of action. Negative information indicates that the test does not function as desired and should not be used for a particular purpose with a particular population. In contrast, lack of information simply indicates that it is not yet known how the test functions. Lack of information does not necessarily imply that the test is biased or functions poorly when used with the target population, but it also does not imply that the test can be assumed to function well in this population or to function comparably across populations of interest. Often, lack of information will lead to rejection of a test; rather, it should lead to a suspension of judgment about the test until relevant information can be located or generated. For a center that lacks the capacity to locate or generate such informa- tion, there may be no practical difference in these two situations for choosing an assessment at a given point in time. In either case, the test is of no use to the center at that point in time. Having compiled all of the collected information on each of the tests, the committee evaluates the information to identify the top two or three tests that best meet the purposes that they detailed at the outset. This process amounts to weighing the strengths and weaknesses of each test, taking into account the dimensions that the committee has agreed are most important for their purposes, and taking into account when information might be lacking for a particular test. Those tests rated as the top two or three will be obtained from the publisher (see note 4, above), along with the technical manuals and any supporting materials that accompany the test. All of this information will be examined firsthand by the committee. This review will typically involve a thorough and direct examination of test items and administration procedures, review of the rationale behind the format of the test and the construction of test items, and a complete reading of the administration guidelines and scoring procedures and informa- tion on the interpretation of test scores. The committee may also
222 EARLY CHILDHOOD ASSESSMENT elect to show the tests to the teachers who will use them, to have teachers rate the difficulty of learning to administer the test, and to pilot the tests with a few children in order to get a sense of how they react to the procedures. This information will be compiled, along with the technical and descriptive information about the test, the information on cost, and the committeeâs best judgment about any special infrastructure that might be needed to support a particular test (e.g., a test may require computerized scoring to obtain standard scores). At this point, the committee can choose the test or tests that will best meet the assessment needs of the center. The decision about which test or tests to adopt will boil down to a compromise across the many criteria agreed on by the committee. In this case, these included the desire to have an assessment process that is both child and teacher friendly, minimizes lost instructional time, meets the highest standards of evidence for reliability and valid- ity for the purposes for which assessment is being planned and with the particular kinds of children that comprise the centerâs population, and that can be purchased and supported within the budgetary limits set out by the director. To no oneâs surprise, no test has emerged that is rated at the top on all of the committeeâs dimensions. Nevertheless, the committeeâs diligence in collecting and reviewing information and in their deliberations has given them the best possible chance of selecting a test that will best meet their needs. Selecting Tests for Multiple Related Entities In this scenario we consider a consortium of early childhood programs that seeks to establish an assessment system to guide instructional decisions that can be used across all programs in the consortium. The process is similar in many respects to the process followed by Ms. Conway and the team at Honeycomb. Unique to this situation are the facts that the consortium wishes to use assessment to guide instructional decision making and that the consortium would like to use the assessment system across all members of the consortium. These differences suggest that the processes adopted by Honeycomb should be modified in specific
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 223 ways, namely, in the construction of the committee and in the criteria for distinguishing among the tests. The expansion of the test setting to multiple members of a con- sortium has specific implications for the constitution of the selec- tion committee. It is critical that the committee that will clarify the purposes of assessment, gather and review test information, and ultimately select the test should be expanded to include represen- tation from across the consortium. It may not be possible to have representation from each member on the committee, but some process should be put in place to ensure that the differing needs and populations across the member programs of the consortium are adequately represented on the committee. It is equally, if not more, important to ensure that the necessary expertise is present on the committee for clarifying assessment purposes, gathering and reviewing the technical information, and choosing among the tests. Just as choosing among the tests will involve weigh- ing advantages and disadvantages and making compromises, with some elements nonnegotiable, establishing the committee to carry out the process will involve choices, compromises, and nonnegotiable elements to be decided on by the leadership of the consortium. The expansion of the assessment setting to cover all mem- bers of a consortium also has implications for implementing the assessment plan. In the case of a single entity, it is immediately obvious who will be responsible for each phase of the assessment plan, from purchasing the assessment, to training those who will administer the test, to scoring, interpreting, and acting on the test. When a consortium is involved and the desire exists to have all entities using the same assessment, a number of other ques- tions must be addressed and the consortium must decide if only a single answer will be allowed to each question, or if individual members will be allowed to answer the question different ways. For example, when will testing be conducted? Who will be respon- sible for conducting the assessment? Who will train the assessors, and who will coordinate the training? What steps will be taken to ensure that training quality is uniformly high and that all asses- sors have been trained and meet the same standards? Will results of assessments be shared across members of the consortium, and if so, in what way? Who will be responsible for collecting the data,
224 EARLY CHILDHOOD ASSESSMENT in what form will the data be collected, and how will the data be stored and aggregated for reporting purposes? Who will decide on report formats and the process of disseminating the results? This list is not exhaustive, but it highlights some of the additional challenges that arise when more than one entity is involved in the testing enterprise. Another major difference between the current scenario and the Honeycomb scenario is the focus on using assessment results to guide instructional decisions. Using assessments to guide instruc- tional decisions implies that assessments will occur at intervals throughout the year, which may imply that different assessments are used at different times during the year, or that different forms of the same assessments are used at different times during the year. In part this distinction hinges on the nature of the instruc- tional decisions to be made throughout the year. Decisions that relate to monitoring progress in a single domain would generally argue for the use of different forms of the same assessment over time, whereas decisions that relate to the introduction of instruc- tion in a new domain or transitioning from one form of instruc- tion to another (e.g., from native language instruction to English instruction) might argue for the use of a different assessment. Several questions must be considered when the focus is on guiding instruction. The first is whether or not the assessment is expected to assess progress against a specific set of standards set forth by the state, the district, the consortium, or some other entity. Ideally, there will not be multiple sets of standards against which performance must be gauged, as each set of standards potentially increases the number of behaviors that have to be assessed and monitored, and the more standards that exist, the more likely it becomes that sets of standards will come into conflict with one another. A second major question that must be addressed is the distinc- tion between status and growth. If the assessment is to monitor growth over time, it should be clear in what domain growth is being measured, whether growth in that domain is captured through quantitative change (i.e., change in level of performance), or whether growth in that domain is captured through qualita- tive change (i.e., change in type), or both. Measuring quantitative change requires that additional psychometric work has been done
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 225 on the instruments to develop a scale for tracking performance gains over time, and that it is clear how to interpret differences between scores at different points on the score scale. Major tests have begun introducing such developmental scales, as they are often called, but these are by no means ubiquitous, and the lack of a strong, psychometrically sound developmental scale can seri- ously hinder accurate interpretation of performance gains over time. Finally, unlike the Honeycomb scenario, which focused on status at entry relative to national norms, the focus on using assessment to guide instruction suggests that the members of the consortium might well be interested in, and best be served by, a locally developed assessment. To the extent that the standards and instructional decisions are mostly local, then it is far more likely that a locally developed assessment, tailored to reflect local standards and approaches to instruction, will meet the needs of the consortium. However, this likelihood also has implications for the test review and selection committee. In particular, locally developed tests are not likely to be covered in the available assess- ment reviews, and are not likely to have been developed to the same rigorous psychometric standards as tests that are intended for use on a broader audience. Thus, the committee may need to gather technical information on more assessments, and may find little or no technical information is available for many of them. Information about test bias in particular is likely to be missing, with the result that it will have to be investigated in the local set- ting for the selected assessments. Except for these major differences, the process for the consor- tium is much the same as the process for Honeycomb. The conÂ sortiumâs committee must spend time clarifying their purposes for assessment and determining the precise reasons for using assessment, the kinds of decisions to be made on the basis of assessment results, and the domains to be assessed. The potential focus on multiple domains of assessment adds complexity to their task, namely, the need to differentiate between domains that may be highly related to one another, and the necessity of restricting the domains to a number that can be reasonably assessed. The pro- cess of gathering information about tests and the steps required to adequately review and choose between tests are essentially the same for the consortium committee and the Honeycomb com-
226 EARLY CHILDHOOD ASSESSMENT mittee. Although the consortium committee may decide to give priority to tests that can assess all of the domains that they have chosen to measure, it is unlikely that they will be able to restrict the review to such tests until later in the review process, when it is clear what tests are available to address their needs. Because the process of gathering information, reviewing it, and selecting among the tests is essentially the same as in the first scenario, that information is not repeated here. Selecting Tests in a Program Evaluation Context Finally, we consider Novatello School District, a large urban school district in which the school board has decided to incorpo- rate child assessments into the evaluation of its new preschool initiative, which is aimed at improving childrenâs school readi- ness, socioemotional development, and physical health. Novatello has a diverse population of children from many ethnic and lin- guistic backgrounds with considerable economic diversity in all ethnic groups and approximately 140 home languages other than E Â nglish. In addition, Novatello provides kindergarten instruction either in English or in the native language for children whose primary language is either Spanish or Farsi, the two predominant languages among Novatelloâs school population. The Spanish lan- guage kindergartens are located throughout the district, whereas the Farsi programs are located in a small region, reflecting the housing patterns of the community. Novatelloâs situation differs in important ways from the two previous scenarios. The program evaluation or accountability purpose of the assessment has the greatest implications for the design of the assessment system. The context of multilingual instruction carries further implications, which must be taken into account if the assessments are to enable valid inferences about the programâs effects on childrenâs school readiness, socioemotional development, and physical health. Program evaluation or accountability carries with it signifi- cant implications for the use of assessments that were not present in the first two scenarios. In particular, in the prior scenarios, the assessments were decidedly low stakes; the decisions being made on the basis of the childrenâs performance on the assessments had
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 227 minor consequences, if any, for the children and teachers. In the program evaluation context, one cannot assume that the conse- quences for children and teachers will be negligible. If program closure is a potential consequence of a poor evaluation outcome, then the consequences for both children and teachers are very high. If children might be prevented from entering kindergarten on the basis of the results of school readiness assessments, then the consequences for children are high. Similarly, if teachersâ employment with the district or pay raises are tied to childrenâs performance, then the consequences for teachers are high. As the consequences associated with decisions based on assessment scores become greater, there is a correspondingly greater burden to demonstrate the validity of inferences that are based on those assessment scores, which in turn requires greater precision in assessment scores. Precision can be increased with uniformity in the assessment setting, standardization of instructions and scoring, and security of assessment information. However, with young children, efforts to standardize assessment conditions can create artificiality in the assessor-child interac- tions, which may negatively affect the validity of the assessment scores. More importantly, the program evaluation context requires that scores obtained from children support inferences about the programs in which the scores were obtained, even though such assessments are designed to support inferences about children, not necessarily the programs that serve them. Determining whether these same assessment scores sup- port valid inferences about the educational context in which the scores were obtained requires a level of abstraction beyond the inference from the score to the child. The validity of the inference from the score to the program cannot be assumed on the basis of the validity of inferences about childrenâs abilities. The validity of inferences about programs must also be demonstrated, just as the validity of inferences about childrenâs knowledge, skills, and abilities must be demonstrated and cannot be assumed on the basis of assessment construction or other properties of assessment scores. Reliance on child assessments in program evaluations Âcarries an explicit assumption that differences between programs in child outcomes at the end of the year can be attributed to differences in
228 EARLY CHILDHOOD ASSESSMENT the educational quality of the programs. Unambiguous inferences about program differences on the basis of end-of-year differences in child performance are most justifiable when the assignment of children to programs has been controlled in some meaningful way, which is not generally the case. In the absence of controlled assignment, inferences about program differences require con- siderable care and caution, especially when those inferences are based, in part, on the results of child assessments. In particular, in the absence of controlled assignment, one must justify any assumption that differences between programs in child assess- ments are attributable only to differences between programs in factors that are under the control of the programs. Support for this assumption is context specific, and it may or may not be defensi- ble in a single district, let alone in a single state. Thus, developing a suitable context for program evaluation will require substantial dialogue among program leaders to identify and address factors that differ among programs and that relate to differences in child outcomes but that are, nonetheless, outside the control of the programs. Failure to account for such differences will negatively affect the validity of inferences about differences in program q Â uality that are based on differences in child outcomes. In the current context, two factors that could affect the validity of inferences about programs based on child assessment results are the primary language of the child and the language of instruc- tion used in the preschool program. The committee developing the assessment program for Novatello must determine the condi- tions governing whether children should be assessed in English or in their primary language. Because the language of instruction model varies across programs that will be evaluated, and because children differ in their primary language within and between pro- grams, there are several factors to consider. In Novatello, children are allowed primary language instruction prior to kindergarten along with English language development if they speak either Farsi or Spanish. These children will receive their instruction in kindergarten in their primary language, and thus there is con- sistency between the language of instruction prior to and during kinderÂgarten. Because primary language instruction is not available in other languages, speakers of languages other than Spanish and Farsi are instructed prior to and during kindergarten in English.
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 229 The Novatello assessment development committee decides that children should be assessed in the language in which they are instructed for all assessment domains that link directly to skills and abilities related to instruction. At the same time, all children, including those instructed in a language other than English, will be assessed for English language acquisition because of the pro- gramsâ focus on English acquisition for all children. The commit- tee agrees that near-term outcome expectations for children must be adjusted to reflect their status as nonnative speakers of English and to reflect the language of instruction. These adjustments are agreed on in order to ensure that short-term performance expecta- tions adequately reflect the different developmental trajectories of children who are at different stages of acquiring English. Although Novatello expects that all children who enter school at preschool or kindergarten will reach proficiency with English by the end of elementary school, they have established outcome expectations for preschool and kindergarten that reflect childrenâs different backgrounds in order to set realistic and comparable performance expectations for all programs. Without these adjustments, pro- grams in areas with high concentrations of nonnative speakers of English or children with the greatest educational needs would be disadvantaged by the evaluation system. The Novatello assessment committee faces all the same chal- lenges that were faced by Honeycomb and the consortium. They must define the domains of interest and all of the purposes of assessment. They must consider whether they are collecting child assessments for purposes other than program evaluation, such as to assess the different educational needs of entering children, to monitor learning and progress, and to make instructional deci- sions regarding individual children. If their singular purpose is program evaluation, then it is not necessary to assess all children at all occasions; rather, a sampling strategy could be employed to reduce the burden of the assessment on children and programs, while still ensuring accurate estimation of the entry characteris- tics of the child population and program outcomes. Challenges of sampling include controlling the sampling process, ensuring that sampling is representative, and obtaining adequate samples of all subpopulations in each program, to the extent that out- comes for subgroups will be monitored separately. If, however,
230 EARLY CHILDHOOD ASSESSMENT program evaluation is not the primary purpose for collecting child assessment data, then the committee must clarify all of the primary purposes for assessing children and ensure that the instrument review and selection process adequately reflects all of these purposes, prioritizing them according to their agreed-on importance. The expansion of the assessment framework to include such domains as socioemotional functioning and physical well-being do not fundamentally alter the instrument review and selection process. The committee will have to expand its search to identify available assessments and to locate review information on those assessments. However, the process itself of identifying assess- ments, gathering and reviewing technical information, consid- ering training needs and challenges, and addressing issues of assessment use with learners from different cultural and linguistic backgrounds is not fundamentally different from the process used by Honeycomb to evaluate language assessments. Of course, the expansion to multiple domains and to domains outside of aca- demic achievement makes the total scope of work much greater, and decreases the chances that a single assessment can be found that will address all of the committeeâs needs. Thus, issues relat- ing to total assessment time across the set of selected assessments will likely lead to compromises in choosing assessments for any particular domain; the most thorough assessment of each domain may generate time demands and training requirements that are excessive when considering multiple domains. Unlike the consortium context, in which aggregation of data and centralized reporting were an option to be discussed and decided on by the members of the consortium, the program evaluation context by definition requires that child assessment results will flow to a centralized repository and reporting author- ity. Precisely what information will be centralized and stored and the process whereby such information will flow to the central agency can be a matter of discussion, but clearly there must be some centralization of child assessment results. The creation of an infrastructure that can support the collection and report- ing of this information must be addressed by Novatello. This infrastructure may not fall under the purview of the assessment review and selection committee, but decisions made regarding the
JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS 231 infrastructure most definitely affect the committeeâs work. Some assessments may lend themselves more readily to use within the planned infrastructure than others, and this information should be considered in evaluating the usefulness of assessments. While ease of integration with the infrastructure would not drive a choice between two instruments that differ substantially in their technical adequacy, it could be a factor in choosing between two instruments of comparable technical merit. When examining the costs associated with the two assessments, the costs of incorporat- ing the assessments into the reporting infrastructure must also be considered. Summary This section provides three different assessment scenarios that might arise in early childhood settings. They are intended to high- light the kinds of processes that one might establish to identify suitable instruments, gather information about those instruments, compile and evaluate the information, and ultimately select the instruments and make them operational for the stated purposes. While each new scenario introduces elements not present in the preceding ones, there is considerable overlap in key aspects of the process of refining oneâs purpose; identifying assessments; gathering, compiling, and reviewing information; and ultimately selecting instruments and making them operational in the partic- ular context. One other way in which all of the scenarios are alike is in the need for regular review. Like most educational undertak- ings, assessments and assessment programs should be subject to periodic review, evaluation, and revision. Over time, the effective- ness of assessment systems for meeting their stated purposes may diminish. Regular review of the stated purposes of assessment, along with regular review of the strengths and weaknesses of the assessment system and consideration of alternativesâsome of which may not have been available at the time of the previous reviewâcan ensure that the individual assessments and the entire assessment system remain effective and efficient for meeting the organizationâs current purposes. If the process for selecting tests in the first place is rigorous and principled, the review and evalu- ation process will be greatly simplified.