7
Judging the Quality and Utility of Assessments

In this chapter we review important characteristics of assessment instruments that can be used to determine their quality and their utility for defined situations and purposes. We review significant psychometric concepts, including validity and reliability, and their relevance to selecting assessment instruments, and we discuss two major classes of instruments and the features that determine the uses to which they may appropriately be put. Next we review methods for evaluating the fairness of instruments, and finally we present three scenarios illustrating how the process of selecting assessment instruments can work in a variety of early childhood care and educational assessment circumstances.

Many tests and other assessment tools are poorly designed. The failure of assessment instruments to meet the psychometric criteria of validity and reliability may be hard for the practitioner or policy maker to recognize, but these failings reduce the usefulness of an instrument severely. Such characteristics as ease of administration and attractiveness are, understandably, likely to be influential in test selection, but they are of less significance than the validity and reliability considerations outlined here.

Validity and reliability are technical concepts, and this chapter addresses some technical issues. Appendix A is a glossary of words and concepts to assist the reader. Especially for Chapter 7,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 181
7 Judging the Quality and Utility of Assessments I n this chapter we review important characteristics of assess- ment instruments that can be used to determine their quality and their utility for defined situations and purposes. We review significant psychometric concepts, including validity and reliability, and their relevance to selecting assessment instru- ments, and we discuss two major classes of instruments and the features that determine the uses to which they may appropriately be put. Next we review methods for evaluating the fairness of instruments, and finally we present three scenarios illustrating how the process of selecting assessment instruments can work in a variety of early childhood care and educational assessment circumstances. Many tests and other assessment tools are poorly designed. The failure of assessment instruments to meet the psychometric criteria of validity and reliability may be hard for the practitioner or policy maker to recognize, but these failings reduce the use- fulness of an instrument severely. Such characteristics as ease of administration and attractiveness are, understandably, likely to be influential in test selection, but they are of less significance than the validity and reliability considerations outlined here. Validity and reliability are technical concepts, and this chap- ter addresses some technical issues. Appendix A is a glossary of words and concepts to assist the reader. Especially for Chapter 7, 

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT many readers may want to focus primarily on identifying the questions they need to ask about assessments under consideration and understanding the concepts well enough to appreciate the responses, rather than on a deep understanding of the statistical processes that determine how those questions can be answered. VALIDITY AND RELIABILITY OF ASSESSMENTS Before an assessment instrument or test is used for the pur- pose of making decisions about children, it is necessary to have evidence showing that the assessment does what it claims to do, namely, that it accurately measures a characteristic or construct (or “outcome” as we are referring to it in this report). The evidence that is gathered to support the use of an assessment is referred to as alidity evidence. Generally, when one asks the question “Is the assessment doing what it is supposed to do?” one is asking for validity evidence. A special kind of validity evidence relates to the consistency of the assessment—this may be consistency over repeated assessment or over different versions or forms of the assessment. This is termed reliability evidence. This chapter reviews the history and logic of validity and reliability evidence, especially as it pertains to infants and young children. It is important to note that, first, when judging valid- ity or reliability, one is judging a weight of evidence. Hence, one does not say that an assessment is “valid” or is “reliable”; instead, one uses an accumulation of evidence of diverse kinds to judge whether the assessment is suitable for the task for which it is intended. Second, when mustering evidence for validity or reliability, the evidence will pertain to specific types of uses (i.e., types of decisions). Some forms of evidence inform a wider range of types of decisions than others. Nonetheless, one should always consider evidence as pertaining to a specific set of decisions. Brief Overview of the History of Validity Evidence The field of assessment of human behavior and development is an evolving one and has undergone many changes in the last half-century. Some changes are the result of developments in the field itself; others are responses to the social and political context

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS in which the field operates. Validity is an enduring criterion of the quality and utility of assessments, although conceptions of what constitutes validity of assessments have changed over time. Criterion Validity Originally, the conception of assessment validity was limited to prediction—specifically, to the closeness of agreement between what the assessment actually assesses or measures and what it is intended to assess or measure (Cureton, 1951). Put differently, at the core of this definition of validity is the relationship between the actual scores obtained on a test or other assessment instrument and the score on another instrument considered to be a good assess- ment of the underlying “true” variable or construct. Under this model of validity—the criterion model—if there already exists a criterion assessment that is considered to be a true measure of the construct, then a test or other measurement instrument is judged to be valid for that construct if the latter instrument provides accurate estimates of the criterion (Gulliksen, 1950). The accuracy of the estimates is usually estimated using a correlation coefficient. Among the advantages of the criterion model of validity are its relevance and potential objectivity. After a criterion has been specified, data can be collected and analyzed in a straightfor- ward manner to ascertain its correlation with the measure being validated. It is not always easy, however, to identify a suitable or adequate criterion. When one considers criterion-related validity evidence, for example, the size of the correlation between test scores and criterion can differ across settings, contexts, or popu- lations, suggesting that a measure be validated separately for every situation, context, or population for which it may be used. In many instances, criterion-related evidence is quite relevant to the interpretations or claims that can be made about the uses of assessments. In addition, questions about the validity of the cri- terion itself often remain unanswered or are difficult to answer without resorting to circular reasoning—for example, when scores on a test of cognitive development are the validity criterion for scores on a different test of cognitive development. Moreover, decisions involving the choice of a criterion involve judgments about the value of the criterion.

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT The “Three Types of Validity” Approach If agreement with a criterion were the only form of validity evidence, then one could never validate a measure in a new area, because there is no preexisting criterion in the new area. Thus, new and broader types of evidence were needed. The criterion model of validity was followed by a more nuanced and amplified view of validity, which identified three types: content, construct, and criterion validity. . Content alidity. The content model of validation seeks to pro- vide a basis for validation without appealing to external criteria. The process of establishing content validity involves establish- ing a rational link between the procedures used to generate the test scores and the proposed interpretation or use of those scores (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; Cronbach, 1971; Kane, 2006). In developing an assessment procedure or system, a set of specifications of the content domain is usually set forth describing the content areas in detail and the item types. Content here refers to the themes, wording, and format of the assessment items (e.g., tasks, ques- tions) as well as the guidelines and procedures for administration and scoring. Defining the content domain becomes critical because validity inferences can be challenged by suggestions that the domain defi- nition is incomplete, irrelevant, or inappropriate. It is important to evaluate the appropriateness of an assessment tool’s content domain with respect to the proposed uses of that tool. For example, an off-the-shelf test that is used for the purposes of evaluating an educational program may cover content that is part of the program’s curriculum as well as material that was not part of that curriculum. It is then up to those who interpret the program evalu- ation results to evaluate the children’s achievement with respect to both the content-represented and content-unrepresented parts of the test. Studies of alignment between early learning standards (e.g., state early learning standards, the Head Start Child Outcomes Framework) and assessments are a new variant of content-related validity evidence. Such standards are descriptions of what children

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS should know and be able to do; benchmarks, a related concept, refer to descriptions of knowledge and skills that children should acquire by a particular age or grade level. It is generally agreed by measurement professionals that content-related validity evidence is necessary but insufficient for validation. Other forms of validity evidence—such as empiri - cal evidence based on relationships between scores and other variables—are also essential. The current shift in emphasis toward learning standards and aligned assessments does not alter this necessity for additional forms of validity evidence, and the growing consequences of assessments increase the importance of empirical evidence (Koretz and Hamilton, 2006). . Construct alidity. Construct validity was originally introduced by Cronbach and Meehl (1955) as an alternative to content and cri- terion validity for assessments that sought to measure attributes or qualities that are theoretically defined but for which there is no adequate empirical criterion or definitive measure nor a domain of content to sample. They went on to emphasize, however, that “determining what psychological constructs account for test performance is desirable for almost any test” (p. 282). In other words, even if an assessment is validated through content- and criterion-related evidence, a deeper understanding of the con- struct underlying the performance on the test requires construct- related evidence (Kane, 2006). Construct validity is also concerned with what research meth- odologists refer to as “confounding” (Campbell and Stanley, 1966; Cook and Campbell, 1979). This refers to the possibility that an assessment procedure that is intended to produce a measure of a particular construct, such as a child’s level of quantitative knowl- edge, produces instead a measure that can be construed in terms of more than one construct. For example, a measure of a child’s quantitative knowledge might be confounded with the child’s willingness to cooperate with the stranger who is conducting the assessment. This reaction of the child to the assessor is thus a rival interpretation of that intended by the assessment procedure. To circumvent this rival interpretation, the assessment procedure might include more efforts to establish rapport between the child and the assessor, paying special attention to the fact that some

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT children are temperamentally shyer than others. If no correlation can be observed between a measure of shyness or willingness to cooperate and the measure of quantitative knowledge, then the rival interpretation can be ruled out. It is a mistake to think that construct validity applies only to measures of theory-based constructs. In this report we depart from some historical uses of the term “construct,” which limit the term to characteristics that are not directly observable but that are inferred from interrelated sets of observations. As noted in the Standards for Educational and Psychological Testing (1999), this limited use invites confusion because it causes some tests but not others to be viewed as measures of constructs. Following the Standards, we use the term “construct” more broadly as “the con- cept or characteristic that a test is designed to measure” (Ameri- can Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p. 5). . Integrated iews of alidity. Current conceptions of assessment validity replace the content/criterion/construct trinitarian model and its reference to types of validity by a discussion of sources, or strands, of validity evidence, often including evidence regard- ing the consequences of the use of assessments. Cronbach (1971) argued that in order to explain a test score, “one must bring to bear some sort of theory about the causes of the test performance and about its implications” (p. 443). While recognizing the practi- cality of subdividing validity evidence into criterion, content, and construct, he called for “a comprehensive, integrated evaluation of a test” (p. 445). He emphasized that “one validates not a test, but an interpretation of data arising from a specified procedure” (p. 447). Messick (1989) echoed this emphasis. The aim of current conceptions of assessment validity is to seek information relevant to a specific interpretation and use of the assessments; many strands of evidence can contribute to an understanding of the meaning of assessments. Validity as Argument Kane’s (2006) treatment of validity is consonant with Messick’s approach, although Kane emphasizes a general methodology for

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS validation based on validity conceptualized as argument. In Kane’s formulation, “to validate a proposed interpretation or use of test scores is to evaluate the rationale for its interpretation for use” (2006, p. 23). In Kane’s approach, validation involves two kinds of argument. An interpretie argument specifies the proposed interpretations and uses of test results. This argument consists of articulating the inferences and assumptions that link the observed behavior or test performance to the conclusions and decisions that are to be based on that behavior or performance. The alidity argument is an evaluation of the interpretive argument. “To claim that a proposed interpretation or use is valid is to claim that the interpretive argument is coherent, that its inferences are reason- able, and that its assumptions are plausible” (Kane, 2006, p. 23). In other words, the validity argument begins by reviewing the interpretive argument as a whole to ascertain whether it makes sense. If the interpretive argument is reasonable, then its infer- ences and assumptions are evaluated by means of appropriate evidence. Any interpretive argument potentially contains many assumptions. If there is any reason for not taking for granted a particular assumption, that assumption needs to be evaluated. The interpretive argument makes explicit the reasoning behind the proposed interpretations and uses, so that it can be clearly understood and evaluated. It also indicates which claims are to be evaluated through validation. For example, a child assessment procedure or instrument usually takes some performances by or observations of the child that are intended to be a sample of all possible performances or observations that constitute the instrument’s target content domain. The procedure assumes that the child’s score on the instrument can be generalized to the entire domain, although the actual observed behaviors or performances may be only a small subset of the entire target domain. In addition, they may or may not be a representative sample of the domain. Standardization typically further restricts the sample of performances or observa- tions by specifying the conditions of observation or performance. Although standardization is necessary to reduce measurement error, it causes the range of possible observations or performances to be narrower than that of the target domain. In other words, it can be seen that the interpretation of the child’s observed behav- ior or performance as an indicator of his or her standing in the

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT target domain requires a complex chain of inferences and gen- eralizations that must be made clear as a part of the interpretive argument. An interpretive argument for a measure of children’s cognitive development in the area of quantitative reasoning, for example, may include inferences ranging from those involved in the scoring procedure (Is the scoring rule that is used to convert an observed behavior or performance by the child to an observed score appro- priate? Is it applied accurately and consistently? If any scaling model is used in scoring, does the model fit the data?); to those involved in the generalization from observed score to universe of scores (Are the observations made of the child in the testing or observation situation representative of the universe of observa- tions or performances defining the target cognitive domain? Is the sample of observations of the child’s behavior sufficiently large to control for sampling error?); to extrapolation from domain score to level of development (or level of proficiency) of the compe- tencies for that domain (Is the acquisition of lower level skills a prerequisite for attaining higher level skills? Are there systematic domain-irrelevant sources of variability that would bias the inter- pretation of scores as measures of the child’s level of development of the target domain attributes?); to the decisions that are made, or implications drawn, on the basis of conclusions about devel- opmental level on the target outcome domain (e.g., children with lower levels of the attribute are not likely to succeed in first grade; programs with strong effects on this measure are more desirable than those with weak effects). The decision inference usually involves assumptions that rest on value judgments. These values assumptions may represent widely held cultural values for which there is societal consensus, or they may represent values on which there is no consensus or even bitter divisions, in which case they are readily identifiable for the purposes of validation. When the underlying decision assumptions represent widely held values, they can be difficult to identify or articulate for validation through scientific analysis. The interpretive argument may also involve highly techni- cal inferences and assumptions (e.g., scaling, equating). The technical sophistication of measurement models has reached such a high degree of complexity that they have become a “black

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS box” even for many measurement professionals (Brennan, 2006, p. 14). Moreover, as Brennan further points out, many mea- surement models are operationalized in proprietary computer programs that can sometimes make it difficult or impossible for users to know important details of the algorithms and assump- tions that underlie the manner in which measurement data are generated. Ideally, the interpretive argument should be made as a part of the development of the assessment procedure or system. From the outset, the goal should be to develop an assessment procedure or system congruent with the proposed interpreta- tion and use. In addition, efforts to identify and control sources of unwanted variance can help to rule out plausible alternative interpretations. Efforts to make the assessment system or pro- cedure congruent with the proposed interpretation and uses provide support for the plausibility of the interpretive argument. In practice, this developmental stage is likely to overlap consid- erably with the appraisal stage, but at some point in the process “a shift to a more arm’s-length and critical stance is necessary in order to provide a convincing evaluation of the proposed inter- pretation and uses” (Kane, 2006, p. 25). Kane views this shift as necessary because it is human nature (appropriate and probably inevitable) for the developers to have a confirmationist bias since they are trying to make the assessment system as good as it can be. The development stage thus has a legitimate confirmationist bias: its purpose is to develop an assessment procedure and a plausible interpretive argument that reflects the proposed inter- pretations and uses of test scores. After the assessment instrument or system is developed but still as a part of the development process, the inferences and assumptions in the interpretive argument should be evaluated to the extent possible. Any problems or weakness revealed by this process would indicate a need for alterations in either the interpretive argument or the assessment instrument. This itera- tive process would continue until the developers are satisfied with the congruence between the assessment instrument and the interpretive argument. This iterative process is similar to that of theory development and refinement in science; here the interpre- tive argument plays the role of the theory.

OCR for page 181
0 EARLY CHILDHOOD ASSESSMENT When the development process is considered complete, it is appropriate for the validation process to take a “more neutral or even critical stance” (Kane, 2006, p. 26). Thus begins the appraisal stage. If the development stage has not delivered an explicit, c oherent, detailed interpretive argument linking observed behavior or performance to the proposed interpretation and uses, then the development stage is considered incomplete, and thus a critical evaluation of the proposed interpretation is premature (Kane, 2006). The following events should occur during the appraisal stage: 1. Conduct studies of questionable inferences and assump- tions in the interpretive argument. To the extent that the proposed interpretive argument withstands these chal- lenges, confidence in the claims increase. “If they do not withstand these challenges, then either the assessment procedure or the interpretive argument has to be revised or abandoned” (Kane, 2006, p. 26). 2. Search for hidden assumptions, including value judg - ments, seeking to make such assumptions explicit and subject them to scrutiny (e.g., by individuals with different values). 3. Conduct investigations of alternative possible interpre - tations of the scores. An effective way to challenge an interpretive argument is to propose an alternative, more plausible argument. The evaluation of plausible competing interpretations is an important component in the appraisal of the proposed interpretive argument. Ruling Out Plausible Alternative Hypotheses It is important to recognize that one never establishes the validity of an assessment instrument or system; rather, one validates a score, and its typical uses, yielded by the instrument (Messick, 1989). For example, depending on the circumstances surrounding an assessment (e.g., the manner of test administra- tion, the characteristics of the target population), the same instru- ment can produce valid or invalid scores.

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS The essence of validity, then, can be stated in the question, “To what extent is an observed score a true or accurate mea- sure of the construct that the assessment instrument intends to measure?” Potential threats to validity are extraneous sources of variance—or construct-irrelevant variance—in the observed scores. These extraneous or irrelevant sources of variance are typi- cally called measurement error. As in the process of conducting scientific research, the validity question can be stated in the form of a hypothesis: “The observed score is a true or accurate reflec- tion of the target construct.” The task of validating is to identify and rule out plausible alternate hypotheses regarding what the observed score measures. In a very fundamental sense, as is the case in science, one never “proves” or “confirms” the assessment hypothesis—rather, the successful assessment hypothesis is tested and escapes being disconfirmed. (The term assessment hypothesis is used here to refer to the hypothesis that specifies what the intended meaning of the observed score is, i.e., what the assess- ment instrument is intended to measure.) In this sense, the results of the validation process “probe” but do not prove the assessment hypothesis (Campbell and Stanley, 1966; Cook and Campbell, 1979). A valid set of scores is one that has survived such probing, but it may always be challenged and rejected by a new empirical probe. The task of validation, then, is to expose the assessment hypothesis to disconfirmation. In short, varying degrees of confirmation are conferred upon the assessment hypothesis through the number of plausible rival hypotheses (Campbell and Stanley, 1966) available to explain the meaning of the observed scores. That is, the smaller the number of such rival hypotheses remaining, the greater the degree of con- firmation of the assessment hypothesis. Thus, the list of potential sources of assessment invalidity is essentially a list of plausible hypotheses that are rival to the assessment hypothesis that speci- fies what the meaning of the observed score is intended to be. Studies need to be designed and conducted to test the tenability of each plausible rival hypothesis in order to determine whether each can be ruled out as a plausible explanation of the observed scores. Where the assessment procedure properly and convinc- ingly “controls” for a potential source of invalidity, the procedure renders the rival hypothesis implausible.

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT elect to show the tests to the teachers who will use them, to have teachers rate the difficulty of learning to administer the test, and to pilot the tests with a few children in order to get a sense of how they react to the procedures. This information will be compiled, along with the technical and descriptive information about the test, the information on cost, and the committee’s best judgment about any special infrastructure that might be needed to support a particular test (e.g., a test may require computerized scoring to obtain standard scores). At this point, the committee can choose the test or tests that will best meet the assessment needs of the center. The decision about which test or tests to adopt will boil down to a compromise across the many criteria agreed on by the committee. In this case, these included the desire to have an assessment process that is both child and teacher friendly, minimizes lost instructional time, meets the highest standards of evidence for reliability and valid- ity for the purposes for which assessment is being planned and with the particular kinds of children that comprise the center’s population, and that can be purchased and supported within the budgetary limits set out by the director. To no one’s surprise, no test has emerged that is rated at the top on all of the committee’s dimensions. Nevertheless, the committee’s diligence in collecting and reviewing information and in their deliberations has given them the best possible chance of selecting a test that will best meet their needs. Selecting Tests for Multiple Related Entities In this scenario we consider a consortium of early childhood programs that seeks to establish an assessment system to guide instructional decisions that can be used across all programs in the consortium. The process is similar in many respects to the process followed by Ms. Conway and the team at Honeycomb. Unique to this situation are the facts that the consortium wishes to use assessment to guide instructional decision making and that the consortium would like to use the assessment system across all members of the consortium. These differences suggest that the processes adopted by Honeycomb should be modified in specific

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS ways, namely, in the construction of the committee and in the criteria for distinguishing among the tests. The expansion of the test setting to multiple members of a con- sortium has specific implications for the constitution of the selec- tion committee. It is critical that the committee that will clarify the purposes of assessment, gather and review test information, and ultimately select the test should be expanded to include represen- tation from across the consortium. It may not be possible to have representation from each member on the committee, but some process should be put in place to ensure that the differing needs and populations across the member programs of the consortium are adequately represented on the committee. It is equally, if not more, important to ensure that the necessary expertise is present on the committee for clarifying assessment purposes, gathering and reviewing the technical information, and choosing among the tests. Just as choosing among the tests will involve weigh- ing advantages and disadvantages and making compromises, with some elements nonnegotiable, establishing the committee to carry out the process will involve choices, compromises, and nonnegotiable elements to be decided on by the leadership of the consortium. The expansion of the assessment setting to cover all mem- bers of a consortium also has implications for implementing the assessment plan. In the case of a single entity, it is immediately obvious who will be responsible for each phase of the assessment plan, from purchasing the assessment, to training those who will administer the test, to scoring, interpreting, and acting on the test. When a consortium is involved and the desire exists to have all entities using the same assessment, a number of other ques- tions must be addressed and the consortium must decide if only a single answer will be allowed to each question, or if individual members will be allowed to answer the question different ways. For example, when will testing be conducted? Who will be respon- sible for conducting the assessment? Who will train the assessors, and who will coordinate the training? What steps will be taken to ensure that training quality is uniformly high and that all asses- sors have been trained and meet the same standards? Will results of assessments be shared across members of the consortium, and if so, in what way? Who will be responsible for collecting the data,

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT in what form will the data be collected, and how will the data be stored and aggregated for reporting purposes? Who will decide on report formats and the process of disseminating the results? This list is not exhaustive, but it highlights some of the additional challenges that arise when more than one entity is involved in the testing enterprise. Another major difference between the current scenario and the Honeycomb scenario is the focus on using assessment results to guide instructional decisions. Using assessments to guide instruc- tional decisions implies that assessments will occur at intervals throughout the year, which may imply that different assessments are used at different times during the year, or that different forms of the same assessments are used at different times during the year. In part this distinction hinges on the nature of the instruc- tional decisions to be made throughout the year. Decisions that relate to monitoring progress in a single domain would generally argue for the use of different forms of the same assessment over time, whereas decisions that relate to the introduction of instruc- tion in a new domain or transitioning from one form of instruc- tion to another (e.g., from native language instruction to English instruction) might argue for the use of a different assessment. Several questions must be considered when the focus is on guiding instruction. The first is whether or not the assessment is expected to assess progress against a specific set of standards set forth by the state, the district, the consortium, or some other entity. Ideally, there will not be multiple sets of standards against which performance must be gauged, as each set of standards potentially increases the number of behaviors that have to be assessed and monitored, and the more standards that exist, the more likely it becomes that sets of standards will come into conflict with one another. A second major question that must be addressed is the distinc- tion between status and growth. If the assessment is to monitor growth over time, it should be clear in what domain growth is being measured, whether growth in that domain is captured through quantitative change (i.e., change in level of performance), or whether growth in that domain is captured through qualita- tive change (i.e., change in type), or both. Measuring quantitative change requires that additional psychometric work has been done

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS on the instruments to develop a scale for tracking performance gains over time, and that it is clear how to interpret differences between scores at different points on the score scale. Major tests have begun introducing such deelopmental scales, as they are often called, but these are by no means ubiquitous, and the lack of a strong, psychometrically sound developmental scale can seri- ously hinder accurate interpretation of performance gains over time. Finally, unlike the Honeycomb scenario, which focused on status at entry relative to national norms, the focus on using assessment to guide instruction suggests that the members of the consortium might well be interested in, and best be served by, a locally developed assessment. To the extent that the standards and instructional decisions are mostly local, then it is far more likely that a locally developed assessment, tailored to reflect local standards and approaches to instruction, will meet the needs of the consortium. However, this likelihood also has implications for the test review and selection committee. In particular, locally developed tests are not likely to be covered in the available assess- ment reviews, and are not likely to have been developed to the same rigorous psychometric standards as tests that are intended for use on a broader audience. Thus, the committee may need to gather technical information on more assessments, and may find little or no technical information is available for many of them. Information about test bias in particular is likely to be missing, with the result that it will have to be investigated in the local set- ting for the selected assessments. Except for these major differences, the process for the consor- tium is much the same as the process for Honeycomb. The con- sortium’s committee must spend time clarifying their purposes for assessment and determining the precise reasons for using assessment, the kinds of decisions to be made on the basis of assessment results, and the domains to be assessed. The potential focus on multiple domains of assessment adds complexity to their task, namely, the need to differentiate between domains that may be highly related to one another, and the necessity of restricting the domains to a number that can be reasonably assessed. The pro- cess of gathering information about tests and the steps required to adequately review and choose between tests are essentially the same for the consortium committee and the Honeycomb com-

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT mittee. Although the consortium committee may decide to give priority to tests that can assess all of the domains that they have chosen to measure, it is unlikely that they will be able to restrict the review to such tests until later in the review process, when it is clear what tests are available to address their needs. Because the process of gathering information, reviewing it, and selecting among the tests is essentially the same as in the first scenario, that information is not repeated here. Selecting Tests in a Program Evaluation Context Finally, we consider Novatello School District, a large urban school district in which the school board has decided to incorpo- rate child assessments into the evaluation of its new preschool initiative, which is aimed at improving children’s school readi- ness, socioemotional development, and physical health. Novatello has a diverse population of children from many ethnic and lin- guistic backgrounds with considerable economic diversity in all ethnic groups and approximately 140 home languages other than English. In addition, Novatello provides kindergarten instruction either in English or in the native language for children whose primary language is either Spanish or Farsi, the two predominant languages among Novatello’s school population. The Spanish lan- guage kindergartens are located throughout the district, whereas the Farsi programs are located in a small region, reflecting the housing patterns of the community. Novatello’s situation differs in important ways from the two previous scenarios. The program evaluation or accountability purpose of the assessment has the greatest implications for the design of the assessment system. The context of multilingual instruction carries further implications, which must be taken into account if the assessments are to enable valid inferences about the program’s effects on children’s school readiness, socioemotional development, and physical health. Program evaluation or accountability carries with it signifi- cant implications for the use of assessments that were not present in the first two scenarios. In particular, in the prior scenarios, the assessments were decidedly low stakes; the decisions being made on the basis of the children’s performance on the assessments had

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS minor consequences, if any, for the children and teachers. In the program evaluation context, one cannot assume that the conse- quences for children and teachers will be negligible. If program closure is a potential consequence of a poor evaluation outcome, then the consequences for both children and teachers are very high. If children might be prevented from entering kindergarten on the basis of the results of school readiness assessments, then the consequences for children are high. Similarly, if teachers’ employment with the district or pay raises are tied to children’s performance, then the consequences for teachers are high. As the consequences associated with decisions based on assessment scores become greater, there is a correspondingly greater burden to demonstrate the validity of inferences that are based on those assessment scores, which in turn requires greater precision in assessment scores. Precision can be increased with uniformity in the assessment setting, standardization of instructions and scoring, and security of assessment information. However, with young children, efforts to standardize assessment conditions can create artificiality in the assessor-child interac- tions, which may negatively affect the validity of the assessment scores. More importantly, the program evaluation context requires that scores obtained from children support inferences about the programs in which the scores were obtained, even though such assessments are designed to support inferences about children, not necessarily the programs that serve them. Determining whether these same assessment scores sup- port valid inferences about the educational context in which the scores were obtained requires a level of abstraction beyond the inference from the score to the child. The validity of the inference from the score to the program cannot be assumed on the basis of the validity of inferences about children’s abilities. The validity of inferences about programs must also be demonstrated, just as the validity of inferences about children’s knowledge, skills, and abilities must be demonstrated and cannot be assumed on the basis of assessment construction or other properties of assessment scores. Reliance on child assessments in program evaluations carries an explicit assumption that differences between programs in child outcomes at the end of the year can be attributed to differences in

OCR for page 181
 EARLY CHILDHOOD ASSESSMENT the educational quality of the programs. Unambiguous inferences about program differences on the basis of end-of-year differences in child performance are most justifiable when the assignment of children to programs has been controlled in some meaningful way, which is not generally the case. In the absence of controlled assignment, inferences about program differences require con- siderable care and caution, especially when those inferences are based, in part, on the results of child assessments. In particular, in the absence of controlled assignment, one must justify any assumption that differences between programs in child assess- ments are attributable only to differences between programs in factors that are under the control of the programs. Support for this assumption is context specific, and it may or may not be defensi- ble in a single district, let alone in a single state. Thus, developing a suitable context for program evaluation will require substantial dialogue among program leaders to identify and address factors that differ among programs and that relate to differences in child outcomes but that are, nonetheless, outside the control of the programs. Failure to account for such differences will negatively affect the validity of inferences about differences in program quality that are based on differences in child outcomes. In the current context, two factors that could affect the validity of inferences about programs based on child assessment results are the primary language of the child and the language of instruc- tion used in the preschool program. The committee developing the assessment program for Novatello must determine the condi- tions governing whether children should be assessed in English or in their primary language. Because the language of instruction model varies across programs that will be evaluated, and because children differ in their primary language within and between pro- grams, there are several factors to consider. In Novatello, children are allowed primary language instruction prior to kindergarten along with English language development if they speak either Farsi or Spanish. These children will receive their instruction in kindergarten in their primary language, and thus there is con- sistency between the language of instruction prior to and during kindergarten. Because primary language instruction is not available in other languages, speakers of languages other than Spanish and Farsi are instructed prior to and during kindergarten in English.

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS The Novatello assessment development committee decides that children should be assessed in the language in which they are instructed for all assessment domains that link directly to skills and abilities related to instruction. At the same time, all children, including those instructed in a language other than English, will be assessed for English language acquisition because of the pro- grams’ focus on English acquisition for all children. The commit- tee agrees that near-term outcome expectations for children must be adjusted to reflect their status as nonnative speakers of English and to reflect the language of instruction. These adjustments are agreed on in order to ensure that short-term performance expecta- tions adequately reflect the different developmental trajectories of children who are at different stages of acquiring English. Although Novatello expects that all children who enter school at preschool or kindergarten will reach proficiency with English by the end of elementary school, they have established outcome expectations for preschool and kindergarten that reflect children’s different backgrounds in order to set realistic and comparable performance expectations for all programs. Without these adjustments, pro- grams in areas with high concentrations of nonnative speakers of English or children with the greatest educational needs would be disadvantaged by the evaluation system. The Novatello assessment committee faces all the same chal- lenges that were faced by Honeycomb and the consortium. They must define the domains of interest and all of the purposes of assessment. They must consider whether they are collecting child assessments for purposes other than program evaluation, such as to assess the different educational needs of entering children, to monitor learning and progress, and to make instructional deci- sions regarding individual children. If their singular purpose is program evaluation, then it is not necessary to assess all children at all occasions; rather, a sampling strategy could be employed to reduce the burden of the assessment on children and programs, while still ensuring accurate estimation of the entry characteris- tics of the child population and program outcomes. Challenges of sampling include controlling the sampling process, ensuring that sampling is representative, and obtaining adequate samples of all subpopulations in each program, to the extent that out- comes for subgroups will be monitored separately. If, however,

OCR for page 181
0 EARLY CHILDHOOD ASSESSMENT program evaluation is not the primary purpose for collecting child assessment data, then the committee must clarify all of the primary purposes for assessing children and ensure that the instrument review and selection process adequately reflects all of these purposes, prioritizing them according to their agreed-on importance. The expansion of the assessment framework to include such domains as socioemotional functioning and physical well-being do not fundamentally alter the instrument review and selection process. The committee will have to expand its search to identify available assessments and to locate review information on those assessments. However, the process itself of identifying assess- ments, gathering and reviewing technical information, consid- ering training needs and challenges, and addressing issues of assessment use with learners from different cultural and linguistic backgrounds is not fundamentally different from the process used by Honeycomb to evaluate language assessments. Of course, the expansion to multiple domains and to domains outside of aca- demic achievement makes the total scope of work much greater, and decreases the chances that a single assessment can be found that will address all of the committee’s needs. Thus, issues relat- ing to total assessment time across the set of selected assessments will likely lead to compromises in choosing assessments for any particular domain; the most thorough assessment of each domain may generate time demands and training requirements that are excessive when considering multiple domains. Unlike the consortium context, in which aggregation of data and centralized reporting were an option to be discussed and decided on by the members of the consortium, the program evaluation context by definition requires that child assessment results will flow to a centralized repository and reporting author- ity. Precisely what information will be centralized and stored and the process whereby such information will flow to the central agency can be a matter of discussion, but clearly there must be some centralization of child assessment results. The creation of an infrastructure that can support the collection and report- ing of this information must be addressed by Novatello. This infrastructure may not fall under the purview of the assessment review and selection committee, but decisions made regarding the

OCR for page 181
 JUDGING THE QUALITY AND UTILITY OF ASSESSMENTS infrastructure most definitely affect the committee’s work. Some assessments may lend themselves more readily to use within the planned infrastructure than others, and this information should be considered in evaluating the usefulness of assessments. While ease of integration with the infrastructure would not drive a choice between two instruments that differ substantially in their technical adequacy, it could be a factor in choosing between two instruments of comparable technical merit. When examining the costs associated with the two assessments, the costs of incorporat- ing the assessments into the reporting infrastructure must also be considered. Summary This section provides three different assessment scenarios that might arise in early childhood settings. They are intended to high- light the kinds of processes that one might establish to identify suitable instruments, gather information about those instruments, compile and evaluate the information, and ultimately select the instruments and make them operational for the stated purposes. While each new scenario introduces elements not present in the preceding ones, there is considerable overlap in key aspects of the process of refining one’s purpose; identifying assessments; gathering, compiling, and reviewing information; and ultimately selecting instruments and making them operational in the partic- ular context. One other way in which all of the scenarios are alike is in the need for regular review. Like most educational undertak- ings, assessments and assessment programs should be subject to periodic review, evaluation, and revision. Over time, the effective- ness of assessment systems for meeting their stated purposes may diminish. Regular review of the stated purposes of assessment, along with regular review of the strengths and weaknesses of the assessment system and consideration of alternatives—some of which may not have been available at the time of the previous review—can ensure that the individual assessments and the entire assessment system remain effective and efficient for meeting the organization’s current purposes. If the process for selecting tests in the first place is rigorous and principled, the review and evalu- ation process will be greatly simplified.

OCR for page 181