Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
9 Implementation of Early Childhood Assessments A s noted in earlier chapters, there is a substantial body of evidence on the importance of considering the reliability and validity of early childhood assessments in the selection of measures and in understanding and interpreting the information obtained from them. In addition to looking at the psychometric properties of the assessment tools themselves, there is emerging evidence that it is also important to attend carefully to the ways in which assessments are actually carried out. Indeed, as noted in Chapter 7, problems with implementation can pose a challenge to the validity of the data obtained. A poorly trained assessor or a child so distracted that she does not engage with the assessment fully, for example, can lead to questionable data. Careful consideration of implementation issues can help to contribute to the underlying goals. For example, if the goal is to use ongoing monitoring or evaluation to strengthen early child- hood programs, then planning for implementation can include consideration of how results will be summarized and commu- nicated to programs. These issues may be particularly salient when early childhood assessments are implemented on a broad scaleâfor example, when assessments are carried out focusing on a population of children or of early childhood programs. The purpose of this chapter is to summarize the emerging evidence on implementation issues in conducting early child- 281
282 EARLY CHILDHOOD ASSESSMENT hood assessments. That is, we complement the earlier summary of research that looks within the assessment tools by considering the evidence on the way in which they are implemented. Relative to the substantial body of work looking at the reliability and validity of specific early childhood assessments, there is much more limited research on issues of implementation. While summarizing avail- able evidence, this chapter also identifies areas in which future research could contribute to the understanding of implementation issues in early childhood assessment. The discussion of implementation issues is organized into three areas, moving sequentially from preparation for administra- tion to the actual administration, and then to follow-up steps: 1. Preparation for administration: clarifying the purpose of assessment, communicating with parents, training of asses- sors, and protection against unintended use of data. 2. Administration of assessments: degree of familiarity of the child with the assessor, childrenâs responses to the assess- ment situation, issues in administration of assessment to English language learners, adaptations for children with special needs. 3. Following up on administration: helping programs use the information from assessments and taking costs to programs into account in planning for next steps. Preparing for Administration Determining and Communicating the Purpose of the Assessments In a summary of principles of early childhood assessment that continues to serve as an important resource (see Meisels and Atkins-Burnett, 2006; Snow, 2006), the National Education Goals Panel (Shepard, Kagan, and Wurtz, 1998) identified four underly- ing purposes for conducting assessments of young children. They cautioned that problems can occur when there is lack of clarity or agreement as to the underlying purpose of carrying out assess- ments, because decisions about which assessment is used, the circumstances under which the information is collected, who is assessed, the technical requirements for the assessment, and how
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 283 the information is communicated follow from the purpose. As one illustration, Shepard, Kagan, and Wurtz (1998) note that assess- ments for the purpose of improving instruction have the least stringent requirements for reliability and validity, while assess- ments with high stakes have the most stringent ones. Assessments to guide instruction can be gathered repeatedly over the course of the year through observations in the classroom, and instruc- tion can be modified if the most recent observations update and change earlier information. This flexibility is not present for high- stakes assessment, in which information gathered in one or only a few assessments must provide a sufficient basis for important decisions. In a more recent discussion of this issue, Mathematica Policy Research (2008) similarly notes that while careful attention needs to be paid to standardization in the implementation of early childhood assessments when the goal is evaluation research, there is greater flexibility in administration when the goal is screening. For example, for screening purposes it may be warranted to repeat the administration of an item if this helps to be certain of the childâs best possible performance. Chapter 7 includes a detailed explanation of the process of matching assessments to purposes. Shepard and colleagues (1998) also cautioned against the inappropriate use of assessment resulting from poor understand- ing of the purpose. For example, screening assessments, intended to provide an initial indication of whether a child should receive in-depth diagnostic assessment by a specialist, are sometimes inappropriately used to make a final determination of childrenâs special needs. Screening assessments are also sometimes used to guide instruction, without the further detailed information that is needed to glean how childrenâs learning is progressing in relation to a set of goals or a curriculum. Assessments carried out by teachers through ongoing obser- vation in the classroom (such as work sampling) have sometimes been used for ongoing program monitoring, although it has been questioned whether data collection of this type is sufficiently reli- able to be used for this further purpose. Use of data from ongoing observations in the classroom for a purpose other than informing instruction also has the potential to introduce bias, as incentives or consequences come to be connected to teacher reports (Snow, 2006). Interviews carried out with staff in a small but nationally rep-
284 EARLY CHILDHOOD ASSESSMENT resentative sample of Head Start programs regarding implementa- tion of the Head Start National Reporting System (NRS) suggest that there was ambiguity as to whether the information from the child assessments was to be used for evaluation and monitoring purposes (with the intent of informing program improvement and tracking whether improvements were occurring over time) or whether it was intended for high-stakes purposes (to make deter- minations about program funding). Staff in 63 percent of the pro- grams in this study indicated that they felt that it was not clear how the results of the assessment were going to be used Â(Mathematica Policy Research, 2006). This study concluded that when systems of early childhood assessment are implemented, information should be shared with programs about how data will be used. Furthermore, if the intent is to guide program improvement, the results at the program level should be shared with sufficient time to guide decisions for the coming year, and guidance should be provided on how to use the results at the program level. Communicating with Parents A further issue of importance in planning for the implementa- tion of early childhood assessments is whether informed consent is required of parents and how they will be informed of results. Mathematica Policy Research (2006) reports that in the repre- sentative sample of Head Start programs studied to document implementation of the NRS, nearly all programs had informed parents that their children would be participating in the assess- ments. However, there was ambiguity as to whether informed consent was needed. In the second year of implementation, in this sample, two-thirds of programs had obtained written consent from parents. This represented a substantial increase over the proportion of programs collecting written consent in the first year of implementation. Thus, in preparing for administration of early childhood assessments, a clear decision should be made about a require- ment to obtain informed consent from parents, and it should be A report of the spring 2006 NRS administration was published in 2008 and received too late for inclusion here.
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 285 consistently implemented. Mathematica Policy Research notes in addition that the availability of written information for parents about planned assessments would help to ensure that parents receive uniform information to guide them both regarding their childrenâs participation in an assessment and in interpreting results when they become available. Assessor Training The quality of data obtained from child assessments relies heavily on the appropriate training of assessors. A process for cer- tifying that assessors have completed training and are prepared to administer an assessment reliably has now been implemented in multiple large-scale studies involving early childhood assess- ments. These include FACES (the Family and Child Experiences Study), the Early Head Start Research and Evaluation Study, and the Early Childhood Longitudinal Study-Birth Cohort ( Â Mathematica Policy Research, 2008; Spier, Sprachman, and Rowand, 2004). In the Early Head Start Research and Evaluation Study, for example, the certification process included videotaping interviewers administering the adaptation of the Bayley Scales of Infant Development developed for this study. The interviewers evaluated their own adherence to the administration protocol, and then their administration of the assessment was reviewed and rated by research staff. Interviewers were required to score 85 percent or above on a set of criteria on two tapes in order to be certified to administer the assessment. Mathematica Policy Research (2006) describes the results of direct observations carried out as more than 300 assessments of Head Start children were conducted for the NRS. Each adminis- tration was coded using a set of criteria similar to those noted by Mathematica Policy Research (2008), such as coaching the child, deviations from the assessment script, and errors in scoring par- ticular items. Results indicated that 84 percent of assessments were conducted in such a way that assessors would have been certified. The use of a certification process with scoring of specific types of deviations from an assessment protocol allows for identification of the types of problems occurring in administration. According to
286 EARLY CHILDHOOD ASSESSMENT Mathematica Policy Research (2008), administration issues in the FACES early childhood assessment identified in the certification process included nonneutral encouragement of children, coach- ing, failure to allow for nonverbal responding, deviation from the script developed to standardize administration, and errors in scoring particular items in the assessment battery. Providing feed- back to assessors on their errors can help establish and maintain adherence to the assessment protocol. Protection Against Unintended Use of Data Maxwell and Clifford (2004) note that there is always a p Â ossibility that early childhood assessments may be used for high-stakes purposes even when that was not the original intent of data collection. For example, data collected for tracking and monitoring of the overall functioning of a program may be used to make decisions about the progress of individual children or teachers. Maxwell and Clifford note that protections should be put in place against inappropriate uses of data for high-stakes decision making when that was not the intent and when the data do not have the technical characteristics needed for such purposes. One possibility for protection against the unintended use of data for high-stakes decisions about children is the collection of data for a sample of children rather than for all children in a program (with the caution that an appropriate sampling approach needs to be developed). Another possibility involves a data entry and reporting system that provides reports only at the level of analysis intended (such as at the program level) rather than for individual children or classrooms. Administration of Assessments Degree of Familiarity of the Assessor to the Child A study by Kim, Baydar, and Greek (2003) raises the possibil- ity that having a familiar person present during an assessment may influence childrenâs assessment results. In this study, having someone familiar, such as a parent, present in the room in addi-
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 287 tion to the assessor was associated with higher scores for children ages 6 to 9 on a measure of receptive vocabulary assessed in the home as part of the National Longitudinal Survey of Youth-Child Supplement. This finding suggests that a familiar presence may help the child relax and focus during an assessment. It is also pos- sible that the causal direction works in the opposite way, and that children who have closer, more supportive, and stimulating rela- tionships with parentsâand therefore may tend to score higher on a vocabulary assessmentâalso tend to have parents who want to stay with them and monitor a situation with an unfamiliar adult present. In addition, in this study, when there was a match between the childâs and the assessorâs race, the race-related gap in assessment scores on measures of vocabulary, reading, and mathematics was significantly reduced. Counterbalancing these findings are reports from the study by Mathematica Policy Research (2006) indicating that famil- iarity of the assessor and child can also pose difficulties. In the small but representative sample of Head Start programs in which implementation of the NRS was studied, teachers were used as assessors in 60 percent of programs. Furthermore, teachers were often permitted to assess the children in their own classes (this was reported in 75 percent of programs that used teachers as assessors). According to the report, teacher assessors sometimes became frustrated when they felt that the child was responding incorrectly, because the teacher felt that the child knew the correct answer to an assessment question (for example, could name more letters than responded to correctly on the letter-naming task). Teachers sometimes felt uncomfortable with the standardization required for the assessments, especially not being able to provide praise when the child performed well. Some children also report- edly became concerned because of the discrepant behavior of their teachers in not providing positive feedback. Systematic study of the effects of familiarity of the assessor on childrenâs assessment scores would make an important con- tribution. While evidence to date concerns variation in childrenâs scores and reactions to the assessment situation when familiarity with the assessor has varied naturally (that is, at the decision of families or programs regarding who should be present during an assessment), an important next step would be to randomly
288 EARLY CHILDHOOD ASSESSMENT assign children to be assessed by someone familiar or unfamiliar. Such work should examine not only childrenâs outcomes when familiarity with the assessor is varied, but also on how fidelity of administration may vary with familiarity with the assessor to the child and on observations of the childâs reactions in the assess- ment situation. Childrenâs Responses to the Assessment Context Researchers have begun to study various factors that con- tribute to assessment burden in young children. For example, the length of assessments in relation to childrenâs performance is a topic that is receiving increasing attention. Sprachman et al. (2007) observe that âresearchers need to balance the desire to assess many domains of child development against the potential threats to measurement posed by long administrations. Minimiz- ing child burden while maintaining high reliability of estimates of achievement is an ongoing objectiveâ (p. 3919). Efforts have been made to develop abbreviated versions of assessments, such as the short form of the Bayley Scales of Infant Development developed for the Early Childhood Longitudinal Study-Birth Cohort, in order to minimize burden (Spier et al., 2004). That abbreviation process involves multiple empirical steps. Shortening an assess- ment or using only selected items always requires great care in ensuring the validity and reliability of the abbreviated measure, as explained in Chapter 7. Another approach to reducing respondent burden when young children are involved focuses on limiting the duration of assessment or splitting assessments into multiple sessions. As a starting point, research to date has examined childrenâs perfor- mance under naturally occurring variation in the length of assess- ments. It is important to note, as do the researchers involved in these early studies, that this is only a starting point for examining this set of issues. It needs to be followed by research intentionally varying duration of assessment and examining child response to the assessment context as well as child performance. As the researchers note, when studying duration as it occurs naturally, length of assessment may reflect how many items a child can com- plete or how long a child can persist in an assessment task.
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 289 One recent study examined variations in childrenâs perfor- mance associated with session length on the assessments carried out for the Preschool Curriculum Evaluation Research Study (PCER; Rowand et al., 2005). While the FACES early childhood assessments and the assessments carried out for the Head Start Impact Study required about 20 minutes to administer, and the NRS took approximately 15 minutes, the PCER assessment battery was substantially longer, requiring about 60 minutes. Because the PCER study was designed to evaluate the full range of impacts of different early childhood curricula, it was important that multiple domains of development be assessed. However, an important question was whether the longer assessment was hav- ing implications for the childrenâs performance. Rowand et al. (2005) found that children who took longer to complete the PCER assessments scored higher, probably because these children were administered more items to reach their ceiling. These researchers also asked whether children generally scored as well on subtests focusing on literacy that were administered ear- lier versus later in the assessment battery. They found that 63 per- cent of children showed consistent performance on the early- and late-administered literacy assessments. The 37 percent of children whose performance varied on earlier versus later subtests of the same domain, however, included 21 percent who scored worse as the assessment proceeded (perhaps reflecting fatigue with the long assessment) but 16 percent who scored better on the related assessment carried out later in the session. In a sample of 1,168 preschool-age children, 228 needed two sessions instead of one to complete the assessment. Performance on four key outcomes did not differ significantly according to the number of sessions required to complete the assessment. However, interviewers rated children as more persistent, more likely to sit still, and less likely to make frequent comments if they completed the assessment in one session. These results suggest that long assessment batteries may be difficult for some young children to complete, and that it is important to train assessors to identify when to take breaks or split administration. The authors of this study note the need for a random assignment study in which children are assigned to complete the same battery of assessments in one versus two ses- sions. This would eliminate issues of self-selection in the research
290 EARLY CHILDHOOD ASSESSMENT to date, with children who need to complete the assessment in two sittings showing differing initial characteristics. In another study focusing on data from the PCER study, researchers asked whether scores would have differed system- atically if different âstop rulesâ had been usedâthat is, whether different requirements had been used for the number of incorrect responses needed before discontinuing the Woodcock-ÂJohnson Letter Word Identification subtest or the Applied Problem math- ematics subtest (Sprachman et al., 2007). According to these researchers, âthe WJ III tests use stop rules that often take chil- dren into questions that are well beyond their ability, which can result in frustration for both the child and the assessor. Although these rules add just a few minutes to the assessment on any one test, the extra minutes have a cumulative effectâ(p. 3919). The researchers note that because the Woodcock-Johnson assessments were intended to assist in determining if individual children needed special services, care was taken to build in conservative rules regarding when a child could no longer respond correctly, requiring six incorrect responses as well as going to the bottom of the items on a particular easel. While not varying administration, this study looked at scores if the stop rule had instead been six incorrect items but not going to the bottom of an easel, or three incorrect items. Stop rule procedures were important particularly to the scores on the Applied Problem subtest. For example, there was an exact score match in only 64 percent of childrenâs scores in the fall pre- school administration and 56 percent of childrenâs scores in the spring administration when the stop rule of three items was used as opposed to standard scoring. While the match was better for the Letter-Word Identification subtest, scores matched closely in both the spring and fall assessment for only about three-fourths of the children (74 percent in the fall and 77 percent in the spring). At the same time, however, the cross-time stability of scores, correlations across the two subtests, and prediction from the subtest scores to other measures in the same domain were very similar whether the standard stop rule or an abbreviated stop rule was used. Caution appears to be needed in assuming that scores will be similar with the use of differing stop rules. Yet, as the Âresearchers note, further work will be needed to examine whether systematically varying
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 291 stop rules can affect overall performance by diminishing the total length of assessments or the sense of frustration in answering questions that are difficult for a child. Length of assessment was sometimes a concern in the sub- stantially shorter NRS assessment. Head Start staff in the selected sites of the study of implementation were asked about their perceptions of the childrenâs reactions to the NRS assessment (Mathematica Policy Research, 2006). The responses were mixed. Staff in 63 percent of the programs sampled indicated that âmost children responded positivelyâ to the assessment. And 43 percent of the staff members interviewed felt that the assessment protocol was too long, and that this contributed to behavioral issues in the children. Behaviors that were of concern to staff included children becoming bored or restless during the Peabody Picture Vocabu- lary Test (PPVT) or letter-naming tasks and needing redirection. Children sometimes pointed again and again to the same quad- rant of the PPVT rather than varying their responses to respond to the word provided. By staff report, however, some children enjoyed the one-on-one time that the assessment permitted. Staff members also often reported that childrenâs comfort level with the assessment situation increased from the fall to the spring assess- ment. It would be valuable to examine childrenâs assessment scores in light of assessor perceptions of child comfort in order to examine whether childrenâs comfort level might be associated with higher scores. Administration for Children Who Are Learning English Multiple implementation issues arise in administering early childhood assessments to children who are learning English. These include the order of assessments if they will be carried out in two languages; length and potential burden to the children of receiving the assessment in two languages; the availability of skilled bilingual assessors; and the adequacy of training for con- ducting assessments in two languages. The assessment of chil- dren who are learning English also requires a reconsideration of the purposes of assessment. We note that issues pertaining to the content, reliability, and validity of assessments in a language other than English are covered in Chapter 8.
292 EARLY CHILDHOOD ASSESSMENT Revisiting the Issue of Underlying Purpose When Assessing in More Than One Language The decision to administer an assessment in both the home language and English to a child who is learning English is clearly tied to the purpose of the assessment for English language l Â earners and the goals of instruction for these children in their early care and education program (Espinosa, 2005). For example, if the intent is to measure how far along a child is in learning E Â nglish, it might suffice to assess only in English once he or she has passed a screener in English. Another possible purpose for assessing children who are English language learners is to assess their maintenance of and progress in their home language while they are learning English. If this is the goal, then it would be important to assess the child in both languages, and analyses would report on both. Yet another possibility is that the aim of the assessment is to measure the childâs mastery of certain concepts or of overall vocabulary, irrespective of which language this is in. If this is the goal, an appropriate assessment practice would be to encourage a child to respond to assessment questions in either the home language or in English and to feel free to use both. The availability of new approaches both to screening and to administration of assessments to children who are learning English will help make it possible to select procedures that are in alignment with the underlying purpose. Thus, for example, new language routing procedures have been developed for the First Five LA Universal Preschool Child Outcomes Study, a study that needed to address the challenge of having many children in the study population learning English with a range of different home languages (Mathematica Policy Research, 2007). The new routing procedures involve three steps: asking parents about the childâs language use, examining the childâs performance on two subtests from the Oral Language Development Scale or the Pre-Language Assessment Scale (Pre-LAS) Â (Duncan and De Avila, 1998), and observing the language in which the child tends to respond on a conceptually scored receptive vocabulary test. The routing procedures provide for the possibility that the initial language of assessment may be revised during the course of administration in response to the childâs spontaneous language use.
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 293 The conceptual scoring on the receptive vocabulary assess- ment is intended to acknowledge that children learning English may have mastered particular words in one or another language, giving the child the opportunity show mastery of vocabulary across languages. This matches with the purpose noted above of assessing overall mastery of concepts and vocabulary rather than vocabulary in a particular language, an approach that will not be appropriate if the underlying purpose is to assess reten- tion of home language or progress in English. The important point to note here is that the range of options for routing and of approaches to assessment for children learning English is expand- ing and will enable better matching with the underlying purpose of assessment. Order of Administration Questions about the order of administration of assessments for children learning English arose in the initial year of the NRS and resulted in a change in practice (Mathematica Policy Research, 2006). In the first year, all children receiving the Â assessment in both Spanish and English started with the English assessment. However there was feedback that this was discouraging to chil- dren whose mastery of English was still limited. There was con- cern that scores on the Spanish language assessment were being affected by these childrenâs initial negative experience with the English assessment. In the second year of administration, the order of administra- tion was reversed, so that the Spanish version of the assessment was always to be given first to children receiving the assessment in both Spanish and English. Interestingly, this too caused some problems, particularly in the spring administration. By this point, children who were accustomed to speaking only in English in their Head Start programs were not always comfortable being assessed in Spanish. According to Mathematica Policy Research, the childrenâs discomfort may have arisen for several different reasons: they may have been taught not to speak Spanish in their Head Start programs, their Spanish may never have been very strong, or their Spanish may have been deteriorating. There were also some observed deviations from the sequencing of the
294 EARLY CHILDHOOD ASSESSMENT assessments in the small observational study of assessments con- ducted in both Spanish and English. Three of 23 programs that participated in this study were observed continuing to administer the assessment in English prior to the Spanish version after the change in guidelines for administration. These findings indicate that when the decision is to administer assessments in two languages, a decision about order of admin- istration is not an easy one to make because there are potential issues with either ordering. Decisions about ordering may need to take into account the nature and goals of the early childhood program, especially whether the primary goal is to maintain two languages or to introduce English. There is a need for systematic study of whether scores for young children learning English vary according to order of administration of home language and E Â nglish versions of assessments. Length of Administration The NRS implementation study found that administration of the Spanish assessment took several minutes longer than the E Â nglish assessment (18.6 compared with 15.8 minutes). In addi- tion, children who received the assessments in two languages had to spend double the time or a little more in the assessment situa- tion. The guidance that sites received was to try to administer both assessments the same day, but to reserve the English language assessment for another day if the child seemed bored or tired. Interviews with program staff about their experiences in admin- istering the NRS assessment indicated concern with the burden to Spanish-speaking children of taking the assessment in two languages (Mathematica Policy Research, 2006). There is a need for systematic study of whether childrenâs assessment scores are related to whether assessments in two languages are conducted as part of a single session or broken up into two sessions. Availability of Bilingual Assessors and Trainers A further issue may be finding assessors who are sufficiently bilingual to administer assessments in both Spanish and English. Although the study conducted of assessments in both Spanish and
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 295 English as part of the NRS was small, it helps to identify issues that other large-scale systems of early childhood assessment may face. Thus, for example, results reported by Mathematica Policy Research (2006, p. 29) indicate that âobservers at about half the sites with observed Spanish assessments reported that some Spanish-language assessors either were not very fluent in Spanish or knew Spanish to speak but not to read; they had difficulty read- ing or pronouncing words in the assessment, and in rare cases, had difficulty communicating with the children (for example, they had trouble understanding questions in Spanish).â In addition, 17 percent of the programs in the study sample administering the assessment in Spanish indicated that there was a problem with finding certified trainers who could provide training on the Spanish version of the assessment. Overall, while 84 percent of the observed English language administrations of the NRS protocol achieved a certification score of 85 percent of higher and would have been certified, the portion who attained or surpassed the certification criterion for observed Spanish lan- guage administrations was 78 percent. Analyses have not been reported on whether childrenâs assessment scores are related to assessorsâ fluency in Spanish nor on the degree to which assessors would have met certification criteria. These results indicate that an important set of issues for those setting up a system of early childhood assessment with an increasingly diverse population of children will be not only finding appropriate assessments but also finding those qualified to administer the assessments in Spanish and other languages, as well as ways to ensure that there is an appropriate process for training on the administration of the assessment in languages other than English. Inclusion of Children with Disabilities in a System of Assessment The 2008 Mathematica Policy Research report identifies as a key issue in training assessors their preparation in working with children with disabilities. In preparing to conduct assessments at a particular site, assessors need to be trained to collect informa- tion on appropriate accommodations for individual childrenâfor
296 EARLY CHILDHOOD ASSESSMENT example, to ascertain whether an aide should be present, if chil- dren need to take frequent breaks, or if it is important to confirm that hearing aids or other assistive devices are working properly. It is possible that certification on assessments could include a requirement to tape an assessment with a child who has a dis- ability. Such a procedure would help to ensure that assessors are aware of and are implementing appropriate practices for children with special needs. In the small study of NRS implementation, 30 of 35 pro- grams reported carrying out assessments with children with disabilities. Staff in these programs usually indicated that they were comfortable with the accommodations made for these children. However, about one in six programs would have liked additional information on when to include children with dis- abilities in the assessment process and when to exempt them and on the kinds of accommodations that were appropriate during the assessments. Some direct observations of assessments carried out as part of the study indicated that children who could have been exempted were nonetheless being assessed. These findings suggest that in implementing a system of early childhood assess- ments, it is a high priority to articulate clearly the decision rules for including children with disabilities in the assessments as well as to provide appropriate training for assessors on the use of accommodations. Following Up on Administration Guiding the Use of Information from Assessments Key implementation decisions for a system of early child- hood assessments do not stop once the assessments have been administered and the data analyzed and summarized. Decisions have to be made about how assessment results will be reported back to programs and program sponsors/funding agencies, and what guidance will be provided on how programs should use the information from the assessments. Fundamental decisions need to be made about how results will be used if the purpose of carrying out assessments is for program monitoring and evaluation or for high-stakes purposes.
IMPLEMENTATION OF EARLY CHILDHOOD ASSESSMENTS 297 Turning again to the study of implementation carried out as part of the NRS, problems with the guidance provided to programs on how to use the results of the assessments often concerned the unit of analysis. In more than half of the sample programs in the study, respondents felt that it would have been more useful to report on results at the classroom or center level rather than the program level (which may have involved multiple centers), because those were the units in which quality improve- ment efforts were most meaningful. Furthermore, about half of the programs participating in the study indicated that local assess- ments (such as ongoing observational assessment through work sampling) were more useful for program improvement purposes than the program-level results of the NRS because results were available more quickly, covered a wider range of domains of chil- drenâs development, and could be summarized at the classroom level or even for individual children. Thus, when designing a system of assessment, it is important to look forward in time to the point of communicating results and to consider in advance the extent to which results are appropriate for use in program improvement, as well as how best to summa- rize them so that implications for programs are clear. Assessing the Costs of Implementing a System of Assessment Finally, a key follow-up step involves taking stock of the costs of the assessments to programs. There is limited information from research available on this issue. Direct examination of the costs of purchasing material, conducting training, and implementing early childhood assessments would be extremely valuable. Some pertinent findings come from program directors participating in the NRS who reported their perceptions of the costs of implemen- tation (Mathematica Policy Research, 2006). These data should be seen as a starting point in the examination of this issue not only because of the small sample size, but also because director percep- tions were not accompanied by direct measures of costs. In this study, 77 percent of the program directors interviewed indicated that there had been substantial in-kind as well as monetary costs to their programs of implementation of the NRS assessments. An in-kind cost they reported was the cost of having staff taken
298 EARLY CHILDHOOD ASSESSMENT away from their usual activities, including instruction of children, to conduct assessments. A monetary cost they reported was the need to hire substitute teachers so that teachers could carry out the assessments or to hire contract staff to conduct the assessments. Information on costs to programs can be used as input into decisions for the future about the frequency of assessments (for example, whether to conduct them at one or multiple time points), whether assessments are conducted universally or for a sample of children, and whether resources need to be made available to programs to cover the additional costs of assessments. Conclusion Emerging evidence indicates that implementing a reliable and valid system of early childhood assessment requires careful consideration not only of which assessments to use but how they are prepared for, how they are put into practice, and how results are communicated to programs. In the next chapter we stress the particular importance of these issues in large-scale systemwide implementation of assessments. However, such issues as clear communication of the purpose of the assessments, consistent practices regarding communication with parents and obtain- ing informed consent, training of assessors, circumstances of administration to children, appropriate training and assessment practice for children learning English as well as children with dis- abilities, and communication of results to programs are important whether the assessments occur only within specific programs or at a broader level, such as across a state or for a national program. There is a clear need for research focusing explicitly on such issues as how child performance may vary as a function of variations in the length of assessment, familiarity of the assessor, and proce- dures for assessing children who are learning English.