7
Options and Strategies

This section of the report summarizes the various suggestions for practical steps that could be taken in making performance assessments in adult education useful to educators and students, psychometrically acceptable, adequate for national reporting purposes, and feasible without overtaxing the personnel or fiscal resources of any state or program. The strategies presented are grouped under the problem they are meant to address.

PROBLEM 1: LIMITED RESOURCES FOR THE DEVELOPMENT OF ASSESSMENTS

A common refrain from workshop participants was that considerable resources are required to create, pilot test, score, and norm assessments of any sort, as well as develop guidelines for interpreting their results. Developing good assessments is time-consuming and expensive, and it also demands specific expertise that is somewhat rare and may be difficult to access. Thus, it would be inefficient for each program or even each state to develop its own assessments, even if the resources were available to do so. Furthermore, in the current funding situation, many smaller states and states with particularly limited resources for adult education are simply unable to assume the task of developing assessments on their own. Workshop participants offered a variety of strategies to address the resource issue.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 92
7 Options and Strategies This section of the report summarizes the various suggestions for practical steps that could be taken in making performance assessments in adult education useful to educators and students, psychometrically acceptable, adequate for national reporting purposes, and feasible without overtaxing the personnel or fiscal resources of any state or program. The strategies presented are grouped under the problem they are meant to address. PROBLEM 1: LIMITED RESOURCES FOR THE DEVELOPMENT OF ASSESSMENTS A common refrain from workshop participants was that considerable resources are required to create, pilot test, score, and norm assessments of any sort, as well as develop guidelines for interpreting their results. Developing good assessments is time-consuming and expensive, and it also demands specific expertise that is somewhat rare and may be difficult to access. Thus, it would be inefficient for each program or even each state to develop its own assessments, even if the resources were available to do so. Furthermore, in the current funding situation, many smaller states and states with particularly limited resources for adult education are simply unable to assume the task of developing assessments on their own. Workshop participants offered a variety of strategies to address the resource issue.

OCR for page 92
Pooling Resources Through Consortia One strategy for overcoming the problems posed by limited resources is to form consortia. Yen, Braun, Plake, and Impara suggested that states form consortia in which they could pool their resources to find the expertise needed and to do the work required to develop assessments useful to all of them. In forming consortia, states would have to team up with others that have defined the content to be assessed in similar ways. The states within a consortium would also need to have adult education programs with a similar profile (in the percent of English-language learners or the distribution of GED versus employment preparation students, for example) and thus with similar demands on the assessments. As Barbara Plake said, “When you have limited resources and a common set of regulations, it makes great sense … to circle the wagons and maximize the utility of the resources you have in developing these programs.” The work of the National Institute for Literacy with Equipped for the Future (EFF) could also produce benefits similar to those of a consortium as there is a defined domain and predetermined assessments. Utilizing Test Publishers’ Resources Another strategy is to utilize the resources available through test publishers. Involving test publishers in this work has a number of potential advantages. The test publishers can access the expertise needed for test development; they have fiscal resources to invest in test development; and they stand to gain from well-designed tests because they are in a position to market and profit from them. Several speakers suggested establishing agreements with publishers to develop assessments for particular purposes that can be used by many states or state consortia and can also be marketed more widely. This would be an effective way to reduce demands on state resources while developing usable assessments. Wendy Yen recommended that directors of adult education programs seek guidance from state testing directors in the K-12 sector because they are greatly skilled in working with publishers to develop the kinds of tests they want. Collaborating with Psychometricians Workshop speakers encouraged consultation with psychometricians. Psychometric professionals have undergone highly specialized training in

OCR for page 92
designing and implementing assessment programs. Workshop presenters such as Ronald Hambleton, Stephen Sireci, and other psychometricians expressed their willingness to become involved in the challenges currently facing adult basic education. Indeed, one of the nice messages of the workshop was the enthusiasm and interest with which the psychometricians in attendance addressed the issues formulated by the adult education specialists. One suggestion that arose from workshop discussion was that the federal government establish a panel of expert psychometricians to provide guidance to the Department of Education (DOEd) on issues related to the National Reporting System (NRS) and other measurement concerns. Prioritizing Assessment Goals A final strategy discussed was ways to prioritize assessment goals so as to make the test development tasks more manageable. Several presenters recommended narrowing the domain coverage assessed as one means to accomplish this. Not all aspects of student growth or program functioning need to be extensively assessed or assessed with shared instruments. The demands of test development could be greatly reduced by being practical and focused in thinking about what needs to be assessed for the purposes of program and/or state comparisons. PROBLEM 2: DEVELOPING A USABLE SUITE OF ASSESSMENTS A common refrain throughout the workshop was that a single assessment, no matter how perfect, will never serve all needs. One suggestion aimed at improving the assessment landscape within adult education was to think about a serviceable “suite” of assessments, that is, having a variety or array of tests, including multiple-choice tests and various kinds of performance assessments tasks, available for use by adult educators. These tests could be used for particular purposes including instruction, local benchmarking, within-state program evaluation, and national reporting. They could be adapted to the needs of the various groups served by adult basic education programs, including GED students, adults with literacy problems related to learning disabilities, adult ESL learners with and without educational experience and literacy skills in their first languages, and so on. Workshop participants seemed to agree that assessments that are divorced in content from the goals of instruction are not useful for the stu-

OCR for page 92
dent or the teacher; in the ABE system, as within the K-12 system, alignment of standards, curriculum, and assessment is key. Kit Viator pointed out that the state of Massachusetts placed a great deal of emphasis on the alignment of both content and performance standards with assessments. Leah Bricker discussed the work of Project 2061 of the American Association for the Advancement of Science (AAAS) on developing an analysis procedure for the alignment of K-12 math and science assessments with national and state standards. AAAS’s procedure reveals the degree of alignment between a state’s standards and its assessments; this is helpful for states that are evaluating the alignment of assessment tasks to specific learning goals. Achieving alignment requires formulating standards (much harder in ABE than in K-12) and including measures of curricular content, as well as selecting or generating appropriate assessments. One aspect of the ABE programs that must be considered in the context of an array of assessments is that students often come to their programs without a specific goal or credential in mind. In making suggestions about the components of a suite of assessments, workshop participants noted the importance of strategies for using technology, making decisions about when to use which assessments, and improving practitioners’ and administrators’ knowledge base about the values and limitations of assessment. They also noted the trade-offs associated with developing and using performance assessments. First, like all kinds of assessments, the development of performance assessments requires a clear definition of curricular goals and content. Second, performance assessments are expensive and technically difficult to develop; the current system may be too restricted by limited funding, time, and expertise to develop high-quality performance assessments. Third, good performance assessments take a lot of time, which must be subtracted from instructional time (which is quite limited in ABE). If the instructors do not see the connection between assessments and instruction, they can undermine validity in their presentation of the assessment exercise to the student. Moreover, many students are mandated by the court or social services office to attend a program; others come voluntarily, but do not seek any particular credential; this makes it hard to define what outcomes can be judged as “good enough.” If students perceive there is little at stake for them, they may be unmotivated to perform their best, and this, too, can undermine validity. A number of workshop participants stressed that good assessment systems are dynamic. They should be expected to change over time in order to remain current. Their development is never finished because the perfect

OCR for page 92
test has never been written. When tests are used for purposes of accountability and of supporting instruction, assessment items should be shared with the public at regular intervals. This is done regularly at the K-12 level, as both Viator and Mark Moody noted. Public release of items means development must be ongoing. Thus, assessments—the items used, the scoring, the guidelines for administration, and so on—need to be reconstituted regularly to incorporate lessons learned from previous administrations. Making Use of Available Technology On the one hand, developing computer-administered tests can be extremely expensive and requires specialized expertise; on the other hand, these tests can greatly decrease the testing burden by providing brief, efficient, individually adapted versions of tests, and they can increase the value of test data by ensuring accurate calculation of students’ proficiency levels. Plake suggested that computer delivery has the advantages of on-demand administration and immediate scoring to provide preliminary test results at the conclusion of the testing session, and it can achieve higher precision of measurement with fewer items and potentially less administration time. Although much assessment in adult education will continue to take place without the use of technology, Cascallar, Braun, Impara, and others strongly urged the strategic use of technology for certain limited purposes. Sireci said that computer-based testing (CBT) could minimize testing time and be widely accessible to students in remote locations. He recommended reviewing how the Test of English as a Foreign Language (TOEFL) and ACCUPLACER are able to administer their tests in a cost-efficient manner.1 Hambleton noted that CBT could be useful in developing shorter and more precise assessments targeted to ability and in improving the ease of scoring and testing security. Finally, Henry Braun endorsed the idea of using technology as a vehicle for delivering professional development. He offered the following advice: A professional development program that combines a couple of hours of contract time with rich materials on the Web may be a way of circumventing some of these issues for the teachers. If you create a higher level of expertise 1   For the Test of English as a Foreign Language (TOEFL), see www.toefl.org. [May 14, 2002]. For ACCUPLACER, see www.collegeboard.com/accuplacer/html/accupla1.html. [May 14, 2002].

OCR for page 92
among teachers, the investment pays off enormously in terms of their influence on the students. Using Test Development to Create Professional Development At the workshop, concerns were voiced about the demands of the NRS in adult education and the increased demands for accountability in education in general. These concerns derive in part from the fear that tests used for comparability judgments will supplant tests that practitioners know to be useful in their own instruction. Enhancing teachers’ understanding of the variety of assessments that are available, the various purposes for which they should be used, and their specific demands could help practitioners use the full variety of assessments in a more targeted way. Limited resources and an enormous need to improve instruction are likely to remain constant characteristics of the ABE system. Under these circumstances, Sireci suggested utilizing the NRS procedure as a mechanism for supporting instructional enhancement and as an opportunity for professional development. He commented that work with K-12 teachers in item-writing workshops, standard-setting studies, and content validity studies has been informative and useful in developing better tests and in improving teachers’ instructional practice. Too few states are offering professional development on developing assessments, reviewing student performance in a guided way, formulating standards for acceptable performance, and scoring assessments. As noted in the overview of the adult education system in Chapter 2, many classrooms are served by instructors who lack training or expertise for their major task of teaching, let alone for implementing and using assessments. Speakers raised concerns that a focus on developing assessment strategies without concomitant attention to instruction would constitute a misdirection of resources. Hambleton stressed the need for training adult education teachers in topics such as constructing and scoring a test and interpreting data to ensure that tests are used appropriately in the classroom and that test results are understood by the educators. Using Performance Assessments Appropriately In many situations, teachers and students appear to value performance assessments over other sorts of assessments. Performance assessments have face validity and a sense of authenticity, and they are thought to have considerable educational value because of their capacity to reflect a wide variety

OCR for page 92
of accomplishments and to be connected organically to the material taught. However, data from performance assessments are also time-consuming to collect and to analyze, and they provide less direct comparability between students, programs, and states than other forms of assessment. Thus, many speakers at the workshop suggested that performance assessments be used selectively and in combination with a variety of other assessment instruments, including standardized multiple-choice tests. Several speakers endorsed the use of a mix of performance and traditional assessments for instructional purposes in the classroom. It was suggested that optimal use of assessments would be ensured if programs and educators were given a strategy for selecting the right combination of assessments that would match the time and money available and provide the information needed. Cultivating Existing Knowledge About Performance Assessment Discussions at the workshop called attention to the fact that performance assessments can benefit from the contributions of committed amateurs, but they cannot achieve sufficient levels of validity, reliability, and comparability without the substantial involvement of a professional psychometrician. Chapters 3 and 5 outlined clear procedures for creating performance assessments. Knowledge of the assessment process and the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999), little of which seems to be currently available within the adult education community, would help states and local programs make educated decisions about implementing performance assessments and would provide valuable guidance during the development process. Developing Program Support for Alternative Assessments Well-designed and well-maintained portfolios are one component of a suite of assessments useful for instructional purposes that could also be used as input to a summative assessment of student progress. Plake stressed the potential of portfolios for fulfilling the reporting needs of the states, while at the same time providing useful information about students’ progress. However, she cautioned that cost, time, and resources are concerns. Furthermore, the reliability of portfolios as a single basis for student assessment has been challenged in the K-8 educational system (Koretz,

OCR for page 92
1994). Additionally, if instructors are devoting time and energy to the compilation of portfolios, some attention to the question of their optimal use for a variety of purposes is clearly warranted. Plake and several others suggested that using portfolios as part of a larger suite of assessments is an idea that deserves further exploration. PROBLEM 3: DATA QUALITY Thus far the primary focus of this report has been on quality standards for performance assessments. In addition, there are issues of quality in maintaining records of students’ performance and other data required by the NRS. Local programs collect the initial data on students and forward data to the states. States are required to implement and maintain a computerized student-level data system and to forward valid and reliable information to the DOEd. At each data collection point, quality controls need to be in place to ensure that data represent what they are intended to represent and that they can be interpreted in the desired ways. The meaning of the collected data relies on the diligence and accuracy of recordkeeping by all parties. The data the DOEd receives from the states are used to obtain national totals and averages for the various indicators required by the NRS. These averages are used by states as they negotiate their levels of performance, and they are reported to Congress and other interested parties. Data from individual states are used to assess whether they have met their negotiated levels of performance and, thereby, meet the WIA Title II component to qualify for an incentive grant. An accountability system like the NRS is inherently dependent on the quality and integrity of the data it accepts. However, the NRS does not yet provide minimum standards for data quality or for auditing or other verification of the integrity of the data collected. Further, it is not clear that states and local programs have the resources and systems in place to collect and maintain the mandated data. For instance, Bob Bickerton presented results from a survey conducted in the first half of the current performance period (program year 2001): In their responses to the survey, 18 states reported that they had not yet implemented the required systems. Issues of data quality enter into the validity of analyses conducted with the data. Bickerton provided examples of the improbable situations that can emerge when standards for data are not yet implemented. In one report, a state with an average of 116 hours of student attendance per year

OCR for page 92
indicated that 36 percent of the students completed a federally defined instructional level (i.e., progressed from one educational functioning level to the next). Another state with only 31 hours per student per year reported that almost 68 percent of the students completed a federal level in program year 2000. In a second example, a state that spent $2,084 per student per year reported that 36.7 percent of the students completed a federal level in program year 1999. Another state that spent only $233 per student per year reported that 90.2 percent of the students completed a federal level in program year 1999. While there is a chance that these results are true, they may also reflect inaccuracies or inconsistencies in the way data are collected, maintained, and reported. If the goal is to achieve comparability across states, quality controls need to be in place to ensure that the meaning of data is consistent. PROBLEM 4: ACHIEVING A BASIS FOR COMPARABILITY USING PERFORMANCE ASSESSMENTS States that decide to implement performance assessments will need multiple versions of the assessment. To avoid practice effects, different versions will be needed for pretests and posttests. For security reasons, different versions will be needed for different testing years. Furthermore, because states will develop their own performance assessments, there will be different versions from state to state. A major problem with performance assessments is the difficulty of achieving comparability across different versions of the assessments. One goal of the NRS is to make comparisons—comparisons of students’ performance from pretest to posttest and comparisons from state to state. For instance, the goal is that student performance that is interpreted as moving to the next level in Oregon would also qualify a student to move to the next level in Ohio. This requires a common basis for comparing performance. Workshop participants voiced concern that performance assessments may not generate adequate levels of comparability. Some thought it might be possible to implement a systematic process that used social moderation to roughly align scores from a variety of performances. Widely used tests such as the TABE, CASAS, and others could also be used as part of the process to establish a link between performance assessments and NRS levels. Under social moderation, judgment is used to align scores on assessments with one another or with a common reporting scale even though the assessments may measure somewhat different knowledge with different

OCR for page 92
kinds of tasks (Linn, 1993; Mislevy, 1992). The question remains whether this level of comparability would be sufficient in the high-stakes environment established by the ABE National Reporting System. The crux of the issue is the degree to which students placed at one level of the NRS with one assessment would be placed at a different level with another assessment—a source of uncertainty in addition to other measurement error associated with scores on either of the two assessments (e.g., due to low reliability levels). There are ways to estimate the uncertainty associated with using social moderation. One way would be to have multiple panels of experts independently carry out the alignment task and then estimate the frequencies of classification discrepancies that would result (that is, the frequency with which the judgments varied from one panel to the next). It should also be noted that the real impact of the social-moderation uncertainty depends on the way test results are used. For instance, one proposed use of test results under the NRS is to compare performance across states. Here, differences in the way scores are aligned will influence estimates of the proportions of students in each state who are considered to be performing at the various NRS levels. Increases or decreases in the proportion of a state’s students at a given level could simply be due to differences in the way the scores are aligned (e.g. variability in the judgment-based decisions). More lenient judgments (i.e., lower cut scores associated with an NRS level) could increase the proportions; harsher judgments (i.e., higher cut scores associated with the NRS level) could decrease the proportions. Another proposed use of results, however, is to set gain-score targets independently within states. This use is affected much less by the uncertainty associated with social moderation, as it only concerns changes on a single assessment (i.e., the state’s own assessment), even though the scores may have been mapped through moderation onto a common NRS metric. Develop Benchmarks Identified with NRS Levels A major challenge to the use of performance assessments for accountability purposes, such as those stated in the NRS, is that performance assessments usually cannot be designed to be precise enough to reflect relatively small developmental increments in skill. Because of external factors in their lives, a large majority of ABE students participate in education programs for a limited length of time and study with limited intensity. Hence, it is not likely that their progress will show up as movement from

OCR for page 92
one NRS level to the next. Several speakers expressed concern that students might not actually show increments in skill sufficient to be measured by performance assessments. Some also recommended that midway points be identified within a NRS level to address this issue. Several concerns were also highlighted about using NRS levels to measure student proficiency and educational gain, regardless of which assessment is administered (see Chapter 4 for further discussion of this issue). Several speakers suggested engaging in a consensual effort to develop benchmarks identified with the transitions between NRS levels and possibly even with identified midway points. This could generate a basis for decisions about student progress within the context of the NRS design without undermining the current use of the wide variety of assessments in ABE programs across the country. Balancing Comparability with Flexibility The issues related to comparability of the assessments and the methods of establishing linking or cross-walking of assessments have been highlighted throughout this report. Several speakers called for work on linking paper-and-pencil as well as performance assessments. Braun and others recommended that the process of matching NRS descriptive levels with benchmarks (or cut-scores) by a variety of test publishers be revisited to be assured that it can support the inferences about students’ skill level. No one at the workshop thought that a simple basis for linkage across the various assessments would emerge, but possibilities exist for developing post hoc subtests within those tests that might aid in linking, using social moderation or statistical moderation (e.g., EFF benchmarking tasks), or might help in defining ranges of scores on the various tests that could be considered roughly equivalent to one another. Greater comparability could be achieved through standardization (i.e., same content standards and tests across states), but it would come at the cost of decreased flexibility at the program or state level in choice of assessments. Thus the trade-offs need to be kept in mind. In the words of Braun: We gain evidential value and construct representation but we pay a cost . . . in terms of development, scoring, testing time, and reliability . . . How they play off against each other will have to be worked out in the context of our particular purposes and constraints.