5
Developing Performance Assessments for the National Reporting System

Many in the adult education community believe that performance assessments are more congruent with the goals and real-life scenarios of adult learners and allow for broader measurement of adult learners’ skills than standardized multiple-choice tests, such as CASAS and TABE (described in Chapter 2). Specifically, many believe that performance assessment tasks provide students with better opportunity to demonstrate their knowledge of the content by producing or constructing a response to an item or task, rather than simply selecting a response from available options. (As Myrna Manly and Stephen Dunbar noted, this contrast between performance assessments and multiple-choice assessments is overly stark. While performance assessments do indeed make it possible to gather rich evidence about students by presenting them with more complex situations, thoughtfully constructed multiple-choice tests can engage higher-order thinking, and poorly constructed performance assessments can obfuscate students’ achievement with demands for irrelevant knowledge.) The trade-off of using performance assessment tasks instead of selected-response tasks is that considerably fewer questions can be asked in the same period of time. This leads to issues about limitations of performance assessments to represent the scope of the content and skills covered on the assessments.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 71
5 Developing Performance Assessments for the National Reporting System Many in the adult education community believe that performance assessments are more congruent with the goals and real-life scenarios of adult learners and allow for broader measurement of adult learners’ skills than standardized multiple-choice tests, such as CASAS and TABE (described in Chapter 2). Specifically, many believe that performance assessment tasks provide students with better opportunity to demonstrate their knowledge of the content by producing or constructing a response to an item or task, rather than simply selecting a response from available options. (As Myrna Manly and Stephen Dunbar noted, this contrast between performance assessments and multiple-choice assessments is overly stark. While performance assessments do indeed make it possible to gather rich evidence about students by presenting them with more complex situations, thoughtfully constructed multiple-choice tests can engage higher-order thinking, and poorly constructed performance assessments can obfuscate students’ achievement with demands for irrelevant knowledge.) The trade-off of using performance assessment tasks instead of selected-response tasks is that considerably fewer questions can be asked in the same period of time. This leads to issues about limitations of performance assessments to represent the scope of the content and skills covered on the assessments.

OCR for page 71
ACHIEVING DOMAIN COVERAGE Several approaches for achieving domain coverage were suggested during the workshop. Two of these are examined in this section: the critical indicator approach and the domain sampling approach. The Critical Indicator Approach Mark Reckase described the critical indicator approach. With this approach, specific skills are identified as more important than others in a particular content area; then tasks are designed to assess those critical skills. The approach rests on the assumptions that can be made about a student’s mastery of the targeted set of content and skills based on his or her ability to successfully complete the critical tasks. If the student can perform the critical tasks, it is reasonable to infer that he or she possesses the capability to perform other less critical tasks that assess similar content and skills. Reckase said that identifying the critical tasks and knowledge that indicate competence in a given content and skill area is the most important component of the critical indicator approach; it requires an in-depth understanding of the domain to be accessed. If the assessment tasks are not on target, the results will not provide useful information. The development of scoring procedures is consistent with the procedures for performance assessments discussed in Chapter 3. It includes conducting range-finding activities with initial products to determine the number of score levels that can be supported, and developing rater evaluation and training materials. Reckase offered several examples of particular skills in adult education that might be critical tasks: A critical writing task would be creating a multi-paragraph memo on job-related tasks, and a critical mathematics task would be a business application of mathematics. Identifying critical reading tasks would entail identifying types of texts that can be used to show accomplishment of the reading goals related to understanding text. The inference is that if an individual can read and understand a complex type of text, then he or she can comprehend other less complex texts. The process of identifying critical tasks requires consideration of scoring procedures. Specifically, it is important that the task allows individuals at different skill levels to demonstrate their proficiency. Further, it is important to ensure that a rating scale can be developed that allows responses at multiple skill levels to be scored.

OCR for page 71
Once critical indicators are identified, the time requirement for assessment is low. By focusing on critical skills and using a short screening pretest to assign performance tasks, this strategy attempts to use what is known about the structure of the domain and what is learned about the functioning level of the examinee to administer only those tasks most likely to be informative for that particular person. A disadvantage of this approach is that inferences about mastery of the targeted set of content and skills are based on a limited sampling of behavior. In addition, those who are developing the assessment must agree on the overall domain, have a deep understanding of the skills and knowledge required, and be able to select the critical tasks in each content area. The Domain Sampling Approach Another alternative for selecting tasks is to sample from the domain that is the targeted set of content and skills in a given subject area. In this approach, a large number of tasks are developed that represent all of the content and skills in the particular subject area. For a given test administration, a smaller number of tasks are randomly selected and administered to the student. Thus, any given form of the assessment is assumed to be a representative sample of the skills and knowledge included in the domain. The idea is that if a student can do well on an assessment of a representative sample of the skills and knowledge in a content area, then it is appropriate to infer that he or she possesses mastery of the domain. In this approach, the goal of assessment development is to produce an instrument that contains tasks that are an appropriate sample of a domain. Ideally, a content framework would be translated into specifications that clearly delimit the types of performance items included in the domain. Test developers would then produce many items that represent the domain, and forms would be developed by sampling from the set of items. A primary benefit of this approach is its familiarity; it draws on procedures usually used to develop a pool of assessment tasks to assess the range of knowledge and skills in a content domain. The targeted knowledge and skills are first mapped out; then tasks are created that not only assess that knowledge but further specify exactly what the student is expected to learn. This step has the added benefit of educating students and teachers as to what is covered by the test. As with the critical indicator approach, the test developers must agree on the range of tasks that represent the skills and

OCR for page 71
knowledge of the content area, and they must be confident that these tasks are representative. Reckase said that there are two disadvantages to the domain sampling approach. First, it takes a substantial amount of time to obtain good domain coverage, a problem not unique to adult education. A second issue that may be more serious in adult education is the apparent lack of consensus on a definition of the domain. If different states or different adult education programs disagree on what the domain includes, sampling in the same way from the same pool of tasks will not adequately meet the purposes of all the programs. Constructing large and comprehensive pools of tasks, from which programs would specify areas of interest and construct their own sampling plans, is one possible solution to this problem. Yet utilizing this option increases the difficulty of both constructing assessments and comparing results across states and programs. HOW STUDENTS DEMONSTRATE PROFICIENCY For those states and local programs that want to use performance assessments to measure students’ proficiency, there are several options. These options were proposed as ways to adhere to the NRS requirements and to assess a variety of content areas. Types of Performance Assessments Performance assessments use different modes for students to provide responses to questions. One mode calls for examinees to actively demonstrate their responses. Another mode makes use of a written response, while others involve students constructing portfolios of their work. Performance Tasks A performance task requires examinees to actively demonstrate their skills. An example of a numeracy task that uses math to solve a realistic problem follows. For example, one such task might involve a consumer math problem in which students are asked to plan a trip to the supermarket. They have a certain amount of money to spend and must generate a shopping list for the week. Using a simulated newspaper ad or worksheet, they must find the prices for each item on their list and calculate the total bill. This is one of many possible examples of authentic performance as-

OCR for page 71
sessment tasks that assess numeracy skills. As Myrna Manly commented, context is a key feature that increases the authenticity of the task. Written Scenarios In a written scenario, a type of on-demand writing task, students are required to write a response to an oral or written prompt (the question and tasks proposed to the examinee). This performance assessment task requires that students apply previous knowledge and pose solutions to realistic problems. The responses may vary in length and usually have several parts. The written scenario has a title, a prompt, and instructions to the student on the specific questions that he or she must address and the aspects of the content that should be included. The evaluation criteria should also be included so that the students know which skills will be evaluated in their responses. An example of a written scenario appears below: Scenario: Ben’s goal is to find work as a sales associate for a department store. He has never worked in a department store before but he feels that he has good interpersonal skills. Ben’s strategy for finding a job is to look at the job ads in the paper every week and send his resume in response to the ads. It has been two months and Ben has not yet found a job or even been asked for a job interview. Because Ben is your friend, he comes to you for advice on seeking work as a sales associate. Instructions: What feedback would you give to Ben on his strategy for finding a job as a sales associate? Specifically, describe two strategies you think that Ben should consider to be more effective in seeking work as a sales associate. Explain how you would present these strategies to Ben. Your response will be evaluated on your ability to: plan (evaluate a plan’s effectiveness in achieving goals); solve problems and make decisions (generate strategies of options for effective action); convey ideas in writing; guide others (Ananda, 2000:10). Written scenarios are easy to develop and administer; they can be modi-

OCR for page 71
fied for either a short or long response; and they can be administered in either individual or group format. At the same time, as Mari Pearlman pointed out, substantial effort is required to design a system for evaluating responses to written scenarios. Shared rubrics, illustrated by examples of performances and ratings that are linked to scoring levels, would increase the comparability of results from this kind of task. Portfolio Assessment The type of performance assessment that received the most attention at the workshop was the portfolio task. As discussed in Chapter 3, a portfolio is a systematic collection of work or educational products created over a certain period of time. The workshop speakers believed that the portfolio was a feasible assessment vehicle for either domain coverage approach discussed above. Less clear is how portfolio assessment would fit into the pretest/posttest paradigm of the NRS. Reckase suggested the use of structured portfolios in which the student would have to follow a prescribed table of contents to create the portfolio; this would ensure some commonality of evidence across examinees. According to Reckase, the table of contents would be useful in narrowing the scope of content coverage and in ensuring that similar information is collected from students from one testing occasion to the next. A fixed menu of options would also allow for the advance development of scoring rubrics for the kinds of assessments included in the table of contents. A menu of options can be developed through either method of achieving domain coverage or in some other way. Developing structured scoring procedures and rubrics is important for maintaining consistency in the scoring process and for enabling comparisons of students’ work from one testing occasion to another. For the menu to be useful, each task or work sample description would have to be defined clearly enough to be specific about the kinds of work students should include in the structured portfolio. Using the menu, a student and teacher could select the task that most appropriately matches the student’s personal and instructional goals. Reckase envisioned a one-page description of the work sample that would be generic enough for the student and teacher to adapt to their stated goal. He also stressed the importance for both the student and the teacher of understanding what constitutes acceptable and unacceptable entries. He highlighted the time, cost, and difficulty of developing scoring procedures for portfolios but said that it can be done (National Board for Professional Teaching Standards,

OCR for page 71
2000; Reckase, 1995; Reckase and Welch, 1999). Reckase also emphasized how important competent and well-trained raters are to the success of this process. Reckase provided an example of a portfolio menu for English Language Arts. Some of the tasks and the work sample description include the following: analysis/evaluation (analyze or evaluate different aspects or parts of a subject, object, or idea); explanatory writing (explain a process or concept to another person through writing); proposing a solution (define a problem and offer a plausible solution); and research/investigative writing (research a subject, gather and organize material, and present it clearly with well-documented sources). At the end of the instructional period, the student and teacher would select the best piece of work to be evaluated for each relevant task. Reckase recommended producing a handbook for students and teachers that describes the scoring procedures and the rubric and provides examples of work that would fit into the different score categories. He suggested that a minimum of five entries of student work would be needed in a portfolio to obtain a reliable student score on a particular content area. Reckase commented that although portfolios can be a very effective tool for evaluating growth, they do present some complications. For a structured portfolio to be used as part of the assessment system, instructors must agree on the content and types of activities that a student should include. It is difficult to develop scoring procedures that are reliable enough to enable the comparison of different work products by different students. Specifically, the cross-task generalizability of performance measures can be weak. This means that a person’s performance may depend on the task he or she is given. Furthermore, portfolios often include products of both successful and unsuccessful performance on different tasks. Thus, the scoring process needs to include ways to handle portfolios in which students do well on one task and not as well on others—that is, their performance across tasks is uneven. Reckase noted that these factors demonstrate that achieving comparability of students’ performance and program effectiveness is difficult. The fact that different students’ portfolio entries are tailored to the substance and the levels they are working on means that each

OCR for page 71
student’s performance is more relevant to him or her individually and less commensurate with those of other students. As a result, there is more judgment involved in mapping performances into a common framework (such as the NRS levels), and a greater burden is placed on the need to achieve consistency of evaluations across students, over time, and among programs. Reckase emphasized the need for scorers who have knowledge of the content and skill area being assessed and who have gone through a thorough training process. Scoring guides should be developed for the training process, and they should include rating points with clear descriptions and exemplar papers. This is especially critical if the assessment system is to provide information about the six levels of performance prescribed in the NRS. There must also be a provision for monitoring the quality of portfolio scoring and for refresher training. These caveats do not apply only to the assessment of portfolios but to other performance assessments as well. One discussant cautioned that developing performance assessments that meet technical standards is challenging. He pointed out that earlier K-12 education reform efforts in Vermont and Kentucky were unsuccessful in their attempts to use portfolio assessment as the foundation for their high-stakes accountability systems. (See Koretz, Stetcher, Klein and McCaffrey, 1994, for more information.) WAYS TO IMPROVE EFFICIENCY Multi-Stage Testing One suggestion that was promoted by Wendy Yen and others is multi-stage testing. In multi-stage testing, students take an initial “routing test” or locator test. The locator test is a short, broad measure of the content that provides an initial estimate of the students’ level of skills. On the basis of their performance on the locator test, students are routed to a test approximately at their skill level. The second-stage tests are of varying levels of difficulty; they are longer and provide a more precise estimate of the students’ skill level. Multi-level testing can be performed with either paper-and-pencil tests or computer. Computerized Testing The use of computerized testing can greatly improve the testing pro-

OCR for page 71
cess, making it more efficient and flexible. With computer-based testing, a paper-and-pencil test is converted to a computer-administered test. Questions are presented to examinees in the same sequence as on the paper-and-pencil test, and the examinees choose their answer selections in the same manner as they would on a paper-and-pencil test. Once an examinee finishes responding to all the questions, the test is scored. Another more technically sophisticated form of computer-administered test is the computerized adaptive test (CAT). CATs rely on programmed algorithms that use an examinee’s response to a given question to select the next question. The difficulty level of the administered items is adapted to the skill level of the examinee. Thus, test takers spend less time answering questions that are too hard or too easy for them. CATs greatly increase test efficiency because examinees do not have to answer all the questions. The programmed algorithm continues presenting items to the examinee until the examinee’s skill level can be estimated with sufficient precision. Computer-adaptive testing is a special type of multi-stage testing that exploits the capability of the computer in the presentation of questions and in scoring. Ronald Hambleton explained that computer-adaptive testing makes it possible to target the assessment to the student’s ability, build in flexibility in scheduling tests, and increase test security as well. (Additional information on computer-adaptive testing can be found in Wainer et al., 2000.) Although the technology available for computer-adaptive testing makes it most feasible for use with multiple-choice test items, CAT has also been used to develop simulations of real-life situations, using selected-response items, which are included on licensure exams for doctors and architects. According to Hambleton, the computer technology for automated scoring is advancing rapidly. A number of workshop participants described examples of performance tasks that use automatic scoring, such as the simulation-based networking tasks used in the Microsoft certification exams1 and the computerized patient management problems used in the National Board of Medical Examiners’ licensing examination for physicians.2 Hambleton said that the positive features of computer-based testing for adult education include: (1) flexibility in scheduling tests (participants can take tests when they are ready and without the aid of a test administra 1   Website: http://www.microsoft.com/traincert/mcp. [March 28, 2002] 2   Website: http://www.usmle.org/. [May 14, 2002]

OCR for page 71
tor; consequently instructors are not overwhelmed by testing responsibilities), and (2) increased test security (the tests are in the computer and not available in paper form, and new test designs and item formats are possible). CAT permits the targeting of assessments to the ability levels of the examinees. In adult education, which has a wide range of abilities among students, targeting the difficulty of the test to each student would be a major advantage—students would experience less frustration, measurement precision could be increased, and testing time could be shortened. Presenters agreed that the introduction of computer technology into assessment practices provides several advantages for adult education: More valid assessments can be developed; assessments can be individualized; flexibility in test scheduling is possible; feedback and scoring of students can be immediate; and testing time can be minimized. Computer technology is also useful in addressing psychometric issues such as scaling and measuring a large continuum of skill levels; it provides more options for analyzing the data (scaling, calibration), and it allows administration across sites and localities. Hambleton and others cautioned, however, that computer technology will require a large item bank; items will still need to be field-tested and calibrated; and the initial cost of computers is substantial. Item Sampling: Maryland School Performance Assessment Program In the early 1990s, the state of Maryland implemented an innovative and challenging educational reform program that held schools, not students, accountable for student performance. The reform program dramatically altered Maryland’s student assessment program and led to the design of an assessment system that uses performance-based assessments for school evaluation. The Maryland School Performance Assessment Program (MSPAP) is administered annually to third, fifth, and eighth graders. It includes assessments in reading, writing, language usage, mathematics, science, and social studies. All the assessment tasks are integrated across the content and are authentic in that students respond to queries based on problems solved during the examination process. According to Mark Moody, the assessment was designed to embody sound instructional practices and to represent good principles of instruction and—most important—to obtain reliable school-level scores (because the focus is on program evaluation), rather than accurate scores for individual students (thereby reducing the burden on individual students). The model is de-

OCR for page 71
scribed here for its instructional purposes even though it would not fulfill the NRS requirements. MSPAP is a criterion-referenced assessment3 based on the Maryland learning outcomes. The MSPAP uses matrix sampling so that students are assessed on different aspects of the content, with no student completing all items on the assessment. Aside from the greater complexity of administering such an assessment, this design offers measurement advantages for Maryland’s objectives of assessing schools over a very broad range of content while minimizing individual student testing burden. According to Moody, there are several advantages to Maryland’s performance-based assessments. State policy makers believe that MSPAP is a test worth teaching to. According to Moody, “It embodies the spirit of good instruction.” Maryland has also found that the assessment has face validity4 with constituents, it provides models of performance opportunities, and it has provided a rich source of data for school improvement. But the MSPAP also has several disadvantages. The assessment is complex, it is expensive, and it does not provide individual student results. The cycle for creating an edition of the test is 30 months, and about 24 months of that cycle are spent writing the items. Moody reported that it is challenging to find authentic materials and readings, and the developers encounter copyright issues in what material can be used and how it can be used. The expense of the MSPAP is calculated at about $60 per student for development, scoring, and reporting, and this does not include expenses associated with test administration time. Administering performance assessment tasks can be more time- and labor-intensive than administering other types of assessments. Approximately 180,000 students are tested yearly. Finally, many constituents are interested in individual student scores rather than in school scores.5 Moody cited some of the lessons learned from Maryland’s experience with the MSPAP. He finds that the most valuable lesson of the last 10 years 3   A criterion-referenced test is used to ascertain an individual’s status with respect to a defined assessment domain. 4   The items and tasks on the test appear to be reasonable representations of the content and skills the test is intended to measure. 5   Since the workshop, Nancy Grasmick, Maryland’s State Superintendent of Schools, has decided to replace the MSPAP with a test that is more aligned with a new high school proficiency exam and meets new federal requirements that state tests provide individual scores.

OCR for page 71
pertains to the four aspects of task development. He stressed that when a performance task is constructed it is crucial to consider these questions: (1) What is the content of the task? (2) How is the task going to be scored? (3) What materials does a task require? and (4) Can the task be administered? Moody and his colleagues have learned that a lot of good ideas cannot be administered, and a lot of tasks that can be administered are not very interesting. He recommended multiple levels of review for the tasks at different levels of the school system. Finally, he suggested the formation of an advisory group of experts to offer guidance on psychometric rigor and administration of the assessment. ALTERNATIVE REPORTING MODELS TO THE NATIONAL REPORTING SYSTEM Given the differences in ABE instruction and student goals across states, many presenters shared their concerns that a set of uniform, standardized performance assessments may not work within the NRS. Even though participants understood that the charge of the committee was to address the use of performance assessments within the NRS framework, a number of workshop speakers stimulated long-range thinking by describing some alternative reporting models. Jim Impara described a model used in Nebraska at the K-12 level. The state has adopted content standards, and local school districts must report on the percentage of students who meet these standards. The school districts are allowed to choose their own assessments, but an independent group formed by the state evaluates each local assessment system using a scale of quality measures. The state uses six criteria to evaluate the assessment system of individual school districts: (1) The assessments reflect state or local standards; (2) Students have an opportunity to learn the content; (3) The assessments are free from bias or offensive situations; (4) The level is appropriate for students; (5) There is consistency in scoring; and (6) Mastery levels are appropriate. (For more information, see http://www.nde.state.ne.us [April 29, 2002].) The state then publishes the district-reported percentage of students meeting the standards and the evaluative rating of the quality of the local assessments. Districts that receive low ratings for their assessments, but report that their students seem to be doing well, do not have as much credibility as districts with assessments that receive high ratings. The Nebraska model could be applied within the NRS in the following way: A

OCR for page 71
national audit of the assessment system used by each state could be conducted, and a “weight” or grade could then be assigned to each state’s system. Adjustments could be made to account for any major differences across the states. Another model was proposed by Richard Hill and is described here even though it does not adhere to the pretest/posttest assessment design of the NRS. Hill suggested allowing ABE programs within each state to establish individual “contracts” with each student. The accountability index would be based on the proportion of individual contracts in which students met their goals. Hill believes that an advantage of this system is that it would be comparable for all types of adult education programs. For example, the same questions could be asked of all programs and all students, whether a program was designed to provide training for a specific job-related task or to provide preparation for postsecondary education.