Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 36
3 Assessment and Test Design Over the last 15 years, there has been a proliferation in the use of assessment for accountability purposes at the national, state, and local district level. Test results have been used as indices in making decisions about individual students, such as advancement from one grade to the next or graduation from high school. Test results have also been aggregated across individuals to make decisions about groups; they have been used to judge the quality of schools or to determine funding allotments within a district or state. Furthermore, test results have been aggregated to the state level and used as a tool to make comparisons among states. In short, tests have been used for many purposes (National Research Council [NRC], 1999b, 2001a). At the workshop, Pamela Moss, a member of the joint committee that developed the Standards for Educational and Psychological Testing (American Educational Research Association [AERA] et al., 1999), observed that tests in educational settings are typically designed to fulfill one of three general purposes: (1) to provide diagnostic information, (2) to evaluate student progress, or (3) to evaluate programs (see NRC, 2001a for more information about the purposes of assessment). During the discussion about the purposes tests can serve in educational settings and in her overview of the Standards, Moss alluded to a number of measurement concepts. To assist readers not fully acquainted with measurement issues, the following background information about designing assessments is provided. Readers interested in more in-depth information on assessment design are referred to introductory measurement texts such as Millman and Greene (1993) and
OCR for page 37
Popham (1999, 2000). This chapter concludes with a discussion on the trade-offs to consider in designing and selecting assessments. PURPOSES OF ASSESSMENT An assessment that provides diagnostic information about a student’s achievement level is considered a formative assessment; its intended purpose is to identify a student’s areas of mastery and weaknesses in the content being studied in the classroom. Formative assessments can include classroom projects, teacher observation, written classwork, homework, and informal conversations with the students. Through formative assessment, the teacher gathers knowledge about what the student has learned, and that knowledge is used to facilitate instructional decisions about what content should be covered next. The results of formative assessments can provide feedback to individual students to help them focus their learning activities. To be of most benefit, formative assessment of student learning should be ongoing, closely aligned with instruction, and designed to support inferences about the students’ developing competence with the content. An assessment that evaluates student progress is a summative assessment; its intended purpose is to determine whether a student has obtained an established level of competency after completing a particular course of education—be it a classroom unit or 12 years of schooling. End-of-unit tests and letter grades are summative assessments. Summative assessments can also be large-scale assessments, such as the Massachusetts Comprehensive Assessment System (MCAS) or the General Educational Development (GED) exam. Finally, assessments can be used to evaluate the overall performance of a particular program or group, such as a classroom, a school, a school district, or a state. For accountability, policy makers sometimes use data at the individual level, as well as data aggregated to the group level, to make judgments about the quality and effectiveness of educational programs and institutions. Examples of this kind of assessment include the Stanford 9, a standardized test that is designed to report scores at the individual level and is often aggregated to the group level, and the Maryland School Performance Assessment Program (MSPAP), which administers tests to samples of students and reports results for relevant groups. Results from assessments designed for program evaluation are used to support inferences about the overall performance of the group and often to make statements about the effectiveness of a given program.
OCR for page 38
Assessments can provide valuable information to help students, teachers, school administrators, and policy makers make a variety of decisions. Although a given assessment is generally designed to address a particular purpose, in practice that assessment is often used for multiple purposes. For instance, some state tests are used both to make decisions about performance of individuals and, when aggregated, to make judgments about the performance of a group (e.g., a classroom, school, or district). ASSESSMENT DESIGN According to the Standards (AERA et al., 1999), there are four phases of the test development process: delineation of the purpose(s) of the test and the scope of the domain [content and skills] to be measured; development and evaluation of the test specifications; development, field testing, evaluation, and selection of items and scoring guides and procedures; and assembly and evaluation of the test for operational use (p. 37). This development process should be followed regardless of the kind of assessment being designed. Each aspect of test development is examined below. Defining the Purpose and Identifying the Content and Skills to Be Assessed A clear statement of purpose provides the test developer with a framework upon which to begin designing the assessment. In fact, in assessments such as the National Assessment of Educational Progress (NAEP), there is a document called the Framework that lays out the purpose of the assessments and defines the content to be measured. According to the Standards, the first step in test development is to define the purpose of the assessment and delineate the scope of the content and skills to be covered. When the assessments are to be used in educational settings, this process includes consideration of certain issues, such as how results will be used, the consequences—intended and unintended—of these uses, the articulated content standards, the material covered by the curriculum, and the ways in which students are to demonstrate mastery of the material. The breadth of content and skill coverage included in assessments can vary considerably and will be guided by the intended purpose of the assessment and the inferences to be based on test results. For example, consider
OCR for page 39
the GED, which students take to achieve the equivalent of a high school diploma. The purpose of the GED is to measure the skills associated with a high school education. The GED tests provide a standard measure of students’ knowledge in mathematics, social studies, science, writing, literature, and the arts. But because a short series of tests (five content areas) cannot measure all the skills students should possess by high school graduation, the GED is designed to broadly measure knowledge and skills equivalent to those of graduating high school seniors, that is, the GED is normed against high school seniors who have been certified by their school registrars as having completed all the requirements for graduation. In contrast, consider an end-of-unit mathematics test. The test might be designed to indicate if students have mastered fractions as parts of a whole and as division, including familiar fractions such as halves, thirds, fourths, fifths, and tenths (National Council of Teachers of Mathematics, 2000). Here, the objective of the assessment is to measure a more narrowly defined content area with considerable depth. The results of the test tell the teacher which students are ready to move to the next mathematics topic and which need more instruction in this content area. Another issue in defining the scope of the material covered by the assessment is to determine how test takers are to demonstrate their mastery of the content and skills. In educational measurement, there are two formats for collecting performance information about test takers: Selected-response items for which examinees select responses from several offered choices, or constructed-response items, for which examinees construct their own responses to test questions. Selected-response formats, such as multiple-choice, matching, or true-false, are suitable for many testing purposes and can easily and objectively be machine-scored. Other test purposes may be more effectively served by the constructed-response format. Short-answer items require a response of one or a few words. Extended-response formats may require the test taker to write a response of one or more sentences or paragraphs, design and carry out an investigation, or explain a solution to a practical problem that requires several steps. Most constructed-response items are scored by human scorers. Manual scoring of constructed-response items requires more time per item than selected-response items. Subjectivity is also an issue with scoring constructed-response items because scoring relies on judgments made by human scorers.1 1 As mentioned in Chapter 1, several papers were commissioned after the workshop. The topic of a paper by Larry Frase will address how technology can facilitate scoring of constructed-response items. Please contact the DOEd to obtain the paper.
OCR for page 40
Included in the category of the constructed-response format are performance assessments. Performance assessments often seek to emulate the context or conditions in which the intended knowledge or skills would actually be applied, and they are characterized by the kind of response required from the test taker. Performance assessments generally require test takers to demonstrate their skills and content knowledge in settings that closely resemble real-life settings (AERA et al., 1999:41). One type of performance assessment is the standardized job or work sample. Job or work samples might include, for example, the assessment of a health care practitioner’s skill in making an accurate diagnosis and recommending treatment for a defined medical condition, a manager’s skill in articulating goals for an organization, or a student’s proficiency in performing a science laboratory experiment. Another type of performance assessment is the portfolio. Portfolios are systematic collections of work or educational products usually created over time. A well-designed portfolio specifies the nature of the work that is to be put in the portfolio, which may include entries such as representative products, the best work of the test taker, or indicators of progress (AERA et al., 1999:42). For more information on developing portfolios, see LeMahieu, Gitomer, and Eresh (1995). Whatever the format of the performance assessment, those who are involved in determining the scope of content and skills that will be addressed in the assessment usually include subject-matter experts, experienced practitioners, and other stakeholders. The process often includes consideration of the impact of the test on instruction because the material covered on the test may come to define the scope of what is taught in the classroom. Utilizing assessments to affect instruction at the classroom level can have both positive and negative consequences, depending on how well the knowledge and skills for the assessment match up with the knowledge and skills the instruction is supposed to cultivate. Although portfolios and other types of performance assessment tasks provide a means for evaluating the skills that are not easily measured by selected-response items (e.g., performance in real-life situations), there are a number of attributes that really cannot be reliably assessed even with performance assessments. For instance, in discussing assessment of teachers, workshop speaker Mari Pearlman pointed out that qualities such as determination, perseverance, flexibility, and a sense of humor are critical for effective teaching, but the science of assessment cannot reliably define and measure these qualities and characteristics. Thus, while performance
OCR for page 41
assessments offer a new approach to assessment, there are limits to what they can be expected to do. As Pearlman noted, “Our technical knowledge is not quite ready for some of the challenges presented by performance assessments.” Developing Test Specifications Once the purpose of assessment and the scope of content and skill coverage have been determined, the test specifications can be developed. Test specifications are derived from the designated purpose of the test, and they provide a guide for developing multiple forms of the assessment. Test specifications can be considered the blueprint for the test (Mislevy, 1992), as they identify the number of items with specific characteristics to be included on each form. For instance, the test specifications might state the number of items measuring each content and skill area along with the numbers of each type of format (e.g., the number of selected-response and constructed-response items). Test specifications play a key role in enabling test forms to be constructed so that they cover similar skills in similar ways and produce results that are comparable. Developing Items The next stage of test construction is to develop items that measure the targeted content and skill areas laid out in the test specifications. The kinds of claims or inferences that are to be made about the knowledge or skills of interest must be considered in developing items. Items should be designed to provide salient evidence to support these claims. Once items have been developed, they must undergo a number of reviews for appropriate content, clarity and lack of ambiguity, sensitivity to gender or cultural issues, and fairness (AERA et al., 1999:39). The quality of the items is usually ascertained through item review procedures and pilot testing. Often, a field test is developed and administered to a group of test takers who are representative of the target population. The field test helps determine some of the psychometric properties of the test items, such as an item’s difficulty and its ability to discriminate among test takers with different skill levels, information that is used to identify appropriate and inappropriate items.
OCR for page 42
Assembling Test Forms The final stage in test development is to assemble items into forms of the test or to identify an item pool for a computerized adaptive test. When the goal is to develop forms, it is important that each form meet the requirements of the test specifications. When the goal is to create an item pool for a computerized test, there should be enough items to address the test specifications. During the assembly of a test form, it is also important that scoring procedures be consistent with the purposes of the test and facilitate meaningful score interpretation. How the scores will be used determines the importance of psychometric characteristics of items in the test construction process. DESIGNING PERFORMANCE ASSESSMENTS2 There are two critical components of a performance assessment: the task the student must carry out and the scoring guide, or rubric, used to judge the adequacy of the student’s response. When test developers are designing performance tasks, they must first determine that examinees’ skill levels can be assessed with a performance assessment task—that is, will the smaller number of items in a performance assessment be sufficient to base inferences about a student’s mastery of the targeted content and skills? And for high-stakes assessments, will different forms of performance assessment lead to comparable decisions about students? Scoring rubrics specify the criteria for evaluating performance. The scoring rubric describes the key features that must be included in a response to be awarded a specific score. It is useful to have samples of examinees’ responses to demonstrate what is meant by the narrative description of each score level; these should include examples of responses scored at the upper and lower bounds for each level. The process for identifying the exemplar papers for each score level is called “range finding.” Range finding is an important part of the rubric development process and involves determining for each score level on the rubric both the weakest and the strongest response. The scoring guide for a performance assessment is comprised of the rubric and example papers. 2 The following text provides an overview of key aspects in developing and scoring performance assessments. It is not intended as a comprehensive guide. For additional information, see Popham, 2000.
OCR for page 43
The reliability of the scoring is an important issue. That is, it should not matter which scorer is rating a particular paper. With high reliability scoring, the ratings from different scorers on the same paper will be essentially the same. Ensuring high reliability requires carefully and unambiguously defined rubrics and extensive, careful training of scorers. To obtain a reliable score for each student, scoring procedures must indicate whether each critical dimension of the performance criteria is to be judged independently and scored separately or only an overall score is to be provided for each student. This determination will depend in part on the purposes of the assessment and costs. Often more detailed information is needed for formative assessments while fewer dimensions are scored for large-scale summative assessments. The reader is referred to Brennan (1983, 2001), Brennan and Johnson (1995), and Reckase (1995) for additional information on reliability in the context of performance assessments. For scores on different forms of an assessment to be comparable, they must mean the same thing. This can be a special challenge with performance assessments. A statistical procedure called equating is typically used to make adjustments to scores derived from different test forms that are developed according to the same specifications. Equating requires carefully managed test construction, a data collection design, and statistical analyses. Equating works primarily because of the care that goes into ensuring that the tests measure essentially the same skills and with essentially the same degree of reliability. Assembling equivalent forms is much more difficult for performance assessments than for selected-response tests because there are fewer tasks to work with, and each one requires some unique knowledge and skills. This leads to concerns that the same skill may not be measured on different versions of the performance assessments. Thus, rigorous statistical equating is usually not possible for performance assessments, and educators must use other methods for linking that have less stringent assumptions and provide lower degrees of comparability. Alternative linking methods (ways of making assessment results comparable across tests) are discussed in the next chapter.3 3 The reader is referred to Kolen and Brennan (1995) for additional information about equating and to Green (1995) for a discussion of equating in the context of performance assessment.
OCR for page 44
CONSIDERATIONS IN CHOOSING AMONG ITEM FORMATS There are benefits and issues associated with using either selected-response or constructed-response items in assessments. When selected-response format is used, more questions can be asked in a shorter period of time, scoring is faster and more straightforward, and it is easier to create comparable test forms. Because selected-response items can usually be answered quickly, many items covering several areas of the content can be administered in a relatively short period of time. Selected-response items are also machine-scoreable, and this allows for quicker and more objective scoring. The psychometric value of an objective scoring process is the reduction of error arising from variabilities in scoring procedures and thus higher score reliability. The costs of scoring tests with selected-response items are also generally fixed; it costs the same to score the test whether or not the items include maps or diagrams; the number of items on the assessment do not change the cost; and the price does not change from grade level to grade level (because issues like length of response are not relevant for selected-response). Because of these factors, the per-item cost of scoring tends to be minimal. Workshop speakers discussed the advantages and disadvantages of different item formats. One disadvantage of selected-response items is that often they only assess test takers’ recall and recognition skills and fail to capture higher-order thinking skills. Stephen Dunbar challenged this notion, saying that it is possible to write selected-response items that measure higher-order thinking skills. Writing high-quality selected-response items is difficult and requires skilled item writers, and writing selected-response items that assess more complex cognitive processes is even more difficult. Selected-response items are also susceptible to guessing. For example, with a binary-choice item such as a true-false question, examinees have a 50 percent chance of answering correctly whether or not they have mastered the material. There is also the perception that selected-response items often do not require application of the assessed skills and thus do not provide authentic information about a student’s response to a real-life situation or problem. An assessment that uses constructed-response items has the potential for obtaining richer information about the depth of student knowledge and understanding of a particular content area. Constructed-response items can certainly be written to tap students’ higher-order thinking skills along with content knowledge. There is also a common perception (which is
OCR for page 45
sometimes correct) that constructed-response items, especially performance assessment tasks, are more authentic because the tasks resemble real-world situations—they present real-world problems that require real-world problem solving. Yet using constructed-response items raises both efficiency and economic issues. Constructed-response items usually require more time to answer; consequently, fewer items can be included on an assessment, and the coverage of content will be sparser than could be attained with selected-response items. In his presentation, Mark Reckase explained that 50 to 100 selected-response items can be administered to an adult in one hour, while no more than 10 performance assessments can be given in the same period of time. This leads to questions about content coverage and generalizability. With fewer tasks, it becomes difficult to generalize from performance on one sampling of tasks to another. A problem of some performance tasks is that they have low generalizability. That is, students may do well on some tasks but not on others and this is not a consequence of their skill level but simply because they are more engaged by some tasks then by others. A student may have the necessary background to be successful on a given task but may react to the specific context or other extraneous characteristics of the task. This reaction is referred to as a “person by task interaction” (Brennan and Johnson, 1995). When there are many tasks, as with selected-response questions, the lack of generalizability of a particular question is not a major issue, because results are averaged across many questions. When there are few questions, the “person by task interaction” becomes more important. Scoring of constructed-response items, especially performance assessment tasks, is also difficult and costly. Dunbar cautioned that there is a per-item cost associated with developing and scoring performance assessment items, as rubrics have to be developed to evaluate the quality of students’ responses. Although the development and scoring costs recur with both selected-response and constructed-response items, the recurring cost is considerably higher for tests that include performance assessment items. In addition, Dunbar reminded participants that the development of open-ended questions is not restricted to simply writing questions; it involves writing questions and anticipating answers. Although test developers attempt to design rubrics adaptive to many varied types of responses, unanticipated factors that arise during scoring often require the test developer to make adjustments, which can also mean additional scoring costs and retraining of scorers. Pearlman added, “There are many sad stories from the
OCR for page 46
world of performance assessment where we forgot to think about scoring because we were seduced by the really enjoyable task of designing what people will do. That happens to be the most sexy part of all of this and is by far the most dangerous.” If the desire is to have scores that are comparable from one person to the next and from one testing occasion to the next, vigorous efforts need to be made to ensure consistent application of the scoring criteria across responses and across administrations. Scorers need extensive training so that they apply the scoring criteria similarly, and their scoring must be monitored throughout the process. Variability in the way the scoring criteria are applied can result when score descriptions are vague or scorers have biases not corrected during training. Some of the quality control procedures used with selected-response items, such as analysis of differential item functioning,4 are more costly or difficult to carry out with constructed-response questions. There is also significant cost associated with training and payment of scorers. Reckase emphasized in his presentation that a defensible scoring procedure will require careful reader or scorer training. Reckase said that constructed-response items take much longer to score than selected-response items. Even under optimal conditions he has found that only 10 performance assessment responses can be scored in one hour. This varies somewhat by the type of tasks and the skill level of the examinees. For example, responses of someone whose achievement level is at a second grade level are likely to be shorter and quicker to score than the responses of someone whose achievement level is at the tenth grade level. Although these are serious issues for assessments designed to be formative and used for low-stakes purposes, they become crucial and raise fundamental questions of fairness in high-stakes tests in which results must be compared across tasks, raters, and programs. In either case, implementing performance assessments introduces additional measurement complexities and cost issues. BALANCING TRADE-OFFS Trade-offs are inevitable in designing an assessment and selecting item formats that both appropriately measure particular content and skill areas 4 Differential item functioning occurs when examinees from different groups have differing probabilities of getting an item correct after being matched on ability (see Camilli and Shepard, 1994, or Holland and Wainer, 1993).
OCR for page 47
and serve the purpose of the assessment. In his closing comments, Reckase stressed that the kind of information that can be gathered from a set of items is limited by the information per unit of time. Multiple-choice items give many small bits of potentially unconnected information in a unit of time. Constructed-response items, particularly performance assessments, give fewer but larger and richer pieces of information, and they have the potential of providing more in-depth measurement of the students’ knowledge within the content skill area. ALIGNING TEST DESIGN WITH TEST PURPOSE Over the last decade, assessment in adult education has generally been used to evaluate students’ progress in various content areas. Most of the assessments in adult education programs are standardized tests (many of them norm-referenced)5 that have been utilized by teachers to make placement decisions and determinations about student advancement through the program. In other programs, the tests have been used to measure advancement towards individual student goals such as learning to read or obtaining a GED. Requirements for NRS reporting, however, underscore the need to obtain information in a form in which it can be accumulated and compared at a national level. Not surprisingly, some of the differences in views about the purposes of assessment in adult education can be traced to differences in views about the purpose of adult education itself. In his summary presentation, David Thissen shared his perception that workshop participants, like practitioners in the broader adult education arena, hold different beliefs about the fundamental purpose of adult education programs. Thissen heard some participants speak of the goal of adult education programs as aiding in the accomplishment of idiosyncratic and often functional goals that brought students to each program. In contrast, he said, other participants seemed to believe that the goal of adult education programs is to help each student make progress toward “becoming an educated person.” In this view, the adult education programs are serving as an alternative to traditional K-12 school systems. According to Thissen, this point of view is implicitly expressed in the structure of the current NRS, which makes extensive use of difference scores6 computed within a 5 A norm-referenced test is used to ascertain an individual’s status with respect to the performance of other individuals on the test (Popham, 2000). 6 Difference scores are the change in scores from the pretest to the posttest.
OCR for page 48
TABLE 3-1 Purpose of the Adult Education Program Purpose of the Assessment Accomplishment of idiosyncratic goals Progress toward becoming an educated person Providing Diagnostic Information A B Evaluating Student Progress C D Evaluating the Program E F SOURCE: Thissen (2002). six-level scale that to some extent mirrors “progress” through elementary and secondary school systems in such subject areas as language and mathematics. Thissen commented that participants also considered different purposes of assessment at different times in the workshop. In Table 3-1 the two broad purposes of adult education programs are crossed with the three traditional purposes of assessment in educational settings to form a display of six cells (labeled A through F). Each cell is defined below and grouped by purpose. The first purpose of adult education to consider is the accomplishment of idiosyncratic goals—that is, the individual and often functional goals that bring each person to the program. In Cell A the purpose of assessment is to provide diagnostic information about a student. An assessment for a student whose goal is to learn to read and write English might be a performance assessment designed to evaluate his or her proficiency in English. When the purpose of assessment changes to evaluating student progress (Cell C), an appropriate assessment might be one of the exams offered as a part of the Microsoft Certification program, which certifies an individual to be an MCP (Microsoft Certified Professional) or an MCSA (Microsoft Certified Systems Administrator) (see http://www.Microsoft.com/traincert/mcp/default.asp [April 29, 2002] for more information about these exams). Adult learners seek this kind of certification in order to qualify for a position or for career advancement. When the purpose of assessment is program evaluation, as in Cell E, the question becomes: Does the program help the student achieve his or her goals? An appropriate assessment might be a performance assessment designed at the local level for an adult education center to evaluate teachers’ effectiveness.
OCR for page 49
If the purpose of adult education is the advancement of the student toward becoming an educated person (however that is defined), the kind of assessment changes for each assessment purpose. An assessment that meets this purpose of adult education and provides diagnostic information (Cell B) is the Degrees of Reading Power (DRP). DRP tests are holistic measures of students’ comprehension of text. Test results are reported on a readability scale—the same scale that is used to measure the reading difficulty of printed material. By linking students’ DRP test scores with the readability values of books, teachers are able to locate, assign, or recommend textbooks, literature, and popular titles of appropriate difficulty for their students. If the purpose of the assessment is to evaluate student progress (Cell D), assessments currently administered in adult education programs, such as TABE or CASAS, are appropriate. Finally, an assessment that serves the purpose of program evaluation (Cell F) is Maryland’s MSPAP. The MSPAP is administered to third, fifth, and eighth graders, but scores are reported only at the school and district level, not at the level of the individual student. The initiation of the NRS has led to the use of assessments for more than one purpose, and Thissen enumerated several concerns about this situation. For example, some largely multiple-choice tests that were originally designed to evaluate student progress, such as the assessments in Cell D, are now being used to provide program evaluation data (Cell F). As a test designer, Thissen’s first question would be, “In which cell does the task fall?” The answer would guide decision making about assessment design. From his perspective as test designer, Thissen believes that the selection of the cell in which the problem lies is not a measurement issue. Rather, that selection needs to be made first, and the measurement issues and mechanics of developing an appropriate assessment can follow. Thus, sorting out the issues raised by purpose of programs and purposes of assessments, as illustrated in this table, is necessary for making sound decisions about the design and selection of adult education assessments.
Representative terms from entire chapter: