National Academies Press: OpenBook

Measuring Literacy: Performance Levels for Adults (2005)

Chapter: 5 Developing Performance-Level Descriptions and Setting Cut Scores

« Previous: 4 Determining Performance Levels for the National Assessment of Adult Literacy
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 108
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 109
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 110
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 111
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 112
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 113
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 114
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 115
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 116
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 117
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 118
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 119
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 120
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 121
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 122
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 123
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 124
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 125
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 126
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 127
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 128
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 129
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 130
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 131
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 132
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 133
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 134
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 135
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 136
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 137
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 138
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 139
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 140
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 141
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 142
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 143
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 144
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 145
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 146
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 147
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 148
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 149
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 150
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 151
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 152
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 153
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 154
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 155
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 156
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 157
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 158
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 159
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 160
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 161
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 162
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 163
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 164
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 165
Suggested Citation:"5 Developing Performance-Level Descriptions and Setting Cut Scores." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.
×
Page 166

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Developing Performance-Level Descriptions and Setting Cut Scores I n this chapter, we detail the processes we used for developing descrip- tions of the performance levels as well as the methods we used to determine the cut scores to be associated with each of the performance levels. The performance-level descriptions were developed through an it- erative process in which the descriptions evolved as we drafted wording, solicited feedback, reviewed the assessment frameworks and tasks, and made revisions. The process of determining the cut scores involved using procedures referred to as “standard setting,” which were introduced in Chapter 3. As we noted in Chapter 3, standard setting is intrinsically judgmental. Science enters the process only as a way of ensuring the internal and exter- nal validity of informed judgments (e.g., that the instructions are clear and understood by the panelists; that the standards are statistically reliable and reasonably consistent with external data, such as levels of completed school- ing). Given the judgmental nature of the task, it is not easy to develop methods and procedures that are scientifically defensible; indeed, standard- setting procedures have provoked considerable controversy (e.g., National Research Council [NRC], 1998; Hambleton et al., 2001). In developing our procedures, we have familiarized ourselves with these controversies and have relied on the substantial research base on standard setting1 and, in 1While we familiarized ourselves with a good deal of this research, we do not provide an exhaustive listing of these articles and cite only the studies that are most relevant for the present project. There are several works that provide overviews of methods, their variations, 108

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 109 particular, on the research on setting achievement levels for the National Assessment of Educational Progress (NAEP). NAEP’s standard-setting procedures are perhaps the most intensely scrutinized procedures in existence today, having been designed, guided, and evaluated by some of the most prominent measurement experts in the county. The discussions about NAEP’s procedures, both the favorable com- ments and the criticisms, provide guidance for those designing a standard- setting procedure. We attempted to implement procedures that reflected the best of what NAEP does and that addressed the criticisms that have been leveled against NAEP’s procedures. Below we highlight the major criticisms and describe how we addressed them. We raise these issues, not to take sides on the various controversies, but to explain how we used this informa- tion to design our standard-setting methods. NAEP has for sometime utilized the modified Angoff method for set- ting cut scores, a procedure that some consider to yield defensible standards (Hambleton and Bourque, 1991; Hambleton et al., 2000; Cizek, 1993, 2001a; Kane, 1993, 1995; Mehrens, 1995; Mullins and Green, 1994) and some believe to pose an overly complex cognitive task for judges (National Research Council, 1999; Shepard, Glaser, and Linn, 1993). While the modi- fied Angoff method is still widely used, especially for licensing and certifica- tion tests, many other methods are available. In fact, although the method is still used for setting the cut scores for NAEP’s achievement levels, other methods are being explored with the assessment (Williams and Schulz, 2005). Given the unresolved controversies about the modified Angoff method, we chose not to use it. Instead, we selected a relatively new method, the bookmark standard-setting method, that appears to be growing in popu- larity. The bookmark method was designed specifically to reduce the cogni- tive complexity of the task posed to panelists (Mitzel et al., 2001). The procedure was endorsed as a promising method for use on NAEP (National Research Council, 1999) and, based on recent estimates, is used by more than half of the states in their K-12 achievement tests (Egan, 2001). Another issue that has been raised in relation to NAEP’s standard- setting procedures is that different standard-setting methods were required for NAEP’s multiple-choice and open-ended items. The use of different methods led to widely disparate cut scores, and there has been disagreement and advantages and disadvantages, such as Jaegar’s article in Educational Measurement (1989) and the collection of writings in Cizek’s (2001b) Setting Performance Standards. We fre- quently refer readers to these writings because they provide a convenient and concise means for learning more about standard setting; however, we do not intend to imply that these were the only documents consulted.

110 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS about how to resolve these differences (Hambleton et al., 2000; National Research Council, 1999; Shepard, Glaser, and Linn, 1993). An advantage of the bookmark procedure is that it is appropriate for both item types. While neither the National Adult Literacy Survey (NALS) nor the National Assessment of Adult Literacy (NAAL) use multiple-choice items, both in- clude open-ended items, some of which were scored as right or wrong and some of which were scored according to a partial credit scoring scheme (e.g., wrong, partially correct, fully correct). The bookmark procedure is suitable for both types of scoring schemes. Another issue discussed in relation to NAEP’s achievement-level set- ting was the collection of evidence used to evaluate the reasonableness of the cut scores. Concerns were expressed about the discordance between cut scores that resulted from different standard-setting methods (e.g., the modi- fied Angoff method and the contrasting groups method yielded different cut scores for the assessment) and the effect of these differences on the percent- ages of students categorized into each of the achievement levels. Concerns were also expressed about whether the percentages of students in each achievement level were reasonable given other indicators of students’ aca- demic achievement in the United States (e.g., performance on the SAT, percentage of students enrolled in Advanced Placement programs), although there was considerable disagreement about the appropriateness of such comparisons. While we do not consider that our charge required us to resolve these disagreements about NAEP’s cut scores, we did try to address the criticisms. As a first step to address these concerns, we used the background data available from the assessment as a means for evaluating the reasonableness of the bookmark cut scores. To accomplish this, we developed an adapted version of the contrasting groups method, which utilizes information about examinees apart from their actual test scores. This quasi-contrasting groups (QCG) approach was not used as a strict standard-setting technique but as a means for considering adjustments to the bookmark cut scores. While validation of the recommended cut scores should be the subject of a thor- ough research endeavor that would be beyond the scope of the committee’s charge, comparison of the cut scores to pertinent background data provides initial evidence. We begin our discussion with an overview of the bookmark standard- setting method and the way we implemented it. Participants in the standard settings provided feedback on the performance-level descriptions, and we present the different versions of the descriptions and explain why they were revised. The results of the standard settings appear at the end of this chap- ter, where we also provide a description of the adapted version of the contrasting groups procedure that we used and make our recommendations for cut scores. The material in this chapter provides an overview of the

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 111 bookmark procedures and highlights the most crucial results from the stan- dard setting; additional details about the standard setting are presented in Appendixes C and D. THE BOOKMARK STANDARD-SETTING METHOD Relatively new, the bookmark procedure was designed to simplify the judgmental task by asking panelists to directly set the cut scores, rather than asking them to make judgments about test questions in isolation, as in the modified Angoff method (Mitzel et al., 2001). The method has the advantage of allowing participants to focus on the content and skills as- sessed by the test questions rather than just on the difficulty of the ques- tions, as panelists are given “item maps” that detail item content (Zieky, 2001). The method also provides an opportunity to revise performance- level descriptions at the completion of the standard-setting process so they are better aligned with the cut scores. In a bookmark standard-setting procedure, test questions are presented in a booklet arranged in order from easiest to hardest according to their estimated level of difficulty, which is derived from examinees’ answers to the test questions. Panelists receive a set of performance-level descriptions to use while making their judgments. They review the test questions in these booklets, called “ordered item booklets,” and place a “bookmark” to demark the set of questions that examinees who have the skills described by a given performance level will be required to answer correctly with a given level of accuracy. To explain, using the committee’s performance-level cat- egories, panelists would consider the description of skills associated with the basic literacy category and, for each test question, make a judgment about whether an examinee with these skills would be likely to answer the question correctly or incorrectly. Once the bookmark is placed for the first performance-level category, the panelists would proceed to consider the skills associated with the second performance-level category (intermediate) and place a second bookmark to denote the set of items that individuals who score in this category would be expected to answer correctly with a specified level of accuracy. The procedure is repeated for each of the perfor- mance-level categories. The bookmark method requires specification of what it means to be “likely” to answer a question correctly. The designers of the method sug- gest that “likely” be defined as “67 percent of the time” (Mitzel et al., 2001, p. 260). This concept of “likely” is important because it is the re- sponse probability value used in calculating the difficulty of each test ques- tion (that is, the scale score associated with the item). Although a response probability of 67 percent (referred to as rp67) is common with the book-

112 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS mark procedure, other values could be used, and we address this issue in more detail later in this chapter. To demonstrate how the response probability value is used in making bookmark judgments, we rely on the performance levels that we recom- mended in Chapter 4. Panelists first consider the description of the basic literacy performance level and the content and skills assessed by the first question in the ordered item booklet, the easiest question in the booklet. Each panelist considers whether an individual with the skills described in the basic category would have a 67 percent chance of answering this ques- tion correctly (or stated another way, if an individual with the skills de- scribed in the basic category would be likely to correctly answer a question measuring these specific skills two out of three times). If a panelist judges this to be true, he or she proceeds to the next question in the booklet. This continues until the panelist comes to a question that he or she judges a basic-level examinee does not have a 67 percent chance of answering cor- rectly (or would not be likely to answer correctly two out of three times). The panelist places his or her bookmark for the basic level on this question. The panelist then moves to the description of the intermediate level and proceeds through the ordered item booklet until reaching an item that he or she judges an individual with intermediate-level skills would not be likely to answer correctly 67 percent of the time. The intermediate-level bookmark would be placed on this item. Determination of the placement of the book- mark for the advanced level proceeds in a similar fashion. Panelists sit at a table with four or five other individuals who are all working with the same set of items, and the bookmark standard-setting procedure is implemented in an iterative fashion. There are three opportu- nities, or rounds, for panelists to decide where to place their bookmarks. Panelists make their individual decisions about bookmark placements dur- ing Round 1, with no input from other panelists. Afterward, panelists seated at the same table compare and discuss their ratings and then make a second set of judgments as part of Round 2. As part of the bookmark process, panelists discuss their bookmark placements, and agreement about the placements is encouraged. Panelists are not required to come to consen- sus about the placement of bookmarks, however. After Round 2, bookmark placements are transformed to test scale scores, and the median scale score is determined for each performance level. At this stage, the medians are calculated by considering the book- mark placements for all panelists who are working on a given test booklet (e.g., all panelists at all tables who are working on the prose ordered item booklet). Panelists are usually provided with information about the percentage of test takers whose scores would fall into each performance-level category based on these medians. This feedback is referred to as “impact data” and

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 113 serves as a reality check to allow panelists to adjust and fine-tune their judgments. Usually, all the panelists working on a given ordered item book- let assemble and review the bookmark placements, the resulting median scale scores, and the impact data together. Panelists then make a final set of judgments during Round 3, working individually at their respective tables. The median scale scores are recalculated after the Round 3 judgments are made. Usually, mean scale scores are also calculated, and the variability in panelists’ judgments is examined to evaluate the extent to which they disagree about bookmark placements. At the conclusion of the standard setting, it is customary to allot time for panelists to discuss and write performance-level descriptions for the items reviewed during the standard setting. Committee’s Approach with the Bookmark Method The committee conducted two bookmark standard-setting sessions, one in July 2004 with data from the 1992 NALS and one in September 2004 with data from the 2003 NAAL. This allowed us to use two different groups of panelists, to try out our procedures with the 1992 data and then make corrections (as needed) before the standard setting with the 2003 data was conducted, and to develop performance-level descriptions that would generalize to both versions of the assessment. Richard Patz, one of the developers of the bookmark method, served as consultant to the committee and led the standard-setting sessions. Three additional consultants and National Research Council project staff assisted with the sessions, and several committee members observed the sessions. The agendas for the two standard-setting sessions appear in Appendixes C and D. Because the issue of response probability had received so much atten- tion in relation to NALS results (see Chapter 3), we arranged to collect data from panelists about the impact of using different instructions about re- sponse probabilities. This data collection was conducted during the July standard setting with the 1992 data and is described in the section of this chapter called “Bookmark Standard Setting with 1992 Data.” The standard-setting sessions were organized to provide opportunity to obtain feedback on the performance-level descriptions. During the July session, time was provided for the panelists to suggest changes in the de- scriptions based on the placement of their bookmarks after the Round 3 judgments had been made. The committee reviewed their feedback, refined the descriptions, and in August invited several of the July panelists to review the revised descriptions. The descriptions were again refined, and a revised version was prepared for the September standard setting. An ex- tended feedback session was held at the conclusion of the September stan- dard setting to finalize the descriptions.

114 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS The July and September bookmark procedures were implemented in relation to the top four performance levels only—below basic, basic, inter- mediate, and advanced. This was a consequence of a decision made by the Department of Education during the development of NAAL. As mentioned in Chapter 2, in 1992, a significant number of people were unable to complete any of the NALS items and therefore produced test results that were clearly low but essentially unscorable. Rather than expanding the coverage of NAAL into low levels of literacy at the letter, word, and simple sentence level, the National Center for Education Statistics (NCES) chose to develop a separate low-level assessment, the Adult Literacy Supplemental Assessment (ALSA). ALSA items were not put on the same scale as the NAAL items or classified into the three literacy areas. As a result, we could not use the ALSA questions in the bookmark procedure. This created a de facto cut score between the nonliterate in English and below basic perfor- mance levels. Consequently, all test takers who performed poorly on the initial screening questions (the core questions) and were administered ALSA are classified into the nonliterate in English category.2 As a result, the performance-level descriptions used for the bookmark procedures included only the top four levels, and the skills evaluated on ALSA were incorporated into the below basic description. After the stan- dard settings, each of the performance-level descriptions for the below basic category were revised, and the nonliterate in English category was formu- lated. The below basic description was split to separate the skills that individuals who took ALSA would be likely to have from the skills that individuals who were administered NAAL, but who were not able to an- swer enough questions correctly to reach the basic level, would be likely to have. Initially, the committee hoped to consolidate prose, document, and quantitative items into a singled ordered item booklet for the bookmark standard setting, which would have produced cut scores for an overall, combined literacy scale. This was not possible, however, because of an operational decision made by NCES and its contractors to scale the test 2Some potential test takers were not able to participate due to various literacy-related reasons, as determined by the interviewer, and are also classified as nonliterate in English. These nonparticipants include individuals who have difficulty with reading or writing or who are not able to communicate in English or Spanish. Another group of individuals who were not able to participate are those with a mental disability, such as retardation, a learning disability, or other mental or emotional conditions. Given the likely wide variation in literacy skills of individuals in this group, these individuals are treated as nonparticipants and are not included in the nonliterate in English category. Since some of these individuals are likely to have low literacy skills, however, an upper bound on the size of the nonliterate in English category could be obtained by including these individuals in the nonliterate in English category.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 115 items separately by literacy area. That is, the difficulty level of each item was determined separately for prose, document, and quantitative items. This means that it was impossible to determine, for example, if a given prose item was harder or easier than a given document item. This decision appears to have been based on the assumption that the three scales measure different dimensions of literacy and that it would be inappropriate to com- bine them into a single scale. Regardless of the rationale for the decision, it precluded our setting an overall cut score. Participants in the Bookmark Standard Settings Selecting Panelists Research and experience suggest that the background and expertise the panelists bring to the standard-setting activity are factors that influence the cut score decisions (Cizek, 2001a; Hambleton, 2001; Jaeger, 1989, 1991; Raymond and Reid, 2001). Furthermore, the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measure- ment in Education, 1999) specify that panelists should be highly knowl- edgeable about the domain in which judgments are required and familiar with the population of test takers. We therefore set up a procedure to solicit recommendations for potential panelists for both standard-setting sessions, review their credentials, and invite those with appropriate expertise to participate. Our goal was to assemble a group of panelists who were knowl- edgeable about acquisition of literacy skills, had an understanding of the literacy demands placed on adults in this country and the strategies adults use when presented with a literacy task, had some background in standard- ized testing, and would be expected to understand and correctly implement the standard-setting tasks. Solicitations for panelists were sent to a variety of individuals: stake- holders who participated in the committee’s public forum, state directors of adult education programs, directors of boards of adult education organiza- tions, directors of boards of professional organizations for curriculum and instruction of adult education programs, and officials with the Council for Applied Linguistics, the National Council of Teachers of English, and the National Council of Teachers of Mathematics. The committee also solicited recommendations from state and federal correctional institutions as well as from the university community for researchers in the areas of workplace, family, and health literacy. Careful attention was paid to including repre- sentatives from as many states as possible, including representatives from the six states that subsidized additional testing of adults in 2003 (Kentucky, Maryland, Massachusetts, Missouri, New York, and Oklahoma).

116 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS The result of this extensive networking process produced a panel of professionals who represented adult education programs in urban, subur- ban, and rural geographic areas and a mix of practitioners, including teach- ers, tutors, coordinators, and directors. Almost all of the panelists had participated at some point in a range-finding or standard-setting activity, which helped them understand the connection between the performance- level descriptions and the task of determining an appropriate cut score. Panelists’ Areas of Expertise Because NALS and NAAL are assessments of adult literacy, we first selected panelists with expertise in the fields of adult education and adult literacy. Adult educators may specialize in curriculum and instruction of adult basic education (ABE) skills, preparation of students for the general educational development (GED) certificate, or English for speakers of other languages. In addition, adult education and adult literacy professionals put forth significant curricular, instructional, and research efforts in the areas of workplace literacy, family literacy, and heath literacy. Expertise in all of these areas was represented among the panelists.3 For the July standard setting, only individuals working in adult educa- tion and adult literacy were selected to participate. Based on panelist feed- back following this standard setting, we decided to broaden the areas of expertise for the September standard setting. Specifically, panelists indi- cated they would have valued additional perspectives from individuals in areas affected by adult education services, such as human resource manage- ment, as well as from teachers who work with middle school and high school students. Therefore, for the second session, we selected panelists from two additional fields: (1) middle or high school language arts teachers and (2) industrial and organizational psychologists who specialize in skill profiling or employee assessment for job placement. The language arts classroom teachers broadened the standard-setting discussions by providing input on literacy instruction for adolescents who were progressing through the grades in a relatively typical manner, whereas teachers of ABE or GED had experience working with adults who, for 3We note that we considered including college faculty as panelists, as they would have brought a different perspective to the standard setting. In the end, we were somewhat con- cerned about their familiarity with adults with lower literacy skills and thought that it would be difficult for those who primarily work in college settings to make judgments about the skills of adults who would be classified at the levels below intermediate. There was a limit to the number of panelists we could include, and we tried to include those with experience working with adults whose skills fell at the levels primarily assessed on NALS and NAAL.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 117 whatever reason, did not acquire the literacy skills attained by most stu- dents who complete the U.S. school system. The industrial and organiza- tional psychologists who participated came from academia and corporate environments and brought a research focus and a practitioner perspective to the discussion that complemented those of the other panelists, who were primarily immersed in the adult education field. Table 5-1 gives a profile of the panelists who participated in the two standard-setting sessions. BOOKMARK STANDARD SETTING WITH 1992 DATA The first standard-setting session was held to obtain panelists’ judg- ments about cut scores for the 1992 NALS and to collect their feedback about the performance-level descriptions. A total of 42 panelists partici- pated in the session. Panelists were assigned to groups, and each group was randomly assigned to two of the three literacy areas (prose, document, or quantitative). Group 1 worked with the prose and document items; Group 2 worked with the prose and quantitative items; and Group 3 worked with the document and quantitative items. The sequence in which they worked on the different literacy scales was alternated in an attempt to balance any potential order effects. For each literacy area, an ordered item booklet was prepared that rank- ordered the test questions from least to most difficult according to NALS examinees’ responses. The ordered item booklets consisted of all the avail- able NALS tasks for a given literacy area, even though with the balanced incomplete block spiraling (see Chapter 2), no individual actually responded to all test questions. The number of items in each NALS ordered item booklet was 39 for prose literacy, 71 for document literacy, and 42 for quantitative literacy. Two training sessions were held, one for the “table leaders,” the indi- viduals assigned to be discussion facilitators for the tables of panelists, and one for all panelists. The role of the table leader was to serve as a discussion facilitator but not to dominate the discussion or to try to bring the tablemates to consensus about cut scores. The bookmark process began by having each panelist respond to all the questions in the NALS test booklet for their assigned literacy scale. For this task, the test booklets contained the full complement of NALS items for each literacy scale, arranged in the order test takers would see them but not ranked-ordered as in the ordered item booklets. Afterward, the table leader facilitated discussion of differences among items with respect to knowledge, skills, and competencies required and what was measured by the scoring rubrics. Panelists then received the ordered item booklets. They discussed each item and noted characteristics they thought made one item more difficult

118 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-1 Profile of Panelists Involved in the Committee’s Standard Settings July September Participant Standard Setting Standard Setting Characteristics N = 42 N = 30 Gender Female 83a 77 Male 17 23 Ethnicity Black 2 7 Caucasian 69 83 Hispanic 0 3 Native American 2 0 Not reported 26 7 Geographic Regionb Midwest 26 37 Northeast 33 23 South 7 13 Southeast 19 7 West 14 20 Occupation c University instructors 7 10 Middle school, high school, or adult education instructors 19 30 Program coordinators or directors 38 40 Researchers 12 7 State office of adult education representative 24 13 than another. Each table member then individually placed their Round 1 bookmarks representing cut points for basic, intermediate, and advanced literacy. In preparation for Round 2, each table received a summary of the Round 1 bookmark placements made by each table member and were provided the medians of the bookmark placements (calculated for each table). Table leaders facilitated discussion among table members about their respective bookmark placements, and panelists were then asked to indepen- dently make their Round 2 judgments. In preparation for Round 3, each table received a summary of the Round 2 bookmark placements made by each table member as well as the medians for the table. In addition, each table received information about the proportion of the 1992 population who would have been categorized as having below basic, basic, intermediate, or advanced literacy based on the

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 119 TABLE 5-1 Continued July September Participant Standard Setting Standard Setting Characteristics N = 42 N = 30 Area of Expertise Adult education 100 70 Classroom teacher 0 17 Human resources or industrial and organizational psychology 0 13 Work Setting NAd Rural 3 Suburban 33 Urban 43 Combination of all three settings 10 Other or not reported 10 aPercentage. bThe geographic regions were grouped in the following way: Midwest (IA, IL, IN, KY, MI, MN, MO, ND, OH, WI), Northeast (CT, DE, MA, ME, MD, NH, NJ, NY, PA, VT), South (AL, LA, MS, OK, TN, TX), Southeast (FL, GA, NC, SC, VA), and West (AZ, CA, CO, MT, NM, NV, OR, UT, WA, WY). cMany panelists reported working in a variety of adult education settings where their work entailed aspects of instruction, curriculum development, program management, and research. For the purposes of constructing this table, the primary duties and/or job title of each panelist, as specified on the panelist’s resume, was used to determine which of the five categories of occupation were appropriate for each panelist. dData not collected in July. table’s median cut points. After discussion, each panelist made his or her final, Round 3, judgments about bookmark placements for the basic, inter- mediate, and advanced literacy levels. At the conclusion of Round 3, panel- ists were asked to provide feedback about the performance-level descrip- tions by reviewing the items that fell between each of their bookmarks and editing the descriptions accordingly. The processes described above were repeated for the second literacy area. The bookmark session concluded with a group session to obtain feedback from the panelists, both orally and through a written survey. Using Different Response Probability Instructions In conjunction with the July standard setting, the committee collected information about the impact of varying the instructions given to panelists

120 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS with regard to the criteria used to judge the probability that an examinee would answer a question correctly (the response probability). The develop- ers of the bookmark method recommend that a response probability of 67 (or two out of three times) be used and have offered both technical and nontechnical reasons for their recommendation. Their technical rationale stems from an analysis by Huynh (2000) in which the author demonstrated mathematically that the item information provided by a correct response to an open-ended item is maximized at the score point associated with a response probability of 67.4 From a less technical standpoint, the develop- ers of the bookmark method argue that a response probability of 67 percent is easier for panelists to conceive of than less familiar probabilities, such as 57.3 percent (Mitzel et al., 2001). They do not entirely rule out use of other response probabilities, such as 65 or 80, but argue that a response probabil- ity of 50 would seem to be conceptually difficult for panelists. They note, however, that research is needed to further understand the ways in which panelists apply response probability instructions and pose three questions that they believe remain to be answered: (1) Do panelists understand, inter- nalize, and use the response probability criterion? (2) Are panelists sensitive to the response probability criterion such that scaling with different levels will systematically affect cut score placements? (3) Do panelists have a native or baseline conception of mastery that corresponds to a response probability? Given these questions about the ways in which panelists apply response probability instructions, and the controversies surrounding the use of a response probability of 80 in 1992, the committee chose to investigate this issue further. We wanted to find out more about (1) the extent to which panelists understand and can make sense of the concept of response prob- ability level when making judgments about cut scores and (2) the extent to which panelists make different choices when faced with different response probability levels. The committee decided to explore panelists’ use and understanding of three response probability values—67, since it is com- monly used with the bookmark procedures, as well as 80 and 50, since these values were discussed in relation to NALS in 1992. The panelists were grouped into nine tables of five panelists each. Each group was given different instructions and worked with different ordered item booklets. Three tables (approximately 15 panelists) worked with book- lets in which the items were ordered with a response probability of 80 percent and received instructions to use 80 percent as the likelihood that the examinee would answer an item correctly. Similarly, three tables used or- 4We refer the reader to the original article or to Mitzel et al. (2001) for more detailed information.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 121 dered item booklets and instructions consistent with a response probability of 67 percent, and three tables used ordered item booklets and instructions consistent with a response probability of 50 percent. Panelists received training in small groups about their assigned re- sponse probability instructions (see Appendix C for the exact wording). Each group was asked not to discuss the instructions about response prob- ability level with anyone other than their tablemates so as not to cause confusion among panelists working with different response probability lev- els. Each table of panelists used the same response probability level for the second content area as they did for the first. Refining the Performance-Level Descriptions The performance-level descriptions used at the July standard setting consisted of overall and subject-specific descriptors for the top four perfor- mance levels (see Table 5-2). Panelists’ written comments about and edits of the performance levels were reviewed. This feedback was invaluable in helping the committee rethink and reword the level descriptions in ways that better addressed the prose, document, and quantitative literacy demands suggested by the assessment items. Four panelists who had par- ticipated in the July standard-setting session were invited to review the revised performance-level descriptions prior to the September standard setting, and their feedback was used to further refine the descriptions. The performance-level descriptions used in the September standard setting are shown in Table 5-3. BOOKMARK STANDARD SETTING WITH 2003 DATA A total of 30 panelists from the fields of adult education, middle and high school English language arts, industrial and organizational psychol- ogy, and state offices of adult education participated in the second standard setting. Similar procedures were followed as in July with the exception that all panelists used the 67 percent response probability instructions. Panelists were assigned to groups and the groups were then randomly assigned to literacy area with the subject area assignments balanced as they had been in July. Two tables worked on prose literacy first; one of these tables then worked on document literacy and the other on quantitative literacy. Two tables worked on document literacy first; one of these tables was assigned to work on quantitative literacy and the other to work on prose literacy. The remaining two tables that worked on quantitative lit- eracy first were similarly divided for the second content area: one table was assigned to work on prose literacy while the other was assigned to work on document literacy.

TABLE 5-2 Performance-Level Descriptions Used During the July 2004 NALS Standard Setting 122 A. Overall Descriptions Level Description An individual who scores at this level: I. Below Basic Literacy May be able to recognize some letters, common sight words, or digits in English; has difficulty reading and understanding simple words, phrases, numbers, or quantities. II. Basic Literacy Can read and understand simple words, phrases, numbers, and quantities in English and locate information in short texts about commonplace events and situations; has some difficulty with drawing inferences and making use of quantitative information in such texts. III. Intermediate Literacy Can read and understand written material in English sufficiently well to locate information in denser, less commonplace texts, construct straightforward summaries, and draw simple inferences; has difficulty with drawing inferences from complex, multipart written material and with making use of quantitative information when multiple operations are involved. IV. Advanced Literacy Can read and understand complex written material in English sufficiently well to locate and integrate multiple pieces of information, perform sophisticated analytical tasks such as making systematic comparisons, draw sophisticated inferences from that material, and can make use of quantitative information when multiple operations are involved. The National Adult Literacy Survey measures competence across a broad range of literacy development. Nonetheless, there exist meaningful distinctions in literacy outside of this range, including degrees of competence well above those described as required for “Advanced Literacy” and below what is required for “Basic Literacy.” The “Below Basic Literacy” and “Advanced Literacy” levels by definition encompass all degrees of literacy below or above, respectively, those levels described in the above performance-level descriptors.

B. Subject-Area Descriptions Level Prose Document Quantitative An individual who scores at this level: I. Below Basic May be able to recognize letters May be able to recognize letters, May be able to recognize Literacy but not able to consistently match numbers, and/or common sight numbers and/or locate sounds with letters; may be able to words in familiar contexts such as numbers in brief familiar recognize a few common sight words. on labels or signs; is not able to contexts; is not able to follow written instructions on perform simple arithmetic simple documents. operations. II. Basic Literacy Is able to read and locate information Is able to understand or follow Is able to locate easily in brief, commonplace text, but has instructions on simple documents; identified numeric difficulty drawing appropriate able to locate and/or enter information in simple texts, conclusions from the text, information based on a literal graphs, tables, or charts; distinguishing fact from opinion or match of information in the able to perform simple identifying an implied theme or idea question to information called arithmetic operations or in a selection. for in the document itself. solve simple word problems when the operation is specified or easily inferred. III. Intermediate Is able to read and understand Is able to locate information in Is able to locate numeric Literacy moderately dense, less commonplace dense, complex documents in information that is not text that contains long paragraphs; which repeated reviewing of the easily identified in texts, able to summarize, make inferences, document is involved. graphs, tables, or charts; determine cause and effect, and able to perform routine recognize author’s purpose. arithmetic operations when the operation is not specified or easily inferred. continued 123

TABLE 5-2 Continued 124 B. Subject-Area Descriptions Level Prose Document Quantitative IV. Advanced Literacy Is able to read lengthy, complex, Is able to integrate multiple pieces Is able to locate and abstract texts; able to handle of information in documents that integrate numeric conditional text; able to synthesize contain complex displays; able information in complex information and perform complex to compare and contrast texts, graphs, tables, or inferences. information; able to analyze charts; able to perform and synthesize information from multiple and/or fairly multiple sources. complex arithmetic operations when the operation(s) is not specified or easily inferred.

TABLE 5-3 Performance-Level Descriptions Used During September 2004 NAAL Standard Setting A. Overall Descriptions Level Description An individual who scores at this level independently and in English: I. Below Basic Literacy May independently be able to recognize some letters, common sight words, or digits in English; may sometimes be able to locate and make use of simple words, phrases, numbers, and quantities in short texts or displays (e.g., charts, figures, or forms) in English that are based on commonplace contexts and situations; may sometimes be able to perform simple one-step arithmetic operations; has some difficulty with reading and understanding information in sentences and short texts. II. Basic Literacy Is independently able to read and understand simple words, phrases, numbers, and quantities in English; able to locate information in short texts based on commonplace contexts and situations and enter such information into simple forms; is able to solve simple one-step problems in which the operation is stated or easily inferred; has some difficulty with drawing inferences from texts and making use of more complicated quantitative information. III. Intermediate Literacy Is independently able to read, understand, and use written material in English sufficiently well to locate information in denser, less commonplace texts, construct straightforward summaries, and draw simple inferences; able to make use of quantitative information when the arithmetic operation or mathematical relationship is not specified or easily inferred; able to generate written responses that demonstrate these skills; has difficulty with drawing inferences from more complex, multipart written material and with making use of quantitative information when multiple operations or complex relationships are involved. continued 125

TABLE 5-3 Continued 126 A. Overall Descriptions Level Description IV. Advanced Literacy Is independently able to read, understand, and use more complex written material in English sufficiently well to locate and integrate multiple pieces of information, perform sophisticated analytical tasks such as making systematic comparisons, draw more sophisticated inferences from that material, and can make use of quantitative information when multiple operations or more complex relationships are involved; able to generate written responses that demonstrate these skills. The National Assessment of Adult Literacy measures competence across a broad range of literacy development. Nonetheless, there exist meaningful distinctions in literacy outside of this range, including degrees of competence well above those described as required for “Advanced Literacy” and below what is required for “Basic Literacy.” The “Below Basic Literacy” and “Advanced Literacy” levels by definition encompass all degrees of literacy below or above, respectively, those levels described in the above performance-level descriptors. B. Subject-Area Descriptions Level Prose Document Quantitative An individual who scores at this level independently and in English: I. Below Basic May be able to recognize letters but May be able to recognize letters, May be able to recognize Literacy not able to consistently match sounds numbers, and/or common sight numbers and/or locate with letters; may be able to recognize words in frequently encountered numbers in frequently a few common sight words; may contexts such as on labels or encountered contexts; may sometimes be able to locate signs; may sometimes be able sometimes be able to information in short texts when the to follow written instructions perform simple arithmetic information is easily identifiable; has on simple displays (e.g., charts, operations in commonly difficulty reading and understanding figures, or forms); may used formats or in simple sentences. sometimes be able to locate problems when the

easily identified information or mathematical information to enter basic personal is very concrete and information in simple forms. mathematical relationships are primarily additive. II. Basic Literacy Is able to read, understand, and locate Is able to read, understand, and Is able to locate and use information in short, commonplace follow instructions on simple easily identified numeric texts when the information is easily displays; able to locate and/or information in simple texts identifiable; has difficulty using text enter easily identifiable or displays; able to solve to draw appropriate conclusions, information that primarily simple one-step problems distinguish fact from opinion or involves making a literal when the arithmetic identify an implied theme or idea in match of information in the operation is specified or a selection. question to information in the easily inferred, the display. mathematical information is familiar and relatively easy to manipulate, and mathematical relationships are primarily additive. III. Intermediate Is able to read and understand Is able to locate information in Is able to locate and use Literacy moderately dense, less commonplace dense, complex displays in which numeric information that is text that may contain long paragraphs; repeated cycling through the not easily identified in texts able to summarize, make simple display is involved; able to make or displays; able to solve inferences, determine cause and effect, simple inferences about the problems when the and recognize author’s purpose; able information in the display; arithmetic operation is not to generate written responses that able to generate written specified or easily inferred, demonstrate these skills. responses that demonstrate and mathematical these skills. information is less familiar and more difficult to manipulate. continued 127

TABLE 5-3 Continued 128 B. Subject-Area Descriptions Level Prose Document Quantitative IV. Advanced Is able to read lengthy, complex, Is able to integrate multiple pieces Is able to locate and use Literacy abstract texts that are less of information located in numeric information in commonplace and may include complex displays; able to complex texts and displays; figurative language, to synthesize compare and contrast able to solve problems that information and make complex information, and to analyze and involve multiple steps and inferences; able to generate written synthesize information from multiple comparisons of responses that demonstrate these skills. multiple sources; able to displays when the generate written responses that operation(s) is not specified demonstrate these skills. or easily inferred, the mathematical relationships are more complex, and the mathematical information is more abstract and requires more complex manipulations.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 129 The ordered item booklets used for the second standard setting were organized in the same way as for the first standard setting, with the excep- tion that some of the NAAL test questions were scored according to a partial credit scheme. When a partial credit scoring scheme is used, a diffi- culty value is estimated for both the partially correct score and the fully correct score. As a result, the test questions have to appear multiple times in the ordered item booklet, once for the difficulty value associated with partially correct and a second time for the difficulty value associated with fully correct. The ordered item booklets included the scoring rubric for determining partial credit and full credit scores. Training procedures in September were similar to those used in July. Table leader training was held the day before the standard setting, and panelist training was held on the first day of the standard setting. The procedures used in September were similar to those used in July, with the exception that the committee decided that all panelists in Septem- ber should use the instructions for a response probability of 67 (the ratio- nale for this decision is documented in the results section of this chapter). This meant that more typical bookmark procedures could be used for the Round 3 discussions. That is, groups of panelists usually work on the same ordered item booklet at different tables during Rounds 1 and 2 but join each other for Round 3 discussions. Therefore, in September, both tables working on the same literacy scale were merged for the Round 3 discussion. During Round 3, panelists received data summarizing bookmark place- ments for the two tables combined. This included a listing of each panelist’s bookmark placements and the median bookmark placements by table. In addition, the combined median scale score (based on the data from both tables) was calculated for each level, and impact data provided about the percentages of adults who would fall into the below basic, basic, intermedi- ate, and advanced categories if the combined median values were used as cut scores.5 Panelists from both tables discussed their reasons for choosing different bookmark placements, after which each panelist independently made a final judgment of items that separated the test among basic, inter- mediate, and advanced literacy. Revising the Performance-Level Descriptions At the conclusion of the September standard setting, 12 of the panelists were asked to stay for an extended session to write performance-level de- 5Data from the prison sample and the state samples were not ready in time for the Septem- ber standard setting. Because the 2003 data file was incomplete, the 1992 data were used to generate the population proportions rather than the 2003 data.

130 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS scriptions for the NAAL items. At least one member from each of the six tables participated in the extended session, and there was representation from each of the three areas of expertise (adult education, middle and high school English language arts, and industrial and organizational psychol- ogy). The 12 participants were split into 3 groups, each focusing on one of the three NAAL content areas. Panelists were instructed to review the test items that would fall into each performance level (based on the Round 3 median cut scores) and prepare more detailed versions of the performance- level descriptions, including specific examples from the stimuli and associ- ated tasks. The revised descriptions are shown in Table 5-4. RESULTS FROM THE STANDARD-SETTING SESSIONS Comparison of Results from Differing Response Probability Instructions The purpose of using the different instructions in the July session was to evaluate the extent to which the different response probability criteria influenced panelists’ judgments about bookmark placements. It would be expected that panelists using the higher probability criteria would place their bookmarks earlier in the ordered item booklets, and as the probability criteria decrease, the bookmarks would be placed later in the booklet. For example, panelists working with rp50 instructions were asked to select the items that individuals at a given performance level would be expected to get right 50 percent of the time. This is a relatively low criterion for success on a test question, and, as a result, the panelist should require the test taker to get more items correct than if a higher criterion for success were used (e.g., rp67 or rp80). Therefore, for a given performance level, the bookmark placement should be in reverse order of the values of the response probabil- ity criteria: the rp80 bookmark placement should come first in the booklet, the rp67 bookmark should come next, and the rp50 bookmark should be furthest into the booklet. Tables 5-5a, 5-5b, and 5-5c present the results from the July standard setting, respectively, for the prose, document, and quantitative areas. The first row of each table shows the median bookmark placements for basic, intermediate, and advanced based on the different response probability instructions. For example, Table 5-5a shows that the median bookmark placements for the basic performance level in prose were on item 6 under the rp80 and rp67 instructions and on item 8 under the rp50 instructions. Ideally, panelists would compensate for the different response criteria by placing their bookmarks earlier or later in the ordered item booklet, depending on the response probability instructions. When panelists respond to the bookmark instructions by conceptualizing a person whose skills

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 131 match the performance-level descriptions, the effect of using different re- sponse probability instructions would shift their bookmark placements in such a way that they compensated exactly for the differences in the transla- tion of bookmark placements into cut scores. When panelists are informing their judgments in this way, the cut score associated with the bookmark placement would be identical under the three different response probability instructions, even though the bookmark locations would differ. As the tables show, however, this does not appear to be the case. For example, the second row of Table 5-5a shows that the median cut scores for basic were different: 226, 211, and 205.5, respectively, for rp80, rp67, and rp50. It is not surprising that panelists fail to place bookmarks in this ideal way, for the ideal assumes prior knowledge of the likelihood that persons at each level of literacy will answer each item correctly. A more relevant issue is whether judges have a sufficient subjective understanding of probability to change bookmark placements in response to different instructions about response probabilities. Our analysis yields weak evidence in favor of the latter hypothesis.6 We conducted tests to evaluate the statistical significance of the differ- ences in bookmark placements and in cut scores. The results indicated that, for a given literacy area and performance level, the bookmark placements were tending in the right direction but were generally not statistically sig- nificantly different under the three response probability instructions. In contrast, for a given literacy area and performance level, the differences among the cut scores were generally statistically significant. Additional details about the analyses we conducted appear in Appendix C. Tables 5-5a, 5-5b, and 5-5c also present the mean and standard devia- tions of the cut scores under the different response probability instructions. The standard deviations provide an estimate of the extent of variability among the panelists’ judgments. Although the bookmark method does not strive for consensus among panelists, the judgments should not be widely disparate. Comparison of the standard deviations across the different re- sponse probability instructions reveals no clear pattern; that is, there is no indication that certain response probability instructions were superior to the others in terms of the variability among panelists’ judgments. A more practical way to evaluate these differences is by looking at the 6In addition, a follow up questionnaire asked panelists what adjustments they would have made to their bookmark placements had they been instructed to use different rp criteria. For each of the three rp criteria, panelists were asked if they would have placed their bookmarks earlier or later in the ordered item booklet if they had been assigned to use a different rp instruction. Of the 37 panelists, 27 (73 percent) indicated adjustments that reflected a correct understanding of the rp instructions.

TABLE 5-4 Performance-Level Descriptions and Subject-Area Descriptions with Exemplar NAAL Items 132 A. Overall Description Level Description Sample Tasks Associated with Level Nonliterate May independently recognize some letters, numbers, • Identify a simple letter, word, number, or date on a in English and/or common sight words in English in frequently consumer food item or road sign encountered contexts. • Read aloud a number or date • Identify information in a contextually-based format (e.g., locate and/or read aloud the company name provided on a bill statement) Below Basic May independently be able to locate and make use of • Underline or otherwise identify a specific sentence in a simple words, phrases, numbers, and quantities in government form or newspaper article short texts or displays (e.g., charts, figures, forms) • Calculate change in a situation involving money in English that are based on commonplace contexts and situations; may sometimes be able to perform simple one-step arithmetic operations. Basic Is independently able to read and understand simple • Read a short story or newspaper article and underline or words, phrases, numbers, and quantities in English circle the sentence that answers a question (e.g., why an when the information is easily identifiable with a event occurred or what foods an athlete ate) (prose) minimal amount of distracting information; able to • Complete a telephone message slip with information locate information in short texts based on about the caller (document) commonplace contexts and situations and enter • Calculate total amount of money to be deposited into a such information into simple forms; is able to solve bank account (quantitative) simple one-step problems in which the operation is stated or easily inferred. Intermediate Is independently able to read, understand, and use • Use an almanac or other reference material to find three written material in English sufficiently well to locate food sources that contain Vitamin E (prose) information in denser, less commonplace texts that • Identify information on a government form (e.g., when

may contain a greater number of distractors; able to an employee is eligible for medical insurance) construct straightforward summaries, and draw (document) simple inferences; able to make use of quantitative • Record several transactions on a check ledger and information when the arithmetic operation or calculate account balance after each transaction mathematical relationship is not specified or easily (quantitative) inferred; able to generate written responses that demonstrate these skills. Advanced Is independently able to read, understand, and use • Read a newspaper article and identify the argument used more complex written material in English sufficiently by the author (prose) well to locate and integrate multiple pieces of • Interpret a timetable (e.g., bus, airplane, or train information, perform more sophisticated analytical schedule or a television program listing) and use that tasks such as making systematic comparisons, draw information to make a decision (document) more sophisticated inferences from that material, and • Calculate the tip on a restaurant food order can make use of quantitative information when (quantitative) multiple operations or more complex relationships are involved; able to generate written responses that demonstrate these skills. The NAAL measures competence across a broad range of literacy development. Nonetheless, there exist meaningful distinctions in literacy outside of this range, including degrees of competence well above those described as required for “Advanced Literacy” and below what is required for “Basic Literacy.” The “Below Basic Literacy” and “Advanced Literacy” levels, by definition, encompass all degrees of literacy below or above, respectively, those levels described in the above performance-level descriptors. continued 133

TABLE 5-4 Continued 134 B. Prose Literacy Content Area An individual who scores at this level Level independently and in English: Sample of NAAL tasks associated with the level Below Basic May sometimes be able to locate information in • Use the text of a short paragraph to answer a question short texts when the information is easily where a literal match occurs (e.g., use the statement, identifiable. “Terry is from Ireland” to answer the question, “What country is Terry from?”) • Underline or otherwise identify a specific sentence in a government form or newspaper article Basic Is able to read, understand, follow directions, copy, • Read a short story or newspaper article and underline or and locate information in short, commonplace circle the sentence that answers a question (e.g., why an texts (e.g., simple newspaper articles, event occurred) advertisements, short stories, government forms) • Locate specific information on a government form (e.g., when the information is easily identifiable with a definition of “blind” on a Social Security Administration minimal number of distractors in the main text. informational handout) May be able to work with somewhat complex texts to complete a literal match of information in the question and text.* Intermediate Is able to read and understand moderately dense, less • Locate information in a short newspaper article or commonplace text that may contain long government form (e.g., a government form regarding paragraphs, a greater number of distractors, a Social Security benefits) higher level vocabulary, longer sentences, more • Use an almanac or other reference material (e.g., to find complex sentence structure; able to summarize, three food sources that contain Vitamin E) make simple inferences, determine cause and effect, • Read a short poem and identify or infer the situation and recognize author’s purpose; able to generate described by the poem written responses (e.g., words, phrases, lists, • Write a letter to a credit department informing them of sentences, short paragraphs) that demonstrate these an error on a bill statement skills.*

Advanced Is able to read lengthy, complex, abstract texts that • Read a newspaper article and identify the argument used are less commonplace and may include figurative by the author language and/or unfamiliar vocabulary; able to • Orally summarize a short newspaper article synthesize information and make complex • Identify differences between terms found on a benefits inferences; compare and contrast viewpoints; able handout (e.g., educational assistance and tuition aid to generate written responses that demonstrate benefits) these skills.* • Read a short poem and identify or infer the author’s purpose • Compare and contrast viewpoints in an editorial *When presented with a task that measures these skills, the individual would be likely to respond correctly 2 out of 3 times. C. Document Literacy Content Area An individual who scores at this level Level independently and in English: Sample of NAAL tasks associated with the level Below Basic May sometimes be able to follow written instructions • Put a signature on a government form (e.g., Social on simple displays (e.g., charts, figures, or forms); Security card) may sometimes be able to locate easily identified • Read a pay stub and identify the current net pay information or to enter basic personal information amount on simple forms; may be able to sign name in right place on form. Basic Is able to read, understand, and follow one-step • Identify a single piece of information on a document instructions on simple displays (e.g., government, (e.g., the time when, or room number where, a banking, and employment application forms, short meeting will take place) newspaper articles or advertisements, television or • Use a television program listing to identify a television public transportation schedules, bar charts or circle program that airs at a specific time on an specific graphs of a single variable); able to locate and/or channel enter easily identifiable information that primarily • Record the name of caller and caller’s telephone involves making a literal match between the number on a message slip question and the display.* continued 135

TABLE 5-4 Continued 136 C. Document Literacy Content Area An individual who scores at this level Level independently and in English: Sample of NAAL tasks associated with the level Intermediate Is able to locate information in dense, complex • Identify a specific location on a map displays (e.g., almanacs or other reference materials, • Complete a bank deposit slip or check maps and legends, government forms and instruction • Write the shipping information on a product order sheets, supply catalogues and product charts, more form complex graphs and figures that contain trends and • Make a decision based on information given in a multiple variables) when repeated cycling or re-reading schedule of events (e.g., television program listing) is involved; able to make simple inferences about the information displayed; able to generate written responses that demonstrate these skills.* Advanced Is able to integrate multiple pieces of information • Locate specific information in an almanac, located in complex displays; able to compare and transportation timetable, utility bill, or television contrast information, and to analyze and synthesize program listing information from multiple sources; able to generate • Determine the appropriateness of a product or written responses that demonstrate these skills.* statement based upon information given in a display • Interpret a display that utilizes multiple variables (e.g., a chart with blood pressure, age, and physical activity), compare information from two displays of data, or transfer data from one display to another • Use a map and follow directions to identify one or more change(s) in location • Make a decision based on information given in a schedule of events where the time given is not written explicitly on the schedule (e.g., reader must infer that 8:15 a.m. is between 8:00 a.m. and 8:30 a.m.) *When presented with a task that measures these skills, the individual would be likely to respond correctly 2 out of 3 times.

D. Quantitative Literacy Content Area An individual who scores at this level Level independently and in English: Sample of NAAL tasks associated with the level Below Basic May sometimes be able to perform simple arithmetic • Calculate change in a situation involving money operations in commonly used formats or in simple • Add two numbers entered on an order form or bank problems when the mathematical information is deposit slip very concrete and mathematical relationships are primarily additive. Basic Is able to locate and use easily identified numeric • Complete a bank deposit slip and calculate the total information in simple texts or displays; able to solve dollar amount of the deposit simple one-step problems when the arithmetic • Compute or compare information (e.g., ticket prices to operation is specified or easily inferred, the two events) mathematical information is familiar and relatively easy to manipulate, and mathematical relationships are primarily additive.* Intermediate Is able to locate numeric information that is embedded • Record several transactions on a check ledger and in texts or in complex displays and use that calculate account balance after each transaction information to solve problems; is able to infer the • Use a transportation schedule to make a decision arithmetic operation or mathematical relationship regarding travel plans when it is not specified; is able to use fractions, decimals, or percents and to apply concepts of area and perimeter in real-life contexts.* continued 137

TABLE 5-4 Continued 138 D. Quantitative Literacy Content Area An individual who scores at this level Level independently and in English: Sample of NAAL tasks associated with the level Advanced Is able to locate and use numeric information in • Compute amount of money needed to purchase one or complex texts and displays; able to solve problems more items and/or the amount of change that will be that involve multiple steps and multiple comparisons returned of displays when the operation(s) is/are not specified • Compute and compare costs for consumer items (e.g., or easily inferred, the mathematical relationships miles per gallon, energy efficiency rating, cost per ounce are more complex, and the mathematical for food items) information is more abstract and requires more • Use information given in a government form to compute complex manipulations.* values (e.g., monthly or annual Social Security payments) *When presented with a task that measures these skills, the individual would be likely to respond correctly 2 out of 3 times.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 139 impact data. The final row of Tables 5-5a, 5-5b, and 5-5c compares the percentage of the population scoring below each of the cut scores when the different response probability instructions were used. Comparison of the impact data reveals that the effects of the different response probability instructions were larger for the cut scores for the document and quantita- tive areas than for prose. These findings raise several questions. First, the findings might lead one to question the credibility of the cut scores produced by the bookmark method. However, there is ample evidence that people have difficulty inter- preting probabilistic information (Tversky and Kahneman, 1983). The fact that bookmark panelists have difficulties with this aspect of the procedure is not particularly surprising. In fact, the developers of the procedure ap- pear to have anticipated this, saying “it is not reasonable to suggest that lack of understanding of the response probability criterion invalidates a cut score judgment any more than a lack of understanding of [item response theory] methods invalidates the interpretation of a test score” (Mitzel et al., 2001, p. 262). In our opinion, the bookmark procedure had been implemented very carefully with strict attention to key factors that can affect the results (Cizek, Bunch, and Koons, 2004; Hambleton, 2001; Kane, 2001; Plake, Melican, and Mills, 1992: Raymond and Reid, 2001). The standard-setting panelists had been carefully selected and had appropriate background quali- fications. The instructions to panelists were very clear, and there was ample time for clarification. Committee members and staff observing the process were impressed with how it was carried out, and the feedback from the standard-setting panelists was very positive. Kane (2001) speaks of this as “procedural evidence” in support of the appropriateness of performance standards, noting that “procedural evidence is a widely accepted basis for evaluating policy decisions” (p. 63). Thus, while the findings indicated that panelists had difficulty implementing the response probability instructions exactly as intended, we judged that this did not seem to be sufficient justi- fication for discrediting the bookmark method entirely. The second issue presented by the findings was that if the different response probability instructions had produced identical cut scores, it would not have mattered which response probability the committee decided to use for the bookmark procedure. However, the findings indicated that different cut scores were produced by the different instructions; hence, the commit- tee had to select among the options for response probability values. As discussed in Chapter 3, the choice of a response probability value involves weighing both technical and nontechnical information to make a judgment about the most appropriate value given the specific assessment context. We had hoped that the comparison of different response probabil- ity instructions would provide evidence to assist in this choice. However,

TABLE 5-5a Median Bookmark Placements and Cut Scores for the Three Response Probability (RP) Instructions in 140 the July 2004 Standard Setting with NALS Prose Items (n = 39 items) Basic Intermediate Advanced (1) (2) (3) (4) (5) (6) (7) (8) (9) RP Instructions: 80% 67% 50% 80% 67% 50% 80% 67% 50% Median bookmark placement 6 6 8 20 20 23.5 32 33 36.5 Median cut score 226.0 211.0 205.5 289.0 270.0 277.0 362.0 336.0 351.5 Mean cut score 236.2 205.3 207.2 301.0 273.0 270.9 357.3 341.6 345.8 Standard deviation 15.5 6.9 14.3 13.9 14.3 18.7 33.0 22.7 33.7 Percent below median cut score 20.6 15.5 14.1 56.2 43.6 48.2 93.9 84.5 90.9 NOTE: Number of panelists for the rp80, rp67, and rp50 conditions, respectively, were 9, 10, and 8. TABLE 5-5b Median Bookmark Placements and Cut Scores for the Three Response Probability (RP) Instructions in the July 2004 Standard Setting with NALS Document Items (n = 71 items) Basic Intermediate Advanced (1) (2) (3) (4) (5) (6) (7) (8) (9) RP Instructions: 80% 67% 50% 80% 67% 50% 80% 67% 50% Median bookmark placement 13 12.5 23 43.5 51 54 64 70.5 68 Median cut score 213.0 189.0 191.0 263.5 255.0 213.5 330.0 343.5 305.0 Mean cut score 215.3 192.3 189.5 264.1 257.3 230.2 333.1 345.1 306.7 Standard deviation 8.1 8.5 2.6 11.7 10.3 5.1 36.7 29.7 21.3 Percent below median cut score 18.2 12.0 12.4 43.6 38.0 18.6 83.9 89.8 70.2 NOTE: Number of panelists for the rp50, rp67, and rp80 conditions, respectively, were 8, 10, and 8.

TABLE 5-5c Median Bookmark Placements and Cut Scores for the Three Response Probability (RP)Instructions in the July 2004 Standard Setting with NALS Quantitative Items (n = 42 items) Basic Intermediate Advanced (1) (2) (3) (4) (5) (6) (7) (8) (9) RP Instructions: 80% 67% 50% 80% 67% 50% 80% 67% 50% Median bookmark placement 14 10.5 11 30 25 26 39 38 39 Median cut score 283.0 244.0 235.0 349.0 307.0 284.0 389.0 351.5 323.0 Mean cut score 279.1 243.6 241.1 334.9 295.2 289.4 384.2 367.5 339.9 Standard deviation 11.5 29.5 12.9 23.1 17.6 8.7 26.3 39.3 22.5 Percent below median cut score 51.7 29.2 25.4 88.9 67.6 52.4 97.8 90.0 77.0 NOTE: Number of panelists for the rp50, rp67, and rp80 conditions, respectively, were 9, 10, and 9. 141

142 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS none of the data suggested that one response probability value was “better” than another. In follow-up debriefing sessions, panelists commented that the rp50 instructions were difficult to apply, in that it was hard to determine book- mark placement when thinking about a 50 percent chance of responding correctly. This concurs with findings from a recent study conducted in connection with standard setting on the NAEP (Williams and Schulz, 2005). As stated earlier, the developers of the bookmark method also believe this value to be conceptually difficult for panelists. A response probability of 80 percent had been used in 1992, in part to reflect what is often considered to be mastery level in the education field. The committee debated about the appropriateness of this criterion versus the 67 percent criterion, given the purposes and uses of the assessment results. The stakes associated with the assessment are low; that is, no scores are reported for individuals, and no decisions affecting an individual are based on the results. A stringent criterion, like 80 percent, would be called for when it is important to have a high degree of certainty that the indi- vidual has truly mastered the specific content or skills, such as in licensing examinations. A response probability of 67 percent is recommended in the literature by the developers of the bookmark procedure (Mitzel et al., 2001) and is the value generally used in practice. Since there was no evidence from our comparison of response probabilities to suggest that we should use a value other than the developer’s recommendation, the committee decided to use a response probability of 67 percent for the bookmark procedure for NALS and NAAL. Therefore, all panelists in the September standard setting used this criterion. In determining the final cut scores from the bookmark proce- dure, we used all of the judgments from September but only the judgments from July based on the rp67 criterion. We are aware that many in the adult education, adult literacy, and health literacy fields have grown accustomed to using the rp80 criterion in relation to NALS results, and that some may at first believe that use of a response probability of 67 constitutes “lowering the standards.” We want to emphasize that this represents a fundamental, albeit not surprising, mis- understanding. Changing the response probability level does not alter the test in any way; the same content and skills are evaluated. Changing the response probability level does not alter the distributions of scores. Distri- butions of skills are what they are estimated to be, regardless of response probability levels. The choice of response probability levels should not in principle affect proportions of people in regions of the distribution, al- though some differences were apparent in our comparisons. Choice of response probability levels does affect a user’s attention in terms of con-

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 143 densed, everyday-language conceptions of what it means to be at a level (e.g., what it means to be “proficient”). It does appear that some segments of the literacy community prefer the higher response probability value of 80 percent as a reporting and inter- pretive device, if for nothing other than continuity with previous literacy assessments. The response probability level of 80 percent is robust to the fact that a response probability level is mapped to a verbal expression, such as “can consistently” or “can usually” do items of a given difficulty (or worse, more simplistic interpretations, such as “can” as opposed to “cannot” do items of a given difficulty level). It is misapplying this am- biguous mapping from precise and invariant quantitative descriptions to imprecise, everyday verbal descriptions that gives the impression of lower- ing standards. Changing the response probability criterion in the report may be justified by the reasons discussed above, but we acknowledge that disadvantages to this recommendation include the potential for misinter- pretations and a less preferable interpretation in the eyes of some segments of the user community. In addition, use of a response probability of 67 percent for the book- mark standard-setting procedure does not preclude using a value of 80 percent in determining exemplary items for the performance levels. That is, for each of the performance levels, it is still possible to select exemplar items that demonstrate the types of questions individuals have an 80 percent chance of answering correctly. Furthermore, it is possible to select exem- plary items that demonstrate other probabilities of success (67 percent, 50 percent, 35 percent, etc.). We discussed this issue in Chapter 3 and return to it in Chapter 6. Comparison of Results from the July and September Bookmark Procedure Table 5-6 presents the median cut scores that resulted from the rp67 instructions for the July standard setting (column 1) along with the median cut scores that resulted from the September standard setting (column 2). Column 3 shows the overall median cut scores that resulted when the July and September judgments were combined, and column 5 shows the overall mean cut score. To provide a sense of the spread of panelists’ judgments about the placement of the bookmarks, two measures of variability are shown. The “interquartile range” of the cut scores is shown in column 4. Whereas the median cut score represents the cut score at the 50th percentile in the distribution of panelists’ judgments, the interquartile range shows the range of cut score values from the 25th percentile to the 75th percentile.

144 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-6 Summary Statistics from the Committee’s Standard Settings for Adult Literacy (1) (2) (3) July Median September Median Overall Median Cut Scorea Cut Scoreb Cut Scorec Prose Literacy (1) Basic 211 219 211 (2) Intermediate 270 281 270 (3) Advanced 336 345 345 Document Literacy (4) Basic 189 210 203 (5) Intermediate 255 254 254 (6) Advanced 344 345 345 Quantitative Literacy (7) Basic 244 244 244 (8) Intermediate 307 295 296 (9) Advanced 352 356 356 aThe July standard setting used the items from the 1992 NALS. The cut scores are based on the bookmark placements set by panelists using the rp67 guidelines. bThe September standard setting used items from the 2003 NAAL. All panelists used rp67 guidelines. Column 6 presents the standard deviation, and column 7 shows the range bounded by the mean plus and minus one standard deviation. Comparison of the medians from the July and September standard- setting sessions reveals that the September cut scores tended to be slightly higher than the July cut scores, although overall the cut scores were quite similar. The differences in median cut scores ranged from 0 to 21, with the largest difference occurring for the basic cut score for document literacy. Examination of the spread in cut scores based on the standard deviation reveals more variability in the advanced cut score than for the other per- formance levels. Comparison of the variability in cut scores in each literacy area shows that, for all literacy areas, the standard deviation for the ad- vanced cut score was at least twice as large as the standard deviation for the intermediate or basic cut scores. Comparison of the variability in cut scores across literacy areas shows that, for all of the performance levels, the standard deviations for the quantitative literacy cut scores were slightly higher than for the other two sections. There was considerable discussion (and some disagreement) among the panelists about the difficulty level of the quantitative section, which probably contributed to the larger variabil- ity in these cut scores. We address this issue in more detail later in this chapter. Appendixes C and D include additional results from the book- mark standard setting.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 145 (4) (5) (6) (7) Interquartile Overall Mean Standard Mean ± One Ranged Cut Score Deviation Standard Deviation 206-221 214.2 11.0 199.6-221.6 264-293 275.9 16.2 254.2-86.7 336-366 355.6 33.5 311.9-378.8 192-210 200.1 13.4 189.8-216.6 247-259 254.0 9.1 244.7-262.8 324-371 343.0 30.8 314.2-375.9 230-245 241.3 19.7 223.8-263.3 288-307 293.8 17.1 279.4-313.5 343-398 368.6 41.8 313.9-397.6 cThe overall median is the median cut score when both the July rp67 and September data were combined. dRange of cut scores from the first quartile (first value in range) to the third quartile (second value in range). Estimating the Variability of the Cut Scores Across Judges The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 1999) recommend report- ing information about the amount of variation in cut scores that might be expected if the standard-setting procedure were replicated. The design of our bookmark sessions provided a means for estimating the extent to which the cut scores would be likely to vary if another standard setting was held on a different occasion with a different set of judges. As described earlier, participants in the July and September standard- setting sessions were divided into groups, each of which focused on two of the three literacy areas. At each session, panelists worked on their first assigned literacy area during the first half of the session (which can be referred to as “Occasion 1”) and their second assigned literacy area during the second half of the session (referred to as “Occasion 2”). This design for the standard setting allowed for cut score judgments to be obtained on four occasions that were essentially replications of each other: two occasions from July and two occasions from September. Thus, the four occasions can be viewed as four replications of the standard-setting procedures. The median cut score for each occasion was determined based on the panelists’ Round 3 bookmark placements; these medians are shown in

146 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-7 Confidence Intervals for the Bookmark Cut Scores July September Occasion 1 Occasion 2 Occasion 1 Occasion 2 Median Median Median Median n=5 n=5 n = 10 n = 10 Prose Literacy Basic 197.0 211.0 208.0 227.0 Intermediate 270.0 263.0 267.5 293.0 Advanced 343.0 336.0 345.0 382.5 Document Literacy Basic 202.0 185.0 210.0 201.0 Intermediate 271.0 247.0 258.0 248.5 Advanced 378.0 324.0 364.5 325.0 Quantitative Literacy Basic 216.0 271.0 244.0 245.0 Intermediate 276.0 309.0 298.5 292.0 Advanced 347.0 410.0 381.0 349.5 aEach median was weighted by the number of panelists submitting judgments. bThe standard error reflects the variation in cut scores across the four occasions and was calculated as: standard deviation. 4 Table 5-7. The average of these occasion medians was calculated by weight- ing each median by the number of panelists. The 95 percent confidence intervals for the weighted averages were computed, which indicate the range in which the cut scores would be expected to fall if the standard- setting session was repeated. For example, a replication of the standard- setting session would be likely to yield a cut score for the prose basic level literacy in the range of 200.5 to 225.5. We revisit these confidence intervals later in the chapter when we make recommendations for the cut scores. CONTRASTING GROUPS STANDARD-SETTING METHOD In a typical contrasting groups procedure, the standard-setting panel- ists are individuals who know the examinees firsthand in teaching, learning, or work environments. Using the performance-level descriptions, the panel- ists are asked to place examinees into the performance categories in which they judge the examinees belong without reference to their actual perfor- mance on the test. Cut scores are then determined from the actual test scores attained by the examinees placed in the distinct categories. The goal is to set the cut score such that the number of misclassifications is roughly the same in both directions (Kane, 1995); that is, the cut score that mini-

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 147 95% Confidence Weighted Average Standard Standard Interval for the of the Mediansa Deviation Errorb Weighted Averagec 213.0 12.7 6.4 200.5 to 225.5 275.7 14.6 7.3 261.3 to 290.0 355.7 22.6 11.3 333.5 to 377.8 201.5 9.8 4.9 191.9 to 211.1 255.2 9.9 5.0 245.5 to 264.8 346.8 26.6 13.3 320.7 to 372.9 244.2 18.7 9.4 225.9 to 262.5 294.3 11.7 5.9 282.8 to 305.8 369.7 27.2 13.6 343.0 to 396.4 cThe confidence interval is the weighted average plus or minus the bound, where the bound was calculated as the standard score at the .05 confidence level multiplied by the standard error. mizes the number of individuals who correctly belong in an upper group but are placed into a lower group (false negative classification errors) and likewise minimizes the number of individuals who correctly belong in a lower group but are placed into an upper group (false positive classification errors). Because data collection procedures for NALS and NAAL guarantee the anonymity of test takers, there was no way to implement the contrasting groups method as it is typically conceived. Instead, the committee designed a variation of this procedure that utilized the information collected via the background questionnaire to form groups of test takers. For example, test takers can be separated into two distinct groups based on their responses about the amount of help they need with reading: those who report they need a lot of help with reading and those who report they do not need a lot of help. Comparison of the distribution of literacy scores for these two groups provides information that can be used in determining cut scores. This approach, while not a true application of the contrasting groups method, seemed promising as a viable technique for generating a second set of cut scores with which to judge the reasonableness of the bookmark cut scores. This QCG method differs from a true contrasting groups approach

148 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS in two key ways. First, because it was impossible to identify and contact respondents after the fact, no panel of judges was assembled to classify individuals into the performance categories. Second, due to the nature of the background questions, the groups were not distinguished on the basis of characteristics described by the performance-level descriptions. Instead, we used background questions as proxies for the functional consequences of the literacy levels, and, as described in the next section, aligned the informa- tion with the performance levels in ways that seemed plausible. We note that implementation of this procedure was limited by the available back- ground information. In particular, there is little information on the background questionnaire that can serve as functional consequences of advanced literacy. As discussed in Chapter 4, additional background infor- mation about advanced literacy habits (e.g., number and character of books read in the past year, types of newspapers read, daily or weekly writing habits) would have helped refine the distinction between intermediate and advanced literacy skills. Implementing the QCG Method Through Analyses with Background Data From the set of questions available in both the NALS and NAAL background questionnaires, we identified the following variables to include in the QCG analyses: education level, occupation, two income-related vari- ables (receiving federal assistance, receiving interest or dividend income), self-rating of reading skills, level of assistance needed with reading, and participation in reading activities (reading the newspaper, using reading at work). We examined the distribution of literacy scores for specific response options to the background questions. The below basic and basic levels originated partly from policy distinc- tions about the provision of supplemental adult education services; thus, we expected the cut score between below basic and basic to be related to a recognized need for adult literacy services. Therefore, for each literacy area, the bookmark cut score between below basic and basic was compared with the QCG cut score that separated individuals with 0-8 years of formal education (i.e., no high school) and those with some high school education. To determine this QCG cut score, we examined the distributions of literacy scores for the two groups to identify the point below which most of those with 0-8 years of education scored and above which most of those with some high school scored. To accomplish this, we determined the median score (50th percentile) in each literacy area for those with no high school education and the median score (50th percentile) for those with some high school education. We then found the midpoint between these two medians

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 149 (which is simply the average of the two medians).7 Table 5-8 presents this information. For example, the table shows that in 1992 the median prose score for those with no high school was 182; the corresponding median for those with some high school was 236. The midpoint between these two medians is 209. Likewise, for 2003, the median prose score for those with no high school was 159 and for those with some high school was 229. The midpoint between these two medians is 194. We also judged that self-rating of reading skills should be related to the distinction between below basic and basic, and the key relevant contrast would be between those who say they do not read well and those who say they do read well. Following the procedures described above, for each literacy area, we determined the median score for those who reported that they do not read well (e.g., in 1992, the value for prose was 140) and those who reported that they read well (e.g., in 1992, the value for prose was 285). The midpoint between these two values is 212.5. The corresponding median prose scores for the 2003 participants were 144 for those who report they do not read well and 282 for those who report that they read well, which results in a midpoint of 213. We then combined the cut scores suggested by these two contrasts (no high school versus some high school; do not read well versus read well) by averaging the four midpoints for the 1992 and 2003 results (209, 194, 212.5, and 213). We refer to this value as the QCG cut score. Combining the information across multiple background variables enhances the stability of the cut score estimates. Table 5-8 presents the QCG cut scores for the basic performance level for prose (207.1), document (205.1), and quantita- tive (209.9) literacy. The contrast between the basic and intermediate levels was developed to reflect a recognized need for GED preparation services. Therefore, the bookmark cut score between these two performance levels was compared with the contrast between individuals without a high school diploma or GED certificate and those with a high school diploma or GED. Further- more, because of a general policy expectation that most individuals can and should achieve a high school level education but not necessarily more, we expected the contrast between the basic and intermediate levels to be asso- ciated with a number of other indicators of unsuccessful versus successful 7We could have used discriminant function analysis to determine the cut score, but in the usual normal assumption, the maximally discriminating point on the literacy scale would be the point at which equal proportions of the higher group were below and the lower group were above. Assuming common variance and normality for the two groups, this is in fact the midpoint between the two group medians (or the mean of the medians). If the two groups have different variances, the point will be higher or lower than the median, in the direction of the mean of the group with a smaller variance.

150 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-8 Comparison of Weighted Median Scaled Scores for Groups Contrasted to Determine the QCG Cut Scores for Basic Literacy Weighted Median Scorea Groups Contrasted 1992 2003 Prose Literacy Education: No high school 182 159 Some high school 236 229 Average of medians 209.0 194.0 Self-perception of reading skills: Do not read well 140 144 Read well 285 282 Average of medians 212.5 213.0 Contrasting groups cut score for prose: 207.1b Document Literacy Education: No high school 173 160 Some high school 232 231 Average of medians 202.5 195.5 Self-perception of reading skills: Do not read well 138 152 Read well 279 276 Average of medians 208.5 214.0 Contrasting groups cut score for document: 205.1 functioning in society available on the background questionnaire, specifi- cally the contrast between: • Needing a lot of help with reading versus not needing a lot of help with reading. • Never reading the newspaper versus sometimes reading the news- paper. • Working in a job in which reading is never used versus working in a job in which reading is used. • Receiving Aid to Families with Dependent Children or food stamps versus receiving interest or dividend income. Following the procedures described above for the basic performance

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 151 TABLE 5-8 Continued Weighted Median Scorea Groups Contrasted 1992 2003 Quantitative Literacy Education: No high school 173 165 Some high school 233 231 Average of medians 203.0 198.0 Self-perception of reading skills: Do not read well 138 166 Read well 285 288 Average of medians 211.5 227.0 Contrasting groups cut score for quantitative: 209.9 aFor 1992, the median scores are calculated on a sample representing the entire adult population. For 2003, the median scores are calculated on a sample that excludes respon- dents with no responses to literacy tasks due to various “literacy-related reasons,” as deter- mined by the interviewer. These excluded respondents correspond to roughly 2 percent of the adult population. Assuming that these respondents are at the lower end of the literacy scale (since they do not have answers for literacy-related reasons), their exclusion causes an up- ward bias in the calculated medians as an estimate of the true median of the full adult population. The impact of this bias on the standard setting is likely to be small for two reasons. First, a comparison of the medians for 1992 and 2003 suggest that the medians are relatively close and that the bias is probably not large. Second, the averaging procedure in the QCG calculation dilutes the effect of the biased 2003 results by averaging them with the unbiased 1992 results. bThe cut score is the overall average of the weighted medians for the groups contrasted. level, we determined the cut score for the contrasted groups in the above list, and Table 5-9 presents these medians for the three types of literacy. For example, the median prose score in 1992 for those with some high school was 236; the corresponding median for those with a high school diploma was 274; and the midpoint between these medians was 255. We determined the corresponding medians from the 2003 results (which were 229 for those with some high school and 262 for those with a high school diploma, yielding a midpoint of 245.5). We then averaged the midpoints resulting from the contrasts on these five variables to yield the QCG cut score. These QCG cut scores for prose (243.5), document (241.6), and quantitative (245.4) literacy areas appear in Table 5-9. The contrast between the intermediate and advanced levels was in- tended to relate to pursuit of postsecondary education or entry into profes-

152 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-9 Comparison of Weighted Median Scaled Scores for Groups Contrasted to Determine the QCG Cut Scores for Intermediate Literacy Weighted Median Scorea Groups Contrasted 1992 2003 Prose Literacy Education: Some high school 236 229 High school diploma 274 262 Average of medians 255.0 245.5 Extent of help needed with reading: A lot 135 153 Not a lot 281 277 Average of medians 208.0 215.0 Read the newspaper: Never 161 173 Sometimes, or more 283 280 Average of medians 222.0 226.5 Read at work: Never 237 222 Sometimes, or more 294 287 Average of medians 265.5 254.5 Financial status: Receive federal assistance 246 241 Receive interest, dividend income 302 296 Average of medians 274.0 268.5 Contrasting groups cut score for prose: 243.5b Document Literacy Education: Some high school 232 231 High school diploma 267 259 Average of medians 249.5 245.0 Extent of help needed with reading: A lot 128 170 Not a lot 275 273 Average of medians 201.5 221.5 Read the newspaper: Never 154 188 Sometimes, or more 278 275 Average of medians 216.0 231.5 Read at work: Never 237 228 Sometimes, or more 289 282 Average of medians 263.0 255.0

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 153 TABLE 5-9 Continued Weighted Median Scorea Groups Contrasted 1992 2003 Financial status: Receive federal assistance 242 240 Have interest/dividend income 295 288 Average of medians 268.5 264.0 Contrasting groups cut score for document: 241.6 Quantitative Literacy Education: Some high school 233 231 High school diploma 275 270 Average of medians 254.0 250.5 Extent of help needed with reading: A lot 114 162 Not a lot 282 285 Average of medians 198.0 223.5 Read the newspaper: Never 145 197 Sometimes, or more 284 287 Average of medians 214.5 242.0 Read at work: Never 236 233 Sometimes, or more 294 294 Average of medians 265.0 263.5 Financial status: Receive federal assistance 240 237 Have interest/dividend income 303 305 Average of medians 271.5 271.0 Contrasting groups cut score for quantitative: 245.4 aFor 1992, the median scores are calculated on a sample representing the entire adult population. For 2003, the median scores are calculated on a sample that excludes respon- dents with no responses to literacy tasks due to various “literacy-related reasons,” as deter- mined by the interviewer. These excluded respondents correspond to roughly 2 percent of the adult population. Assuming that these respondents are at the lower end of the literacy scale (since they do not have answers for literacy-related reasons), their exclusion causes an up- ward bias in the calculated medians as an estimate of the true median of the full adult population. The impact of this bias on the standard setting is likely to be small for two reasons. First, a comparison of the medians for 1992 and 2003 suggest that the medians are relatively close and that the bias is probably not large. Second, the averaging procedure in the QCG calculation dilutes the effect of the biased 2003 results by averaging them with the unbiased 1992 results. bThe cut score is the overall average of the weighted medians for the groups contrasted.

154 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-10 Comparison of Weighted Median Scaled Scores for Groups Contrasted to Determine the QCG Cut Scores for Advanced Literacy Median Scorea Groups Contrasted 1992 2003 Prose Literacy Education: High school diploma 274 262 College degree 327 316 Average of medians 300.5 289.0 Occupational status: Low formal training requirements 267 261 High formal training requirements 324 306 Average of medians 295.5 283.5 Contrasting groups cut score for prose: 292.1b Document Literacy Education: High school diploma 267 259 College degree 319 304.5 Average of medians 293.0 281.8 Occupational status: Low formal training requirements 264 258 High formal training requirements 315 298 Average of medians 289.5 278.0 Contrasting groups cut score for document: 285.6 sional, managerial, or technical occupations. Therefore, the bookmark cut score between intermediate and advanced literacy was compared with the contrast between those who have a high school diploma (or GED) and those who graduated from college. We expected that completing postsecondary education would be related to occupation. Thus, for each type of literacy, we determined the median score for occupations with minimal formal training requirements (e.g., laborer, assembler, fishing, farming) and those occupations that require formal training or education (e.g., manager, professional, technician). These QCG cut scores for prose (292.1), document (285.6), and quantitative (296.1) literacy appear in Table 5-10. In examining the relationships described above, it is important to note that for those who speak little English, the relationship between literacy

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 155 TABLE 5-10 Continued Median Scorea Groups Contrasted 1992 2003 Quantitative Literacy Education: High school diploma 275 270 College degree 326 324 Average of medians 300.5 297.0 Occupational status: Low formal training requirements 269 267 High formal training requirements 323 315 Average of medians 296.0 291.0 Contrasting groups cut score for quantitative: 296.1 aFor 1992, the median scores are calculated on a sample representing the entire adult population. For 2003, the median scores are calculated on a sample that excludes respon- dents with no responses to literacy tasks due to various “literacy-related reasons,” as deter- mined by the interviewer. These excluded respondents correspond to roughly 2 percent of the adult population. Assuming that these respondents are at the lower end of the literacy scale (since they do not have answers for literacy-related reasons), their exclusion causes an up- ward bias in the calculated medians as an estimate of the true median of the full adult population. The impact of this bias on the standard setting is likely to be small for two reasons. First, a comparison of the medians for 1992 and 2003 suggest that the medians are relatively close and that the bias is probably not large. Second, the averaging procedure in the QCG calculation dilutes the effect of the biased 2003 results by averaging them with the unbiased 1992 results. bThe cut score is the overall average of the weighted medians for the groups contrasted. levels in English and educational attainment in the home country may be skewed, since it is possible to have high levels of education from one’s home country yet not be literate in English. To see if inclusion of non-English speakers would skew the results in any way, we examined the medians for all test takers and just for English speakers. There were no meaningful differences among the resulting medians; thus we decided to report medians for the full aggregated dataset. Procedures for Using QCG Cut Scores to Adjust Bookmark Cut Scores Most authorities on standard setting (e.g., Green, Trimble, and Lewis, 2003; Hambleton, 1980; Jaeger, 1989; Shepard, 1980; Zieky, 2001) sug- gest that, when setting cut scores, it is prudent to use and compare the

156 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS results from different standard-setting methods. At the same time, they acknowledge that different methods, or even the same method replicated with different panelists, are likely to produce different cut scores. This presents a dilemma to those who must make decisions about cut scores. Geisinger (1991, p. 17) captured this idea when he noted that “running a standard-setting panel is only the beginning of the standard-setting pro- cess.” At the conclusion of the standard setting, one has only proposed cut scores that must be accepted, rejected, or adjusted. The standard-setting literature contains discussions about how to pro- ceed with making decisions about proposed cut scores, but there do not appear to be any hard and fast rules. Several quantitative approaches have been explored. For example, in the early 1980s, two quantitative tech- niques were devised for “merging” results from different standard-setting procedures (Beuck, 1984; Hofstee, 1983). These methods involve obtaining additional sorts of judgments from the panelists, besides the typical stan- dard-setting judgments, to derive the cut scores. In the Beuck technique, panelists are asked to make judgments about the optimal pass rate on the test. In the Hofstee approach, panelists are asked their opinions about the highest and lowest possible cut scores and the highest and lowest possible failing rate.8 Another quantitative approach is to set reasonable ranges for the cut scores and to make adjustments within this range. One way to establish a range is by using estimates of the standard errors of the proposed cut scores (Zieky, 2001). Also, Huff (2001) described a method of triangulating re- sults from three standard-setting procedures in which a reasonable range was determined from the results of one of the standard-setting methods. The cut scores from the two other methods fell within this range and were therefore averaged to determine the final set of cut scores. While these techniques use quantitative information in determining final cut scores, they are not devoid of judgments (e.g., someone must decide whether a quantitative procedure should be used, which one to use and how to implement it, and so on). Like the standard-setting procedure itself, determination of final cut scores is ultimately a judgment-based task that authorities on standard setting maintain should be based on both quantitative and qualitative information. For example, The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological As- sociation, and the National Council on Measurement in Education, 1999, p. 54) note that determining cut scores cannot be a “purely technical mat- 8The reader is referred to the original articles or Geisinger (1991) for additional detail on how the procedures are implemented.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 157 ter,” indicating that they should “embody value judgments as well as tech- nical and empirical considerations.” In his landmark article on certifying students’ competence, Jaeger (1989, p. 500) recommended considering all of the results from the standard setting together with “extra-statistical factors” to determine the final cut scores. Geisinger (1991) suggests that a panel composed of informed members of involved groups should be em- powered to make decisions about final cut scores. Green et al. (2003) proposed convening a separate judgment-based procedure wherein a set of judges synthesizes the various results to determine a final set of cut scores or submitting the different sets of cut scores to a policy board (e.g., a board of education) for final determination. As should be obvious from this discussion, there is no consensus in the measurement field about ways to determine final cut scores and no absolute guidance in the literature that the committee could rely on in making final decisions about cut scores. Using the advice that can be gleaned from the literature and guidance from the Standards that the process should be clearly documented and defensible, we developed an approach for utilizing the information from the two bookmark standard-setting sessions and the QCG procedure to develop our recommendations for final cut scores. We judged that the cut scores resulting from the two bookmark ses- sions were sufficiently similar to warrant combining them, and we formed median cut scores based on the two sets of panelist judgments. Since we decided to use the cut scores from the QCG procedure solely to comple- ment the information from the bookmark procedure, we did not want to combine these two sets of cut scores in such a way that they were accorded equal weight. There were two reasons for this. One reason, as described above, was that the background questions used for the QCG procedure were correlates of the constructs evaluated on the assessment and were not intended as direct measures of these constructs. Furthermore, as explained earlier in this chapter, the available information was not ideal and did not include questions that would be most useful in distinguishing between cer- tain levels of literacy. The other reason related to our judgment that the bookmark procedure had been implemented appropriately according to the guidelines docu- mented in the literature (Hambleton, 2001; Kane, 2001; Plake, Melican, and Mills, 1992: Raymond and Reid, 2001) and that key factors had re- ceived close attention. We therefore chose to use a method for combining the results that accorded more weight to the bookmark cut scores than the QCG cut scores. The cut scores produced by the bookmark and QCG approaches are summarized in the first two rows of Table 5-11 for each type of literacy. Comparison of these cut scores reveals that the QCG cut scores are always lower than the bookmark cut scores. The differences among the two sets of

158 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-11 Summary of Cut Scores Resulting from Different Procedures Basic Intermediate Advanced Prose QCG cut score 207.1 243.5 292.1 Bookmark cut score 211 270 345 Interquartile range of bookmark cut score 206-221 264-293 336-366 Adjusted cut scores 211.0 267.0 340.5 Average of cut scores 209.1 256.8 318.6 Confidence interval for cut scores 200.5-225.5 261.3-290.0 333.5-377.8 Document QCG cut score 205.1 241.6 285.6 Bookmark cut score 203 254 345 Interquartile range of bookmark cut score 192-210 247-259 324-371 Adjusted cut scores 203.0 250.5 334.5 Average of cut scores 204.1 247.8 315.3 Confidence interval for cut scores 191.9-211.1 245.5-264.8 320.7-372.9 Quantitative QCG cut score 209.9 245.4 296.1 Bookmark cut score 244 296 356 Interquartile range of bookmark cut score 230-245 288-307 343-398 Adjusted cut scores 237.0 292.0 349.5 Average of cut scores 227.0 275.2 326.1 Confidence interval for cut scores 225.9-262.5 282.8-305.8 343.0-396.4 cut scores are smaller for the basic and intermediate performance levels for prose and document literacy, with differences ranging from 2 to 26 points. Differences among the cut scores are somewhat larger for all performance levels in the quantitative literacy area and for the advanced performance level for all three types of literacy, with differences ranging from 34 to 60 points. Overall, this comparison suggests that the bookmark cut scores should be lowered slightly. We designed a procedure for combining the two sets of cut scores that was intended to make only minor adjustments to the bookmark cut scores, and we examined its effects on the resulting impact data. The adjustment procedure is described below and the resulting cut scores are also presented in Table 5-11. The table also includes the cut scores that would result from averaging the bookmark and QCG cut scores, which, although we did not consider this as a viable alternative, we provide as a comparison with the cut scores that resulted from the adjustment.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 159 ADJUSTING THE BOOKMARK CUT SCORES We devised a procedure for adjusting the bookmark cut scores that involved specifying a reasonable range for the cut scores and making ad- justments within this range. We decided that the adjustment should keep the cut scores within the interquartile range of the bookmark cut scores (that is, the range encompassed by the 25th and 75th percentile scaled scores produced by the bookmark judgments) and used the QCG cut scores to determine the direction of the adjustment within this range. Specifically, we compared each QCG cut score to the respective interquartile range from the bookmark procedure. If the cut score lay within the interquartile range, no adjustment was made. If the cut score lay outside the inter- quartile range, the bookmark cut score was adjusted using the following rules: • If the QCG cut score is lower than the lower bound of the inter- quartile range (i.e., lower than the 25th percentile), determine the difference between the bookmark cut score and the lower bound of the interquartile range. Reduce the bookmark cut score by half of this difference (essentially, the midpoint between the 25th and 50th percentiles of the bookmark cut scores). • If the QCG cut score is higher than the upper bound of the inter- quartile range (i.e., higher than the 75th percentile), determine the differ- ence between the bookmark cut score and the upper bound of the interquartile range. Increase the bookmark cut score by half of this difference (essentially the midpoint between the 50th and 75th percentile of the bookmark cut scores). To demonstrate this procedure, the QCG cut score for the basic perfor- mance level in prose is 207.1, and the bookmark cut score is 211 (see Table 5-11). The corresponding interquartile range based on the bookmark pro- cedure is 206 to 221. Since 207.1 falls within the interquartile range, no adjustment is made. The QCG cut score for intermediate is 243.5. Since 243.5 is lower than the 25th percentile score (interquartile range of 264 to 293), the bookmark cut score of 270 needs to be reduced. The amount of the reduction is half the difference between the bookmark cut score of 270 and the lower bound of the interquartile range (264), which is 3 points. Therefore, the bookmark cut score would be reduced from 270 to 267. Application of these rules to the remaining cut scores indicates that all of the bookmark cut scores should be adjusted except the basic cut scores for prose and document literacy. The adjusted cut scores produced by this adjustment are presented in Table 5-11.

160 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS Rounding the Adjusted Cut Scores In 1992, the test designers noted that the break points determined by the analyses that produced the performance levels did not necessarily occur at exact 50-point intervals on the scales. As we described in Chapter 3, the test designers judged that assigning the exact range of scores to each level would imply a level of precision of measurement that was inappropriate for the methodology adopted, and they therefore rounded the cut scores. In essence, this rounding procedure reflected the notion that there is a level of uncertainty associated with the specification of cut scores. The procedures we used for the bookmark standard setting allowed determination of confidence intervals for the cut scores, which also reflect the level of uncertainty in the cut scores. Like the test designers in 1992, we judged that the cut scores should be rounded and suggest that they be rounded to multiples of five. Tables 5-12a, 5-12b, and 5-12c show, for prose, document, and quantitative literacy, respectively, the original cut scores from the bookmark procedure and the adjustment procedure after rounding to the nearest multiple of five. For comparison, the table also presents the confidence intervals for the cut scores to indicate the level of uncertainty associated the specific cut scores. Another consideration when making use of cut scores from different standard-setting methods is the resulting impact data; that is, the percent- ages of examinees who would be placed into each performance category based on the cut scores. Tables 5-12a, 5-12b, and 5-12c show the percent- age of the population who scored below the rounded cut scores. Again for comparison purposes, the table also presents impact data for the confi- dence intervals. Impact data were examined for both the original cut scores that re- sulted from the bookmark procedure and for the adjusted values of the cut scores. Comparison of the impact results based on the original and adjusted cut scores shows that the primary effect of the adjustment was to slightly lower the cut scores, more so for quantitative literacy than the other sec- tions. A visual depiction of the differences in the percentages of adults classified into each performance level based on the two sets of cut scores is presented in Figures 5-1 through 5-6, respectively, for the prose, document, and quantitative sections. The top bar shows the percentages of adults that would be placed into each performance level based on the adjusted cut scores, and the bottom bar shows the distribution based on the original bookmark cut scores. Overall, the adjustment procedure tended to produce a distribution of participants across the performance levels that resembled the distribution produced by the original bookmark cut scores. The largest changes were in

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 161 the quantitative section, in which the adjustment slightly lowered the cut scores. The result of the adjustment is a slight increase in the percentages of individuals in the basic, intermediate, and advanced categories. In our view, the procedures used to determine the adjustment were sensible and served to align the bookmark cut scores more closely with the relevant background measures. The adjustments were relatively small and made only slight differences in the impact data. The adjusted values re- mained within the confidence intervals. We therefore recommend the cut scores produced by the adjustment. RECOMMENDATION 5-1: The scale score intervals associated with each of the levels should be as shown below for prose, document, and quantita- tive literacy. Nonliterate Below in English Basic Basic Intermediate Advanced Prose: Took 0-209 210-264 265-339 340-500 ALSA Document: Took 0-204 205-249 250-334 335-500 ALSA Quantitative: Took 0-234 235-289 290-349 350-500 ALSA We remind the reader that the nonliterate in English category was intended to comprise the individuals who were not able to answer the core questions in 2003 and were given the ALSA instead of NAAL. Below basic is the lowest performance level for 1992, since the ALSA did not exist at that time.9 DIFFICULTIES WITH THE UPPER AND LOWER ENDS OF THE SCORE SCALE With respect to setting achievement levels on the NAAL, we found that there were significant problems at both the lower and upper ends of the literacy scale. The problems with the lower end relate to decisions about the 9For the 2003 assessment, the nonliterate in English category is intended to include those who were correctly routed to ALSA based on the core questions, those who should have been routed to ALSA but were misrouted to NAAL, and those who could not participate in the literacy assessment because their literacy levels were too low. The below basic category is intended to encompass those who were correctly routed to NAAL, and they should be classi- fied into below basic using their performance on NAAL.

162 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 5-12a Comparison of Impact Data for Prose Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores Basic Intermediate Advanced Roundeda bookmark cut score 210 270 345 Percent below cut score: 1992 16.5b,c 46.8 87.4 2003 15.4c,d 46.8 88.8 Roundeda adjusted cut score 210 265 340 Percent below cut score: 1992 16.5b,c 43.7 85.7 2003 15.4c,d 43.6 87.1 Roundede confidence interval 201-226 261-290 334-378 Percent below cut scores: 1992 13.8-22.7b,c 41.2-59.4 83.2-95.6 2003 12.6-21.5c,d 40.9-60.1 84.6-96.5 aRounded to nearest multiple of five. bIncludes those who took NALS and scored below the cut score as well as those who were not able to participate in the assessment for literacy-related reasons (having difficulty with reading or writing or unable to communicate in English or Spanish); nonparticipants for literacy-related reasons comprised 3 percent of the sample in 1992. cThis is an underestimate because it does not include the 1 percent of individual who could not participate due to a mental disability such as retardation, a learning disability, or other mental/emotional conditions. An upper bound on the percent below basic could be obtained by including this percentage. dIncludes those who took NAAL and scored below the basic cut score, those who took ALSA, and those who were not able to participate in the assessment for literacy-related reasons (having difficulty with reading or writing or unable to communicate in English or Spanish); nonparticipants for literacy-related reasons comprised 2 percent of the sample in 2003. eRounded to nearest whole number. nature of the ALSA component. ALSA was implemented as a separate low- level assessment. ALSA and NAAL items were not analyzed or calibrated together and hence were not placed on the same scale. We were therefore not able to use the ALSA items in our procedures for setting the cut scores. These decisions about the ways to process ALSA data created a de facto cut score between the nonliterate in English and below basic categories. Conse- quently, all test takers in 2003 who performed poorly on the initial screen- ing questions (the core questions) and were administered ALSA are classi- fied into the nonliterate in English category (see footnote 9).

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 163 TABLE 5-12b Comparison of Impact Data for Document Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores Basic Intermediate Advanced Roundeda bookmark cut score 205 255 345 Percent below cut score: 1992 16.8b,c 40.8 89.2 2003 14.2c,d 39.4 91.1 Roundeda adjusted cut score 205 250 335 Percent below cut score 1992 16.8 37.8 85.8 2003 14.2 36.1 87.7 Roundede confidence interval 192-211 246-265 321-373 Percent below cut scores: 1992 12.9-18.9 35.5-47.0 79.9-95.6 2003 10.5-16.3 33.7-46.0 81.6-96.9 See footnotes to Table 15-12a. TABLE 5-12c Comparison of Impact Data for Quantitative Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores Basic Intermediate Advanced Roundeda bookmark cut score 245 300 355 Percent below cut score: 1992 33.3b,c 65.1 89.3 2003 27.9c,d 61.3 88.6 Roundeda adjusted cut score 235 290 350 Percent below cut score 1992 28.5 59.1 87.9 2003 23.1 55.1 87.0 Roundede confidence interval 226-263 283-306 343-396 Percent below cut scores: 1992 24.7-42.9 55.0-68.5 85.6-97.1 2003 19.2-37.9 50.5-64.9 84.1-97.2 See footnotes to Table 15-12a.

164 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS Adjusted 16.5 27.2 42.0 14.3 Below Basic Basic Intermediate Advanced Bookmark 16.5 30.3 40.6 12.6 FIGURE 5-1 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 prose literacy. 15.4 Nonliterate Adjusted 4.7 10.7 28.2 43.5 12.9 in English* Below Basic Basic 15.4 Intermediate Bookmark 4.7 10.7 31.4 42.0 11.2 Advanced FIGURE 5-2 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 prose literacy. *The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category. Adjusted 16.8 21.0 48.0 14.2 Below Basic Basic Intermediate Advanced Bookmark 16.8 24.0 48.4 10.8 FIGURE 5-3 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 document literacy.

PERFORMANCE-LEVEL DESCRIPTIONS AND SETTING CUT SCORES 165 14.2 Nonliterate Adjusted 4.7 9.5 21.9 51.6 12.3 in English* Below Basic 12.4 Basic Intermediate Bookmark 4.7 9.5 25.2 51.7 8.9 Advanced FIGURE 5-4 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 document literacy. *The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category. Adjusted 28.5 30.6 28.8 12.1 Below Basic Basic Intermediate Advanced Bookmark 33.3 31.8 24.2 10.7 FIGURE 5-5 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 quantitative literacy. 23.1 Nonliterate Adjusted 4.7 18.4 32.0 31.9 13.0 in English* Below Basic 27.9 Basic Intermediate Bookmark 4.7 23.2 33.4 27.3 11.4 Advanced FIGURE 5-6 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 quantitative literacy. *The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category.

166 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS This creates problems in making comparisons between the 1992 and 2003 data. Since ALSA was not a part of NALS in 1992, there is no way to identify the group of test takers who would have been classified into the nonliterate in English category. As a result, the below basic and nonliterate in English categories will need to be combined to examine trends between 1992 and 2003. With regard to the upper end of the scale, we found that feedback from the bookmark panelists, combined with our review of the items, suggests that the assessment does not adequately cover the upper end of the distribu- tion of literacy proficiency. We developed the description of this level based on what we thought was the natural progression of skills beyond the inter- mediate level. In devising the wording of the description, we reviewed samples of NALS items and considered the 1992 descriptions of NALS Levels 4 and 5. A number of panelists in the bookmark procedure com- mented about the lack of difficulty represented by the items, however, particularly the quantitative items. A few judged that an individual at the advanced level should be able to answer all of the items correctly, which essentially means that these panelists did not set a cut score for the ad- vanced category. We therefore conclude that the assessment is very weak at the upper end of the scale. Although there are growing concerns about readiness for college-level work and preparedness for entry into profes- sional and technical professions, we think that NAAL, as currently de- signed, will not allow for detection of problems at these levels of profi- ciency. It is therefore with some reservations that we include the advanced category in our recommendation for performance levels, and we leave it to NCES to ultimately decide on the utility and meaning of this category. With regard to the lower and upper ends of the score scale, we make the following recommendation: RECOMMENDATION 5-2: Future development of NAAL should include more comprehensive coverage at the lower end of the continuum of literacy skills, including assessment of the extent to which individuals are able to recognize letters and numbers and read words and simple sentences, to allow determination of which individuals have the basic foundation skills in literacy and which individuals do not. This assessment should be part of NAAL and should yield information used in calculating scores for each of the three types of literacy. At the upper end of the continuum of literacy skills, future development of NAAL should also include assessment items necessary to identify the extent to which policy interventions are needed at the postsecondary level and above.

Next: 6 Communicating and Using the Results of Literacy Assessments »
Measuring Literacy: Performance Levels for Adults Get This Book
×
Buy Paperback | $55.00 Buy Ebook | $43.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The National Assessment of Adult Literacy (NAAL) is a household survey conducted periodically by the Department of Education that evaluates the literacy skills of a sample of adults in the United Stages ages 16 and older. NAAL results are used to characterize adults' literacy skills and to inform policy and programmatic decisions. The Committee on Performance Levels for Adult Literacy was convened at the Department's request for assistance in determining a means for booking assessment results that would be useful and understandable for NAAL'S many varied audiences. Through a process detailed in the book, the committee determined that five performance level categories should be used to characterize adults' literacy skills: nonliterate in English, below basic literacy, basic literacy, intermediate literacy, and advanced literacy. This book documents the process the committee used to determine these performance categories, estimates the percentages of adults whose literacy skills fall into each category, recommends ways to communicate about adults' literacy skills based on NAAL, and makes suggestions for ways to improve future assessments of adult literacy.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!