5
Developing Performance-Level Descriptions and Setting Cut Scores

In this chapter, we detail the processes we used for developing descriptions of the performance levels as well as the methods we used to determine the cut scores to be associated with each of the performance levels. The performance-level descriptions were developed through an iterative process in which the descriptions evolved as we drafted wording, solicited feedback, reviewed the assessment frameworks and tasks, and made revisions. The process of determining the cut scores involved using procedures referred to as “standard setting,” which were introduced in Chapter 3.

As we noted in Chapter 3, standard setting is intrinsically judgmental. Science enters the process only as a way of ensuring the internal and external validity of informed judgments (e.g., that the instructions are clear and understood by the panelists; that the standards are statistically reliable and reasonably consistent with external data, such as levels of completed schooling). Given the judgmental nature of the task, it is not easy to develop methods and procedures that are scientifically defensible; indeed, standard-setting procedures have provoked considerable controversy (e.g., National Research Council [NRC], 1998; Hambleton et al., 2001). In developing our procedures, we have familiarized ourselves with these controversies and have relied on the substantial research base on standard setting1 and, in

1  

While we familiarized ourselves with a good deal of this research, we do not provide an exhaustive listing of these articles and cite only the studies that are most relevant for the present project. There are several works that provide overviews of methods, their variations,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 108
Measuring Literacy: Performance Levels for Adults 5 Developing Performance-Level Descriptions and Setting Cut Scores In this chapter, we detail the processes we used for developing descriptions of the performance levels as well as the methods we used to determine the cut scores to be associated with each of the performance levels. The performance-level descriptions were developed through an iterative process in which the descriptions evolved as we drafted wording, solicited feedback, reviewed the assessment frameworks and tasks, and made revisions. The process of determining the cut scores involved using procedures referred to as “standard setting,” which were introduced in Chapter 3. As we noted in Chapter 3, standard setting is intrinsically judgmental. Science enters the process only as a way of ensuring the internal and external validity of informed judgments (e.g., that the instructions are clear and understood by the panelists; that the standards are statistically reliable and reasonably consistent with external data, such as levels of completed schooling). Given the judgmental nature of the task, it is not easy to develop methods and procedures that are scientifically defensible; indeed, standard-setting procedures have provoked considerable controversy (e.g., National Research Council [NRC], 1998; Hambleton et al., 2001). In developing our procedures, we have familiarized ourselves with these controversies and have relied on the substantial research base on standard setting1 and, in 1   While we familiarized ourselves with a good deal of this research, we do not provide an exhaustive listing of these articles and cite only the studies that are most relevant for the present project. There are several works that provide overviews of methods, their variations,

OCR for page 108
Measuring Literacy: Performance Levels for Adults particular, on the research on setting achievement levels for the National Assessment of Educational Progress (NAEP). NAEP’s standard-setting procedures are perhaps the most intensely scrutinized procedures in existence today, having been designed, guided, and evaluated by some of the most prominent measurement experts in the county. The discussions about NAEP’s procedures, both the favorable comments and the criticisms, provide guidance for those designing a standard-setting procedure. We attempted to implement procedures that reflected the best of what NAEP does and that addressed the criticisms that have been leveled against NAEP’s procedures. Below we highlight the major criticisms and describe how we addressed them. We raise these issues, not to take sides on the various controversies, but to explain how we used this information to design our standard-setting methods. NAEP has for sometime utilized the modified Angoff method for setting cut scores, a procedure that some consider to yield defensible standards (Hambleton and Bourque, 1991; Hambleton et al., 2000; Cizek, 1993, 2001a; Kane, 1993, 1995; Mehrens, 1995; Mullins and Green, 1994) and some believe to pose an overly complex cognitive task for judges (National Research Council, 1999; Shepard, Glaser, and Linn, 1993). While the modified Angoff method is still widely used, especially for licensing and certification tests, many other methods are available. In fact, although the method is still used for setting the cut scores for NAEP’s achievement levels, other methods are being explored with the assessment (Williams and Schulz, 2005). Given the unresolved controversies about the modified Angoff method, we chose not to use it. Instead, we selected a relatively new method, the bookmark standard-setting method, that appears to be growing in popularity. The bookmark method was designed specifically to reduce the cognitive complexity of the task posed to panelists (Mitzel et al., 2001). The procedure was endorsed as a promising method for use on NAEP (National Research Council, 1999) and, based on recent estimates, is used by more than half of the states in their K-12 achievement tests (Egan, 2001). Another issue that has been raised in relation to NAEP’s standard-setting procedures is that different standard-setting methods were required for NAEP’s multiple-choice and open-ended items. The use of different methods led to widely disparate cut scores, and there has been disagreement and advantages and disadvantages, such as Jaegar’s article in Educational Measurement (1989)     and the collection of writings in Cizek’s (2001b) Setting Performance Standards. We frequently refer readers to these writings because they provide a convenient and concise means for learning more about standard setting; however, we do not intend to imply that these were the only documents consulted.

OCR for page 108
Measuring Literacy: Performance Levels for Adults about how to resolve these differences (Hambleton et al., 2000; National Research Council, 1999; Shepard, Glaser, and Linn, 1993). An advantage of the bookmark procedure is that it is appropriate for both item types. While neither the National Adult Literacy Survey (NALS) nor the National Assessment of Adult Literacy (NAAL) use multiple-choice items, both include open-ended items, some of which were scored as right or wrong and some of which were scored according to a partial credit scoring scheme (e.g., wrong, partially correct, fully correct). The bookmark procedure is suitable for both types of scoring schemes. Another issue discussed in relation to NAEP’s achievement-level setting was the collection of evidence used to evaluate the reasonableness of the cut scores. Concerns were expressed about the discordance between cut scores that resulted from different standard-setting methods (e.g., the modified Angoff method and the contrasting groups method yielded different cut scores for the assessment) and the effect of these differences on the percentages of students categorized into each of the achievement levels. Concerns were also expressed about whether the percentages of students in each achievement level were reasonable given other indicators of students’ academic achievement in the United States (e.g., performance on the SAT, percentage of students enrolled in Advanced Placement programs), although there was considerable disagreement about the appropriateness of such comparisons. While we do not consider that our charge required us to resolve these disagreements about NAEP’s cut scores, we did try to address the criticisms. As a first step to address these concerns, we used the background data available from the assessment as a means for evaluating the reasonableness of the bookmark cut scores. To accomplish this, we developed an adapted version of the contrasting groups method, which utilizes information about examinees apart from their actual test scores. This quasi-contrasting groups (QCG) approach was not used as a strict standard-setting technique but as a means for considering adjustments to the bookmark cut scores. While validation of the recommended cut scores should be the subject of a thorough research endeavor that would be beyond the scope of the committee’s charge, comparison of the cut scores to pertinent background data provides initial evidence. We begin our discussion with an overview of the bookmark standard-setting method and the way we implemented it. Participants in the standard settings provided feedback on the performance-level descriptions, and we present the different versions of the descriptions and explain why they were revised. The results of the standard settings appear at the end of this chapter, where we also provide a description of the adapted version of the contrasting groups procedure that we used and make our recommendations for cut scores. The material in this chapter provides an overview of the

OCR for page 108
Measuring Literacy: Performance Levels for Adults bookmark procedures and highlights the most crucial results from the standard setting; additional details about the standard setting are presented in Appendixes C and D. THE BOOKMARK STANDARD-SETTING METHOD Relatively new, the bookmark procedure was designed to simplify the judgmental task by asking panelists to directly set the cut scores, rather than asking them to make judgments about test questions in isolation, as in the modified Angoff method (Mitzel et al., 2001). The method has the advantage of allowing participants to focus on the content and skills assessed by the test questions rather than just on the difficulty of the questions, as panelists are given “item maps” that detail item content (Zieky, 2001). The method also provides an opportunity to revise performance-level descriptions at the completion of the standard-setting process so they are better aligned with the cut scores. In a bookmark standard-setting procedure, test questions are presented in a booklet arranged in order from easiest to hardest according to their estimated level of difficulty, which is derived from examinees’ answers to the test questions. Panelists receive a set of performance-level descriptions to use while making their judgments. They review the test questions in these booklets, called “ordered item booklets,” and place a “bookmark” to demark the set of questions that examinees who have the skills described by a given performance level will be required to answer correctly with a given level of accuracy. To explain, using the committee’s performance-level categories, panelists would consider the description of skills associated with the basic literacy category and, for each test question, make a judgment about whether an examinee with these skills would be likely to answer the question correctly or incorrectly. Once the bookmark is placed for the first performance-level category, the panelists would proceed to consider the skills associated with the second performance-level category (intermediate) and place a second bookmark to denote the set of items that individuals who score in this category would be expected to answer correctly with a specified level of accuracy. The procedure is repeated for each of the performance-level categories. The bookmark method requires specification of what it means to be “likely” to answer a question correctly. The designers of the method suggest that “likely” be defined as “67 percent of the time” (Mitzel et al., 2001, p. 260). This concept of “likely” is important because it is the response probability value used in calculating the difficulty of each test question (that is, the scale score associated with the item). Although a response probability of 67 percent (referred to as rp67) is common with the book-

OCR for page 108
Measuring Literacy: Performance Levels for Adults mark procedure, other values could be used, and we address this issue in more detail later in this chapter. To demonstrate how the response probability value is used in making bookmark judgments, we rely on the performance levels that we recommended in Chapter 4. Panelists first consider the description of the basic literacy performance level and the content and skills assessed by the first question in the ordered item booklet, the easiest question in the booklet. Each panelist considers whether an individual with the skills described in the basic category would have a 67 percent chance of answering this question correctly (or stated another way, if an individual with the skills described in the basic category would be likely to correctly answer a question measuring these specific skills two out of three times). If a panelist judges this to be true, he or she proceeds to the next question in the booklet. This continues until the panelist comes to a question that he or she judges a basic-level examinee does not have a 67 percent chance of answering correctly (or would not be likely to answer correctly two out of three times). The panelist places his or her bookmark for the basic level on this question. The panelist then moves to the description of the intermediate level and proceeds through the ordered item booklet until reaching an item that he or she judges an individual with intermediate-level skills would not be likely to answer correctly 67 percent of the time. The intermediate-level bookmark would be placed on this item. Determination of the placement of the bookmark for the advanced level proceeds in a similar fashion. Panelists sit at a table with four or five other individuals who are all working with the same set of items, and the bookmark standard-setting procedure is implemented in an iterative fashion. There are three opportunities, or rounds, for panelists to decide where to place their bookmarks. Panelists make their individual decisions about bookmark placements during Round 1, with no input from other panelists. Afterward, panelists seated at the same table compare and discuss their ratings and then make a second set of judgments as part of Round 2. As part of the bookmark process, panelists discuss their bookmark placements, and agreement about the placements is encouraged. Panelists are not required to come to consensus about the placement of bookmarks, however. After Round 2, bookmark placements are transformed to test scale scores, and the median scale score is determined for each performance level. At this stage, the medians are calculated by considering the bookmark placements for all panelists who are working on a given test booklet (e.g., all panelists at all tables who are working on the prose ordered item booklet). Panelists are usually provided with information about the percentage of test takers whose scores would fall into each performance-level category based on these medians. This feedback is referred to as “impact data” and

OCR for page 108
Measuring Literacy: Performance Levels for Adults serves as a reality check to allow panelists to adjust and fine-tune their judgments. Usually, all the panelists working on a given ordered item booklet assemble and review the bookmark placements, the resulting median scale scores, and the impact data together. Panelists then make a final set of judgments during Round 3, working individually at their respective tables. The median scale scores are recalculated after the Round 3 judgments are made. Usually, mean scale scores are also calculated, and the variability in panelists’ judgments is examined to evaluate the extent to which they disagree about bookmark placements. At the conclusion of the standard setting, it is customary to allot time for panelists to discuss and write performance-level descriptions for the items reviewed during the standard setting. Committee’s Approach with the Bookmark Method The committee conducted two bookmark standard-setting sessions, one in July 2004 with data from the 1992 NALS and one in September 2004 with data from the 2003 NAAL. This allowed us to use two different groups of panelists, to try out our procedures with the 1992 data and then make corrections (as needed) before the standard setting with the 2003 data was conducted, and to develop performance-level descriptions that would generalize to both versions of the assessment. Richard Patz, one of the developers of the bookmark method, served as consultant to the committee and led the standard-setting sessions. Three additional consultants and National Research Council project staff assisted with the sessions, and several committee members observed the sessions. The agendas for the two standard-setting sessions appear in Appendixes C and D. Because the issue of response probability had received so much attention in relation to NALS results (see Chapter 3), we arranged to collect data from panelists about the impact of using different instructions about response probabilities. This data collection was conducted during the July standard setting with the 1992 data and is described in the section of this chapter called “Bookmark Standard Setting with 1992 Data.” The standard-setting sessions were organized to provide opportunity to obtain feedback on the performance-level descriptions. During the July session, time was provided for the panelists to suggest changes in the descriptions based on the placement of their bookmarks after the Round 3 judgments had been made. The committee reviewed their feedback, refined the descriptions, and in August invited several of the July panelists to review the revised descriptions. The descriptions were again refined, and a revised version was prepared for the September standard setting. An extended feedback session was held at the conclusion of the September standard setting to finalize the descriptions.

OCR for page 108
Measuring Literacy: Performance Levels for Adults The July and September bookmark procedures were implemented in relation to the top four performance levels only—below basic, basic, intermediate, and advanced. This was a consequence of a decision made by the Department of Education during the development of NAAL. As mentioned in Chapter 2, in 1992, a significant number of people were unable to complete any of the NALS items and therefore produced test results that were clearly low but essentially unscorable. Rather than expanding the coverage of NAAL into low levels of literacy at the letter, word, and simple sentence level, the National Center for Education Statistics (NCES) chose to develop a separate low-level assessment, the Adult Literacy Supplemental Assessment (ALSA). ALSA items were not put on the same scale as the NAAL items or classified into the three literacy areas. As a result, we could not use the ALSA questions in the bookmark procedure. This created a de facto cut score between the nonliterate in English and below basic performance levels. Consequently, all test takers who performed poorly on the initial screening questions (the core questions) and were administered ALSA are classified into the nonliterate in English category.2 As a result, the performance-level descriptions used for the bookmark procedures included only the top four levels, and the skills evaluated on ALSA were incorporated into the below basic description. After the standard settings, each of the performance-level descriptions for the below basic category were revised, and the nonliterate in English category was formulated. The below basic description was split to separate the skills that individuals who took ALSA would be likely to have from the skills that individuals who were administered NAAL, but who were not able to answer enough questions correctly to reach the basic level, would be likely to have. Initially, the committee hoped to consolidate prose, document, and quantitative items into a singled ordered item booklet for the bookmark standard setting, which would have produced cut scores for an overall, combined literacy scale. This was not possible, however, because of an operational decision made by NCES and its contractors to scale the test 2   Some potential test takers were not able to participate due to various literacy-related reasons, as determined by the interviewer, and are also classified as nonliterate in English. These nonparticipants include individuals who have difficulty with reading or writing or who are not able to communicate in English or Spanish. Another group of individuals who were not able to participate are those with a mental disability, such as retardation, a learning disability, or other mental or emotional conditions. Given the likely wide variation in literacy skills of individuals in this group, these individuals are treated as nonparticipants and are not included in the nonliterate in English category. Since some of these individuals are likely to have low literacy skills, however, an upper bound on the size of the nonliterate in English category could be obtained by including these individuals in the nonliterate in English category.

OCR for page 108
Measuring Literacy: Performance Levels for Adults items separately by literacy area. That is, the difficulty level of each item was determined separately for prose, document, and quantitative items. This means that it was impossible to determine, for example, if a given prose item was harder or easier than a given document item. This decision appears to have been based on the assumption that the three scales measure different dimensions of literacy and that it would be inappropriate to combine them into a single scale. Regardless of the rationale for the decision, it precluded our setting an overall cut score. Participants in the Bookmark Standard Settings Selecting Panelists Research and experience suggest that the background and expertise the panelists bring to the standard-setting activity are factors that influence the cut score decisions (Cizek, 2001a; Hambleton, 2001; Jaeger, 1989, 1991; Raymond and Reid, 2001). Furthermore, the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999) specify that panelists should be highly knowledgeable about the domain in which judgments are required and familiar with the population of test takers. We therefore set up a procedure to solicit recommendations for potential panelists for both standard-setting sessions, review their credentials, and invite those with appropriate expertise to participate. Our goal was to assemble a group of panelists who were knowledgeable about acquisition of literacy skills, had an understanding of the literacy demands placed on adults in this country and the strategies adults use when presented with a literacy task, had some background in standardized testing, and would be expected to understand and correctly implement the standard-setting tasks. Solicitations for panelists were sent to a variety of individuals: stakeholders who participated in the committee’s public forum, state directors of adult education programs, directors of boards of adult education organizations, directors of boards of professional organizations for curriculum and instruction of adult education programs, and officials with the Council for Applied Linguistics, the National Council of Teachers of English, and the National Council of Teachers of Mathematics. The committee also solicited recommendations from state and federal correctional institutions as well as from the university community for researchers in the areas of workplace, family, and health literacy. Careful attention was paid to including representatives from as many states as possible, including representatives from the six states that subsidized additional testing of adults in 2003 (Kentucky, Maryland, Massachusetts, Missouri, New York, and Oklahoma).

OCR for page 108
Measuring Literacy: Performance Levels for Adults The result of this extensive networking process produced a panel of professionals who represented adult education programs in urban, suburban, and rural geographic areas and a mix of practitioners, including teachers, tutors, coordinators, and directors. Almost all of the panelists had participated at some point in a range-finding or standard-setting activity, which helped them understand the connection between the performance-level descriptions and the task of determining an appropriate cut score. Panelists’ Areas of Expertise Because NALS and NAAL are assessments of adult literacy, we first selected panelists with expertise in the fields of adult education and adult literacy. Adult educators may specialize in curriculum and instruction of adult basic education (ABE) skills, preparation of students for the general educational development (GED) certificate, or English for speakers of other languages. In addition, adult education and adult literacy professionals put forth significant curricular, instructional, and research efforts in the areas of workplace literacy, family literacy, and heath literacy. Expertise in all of these areas was represented among the panelists.3 For the July standard setting, only individuals working in adult education and adult literacy were selected to participate. Based on panelist feedback following this standard setting, we decided to broaden the areas of expertise for the September standard setting. Specifically, panelists indicated they would have valued additional perspectives from individuals in areas affected by adult education services, such as human resource management, as well as from teachers who work with middle school and high school students. Therefore, for the second session, we selected panelists from two additional fields: (1) middle or high school language arts teachers and (2) industrial and organizational psychologists who specialize in skill profiling or employee assessment for job placement. The language arts classroom teachers broadened the standard-setting discussions by providing input on literacy instruction for adolescents who were progressing through the grades in a relatively typical manner, whereas teachers of ABE or GED had experience working with adults who, for 3   We note that we considered including college faculty as panelists, as they would have brought a different perspective to the standard setting. In the end, we were somewhat concerned about their familiarity with adults with lower literacy skills and thought that it would be difficult for those who primarily work in college settings to make judgments about the skills of adults who would be classified at the levels below intermediate. There was a limit to the number of panelists we could include, and we tried to include those with experience working with adults whose skills fell at the levels primarily assessed on NALS and NAAL.

OCR for page 108
Measuring Literacy: Performance Levels for Adults whatever reason, did not acquire the literacy skills attained by most students who complete the U.S. school system. The industrial and organizational psychologists who participated came from academia and corporate environments and brought a research focus and a practitioner perspective to the discussion that complemented those of the other panelists, who were primarily immersed in the adult education field. Table 5-1 gives a profile of the panelists who participated in the two standard-setting sessions. BOOKMARK STANDARD SETTING WITH 1992 DATA The first standard-setting session was held to obtain panelists’ judgments about cut scores for the 1992 NALS and to collect their feedback about the performance-level descriptions. A total of 42 panelists participated in the session. Panelists were assigned to groups, and each group was randomly assigned to two of the three literacy areas (prose, document, or quantitative). Group 1 worked with the prose and document items; Group 2 worked with the prose and quantitative items; and Group 3 worked with the document and quantitative items. The sequence in which they worked on the different literacy scales was alternated in an attempt to balance any potential order effects. For each literacy area, an ordered item booklet was prepared that rank-ordered the test questions from least to most difficult according to NALS examinees’ responses. The ordered item booklets consisted of all the available NALS tasks for a given literacy area, even though with the balanced incomplete block spiraling (see Chapter 2), no individual actually responded to all test questions. The number of items in each NALS ordered item booklet was 39 for prose literacy, 71 for document literacy, and 42 for quantitative literacy. Two training sessions were held, one for the “table leaders,” the individuals assigned to be discussion facilitators for the tables of panelists, and one for all panelists. The role of the table leader was to serve as a discussion facilitator but not to dominate the discussion or to try to bring the tablemates to consensus about cut scores. The bookmark process began by having each panelist respond to all the questions in the NALS test booklet for their assigned literacy scale. For this task, the test booklets contained the full complement of NALS items for each literacy scale, arranged in the order test takers would see them but not ranked-ordered as in the ordered item booklets. Afterward, the table leader facilitated discussion of differences among items with respect to knowledge, skills, and competencies required and what was measured by the scoring rubrics. Panelists then received the ordered item booklets. They discussed each item and noted characteristics they thought made one item more difficult

OCR for page 108
Measuring Literacy: Performance Levels for Adults TABLE 5-1 Profile of Panelists Involved in the Committee’s Standard Settings Participant Characteristics July Standard Setting N = 42 September Standard Setting N = 30 Gender Female 83a 77 Male 17 23 Ethnicity Black 2 7 Caucasian 69 83 Hispanic 0 3 Native American 2 0 Not reported 26 7 Geographic Regionb Midwest 26 37 Northeast 33 23 South 7 13 Southeast 19 7 West 14 20 Occupationc University instructors 7 10 Middle school, high school, or adult education instructors 19 30 Program coordinators or directors 38 40 Researchers 12 7 State office of adult education representative 24 13 than another. Each table member then individually placed their Round 1 bookmarks representing cut points for basic, intermediate, and advanced literacy. In preparation for Round 2, each table received a summary of the Round 1 bookmark placements made by each table member and were provided the medians of the bookmark placements (calculated for each table). Table leaders facilitated discussion among table members about their respective bookmark placements, and panelists were then asked to independently make their Round 2 judgments. In preparation for Round 3, each table received a summary of the Round 2 bookmark placements made by each table member as well as the medians for the table. In addition, each table received information about the proportion of the 1992 population who would have been categorized as having below basic, basic, intermediate, or advanced literacy based on the

OCR for page 108
Measuring Literacy: Performance Levels for Adults results from different standard-setting methods. At the same time, they acknowledge that different methods, or even the same method replicated with different panelists, are likely to produce different cut scores. This presents a dilemma to those who must make decisions about cut scores. Geisinger (1991, p. 17) captured this idea when he noted that “running a standard-setting panel is only the beginning of the standard-setting process.” At the conclusion of the standard setting, one has only proposed cut scores that must be accepted, rejected, or adjusted. The standard-setting literature contains discussions about how to proceed with making decisions about proposed cut scores, but there do not appear to be any hard and fast rules. Several quantitative approaches have been explored. For example, in the early 1980s, two quantitative techniques were devised for “merging” results from different standard-setting procedures (Beuck, 1984; Hofstee, 1983). These methods involve obtaining additional sorts of judgments from the panelists, besides the typical standard-setting judgments, to derive the cut scores. In the Beuck technique, panelists are asked to make judgments about the optimal pass rate on the test. In the Hofstee approach, panelists are asked their opinions about the highest and lowest possible cut scores and the highest and lowest possible failing rate.8 Another quantitative approach is to set reasonable ranges for the cut scores and to make adjustments within this range. One way to establish a range is by using estimates of the standard errors of the proposed cut scores (Zieky, 2001). Also, Huff (2001) described a method of triangulating results from three standard-setting procedures in which a reasonable range was determined from the results of one of the standard-setting methods. The cut scores from the two other methods fell within this range and were therefore averaged to determine the final set of cut scores. While these techniques use quantitative information in determining final cut scores, they are not devoid of judgments (e.g., someone must decide whether a quantitative procedure should be used, which one to use and how to implement it, and so on). Like the standard-setting procedure itself, determination of final cut scores is ultimately a judgment-based task that authorities on standard setting maintain should be based on both quantitative and qualitative information. For example, The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education, 1999, p. 54) note that determining cut scores cannot be a “purely technical mat- 8   The reader is referred to the original articles or Geisinger (1991) for additional detail on how the procedures are implemented.

OCR for page 108
Measuring Literacy: Performance Levels for Adults ter,” indicating that they should “embody value judgments as well as technical and empirical considerations.” In his landmark article on certifying students’ competence, Jaeger (1989, p. 500) recommended considering all of the results from the standard setting together with “extra-statistical factors” to determine the final cut scores. Geisinger (1991) suggests that a panel composed of informed members of involved groups should be empowered to make decisions about final cut scores. Green et al. (2003) proposed convening a separate judgment-based procedure wherein a set of judges synthesizes the various results to determine a final set of cut scores or submitting the different sets of cut scores to a policy board (e.g., a board of education) for final determination. As should be obvious from this discussion, there is no consensus in the measurement field about ways to determine final cut scores and no absolute guidance in the literature that the committee could rely on in making final decisions about cut scores. Using the advice that can be gleaned from the literature and guidance from the Standards that the process should be clearly documented and defensible, we developed an approach for utilizing the information from the two bookmark standard-setting sessions and the QCG procedure to develop our recommendations for final cut scores. We judged that the cut scores resulting from the two bookmark sessions were sufficiently similar to warrant combining them, and we formed median cut scores based on the two sets of panelist judgments. Since we decided to use the cut scores from the QCG procedure solely to complement the information from the bookmark procedure, we did not want to combine these two sets of cut scores in such a way that they were accorded equal weight. There were two reasons for this. One reason, as described above, was that the background questions used for the QCG procedure were correlates of the constructs evaluated on the assessment and were not intended as direct measures of these constructs. Furthermore, as explained earlier in this chapter, the available information was not ideal and did not include questions that would be most useful in distinguishing between certain levels of literacy. The other reason related to our judgment that the bookmark procedure had been implemented appropriately according to the guidelines documented in the literature (Hambleton, 2001; Kane, 2001; Plake, Melican, and Mills, 1992: Raymond and Reid, 2001) and that key factors had received close attention. We therefore chose to use a method for combining the results that accorded more weight to the bookmark cut scores than the QCG cut scores. The cut scores produced by the bookmark and QCG approaches are summarized in the first two rows of Table 5-11 for each type of literacy. Comparison of these cut scores reveals that the QCG cut scores are always lower than the bookmark cut scores. The differences among the two sets of

OCR for page 108
Measuring Literacy: Performance Levels for Adults TABLE 5-11 Summary of Cut Scores Resulting from Different Procedures   Basic Intermediate Advanced Prose QCG cut score 207.1 243.5 292.1 Bookmark cut score 211 270 345 Interquartile range of bookmark cut score 206-221 264-293 336-366 Adjusted cut scores 211.0 267.0 340.5 Average of cut scores 209.1 256.8 318.6 Confidence interval for cut scores 200.5-225.5 261.3-290.0 333.5-377.8 Document QCG cut score 205.1 241.6 285.6 Bookmark cut score 203 254 345 Interquartile range of bookmark cut score 192-210 247-259 324-371 Adjusted cut scores 203.0 250.5 334.5 Average of cut scores 204.1 247.8 315.3 Confidence interval for cut scores 191.9-211.1 245.5-264.8 320.7-372.9 Quantitative QCG cut score 209.9 245.4 296.1 Bookmark cut score 244 296 356 Interquartile range of bookmark cut score 230-245 288-307 343-398 Adjusted cut scores 237.0 292.0 349.5 Average of cut scores 227.0 275.2 326.1 Confidence interval for cut scores 225.9-262.5 282.8-305.8 343.0-396.4 cut scores are smaller for the basic and intermediate performance levels for prose and document literacy, with differences ranging from 2 to 26 points. Differences among the cut scores are somewhat larger for all performance levels in the quantitative literacy area and for the advanced performance level for all three types of literacy, with differences ranging from 34 to 60 points. Overall, this comparison suggests that the bookmark cut scores should be lowered slightly. We designed a procedure for combining the two sets of cut scores that was intended to make only minor adjustments to the bookmark cut scores, and we examined its effects on the resulting impact data. The adjustment procedure is described below and the resulting cut scores are also presented in Table 5-11. The table also includes the cut scores that would result from averaging the bookmark and QCG cut scores, which, although we did not consider this as a viable alternative, we provide as a comparison with the cut scores that resulted from the adjustment.

OCR for page 108
Measuring Literacy: Performance Levels for Adults ADJUSTING THE BOOKMARK CUT SCORES We devised a procedure for adjusting the bookmark cut scores that involved specifying a reasonable range for the cut scores and making adjustments within this range. We decided that the adjustment should keep the cut scores within the interquartile range of the bookmark cut scores (that is, the range encompassed by the 25th and 75th percentile scaled scores produced by the bookmark judgments) and used the QCG cut scores to determine the direction of the adjustment within this range. Specifically, we compared each QCG cut score to the respective interquartile range from the bookmark procedure. If the cut score lay within the interquartile range, no adjustment was made. If the cut score lay outside the interquartile range, the bookmark cut score was adjusted using the following rules: If the QCG cut score is lower than the lower bound of the interquartile range (i.e., lower than the 25th percentile), determine the difference between the bookmark cut score and the lower bound of the interquartile range. Reduce the bookmark cut score by half of this difference (essentially, the midpoint between the 25th and 50th percentiles of the bookmark cut scores). If the QCG cut score is higher than the upper bound of the interquartile range (i.e., higher than the 75th percentile), determine the difference between the bookmark cut score and the upper bound of the interquartile range. Increase the bookmark cut score by half of this difference (essentially the midpoint between the 50th and 75th percentile of the bookmark cut scores). To demonstrate this procedure, the QCG cut score for the basic performance level in prose is 207.1, and the bookmark cut score is 211 (see Table 5-11). The corresponding interquartile range based on the bookmark procedure is 206 to 221. Since 207.1 falls within the interquartile range, no adjustment is made. The QCG cut score for intermediate is 243.5. Since 243.5 is lower than the 25th percentile score (interquartile range of 264 to 293), the bookmark cut score of 270 needs to be reduced. The amount of the reduction is half the difference between the bookmark cut score of 270 and the lower bound of the interquartile range (264), which is 3 points. Therefore, the bookmark cut score would be reduced from 270 to 267. Application of these rules to the remaining cut scores indicates that all of the bookmark cut scores should be adjusted except the basic cut scores for prose and document literacy. The adjusted cut scores produced by this adjustment are presented in Table 5-11.

OCR for page 108
Measuring Literacy: Performance Levels for Adults Rounding the Adjusted Cut Scores In 1992, the test designers noted that the break points determined by the analyses that produced the performance levels did not necessarily occur at exact 50-point intervals on the scales. As we described in Chapter 3, the test designers judged that assigning the exact range of scores to each level would imply a level of precision of measurement that was inappropriate for the methodology adopted, and they therefore rounded the cut scores. In essence, this rounding procedure reflected the notion that there is a level of uncertainty associated with the specification of cut scores. The procedures we used for the bookmark standard setting allowed determination of confidence intervals for the cut scores, which also reflect the level of uncertainty in the cut scores. Like the test designers in 1992, we judged that the cut scores should be rounded and suggest that they be rounded to multiples of five. Tables 5-12a, 5-12b, and 5-12c show, for prose, document, and quantitative literacy, respectively, the original cut scores from the bookmark procedure and the adjustment procedure after rounding to the nearest multiple of five. For comparison, the table also presents the confidence intervals for the cut scores to indicate the level of uncertainty associated the specific cut scores. Another consideration when making use of cut scores from different standard-setting methods is the resulting impact data; that is, the percentages of examinees who would be placed into each performance category based on the cut scores. Tables 5-12a, 5-12b, and 5-12c show the percentage of the population who scored below the rounded cut scores. Again for comparison purposes, the table also presents impact data for the confidence intervals. Impact data were examined for both the original cut scores that resulted from the bookmark procedure and for the adjusted values of the cut scores. Comparison of the impact results based on the original and adjusted cut scores shows that the primary effect of the adjustment was to slightly lower the cut scores, more so for quantitative literacy than the other sections. A visual depiction of the differences in the percentages of adults classified into each performance level based on the two sets of cut scores is presented in Figures 5-1 through 5-6, respectively, for the prose, document, and quantitative sections. The top bar shows the percentages of adults that would be placed into each performance level based on the adjusted cut scores, and the bottom bar shows the distribution based on the original bookmark cut scores. Overall, the adjustment procedure tended to produce a distribution of participants across the performance levels that resembled the distribution produced by the original bookmark cut scores. The largest changes were in

OCR for page 108
Measuring Literacy: Performance Levels for Adults the quantitative section, in which the adjustment slightly lowered the cut scores. The result of the adjustment is a slight increase in the percentages of individuals in the basic, intermediate, and advanced categories. In our view, the procedures used to determine the adjustment were sensible and served to align the bookmark cut scores more closely with the relevant background measures. The adjustments were relatively small and made only slight differences in the impact data. The adjusted values remained within the confidence intervals. We therefore recommend the cut scores produced by the adjustment. RECOMMENDATION 5-1: The scale score intervals associated with each of the levels should be as shown below for prose, document, and quantitative literacy.   Nonliterate in English Below Basic Basic Intermediate Advanced Prose: Took ALSA 0-209 210-264 265-339 340-500 Document: Took ALSA 0-204 205-249 250-334 335-500 Quantitative: Took ALSA 0-234 235-289 290-349 350-500 We remind the reader that the nonliterate in English category was intended to comprise the individuals who were not able to answer the core questions in 2003 and were given the ALSA instead of NAAL. Below basic is the lowest performance level for 1992, since the ALSA did not exist at that time.9 DIFFICULTIES WITH THE UPPER AND LOWER ENDS OF THE SCORE SCALE With respect to setting achievement levels on the NAAL, we found that there were significant problems at both the lower and upper ends of the literacy scale. The problems with the lower end relate to decisions about the 9   For the 2003 assessment, the nonliterate in English category is intended to include those who were correctly routed to ALSA based on the core questions, those who should have been routed to ALSA but were misrouted to NAAL, and those who could not participate in the literacy assessment because their literacy levels were too low. The below basic category is intended to encompass those who were correctly routed to NAAL, and they should be classified into below basic using their performance on NAAL.

OCR for page 108
Measuring Literacy: Performance Levels for Adults TABLE 5-12a Comparison of Impact Data for Prose Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores   Basic Intermediate Advanced Roundeda bookmark cut score 210 270 345 Percent below cut score: 1992 16.5b,c 46.8 87.4 2003 15.4c,d 46.8 88.8 Roundeda adjusted cut score 210 265 340 Percent below cut score: 1992 16.5b,c 43.7 85.7 2003 15.4c,d 43.6 87.1 Roundede confidence interval 201-226 261-290 334-378 Percent below cut scores: 1992 13.8-22.7b,c 41.2-59.4 83.2-95.6 2003 12.6-21.5c,d 40.9-60.1 84.6-96.5 aRounded to nearest multiple of five. bIncludes those who took NALS and scored below the cut score as well as those who were not able to participate in the assessment for literacy-related reasons (having difficulty with reading or writing or unable to communicate in English or Spanish); nonparticipants for literacy-related reasons comprised 3 percent of the sample in 1992. cThis is an underestimate because it does not include the 1 percent of individual who could not participate due to a mental disability such as retardation, a learning disability, or other mental/emotional conditions. An upper bound on the percent below basic could be obtained by including this percentage. dIncludes those who took NAAL and scored below the basic cut score, those who took ALSA, and those who were not able to participate in the assessment for literacy-related reasons (having difficulty with reading or writing or unable to communicate in English or Spanish); nonparticipants for literacy-related reasons comprised 2 percent of the sample in 2003. eRounded to nearest whole number. nature of the ALSA component. ALSA was implemented as a separate low-level assessment. ALSA and NAAL items were not analyzed or calibrated together and hence were not placed on the same scale. We were therefore not able to use the ALSA items in our procedures for setting the cut scores. These decisions about the ways to process ALSA data created a de facto cut score between the nonliterate in English and below basic categories. Consequently, all test takers in 2003 who performed poorly on the initial screening questions (the core questions) and were administered ALSA are classified into the nonliterate in English category (see footnote 9).

OCR for page 108
Measuring Literacy: Performance Levels for Adults TABLE 5-12b Comparison of Impact Data for Document Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores   Basic Intermediate Advanced Roundeda bookmark cut score 205 255 345 Percent below cut score: 1992 16.8b,c 40.8 89.2 2003 14.2c,d 39.4 91.1 Roundeda adjusted cut score 205 250 335 Percent below cut score 1992 16.8 37.8 85.8 2003 14.2 36.1 87.7 Roundede confidence interval 192-211 246-265 321-373 Percent below cut scores: 1992 12.9-18.9 35.5-47.0 79.9-95.6 2003 10.5-16.3 33.7-46.0 81.6-96.9 See footnotes to Table 15-12a. TABLE 5-12c Comparison of Impact Data for Quantitative Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores   Basic Intermediate Advanced Roundeda bookmark cut score 245 300 355 Percent below cut score: 1992 33.3b,c 65.1 89.3 2003 27.9c,d 61.3 88.6 Roundeda adjusted cut score 235 290 350 Percent below cut score 1992 28.5 59.1 87.9 2003 23.1 55.1 87.0 Roundede confidence interval 226-263 283-306 343-396 Percent below cut scores: 1992 24.7-42.9 55.0-68.5 85.6-97.1 2003 19.2-37.9 50.5-64.9 84.1-97.2 See footnotes to Table 15-12a.

OCR for page 108
Measuring Literacy: Performance Levels for Adults FIGURE 5-1 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 prose literacy. FIGURE 5-2 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 prose literacy. *The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category. FIGURE 5-3 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 document literacy.

OCR for page 108
Measuring Literacy: Performance Levels for Adults FIGURE 5-4 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 document literacy. *The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category. FIGURE 5-5 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 quantitative literacy. FIGURE 5-6 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 quantitative literacy. *The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category.

OCR for page 108
Measuring Literacy: Performance Levels for Adults This creates problems in making comparisons between the 1992 and 2003 data. Since ALSA was not a part of NALS in 1992, there is no way to identify the group of test takers who would have been classified into the nonliterate in English category. As a result, the below basic and nonliterate in English categories will need to be combined to examine trends between 1992 and 2003. With regard to the upper end of the scale, we found that feedback from the bookmark panelists, combined with our review of the items, suggests that the assessment does not adequately cover the upper end of the distribution of literacy proficiency. We developed the description of this level based on what we thought was the natural progression of skills beyond the intermediate level. In devising the wording of the description, we reviewed samples of NALS items and considered the 1992 descriptions of NALS Levels 4 and 5. A number of panelists in the bookmark procedure commented about the lack of difficulty represented by the items, however, particularly the quantitative items. A few judged that an individual at the advanced level should be able to answer all of the items correctly, which essentially means that these panelists did not set a cut score for the advanced category. We therefore conclude that the assessment is very weak at the upper end of the scale. Although there are growing concerns about readiness for college-level work and preparedness for entry into professional and technical professions, we think that NAAL, as currently designed, will not allow for detection of problems at these levels of proficiency. It is therefore with some reservations that we include the advanced category in our recommendation for performance levels, and we leave it to NCES to ultimately decide on the utility and meaning of this category. With regard to the lower and upper ends of the score scale, we make the following recommendation: RECOMMENDATION 5-2: Future development of NAAL should include more comprehensive coverage at the lower end of the continuum of literacy skills, including assessment of the extent to which individuals are able to recognize letters and numbers and read words and simple sentences, to allow determination of which individuals have the basic foundation skills in literacy and which individuals do not. This assessment should be part of NAAL and should yield information used in calculating scores for each of the three types of literacy. At the upper end of the continuum of literacy skills, future development of NAAL should also include assessment items necessary to identify the extent to which policy interventions are needed at the postsecondary level and above.