Read "Measuring Literacy: Performance Levels for Adults" at NAP.edu

« Previous: 2 Adult Literacy Assessments and Adult Education

Page 50 Cite

Suggested Citation:"3 Developing Performance Levels for the National Adult Literacy Survey." National Research Council. 2005. Measuring Literacy: Performance Levels for Adults. Washington, DC: The National Academies Press. doi: 10.17226/11267.

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

3 Developing Performance Levels for the National Adult Literacy Survey I n this chapter, we document our observations and findings about the procedures used to develop the performance levels for the 1992 Na- tional Adult Literacy Survey (NALS). The chapter begins with some background information on how performance levels and the associated cut scores are typically determined. We then provide a brief overview of the test development process used for NALS, as it relates to the procedures for determining performance levels, and describe how the performance levels were determined and the cut scores set. The chapter also includes a discus- sion of the role of response probabilities in setting cut scores and in identi- fying assessment tasks to exemplify performance levels; the technical notes at the end of the chapter provides additional details about this topic. BACKGROUND ON DEVELOPING PERFORMANCE LEVELS When the objective of a test is to report results using performance levels, the number of levels and the descriptions of the levels are usually articulated early in the test development process and serve as the founda- tion for test development. The process of determining the number of levels and their descriptions usually involves consideration of the content and skills evaluated on the test as well as discussions with stakeholders about the inferences to be based on the test results and the ways the test results will be used. When the number of levels and the descriptions of the levels are laid out in advance, development efforts can focus on constructing items that measure the content and skills described by the levels. It is important to develop a sufficient number of items that measure the skills 50

DEVELOPING PERFORMANCE LEVELS 51 described by each of the levels. This allows for more reliable estimates of test-takersâ skills and more accurate classification of individuals into the various performance levels. While determination of the performance-level descriptions is usually completed early in the test development process, determination of the cut scores between the performance levels is usually made after the test has been administered and examineesâ answers are available. Typically, the process of setting cut scores involves convening a group of panelists with expertise in areas relevant to the subject matter covered on the test and familiarity with the test-taking population, who are instructed to make judgments about what test takers need to know and be able to do (e.g., which test items individuals should be expected to answer correctly) in order to be classified into a given performance level. These judgments are used to determine the cut scores that separate the performance levels. Methods for setting cut scores are used in a wide array of assessment contexts, from the National Assessment of Educational Progress (NAEP) and state-sponsored achievement tests, in which procedures are used to determine the level of performance required to classify students into one of several performance levels (e.g., basic, proficient, or advanced), to licens- ing and certification tests, in which procedures are used to determine the level of performance required to pass such tests in order to be licensed or certified. There is a broad literature on procedures for setting cut scores on tests. In 1986, Berk documented 38 methods and variations on these methods, and the literature has grown substantially since. All of the methods rely on panels of judges, but the tasks posed to the panelists and the procedures for arriving at the cut scores differ. The methods can be classified as test- centered, examinee-centered, and standards-centered. The modified Angoff and bookmark procedures are two examples of test-centered methods. In the modified Angoff procedure, the task posed to the panelists is to imagine a typical minimally competent examinee and to decide on the probability that this hypothetical examinee would answer each item correctly (Kane, 2001). The bookmark method requires placing all of the items in a test in order by difficulty; panelists are asked to place a âbookmarkâ at the point between the most difficult item borderline test takers would be likely to answer correctly and the easiest item borderline test takers would be likely to answer incorrectly (Zeiky, 2001). The borderline group and contrasting group methods are two examples of examinee-centered procedures. In the borderline group method, the pan- elists are tasked with identifying examinees who just meet the performance standard; the cut score is set equal to the median score for these examinees (Kane, 2001). In the contrasting group method, the panelists are asked to categorize examinees into two groupsâan upper group that has clearly met

52 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS the standard and a lower group that has not met the standard. The cut score is the score that best discriminates between the two groups. The Jaeger-Mills integrated judgment procedure and the body of work procedure are examples of standards-centered methods. With these meth- ods, panelists examine full sets of examineesâ responses and match the full set of responses to a performance level (Jaeger and Mills, 2001; Kingston et al., 2001). Texts such as Jaeger (1989) and Cizek (2001a) provide full descriptions of these and the other available methods. Although the methods differ in their approaches to setting cut scores, all ultimately rely on judgments. The psychometric literature documents procedures for systematizing the process of obtaining judgments about cut scores (e.g., see Jaeger, 1989; Cizek, 2001a). Use of systematic and careful procedures can increase the likelihood of obtaining fair and reasoned judg- ments, thus improving the reliability and validity of the results. Neverthe- less, the psychometric field acknowledges that there are no âcorrectâ stan- dards, and the ultimate judgments depend on the method used, the way it is carried out, and the panelists themselves (Brennan, 1998; Green, Trimble, and Lewis, 2003; Jaeger, 1989; Zieky, 2001). The literature on setting cut scores includes critiques of the various methods that document their strengths and weaknesses. As might be ex- pected, methods that have been used widely and for some time, such as the modified Angoff procedure, have been the subject of more scrutiny than recently developed methods like the bookmark procedure. A review of these critiques quickly reveals that there are no perfect or correct methods. Like the cut-score-setting process itself, choice of a specific procedure re- quires making an informed judgment about the most appropriate method for a given assessment situation. Additional information about methods for setting cut scores appears in Chapter 5, where we describe the procedures we used. DEVELOPMENT OF NALS TASKS The NALS tasks were drawn from the contexts that adults encounter on a daily basis. As mentioned in Chapter 2, these contexts include work, home and family, health and safety, community and citizenship, consumer economics, and leisure and recreation. Some of the tasks had been used on the earlier adult literacy assessments (the Young Adult Literacy Survey in 1985 and the survey of job seekers in 1990), to allow comparison with the earlier results, and some were newly developed for NALS. The tasks that were included on NALS were intended to profile and describe performance in each of the specified contexts. However, NALS was not designed to support inferences about the level of literacy adults need in order to function in the various contexts. That is, there was no

DEVELOPING PERFORMANCE LEVELS 53 attempt to systematically define the critical literacy demands in each of the contexts. The test designers specifically emphasize this, saying: â[The lit- eracy levels] do not reveal the types of literacy demands that are associated with particular contexts. . . . They do not enable us to say what specific level of prose, document, or quantitative skill is required to obtain, hold, or advance in a particular occupation, to manage a household, or to obtain legal or community servicesâ (Kirsch et al., 1993, p. 9). This is an impor- tant point, because it demonstrates that some of the inferences made by policy makers and the media about the 1992 results were clearly not sup- ported by the test development process and the intent of the assessment. The approach toward test development used for NALS does not reflect typical procedures used when the objective of an assessment is to distin- guish individuals with adequate levels of skills from those whose skills are inadequate. We point this out, not to criticize the process, but to clarify the limitations placed on the inferences that can be drawn about the results. To explain, it is useful to contrast the test development procedures used for NALS with procedures used in other assessment contexts, such as licensing and credentialing or state achievement testing. Licensing and credentialing assessments are generally designed to dis- tinguish between performance that demonstrates sufficient competence in the targeted knowledge, skills, and capabilities to be judged as passing and performance that is inadequate and judged as failing. Typically, licensing and certification tests are intentionally developed to distinguish between adequate and inadequate performance. The test development process in- volves specification of the skills critical to adequate performance generally determined by systematically collecting judgments from experts in the spe- cific field (e.g., via surveys) about what a licensed practitioner needs to know and be able to do. The process for setting cut scores relies on expert judgments about just how much of the specific knowledge, skills, and capa- bilities is needed for a candidate to be placed in the passing category. The process for test development and determining performance levels for state K-12 achievement tests is similar. Under ideal circumstances, the performance-level categories and their descriptions are determined in ad- vance of or concurrent with item development, and items are developed to measure skills described by the performance levels. The process of setting the cut scores then focuses on determining the level of performance consid- ered to be adequate mastery of the content and skills (often called âprofi- cientâ). Categories of performance below and above the proficient level are also often described to characterize the score distribution of the group of test takers. The process for developing NALS and determining the performance levels was different. This approach toward test development does notâand was not intended toâprovide the necessary foundation for setting stan-

54 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS dards for what adults need in order to adequately function in society, and there is no way to compensate for this after the fact. That is, there is no way to set a specific cut score that would separate adults who have sufficient literacy skills to function in society from those who do not. This does not mean that performance levels should not be used for reporting NALS re- sults or that cut scores should not be set. But it does mean that users need to be careful about the inferences about the test results that can be supported and the inferences that cannot. DEVELOPMENT OF PERFORMANCE-LEVEL DESCRIPTIONS AND CUT SCORES Overview of the Process Used for the 1992 NALS The process of determining performance levels for the 1992 NALS was based partially on analyses conducted on data from the two earlier assess- ments of adultsâ literacy skills. The analyses focused on identifying the features of the assessment tasks and stimulus materials that contributed to the difficulty of the test questions. These analyses had been used to deter- mine performance levels for the Survey of Workplace Literacy, the survey of job seekers conducted in 1990.1 The analyses conducted on the prior surveys were not entirely replicated for NALS. Instead, new analyses were conducted to evaluate the appropriateness of the performance levels and associated cut scores that had been used for the survey of job seekers. Based on these analyses, slight adjustments were made in the existing performance levels before adopting them for NALS. This process is described more fully below. The first step in the process that ultimately led to the formulation of NALS performance levels was an in-depth examination of the items in- cluded on the Young Adult Literacy Survey and the Survey of Workplace Literacy, to identify the features judged to contribute to their complexity.2 For the prose literacy items, four features were judged to contribute to their complexity: â¢ Type of match: whether finding the information needed to answer 1The analyses were conducted on the Young Adult Literacy Survey but performance levels were not used in reporting its results. The analyses were partly replicated and extended to yield performance levels for the Survey of Workplace Literacy. 2See Chapter 13 of the NALS Technical Manual for additional details about the process (http://www.nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2001457).

DEVELOPING PERFORMANCE LEVELS 55 the question involved simply locating the answer in the text, cycling through the text iteratively, integrating multiple pieces of information, or generating new information based on prior knowledge. â¢ Abstractness of the information requested. â¢ Plausibility of distractors: the extent of and location of information related to the question, other than the correct answer, that appears in the stimulus. â¢ Readability index as estimated using Fryâs (1977) readability index. The features judged to contribute to the complexity of document lit- eracy items were the same as for prose, with the exception that an index of the structural complexity of the display was substituted for the readability index. For the quantitative literacy items, the identified features included type of match and plausibility of the distractors, as with the prose items, and structural complexity, as with the document items, along with two other features: â¢ Operation specificity: the process required for identifying the op- eration to perform and the numbers to manipulate. â¢ Type of calculation: the type and number of arithmetic operations. A detailed schema was developed for use in âscoringâ items according to these features, and the scores were referred to as complexity ratings. The next step in the process involved determination of the cut scores for the performance levels used for reporting results of the 1990 Survey of Workplace Literacy. The process involved rank-ordering the items accord- ing to a statistical estimate of their difficulty, which was calculated using data from the actual survey respondents. The items were listed in order from least to most difficult, and the judgment-based ratings of complexity were displayed on the listing. Tables 3-1 through 3-3, respectively, present the lists of prose, document, and quantitative items rank-ordered by diffi- culty level. This display was visually examined for natural groupings or break points. According to Kirsch, Jungeblut, and Mosenthal (2001, p. 332), âvisual inspection of the distribution of [the ratings] along each of the literacy scales revealed several major [break] points occurring at roughly 50 point intervals beginning with a difficulty score of 225 on each scale.â The process of determining the break points was characterized as con- taining âsome noiseâ and not accounting for all the score variance associ- ated with performance on the literacy scales. It was noted that the shifts in complexity ratings did not necessarily occur at exactly 50 point intervals on the scales, but that assigning the exact range of scores to each level (e.g.,

56 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-1 List of Prose Literacy Tasks, Along with RP80 Task Difficulty, IRT Item Parameters, and Values of Variables Associated with Task Difficulty: 1990 Survey of the Literacy of Job-Seekers Scaled Identifier Task Description RP80 Level 1 A111301 Toyota, Acura, Nissan 189 AB21101 Swimmer: Underline sentence telling 208 what Ms. Chanin ate A120501 Blood donor pamphlet 216 A130601 Summons for jury service 237 Level 2 A120301 Blood donor pamphlet 245 A100201 PHP subscriber letter 249 A111401 Toyota, Acura, Nissan 250 A121401 Dr. Spock column: Alterntv to phys punish 251 AB21201 Swimmer: Age Ms. Chanin began to swim 250 competitively A131001 Shadows Columbus saw 280 AB80801 Illegal questions 265 AB41001 Declaration: Describe what poem is 263 about AB81101 New methods for capital gains 277 AB71001 Instruction to return appliance: 275 Indicate best note AB90501 Questions for new jurors 281 AB90701 Financial security tips 262 A130901 Shadows Columbus saw 282 Level 3 AB60201 Make out check: Write letter explaining 280 bill error AB90601 Financial security tips 299 A121201 Dr. Spock column: Why 285 phys punish accptd AB70401 Almanac vitamins: List correct info 289 from almanac A100301 PHP subscriber letter 294 A130701 Shadows Columbus saw 298 A130801 Shadows Columbus saw 303 AB60601 Economic index: Underline sentence 305 explaining action A121301 Dr. Spock column: 2 cons against 312 phys punish AB90401 Questions for new jurors 300 AB80901 Illegal questions 316 A111101 Toyota, Acura, Nissan 319

DEVELOPING PERFORMANCE LEVELS 57 IRT Parameters Type of Distractor Information a b c Readability Match Plausibility Type 0.868 â2.488 0.000 8 1 1 1 1.125 â1.901 0.000 8 1 1 1 0.945 â1.896 0.000 7 1 1 2 1.213 â1.295 0.000 7 3 2 2 0.956 â1.322 0.000 7 1 2 3 1.005 â1.195 0.000 10 3 1 3 1.144 â1.088 0.000 8 3 2 4 1.035 â1.146 0.000 8 2 2 3 1.070 â1.125 0.000 8 3 4 2 1.578 â0.312 0.000 9 3 1 2 1.141 â0.788 0.000 6 3 2 2 0.622 â1.433 0.000 4 3 1 3 1.025 â0.638 0.000 7 4 1 3 1.378 â0.306 0.266 5 3 2 3 1.118 â0.493 0.000 6 4 2 1 1.563 â0.667 0.000 8 3 2 4 1.633 â0.255 0.000 9 3 4 1 1.241 â0.440 0.000 7 3 2 4 1.295 â0.050 0.000 8 2 2 4 1.167 â0.390 0.000 8 3 2 4 0.706 â0.765 0.000 7 3 4 1 0.853 â0.479 0.000 10 4 3 2 1.070 â0.203 0.000 9 3 2 3 0.515 â0.929 0.000 9 3 2 2 0.809 â0.320 0.000 10 3 2 4 0.836 â0.139 0.000 8 3 3 4 1.230 â0.072 0.000 6 4 2 3 0.905 â0.003 0.000 6 4 3 3 0.772 â0.084 0.000 8 4 3 2 continued

58 TABLE 3-1 Continued Identifier Level 4 AB40901 A131101 AB90801 AB30601 AB50201 A101101 AB71101 AB81301 A120401 AB31201 AB30501 Level 5 AB81201 A111201 A101201 AB50101 TABLE 3-2 Difficulty with Task distractor, Identifier Level 1 SCOR100 SCOR300 SCOR200 AB60803 AB60802 SCOR400 AB71201 A110501 AB70104 AB60801 SCOR500 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS Scaled Task Description RP80 Korean Jet: Give argument made in article 329 Shadows Columbus saw 332 Financial security tips 331 Technology: Orally explain info from article 333 Panel: Determine surprising future headline 343 AmerExp: 2 similarities in handling receipts 346 Explain difference between 2 types of benefits 348 New methods for capital gains 355 Blood donor pamphlet 358 Dickinson: Describe what is expessed in poem 363 Technology: Underline sentence explaining action 371 New methods for capital gains 384 Toyota, Acura, Nissan 404 AmExp: 2 diffs in handling receipts 441 Panel: Find information from article 469 List of Document Literacy Tasks, Along with RP80 Task Score, IRT Item Parameters, and Values of Variables Associated Difficulty (structural complexity, type of match, plausibility of type of information): 1990 Survey of the Literacy of Job-Seekers Task Description RP80 Social Security card: Sign name on line 70 Driverâs license: Locate expiration date 152 Traffic signs 176 Nursesâ convention: What is time of program? 181 Nursesâ convention: What is date of program? 187 Medicine dosage 186 Mark correct movie from given information 189 Registration & tuition info 189 Job application: Complete personal information 193 Nursesâ convention: Write correct day of program 199 Theatre trip information 197

DEVELOPING PERFORMANCE LEVELS 59 IRT Parameters Type of Distractor Information a b c Readability Match Plausibility Type 0.826 0.166 0.000 10 4 4 4 0.849 0.258 0.000 9 5 4 1 0.851 0.236 0.000 8 5 5 2 0.915 0.347 0.000 8 4 4 4 1.161 0.861 0.196 13 4 4 4 0.763 0.416 0.000 8 4 2 4 0.783 0.482 0.000 9 6 2 5 0.803 0.652 0.000 7 5 5 3 0.458 â0.056 0.000 7 4 5 2 0.725 0.691 0.000 6 6 2 4 0.591 0.593 0.000 8 6 4 4 0.295 â0.546 0.000 7 2 4 2 0.578 1.192 0.000 8 8 4 5 0.630 2.034 0.000 8 7 5 5 0.466 2.112 0.000 13 6 5 4 IRT Parameters Type of Distractor Information a b c Complexity Match Plausibility Type 0.505 â4.804 0.000 1 1 1 1 0.918 â2.525 0.000 2 1 2 1 0.566 â2.567 0.000 1 1 1 1 1.439 â1.650 0.000 1 1 1 1 1.232 â1.620 0.000 1 1 1 1 0.442 â2.779 0.000 2 1 2 2 0.940 â1.802 0.000 8 2 2 1 0.763 â1.960 0.000 3 1 2 2 0.543 â2.337 0.000 1 2 1 2 1.017 â1.539 0.000 1 1 2 1 0.671 â1.952 0.000 2 1 2 2 continued

60 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-2 Continued Identifier Task Description RP80 AB60301 Phone message: Write correct name of caller 200 AB60302 Phone message: Write correct number of caller 202 AB80301 How companies share market 203 AB60401 Food coupons 204 AB60701 Nursesâ convention: Who would be asked questions 206 A120601 MasterCard/Visa statement 211 AB61001 Nursesâ convention: Write correct place for tables 217 A110301 Dessert recipes 216 AB70903 Checking deposit: Enter correct amount of check 223 AB70901 Checking deposit: Enter correct date 224 AB50801 Wage & tax statement: What is current net pay? 224 A130201 El Paso Gas & Electric bill 223 Level 2 AB70801 Classified: Match list with coupons 229 AB30101 Street map: Locate intersection 232 AB30201 Sign out sheet: Respond to call about resident 232 AB40101 School registration: Mark correct age information 234 A131201 Tempra dosage chart 233 AB31301 Facts about fire: Mark information in article 235 AB80401 How companies share market 236 AB60306 Phone message: Write whom message is for 237 AB60104 Make out check: Enter correct amount written out 238 AB21301 Bus schedule 238 A110201 Dessert recipes 239 AB30301 Sign out sheet: Respond to call about resident 240 AB30701 Major medical: Locate eligibility from table 245 AB60103 Make out check: Enter correct amount in numbers 245 AB60101 Make out check: Enter correct date on check 246 AB60102 Make out check: Paid to the correct place 246 AB50401 Catalog order: Order product one 247 AB60303 Phone message: Mark âplease callâ box 249 AB50701 Almanac football: Explain why an award is given 254 AB20101 Energy graph: Find answer for given conditions (1) 255 A120901 MasterCard/Visa statement 257 A130101 El Paso Gas & Electric bill 257 AB91101 Minimum wage power 260 AB81001 Consumer Reports books 261 AB90101 Pest control warning 261 AB21501 With graph, predict sales for spring 1985 261 AB20601 Yellow pages: Find place open Saturday 266 A130401 El Paso Gas & Electric bill 270 AB70902 Checking deposit: Enter correct cash amount 271

DEVELOPING PERFORMANCE LEVELS 61 IRT Parameters Type of Distractor Information a b c Complexity Match Plausibility Type 1.454 â1.283 0.000 1 1 2 1 1.069 â1.434 0.000 1 1 1 1 1.292 â1.250 0.000 7 2 2 2 0.633 â1.898 0.000 3 2 2 1 1.179 â1.296 0.000 1 2 2 1 0.997 â1.296 0.000 6 1 2 2 0.766 â1.454 0.000 1 1 2 2 1.029 â1.173 0.000 5 3 2 1 1.266 â0.922 0.000 3 2 2 1 0.990 â1.089 0.000 3 1 1 1 0.734 â1.366 0.000 5 2 2 2 1.317 â0.868 0.000 8 1 2 2 1.143 â0.881 0.000 8 2 3 1 0.954 â0.956 0.000 4 2 2 2 0.615 â1.408 0.000 2 3 2 1 0.821 â1.063 0.000 6 2 2 3 1.005 â0.872 0.000 5 2 3 3 0.721 â1.170 0.000 1 2 3 2 1.014 â0.815 0.000 7 3 2 2 0.948 â0.868 0.000 1 2 3 1 1.538 â0.525 0.000 6 3 2 1 0.593 â1.345 0.000 2 2 3 2 0.821 â0.947 0.000 5 3 2 1 0.904 â0.845 0.000 2 2 2 3 0.961 â0.703 0.000 4 2 2 2 0.993 â0.674 0.000 6 3 2 1 1.254 â0.497 0.000 6 3 2 1 1.408 â0.425 0.000 6 3 2 1 0.773 â0.883 0.000 8 3 2 1 0.904 â0.680 0.000 1 2 2 2 1.182 â0.373 0.000 6 2 2 3 1.154 â0.193 0.228 4 3 2 1 0.610 â0.974 0.000 6 1 2 2 0.953 â0.483 0.000 8 2 2 2 0.921 â0.447 0.000 4 3 3 2 1.093 â0.304 0.000 4 3 2 1 0.889 â0.471 0.000 2 3 3 2 0.799 â0.572 0.000 5 3 2 2 1.078 â0.143 0.106 7 3 2 1 0.635 â0.663 0.000 8 3 3 2 0.858 â0.303 0.000 3 3 3 2 continued

62 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-2 Continued Identifier Task Description RP80 Level 3 AB50601 Almanac football: Locate page of info in almanac 276 A110701 Registration & tuition info 277 AB20201 Energy graph: Find answer for given conditions (2) 278 AB31101 Abrasive gd: Can product be used in given case? 280 AB80101 Burning out of control 281 AB70701 Follow directions on map: Give correct location 284 A110801 Washington/Boston schedule 284 AB70301 Almanac vitamins: Locate list of info in almanac 287 AB20401 Yellow pages: Find a list of stores 289 AB20501 Yellow pages: Find phone number of given place 291 AB60305 Phone message: Write who took the message 293 AB30401 Sign out sheet: Respond to call about resident (2) 297 AB31001 Abrasive guide: Type of sandpaper for sealing 304 AB20301 Energy: Yr 2000 source prcnt power larger than 71 307 AB90901 U.S. Savings Bonds 308 AB60304 Phone message: Write out correct message 310 AB81002 Consumer Reports books 311 AB20801 Bus schd: Take correct bus for given condition (2) 313 AB50402 Catalog order: Order product two 314 AB40401 Almanac: Find page containing chart for given info 314 AB21001 Bus schd: Take correct bus for given condition (4) 315 AB60502 Petroleum graph: Complete graph including axes 318 A120701 MasterCard/Visa statement 320 AB20701 Bus schd: Take correct bus for given condition (1) 324 Level 4 A131301 Tempra dosage chart 326 AB50501 Telephone bill: Mark information on bill 330 AB91401 Consumer Reports index 330 AB30801 Almanac: Find page containing chart for given info 347 AB20901 Bus schd: After 2:35, how long til Flint&Acad bus 348 A130301 El Paso Gas & Electric bill 362 A120801 MasterCard/Visa statement 363 AB91301 Consumer Reports index 367 Level 5 AB60501 Petroleum graph: Label axes of graph 378 AB30901 Almanac: Determine pattern in exports across years 380 A100701 Spotlight economy 381 A100501 Spotlight economy 386 A100401 Spotlight economy 406 AB51001 Income tax table 421 A100601 Spotlight economy 465

DEVELOPING PERFORMANCE LEVELS 63 IRT Parameters Type of Distractor Information a b c Complexity Match Plausibility Type 1.001 â0.083 0.000 5 3 2 2 0.820 â0.246 0.000 3 2 5 2 0.936 â0.023 0.097 4 4 2 1 0.762 â0.257 0.000 10 5 2 3 0.550 â0.656 0.000 2 3 2 2 0.799 â0.126 0.000 4 4 2 2 0.491 â0.766 0.000 9 2 4 2 0.754 â0.134 0.000 5 3 4 2 0.479 â0.468 0.144 7 2 5 1 0.415 â0.772 0.088 7 2 4 2 0.640 â0.221 0.000 1 5 2 1 0.666 â0.089 0.000 2 2 1 4 0.831 0.285 0.000 10 4 2 2 1.090 0.684 0.142 4 4 2 1 0.932 0.479 0.000 6 4 4 2 0.895 0.462 0.000 1 5 2 3 0.975 0.570 0.000 4 3 5 2 1.282 0.902 0.144 10 3 5 2 1.108 0.717 0.000 8 4 4 3 0.771 0.397 0.000 5 4 3 2 0.730 0.521 0.144 10 3 4 2 1.082 0.783 0.000 10 6 2 2 0.513 â0.015 0.000 6 2 4 2 0.522 0.293 0.131 10 3 4 2 0.624 0.386 0.000 5 4 4 2 0.360 â0.512 0.000 7 4 4 2 0.852 0.801 0.000 7 3 5 3 0.704 0.929 0.000 5 4 5 2 1.169 1.521 0.163 10 5 4 2 0.980 1.539 0.000 8 5 4 5 0.727 1.266 0.000 6 5 4 2 0.620 1.158 0.000 7 4 5 3 1.103 1.938 0.000 11 7 2 5 0.299 0.000 0.000 7 5 5 3 0.746 1.636 0.000 10 5 5 2 0.982 1.993 0.000 10 5 5 5 0.489 1.545 0.000 10 5 5 2 0.257 0.328 0.000 9 4 5 2 0.510 2.737 0.000 10 7 5 2

64 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-3 List of Quantitative Literacy Tasks, Along with RP80 Task Difficulty, IRT Item Parameters, and Values of Variables Associated with Task Difficulty (structural complexity, type of match, plausibility of distractors, type of calculation, and specificity of operation): 1990 Survey of the Literacy of Job-Seekers Identifier Quantitative Literacy Items RP80 Level 1 AB70904 Enter total amount of both checks being deposited 221 Level 2 AB50404 Catalog order: Shipping, handling, and total 271 AB91201 Tempra coupon 271 AB40701 Check ledger: Complete ledger (1) 277 A121001 Insurance protection workform 275 Level 3 AB90102 Pest control warning 279 AB40702 Check ledger: Complete ledger (2) 281 AB40703 Check ledger: Complete ledger (3) 282 A131601 Money rates: Thursday vs. one year ago 281 AB40704 Check ledger: Complete ledger (4) 283 AB80201 Burning out of control 286 A110101 Dessert recipes 289 AB90201 LPGA money leaders 294 A120101 Businessland printer stand 300 AB81003 Consumer Reports books 301 AB80601 Valet airport parking discount 307 AB40301 Unit price: Mark economical brand 311 A131701 Money rates: Compare S&L w/mutual funds 312 AB80701 Valet airport parking discount 315 A100101 Pizza coupons 316 AB90301 LPGA money leaders 320 A110401 Dessert recipes 323 A131401 Tempra dosage chart 322 Level 4 AB40501 Airline schedule: Plan travel arrangements (1) 326 AB70501 Lunch: Determine correct change using info in menu 331 A120201 Businessland printer stand 340 A110901 Washington/Boston train schedule 340 AB60901 Nursesâ convention: Write number of seats needed 346 AB70601 Lunch: Determine 10% tip using given info 349 A111001 Washington/Boston train schedule 355 A130501 El Paso Gas & Electric bill 352 A100801 Spotlight economy 356

DEVELOPING PERFORMANCE LEVELS 65 IRT Parameters Type of Distractor Calculation Op a b c Complexity Match Plausibility Type Specfy 0.869 â1.970 0.000 2 1 1 1 1 0.968 â0.952 0.000 6 3 2 1 3 0.947 â0.977 0.000 1 2 1 5 4 1.597 â0.501 0.000 3 2 2 1 4 0.936 â0.898 0.000 2 3 2 3 2 0.883 â0.881 0.000 2 3 3 1 4 1.936 â0.345 0.000 3 2 2 2 4 1.874 â0.332 0.000 3 1 2 2 4 1.073 â0.679 0.000 4 3 2 2 4 1.970 â0.295 0.000 3 2 2 2 4 0.848 â0.790 0.000 2 3 2 2 4 0.813 â0.775 0.000 5 3 2 2 4 0.896 â0.588 0.000 5 2 2 2 4 1.022 â0.369 0.000 2 3 3 2 4 0.769 â0.609 0.000 7 2 3 1 4 0.567 â0.886 0.000 2 3 3 2 4 0.816 0.217 0.448 2 2 3 4 6 1.001 â0.169 0.000 4 3 3 2 2 0.705 â0.450 0.000 2 2 3 3 4 0.690 â0.472 0.000 2 3 3 1 4 1.044 0.017 0.000 5 1 2 4 3 1.180 0.157 0.000 5 3 2 3 6 1.038 0.046 0.000 5 3 3 2 4 0.910 0.006 0.000 3 3 3 5 3 0.894 0.091 0.000 2 2 2 5 4 0.871 0.232 0.000 2 3 4 3 5 1.038 0.371 0.000 7 4 4 2 5 0.504 â0.355 0.000 3 4 4 1 5 0.873 0.384 0.000 2 1 2 5 7 0.815 0.434 0.000 7 4 4 2 5 0.772 0.323 0.000 8 3 4 2 2 0.874 0.520 0.000 8 5 4 2 2 continued

66 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-3 Continued Identifier Quantitative Literacy Items RP80 AB40201 Unit price: Estimate cost/oz of peanut butter 356 A121101 Insurance protection workform 356 A100901 Camp advertisement 366 A101001 Camp advertisement 366 AB80501 How companies share market 371 Level 5 A131501 Tempra dosage chart 381 AB50403 Catalog order: Order product three 382 AB91001 U.S. Savings Bonds 385 A110601 Registration & tuition info 407 AB50301 Interest charges: Orally explain computation 433 277-319 for Level 3 of document literacy; and 331-370 for Level 4 of quantitative literacy) would imply a level of precision of measurement that the test designers believed was inappropriate for the methodology adopted. Thus, identical score intervals were adopted for each of the three literacy scales as shown below: â¢ Level 1: 0â225 â¢ Level 2: 226â275 â¢ Level 3: 276â325 â¢ Level 4: 326â375 â¢ Level 5: 376â500 Performance-level descriptions were developed by summarizing the features of the items that had difficulty values that fell within each of the score ranges. These procedures were not entirely replicated to determine the perfor- mance levels for NALS, in part because NALS used some of the items from the two earlier assessments. Instead, statistical estimates of test question difficulty levels were carried out for the newly developed NALS items (the items that had not been used on the earlier assessments), and the correlation between these difficulty levels and the item complexity ratings was deter- mined. The test designers judged the correlations to be sufficiently similar to those from the earlier assessments and chose to use the same score scale breakpoints for NALS as had been used for the performance levels for the Survey of Workplace Literacy. Minor adjustments were made to the lan-

DEVELOPING PERFORMANCE LEVELS 67 IRT Parameters Type of Distractor Calculation Op a b c Complexity Match Plausibility Type Specfy 0.818 0.455 0.000 2 1 2 4 5 0.860 0.513 0.000 2 1 2 5 4 0.683 0.447 0.000 2 2 4 5 4 0.974 0.795 0.000 2 3 4 5 4 1.163 1.027 0.000 6 3 2 3 6 0.916 1.031 0.000 5 3 5 3 5 0.609 0.601 0.000 6 4 5 5 5 0.908 1.083 0.000 6 4 5 2 4 0.624 1.078 0.000 8 2 5 5 5 0.602 1.523 0.000 2 5 5 5 7 guage describing the existing performance levels. The resulting performance- level descriptions appear in Table 3-4. Findings About the Process Used for the 1992 NALS The available written documentation about the procedures used for determining performance levels for NALS does not specify some of the more important details about the process (see Kirsch, Jungeblut, and Mosenthal, 2001, Chapter 13). For instance, it is not clear who participated in producing the complexity ratings or exactly how this task was handled. Determination of the cut scores involved examination of the listing of items for break points, but the break points are not entirely obvious. It is not clear that other people looking at this list would make the same choices for break points. In addition, it is not always clear whether the procedures described in the technical manual pertain to NALS or to one of the earlier assess- ments. A more open and public process combined with more explicit, trans- parent documentation is likely to lead to better understanding of how the levels were determined and what conclusions can be drawn about the re- sults. The performance levels produced by this approach were score ranges based on the cognitive processes required to respond to the items. While the 1992 score levels were used to inform a variety of programmatic decisions, there is a benefit to developing performance levels through open discussions with stakeholders. Such a process would result in levels that would be more readily understood. The process for determining the cut scores for the performance levels

68 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-4 National Adult Literacy Survey (NALS) Performance-Level Descriptions Prose Document Quantitative Most of the tasks in this level require Tasks in this level tend to require the Tasks in this level require readers to the reader to read relatively short text to reader either to locate a piece of perform single, relatively simple locate a single piece of information information based on a literal match or arithmetic operations, such as addition. which is identical to or synonymous to enter information from personal The numbers to be used are provided Level 1 with the information given in the knowledge onto a document. Little, if and the arithmetic operation to be 0-225 question or directive. If plausible but any, distracting information is present. performed is specified. incorrect information is present in the text, it tends not to be located near the correct information. Some tasks in this level require readers Tasks in this level are more varied than Tasks in this level typically require to locate a single piece of information those in Level 1. Some require the readers to perform a single operation in the text; however, several distractors readers to match a single piece of using numbers that are either stated in or plausible but incorrect pieces of information; however, several the task or easily located in the information may be present, or low- distractors may be present, or the match material. The operation to be performed Level 2 level inferences may be required. Other may require low-level inferences. Tasks may be stated in the question or easily 226-275 tasks require the reader to integrate two in this level may also ask the reader to determined from the format of the or more pieces of information or to compare and contrast easily identifiable cycle through information in a material (for example, an order form). information based on a criterion document or to integrate information provided in the question or directive. from various parts of a document. Tasks in this level tend to require Some tasks in this level require the In tasks in this level, two or more readers to make literal or synonymous reader to integrate multiple pieces of numbers are typically needed to solve matches between the text and information information from one or more the problem, and these must be found in given in the task, or to make matches documents. Others ask readers to cycle the material. The operation(s) needed that require low-level inferences. Other through rather complex tables or graphs can be determined from the arithmetic tasks ask readers to integrate information which contain information that is relation terms used in the question or Level 3 from dense or lengthy text that contains irrelevant or inappropriate to the task. directive. 276-325 no organizational aids such as headings. Readers may also be asked to generate a response based on information that can be easily identified in the text. Distracting information is present, but is not located near the correct information. These tasks require readers to perform Tasks in this level, like those at the These tasks tend to require readers to multiple-feature matches and to previous levels, ask readers to perform perform two or more sequential integrate or synthesize information multiple-feature matches, cycle operations or a single operation in Level 4 from complex or lengthy passages. through documents, and integrate which the quantities are found in More complex inferences are needed information; however, they require a different types of displays, or the 326-375 to perform successfully. Conditional operations must be inferred from greater degree of inferencing. Many of information is frequently present in these tasks require readers to provide semantic information given or drawn tasks at this level and must be taken numerous responses but do not from prior knowledge. into consideration by the reader. designate how many responses are needed. Conditional information is also present in the document tasks at this level and must be taken into account by the reader. Some tasks in this level require the Tasks in this level require the reader These tasks require readers to perform reader to search for information in to search through complex displays multiple operations sequentially. They dense text which contains a number of that contain multiple distractors, to must disembed the features of the Level 5 plausible distractors. Others ask make high-level text-based inferences, problem from text or rely on 376-500 readers to make high-level inferences and to use specialized knowledge. background knowledge to determine or use specialized background the quantities or operations needed. knowledge. Some tasks ask readers to contrast complex information. Source: U.S. Department of Education, National Center for Education Statistics, National Adult Literacy Survey, 1992.

DEVELOPING PERFORMANCE LEVELS 69 used for reporting NALS in 1992 did not involve one of the typical methods documented in the psychometric literature. This is not to criticize the test designersâ choice of procedures, as it appears that they were not asked to set standards for NALS, and hence one would not expect them to use one of these methods. It is our view, however, that there are benefits to using one or more of these documented methods. Use of established procedures for setting cut scores allows one to draw from the existing research and experi- ential base to gather information about the method, such as prescribed ways to implement the method, variations on the method, research on its advantages and disadvantages, and so on. In addition, use of established procedures facilitates communication with others about the general pro- cess. For example, if the technical manual for an assessment program indi- cates that the body of work method was used to set the cut scores, people can refer to the research literature for further details about what this typi- cally entails. CHOICE OF RESPONSE PROBABILITY VALUES The Effects of Response Probability Values on the Performance Levels The difficulty level of test questions can be estimated using a statistical procedure called item response theory (IRT). With IRT, a curve is estimated that gives the probability of a correct response from individuals across the range of proficiency. The curve is described in terms of parameters in a mathematical model. One of the parameter estimates, the difficulty param- eter, typically corresponds to the score (or proficiency level) at which an individual has a 50 percent chance of answering the question correctly. Under this approach, it is also possible to designate, for the purposes of interpreting an itemâs response curve, the proficiency at which the probabil- ity is any particular value that users find helpful. In 1992 the test developers chose to calculate test question difficulty values representing the proficiency level at which an individual had an 80 percent chance of answering an item correctly. The items were rank-ordered according to this estimate of their difficulty levels. Thus, the scaled scores used in determining the score ranges associated with the five performance levels were the scaled scores associ- ated with an 80 percent probability of responding correctly. The choice of the specific response probability value (e.g., 50, 65, or 80 percent) does not affect either the estimates of item response curves or distributions of proficiency. It is nevertheless an important decision because it affects usersâ interpretations of the value of the scale scores used to separate the performance levels. Furthermore, due to the imprecision of the connection between the mathematical definitions of response probability values and the linguistic descriptions of their implications for performance

70 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS that judges use to set standards, the cut scores could be higher or lower simply as a consequence of the response probability selected. As mentioned earlier, the decision to use a response probability of 80 percent for the 1992 NALS has been the subject of subsequent debate, which has centered on whether the use of a response probability of 80 percent may have misrepre- sented the literacy levels of adults in the United States by producing cut scores that were too high (Baron, 2002; Kirsch, 2002; Kirsch et al., 2001, Ch. 14; Matthews, 2001; Sticht, 2004), to the extent that having a prob- ability lower than 80 percent was misinterpreted as ânot being able to doâ the task required by an item. In the final chapter of the technical manual (see Kirsch et al., 2001, Chapter 14), Kolstad demonstrated how the choice of a response probabil- ity value affects the value of the cut scores, under the presumption that response probability values might change considerably, while the everyday interpretation of the resulting numbers did not. He conducted a reanalysis of NALS data using a response probability value of 50 percent; that is, he calculated the difficulty of the items based on a 50 percent probability of responding correctly. This reanalysis demonstrated that use of a response probability value of 50 percent rather than 80 percent, with both inter- preted by the same everyday language interpretation (e.g., that an indi- vidual at that level was likely to get an item correct), would have lowered the cut scores associated with the performance levels in such a way that a much smaller percentage of adults would have been classified at the lowest level. For example, the cut score based on a response probability of 80 placed slightly more than 20 percent of respondents in the lowest perfor- mance level; the cut score based on a response probability of 50 classified only 9 percent at this level. It is important to point out here that the underlying distribution of scores did not change (and clearly could not change) with this reanalysis. There were no differences in the percentages of individuals scoring at each scale score. The only changes were the response probability criteria and interpretation of the cut scores. Using 80 percent as the response probabil- ity criterion, we would say that 20 percent of the population could perform the skills described by the first performance level with 80 percent accuracy. If the accuracy level was set at 50 percent and the same everyday language interpretation was applied, a larger share of the population could be said to perform these skills. Findings About the Choice of Response Probability Values Like many decisions made in connection with developing a test, the choice of a specific response probability value requires both technical and nontechnical considerations. For example, a high response probability may

DEVELOPING PERFORMANCE LEVELS 71 be adopted when the primary objective of the test is to certify, with a high degree of certainty, that test takers have mastered the content and skills. In licensing decisions, one would want to have a high degree of confidence that a potential license recipient has truly mastered the requisite subject matter and skills. When there are no high-stakes decisions associated with test results, a lower response probability value may be more appropriate. Choice of a response probability value requires making a judgment, and reasonable people may disagree about which of several options is most appropriate. For this reason, it is important to lay out the logic behind the decision. It is not clear from the NALS Technical Manual (Kirsch et al., 2001) that the consequences associated with the choice of a response prob- ability of 80 percent were fully explored or that other options were consid- ered. Furthermore, the technical manual (Kirsch et al., 2001) contains con- tradictory informationâone chapter that specifies the response probability value used and another chapter that demonstrates how alternate choices would have affected the resulting cut scores. Including contradictory infor- mation like this in a technical manual is very disconcerting to those who must interpret and use the assessment results. It is our opinion that the choice of a response probability value to use in setting cut scores should be based on a thorough consideration of technical and nontechnical factors, such as the difficulty level of the test in relation to the proficiency level of the examinees, the objectives of the assessment, the ways the test results are used, and the consequences associated with these uses of test results. The logic and rationale for the choice should be clearly documented. Additional discussion of response probabilities appears in the technical note to this chapter, and we revisit the topic in Chapter 5. MAPPING ITEMS TO PERFORMANCE LEVELS Response probabilities are calculated for purposes other than determin- ing cut scores. One of the most common uses of response probability values is to âmapâ items to specific score levels in order to more tangibly describe what it means to score at the specific level. For NALS, as described in the preceding section, the scale score associated with an 80 percent probability of responding correctlyâabbreviated in the measurement literature as rp80âwas calculated for each NALS item. Selected items were then mapped to the performance level whose associated score range encompassed the rp80 difficulty value. The choice of rp80 (as opposed to rp65, or some other value) appears to have been made both to conform to conventional item mapping practices at the time (e.g., NAEP used rp80 at the time, although it has since changed to rp67) and because it represents the concept of âmasteryâ as it is generally conceptualized in the field of education (Kirsch et al., 2001; personal communication, August 2004).

72 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS Item mapping is a useful tool for communicating about test perfor- mance. A common misperception occurs with its use, however: namely, that individuals who score at the specific level will respond correctly and those at lower levels will respond incorrectly. Much of the NALS results that were publicly reported displayed items mapped to only a single perfor- mance level, the level associated with a response probability of 80 percent. This all-or-nothing interpretation ignores the continuous nature of response probabilities. That is, for any given item, individuals at every score point have some probability of responding correctly. Table 3-5, which originally appeared in Chapter 14 of the technical manual as Figure 14-4 (Kirsch et al., 2001), demonstrates this point using four sample NALS prose tasks. Each task is mapped to four different scale scores according to four different probabilities of a correct response (rp80, rp65, rp50, and rp35). Consider the first mapped prose task, âidentify country in short article.â According to the figure, individuals who achieved a scaled score of 149 had an 80 percent chance of responding correctly; those who scored 123 had a 65 percent change of responding cor- rectly; those with a score of 102 had a 50 percent chance of responding correctly; and those who scored 81 had a 35 percent chance of respond- ing correctly. Although those who worked on NALS had a rationale for selecting an rp80 criterion for use in mapping exemplary items to the performance levels, other response probability values might have been used and displays such as in Table 3-5 might have been prepared. If item mapping procedures are to be used in describing performance on NAAL, we encourage use of display more like that in Table 3-5. Additional information about item mapping appears in the technical note to this chapter. We also revisit this issue in Chapter 6, where we discuss methods of communicating about NAAL results. Recommendation 3-1: If the Department of Education decides to use an item mapping procedure to exemplify performance on the National Assess- ment of Adult Literacy (NAAL), displays should demonstrate that individu- als who score at all of the performance levels have some likelihood of responding correctly to the items. CONCLUSION As clearly stated by the test designers, the decision to collapse the NALS score distribution into five categories or ranges of performance was not done with the intent or desire to establish standards reflecting the extent of literacy skills that adults in the United States need or should have. Creating such levels was a means to convey the summary of performance on NALS.

DEVELOPING PERFORMANCE LEVELS 73 Some of the more important details about the process were not speci- fied in the NALS Technical Manual (Kirsch et al., 2001). Determination of the cut scores involved examination of the listing of items for break points, but the actual break points were not entirely obvious. It is not clear who participated in this process or how decisions were made. In addition, the choice of the response probability value of 80 percent is not fully docu- mented. All of this suggests that one should not automatically accept the five NALS performance categories as the representation of defensible or justified levels of performance expectations. The performance levels produced by the 1992 approach were group- ings based on judgments about the complexity of the thinking processes required to respond to the items. While these levels might be useful for characterizing adultsâ literacy skills, the process through which they were determined is not one that would typically be used to derive performance levels expected to inform policy interventions or to identify needed pro- grams. It is the committeeâs view that a more open, transparent process that relies on and utilizes stakeholder feedback is more likely to result in perfor- mance levels informative for the sorts of decisions expected to be based on the results. Such a process is more in line with currently accepted practices for setting cut scores. The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological As- sociation, and National Council on Measurement in Education, 1999) spe- cifically call for (1) clear documentation of the rationale and procedures used for establishing cut scores (Standard 4.19), (2) investigation of the relations between test scores and relevant criteria (Standard 4.20), and (3) designing the judgmental process so that judges can bring their knowledge and experience to bear in a reasonable way (Standard 4.21). We relied on this guidance offered by the Standards in designing our approach to devel- oping performance levels and setting cut scores, which is the subject of the remainder of this report. TECHNICAL NOTE Item Response Theory and Response Probabilities: A More Technical Explanation This technical note provides additional details about item response theory and response probabilities. The section begins with a brief introduc- tion to the two-parameter item response model. This is followed by a discussion of how some of the features of item response models can be exploited to devise ways to map test items to scale score levels and further exemplify the skills associated with specified proficiency levels. The section

74 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS TABLE 3-5 Difficulty Values of Selected Tasks Along the Prose Literacy Scale, Mapped at Four Response Probability Criteria: The 1992 National Adult Literacy Survey RP 80 RP 65 RP 50 RP 35 75 <81> Identify country in short articlea <102> Identify country in short articlea <123> Identify country in short articlea 125 <145> Underline sentence explaining action stated in short article <149> Identify country in short articlea <169> Underline sentence explaining action stated in short article 175 <194> Underline sentence explaining action stated in short article <224> Underline sentence explaining action stated in short article 225 <255> State in writing an argument made in a long newspaper story

DEVELOPING PERFORMANCE LEVELS 75 TABLE 3-5 Continued RP 80 RP 65 RP 50 RP 35 275 <278> State in writing an argument made in a long newspaper story <300> State in writing an argument made in a long newspaper story 325 <329> State in writing an argument made in a long newspaper story <358> Interpret a brief phrase from a lengthy news article 375 <378> Interpret a brief phrase from a lengthy news article <398> Interpret a brief phrase from a lengthy news article <424> Interpret abrief phrase from a lengthy news article 425 aAt a scale score of 149, an individual has an 80 percent chance of a correct response to this item. At a scale score of 123, an individual has a 65 percent chance of a correct response. At a scale score of 102 and 81, individuals have, respectively, a 50 percent chance of responding correctly to the item.

76 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS concludes with a discussion of factors to consider when selecting response probability values. Overview of the Two-Parameter Item Response Model As mentioned above, IRT methodology was used for scaling the 1992 NALS items. While some of the equations and computations required by IRT are complicated, the underlying theoretical concept is actually quite straightforward, and the methodology provides some statistics very useful for interpreting assessment results. The IRT equation (referred to as the two-parameter logistic model, or 2-PL for short) used for scaling the 1992 NALS data appears below: ( ) P xi = 1 | Î¸ j = 1 1 + e ai (Î¸ âbi ) (3-1) The left-hand side of the equation symbolizes the probability (P) of responding correctly to an item (e.g., item i) given a specified ability level (referred to as theta or Î¸). The right-hand side of the equation gives the mechanism for calculating the probability of responding correctly, where ai and bi are referred to as âitem parameters,â3 and Î¸ is the specified ability level. In IRT, this equation is typically used to estimate the probability that an individual, with a specified ability level Î¸, will correctly respond to an item. Alternatively, the probability P of a correct response can be specified along with the item parameters (ai and bi), and the equation can be solved for the value of theta associated with the specified probability value. Exemplifying Assessment Results A hallmark of IRT is the way it describes the relation of the probability of an item response to scores on the scale reflecting the level of performance on the construct measured by the test. That description has two parts, as illustrated in Figure 3-1. The first part describes the population density, or distribution of persons over the variable being measured. For the illustra- tion in Figure 3-1, the variable being measured is prose literacy as defined by the 1992 NALS. A hypothetical population distribution is shown in the upper panel of Figure 3-1, simulated as a normal distribution.4 3Item discrimination is denoted by a ; item location (difficulty) is denoted by b . i i 4A normal distribution is used for simplicity. The actual NALS distribution was skewed (see page N-3 of the NALS Technical Manual).

DEVELOPING PERFORMANCE LEVELS 77 0.010 Population Density 0.005 0.000 100 150 200 250 300 350 400 450 Prose Scale Score Probability of a Correct Response Make out check: Write letter explaining bill error 1.0 0.5 0.0 100 150 200 250 300 350 400 450 Prose Scale Score FIGURE 3-1 Upper panel: Distribution of proficiency in the population for the prose literacy scale. Lower panel: The trace line, or item characteristic curve, for a sample prose item. The second part of an IRT description of item performance is the trace line, or item characteristic curve. A trace line shows the probability of a correct response to an item as a function of proficiency (in this case, prose literacy). Such a curve is shown in the lower panel of Figure 3-1 for an item that is described as requiring âthe reader to write a brief letter explaining that an error has been made on a credit card billâ (Kirsch et al., 1993, p. 78). For this item, the trace line in Figure 3-1 shows that people with prose literacy scale scores higher than 300 are nearly certain to respond correctly, while those with scores lower than 200 are nearly certain to fail. The

78 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS probability of a correct response rises relatively quickly as scores increase from 200 to 300. Making Use of Trace Lines Trace lines can be determined for each item on the assessment. The trace lines are estimated from the assessment data in a process called item calibration. Trace lines for the 39 open-ended items on the prose scale for the 1992 NALS are shown in Figure 3-2. The trace line shown in Figure 3- 1 is one of those in the center of Figure 3-2. The variation in the trace lines for the different items in Figure 3-2 shows how the items vary in difficulty. Some trace lines are shifted to the left, indicating that lower scoring indi- viduals have a high probability of responding correctly. Some trace lines are shifted to right, which means the items are more difficult and only very high-scoring individuals are likely to respond correctly. As Figure 3-2 shows, some trace lines are steeper than others. The steeper the trace line, the more discriminating the item. That is, items with 1.0 Probability of a Correct Response 0.80 0.67 0.50 0.5 0.0 100 150 200 250 300 350 400 450 Prose Scale Score FIGURE 3-2 Trace lines for the 39 open-ended items on the prose scale for the 1992 NALS.

DEVELOPING PERFORMANCE LEVELS 79 0.010 Level 1 Level 2 Level 3 Level 4 Level 5 Population Density 0.005 0.000 100 150 200 250 300 350 400 450 Prose Scale Score FIGURE 3-3 Division of the 1992 NALS prose literacy scale into five levels. higher discrimination values are better at distinguishing among test takersâ proficiency levels. The collection of trace lines is used for several purposes. One purpose is the computation of scores for persons with particular patterns of item responses. Another purpose is to link the scales from repeated assessments. Such trace lines for items repeated between assessments were used to link the scale of the 1992 NALS to the 1985 Young Adult Literacy Survey. A similar linkage was constructed between the 1992 NALS and the 2003 NAAL. In addition, the trace lines for each item may be used to describe how responses to the items are related to alternate reporting schemes for the literacy scale. For reporting purposes, the prose literacy scale for the 1992 NALS was divided into five levels using cut scores that are shown embed- ded in the population distribution in Figure 3-3. Using these levels for reporting, the proportion of the population scoring 225 or lower was said to be in Level 1, with the proportions in Levels 2, 3, and 4 representing score ranges of 50 points, and finally Level 5 included scores exceeding 375. Mapping Items to Specific Scale Score Values With a response probability (rp) criterion specified, it is possible to use the IRT model to âplaceâ the items at some specific level on the scale. Placing an item at a specific level allows one to make statements or predic- tions about the likelihood that a person who scores at the level will answer the question correctly. For the 1992 NALS, items were placed at a specific

80 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS Probability of a Correct Response Make out check: Write letter explaining bill error 1.0 0.80 0.67 0.50 0.5 246 264 280 0.0 100 150 200 250 300 350 400 450 Prose Scale Score FIGURE 3-4 Scale scores associated with rp values of .50, .67, and .80 for a sample item from the NALS prose scale. level as part of the process that was used to decide on the cut scores among the five levels and for use in reporting examples of items. For the 1992 NALS, an rp value of .80 was used. This means that each item was said to be âatâ the value of the prose score scale for which the probability of a correct response was .80. For example, for the âwrite letterâ item, it was said âthis task is at 280 on the prose scaleâ (Kirsch et al., 1993, p. 78), as shown by the dotted lines in Figure 3-4. Using these placements, items were said to be representative of what persons scoring in each level could do. Depending on where the item was placed within the level, it was noted whether an item was one of the easier or more difficult items in the level. For example, the âwrite letterâ item was described as âone of the easier Level 3 tasksâ (Kirsch, 1993, p. 78). These placements of items were also shown on item maps, such as the one that appeared on page 10 of Kirsch, 1993 (see Table 3-6); the purpose of the item maps is to aid in the interpretation of the meaning of scores on the scale and in the levels. Some procedures, such as the bookmark standard-setting procedures, require the specification of an rp value to place the items on the scale. However, even when it is necessary to place an item at a specific point on the scale, it is important to remember that an item can be placed anywhere on the scale, with some rp value. For example, as illustrated in Figure 3-4, the âwrite letterâ item is âatâ 280 (and âinâ Level 3, because that location is above 275) for an rp value of .80. However, this item is at 246, which places it in the lower middle of Level 2 (between 226 and 275) for an rp value of .50, and it is at 264, which is in the upper middle of Level 2 for an rp value of .67.

DEVELOPING PERFORMANCE LEVELS 81 TABLE 3-6 National Adult Literacy Survey (NALS) Item Map Prose Document Quantitative 0 149 Identify country in short article 69 Sign your name 191 Total a bank deposit entry 210 Locate one piece of information 151 Locate expiration date on driver's license in sports article 180 Locate time of meeting on a form 224 Underline sentence explaining action stated in short article 214 Using pie graph, locate type of vehicle having specific sales 225 226 Underline meaning of a term given in 232 Locate intersection on a street map 238 Calculate postage and fees for government brochure on supplemental certified mail security income 245 Locate eligibility from table of employee benefits 246 Determine difference in price between 250 Locate two features of information in tickets for two shows sports article 259 Identify and enter background information on application for social 270 Calculate total costs of purchase from security card an order form 275 275 Interpret instructions from an appliance 277 Identify information from bar graph 278 Using calculator, calculate difference warranty depicting source of energy and year between regular and sale price from an advertisement 280 Write a brief letter explaining error 296 Use sign out sheet to respond to call made on a credit card bill about resident 308 Using calculator, determine the discount from an oil bill if paid within 10 days 304 Read a news article and identify 314 Use bus schedule to determine a sentence that provides interpretation appropriate bus for given set of a situation of conditions 316 Read lengthy article to identify two 323 Enter information given into an behaviors that meet a stated condition automobile maintenance record form 325 328 State in writing an argument made in 342 Identify the correct percentage meeting 325 Plan travel arrangements for meeting lengthy newspaper article specified conditions from a table of such using flight schedule information 347 Explain difference between two types 331 Determine correct change using of employee benefits 348 Use bus schedule to determine information in a menu appropriate bus for given set 359 Contrast views expressed in two of conditions 350 Using information stated in news article, editorials on technologies available to calculate amount of money that should make fuel-efficient cars go to raising a child 368 Using eligibility pamphlet, calculate the 362 Generate unfamiliar theme from short yearly amount a couple would receive poems for basic supplemental security income 374 Compare two metaphors used in poem 375 382 Compare approaches stated in 379 Use table of information to determine 375 Calculate miles per gallon using narrative on growing up pattern in oil exports across years information given on mileage record chart 410 Summarize two ways lawyers may 387 Using table comparing credit cards, challenge prospective jurors 382 Determine individual and total costs on identify the two categories used and write an order form for items in a catalog two differences between them 423 Interpret a brief phrase from a lengthy 405 Using information in news article, news article 396 Use a table depicting information about calculate difference in times for parental involvement in school survey to completing a race write a paragraph summarizing extent to which parents and teachers agree 421 Using calculator, determine the total cost of carpet to cover a room 500 Source: U.S. Department of Education, National Center for Education Statistics, National Adult Literacy Survey, 1992.

82 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS Probability of a Correct Response Make out check: Write letter explaining bill error Level 1 Level 2 Level 3 Level 4 Level 5 1.0 % Correct: 14 55 88 98 99+ 0.5 0.0 100 150 200 250 300 350 400 450 Prose Scale Score FIGURE 3-5 Percentage expected to answer the sample item correctly within each of the five levels of the 1992 NALS scale. It should be emphasized that it is not necessary to place items at a single score location. For example, in reporting the results of the assessment, it is not necessary to say that an item is âatâ some value (such as 280 for the âwrite letterâ item). Futhermore, there are more informative alternatives to placing items at a single score location. If an item is said to be âatâ some scale value or âinâ some level (as the âwrite letterâ item is at 280 and in Level 3), it suggests that people scoring lower, or in lower levels, do not respond correctly. That is not the case. The trace line itself, as shown in Figure 3-4, reminds us that many people scoring in Level 2 (more than the upper half of those in Level 2) have a better than 50-50 chance of responding correctly to this item. A more accurate depiction of the likelihood of a correct response was pre- sented in Appendix D of the 1992 technical manual (Kirsch et al., 2001). That appendix includes a representation of the trace line for each item at seven equally spaced scale scores between 150 and 450 (along with the rp80 value). This type of representation would allow readers to make infer- ences about this item much like those suggested by Figure 3-4. Figure 3-5 shows the percentage expected to answer the âwrite letterâ item in each of the five levels. These values can be computed from the IRT model (represented by equation 3-1), in combination with the population distribution.5 With access to the data, one can alternatively simply tabulate 5They are the weighted average of the probabilities correct given by the trace line for each score within the level, weighted by the population density of persons at that score (in the upper panel of Figure 3-1). Using the Gaussian population distribution, those values are not extremely accurate for 1992 NALS; however, they are used here for illustrative purposes.

DEVELOPING PERFORMANCE LEVELS 83 the observed proportion of examinees who responded correctly at each reporting level. The latter has been done often in recent NAEP reports (e.g., The Nationâs Report Card: Reading 2002, http://www.nces.ed.gov/ pubsearch/pubsinfo.asp?pubid=2003521, Chapter 4, pp. 102ff). The values in Figure 3-5 show clearly how misconceptions can arise from statements such as âthis item is âinâ Level 3â (using an rp value of .80). While the item may be âinâ Level 3, 55 percent of people in Level 2 responded correctly. So statements such as âbecause the item is in Level 3, people scoring in Level 2 would respond incorrectlyâ are wrong. For re- porting results using sets of levels, a graphical or numerical summary of the probability of a correct response at multiple points on the scale score, such as shown in Figure 3-5, is likely to be more informative and lead to more accurate interpretations. Use of Response Probabilities in Standard Setting As previously mentioned, for some purposes, such as the bookmark method of standard setting, it is essential that items be placed at a single location on the score scale. An rp value must be selected to accomplish that. The bookmark method of standard setting requires an âordered item book- letâ in which the items are placed in increasing order of difficulty. With the kinds of IRT models that are used for NALS and NAAL, different rp values place the items in different orders. For example, Figure 3-2 includes dotted lines that denote three rp values: rp80, rp67, and rp50. The item trace lines cross the dotted line representing an rp value of 80 percent in one sequence, while they cross the dotted line representing an rp value of 67 percent in another sequence, and they cross the dotted line representing an rp value of 50 percent in yet another sequence. There are a number of factors to consider in selecting an rp criterion. Factors to Consider in Selecting a Response Probability Value One source of information on which to base the selection of an rp value involves empirical studies of the effects of different rp values on the stan- dard-setting process (e.g., Williams and Schultz, 2005). Another source of information relevant to the selection of an rp value is purely statistical in nature, having to do with the relative precision of estimates of the scale scores associated with various rp values. To illustrate, Figure 3-6 shows the trace line for the âwrite letterâ item as it passes through the middle of the prose score scale. The trace line is enclosed in dashed lines that represent the boundaries of a 95 percent confidence envelope for the curve. The confidence envelope for a curve is a region that includes the curves corre- sponding to the central 95 percent confidence interval for the (item) param-

84 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS 1.0 Probability of a Correct Response Make out check: Write letter explaining bill error 0.80 0.67 0.50 0.5 0.0 245 248 262 266 278 283 240 250 260 270 280 290 300 Prose Scale Score FIGURE 3-6 A 95 percent confidence envelope for the trace line for the sample item on the NALS prose scale. eters that produce the curve. That is, the confidence envelope translates statistical uncertainty (due to random sampling) in the estimation of the item parameters into a graphical display of the consequent uncertainty in the location of the trace line itself.6 A striking feature of the confidence envelope in Figure 3-6 is that it is relatively narrow. This is because the standard errors for the item param- eters (reported in Appendix A of the 1992 NALS Technical Manual) are very small. Because the confidence envelope is very narrow, it is difficult to see in Figure 3-6 that it is actually narrower (either vertically or horizon- tally) around rp50 than it is around rp80. This means that there is less uncertainty associated with proficiency estimates based on rp50 than on rp80. While this finding is not evident in the visual display (Figure 3-6), it has been previously documented (see Thissen and Wainer, 1990, for illus- trations of confidence envelopes that are not so narrow and show their characteristic asymmetries more clearly). Nonetheless, the confidence envelope may be used to translate the uncertainty in the item parameter estimates into descriptions of the uncer- tainty of the scale scores corresponding to particular rp values. Using the âwrite letterâ NALS item as an illustration, at rp50 the confidence envelope 6For a more detailed description of confidence envelopes in the context of IRT, see Thissen and Wainer (1990), who use results obtained by Thissen and Wainer (1982) and an algorithm described by Hauck (1983) to produce confidence envelopes like the dashed lines in Figure 3-6.

DEVELOPING PERFORMANCE LEVELS 85 encloses trace lines that would place the corresponding scale score any- where between 245 and 248 (as shown by the solid lines connected to the dotted line for 0.50 in Figure 3-6). That range of three points is smaller than the four-point range for rp67 (from 262 to 266), which is, in turn, smaller than the range for the rp80 scale score (278-283).7 The rp80 values, as used for reporting the 1992 NALS results, have statistical uncertainty that is almost twice as large (5 points, from 278 to 283, around the reported value of 280 for the âwrite letterâ item) as the rp50 values (3 points, from 245 to 248, for this item). The rp50 values are always most precisely estimated. So a purely statistical answer to the ques- tion, âWhat rp value is most precisely estimated, given the data?â would be rp50 for the item response model used for the binary-scored open-ended items in NALS and NAAL. The statistical uncertainty in the scale scores associated with rp values simply increases as the rp value increases above 0.50. It actually becomes very large for rp values of 90, 95, or 99 percent (which is no doubt the reason such rp values are never considered in practice). Nevertheless, the use of rp50 has been reported to be very difficult for judges in standard-setting processes, as well as other consumers, to inter- pret usefully (Williams and Schulz, 2004). What does it mean to say âthe score at which the person has a 50-50 chance of responding correctlyâ? While that value may be useful (and interpretable) for a data analyst devel- oping models for item response data, it is not so useful for consumers of test results who are more interested in ideas like âmastery.â An rp value of 67 percent, now commonly used in bookmark procedures (Mitzel et al., 2001), represents a useful compromise for some purposes. That is, the idea that there is a 2 in 3 chance that the examinee will respond correctly is readily interpretable as âmore likely than not.â Furthermore, the statistical uncer- tainty of the estimate of the scale score associated with rp67 is larger than for rp50 but not as large as for rp80. Figure 3-4 illustrates another statistical property of the trace lines used for NALS and NAAL that provides motivation for choosing an rp value closer to 50 percent. Note in Figure 3-2 that not only are the trace lines in a different (horizontal) order for rp values of 50, 67, and 80 percent, but they are also considerably more variable (more widely spread) at rp80 than 7Some explanation is needed. First, the rp50 interval is actually symmetrical. Earlier (Fig- ure 3-4), the rp50 value was claimed to be 246. The actual value, before rounding, is very close to 246.5, so the interval from 245 to 248 (which is rounded very little) is both correct and symmetrical. The intervals for the higher rp values are supposed to be asymmetrical.

86 MEASURING LITERACY: PERFORMANCE LEVELS FOR ADULTS they are at rp50. These greater variations at rp80, and the previously de- scribed wider confidence envelope, are simply due to the inherent shape of the trace line. As it approaches a value of 1.0, it must flatten out and so it must develop a âshoulderâ that has very uncertain location (in the left-right direction) for any particular value of the probability of a correct response (in the vertical direction). Figure 3-2 shows that variation in the discrimina- tion of the items greatly accentuates the variation in the scale score location of high and low rp values. Again, these kinds of purely statistical considerations would lead to a choice of rp50. Considerations of mastery for the presentation and descrip- tion of the results to many audiences suggests higher rp values. A compro- mise value of rp67, combined with a reminder that the rp values are arbi- trary values used in the standard-setting process, and reporting of the results can describe the likelihood or correct responses for any level or scale score, are what we suggest.

Next: 4 Determining Performance Levels for the National Assessment of Adult Literacy »

Measuring Literacy: Performance Levels for Adults (2005)

Chapter: 3 Developing Performance Levels for the National Adult Literacy Survey

Welcome to OpenBook!

Get Email Updates