3
Developing Performance Levels for the National Adult Literacy Survey

In this chapter, we document our observations and findings about the procedures used to develop the performance levels for the 1992 National Adult Literacy Survey (NALS). The chapter begins with some background information on how performance levels and the associated cut scores are typically determined. We then provide a brief overview of the test development process used for NALS, as it relates to the procedures for determining performance levels, and describe how the performance levels were determined and the cut scores set. The chapter also includes a discussion of the role of response probabilities in setting cut scores and in identifying assessment tasks to exemplify performance levels; the technical notes at the end of the chapter provides additional details about this topic.

BACKGROUND ON DEVELOPING PERFORMANCE LEVELS

When the objective of a test is to report results using performance levels, the number of levels and the descriptions of the levels are usually articulated early in the test development process and serve as the foundation for test development. The process of determining the number of levels and their descriptions usually involves consideration of the content and skills evaluated on the test as well as discussions with stakeholders about the inferences to be based on the test results and the ways the test results will be used. When the number of levels and the descriptions of the levels are laid out in advance, development efforts can focus on constructing items that measure the content and skills described by the levels. It is important to develop a sufficient number of items that measure the skills



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 50
Measuring Literacy: Performance Levels for Adults 3 Developing Performance Levels for the National Adult Literacy Survey In this chapter, we document our observations and findings about the procedures used to develop the performance levels for the 1992 National Adult Literacy Survey (NALS). The chapter begins with some background information on how performance levels and the associated cut scores are typically determined. We then provide a brief overview of the test development process used for NALS, as it relates to the procedures for determining performance levels, and describe how the performance levels were determined and the cut scores set. The chapter also includes a discussion of the role of response probabilities in setting cut scores and in identifying assessment tasks to exemplify performance levels; the technical notes at the end of the chapter provides additional details about this topic. BACKGROUND ON DEVELOPING PERFORMANCE LEVELS When the objective of a test is to report results using performance levels, the number of levels and the descriptions of the levels are usually articulated early in the test development process and serve as the foundation for test development. The process of determining the number of levels and their descriptions usually involves consideration of the content and skills evaluated on the test as well as discussions with stakeholders about the inferences to be based on the test results and the ways the test results will be used. When the number of levels and the descriptions of the levels are laid out in advance, development efforts can focus on constructing items that measure the content and skills described by the levels. It is important to develop a sufficient number of items that measure the skills

OCR for page 50
Measuring Literacy: Performance Levels for Adults described by each of the levels. This allows for more reliable estimates of test-takers’ skills and more accurate classification of individuals into the various performance levels. While determination of the performance-level descriptions is usually completed early in the test development process, determination of the cut scores between the performance levels is usually made after the test has been administered and examinees’ answers are available. Typically, the process of setting cut scores involves convening a group of panelists with expertise in areas relevant to the subject matter covered on the test and familiarity with the test-taking population, who are instructed to make judgments about what test takers need to know and be able to do (e.g., which test items individuals should be expected to answer correctly) in order to be classified into a given performance level. These judgments are used to determine the cut scores that separate the performance levels. Methods for setting cut scores are used in a wide array of assessment contexts, from the National Assessment of Educational Progress (NAEP) and state-sponsored achievement tests, in which procedures are used to determine the level of performance required to classify students into one of several performance levels (e.g., basic, proficient, or advanced), to licensing and certification tests, in which procedures are used to determine the level of performance required to pass such tests in order to be licensed or certified. There is a broad literature on procedures for setting cut scores on tests. In 1986, Berk documented 38 methods and variations on these methods, and the literature has grown substantially since. All of the methods rely on panels of judges, but the tasks posed to the panelists and the procedures for arriving at the cut scores differ. The methods can be classified as test-centered, examinee-centered, and standards-centered. The modified Angoff and bookmark procedures are two examples of test-centered methods. In the modified Angoff procedure, the task posed to the panelists is to imagine a typical minimally competent examinee and to decide on the probability that this hypothetical examinee would answer each item correctly (Kane, 2001). The bookmark method requires placing all of the items in a test in order by difficulty; panelists are asked to place a “bookmark” at the point between the most difficult item borderline test takers would be likely to answer correctly and the easiest item borderline test takers would be likely to answer incorrectly (Zeiky, 2001). The borderline group and contrasting group methods are two examples of examinee-centered procedures. In the borderline group method, the panelists are tasked with identifying examinees who just meet the performance standard; the cut score is set equal to the median score for these examinees (Kane, 2001). In the contrasting group method, the panelists are asked to categorize examinees into two groups—an upper group that has clearly met

OCR for page 50
Measuring Literacy: Performance Levels for Adults the standard and a lower group that has not met the standard. The cut score is the score that best discriminates between the two groups. The Jaeger-Mills integrated judgment procedure and the body of work procedure are examples of standards-centered methods. With these methods, panelists examine full sets of examinees’ responses and match the full set of responses to a performance level (Jaeger and Mills, 2001; Kingston et al., 2001). Texts such as Jaeger (1989) and Cizek (2001a) provide full descriptions of these and the other available methods. Although the methods differ in their approaches to setting cut scores, all ultimately rely on judgments. The psychometric literature documents procedures for systematizing the process of obtaining judgments about cut scores (e.g., see Jaeger, 1989; Cizek, 2001a). Use of systematic and careful procedures can increase the likelihood of obtaining fair and reasoned judgments, thus improving the reliability and validity of the results. Nevertheless, the psychometric field acknowledges that there are no “correct” standards, and the ultimate judgments depend on the method used, the way it is carried out, and the panelists themselves (Brennan, 1998; Green, Trimble, and Lewis, 2003; Jaeger, 1989; Zieky, 2001). The literature on setting cut scores includes critiques of the various methods that document their strengths and weaknesses. As might be expected, methods that have been used widely and for some time, such as the modified Angoff procedure, have been the subject of more scrutiny than recently developed methods like the bookmark procedure. A review of these critiques quickly reveals that there are no perfect or correct methods. Like the cut-score-setting process itself, choice of a specific procedure requires making an informed judgment about the most appropriate method for a given assessment situation. Additional information about methods for setting cut scores appears in Chapter 5, where we describe the procedures we used. DEVELOPMENT OF NALS TASKS The NALS tasks were drawn from the contexts that adults encounter on a daily basis. As mentioned in Chapter 2, these contexts include work, home and family, health and safety, community and citizenship, consumer economics, and leisure and recreation. Some of the tasks had been used on the earlier adult literacy assessments (the Young Adult Literacy Survey in 1985 and the survey of job seekers in 1990), to allow comparison with the earlier results, and some were newly developed for NALS. The tasks that were included on NALS were intended to profile and describe performance in each of the specified contexts. However, NALS was not designed to support inferences about the level of literacy adults need in order to function in the various contexts. That is, there was no

OCR for page 50
Measuring Literacy: Performance Levels for Adults attempt to systematically define the critical literacy demands in each of the contexts. The test designers specifically emphasize this, saying: “[The literacy levels] do not reveal the types of literacy demands that are associated with particular contexts…. They do not enable us to say what specific level of prose, document, or quantitative skill is required to obtain, hold, or advance in a particular occupation, to manage a household, or to obtain legal or community services” (Kirsch et al., 1993, p. 9). This is an important point, because it demonstrates that some of the inferences made by policy makers and the media about the 1992 results were clearly not supported by the test development process and the intent of the assessment. The approach toward test development used for NALS does not reflect typical procedures used when the objective of an assessment is to distinguish individuals with adequate levels of skills from those whose skills are inadequate. We point this out, not to criticize the process, but to clarify the limitations placed on the inferences that can be drawn about the results. To explain, it is useful to contrast the test development procedures used for NALS with procedures used in other assessment contexts, such as licensing and credentialing or state achievement testing. Licensing and credentialing assessments are generally designed to distinguish between performance that demonstrates sufficient competence in the targeted knowledge, skills, and capabilities to be judged as passing and performance that is inadequate and judged as failing. Typically, licensing and certification tests are intentionally developed to distinguish between adequate and inadequate performance. The test development process involves specification of the skills critical to adequate performance generally determined by systematically collecting judgments from experts in the specific field (e.g., via surveys) about what a licensed practitioner needs to know and be able to do. The process for setting cut scores relies on expert judgments about just how much of the specific knowledge, skills, and capabilities is needed for a candidate to be placed in the passing category. The process for test development and determining performance levels for state K-12 achievement tests is similar. Under ideal circumstances, the performance-level categories and their descriptions are determined in advance of or concurrent with item development, and items are developed to measure skills described by the performance levels. The process of setting the cut scores then focuses on determining the level of performance considered to be adequate mastery of the content and skills (often called “proficient”). Categories of performance below and above the proficient level are also often described to characterize the score distribution of the group of test takers. The process for developing NALS and determining the performance levels was different. This approach toward test development does not—and was not intended to—provide the necessary foundation for setting stan-

OCR for page 50
Measuring Literacy: Performance Levels for Adults dards for what adults need in order to adequately function in society, and there is no way to compensate for this after the fact. That is, there is no way to set a specific cut score that would separate adults who have sufficient literacy skills to function in society from those who do not. This does not mean that performance levels should not be used for reporting NALS results or that cut scores should not be set. But it does mean that users need to be careful about the inferences about the test results that can be supported and the inferences that cannot. DEVELOPMENT OF PERFORMANCE-LEVEL DESCRIPTIONS AND CUT SCORES Overview of the Process Used for the 1992 NALS The process of determining performance levels for the 1992 NALS was based partially on analyses conducted on data from the two earlier assessments of adults’ literacy skills. The analyses focused on identifying the features of the assessment tasks and stimulus materials that contributed to the difficulty of the test questions. These analyses had been used to determine performance levels for the Survey of Workplace Literacy, the survey of job seekers conducted in 1990.1 The analyses conducted on the prior surveys were not entirely replicated for NALS. Instead, new analyses were conducted to evaluate the appropriateness of the performance levels and associated cut scores that had been used for the survey of job seekers. Based on these analyses, slight adjustments were made in the existing performance levels before adopting them for NALS. This process is described more fully below. The first step in the process that ultimately led to the formulation of NALS performance levels was an in-depth examination of the items included on the Young Adult Literacy Survey and the Survey of Workplace Literacy, to identify the features judged to contribute to their complexity.2 For the prose literacy items, four features were judged to contribute to their complexity: Type of match: whether finding the information needed to answer 1   The analyses were conducted on the Young Adult Literacy Survey but performance levels were not used in reporting its results. The analyses were partly replicated and extended to yield performance levels for the Survey of Workplace Literacy. 2   See Chapter 13 of the NALS Technical Manual for additional details about the process (http://www.nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2001457).

OCR for page 50
Measuring Literacy: Performance Levels for Adults the question involved simply locating the answer in the text, cycling through the text iteratively, integrating multiple pieces of information, or generating new information based on prior knowledge. Abstractness of the information requested. Plausibility of distractors: the extent of and location of information related to the question, other than the correct answer, that appears in the stimulus. Readability index as estimated using Fry’s (1977) readability index. The features judged to contribute to the complexity of document literacy items were the same as for prose, with the exception that an index of the structural complexity of the display was substituted for the readability index. For the quantitative literacy items, the identified features included type of match and plausibility of the distractors, as with the prose items, and structural complexity, as with the document items, along with two other features: Operation specificity: the process required for identifying the operation to perform and the numbers to manipulate. Type of calculation: the type and number of arithmetic operations. A detailed schema was developed for use in “scoring” items according to these features, and the scores were referred to as complexity ratings. The next step in the process involved determination of the cut scores for the performance levels used for reporting results of the 1990 Survey of Workplace Literacy. The process involved rank-ordering the items according to a statistical estimate of their difficulty, which was calculated using data from the actual survey respondents. The items were listed in order from least to most difficult, and the judgment-based ratings of complexity were displayed on the listing. Tables 3-1 through 3-3, respectively, present the lists of prose, document, and quantitative items rank-ordered by difficulty level. This display was visually examined for natural groupings or break points. According to Kirsch, Jungeblut, and Mosenthal (2001, p. 332), “visual inspection of the distribution of [the ratings] along each of the literacy scales revealed several major [break] points occurring at roughly 50 point intervals beginning with a difficulty score of 225 on each scale.” The process of determining the break points was characterized as containing “some noise” and not accounting for all the score variance associated with performance on the literacy scales. It was noted that the shifts in complexity ratings did not necessarily occur at exactly 50 point intervals on the scales, but that assigning the exact range of scores to each level (e.g.,

OCR for page 50
Measuring Literacy: Performance Levels for Adults TABLE 3-1 List of Prose Literacy Tasks, Along with RP80 Task Difficulty, IRT Item Parameters, and Values of Variables Associated with Task Difficulty: 1990 Survey of the Literacy of Job-Seekers   Identifier Task Description Scaled RP80 Level 1 A111301 Toyota, Acura, Nissan 189   AB21101 Swimmer: Underline sentence telling what Ms. Chanin ate 208   A120501 Blood donor pamphlet 216   A130601 Summons for jury service 237 Level 2 A120301 Blood donor pamphlet 245   A100201 PHP subscriber letter 249   A111401 Toyota, Acura, Nissan 250   A121401 Dr. Spock column: Alterntv to phys punish 251   AB21201 Swimmer: Age Ms. Chanin began to swim competitively 250   A131001 Shadows Columbus saw 280   AB80801 Illegal questions 265   AB41001 Declaration: Describe what poem is about 263   AB81101 New methods for capital gains 277   AB71001 Instruction to return appliance: Indicate best note 275   AB90501 Questions for new jurors 281   AB90701 Financial security tips 262   A130901 Shadows Columbus saw 282 Level 3 AB60201 Make out check: Write letter explaining bill error 280   AB90601 Financial security tips 299   A121201 Dr. Spock column: Why phys punish accptd 285   AB70401 Almanac vitamins: List correct info from almanac 289   A100301 PHP subscriber letter 294   A130701 Shadows Columbus saw 298   A130801 Shadows Columbus saw 303   AB60601 Economic index: Underline sentence explaining action 305   A121301 Dr. Spock column: 2 cons against phys punish 312   AB90401 Questions for new jurors 300   AB80901 Illegal questions 316   A111101 Toyota, Acura, Nissan 319

OCR for page 50
Measuring Literacy: Performance Levels for Adults IRT Parameters Readability Type of Match Distractor Plausibility Information Type a b c 0.868 –2.488 0.000 8 1 1 1 1.125 –1.901 0.000 8 1 1 1 0.945 –1.896 0.000 7 1 1 2 1.213 –1.295 0.000 7 3 2 2 0.956 –1.322 0.000 7 1 2 3 1.005 –1.195 0.000 10 3 1 3 1.144 –1.088 0.000 8 3 2 4 1.035 –1.146 0.000 8 2 2 3 1.070 –1.125 0.000 8 3 4 2 1.578 –0.312 0.000 9 3 1 2 1.141 –0.788 0.000 6 3 2 2 0.622 –1.433 0.000 4 3 1 3 1.025 –0.638 0.000 7 4 1 3 1.378 –0.306 0.266 5 3 2 3 1.118 –0.493 0.000 6 4 2 1 1.563 –0.667 0.000 8 3 2 4 1.633 –0.255 0.000 9 3 4 1 1.241 –0.440 0.000 7 3 2 4 1.295 –0.050 0.000 8 2 2 4 1.167 –0.390 0.000 8 3 2 4 0.706 –0.765 0.000 7 3 4 1 0.853 –0.479 0.000 10 4 3 2 1.070 –0.203 0.000 9 3 2 3 0.515 –0.929 0.000 9 3 2 2 0.809 –0.320 0.000 10 3 2 4 0.836 –0.139 0.000 8 3 3 4 1.230 –0.072 0.000 6 4 2 3 0.905 –0.003 0.000 6 4 3 3 0.772 –0.084 0.000 8 4 3 2

OCR for page 50
Measuring Literacy: Performance Levels for Adults   Identifier Task Description Scaled RP80 Level 4 AB40901 Korean Jet: Give argument made in article 329   A131101 Shadows Columbus saw 332   AB90801 Financial security tips 331   AB30601 Technology: Orally explain info from article 333   AB50201 Panel: Determine surprising future headline 343   A101101 AmerExp: 2 similarities in handling receipts 346   AB71101 Explain difference between 2 types of benefits 348   AB81301 New methods for capital gains 355   A120401 Blood donor pamphlet 358   AB31201 Dickinson: Describe what is expessed in poem 363   AB30501 Technology: Underline sentence explaining action 371 Level 5 AB81201 New methods for capital gains 384   A111201 Toyota, Acura, Nissan 404   A101201 AmExp: 2 diffs in handling receipts 441   AB50101 Panel: Find information from article 469 TABLE 3-2 List of Document Literacy Tasks, Along with RP80 Task Difficulty Score, IRT Item Parameters, and Values of Variables Associated with Task Difficulty (structural complexity, type of match, plausibility of distractor, type of information): 1990 Survey of the Literacy of Job-Seekers   Identifier Task Description RP80 Level 1 SCOR100 Social Security card: Sign name on line 70   SCOR300 Driver’s license: Locate expiration date 152   SCOR200 Traffic signs 176   AB60803 Nurses’ convention: What is time of program? 181   AB60802 Nurses’ convention: What is date of program? 187   SCOR400 Medicine dosage 186   AB71201 Mark correct movie from given information 189   A110501 Registration & tuition info 189   AB70104 Job application: Complete personal information 193   AB60801 Nurses’ convention: Write correct day of program 199   SCOR500 Theatre trip information 197

OCR for page 50
Measuring Literacy: Performance Levels for Adults IRT Parameters Readability Type of Match Distractor Plausibility Information Type a b c 0.826 0.166 0.000 10 4 4 4 0.849 0.258 0.000 9 5 4 1 0.851 0.236 0.000 8 5 5 2 0.915 0.347 0.000 8 4 4 4 1.161 0.861 0.196 13 4 4 4 0.763 0.416 0.000 8 4 2 4 0.783 0.482 0.000 9 6 2 5 0.803 0.652 0.000 7 5 5 3 0.458 –0.056 0.000 7 4 5 2 0.725 0.691 0.000 6 6 2 4 0.591 0.593 0.000 8 6 4 4 0.295 –0.546 0.000 7 2 4 2 0.578 1.192 0.000 8 8 4 5 0.630 2.034 0.000 8 7 5 5 0.466 2.112 0.000 13 6 5 4 IRT Parameters Complexity Type of Match Distractor Plausibility Information Type a b c 0.505 –4.804 0.000 1 1 1 1 0.918 –2.525 0.000 2 1 2 1 0.566 –2.567 0.000 1 1 1 1 1.439 –1.650 0.000 1 1 1 1 1.232 –1.620 0.000 1 1 1 1 0.442 –2.779 0.000 2 1 2 2 0.940 –1.802 0.000 8 2 2 1 0.763 –1.960 0.000 3 1 2 2 0.543 –2.337 0.000 1 2 1 2 1.017 –1.539 0.000 1 1 2 1 0.671 –1.952 0.000 2 1 2 2

OCR for page 50
Measuring Literacy: Performance Levels for Adults   Identifier Task Description RP80   AB60301 Phone message: Write correct name of caller 200   AB60302 Phone message: Write correct number of caller 202   AB80301 How companies share market 203   AB60401 Food coupons 204   AB60701 Nurses’ convention: Who would be asked questions 206   A120601 MasterCard/Visa statement 211   AB61001 Nurses’ convention: Write correct place for tables 217   A110301 Dessert recipes 216   AB70903 Checking deposit: Enter correct amount of check 223   AB70901 Checking deposit: Enter correct date 224   AB50801 Wage & tax statement: What is current net pay? 224   A130201 El Paso Gas & Electric bill 223 Level 2 AB70801 Classified: Match list with coupons 229   AB30101 Street map: Locate intersection 232   AB30201 Sign out sheet: Respond to call about resident 232   AB40101 School registration: Mark correct age information 234   A131201 Tempra dosage chart 233   AB31301 Facts about fire: Mark information in article 235   AB80401 How companies share market 236   AB60306 Phone message: Write whom message is for 237   AB60104 Make out check: Enter correct amount written out 238   AB21301 Bus schedule 238   A110201 Dessert recipes 239   AB30301 Sign out sheet: Respond to call about resident 240   AB30701 Major medical: Locate eligibility from table 245   AB60103 Make out check: Enter correct amount in numbers 245   AB60101 Make out check: Enter correct date on check 246   AB60102 Make out check: Paid to the correct place 246   AB50401 Catalog order: Order product one 247   AB60303 Phone message: Mark “please call” box 249   AB50701 Almanac football: Explain why an award is given 254   AB20101 Energy graph: Find answer for given conditions (1) 255   A120901 MasterCard/Visa statement 257   A130101 El Paso Gas & Electric bill 257   AB91101 Minimum wage power 260   AB81001 Consumer Reports books 261   AB90101 Pest control warning 261   AB21501 With graph, predict sales for spring 1985 261   AB20601 Yellow pages: Find place open Saturday 266   A130401 El Paso Gas & Electric bill 270   AB70902 Checking deposit: Enter correct cash amount 271

OCR for page 50
Measuring Literacy: Performance Levels for Adults concludes with a discussion of factors to consider when selecting response probability values. Overview of the Two-Parameter Item Response Model As mentioned above, IRT methodology was used for scaling the 1992 NALS items. While some of the equations and computations required by IRT are complicated, the underlying theoretical concept is actually quite straightforward, and the methodology provides some statistics very useful for interpreting assessment results. The IRT equation (referred to as the two-parameter logistic model, or 2-PL for short) used for scaling the 1992 NALS data appears below: (3-1) The left-hand side of the equation symbolizes the probability (P) of responding correctly to an item (e.g., item i) given a specified ability level (referred to as theta or θ). The right-hand side of the equation gives the mechanism for calculating the probability of responding correctly, where ai and bi are referred to as “item parameters,”3 and θ is the specified ability level. In IRT, this equation is typically used to estimate the probability that an individual, with a specified ability level θ, will correctly respond to an item. Alternatively, the probability P of a correct response can be specified along with the item parameters (ai and bi), and the equation can be solved for the value of theta associated with the specified probability value. Exemplifying Assessment Results A hallmark of IRT is the way it describes the relation of the probability of an item response to scores on the scale reflecting the level of performance on the construct measured by the test. That description has two parts, as illustrated in Figure 3-1. The first part describes the population density, or distribution of persons over the variable being measured. For the illustration in Figure 3-1, the variable being measured is prose literacy as defined by the 1992 NALS. A hypothetical population distribution is shown in the upper panel of Figure 3-1, simulated as a normal distribution.4 3   Item discrimination is denoted by ai; item location (difficulty) is denoted by bi. 4   A normal distribution is used for simplicity. The actual NALS distribution was skewed (see page N-3 of the NALS Technical Manual).

OCR for page 50
Measuring Literacy: Performance Levels for Adults FIGURE 3-1 Upper panel: Distribution of proficiency in the population for the prose literacy scale. Lower panel: The trace line, or item characteristic curve, for a sample prose item. The second part of an IRT description of item performance is the trace line, or item characteristic curve. A trace line shows the probability of a correct response to an item as a function of proficiency (in this case, prose literacy). Such a curve is shown in the lower panel of Figure 3-1 for an item that is described as requiring “the reader to write a brief letter explaining that an error has been made on a credit card bill” (Kirsch et al., 1993, p. 78). For this item, the trace line in Figure 3-1 shows that people with prose literacy scale scores higher than 300 are nearly certain to respond correctly, while those with scores lower than 200 are nearly certain to fail. The

OCR for page 50
Measuring Literacy: Performance Levels for Adults probability of a correct response rises relatively quickly as scores increase from 200 to 300. Making Use of Trace Lines Trace lines can be determined for each item on the assessment. The trace lines are estimated from the assessment data in a process called item calibration. Trace lines for the 39 open-ended items on the prose scale for the 1992 NALS are shown in Figure 3-2. The trace line shown in Figure 3-1 is one of those in the center of Figure 3-2. The variation in the trace lines for the different items in Figure 3-2 shows how the items vary in difficulty. Some trace lines are shifted to the left, indicating that lower scoring individuals have a high probability of responding correctly. Some trace lines are shifted to right, which means the items are more difficult and only very high-scoring individuals are likely to respond correctly. As Figure 3-2 shows, some trace lines are steeper than others. The steeper the trace line, the more discriminating the item. That is, items with FIGURE 3-2 Trace lines for the 39 open-ended items on the prose scale for the 1992 NALS.

OCR for page 50
Measuring Literacy: Performance Levels for Adults FIGURE 3-3 Division of the 1992 NALS prose literacy scale into five levels. higher discrimination values are better at distinguishing among test takers’ proficiency levels. The collection of trace lines is used for several purposes. One purpose is the computation of scores for persons with particular patterns of item responses. Another purpose is to link the scales from repeated assessments. Such trace lines for items repeated between assessments were used to link the scale of the 1992 NALS to the 1985 Young Adult Literacy Survey. A similar linkage was constructed between the 1992 NALS and the 2003 NAAL. In addition, the trace lines for each item may be used to describe how responses to the items are related to alternate reporting schemes for the literacy scale. For reporting purposes, the prose literacy scale for the 1992 NALS was divided into five levels using cut scores that are shown embedded in the population distribution in Figure 3-3. Using these levels for reporting, the proportion of the population scoring 225 or lower was said to be in Level 1, with the proportions in Levels 2, 3, and 4 representing score ranges of 50 points, and finally Level 5 included scores exceeding 375. Mapping Items to Specific Scale Score Values With a response probability (rp) criterion specified, it is possible to use the IRT model to “place” the items at some specific level on the scale. Placing an item at a specific level allows one to make statements or predictions about the likelihood that a person who scores at the level will answer the question correctly. For the 1992 NALS, items were placed at a specific

OCR for page 50
Measuring Literacy: Performance Levels for Adults FIGURE 3-4 Scale scores associated with rp values of .50, .67, and .80 for a sample item from the NALS prose scale. level as part of the process that was used to decide on the cut scores among the five levels and for use in reporting examples of items. For the 1992 NALS, an rp value of .80 was used. This means that each item was said to be “at” the value of the prose score scale for which the probability of a correct response was .80. For example, for the “write letter” item, it was said “this task is at 280 on the prose scale” (Kirsch et al., 1993, p. 78), as shown by the dotted lines in Figure 3-4. Using these placements, items were said to be representative of what persons scoring in each level could do. Depending on where the item was placed within the level, it was noted whether an item was one of the easier or more difficult items in the level. For example, the “write letter” item was described as “one of the easier Level 3 tasks” (Kirsch, 1993, p. 78). These placements of items were also shown on item maps, such as the one that appeared on page 10 of Kirsch, 1993 (see Table 3-6); the purpose of the item maps is to aid in the interpretation of the meaning of scores on the scale and in the levels. Some procedures, such as the bookmark standard-setting procedures, require the specification of an rp value to place the items on the scale. However, even when it is necessary to place an item at a specific point on the scale, it is important to remember that an item can be placed anywhere on the scale, with some rp value. For example, as illustrated in Figure 3-4, the “write letter” item is “at” 280 (and “in” Level 3, because that location is above 275) for an rp value of .80. However, this item is at 246, which places it in the lower middle of Level 2 (between 226 and 275) for an rp value of .50, and it is at 264, which is in the upper middle of Level 2 for an rp value of .67.

OCR for page 50
Measuring Literacy: Performance Levels for Adults TABLE 3-6 National Adult Literacy Survey (NALS) Item Map  

OCR for page 50
Measuring Literacy: Performance Levels for Adults FIGURE 3-5 Percentage expected to answer the sample item correctly within each of the five levels of the 1992 NALS scale. It should be emphasized that it is not necessary to place items at a single score location. For example, in reporting the results of the assessment, it is not necessary to say that an item is “at” some value (such as 280 for the “write letter” item). Futhermore, there are more informative alternatives to placing items at a single score location. If an item is said to be “at” some scale value or “in” some level (as the “write letter” item is at 280 and in Level 3), it suggests that people scoring lower, or in lower levels, do not respond correctly. That is not the case. The trace line itself, as shown in Figure 3-4, reminds us that many people scoring in Level 2 (more than the upper half of those in Level 2) have a better than 50-50 chance of responding correctly to this item. A more accurate depiction of the likelihood of a correct response was presented in Appendix D of the 1992 technical manual (Kirsch et al., 2001). That appendix includes a representation of the trace line for each item at seven equally spaced scale scores between 150 and 450 (along with the rp80 value). This type of representation would allow readers to make inferences about this item much like those suggested by Figure 3-4. Figure 3-5 shows the percentage expected to answer the “write letter” item in each of the five levels. These values can be computed from the IRT model (represented by equation 3-1), in combination with the population distribution.5 With access to the data, one can alternatively simply tabulate 5   They are the weighted average of the probabilities correct given by the trace line for each score within the level, weighted by the population density of persons at that score (in the upper panel of Figure 3-1). Using the Gaussian population distribution, those values are not extremely accurate for 1992 NALS; however, they are used here for illustrative purposes.

OCR for page 50
Measuring Literacy: Performance Levels for Adults the observed proportion of examinees who responded correctly at each reporting level. The latter has been done often in recent NAEP reports (e.g., The Nation’s Report Card: Reading 2002, http://www.nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2003521, Chapter 4, pp. 102ff). The values in Figure 3-5 show clearly how misconceptions can arise from statements such as “this item is ‘in’ Level 3” (using an rp value of .80). While the item may be “in” Level 3, 55 percent of people in Level 2 responded correctly. So statements such as “because the item is in Level 3, people scoring in Level 2 would respond incorrectly” are wrong. For reporting results using sets of levels, a graphical or numerical summary of the probability of a correct response at multiple points on the scale score, such as shown in Figure 3-5, is likely to be more informative and lead to more accurate interpretations. Use of Response Probabilities in Standard Setting As previously mentioned, for some purposes, such as the bookmark method of standard setting, it is essential that items be placed at a single location on the score scale. An rp value must be selected to accomplish that. The bookmark method of standard setting requires an “ordered item booklet” in which the items are placed in increasing order of difficulty. With the kinds of IRT models that are used for NALS and NAAL, different rp values place the items in different orders. For example, Figure 3-2 includes dotted lines that denote three rp values: rp80, rp67, and rp50. The item trace lines cross the dotted line representing an rp value of 80 percent in one sequence, while they cross the dotted line representing an rp value of 67 percent in another sequence, and they cross the dotted line representing an rp value of 50 percent in yet another sequence. There are a number of factors to consider in selecting an rp criterion. Factors to Consider in Selecting a Response Probability Value One source of information on which to base the selection of an rp value involves empirical studies of the effects of different rp values on the standard-setting process (e.g., Williams and Schultz, 2005). Another source of information relevant to the selection of an rp value is purely statistical in nature, having to do with the relative precision of estimates of the scale scores associated with various rp values. To illustrate, Figure 3-6 shows the trace line for the “write letter” item as it passes through the middle of the prose score scale. The trace line is enclosed in dashed lines that represent the boundaries of a 95 percent confidence envelope for the curve. The confidence envelope for a curve is a region that includes the curves corresponding to the central 95 percent confidence interval for the (item) param-

OCR for page 50
Measuring Literacy: Performance Levels for Adults FIGURE 3-6 A 95 percent confidence envelope for the trace line for the sample item on the NALS prose scale. eters that produce the curve. That is, the confidence envelope translates statistical uncertainty (due to random sampling) in the estimation of the item parameters into a graphical display of the consequent uncertainty in the location of the trace line itself.6 A striking feature of the confidence envelope in Figure 3-6 is that it is relatively narrow. This is because the standard errors for the item parameters (reported in Appendix A of the 1992 NALS Technical Manual) are very small. Because the confidence envelope is very narrow, it is difficult to see in Figure 3-6 that it is actually narrower (either vertically or horizontally) around rp50 than it is around rp80. This means that there is less uncertainty associated with proficiency estimates based on rp50 than on rp80. While this finding is not evident in the visual display (Figure 3-6), it has been previously documented (see Thissen and Wainer, 1990, for illustrations of confidence envelopes that are not so narrow and show their characteristic asymmetries more clearly). Nonetheless, the confidence envelope may be used to translate the uncertainty in the item parameter estimates into descriptions of the uncertainty of the scale scores corresponding to particular rp values. Using the “write letter” NALS item as an illustration, at rp50 the confidence envelope 6   For a more detailed description of confidence envelopes in the context of IRT, see Thissen and Wainer (1990), who use results obtained by Thissen and Wainer (1982) and an algorithm described by Hauck (1983) to produce confidence envelopes like the dashed lines in Figure 3-6.

OCR for page 50
Measuring Literacy: Performance Levels for Adults encloses trace lines that would place the corresponding scale score anywhere between 245 and 248 (as shown by the solid lines connected to the dotted line for 0.50 in Figure 3-6). That range of three points is smaller than the four-point range for rp67 (from 262 to 266), which is, in turn, smaller than the range for the rp80 scale score (278-283).7 The rp80 values, as used for reporting the 1992 NALS results, have statistical uncertainty that is almost twice as large (5 points, from 278 to 283, around the reported value of 280 for the “write letter” item) as the rp50 values (3 points, from 245 to 248, for this item). The rp50 values are always most precisely estimated. So a purely statistical answer to the question, “What rp value is most precisely estimated, given the data?” would be rp50 for the item response model used for the binary-scored open-ended items in NALS and NAAL. The statistical uncertainty in the scale scores associated with rp values simply increases as the rp value increases above 0.50. It actually becomes very large for rp values of 90, 95, or 99 percent (which is no doubt the reason such rp values are never considered in practice). Nevertheless, the use of rp50 has been reported to be very difficult for judges in standard-setting processes, as well as other consumers, to interpret usefully (Williams and Schulz, 2004). What does it mean to say “the score at which the person has a 50-50 chance of responding correctly”? While that value may be useful (and interpretable) for a data analyst developing models for item response data, it is not so useful for consumers of test results who are more interested in ideas like “mastery.” An rp value of 67 percent, now commonly used in bookmark procedures (Mitzel et al., 2001), represents a useful compromise for some purposes. That is, the idea that there is a 2 in 3 chance that the examinee will respond correctly is readily interpretable as “more likely than not.” Furthermore, the statistical uncertainty of the estimate of the scale score associated with rp67 is larger than for rp50 but not as large as for rp80. Figure 3-4 illustrates another statistical property of the trace lines used for NALS and NAAL that provides motivation for choosing an rp value closer to 50 percent. Note in Figure 3-2 that not only are the trace lines in a different (horizontal) order for rp values of 50, 67, and 80 percent, but they are also considerably more variable (more widely spread) at rp80 than 7   Some explanation is needed. First, the rp50 interval is actually symmetrical. Earlier (Figure 3-4), the rp50 value was claimed to be 246. The actual value, before rounding, is very close to 246.5, so the interval from 245 to 248 (which is rounded very little) is both correct and symmetrical. The intervals for the higher rp values are supposed to be asymmetrical.

OCR for page 50
Measuring Literacy: Performance Levels for Adults they are at rp50. These greater variations at rp80, and the previously described wider confidence envelope, are simply due to the inherent shape of the trace line. As it approaches a value of 1.0, it must flatten out and so it must develop a “shoulder” that has very uncertain location (in the left-right direction) for any particular value of the probability of a correct response (in the vertical direction). Figure 3-2 shows that variation in the discrimination of the items greatly accentuates the variation in the scale score location of high and low rp values. Again, these kinds of purely statistical considerations would lead to a choice of rp50. Considerations of mastery for the presentation and description of the results to many audiences suggests higher rp values. A compromise value of rp67, combined with a reminder that the rp values are arbitrary values used in the standard-setting process, and reporting of the results can describe the likelihood or correct responses for any level or scale score, are what we suggest.